๊ฐœ๋ฐœ Code/์ธ๊ณต์ง€๋Šฅ A.I.

[Python][AI] ํ•œ๊ตญ ๋กœ๋˜ ๋ถ„์„: ๋‹น์ฒจ ๋ฒˆํ˜ธ ๋ถ„์„๊ณผ ํŒจํ„ด ์ฐพ๊ธฐ(EDA)

5hr1rnp 2025. 2. 19. 21:03
๋ฐ˜์‘ํ˜•

์ถœ์ฒ˜ : https://dhlottery.co.kr/gameInfo.do?method=buyLotto

2025.02.18 - [๊ฐœ๋ฐœ Code/์ธ๊ณต์ง€๋Šฅ A.I.] - [Python][AI] ํ•œ๊ตญ ๋กœ๋˜ ๋ถ„์„: ๋‹น์ฒจ ํ™•๋ฅ ๊ณผ ์˜ˆ์ธก์˜ ๋ถˆ๊ฐ€๋Šฅ์„ฑ

 

[Python][AI] ํ•œ๊ตญ ๋กœ๋˜ ๋ถ„์„: ๋‹น์ฒจ ํ™•๋ฅ ๊ณผ ์˜ˆ์ธก์˜ ๋ถˆ๊ฐ€๋Šฅ์„ฑ

1. ํ•œ๊ตญ ๋กœ๋˜ ๊ฐœ๋…๋กœ๋˜๋Š” 1๋ถ€ํ„ฐ 45๊นŒ์ง€์˜ ์ˆซ์ž ์ค‘ 6๊ฐœ๋ฅผ ์„ ํƒํ•˜๋Š” ๋ฐฉ์‹์˜ ๋ณต๊ถŒ์ด๋‹ค. ๊ตฌ๋งค์ž๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ ๋ฒˆํ˜ธ๋ฅผ ์„ ํƒํ•  ์ˆ˜ ์žˆ๋‹ค.์ž๋™ ์„ ํƒ: 45๊ฐœ์˜ ๋ฒˆํ˜ธ ์ค‘ 6๊ฐœ ๋ฒˆํ˜ธ๋ฅผ ๋ฌด์ž‘์œ„๋กœ ๋ถ€์—ฌ๋ฐ›

5hr1rnp.tistory.com

 

๋ณธ ๊ธ€์—์„œ๋Š” ๋กœ๋˜ ๋‹น์ฒจ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„(EDA)์„ ์ง„ํ–‰ํ•˜๋ฉฐ, ๋‹ค์–‘ํ•œ ํ†ต๊ณ„์  ํŠน์ง•์„ ์‚ดํŽด๋ณด๊ฒ ๋‹ค.


1. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ


๋กœ๋˜ ๋ฐ์ดํ„ฐ๋Š” ๋™ํ–‰๋ณต๊ถŒ ์‚ฌ์ดํŠธ์—์„œ ์ œ๊ณตํ•˜๋Š” ์—‘์…€ ํŒŒ์ผ์„ ๊ฐ๊ฐ lotto-1.xls(1~600ํšŒ), lotto-2.xls(601~1,159ํšŒ)๋กœ ์ €์žฅํ•จ ('25. 02. 18. ๊ธฐ์ค€).

๋‘ ๋ฐ์ดํ„ฐ๋ฅผ Python์œผ๋กœ ๋ถˆ๋Ÿฌ์™€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ „์ฒ˜๋ฆฌํ•จ.

๋ฐ์ดํ„ฐ ๋กœ๋“œ ๋ฐ ์ •๋ฆฌ

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

plt.rcParams['font.family'] = 'NanumGothic'
plt.rcParams['axes.unicode_minus'] = False

%load_ext watermark
%watermark -v -m -p pandas,numpy,matplotlib,seaborn

# Python version       : 3.10.15
# IPython version      : 8.29.0

# pandas    : 2.2.3
# numpy     : 1.23.5
# matplotlib: 3.9.2
# seaborn   : 0.13.2

df1_raw = pd.read_html("./lotto/lotto-1.xls", encoding='euc-kr', index_col=False)
df2_raw = pd.read_html("./lotto/lotto-2.xls", encoding='euc-kr', index_col=False)

df1_table = df1_raw[1]
df2_table = df2_raw[1]

# ์ฒซ ๋ฒˆ์งธ ํ–‰ ์ œ๊ฑฐ ํ›„ ๋‘ ๋ฒˆ์งธ ํ–‰์„ ์ปฌ๋Ÿผ๋ช…์œผ๋กœ ์„ค์ •
df1_table.columns = df1_table.iloc[1]  # ๋‘ ๋ฒˆ์งธ ํ–‰์„ ์ปฌ๋Ÿผ๋ช…์œผ๋กœ ์„ค์ •
df2_table.columns = df2_table.iloc[1]

df1_table = df1_table[2:].reset_index(drop=True)  # ๋ถˆํ•„์š”ํ•œ ํ–‰ ์‚ญ์ œ
df2_table = df2_table[2:].reset_index(drop=True)

# ์ปฌ๋Ÿผ๋ช… ์ •๋ฆฌ
df1_table.columns = ['๋…„๋„', 'ํšŒ์ฐจ', '์ถ”์ฒจ์ผ', '1๋“ฑ ๋‹น์ฒจ์ž์ˆ˜', '1๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก', 
                     '2๋“ฑ ๋‹น์ฒจ์ž์ˆ˜', '2๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก', '3๋“ฑ ๋‹น์ฒจ์ž์ˆ˜', '3๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก', 
                     '4๋“ฑ ๋‹น์ฒจ์ž์ˆ˜', '4๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก', '5๋“ฑ ๋‹น์ฒจ์ž์ˆ˜', '5๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก', 
                     '๋ฒˆํ˜ธ1', '๋ฒˆํ˜ธ2', '๋ฒˆํ˜ธ3', '๋ฒˆํ˜ธ4', '๋ฒˆํ˜ธ5', '๋ฒˆํ˜ธ6', '๋ณด๋„ˆ์Šค๋ฒˆํ˜ธ']

df2_table.columns = df1_table.columns  # ๋™์ผํ•œ ์ปฌ๋Ÿผ๋ช… ์ ์šฉ

# ์ˆซ์ž ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜
numeric_cols = ["1๋“ฑ ๋‹น์ฒจ์ž์ˆ˜", "1๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก", "2๋“ฑ ๋‹น์ฒจ์ž์ˆ˜", "2๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก",
                "3๋“ฑ ๋‹น์ฒจ์ž์ˆ˜", "3๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก", "4๋“ฑ ๋‹น์ฒจ์ž์ˆ˜", "4๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก", 
                "5๋“ฑ ๋‹น์ฒจ์ž์ˆ˜", "5๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก"]

def clean_numeric_columns(df, cols):
    for col in cols:
        df[col] = df[col].str.replace(",", "").str.replace("์›", "").astype(np.int64)
    return df

df1_table = clean_numeric_columns(df1_table, numeric_cols)
df2_table = clean_numeric_columns(df2_table, numeric_cols)

# ๋‚ ์งœ ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜
df1_table["์ถ”์ฒจ์ผ"] = pd.to_datetime(df1_table["์ถ”์ฒจ์ผ"], format="%Y.%m.%d")
df2_table["์ถ”์ฒจ์ผ"] = pd.to_datetime(df2_table["์ถ”์ฒจ์ผ"], format="%Y.%m.%d")

# ๋‘ ๊ฐœ์˜ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ํ•ฉ์น˜๊ธฐ
lotto_dataset = pd.concat([df1_table, df2_table], ignore_index=True).sort_values('๋…„๋„').reset_index(drop=True)

lotto_dataset['์ถ”์ฒจ์ผ'] = pd.to_datetime(lotto_dataset['์ถ”์ฒจ์ผ'], format="%Y-%m-%d")

lotto_obj_cols = ['๋…„๋„', 'ํšŒ์ฐจ', '๋ฒˆํ˜ธ1', '๋ฒˆํ˜ธ2', '๋ฒˆํ˜ธ3', '๋ฒˆํ˜ธ4', '๋ฒˆํ˜ธ5', '๋ฒˆํ˜ธ6', '๋ณด๋„ˆ์Šค๋ฒˆํ˜ธ']

for col in lotto_obj_cols:
    lotto_dataset[col] = lotto_dataset[col].astype(np.int64)

# 	๋…„๋„	ํšŒ์ฐจ	์ถ”์ฒจ์ผ	1๋“ฑ ๋‹น์ฒจ์ž์ˆ˜	1๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก	2๋“ฑ ๋‹น์ฒจ์ž์ˆ˜	2๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก	3๋“ฑ ๋‹น์ฒจ์ž์ˆ˜	3๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก	4๋“ฑ ๋‹น์ฒจ์ž์ˆ˜	4๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก	5๋“ฑ ๋‹น์ฒจ์ž์ˆ˜	5๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก	๋ฒˆํ˜ธ1	๋ฒˆํ˜ธ2	๋ฒˆํ˜ธ3	๋ฒˆํ˜ธ4	๋ฒˆํ˜ธ5	๋ฒˆํ˜ธ6	๋ณด๋„ˆ์Šค๋ฒˆํ˜ธ
# 0	2002	1	2002-12-07	0	0	1	143934100	28	5140500	2537	113400	40155	10000	10	23	29	33	37	40	16
# 1	2002	2	2002-12-14	1	2002006800	2	94866800	103	1842000	3763	100800	55480	10000	9	13	21	25	32	42	2
# 2	2002	3	2002-12-21	1	2000000000	0	0	139	1174100	5940	54900	73256	10000	11	16	19	21	27	31	30
# 3	2002	4	2002-12-28	0	0	1	211191200	29	7282400	2777	152100	52382	10000	14	27	30	31	40	42	2
# 4	2003	21	2003-04-26	23	797475400	26	117576500	2008	1522400	82996	73600	1155368	10000	6	12	17	18	31	32	21
# ...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
# 1154	2025	1156	2025-01-25	21	1505207090	73	72167464	3494	1507792	162883	50000	2706409	5000	30	31	34	39	41	45	7
# 1155	2025	1155	2025-01-18	7	4066375179	78	60821851	3171	1496091	160366	50000	2654046	5000	10	16	19	27	37	38	13
# 1156	2025	1154	2025-01-11	15	1854965425	62	74796993	3190	1453735	154785	50000	2559114	5000	4	8	22	26	32	38	27
# 1157	2025	1153	2025-01-04	15	2027312925	79	64155473	3506	1445603	167692	50000	2848165	5000	1	9	10	13	35	44	5
# 1158	2025	1157	2025-02-01	12	2257842157	141	32026130	4549	992677	201847	50000	3100779	5000	5	7	12	20	25	26	28
# 1159 rows ร— 20 columns

2. ๋ฐ์ดํ„ฐ ๊ฐœ์š”


ํ†ตํ•ฉ๋œ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๊ตฌ์กฐ ํ™•์ธ

# ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ ํ™•์ธ
lotto_dataset.info()

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 1159 entries, 0 to 1158
# Data columns (total 20 columns):
#  #   Column   Non-Null Count  Dtype         
# ---  ------   --------------  -----         
#  0   ๋…„๋„       1159 non-null   int64         
#  1   ํšŒ์ฐจ       1159 non-null   int64         
#  2   ์ถ”์ฒจ์ผ      1159 non-null   datetime64[ns]
#  3   1๋“ฑ ๋‹น์ฒจ์ž์ˆ˜  1159 non-null   int64         
#  4   1๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก  1159 non-null   int64         
#  5   2๋“ฑ ๋‹น์ฒจ์ž์ˆ˜  1159 non-null   int64         
#  6   2๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก  1159 non-null   int64         
#  7   3๋“ฑ ๋‹น์ฒจ์ž์ˆ˜  1159 non-null   int64         
#  8   3๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก  1159 non-null   int64         
#  9   4๋“ฑ ๋‹น์ฒจ์ž์ˆ˜  1159 non-null   int64         
#  10  4๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก  1159 non-null   int64         
#  11  5๋“ฑ ๋‹น์ฒจ์ž์ˆ˜  1159 non-null   int64         
#  12  5๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก  1159 non-null   int64         
#  13  ๋ฒˆํ˜ธ1      1159 non-null   int64         
#  14  ๋ฒˆํ˜ธ2      1159 non-null   int64         
#  15  ๋ฒˆํ˜ธ3      1159 non-null   int64         
#  16  ๋ฒˆํ˜ธ4      1159 non-null   int64         
#  17  ๋ฒˆํ˜ธ5      1159 non-null   int64         
#  18  ๋ฒˆํ˜ธ6      1159 non-null   int64         
#  19  ๋ณด๋„ˆ์Šค๋ฒˆํ˜ธ    1159 non-null   int64         
# dtypes: datetime64[ns](1), int64(19)
# memory usage: 181.2 KB

# ๊ธฐ์ˆ  ํ†ต๊ณ„ ๋ถ„์„
round(lotto_dataset.describe(), 3)

# 	๋…„๋„	ํšŒ์ฐจ	์ถ”์ฒจ์ผ	1๋“ฑ ๋‹น์ฒจ์ž์ˆ˜	1๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก	2๋“ฑ ๋‹น์ฒจ์ž์ˆ˜	2๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก	3๋“ฑ ๋‹น์ฒจ์ž์ˆ˜	3๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก	4๋“ฑ ๋‹น์ฒจ์ž์ˆ˜	4๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก	5๋“ฑ ๋‹น์ฒจ์ž์ˆ˜	5๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก	๋ฒˆํ˜ธ1	๋ฒˆํ˜ธ2	๋ฒˆํ˜ธ3	๋ฒˆํ˜ธ4	๋ฒˆํ˜ธ5	๋ฒˆํ˜ธ6	๋ณด๋„ˆ์Šค๋ฒˆํ˜ธ
# count	1159.000	1159.000	1159	1159.000	1.159000e+03	1159.000	1.159000e+03	1159.000	1159.000	1159.000	1159.000	1159.000	1159.000	1159.000	1159.000	1159.000	1159.000	1159.000	1159.000	1159.000
# mean	2013.530	580.000	2014-01-11 00:00:00	8.097	2.665804e+09	49.023	6.332445e+07	1851.280	1585731.174	90760.637	57301.499	1487654.831	5375.324	6.752	13.090	19.954	26.234	32.840	39.478	22.334
# min	2002.000	1.000	2002-12-07 00:00:00	0.000	0.000000e+00	0.000	0.000000e+00	28.000	627634.000	2537.000	27300.000	40155.000	5000.000	1.000	2.000	3.000	5.000	9.000	18.000	1.000
# 25%	2008.000	290.500	2008-06-24 12:00:00	5.000	1.566595e+09	33.000	4.878907e+07	1294.000	1354515.500	64195.500	50000.000	1060499.000	5000.000	3.000	8.000	14.000	21.000	28.000	37.000	10.500
# 50%	2014.000	580.000	2014-01-11 00:00:00	7.000	2.082099e+09	44.000	5.712570e+07	1650.000	1478429.000	80303.000	50000.000	1324790.000	5000.000	5.000	12.000	19.000	27.000	34.000	41.000	23.000
# 75%	2019.000	869.500	2019-07-30 12:00:00	10.000	3.019093e+09	62.000	6.688088e+07	2339.000	1603055.000	114261.000	56207.000	1865719.500	5000.000	10.000	18.000	25.500	32.000	38.000	44.000	34.000
# max	2025.000	1159.000	2025-02-15 00:00:00	63.000	4.072296e+10	664.000	7.694565e+08	11247.000	9307100.000	703234.000	260000.000	3410846.000	10000.000	35.000	37.000	40.000	43.000	44.000	45.000	45.000
# std	6.415	334.719	NaN	4.860	2.398714e+09	28.782	3.692096e+07	784.666	641177.352	38495.307	20277.336	553517.220	1318.047	5.406	6.892	7.436	7.541	6.821	5.292	13.204

 

๋ฐ์ดํ„ฐ์…‹์€ 1159ํ–‰ร—20์—ด๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์œผ๋ฉฐ, ๋ชจ๋“  ์ปฌ๋Ÿผ์— ๊ฒฐ์ธก์น˜๊ฐ€ ์—†์Œ.

 

1๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก์„ ๋ณด๋ฉด, ์ตœ๋Œ“๊ฐ’์ด 400์–ต์ด ๋„˜๋Š”๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ถ๊ธˆํ•ด์„œ 1๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก ์ˆœ์œผ๋กœ ํ™•์ธํ•ด ๋ดค์„๋•Œ ๋‹ค์Œ๊ณผ ๊ฐ™์•˜๋‹ค.

 

lotto_dataset['1๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก'].sort_values(ascending=False)[:10]

# 18    40722959400
# 24    24227745300
# 19    19352212800
# 42    17749630800
# 14    17014245000
# 35    16014475800
# 61    15817286400
# 32    14903517600
# 81    14562494400
# 85    14252186400
# Name: 1๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก, dtype: int64

 

100์–ต์ด ๋„˜๋Š” 1๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก์ด ์ด๋ฆฌ๋„ ๋งŽ์•˜๋˜๊ฐ€? ํ•ด์„œ ์ถ”์ฒจ์ผ์„ ์ฐพ์•„๋ณด์•˜๋‹ค.

 

lotto_dataset.loc[lotto_dataset['1๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก'].sort_values(ascending=False)[:10].index]['์ถ”์ฒจ์ผ']

# 18   2003-04-12
# 24   2003-05-24
# 19   2003-04-19
# 42   2003-09-27
# 14   2003-03-15
# 35   2003-08-09
# 61   2004-02-07
# 32   2003-07-19
# 81   2004-06-26
# 85   2004-07-24
# Name: ์ถ”์ฒจ์ผ, dtype: datetime64[ns]

 

2003 ~ 2004๋…„์— 100์–ต์ด ๋„˜๋Š” 1๋“ฑ ๋‹น์ฒจ๊ธˆ์„ ๋ฐ›๋Š” ์ผ์ด ๋งŽ์•˜๋˜ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.


728x90
๋ฐ˜์‘ํ˜•

3. ์‹œ๊ฐํ™” ๋ถ„์„


(1) ์ถ”์ฒจ์ผ์— ๋”ฐ๋ฅธ 1๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก & ๋‹น์ฒจ์ž์ˆ˜ ๋ณ€ํ™” ๋ฐ ์„ ํ˜• ์ถ”์„ธ์„ 

import statsmodels.api as sm

# ์‹œ๊ฐ„ ํ๋ฆ„(์ธ๋ฑ์Šค)์„ ํšŒ๊ท€๋ถ„์„์˜ ๋…๋ฆฝ๋ณ€์ˆ˜๋กœ ์‚ฌ์šฉ
lotto_dataset['ํšŒ์ฐจ์ธ๋ฑ์Šค'] = np.arange(len(lotto_dataset))

# (1) 1๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก์— ๋Œ€ํ•œ ์„ ํ˜• ํšŒ๊ท€
X = lotto_dataset[['ํšŒ์ฐจ์ธ๋ฑ์Šค']]
X = sm.add_constant(X)  # ์ƒ์ˆ˜ํ•ญ ์ถ”๊ฐ€
y = lotto_dataset['1๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก']

model = sm.OLS(y, X).fit()
lotto_dataset['๊ธˆ์•ก_์ถ”์„ธ์„ '] = model.predict(X)

# (2) 1๋“ฑ ๋‹น์ฒจ์ž์ˆ˜์— ๋Œ€ํ•œ ์„ ํ˜• ํšŒ๊ท€
X2 = lotto_dataset[['ํšŒ์ฐจ์ธ๋ฑ์Šค']]
X2 = sm.add_constant(X2)
y2 = lotto_dataset['1๋“ฑ ๋‹น์ฒจ์ž์ˆ˜']

model2 = sm.OLS(y2, X2).fit()
lotto_dataset['๋‹น์ฒจ์ž์ˆ˜_์ถ”์„ธ์„ '] = model2.predict(X2)

# ์‹œ๊ฐํ™”
fig, axes = plt.subplots(2, 1, figsize=(16, 10), sharex=True)

# 1๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก (์› ๋ฐ์ดํ„ฐ + ํšŒ๊ท€์„ )
axes[0].plot(lotto_dataset['์ถ”์ฒจ์ผ'], lotto_dataset['1๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก'], label='1๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก(์›)', color='blue')
axes[0].plot(lotto_dataset['์ถ”์ฒจ์ผ'], lotto_dataset['๊ธˆ์•ก_์ถ”์„ธ์„ '], label='์ถ”์„ธ์„ (์„ ํ˜• ํšŒ๊ท€)', color='red')
axes[0].set_ylabel('1๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก(์›)')
axes[0].set_title('์ถ”์ฒจ์ผ์— ๋”ฐ๋ฅธ 1๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก ๋ณ€ํ™” ๋ฐ ์„ ํ˜• ์ถ”์„ธ์„ ')
axes[0].grid(True)
axes[0].legend(loc='upper center', fontsize=15)

# 1๋“ฑ ๋‹น์ฒจ์ž์ˆ˜ (์› ๋ฐ์ดํ„ฐ + ํšŒ๊ท€์„ )
axes[1].plot(lotto_dataset['์ถ”์ฒจ์ผ'], lotto_dataset['1๋“ฑ ๋‹น์ฒจ์ž์ˆ˜'], label='1๋“ฑ ๋‹น์ฒจ์ž์ˆ˜(๋ช…)', color='orange')
axes[1].plot(lotto_dataset['์ถ”์ฒจ์ผ'], lotto_dataset['๋‹น์ฒจ์ž์ˆ˜_์ถ”์„ธ์„ '], label='์ถ”์„ธ์„ (์„ ํ˜• ํšŒ๊ท€)', color='red')
axes[1].set_xlabel('์ถ”์ฒจ์ผ')
axes[1].set_ylabel('1๋“ฑ ๋‹น์ฒจ์ž์ˆ˜(๋ช…)')
axes[1].set_title('์ถ”์ฒจ์ผ์— ๋”ฐ๋ฅธ 1๋“ฑ ๋‹น์ฒจ์ž์ˆ˜ ๋ณ€ํ™” ๋ฐ ์„ ํ˜• ์ถ”์„ธ์„ ')
axes[1].grid(True)
axes[1].legend(loc='upper center', fontsize=15)

plt.tight_layout()
plt.show()

์ถ”์ฒจ์ผ์— ๋”ฐ๋ฅธ ๋ณ€ํ™” ๋ฐ ์ถ”์„ธ์„ 

 

1๋“ฑ ๋‹น์ฒจ๊ธˆ์•ก์€ ๊ฐ์†Œํ•˜๋Š” ์ถ”์„ธ๊ณ , 1๋“ฑ ๋‹น์ฒจ์ž๋Š” ์ฆ๊ฐ€ํ•˜๋Š” ์ถ”์„ธ๋ผ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋‹น์ฒจ์ž์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ 1๋“ฑ ๋‹น์ฒจ ๊ธˆ์•ก์ด ์ค„์–ด๋“  ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. ๋ณต๊ถŒ ๊ตฌ๋งค๋Ÿ‰(ํŒ๋งค๋Ÿ‰)์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์—†์–ด์„œ ๊ตฌ๋งค์ž ๋Œ€๋น„ ๋‹น์ฒจ๋ฅ ์€ ํ™•์ธํ•  ์ˆ˜ ์—†์—ˆ๋‹ค.

(2) ๋กœ๋˜ ๋ฒˆํ˜ธ๋ณ„ ์ถœ๋ณ€ ๋นˆ๋„

from collections import Counter
palette = sns.color_palette("rocket", 45)

# ๋‹น์ฒจ ๋ฒˆํ˜ธ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘
all_numbers = lotto_dataset[['๋ฒˆํ˜ธ1', '๋ฒˆํ˜ธ2', '๋ฒˆํ˜ธ3', '๋ฒˆํ˜ธ4', '๋ฒˆํ˜ธ5', '๋ฒˆํ˜ธ6', '๋ณด๋„ˆ์Šค๋ฒˆํ˜ธ']].values.flatten()
all_but_no_bonus_numbers = lotto_dataset[['๋ฒˆํ˜ธ1', '๋ฒˆํ˜ธ2', '๋ฒˆํ˜ธ3', '๋ฒˆํ˜ธ4', '๋ฒˆํ˜ธ5', '๋ฒˆํ˜ธ6', '๋ณด๋„ˆ์Šค๋ฒˆํ˜ธ']].values.flatten()
only_bonus_numbers = lotto_dataset['๋ณด๋„ˆ์Šค๋ฒˆํ˜ธ'].values.flatten()

# ๋ฒˆํ˜ธ๋ณ„ ์ถœํ˜„ ๋นˆ๋„ ๊ณ„์‚ฐ
number_counts = Counter(all_numbers)
all_but_no_bonus_counts = Counter(all_but_no_bonus_numbers)
only_bonus_numbers_counts = Counter(only_bonus_numbers)

# Counter์˜ key๋ฅผ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ
sorted_items = sorted(number_counts.items(), key=lambda x: x[0], reverse=True)
no_bonus_sorted_items = sorted(all_but_no_bonus_counts.items(), key=lambda x: x[0], reverse=True)
bonus_sorted_items = sorted(only_bonus_numbers_counts.items(), key=lambda x: x[0], reverse=True)

sorted_keys, sorted_values = zip(*sorted_items)
no_bonus_sorted_keys, no_bonus_sorted_values = zip(*no_bonus_sorted_items)
bonus_sorted_keys, bonus_sorted_values = zip(*bonus_sorted_items)

# ์‹œ๊ฐํ™”
fig, axes = plt.subplots(3, 1, figsize=(16, 10), sharex=True)

# ์‹œ๊ฐํ™”
axes[0].bar(sorted_keys, sorted_values, color=palette)
axes[0].set_ylabel("์ถœํ˜„ ํšŸ์ˆ˜")
axes[0].set_title("๋กœ๋˜ ๋ฒˆํ˜ธ๋ณ„ ์ถœํ˜„ ๋นˆ๋„(๋ฒˆํ˜ธ ์ „์ฒด)")
axes[0].set_ylim(140, 210)
axes[0].grid(axis="y", linestyle="--", alpha=0.8)

axes[1].bar(no_bonus_sorted_keys, no_bonus_sorted_values, color=palette)
axes[1].set_ylabel("์ถœํ˜„ ํšŸ์ˆ˜")
axes[1].set_title("๋กœ๋˜ ๋ฒˆํ˜ธ๋ณ„ ์ถœํ˜„ ๋นˆ๋„(๋ณด๋„ˆ์Šค ๋ฒˆํ˜ธ ์ œ์™ธ)")
axes[1].set_ylim(140, 210)
axes[1].grid(axis="y", linestyle="--", alpha=0.8)

axes[2].bar(bonus_sorted_keys, bonus_sorted_values, color=palette)
axes[2].set_xlabel("๋กœ๋˜ ๋ฒˆํ˜ธ")
axes[2].set_ylabel("์ถœํ˜„ ํšŸ์ˆ˜")
axes[2].set_title("๋กœ๋˜ ๋ฒˆํ˜ธ๋ณ„ ์ถœํ˜„ ๋นˆ๋„(๋ณด๋„ˆ์Šค ๋ฒˆํ˜ธ)")
axes[2].set_xticks(range(1, 46))
axes[2].set_ylim(10, 40)
axes[2].grid(axis="y", linestyle="--", alpha=0.8)

plt.tight_layout()

 

 

1๋“ฑ ๋‹น์ฒจ ๋ฒˆํ˜ธ๋กœ ๊ฐ€์žฅ ๋งŽ์ด ์„ ์ •๋œ ์ˆซ์ž๋Š” '34', ๊ฐ€์žฅ ์ ๊ฒŒ ์„ ์ •๋œ ์ˆซ์ž๋Š” '9'๋ผ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋ณด๋„ˆ์Šค ๋ฒˆํ˜ธ๋ฅผ ์ œ์™ธํ•˜๊ณ ์„œ๋„ ๊ฐ€์žฅ ๋งŽ์ด ์„ ์ •๋œ ์ˆซ์ž๋Š” '34,', ๊ฐ€์žฅ ์ ๊ฒŒ ์„ ์ •๋œ ์ˆซ์ž๋Š” '9'์˜€๋‹ค. ์ด์™€ ๋‹ฌ๋ฆฌ, ๋ณด๋„ˆ์Šค ๋ฒˆํ˜ธ ์ค‘์—์„œ๋Š” '43'์ด ๊ฐ€์žฅ ๋งŽ์ด ์„ ์ • ๋์œผ๋ฉฐ '29'๊ฐ€ ๊ฐ€์žฅ ์ ๊ฒŒ ์„ ์ •๋๋‹ค.

(3) ๊ฐ€์žฅ ๋งŽ์ด ๋“ฑ์žฅํ•œ ๋กœ๋˜ ์ˆซ์ž ์กฐํ•ฉ (TOP 10)

from itertools import combinations

# ๋ชจ๋“  ์กฐํ•ฉ์˜ ๋“ฑ์žฅ ํšŸ์ˆ˜๋ฅผ ์ €์žฅํ•  Counter ์ƒ์„ฑ
combination_counter = Counter()

# ๊ฐ€๋Šฅํ•œ ์กฐํ•ฉ ํฌ๊ธฐ (2๊ฐœ ์กฐํ•ฉ๋ถ€ํ„ฐ 6๊ฐœ ์กฐํ•ฉ๊นŒ์ง€)
for r in range(2, 7):
    for row in lotto_dataset[['๋ฒˆํ˜ธ1', '๋ฒˆํ˜ธ2', '๋ฒˆํ˜ธ3', '๋ฒˆํ˜ธ4', '๋ฒˆํ˜ธ5', '๋ฒˆํ˜ธ6', '๋ณด๋„ˆ์Šค๋ฒˆํ˜ธ']].values:
        # ๊ฐ ํšŒ์ฐจ์˜ ๋‹น์ฒจ ๋ฒˆํ˜ธ์—์„œ r๊ฐœ์˜ ์ˆซ์ž๋กœ ์ด๋ฃจ์–ด์ง„ ์กฐํ•ฉ ์ƒ์„ฑ ํ›„ ์ •๋ ฌํ•˜์—ฌ ์นด์šดํŠธ
        for combo in combinations(sorted(row), r):
            combination_counter[combo] += 1

# ๊ฐ€์žฅ ๋งŽ์ด ๋“ฑ์žฅํ•œ ์กฐํ•ฉ 10๊ฐœ ์ฐพ๊ธฐ
most_common_combinations = combination_counter.most_common(10)

# ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜: ์กฐํ•ฉ์„ ๋ฌธ์ž์—ด๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์‹œ๊ฐํ™”
combo_labels = [str(combo) for combo, _ in most_common_combinations]
combo_counts = [count for _, count in most_common_combinations]

# ์‹œ๊ฐํ™”
plt.figure(figsize=(12, 6))
sns.barplot(x=combo_counts, y=combo_labels, palette="Blues_r")
plt.xlim(32,42)
plt.xlabel("๋“ฑ์žฅ ํšŸ์ˆ˜")
plt.ylabel("์ˆซ์ž ์กฐํ•ฉ")
plt.title("๊ฐ€์žฅ ๋งŽ์ด ๋“ฑ์žฅํ•œ ๋กœ๋˜ ์ˆซ์ž ์กฐํ•ฉ (TOP 10)")
plt.grid(axis="x", linestyle="--", alpha=0.7)
plt.tight_layout()

๊ฐ€์žฅ ๋งŽ์ด ๋“ฑ์žฅํ•œ ๋กœ๋˜ ์ˆซ์ž ์กฐํ•ฉ (TOP 10)

 

๋ณด๋„ˆ์Šค ๋ฒˆํ˜ธ๋ฅผ ํฌํ•จํ•˜์—ฌ 2~6๊ฐœ๊นŒ์ง€ ๊ฐ€๋Šฅํ•œ ์กฐํ•ฉ์„ ์กฐํšŒํ•œ ๊ฒฐ๊ณผ, ๊ฐ€์žฅ ๋งŽ์ด ๋“ฑ์žฅํ•œ ์ˆซ์ž ์กฐํ•ฉ์€ (3,20)์ด์—ˆ๋‹ค.

์ด๋ฒˆ ๊ธ€์—์„œ๋Š” ์—ฌ๊ธฐ๊นŒ์ง€ ์ง„ํ–‰ํ•˜๊ณ , ๋‹ค์Œ ๊ธ€์—์„œ ์ถ”๊ฐ€๋กœ ์ง„ํ–‰ํ•˜๊ฒ ๋‹ค.

๋ฐ˜์‘ํ˜•