๊ฐœ๋ฐœ Code/์ธ๊ณต์ง€๋Šฅ A.I.

[Python][AI] ํ•œ๊ตญ ๋กœ๋˜ ๋ถ„์„ : ์ถ”๊ฐ€ EDA ๋ฐ ML ๋ฒˆํ˜ธ ์˜ˆ์ธก

5hr1rnp 2025. 2. 24. 22:30
๋ฐ˜์‘ํ˜•

์ถœ์ฒ˜ : https://dhlottery.co.kr/gameInfo.do?method=buyLotto

 

2025.02.18 - [๊ฐœ๋ฐœ Code/์ธ๊ณต์ง€๋Šฅ A.I.] - [Python][AI] ํ•œ๊ตญ ๋กœ๋˜ ๋ถ„์„: ๋‹น์ฒจ ํ™•๋ฅ ๊ณผ ์˜ˆ์ธก์˜ ๋ถˆ๊ฐ€๋Šฅ์„ฑ

2025.02.19 - [๊ฐœ๋ฐœ Code/์ธ๊ณต์ง€๋Šฅ A.I.] - [Python][AI] ํ•œ๊ตญ ๋กœ๋˜ ๋ถ„์„: ๋‹น์ฒจ ๋ฒˆํ˜ธ ๋ถ„์„๊ณผ ํŒจํ„ด ์ฐพ๊ธฐ(EDA)

 

๋กœ๋˜ ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ํ†ตํ•ด ๋‹น์ฒจ ๋ฒˆํ˜ธ์˜ ํŒจํ„ด์„ ์ฐพ์•„๋ณด๊ณ , XGBoost๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋‹ค์Œ ๋‹น์ฒจ ๋ฒˆํ˜ธ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค. ์ด๋ฒˆ ๋ถ„์„์—์„œ๋Š” ํ™€์ˆ˜/์ง์ˆ˜ ๋น„์œจ, ๋‚ฎ์€ ์ˆซ์ž vs ๋†’์€ ์ˆซ์ž ๋น„์œจ, ์›”๋ณ„ ๋‹น์ฒจ ๋ฒˆํ˜ธ ๋ถ„์„ ๋“ฑ์„ ์ง„ํ–‰ํ•˜๊ณ , ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ํ™œ์šฉํ•ด ๋ฒˆํ˜ธ๋ฅผ ์˜ˆ์ธกํ•ด๋ณผ ๊ฒƒ์ด๋‹ค.


1. ํ™€์ˆ˜ vs ์ง์ˆ˜ and ๋‚ฎ์€ ์ˆซ์ž(1~22) vs ๋†’์€ ์ˆซ์ž(23~45) ๋น„์œจ ๋ถ„์„


๋กœ๋˜ ๋‹น์ฒจ ๋ฒˆํ˜ธ์—์„œ ํ™€์ˆ˜์™€ ์ง์ˆ˜์˜ ์ถœํ˜„ ๋น„์œจ์„ ๋ถ„์„ํ•œ๋‹ค.

def odd_even_ratio(df):
    odd_even_counts = {"ํ™€์ˆ˜": 0, "์ง์ˆ˜": 0}
    for i in range(1, 7):
        odd_even_counts["ํ™€์ˆ˜"] += (df[f"๋ฒˆํ˜ธ{i}"] % 2 == 1).sum()
        odd_even_counts["์ง์ˆ˜"] += (df[f"๋ฒˆํ˜ธ{i}"] % 2 == 0).sum()
    return odd_even_counts

def low_high_ratio(df):
    low_high_counts = {"๋‚ฎ์€ ์ˆซ์ž (1-22)": 0, "๋†’์€ ์ˆซ์ž (23-45)": 0}
    for i in range(1, 7):
        low_high_counts["๋‚ฎ์€ ์ˆซ์ž (1-22)"] += (df[f"๋ฒˆํ˜ธ{i}"] <= 22).sum()
        low_high_counts["๋†’์€ ์ˆซ์ž (23-45)"] += (df[f"๋ฒˆํ˜ธ{i}"] > 22).sum()
    return low_high_counts

 

์ด ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๊ทธ๋ž˜ํ”„๋ฅผ ์‹œ๊ฐํ™”ํ•œ๋‹ค.

# ๋ถ„์„ ์‹คํ–‰
odd_even_counts = odd_even_ratio(lotto_dataset)
low_high_counts = low_high_ratio(lotto_dataset)

# ์‹œ๊ฐํ™”
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# ํ™€์ˆ˜ vs ์ง์ˆ˜ ๋น„์œจ
sns.barplot(x=list(odd_even_counts.keys()), y=list(odd_even_counts.values()), palette="coolwarm", ax=axes[0])
axes[0].set_title("ํ™€์ˆ˜ vs ์ง์ˆ˜ ๋น„์œจ", fontsize=14)
axes[0].set_ylabel("์ถœํ˜„ ํšŸ์ˆ˜")
axes[0].set_ylim(3200, 3600)
axes[0].grid(axis="y", linestyle="--", alpha=0.8)

# ๋‚ฎ์€ ์ˆซ์ž vs ๋†’์€ ์ˆซ์ž ๋น„์œจ
sns.barplot(x=list(low_high_counts.keys()), y=list(low_high_counts.values()), palette="viridis", ax=axes[1])
axes[1].set_title("๋‚ฎ์€ ์ˆซ์ž (1-22) vs ๋†’์€ ์ˆซ์ž (23-45) ๋น„์œจ", fontsize=14)
axes[1].set_ylabel("์ถœํ˜„ ํšŸ์ˆ˜")
axes[1].set_ylim(3200, 3600)
axes[1].grid(axis="y", linestyle="--", alpha=0.8)

plt.tight_layout()
plt.show()

 

 

ํ™€์ˆ˜,์ง์ˆ˜,๋‚ฎ์€์ˆซ์ž,๋†’์€์ˆซ์ž ๋น„์œจ

 

ํ™€์ˆ˜๊ฐ€ ์ง์ˆ˜๋ณด๋‹ค ๋น„๊ต์  ๋‹น์ถค ํšŸ์ˆ˜๊ฐ€ ๋” ๋งŽ์•˜์œผ๋ฉฐ, ๋‚ฎ์€ ์ˆซ์ž๋ณด๋‹ค ๋†’์€ ์ˆซ์ž๊ฐ€ ๋” ๋งŽ์ด ๋‹น์ฒจ๋๋‹ค.


728x90
๋ฐ˜์‘ํ˜•

2. ์›”๋ณ„ ๊ฐ€์žฅ ๋งŽ์ด ๋‚˜์˜จ ๋ฒˆํ˜ธ ๋ถ„์„


palette = sns.color_palette("rocket", 12)

# ์›” ์ •๋ณด ์ถ”๊ฐ€
lotto_dataset["์›”"] = lotto_dataset["์ถ”์ฒจ์ผ"].dt.month

# ๋ฒˆํ˜ธ ์ปฌ๋Ÿผ ๋ฆฌ์ŠคํŠธ
number_columns = ["๋ฒˆํ˜ธ1", "๋ฒˆํ˜ธ2", "๋ฒˆํ˜ธ3", "๋ฒˆํ˜ธ4", "๋ฒˆํ˜ธ5", "๋ฒˆํ˜ธ6"]

# ์›”๋ณ„ ๋ฒˆํ˜ธ ์ถœํ˜„ ํšŸ์ˆ˜ ๊ณ„์‚ฐ
monthly_number_counts = {month: {} for month in range(1, 13)}

for month in range(1, 13):
    monthly_data = lotto_dataset[lotto_dataset["์›”"] == month]
    number_counts = monthly_data[number_columns].values.flatten()
    unique, counts = np.unique(number_counts, return_counts=True)
    monthly_number_counts[month] = dict(zip(unique, counts))

# ์›”๋ณ„ ๊ฐ€์žฅ ๋งŽ์ด ๋‚˜์˜จ ๋ฒˆํ˜ธ ์ฐพ๊ธฐ
most_frequent_numbers = {
    month: max(monthly_number_counts[month], key=monthly_number_counts[month].get)
    for month in range(1, 13)
}

# ์›”๋ณ„ ๊ฐ€์žฅ ๋งŽ์ด ๋‚˜์˜จ ๋ฒˆํ˜ธ์˜ ์ถœํ˜„ ํšŸ์ˆ˜
most_frequent_counts = {
    month: monthly_number_counts[month][most_frequent_numbers[month]]
    for month in range(1, 13)
}

# ์‹œ๊ฐํ™”
plt.figure(figsize=(12, 6))
plt.bar(most_frequent_numbers.keys(), most_frequent_counts.values(), tick_label=list(most_frequent_numbers.keys()), color=palette)
plt.xlabel("์›”")
plt.ylabel("๋‹น์ฒจ ํšŸ์ˆ˜")
plt.title("์›”๋ณ„ ๊ฐ€์žฅ ๋งŽ์ด ๋‹น์ฒจ๋œ ๋กœ๋˜ ๋ฒˆํ˜ธ์™€ ๋‹น์ฒจ ํšŸ์ˆ˜")
plt.xticks(range(1, 13))
plt.ylim(10,30)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# ๋ง‰๋Œ€ ์œ„์— ๋ฒˆํ˜ธ ํ‘œ์‹œ
for month, num in most_frequent_numbers.items():
    plt.text(month, most_frequent_counts[month], str(num), ha='center', va='bottom', fontsize=12, fontweight='bold')

plt.show()

์›”๋ณ„ ๊ฐ€์žฅ ๋งŽ์ด ๋‹น์ฒจ๋‹จ ๋ฒˆํ˜ธ ๋ฐ ํšŸ์ˆ˜

 

์›”๋ณ„๋กœ ๋งŽ์ด ๋‹น์ฒจ๋œ ๋ฒˆํ˜ธ ํ•œ ๊ฐœ๋งŒ ํ™•์ธํ•ด ๋ณด์•˜์„๋•Œ ์œ„์˜ ํ‘œ์™€ ๊ฐ™์ด ๋‚˜์™”๋‹ค. ์ œ์ผ ์•ˆ ๋‚˜์˜ฌ๊ฒƒ ๊ฐ™์•˜๋˜ 1์ด 6์›”๊ณผ 10์›” ๋‘ ๊ฐœ์˜ ๋‹ฌ์— ๋‚˜์˜จ๊ฒŒ ์ข€ ์‹ ๊ธฐํ–ˆ๋‹ค.


4. XGBoost๋ฅผ ํ™œ์šฉํ•œ ๋กœ๋˜ ๋ฒˆํ˜ธ ์˜ˆ์ธก


๋กœ๋˜ ๋ฒˆํ˜ธ ์˜ˆ์ธก์„ ์œ„ํ•ด XGBoost๋ฅผ ํ™œ์šฉํ•œ๋‹ค. ๊ฐ ๋ฒˆํ˜ธ(1~45)์˜ ์ถœํ˜„ ์—ฌ๋ถ€๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰ํ•œ๋‹ค.

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# ๋ฐ์ดํ„ฐ ์ค€๋น„
lotto_dataset["์›”"] = lotto_dataset["์ถ”์ฒจ์ผ"].dt.month
lotto_dataset["๋…„๋„"] = lotto_dataset["์ถ”์ฒจ์ผ"].dt.year

# ์ถœํ˜„ ์—ฌ๋ถ€ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
for i in range(1, 46):
    lotto_dataset[f"๋ฒˆํ˜ธ_{i}"] = lotto_dataset[["๋ฒˆํ˜ธ1", "๋ฒˆํ˜ธ2", "๋ฒˆํ˜ธ3", "๋ฒˆํ˜ธ4", "๋ฒˆํ˜ธ5", "๋ฒˆํ˜ธ6"]].apply(lambda row: int(i in row.values), axis=1)

features = ["๋…„๋„", "์›”", "ํšŒ์ฐจ"]
X = lotto_dataset[features]
y = lotto_dataset[[f"๋ฒˆํ˜ธ_{i}" for i in range(1, 46)]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ๋ชจ๋ธ ํ•™์Šต
model = xgb.XGBClassifier(objective="binary:logistic", eval_metric="logloss", use_label_encoder=False)
model.fit(X_train_scaled, y_train)

# ์˜ˆ์ธก ์ˆ˜ํ–‰
y_pred = model.predict(X_test_scaled)

# ๋‹ค์Œ ํšŒ์ฐจ ์˜ˆ์ธก
def predict_next_draw(model, last_draw):
    last_draw_scaled = scaler.transform([last_draw])
    probabilities = model.predict_proba(last_draw_scaled)
    predicted_numbers = np.argsort(probabilities[0])[-6:] + 1  # ํ™•๋ฅ ์ด ๋†’์€ 6๊ฐœ ๋ฒˆํ˜ธ ์„ ํƒ
    return sorted(predicted_numbers)

latest_data = lotto_dataset[features].iloc[-1].values
predicted_numbers = predict_next_draw(model, latest_data)
print(f"์˜ˆ์ธก๋œ ๋กœ๋˜ ๋ฒˆํ˜ธ: {predicted_numbers}")

# ๋ชจ๋ธ ์ •ํ™•๋„: 0.0560
# ์˜ˆ์ธก๋œ ๋กœ๋˜ ๋ฒˆํ˜ธ: [25, 26, 27, 32, 38, 39]

๊ฒฐ๋ก 


์ด๋ฒˆ ๋ถ„์„์—์„œ๋Š” ๋กœ๋˜ ๋ฒˆํ˜ธ์˜ ํŒจํ„ด์„ ๋ถ„์„ํ•˜๊ณ , XGBoost๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ฒˆํ˜ธ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์„ ๊ตฌํ˜„ํ•ด๋ณด์•˜๋‹ค. ์ •ํ™•๋„(Accuracy)๊ฐ€ ๊ฒจ์šฐ 5.6%(0.056)์ด๋ฏ€๋กœ ํ˜„์ €ํžˆ ๋‚ฎ์€ ์ •ํ™•๋„๋ฅผ ๋ณด์ด์ง€๋งŒ, ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•ด ๋ช‡%๊นŒ์ง€ ์˜ฌ๋ฆด ์ˆ˜ ์žˆ๋Š”์ง€ ํ…Œ์ŠคํŠธํ•œ ํ›„ ๋‹ค์Œ ๊ธ€์—์„œ ์ž‘์„ฑํ•ด๋ณด๋„๋ก ํ•˜๊ฒ ๋‹ค.

๋ฐ˜์‘ํ˜•