๊ฐœ๋ฐœ Code/์ธ๊ณต์ง€๋Šฅ A.I.

[Python][AI] ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„(EDA) - ์™€์ธ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ์…‹ (Wine Quality Dataset) - 4

5hr1rnp 2025. 2. 4. 21:35
๋ฐ˜์‘ํ˜•

2025.01.23 - [๊ฐœ๋ฐœ Code/์ธ๊ณต์ง€๋Šฅ A.I.] - [Python][AI] ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„(EDA) - ์™€์ธ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ์…‹ (Wine Quality Dataset) - 1

2025.01.24 - [๊ฐœ๋ฐœ Code/์ธ๊ณต์ง€๋Šฅ A.I.] - [Python][AI] ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„(EDA) - ์™€์ธ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ์…‹ (Wine Quality Dataset) - 2

2025.02.04 - [๊ฐœ๋ฐœ Code/์ธ๊ณต์ง€๋Šฅ A.I.] - [Python][AI] ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„(EDA) - ์™€์ธ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ์…‹ (Wine Quality Dataset) - 3

1. ๋“ค์–ด๊ฐ€๋ฉฐ


์ด๋ฒˆ ๊ธ€์—์„œ๋Š” XGBoost ํšŒ๊ท€ ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ์™€์ธ์˜ ํ’ˆ์งˆ์„ ์˜ˆ์ธกํ•˜๊ณ , ๋‹ค์–‘ํ•œ ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์—ฌ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•ด๋ณด๊ฒ ๋‹ค.
๋‹จ์ˆœํ•œ XGBoost ๋ชจ๋ธ์„ ๋จผ์ € ํ•™์Šตํ•œ ํ›„,

  1. StandardScaler๋ฅผ ํ†ตํ•œ ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜,
  2. ์ด์ƒ์น˜ ์ œ๊ฑฐ,
  3. ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹
    ๋“ฑ์„ ์ ์šฉํ•˜์—ฌ ์„ฑ๋Šฅ ๋ณ€ํ™”๋ฅผ ํ™•์ธํ•ด ๋ณด๊ฒ ๋‹ค.

2. ๋ฐ์ดํ„ฐ ์ค€๋น„ ๋ฐ ๋ถ„ํ• 


์™€์ธ ํ’ˆ์งˆ ์˜ˆ์ธก์„ ์œ„ํ•ด Red Wine Quality Dataset์„ ์‚ฌ์šฉํ•˜๋ฉฐ,
์ž…๋ ฅ ๋ณ€์ˆ˜(X)์™€ ์ถœ๋ ฅ ๋ณ€์ˆ˜(y)๋ฅผ ๋ถ„๋ฆฌํ•œ ํ›„ 80:20 ๋น„์œจ๋กœ ํ›ˆ๋ จ ๋ฐ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ๋‚˜๋ˆ„๊ฒ ์Œ

 
from sklearn.model_selection import train_test_split

# ์ž…๋ ฅ ๋ณ€์ˆ˜(X)์™€ ํƒ€๊ฒŸ ๋ณ€์ˆ˜(y) ๋ถ„๋ฆฌ
X = red_wine.drop(columns=['quality'])
y = red_wine['quality']

# 80% ํ•™์Šต, 20% ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋ถ„ํ• 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

3. XGBoost ๋ชจ๋ธ์„ ํ™œ์šฉํ•œ ์˜ˆ์ธก


XGBoost๋Š” ๋ถ€์ŠคํŒ…(Boosting) ๊ธฐ๋ฐ˜์˜ ๊ฐ•๋ ฅํ•œ ํšŒ๊ท€ ๋ชจ๋ธ๋กœ, ์™€์ธ์˜ ํ’ˆ์งˆ์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ ์ ํ•ฉํ•จ

 

XGBoost ๋ชจ๋ธ ํ•™์Šต ๋ฐ ์˜ˆ์ธก:

from xgboost import XGBRegressor

# XGBoost ๋ชจ๋ธ ์ •์˜ ๋ฐ ํ•™์Šต
xgb_model = XGBRegressor(random_state=42)
xgb_model.fit(X_train, y_train)

# ์˜ˆ์ธก ์ˆ˜ํ–‰
y_pred_xgb = xgb_model.predict(X_test)

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# ๋ชจ๋ธ ํ‰๊ฐ€ ํ•จ์ˆ˜
def evaluate_model(y_true, y_pred, model_name="Model"):
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    print(f"{model_name} Performance:")
    print(f"Mean Absolute Error (MAE): {mae:.4f}")
    print(f"Mean Squared Error (MSE): {mse:.4f}")
    print(f"R² Score: {r2:.4f}\n")

# ์„ฑ๋Šฅ ํ‰๊ฐ€
evaluate_model(y_test, y_pred_xgb, "XGBoost (Baseline)")

# XGBoost (Baseline) Performance:
# Mean Absolute Error (MAE): 0.4175
# Mean Squared Error (MSE): 0.3513
# R² Score: 0.4625


# 	Actual Quality	Predicted Quality
# 803		6		5.230904
# 124		5		5.347796
# 350		6		5.213949
# 682		5		5.278363
# 1326		6		5.996580
# 976		5		5.008490
# 1493		5		5.004802
# 706		5		5.094469
# 613		5		6.004089
# 1587		6		5.787866
 

 

๋ชจ๋ธ ํ‰๊ฐ€ ๊ฒฐ๊ณผ

Metric Score
MAE 0.4175
MSE 0.3513
R² Score 0.4625

728x90
๋ฐ˜์‘ํ˜•

4. StandardScaler๋ฅผ ํ™œ์šฉํ•œ ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜ ํ›„ ์„ฑ๋Šฅ ๋น„๊ต


์Šค์ผ€์ผ๋ง์ด ํ•„์š”ํ•œ ์ด์œ ?

  • XGBoost๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์ •๊ทœํ™” ์—†์ด๋„ ์ž˜ ์ž‘๋™ํ•˜์ง€๋งŒ,
  • ๋ฐ์ดํ„ฐ ์Šค์ผ€์ผ์ด ํฐ ๊ฒฝ์šฐ ๋ชจ๋ธ ์ˆ˜๋ ด ์†๋„ ๊ฐœ์„  ๋ฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ์Œ

StandardScaler ์ ์šฉ ํ›„ XGBoost ํ•™์Šต

from sklearn.preprocessing import StandardScaler

# ๋ฐ์ดํ„ฐ ์Šค์ผ€์ผ๋ง ์ ์šฉ
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# XGBoost ๋ชจ๋ธ ์žฌํ•™์Šต (์Šค์ผ€์ผ๋ง ์ ์šฉ)
xgb_model_scaled = XGBRegressor(random_state=42)
xgb_model_scaled.fit(X_train_scaled, y_train)

# ์˜ˆ์ธก ๋ฐ ์„ฑ๋Šฅ ํ‰๊ฐ€
y_pred_xgb_scaled = xgb_model_scaled.predict(X_test_scaled)
evaluate_model(y_test, y_pred_xgb_scaled, "XGBoost (Scaled Data)")

# XGBoost (Scaled Data) Performance:
# Mean Absolute Error (MAE): 0.4175
# Mean Squared Error (MSE): 0.3513
# R² Score: 0.4625

# 	Actual Quality	Predicted Quality (Scaled)
# 803		6		5.230904
# 124		5		5.347796
# 350		6		5.213949
# 682		5		5.278363
# 1326		6		5.996580
# 976		5		5.008490
# 1493		5		5.004802
# 706		5		5.094469
# 613		5		6.004089
# 1587		6		5.787866
 

์Šค์ผ€์ผ๋ง ํ›„ ๋ชจ๋ธ ํ‰๊ฐ€ ๊ฒฐ๊ณผ

Metric Score
MSE 0.3513
R² Score 0.4625
MAE 0.4175

 

๋น„๊ต ๊ฒฐ๊ณผ:

  • Baseline๊ณผ ๋™์ผํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž„ → ์Šค์ผ€์ผ๋ง์ด XGBoost ๋ชจ๋ธ์—๋Š” ํฐ ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

5. ์ด์ƒ์น˜ ์ œ๊ฑฐ ํ›„ ์„ฑ๋Šฅ ๋น„๊ต


์ด์ƒ์น˜ ์ œ๊ฑฐ๊ฐ€ ํ•„์š”ํ•œ ์ด์œ ?

  • ๋ฐ์ดํ„ฐ์— ๊ทน๋‹จ์ ์ธ ์ด์ƒ์น˜๊ฐ€ ํฌํ•จ๋  ๊ฒฝ์šฐ ๋ชจ๋ธ์ด ๊ณผ์ ํ•ฉ(overfitting)๋˜๊ฑฐ๋‚˜
  • ์˜ˆ์ธก ์„ฑ๋Šฅ์ด ์ €ํ•˜๋  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Œ

์ด์ƒ์น˜ ์ œ๊ฑฐ ํ›„ ๋ชจ๋ธ ๊ฐœ์„ ํ•˜๊ธฐ:

# IQR(Interquartile Range) ๋ฐฉ๋ฒ•์„ ํ™œ์šฉํ•œ ์ด์ƒ์น˜ ์ œ๊ฑฐ
import numpy as np

# IQR ๊ณ„์‚ฐ
Q1 = X_train.quantile(0.25)
Q3 = X_train.quantile(0.75)
IQR = Q3 - Q1

# ์ด์ƒ์น˜ ๊ธฐ์ค€ ์„ค์ • (1.5 * IQR)
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# ์ด์ƒ์น˜๊ฐ€ ์•„๋‹Œ ๋ฐ์ดํ„ฐ๋งŒ ์„ ํƒ
X_train_filtered = X_train[~((X_train < lower_bound) | (X_train > upper_bound)).any(axis=1)]
y_train_filtered = y_train.loc[X_train_filtered.index]

# XGBoost ๋ชจ๋ธ ์žฌํ•™์Šต (์ด์ƒ์น˜ ์ œ๊ฑฐ ํ›„)
xgb_model_filtered = XGBRegressor(random_state=42)
xgb_model_filtered.fit(X_train_filtered, y_train_filtered)

# ์˜ˆ์ธก ๋ฐ ์„ฑ๋Šฅ ํ‰๊ฐ€
y_pred_xgb_filtered = xgb_model_filtered.predict(X_test)
evaluate_model(y_test, y_pred_xgb_filtered, "XGBoost (Outliers Removed)")

# XGBoost (Outliers Removed) Performance:
# Mean Absolute Error (MAE): 0.4383
# Mean Squared Error (MSE): 0.3492
# R² Score: 0.4656

# 	Actual Quality	Predicted Quality (Scaled)
# 803		6			5.300020
# 124		5			5.235788
# 350		6			4.754515
# 682		5			5.131290
# 1326		6			5.996963
# 976		5			4.997323
# 1493		5			5.347618
# 706		5			5.048402
# 613		5			5.936689
# 1587		6			5.864721
 

์ด์ƒ์น˜ ์ œ๊ฑฐ ํ›„ ๋ชจ๋ธ ํ‰๊ฐ€ ๊ฒฐ๊ณผ

 

Metric Score
MAE 0.4383
MSE 0.3492
R² Score 0.4656

 

๋น„๊ต ๊ฒฐ๊ณผ

  • ์ด์ƒ์น˜ ์ œ๊ฑฐ ํ›„ ์„ฑ๋Šฅ ๋ณ€ํ™”๋Š” ํฌ์ง€ ์•Š์Œ
  • ๋‹ค๋งŒ, MSE๊ฐ€ ์•ฝ๊ฐ„ ๊ฐ์†Œ(R² Score ์ฆ๊ฐ€) → ์ด์ƒ์น˜ ์ œ๊ฑฐ๊ฐ€ ๋ฏธ์„ธํ•˜๊ฒŒ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ด

6. ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์„ ํ†ตํ•œ ์ตœ์  ์„ฑ๋Šฅ ์ฐพ๊ธฐ


XGBoost๋Š” ๋‹ค์–‘ํ•œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ์ •์„ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์Œ
ํŠนํžˆ, n_estimators, max_depth, learning_rate ๋“ฑ์˜ ๊ฐ’์„ ์ตœ์ ํ™”ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•จ

 

GridSearchCV๋ฅผ ํ™œ์šฉํ•œ ์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ ํƒ์ƒ‰:

from sklearn.model_selection import GridSearchCV

# ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํ›„๋ณด ์ •์˜
param_grid = {
    'n_estimators': [100, 300, 500, 1000],
    'max_depth': [3, 5, 7, 9],
    'learning_rate': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1]
}

# GridSearchCV ์‹คํ–‰
grid_search = GridSearchCV(XGBRegressor(random_state=42),
                           param_grid, cv=5, scoring='r2', n_jobs=-1)
grid_search.fit(X_train, y_train)

# ์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ ํ™•์ธ
print("Best Parameters:", grid_search.best_params_)

# ์ตœ์  ๋ชจ๋ธ ํ‰๊ฐ€
best_xgb = grid_search.best_estimator_
y_pred_xgb_tuned = best_xgb.predict(X_test)
evaluate_model(y_test, y_pred_xgb_tuned, "XGBoost (Tuned)")
 

์ตœ์  ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๊ฒฐ๊ณผ

# Best Parameters: {'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 300}
# XGBoost (Tuned) Performance:
# Mean Absolute Error (MAE): 0.4549
# Mean Squared Error (MSE): 0.3506
# R² Score: 0.4635

# 	Actual Quality	Predicted Quality (Scaled)
# 803		6		5.439800
# 124		5		5.003238
# 350		6		5.149049
# 682		5		5.308735
# 1326		6		5.826034
# 976		5		5.128599
# 1493		5		5.092262
# 706		5		5.184746
# 613		5		5.990964
# 1587		6		5.870989
 

ํŠœ๋‹ ํ›„ ๋ชจ๋ธ ํ‰๊ฐ€ ๊ฒฐ๊ณผ

Metric Score
MAE 0.4549
MSE 0.3506
R² Score 0.4635

 

ํŠœ๋‹ ๊ฒฐ๊ณผ

  • ์„ฑ๋Šฅ์ด Baseline๊ณผ ๊ฑฐ์˜ ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ์•ฝ๊ฐ„ ๋‚ฎ์Œ → ๊ธฐ๋ณธ๊ฐ’๋„ ์ถฉ๋ถ„ํžˆ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Œ
  • ํŠœ๋‹์ด ํ•ญ์ƒ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜์ง€๋Š” ์•Š์Œ → ๋ฐ์ดํ„ฐ ํŠน์„ฑ์— ๋”ฐ๋ผ ๋‹ค๋ฆ„

7. ์ตœ์ข… ์„ฑ๋Šฅ ๋น„๊ต ๋ฐ ๊ฒฐ๋ก 


Model MAE MSE R² Score
XGBoost (Baseline) 0.4175 0.3513 0.4625
XGBoost (Scaled) 0.4175 0.3513 0.4625
XGBoost (Outliers Removed) 0.4383 0.3492 0.4656
XGBoost (Tuned) 0.4549 0.3506 0.4635

 

๊ฒฐ๋ก :

  • Baseline ๋ชจ๋ธ์ด ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Œ
  • ์ด์ƒ์น˜ ์ œ๊ฑฐ๋Š” ์ผ๋ถ€ ๊ฐœ์„  ํšจ๊ณผ๊ฐ€ ์žˆ์—ˆ์Œ
  • ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์ด ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ค์ง€๋Š” ์•Š์Œ
๋ฐ˜์‘ํ˜•