๊ฐœ๋ฐœ Code/์ธ๊ณต์ง€๋Šฅ A.I.

[Python][AI] Exploratory Data Analysis (EDA) - Wine Quality Dataset - 4

5hr1rnp 2025. 2. 10. 17:13
๋ฐ˜์‘ํ˜•

 

2025.02.10 - [๊ฐœ๋ฐœ Code/์ธ๊ณต์ง€๋Šฅ A.I.] - [Python][AI] Exploratory Data Analysis (EDA) - Wine Quality Dataset - 1

2025.02.10 - [๊ฐœ๋ฐœ Code/์ธ๊ณต์ง€๋Šฅ A.I.] - [Python][AI] Exploratory Data Analysis (EDA) - Wine Quality Dataset - 2

2025.02.10 - [๊ฐœ๋ฐœ Code/์ธ๊ณต์ง€๋Šฅ A.I.] - [Python][AI] Exploratory Data Analysis (EDA) - Wine Quality Dataset - 3

 

1. Introduction


In this section, we will use the XGBoost regression model to predict wine quality and compare performance after applying various techniques such as:

  • Data scaling using StandardScaler
  • Outlier removal
  • Hyperparameter tuning

We will first train a basic XGBoost model, then apply these techniques to analyze their impact on model performance.


2. Data Preparation and Splitting


We will use the Red Wine Quality Dataset for this task. The input variables (X) and target variable (y) will be separated, and the dataset will be split into an 80:20 train-test ratio.

from sklearn.model_selection import train_test_split

# Separate input (X) and target (y) variables
X = red_wine.drop(columns=['quality'])
y = red_wine['quality']

# Split the data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

3. XGBoost Model for Prediction


XGBoost is a boosting-based regression model known for its powerful predictive capabilities.

Train and Predict using XGBoost

from xgboost import XGBRegressor

# Define and train the XGBoost model
xgb_model = XGBRegressor(random_state=42)
xgb_model.fit(X_train, y_train)

# Make predictions
y_pred_xgb = xgb_model.predict(X_test)

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Function to evaluate model performance
def evaluate_model(y_true, y_pred, model_name="Model"):
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    print(f"{model_name} Performance:")
    print(f"Mean Absolute Error (MAE): {mae:.4f}")
    print(f"Mean Squared Error (MSE): {mse:.4f}")
    print(f"R² Score: {r2:.4f}\n")

# Evaluate baseline model
evaluate_model(y_test, y_pred_xgb, "XGBoost (Baseline)")

# XGBoost (Baseline) Performance:
# Mean Absolute Error (MAE): 0.4175
# Mean Squared Error (MSE): 0.3513
# R² Score: 0.4625


# 	Actual Quality	Predicted Quality
# 803		6		5.230904
# 124		5		5.347796
# 350		6		5.213949
# 682		5		5.278363
# 1326		6		5.996580
# 976		5		5.008490
# 1493		5		5.004802
# 706		5		5.094469
# 613		5		6.004089
# 1587		6		5.787866

Baseline Model Performance

Metric Score
MAE 0.4175
MSE 0.3513
R² Score 0.4625

 

728x90
๋ฐ˜์‘ํ˜•

4. Applying StandardScaler for Data Scaling


Why Scaling?

Although XGBoost does not require feature scaling, scaling can sometimes improve convergence speed and enhance performance when features have very different ranges.

Train XGBoost with Scaled Data

from sklearn.preprocessing import StandardScaler

# Apply StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train XGBoost on scaled data
xgb_model_scaled = XGBRegressor(random_state=42)
xgb_model_scaled.fit(X_train_scaled, y_train)

# Make predictions and evaluate performance
y_pred_xgb_scaled = xgb_model_scaled.predict(X_test_scaled)
evaluate_model(y_test, y_pred_xgb_scaled, "XGBoost (Scaled Data)")

# XGBoost (Scaled Data) Performance:
# Mean Absolute Error (MAE): 0.4175
# Mean Squared Error (MSE): 0.3513
# R² Score: 0.4625

# 	Actual Quality	Predicted Quality (Scaled)
# 803		6		5.230904
# 124		5		5.347796
# 350		6		5.213949
# 682		5		5.278363
# 1326		6		5.996580
# 976		5		5.008490
# 1493		5		5.004802
# 706		5		5.094469
# 613		5		6.004089
# 1587		6		5.787866

Performance after Scaling

Metric Score
MAE 0.4175
MSE 0.3513
R² Score 0.4625

Comparison with Baseline:

  • The performance remains identical to the baseline model.
  • This suggests that scaling has little impact on XGBoost's performance.

5. Outlier Removal and Performance Comparison


Why Remove Outliers?

Extreme outliers can lead to overfitting or reduce predictive performance.

# Remove Outliers using the IQR(Interquartile Range) Method
import numpy as np

# Calculate IQR
Q1 = X_train.quantile(0.25)
Q3 = X_train.quantile(0.75)
IQR = Q3 - Q1

# Define outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Select only non-outlier data
X_train_filtered = X_train[~((X_train < lower_bound) | (X_train > upper_bound)).any(axis=1)]
y_train_filtered = y_train.loc[X_train_filtered.index]

# Train XGBoost on filtered data
xgb_model_filtered = XGBRegressor(random_state=42)
xgb_model_filtered.fit(X_train_filtered, y_train_filtered)

# Predict and evaluate
y_pred_xgb_filtered = xgb_model_filtered.predict(X_test)
evaluate_model(y_test, y_pred_xgb_filtered, "XGBoost (Outliers Removed)")

# XGBoost (Outliers Removed) Performance:
# Mean Absolute Error (MAE): 0.4383
# Mean Squared Error (MSE): 0.3492
# R² Score: 0.4656

# 	Actual Quality	Predicted Quality (Scaled)
# 803		6			5.300020
# 124		5			5.235788
# 350		6			4.754515
# 682		5			5.131290
# 1326		6			5.996963
# 976		5			4.997323
# 1493		5			5.347618
# 706		5			5.048402
# 613		5			5.936689
# 1587		6			5.864721

Performance after Outlier Removal

Metric Score
MAE 0.4383
MSE 0.3492
R² Score 0.4656

Comparison with Baseline:

  • MSE slightly decreased, leading to a small improvement in R² Score.
  • This suggests that removing outliers slightly improves performance.

6. Hyperparameter Tuning for Performance Optimization


Why Tune Hyperparameters?

Optimizing n_estimators, max_depth, and learning_rate can boost model performance.

Hyperparameter Tuning with GridSearchCV

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [100, 300, 500, 1000],
    'max_depth': [3, 5, 7, 9],
    'learning_rate': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1]
}

# Perform GridSearchCV
grid_search = GridSearchCV(XGBRegressor(random_state=42),
                           param_grid, cv=5, scoring='r2', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)

# Train model with best parameters
best_xgb = grid_search.best_estimator_
y_pred_xgb_tuned = best_xgb.predict(X_test)
evaluate_model(y_test, y_pred_xgb_tuned, "XGBoost (Tuned)")

Best Hyperparameters Found

# Best Parameters: {'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 300}
# XGBoost (Tuned) Performance:
# Mean Absolute Error (MAE): 0.4549
# Mean Squared Error (MSE): 0.3506
# R² Score: 0.4635

# 	Actual Quality	Predicted Quality (Scaled)
# 803		6		5.439800
# 124		5		5.003238
# 350		6		5.149049
# 682		5		5.308735
# 1326		6		5.826034
# 976		5		5.128599
# 1493		5		5.092262
# 706		5		5.184746
# 613		5		5.990964
# 1587		6		5.870989

Performance After Tuning

Metric Score
MAE 0.4549
MSE 0.3506
R² Score 0.4635

Comparison with Baseline:

  • Performance is almost identical or slightly worse than the baseline.
  • Tuning does not always guarantee better results, as it depends on dataset characteristics.

7. Final Model Comparison and Conclusion


 

Model MAE MSE R² Score
XGBoost (Baseline) 0.4175 0.3513 0.4625
XGBoost (Scaled) 0.4175 0.3513 0.4625
XGBoost (Outliers Removed) 0.4383 0.3492 0.4656
XGBoost (Tuned) 0.4549 0.3506 0.4635

Key Takeaways:

  • The baseline model performed the best.
  • Outlier removal slightly improved performance, but not significantly.
  • Hyperparameter tuning did not lead to major improvements.
๋ฐ˜์‘ํ˜•