1. Introduction
In this section, we will use the XGBoost regression model to predict wine quality and compare performance after applying various techniques such as:
- Data scaling using StandardScaler
- Outlier removal
- Hyperparameter tuning
We will first train a basic XGBoost model, then apply these techniques to analyze their impact on model performance.
2. Data Preparation and Splitting
We will use the Red Wine Quality Dataset for this task. The input variables (X) and target variable (y) will be separated, and the dataset will be split into an 80:20 train-test ratio.
from sklearn.model_selection import train_test_split
# Separate input (X) and target (y) variables
X = red_wine.drop(columns=['quality'])
y = red_wine['quality']
# Split the data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
3. XGBoost Model for Prediction
XGBoost is a boosting-based regression model known for its powerful predictive capabilities.
Train and Predict using XGBoost
from xgboost import XGBRegressor
# Define and train the XGBoost model
xgb_model = XGBRegressor(random_state=42)
xgb_model.fit(X_train, y_train)
# Make predictions
y_pred_xgb = xgb_model.predict(X_test)
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Function to evaluate model performance
def evaluate_model(y_true, y_pred, model_name="Model"):
mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
print(f"{model_name} Performance:")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"R² Score: {r2:.4f}\n")
# Evaluate baseline model
evaluate_model(y_test, y_pred_xgb, "XGBoost (Baseline)")
# XGBoost (Baseline) Performance:
# Mean Absolute Error (MAE): 0.4175
# Mean Squared Error (MSE): 0.3513
# R² Score: 0.4625
# Actual Quality Predicted Quality
# 803 6 5.230904
# 124 5 5.347796
# 350 6 5.213949
# 682 5 5.278363
# 1326 6 5.996580
# 976 5 5.008490
# 1493 5 5.004802
# 706 5 5.094469
# 613 5 6.004089
# 1587 6 5.787866
Baseline Model Performance
Metric | Score |
MAE | 0.4175 |
MSE | 0.3513 |
R² Score | 0.4625 |
4. Applying StandardScaler for Data Scaling
Why Scaling?
Although XGBoost does not require feature scaling, scaling can sometimes improve convergence speed and enhance performance when features have very different ranges.
Train XGBoost with Scaled Data
from sklearn.preprocessing import StandardScaler
# Apply StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train XGBoost on scaled data
xgb_model_scaled = XGBRegressor(random_state=42)
xgb_model_scaled.fit(X_train_scaled, y_train)
# Make predictions and evaluate performance
y_pred_xgb_scaled = xgb_model_scaled.predict(X_test_scaled)
evaluate_model(y_test, y_pred_xgb_scaled, "XGBoost (Scaled Data)")
# XGBoost (Scaled Data) Performance:
# Mean Absolute Error (MAE): 0.4175
# Mean Squared Error (MSE): 0.3513
# R² Score: 0.4625
# Actual Quality Predicted Quality (Scaled)
# 803 6 5.230904
# 124 5 5.347796
# 350 6 5.213949
# 682 5 5.278363
# 1326 6 5.996580
# 976 5 5.008490
# 1493 5 5.004802
# 706 5 5.094469
# 613 5 6.004089
# 1587 6 5.787866
Performance after Scaling
Metric | Score |
MAE | 0.4175 |
MSE | 0.3513 |
R² Score | 0.4625 |
Comparison with Baseline:
- The performance remains identical to the baseline model.
- This suggests that scaling has little impact on XGBoost's performance.
5. Outlier Removal and Performance Comparison
Why Remove Outliers?
Extreme outliers can lead to overfitting or reduce predictive performance.
# Remove Outliers using the IQR(Interquartile Range) Method
import numpy as np
# Calculate IQR
Q1 = X_train.quantile(0.25)
Q3 = X_train.quantile(0.75)
IQR = Q3 - Q1
# Define outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Select only non-outlier data
X_train_filtered = X_train[~((X_train < lower_bound) | (X_train > upper_bound)).any(axis=1)]
y_train_filtered = y_train.loc[X_train_filtered.index]
# Train XGBoost on filtered data
xgb_model_filtered = XGBRegressor(random_state=42)
xgb_model_filtered.fit(X_train_filtered, y_train_filtered)
# Predict and evaluate
y_pred_xgb_filtered = xgb_model_filtered.predict(X_test)
evaluate_model(y_test, y_pred_xgb_filtered, "XGBoost (Outliers Removed)")
# XGBoost (Outliers Removed) Performance:
# Mean Absolute Error (MAE): 0.4383
# Mean Squared Error (MSE): 0.3492
# R² Score: 0.4656
# Actual Quality Predicted Quality (Scaled)
# 803 6 5.300020
# 124 5 5.235788
# 350 6 4.754515
# 682 5 5.131290
# 1326 6 5.996963
# 976 5 4.997323
# 1493 5 5.347618
# 706 5 5.048402
# 613 5 5.936689
# 1587 6 5.864721
Performance after Outlier Removal
Metric | Score |
MAE | 0.4383 |
MSE | 0.3492 |
R² Score | 0.4656 |
Comparison with Baseline:
- MSE slightly decreased, leading to a small improvement in R² Score.
- This suggests that removing outliers slightly improves performance.
6. Hyperparameter Tuning for Performance Optimization
Why Tune Hyperparameters?
Optimizing n_estimators, max_depth, and learning_rate can boost model performance.
Hyperparameter Tuning with GridSearchCV
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {
'n_estimators': [100, 300, 500, 1000],
'max_depth': [3, 5, 7, 9],
'learning_rate': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1]
}
# Perform GridSearchCV
grid_search = GridSearchCV(XGBRegressor(random_state=42),
param_grid, cv=5, scoring='r2', n_jobs=-1)
grid_search.fit(X_train, y_train)
# Best parameters
print("Best Parameters:", grid_search.best_params_)
# Train model with best parameters
best_xgb = grid_search.best_estimator_
y_pred_xgb_tuned = best_xgb.predict(X_test)
evaluate_model(y_test, y_pred_xgb_tuned, "XGBoost (Tuned)")
Best Hyperparameters Found
# Best Parameters: {'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 300}
# XGBoost (Tuned) Performance:
# Mean Absolute Error (MAE): 0.4549
# Mean Squared Error (MSE): 0.3506
# R² Score: 0.4635
# Actual Quality Predicted Quality (Scaled)
# 803 6 5.439800
# 124 5 5.003238
# 350 6 5.149049
# 682 5 5.308735
# 1326 6 5.826034
# 976 5 5.128599
# 1493 5 5.092262
# 706 5 5.184746
# 613 5 5.990964
# 1587 6 5.870989
Performance After Tuning
Metric | Score |
MAE | 0.4549 |
MSE | 0.3506 |
R² Score | 0.4635 |
Comparison with Baseline:
- Performance is almost identical or slightly worse than the baseline.
- Tuning does not always guarantee better results, as it depends on dataset characteristics.
7. Final Model Comparison and Conclusion
Model | MAE | MSE | R² Score |
XGBoost (Baseline) | 0.4175 | 0.3513 | 0.4625 |
XGBoost (Scaled) | 0.4175 | 0.3513 | 0.4625 |
XGBoost (Outliers Removed) | 0.4383 | 0.3492 | 0.4656 |
XGBoost (Tuned) | 0.4549 | 0.3506 | 0.4635 |
Key Takeaways:
- The baseline model performed the best.
- Outlier removal slightly improved performance, but not significantly.
- Hyperparameter tuning did not lead to major improvements.