๋ฐ์ํ
[Python][AI] Exploratory Data Analysis (EDA) - Wine Quality Dataset - 1
Exploratory Data Analysis (EDA) is the first step in data analysis, where data is visually explored, summary statistics are examined, and patterns and characteristics of the dataset are identified. In this post, we will walk through the step-by-step proces
5hr1rnp.tistory.com
Continuing from the previous analysis, we will now load the dataset and examine its basic information and summary statistics.
Loading the Data
# Library Version
# pandas : 2.2.1
# numpy : 1.26.4
# matplotlib: 3.9.2
# library import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# data load
# Define CSV file paths
red_wine_path = './wine+quality/winequality-red.csv'
white_wine_path = './wine+quality/winequality-white.csv'
# Load CSV files
red_wine = pd.read_csv(red_wine_path, sep=';')
white_wine = pd.read_csv(white_wine_path, sep=';')
# Display first few rows
red_wine.head()
# fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
# 0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
# 1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
# 2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
# 3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
# 4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
white_wine.head()
# fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
# 0 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010 3.00 0.45 8.8 6
# 1 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 3.30 0.49 9.5 6
# 2 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3.26 0.44 10.1 6
# 3 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
# 4 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
Since the CSV files are separated by ;, we specify sep=';' in read_csv() to correctly load the data.
Exploring the Dataset
1. Basic Information Summary
red_wine.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 1599 entries, 0 to 1598
# Data columns (total 12 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 fixed acidity 1599 non-null float64
# 1 volatile acidity 1599 non-null float64
# 2 citric acid 1599 non-null float64
# 3 residual sugar 1599 non-null float64
# 4 chlorides 1599 non-null float64
# 5 free sulfur dioxide 1599 non-null float64
# 6 total sulfur dioxide 1599 non-null float64
# 7 density 1599 non-null float64
# 8 pH 1599 non-null float64
# 9 sulphates 1599 non-null float64
# 10 alcohol 1599 non-null float64
# 11 quality 1599 non-null int64
# dtypes: float64(11), int64(1)
# memory usage: 150.0 KB
white_wine.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 4898 entries, 0 to 4897
# Data columns (total 12 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 fixed acidity 4898 non-null float64
# 1 volatile acidity 4898 non-null float64
# 2 citric acid 4898 non-null float64
# 3 residual sugar 4898 non-null float64
# 4 chlorides 4898 non-null float64
# 5 free sulfur dioxide 4898 non-null float64
# 6 total sulfur dioxide 4898 non-null float64
# 7 density 4898 non-null float64
# 8 pH 4898 non-null float64
# 9 sulphates 4898 non-null float64
# 10 alcohol 4898 non-null float64
# 11 quality 4898 non-null int64
# dtypes: float64(11), int64(1)
# memory usage: 459.3 KB
Red Wine Dataset
- Total Samples: 1,599
- Columns: 12 (11 numerical features + 1 quality score)
- Data Types:
- 11 float64 columns (chemical properties)
- 1 int64 column (quality)
- Missing Values: None
White Wine Dataset
- Total Samples: 4,898
- Columns: 12 (same as red wine)
- Data Types:
- 11 float64 columns
- 1 int64 column (quality)
- Missing Values: None
Both datasets contain no missing values, and the quality column is the only integer variable.
2. Descriptive Statistics
# round(data, ndigits) Rounding Half Even Function
round(red_wine.describe(), 4)
# fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
# count 1599.0000 1599.0000 1599.0000 1599.0000 1599.0000 1599.0000 1599.0000 1599.0000 1599.0000 1599.0000 1599.0000 1599.0000
# mean 8.3196 0.5278 0.2710 2.5388 0.0875 15.8749 46.4678 0.9967 3.3111 0.6581 10.4230 5.6360
# std 1.7411 0.1791 0.1948 1.4099 0.0471 10.4602 32.8953 0.0019 0.1544 0.1695 1.0657 0.8076
# min 4.6000 0.1200 0.0000 0.9000 0.0120 1.0000 6.0000 0.9901 2.7400 0.3300 8.4000 3.0000
# 25% 7.1000 0.3900 0.0900 1.9000 0.0700 7.0000 22.0000 0.9956 3.2100 0.5500 9.5000 5.0000
# 50% 7.9000 0.5200 0.2600 2.2000 0.0790 14.0000 38.0000 0.9968 3.3100 0.6200 10.2000 6.0000
# 75% 9.2000 0.6400 0.4200 2.6000 0.0900 21.0000 62.0000 0.9978 3.4000 0.7300 11.1000 6.0000
# max 15.9000 1.5800 1.0000 15.5000 0.6110 72.0000 289.0000 1.0037 4.0100 2.0000 14.9000 8.0000
round(white_wine.describe(), 4)
# fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
# count 4898.0000 4898.0000 4898.0000 4898.0000 4898.0000 4898.0000 4898.0000 4898.0000 4898.0000 4898.0000 4898.0000 4898.0000
# mean 6.8548 0.2782 0.3342 6.3914 0.0458 35.3081 138.3607 0.9940 3.1883 0.4898 10.5143 5.8779
# std 0.8439 0.1008 0.1210 5.0721 0.0218 17.0071 42.4981 0.0030 0.1510 0.1141 1.2306 0.8856
# min 3.8000 0.0800 0.0000 0.6000 0.0090 2.0000 9.0000 0.9871 2.7200 0.2200 8.0000 3.0000
# 25% 6.3000 0.2100 0.2700 1.7000 0.0360 23.0000 108.0000 0.9917 3.0900 0.4100 9.5000 5.0000
# 50% 6.8000 0.2600 0.3200 5.2000 0.0430 34.0000 134.0000 0.9937 3.1800 0.4700 10.4000 6.0000
# 75% 7.3000 0.3200 0.3900 9.9000 0.0500 46.0000 167.0000 0.9961 3.2800 0.5500 11.4000 6.0000
# max 14.2000 1.1000 1.6600 65.8000 0.3460 289.0000 440.0000 1.0390 3.8200 1.0800 14.2000 9.0000
Feature | Red Wine ( Mean ± Std) | White Wine (Mean ± Std) |
Fixed Acidity | 8.32 ± 1.74 | 6.85 ± 0.84 |
Volatile Acidity | 0.53 ± 0.18 | 0.28 ± 0.10 |
Citric Acid | 0.27 ± 0.19 | 0.33 ± 0.12 |
Residual Sugar | 2.54 ± 1.41 | 6.39 ± 5.07 |
Chlorides | 0.088 ± 0.047 | 0.046 ± 0.022 |
Free Sulfur Dioxide | 15.87 ± 10.46 | 35.31 ± 17.01 |
Total Sulfur Dioxide | 46.47 ± 32.90 | 138.36 ± 42.50 |
Density | 0.9967 ± 0.0019 | 0.9940 ± 0.0030 |
pH | 3.31 ± 0.15 | 3.19 ± 0.15 |
Alcohol | 10.42 ± 1.07 | 10.51 ± 1.23 |
Quality Score | 5.64 ± 0.81 | 5.88 ± 0.89 |
728x90
๋ฐ์ํ
Key Insights from Data Analysis
(1) Fixed Acidity
- Red Wine: Mean = 8.32, Std = 1.74
- White Wine: Mean = 6.85, Std = 0.84
- Interpretation:
- Red wine has a higher fixed acidity than white wine.
- This suggests that red wine tends to have a stronger tartness.
- The wider distribution in red wine indicates greater variety in acidity levels.
(2) Volatile Acidity
- Red Wine: Mean = 0.53, Std = 0.18
- White Wine: Mean = 0.28, Std = 0.10
- Interpretation:
- Red wine has almost double the volatile acidity compared to white wine.
- This means volatile acidity is more likely to impact red wine quality.
- Since white wine has lower volatile acidity, it is less likely to have a negative impact on quality.
(3) Citric Acid
- Red Wine: Mean = 0.27, Std = 0.19
- White Wine: Mean = 0.33, Std = 0.12
- Interpretation:
- White wine contains more citric acid, contributing to a fresher, crispier taste.
- Red wine shows a wider distribution, meaning greater variability in citric acid levels.
(4) Residual Sugar
- Red Wine: Mean = 2.54, Max = 15.5
- White Wine: Mean = 6.39, Max = 65.8
- Interpretation:
- White wine has significantly higher residual sugar, making it noticeably sweeter.
- White wine’s maximum value (65.8) is much higher than red wine’s (15.5), suggesting potential outliers.
(5) Chlorides
- Red Wine: Mean = 0.087, Max = 0.611
- White Wine: Mean = 0.046, Max = 0.346
- Interpretation:
- Red wine has almost twice the chloride concentration as white wine.
- Higher chloride levels can negatively affect wine quality.
(6) Free Sulfur Dioxide
- Red Wine: Mean = 15.87, Max = 72
- White Wine: Mean = 35.31, Max = 289
- Interpretation:
- White wine has over twice the free sulfur dioxide compared to red wine.
- White wine needs more sulfur dioxide to prevent oxidation and maintain freshness.
(7) Total Sulfur Dioxide
- Red Wine: Mean = 46.47, Max = 289
- White Wine: Mean = 138.36, Max = 440
- Interpretation:
- White wine contains almost three times the total sulfur dioxide.
- This highlights an important preservation difference between red and white wines.
(8) Density
- Red Wine: Mean = 0.9967
- White Wine: Mean = 0.9940
- Interpretation:
- White wine has lower density, likely due to higher alcohol and sugar content.
(9) pH
- Red Wine: Mean = 3.31
- White Wine: Mean = 3.19
- Interpretation:
- Red wine has a slightly higher pH, meaning it is less acidic than white wine.
- This could result in a smoother, rounder taste for red wine.
(10) Alcohol Content
- Red Wine: Mean = 10.42, Max = 14.9
- White Wine: Mean = 10.51, Max = 14.2
- Interpretation:
- Both wines have similar alcohol distributions, with white wine being slightly higher on average.
(11) Quality Score
- Red Wine: Mean = 5.64, Max = 8
- White Wine: Mean = 5.88, Max = 9
- Interpretation:
- White wine has a slightly higher average quality score.
- The highest recorded quality score is higher for white wine.
Summary of Findings
- Acidity: Red wine is more acidic than white wine.
- Sweetness: White wine has significantly more residual sugar.
- Sulfur Dioxide: White wine requires more sulfur dioxide for preservation.
- Quality: White wine has a slightly higher average quality score.
๋ฐ์ํ