๊ฐœ๋ฐœ Code/์ธ๊ณต์ง€๋Šฅ A.I.

[Python][AI] Exploratory Data Analysis (EDA) - Wine Quality Dataset - 2

5hr1rnp 2025. 2. 10. 16:27
๋ฐ˜์‘ํ˜•

 

2025.02.10 - [๊ฐœ๋ฐœ Code/์ธ๊ณต์ง€๋Šฅ A.I.] - [Python][AI] Exploratory Data Analysis (EDA) - Wine Quality Dataset - 1

 

[Python][AI] Exploratory Data Analysis (EDA) - Wine Quality Dataset - 1

Exploratory Data Analysis (EDA) is the first step in data analysis, where data is visually explored, summary statistics are examined, and patterns and characteristics of the dataset are identified. In this post, we will walk through the step-by-step proces

5hr1rnp.tistory.com

 

  Continuing from the previous analysis, we will now load the dataset and examine its basic information and summary statistics.


Loading the Data


# Library Version
# pandas    : 2.2.1
# numpy     : 1.26.4
# matplotlib: 3.9.2

# library import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# data load
# Define CSV file paths
red_wine_path = './wine+quality/winequality-red.csv'
white_wine_path = './wine+quality/winequality-white.csv'

# Load CSV files
red_wine = pd.read_csv(red_wine_path, sep=';')
white_wine = pd.read_csv(white_wine_path, sep=';')

# Display first few rows
red_wine.head()
#	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
# 0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
# 1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
# 2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5
# 3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6
# 4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5

white_wine.head()
#	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
# 0	7.0	0.27	0.36	20.7	0.045	45.0	170.0	1.0010	3.00	0.45	8.8	6
# 1	6.3	0.30	0.34	1.6	0.049	14.0	132.0	0.9940	3.30	0.49	9.5	6
# 2	8.1	0.28	0.40	6.9	0.050	30.0	97.0	0.9951	3.26	0.44	10.1	6
# 3	7.2	0.23	0.32	8.5	0.058	47.0	186.0	0.9956	3.19	0.40	9.9	6
# 4	7.2	0.23	0.32	8.5	0.058	47.0	186.0	0.9956	3.19	0.40	9.9	6
 

Since the CSV files are separated by ;, we specify sep=';' in read_csv() to correctly load the data.


Exploring the Dataset


1. Basic Information Summary

red_wine.info()

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 1599 entries, 0 to 1598
# Data columns (total 12 columns):
#  #   Column                Non-Null Count  Dtype  
# ---  ------                --------------  -----  
#  0   fixed acidity         1599 non-null   float64
#  1   volatile acidity      1599 non-null   float64
#  2   citric acid           1599 non-null   float64
#  3   residual sugar        1599 non-null   float64
#  4   chlorides             1599 non-null   float64
#  5   free sulfur dioxide   1599 non-null   float64
#  6   total sulfur dioxide  1599 non-null   float64
#  7   density               1599 non-null   float64
#  8   pH                    1599 non-null   float64
#  9   sulphates             1599 non-null   float64
#  10  alcohol               1599 non-null   float64
#  11  quality               1599 non-null   int64  
# dtypes: float64(11), int64(1)
# memory usage: 150.0 KB

white_wine.info()

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 4898 entries, 0 to 4897
# Data columns (total 12 columns):
#  #   Column                Non-Null Count  Dtype  
# ---  ------                --------------  -----  
#  0   fixed acidity         4898 non-null   float64
#  1   volatile acidity      4898 non-null   float64
#  2   citric acid           4898 non-null   float64
#  3   residual sugar        4898 non-null   float64
#  4   chlorides             4898 non-null   float64
#  5   free sulfur dioxide   4898 non-null   float64
#  6   total sulfur dioxide  4898 non-null   float64
#  7   density               4898 non-null   float64
#  8   pH                    4898 non-null   float64
#  9   sulphates             4898 non-null   float64
#  10  alcohol               4898 non-null   float64
#  11  quality               4898 non-null   int64  
# dtypes: float64(11), int64(1)
# memory usage: 459.3 KB
 

Red Wine Dataset

  • Total Samples: 1,599
  • Columns: 12 (11 numerical features + 1 quality score)
  • Data Types:
    • 11 float64 columns (chemical properties)
    • 1 int64 column (quality)
  • Missing Values: None

White Wine Dataset

  • Total Samples: 4,898
  • Columns: 12 (same as red wine)
  • Data Types:
    • 11 float64 columns
    • 1 int64 column (quality)
  • Missing Values: None

Both datasets contain no missing values, and the quality column is the only integer variable.


2. Descriptive Statistics

 
# round(data, ndigits) Rounding Half Even Function
round(red_wine.describe(), 4)

# 	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
# count	1599.0000	1599.0000	1599.0000	1599.0000	1599.0000	1599.0000	1599.0000	1599.0000	1599.0000	1599.0000	1599.0000	1599.0000
# mean	8.3196	0.5278	0.2710	2.5388	0.0875	15.8749	46.4678	0.9967	3.3111	0.6581	10.4230	5.6360
# std	1.7411	0.1791	0.1948	1.4099	0.0471	10.4602	32.8953	0.0019	0.1544	0.1695	1.0657	0.8076
# min	4.6000	0.1200	0.0000	0.9000	0.0120	1.0000	6.0000	0.9901	2.7400	0.3300	8.4000	3.0000
# 25%	7.1000	0.3900	0.0900	1.9000	0.0700	7.0000	22.0000	0.9956	3.2100	0.5500	9.5000	5.0000
# 50%	7.9000	0.5200	0.2600	2.2000	0.0790	14.0000	38.0000	0.9968	3.3100	0.6200	10.2000	6.0000
# 75%	9.2000	0.6400	0.4200	2.6000	0.0900	21.0000	62.0000	0.9978	3.4000	0.7300	11.1000	6.0000
# max	15.9000	1.5800	1.0000	15.5000	0.6110	72.0000	289.0000	1.0037	4.0100	2.0000	14.9000	8.0000

round(white_wine.describe(), 4)

# 	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
# count	4898.0000	4898.0000	4898.0000	4898.0000	4898.0000	4898.0000	4898.0000	4898.0000	4898.0000	4898.0000	4898.0000	4898.0000
# mean	6.8548	0.2782	0.3342	6.3914	0.0458	35.3081	138.3607	0.9940	3.1883	0.4898	10.5143	5.8779
# std	0.8439	0.1008	0.1210	5.0721	0.0218	17.0071	42.4981	0.0030	0.1510	0.1141	1.2306	0.8856
# min	3.8000	0.0800	0.0000	0.6000	0.0090	2.0000	9.0000	0.9871	2.7200	0.2200	8.0000	3.0000
# 25%	6.3000	0.2100	0.2700	1.7000	0.0360	23.0000	108.0000	0.9917	3.0900	0.4100	9.5000	5.0000
# 50%	6.8000	0.2600	0.3200	5.2000	0.0430	34.0000	134.0000	0.9937	3.1800	0.4700	10.4000	6.0000
# 75%	7.3000	0.3200	0.3900	9.9000	0.0500	46.0000	167.0000	0.9961	3.2800	0.5500	11.4000	6.0000
# max	14.2000	1.1000	1.6600	65.8000	0.3460	289.0000 440.0000	1.0390	3.8200	1.0800	14.2000	9.0000

 

Feature Red Wine ( Mean ± Std) White Wine (Mean ± Std)
Fixed Acidity 8.32 ± 1.74 6.85 ± 0.84
Volatile Acidity 0.53 ± 0.18 0.28 ± 0.10
Citric Acid 0.27 ± 0.19 0.33 ± 0.12
Residual Sugar 2.54 ± 1.41 6.39 ± 5.07
Chlorides 0.088 ± 0.047 0.046 ± 0.022
Free Sulfur Dioxide 15.87 ± 10.46 35.31 ± 17.01
Total Sulfur Dioxide 46.47 ± 32.90 138.36 ± 42.50
Density 0.9967 ± 0.0019 0.9940 ± 0.0030
pH 3.31 ± 0.15 3.19 ± 0.15
Alcohol 10.42 ± 1.07 10.51 ± 1.23
Quality Score 5.64 ± 0.81 5.88 ± 0.89

 


728x90
๋ฐ˜์‘ํ˜•

Key Insights from Data Analysis


(1) Fixed Acidity

  • Red Wine: Mean = 8.32, Std = 1.74
  • White Wine: Mean = 6.85, Std = 0.84
  • Interpretation:
    • Red wine has a higher fixed acidity than white wine.
    • This suggests that red wine tends to have a stronger tartness.
    • The wider distribution in red wine indicates greater variety in acidity levels.

(2) Volatile Acidity

  • Red Wine: Mean = 0.53, Std = 0.18
  • White Wine: Mean = 0.28, Std = 0.10
  • Interpretation:
    • Red wine has almost double the volatile acidity compared to white wine.
    • This means volatile acidity is more likely to impact red wine quality.
    • Since white wine has lower volatile acidity, it is less likely to have a negative impact on quality.

(3) Citric Acid

  • Red Wine: Mean = 0.27, Std = 0.19
  • White Wine: Mean = 0.33, Std = 0.12
  • Interpretation:
    • White wine contains more citric acid, contributing to a fresher, crispier taste.
    • Red wine shows a wider distribution, meaning greater variability in citric acid levels.

(4) Residual Sugar

  • Red Wine: Mean = 2.54, Max = 15.5
  • White Wine: Mean = 6.39, Max = 65.8
  • Interpretation:
    • White wine has significantly higher residual sugar, making it noticeably sweeter.
    • White wine’s maximum value (65.8) is much higher than red wine’s (15.5), suggesting potential outliers.

(5) Chlorides

  • Red Wine: Mean = 0.087, Max = 0.611
  • White Wine: Mean = 0.046, Max = 0.346
  • Interpretation:
    • Red wine has almost twice the chloride concentration as white wine.
    • Higher chloride levels can negatively affect wine quality.

(6) Free Sulfur Dioxide

  • Red Wine: Mean = 15.87, Max = 72
  • White Wine: Mean = 35.31, Max = 289
  • Interpretation:
    • White wine has over twice the free sulfur dioxide compared to red wine.
    • White wine needs more sulfur dioxide to prevent oxidation and maintain freshness.

(7) Total Sulfur Dioxide

  • Red Wine: Mean = 46.47, Max = 289
  • White Wine: Mean = 138.36, Max = 440
  • Interpretation:
    • White wine contains almost three times the total sulfur dioxide.
    • This highlights an important preservation difference between red and white wines.

(8) Density

  • Red Wine: Mean = 0.9967
  • White Wine: Mean = 0.9940
  • Interpretation:
    • White wine has lower density, likely due to higher alcohol and sugar content.

(9) pH

  • Red Wine: Mean = 3.31
  • White Wine: Mean = 3.19
  • Interpretation:
    • Red wine has a slightly higher pH, meaning it is less acidic than white wine.
    • This could result in a smoother, rounder taste for red wine.

(10) Alcohol Content

  • Red Wine: Mean = 10.42, Max = 14.9
  • White Wine: Mean = 10.51, Max = 14.2
  • Interpretation:
    • Both wines have similar alcohol distributions, with white wine being slightly higher on average.

(11) Quality Score

  • Red Wine: Mean = 5.64, Max = 8
  • White Wine: Mean = 5.88, Max = 9
  • Interpretation:
    • White wine has a slightly higher average quality score.
    • The highest recorded quality score is higher for white wine.

Summary of Findings


  • Acidity: Red wine is more acidic than white wine.
  • Sweetness: White wine has significantly more residual sugar.
  • Sulfur Dioxide: White wine requires more sulfur dioxide for preservation.
  • Quality: White wine has a slightly higher average quality score.
๋ฐ˜์‘ํ˜•