๋ฐ์ํ
Exploratory Data Analysis (EDA) is the first step in data analysis, where data is visually explored, summary statistics are examined, and patterns and characteristics of the dataset are identified. In this post, we will walk through the step-by-step process of exploring data using the Wine Quality Dataset.
What is EDA?
EDA (Exploratory Data Analysis) is a crucial process for gaining a deeper understanding of a dataset. The primary objectives of EDA include:
- Understanding the distribution and structure of the data
- Detecting missing values and outliers
- Identifying key characteristics and patterns in the data
- Extracting insights necessary for preprocessing and modeling
Dataset Information
- Dataset Name: Wine Quality Dataset
- Source: UCI Machine Learning Repository
- Dataset Composition:
- Total samples: 6,497 (Red wine : 1,599 and White wine : 4,898)
- Input variables (11 features): Chemical properties of wine (e.g., acidity, sugar content, pH, etc.)
- Output variable (1 target): Wine quality score (0–10)
Input Variables (Features)
1. Fixed Acidity (fixed_acidity)
- Description: Fixed acidity refers to the concentration of non-volatile acids in wine, primarily including tartaric acid and malic acid.
- Unit: g/dm³
- Role:
- Influences the tartness of the wine
- Essential for maintaining the wine’s freshness
2. Volatile Acidity (volatile_acidity)
- Description: Volatile acidity represents the concentration of acids that can evaporate, primarily acetic acid.
- Unit: g/dm³
- Role:
- High volatile acidity can give the wine a vinegar-like taste, lowering its quality
- Moderate levels add complexity to the wine’s flavor
3. Citric Acid (citric_acid)
- Description: Citric acid is a natural component that helps regulate acidity and contributes to the wine’s freshness.
- Unit: g/dm³
- Role:
- Balances and enhances the wine’s tartness
- Low levels may indicate an aged or lower-quality wine
4. Residual Sugar (residual_sugar)
- Description: Residual sugar refers to the concentration of sugar that remains after fermentation.
- Unit: g/dm³
- Role:
- Directly affects the wine’s sweetness
- Most wines range from 1 g/dm³ to 10 g/dm³, while dessert wines have much higher levels
5. Chlorides (chlorides)
- Description: Chlorides represent the concentration of salts in wine.
- Unit: g/dm³
- Role:
- Affects the saltiness of the wine
- High chloride levels can degrade the wine’s quality
728x90
๋ฐ์ํ
6. Free Sulfur Dioxide (free_sulfur_dioxide)
- Description: The amount of sulfur dioxide (SOโ) present in a chemically free state in the wine.
- Unit: mg/dm³
- Role:
- Prevents oxidation and microbial activity, preserving the wine’s freshness
- Excessive amounts can negatively affect the wine’s flavor
7. Total Sulfur Dioxide (total_sulfur_dioxide)
- Description: The total concentration of free and bound sulfur dioxide in the wine.
- Unit: mg/dm³
- Role:
- Enhances wine preservation
- High levels may cause health concerns for sensitive individuals
8. Density (density)
- Description: The relative density of wine compared to water (1.000 g/cm³).
- Unit: g/cm³
- Role:
- Reflects the concentration of residual sugar and alcohol
- Higher alcohol or sugar content increases the density
9. pH (pH)
- Description: The pH value measures the acidity or alkalinity of the wine.
- Range: 0–14 (typically between 3–4 for wine)
- Role:
- Affects the wine’s freshness and stability
- Lower pH indicates a fresher taste, while higher pH increases oxidation risk
10. Sulphates (sulphates)
- Description: Sulphates are compounds formed during fermentation that impact the wine’s preservation and aroma.
- Unit: g/dm³
- Role:
- Enhances the wine’s aroma and oxidation resistance
- Excessive sulphates can cause a bitter taste
11. Alcohol (alcohol)
- Description: The alcohol content of the wine.
- Unit: % (by volume)
- Role:
- Directly affects the body and flavor of the wine
- Higher alcohol content is often associated with higher-quality wines
Output Variable (Target)
Quality Score (quality)
- Description: The overall quality score of the wine, rated on a scale of 0 to 10.
- Role:
- Used as the target variable for prediction
- The goal of this dataset is to predict wine quality based on its chemical properties
๋ฐ์ํ