
What is a CSV File?
One of the most commonly used formats in data analysis is CSV (Comma-Separated Values). CSV files store data in a simple text format, where values are separated by commas (or other delimiters).
Pandas provides a powerful function, read_csv(), to easily load CSV files into a DataFrame. In this guide, we will explore what a CSV file is, how to load it using Pandas, key parameters of read_csv(), and how to prevent or resolve common errors.
Understanding the CSV File Format
A CSV file is a plain text file where data is separated by commas (,). Each row represents a record, and the first row is usually used as the header (column names).
Example of a CSV file:
# Example CSV File
# Name,Age,City
# Alice,25,New York
# Bob,30,Los Angeles
# Charlie,35,Chicago
Characteristics of CSV Files
- Delimiter: Typically separated by commas (,), but can also use tabs (\t) or semicolons (;).
- Header: The first row often contains column names.
- Text Encoding: Default encoding is UTF-8, but in some regions (such as Korea), files may be saved in CP949 or EUC-KR.
Loading a CSV File with Pandas
Pandas provides the read_csv() function to convert a CSV file into a DataFrame. Hereโs a basic example:
import pandas as pd
# Read CSV file
df = pd.read_csv('data.csv')
# Display DataFrame
print(df)
This code reads 'data.csv' from the current directory and loads it into a Pandas DataFrame.
Key Parameters of read_csv()
The read_csv() function includes various parameters to customize how the data is read. Here are some commonly used options:
# 1. index_col โ Setting an Index Column
# By default, Pandas assigns an automatic index,
# but you can specify a column to be used as the index.
df = pd.read_csv('data.csv', index_col=0) # Use the first column as the index
# 2. sep โ Defining a Custom Delimiter
# If your data is separated by tabs (\t) instead of commas, specify the delimiter:
df = pd.read_csv('data.tsv', sep='\t') # Read a tab-separated file
# 3. encoding โ Handling Character Encoding
# In some datasets, text encoding may not be UTF-8. In Korea, files are often encoded in CP949.
df = pd.read_csv('data.csv', encoding='cp949') # Read CP949-encoded data
# 4. header โ Specifying the Header Row
# If your CSV file has no column names, set header=None:
df = pd.read_csv('data.csv', header=None) # Read file without headers
# 5. na_values โ Handling Missing Values
# Certain values can be interpreted as missing data (NaN), such as "N/A" or "-".
df = pd.read_csv('data.csv', na_values=['N/A', '-']) # Treat "N/A" and "-" as missing values
# 6. usecols โ Selecting Specific Columns
# To load only specific columns from a CSV file:
df = pd.read_csv('data.csv', usecols=['Name', 'Age']) # Read only "Name" and "Age" columns
Preventing and Resolving Errors
1. ParserError โ Mismatched Number of Columns
If a file has an inconsistent number of columns, you may encounter this error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 6 fields in line 13, saw 7
This indicates that line 13 has 7 fields instead of the expected 6. To skip problematic rows:
# For Pandas 1.3.0 and earlier:
df = pd.read_csv('data.csv', error_bad_lines=False) # Skip problematic lines
# For Pandas 1.3.0 and later:
df = pd.read_csv('data.csv', on_bad_lines='skip') # Skip problematic lines
# Skipping Specific Rows Manually
df = pd.read_csv('data.csv', skiprows=[2, 4]) # Skip rows 2 and 4
# Filtering Data While Reading (for Large Files)
# If the dataset is large and you donโt know which rows are problematic,
# use chunksize to process data in smaller parts:
valid_rows = []
for chunk in pd.read_csv('data.csv', chunksize=1000):
valid_chunk = chunk[chunk.apply(lambda x: len(x) == len(chunk.columns), axis=1)]
valid_rows.append(valid_chunk)
df = pd.concat(valid_rows, ignore_index=True) # Combine valid rows into a DataFrame
2. Encoding Issues (UnicodeDecodeError)
If a CSV file is encoded in CP949 or EUC-KR, you may see this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 0: invalid start byte
To fix this, explicitly specify the correct encoding:
df = pd.read_csv('data.csv', encoding='cp949') # Use CP949 encoding
# OR
df = pd.read_csv('data.csv', encoding='euc-kr') # Use EUC-KR encoding
Saving Data with UTF-8 Encoding
To ensure compatibility across different systems, save the file with UTF-8-SIG encoding:
df.to_csv('output.csv', index=False, encoding='utf-8-sig')
- UTF-8: Commonly used encoding on most operating systems.
- UTF-8-SIG: Includes a BOM (Byte Order Mark) for better compatibility with Windows and Linux.
Summary
Issue | Solution |
Mismatched Columns (ParserError) | Use on_bad_lines='skip' or skiprows |
Encoding Errors (UnicodeDecodeError) | Specify encoding (cp949, euc-kr) |
Skipping Unwanted Rows | Use skiprows |
Handling Large Files | Use chunksize |
Handling Missing Values | Use na_values |
Pandasโ read_csv() function provides extensive options to handle real-world CSV files efficiently. By understanding its key parameters and common issues, you can load and process CSV data seamlessly.
'๊ฐ๋ฐ Code > ํ์ด์ฌ Python' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
[Python][pandas] Loading Data - Excel (0) | 2025.02.13 |
---|---|
[Python][program] CLI ASCII art ๋ฐ๋ ํ์ธ ๋ฉ์ธ์ง ์ฐ๊ธฐ (0) | 2025.02.12 |
[Python][pandas] Exploring pandas in Depth (0) | 2025.02.11 |
[Python][numpy] Numpy๋ก ํจ์จ์ ์ธ ๋ฐ์ดํฐ ์ํ๋ง ๋ฐ ๋์ ์์ฑ (0) | 2025.02.09 |
[Python][numpy] Numpy ๋ฐฐ์ด ์ ์ฅ ๋ฐ ๋ถ๋ฌ์ค๊ธฐ (0) | 2025.02.09 |