๊ฐœ๋ฐœ Code/ํŒŒ์ด์ฌ Python

[Python][pandas] Loading Data - CSV

5hr1rnp 2025. 2. 11. 17:53
๋ฐ˜์‘ํ˜•

What is a CSV File?


One of the most commonly used formats in data analysis is CSV (Comma-Separated Values). CSV files store data in a simple text format, where values are separated by commas (or other delimiters).

Pandas provides a powerful function, read_csv(), to easily load CSV files into a DataFrame. In this guide, we will explore what a CSV file is, how to load it using Pandas, key parameters of read_csv(), and how to prevent or resolve common errors.


Understanding the CSV File Format


A CSV file is a plain text file where data is separated by commas (,). Each row represents a record, and the first row is usually used as the header (column names).

Example of a CSV file:

# Example CSV File
# Name,Age,City
# Alice,25,New York
# Bob,30,Los Angeles
# Charlie,35,Chicago

 


Characteristics of CSV Files


  • Delimiter: Typically separated by commas (,), but can also use tabs (\t) or semicolons (;).
  • Header: The first row often contains column names.
  • Text Encoding: Default encoding is UTF-8, but in some regions (such as Korea), files may be saved in CP949 or EUC-KR.

Loading a CSV File with Pandas


Pandas provides the read_csv() function to convert a CSV file into a DataFrame. Hereโ€™s a basic example:

import pandas as pd

# Read CSV file
df = pd.read_csv('data.csv')

# Display DataFrame
print(df)

 

This code reads 'data.csv' from the current directory and loads it into a Pandas DataFrame.


728x90
๋ฐ˜์‘ํ˜•

Key Parameters of read_csv()


The read_csv() function includes various parameters to customize how the data is read. Here are some commonly used options:

# 1. index_col โ€“ Setting an Index Column
# By default, Pandas assigns an automatic index, 
# but you can specify a column to be used as the index.

df = pd.read_csv('data.csv', index_col=0)  # Use the first column as the index

# 2. sep โ€“ Defining a Custom Delimiter
# If your data is separated by tabs (\t) instead of commas, specify the delimiter:

df = pd.read_csv('data.tsv', sep='\t')  # Read a tab-separated file

# 3. encoding โ€“ Handling Character Encoding
# In some datasets, text encoding may not be UTF-8. In Korea, files are often encoded in CP949.

df = pd.read_csv('data.csv', encoding='cp949')  # Read CP949-encoded data

# 4. header โ€“ Specifying the Header Row
# If your CSV file has no column names, set header=None:

df = pd.read_csv('data.csv', header=None)  # Read file without headers

# 5. na_values โ€“ Handling Missing Values
# Certain values can be interpreted as missing data (NaN), such as "N/A" or "-".

df = pd.read_csv('data.csv', na_values=['N/A', '-'])  # Treat "N/A" and "-" as missing values

# 6. usecols โ€“ Selecting Specific Columns
# To load only specific columns from a CSV file:

df = pd.read_csv('data.csv', usecols=['Name', 'Age'])  # Read only "Name" and "Age" columns

Preventing and Resolving Errors


1. ParserError โ€“ Mismatched Number of Columns

If a file has an inconsistent number of columns, you may encounter this error:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 6 fields in line 13, saw 7

This indicates that line 13 has 7 fields instead of the expected 6. To skip problematic rows:

# For Pandas 1.3.0 and earlier:
df = pd.read_csv('data.csv', error_bad_lines=False)  # Skip problematic lines

# For Pandas 1.3.0 and later:
df = pd.read_csv('data.csv', on_bad_lines='skip')  # Skip problematic lines

# Skipping Specific Rows Manually
df = pd.read_csv('data.csv', skiprows=[2, 4])  # Skip rows 2 and 4

# Filtering Data While Reading (for Large Files)
# If the dataset is large and you donโ€™t know which rows are problematic, 
# use chunksize to process data in smaller parts:

valid_rows = []
for chunk in pd.read_csv('data.csv', chunksize=1000):
    valid_chunk = chunk[chunk.apply(lambda x: len(x) == len(chunk.columns), axis=1)]
    valid_rows.append(valid_chunk)

df = pd.concat(valid_rows, ignore_index=True)  # Combine valid rows into a DataFrame

2. Encoding Issues (UnicodeDecodeError)


If a CSV file is encoded in CP949 or EUC-KR, you may see this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 0: invalid start byte

 

To fix this, explicitly specify the correct encoding:

df = pd.read_csv('data.csv', encoding='cp949')  # Use CP949 encoding

# OR

df = pd.read_csv('data.csv', encoding='euc-kr')  # Use EUC-KR encoding

Saving Data with UTF-8 Encoding

To ensure compatibility across different systems, save the file with UTF-8-SIG encoding:

df.to_csv('output.csv', index=False, encoding='utf-8-sig')

 

  • UTF-8: Commonly used encoding on most operating systems.
  • UTF-8-SIG: Includes a BOM (Byte Order Mark) for better compatibility with Windows and Linux.

Summary


Issue Solution
Mismatched Columns (ParserError) Use on_bad_lines='skip' or skiprows
Encoding Errors (UnicodeDecodeError) Specify encoding (cp949, euc-kr)
Skipping Unwanted Rows Use skiprows
Handling Large Files Use chunksize
Handling Missing Values Use na_values

 

Pandasโ€™ read_csv() function provides extensive options to handle real-world CSV files efficiently. By understanding its key parameters and common issues, you can load and process CSV data seamlessly.

 

๋ฐ˜์‘ํ˜•