[Python][pandas] Loading Data

개발 Code/파이썬 Python

[Python][pandas] Loading Data - CSV

5hr1rnp 2025. 2. 11. 17:53

What is a CSV File?

One of the most commonly used formats in data analysis is CSV (Comma-Separated Values). CSV files store data in a simple text format, where values are separated by commas (or other delimiters).

Pandas provides a powerful function, read_csv(), to easily load CSV files into a DataFrame. In this guide, we will explore what a CSV file is, how to load it using Pandas, key parameters of read_csv(), and how to prevent or resolve common errors.

Understanding the CSV File Format

A CSV file is a plain text file where data is separated by commas (,). Each row represents a record, and the first row is usually used as the header (column names).

Example of a CSV file:

# Example CSV File
# Name,Age,City
# Alice,25,New York
# Bob,30,Los Angeles
# Charlie,35,Chicago

Characteristics of CSV Files

Delimiter: Typically separated by commas (,), but can also use tabs (\t) or semicolons (;).
Header: The first row often contains column names.
Text Encoding: Default encoding is UTF-8, but in some regions (such as Korea), files may be saved in CP949 or EUC-KR.

Loading a CSV File with Pandas

Pandas provides the read_csv() function to convert a CSV file into a DataFrame. Here’s a basic example:

import pandas as pd

# Read CSV file
df = pd.read_csv('data.csv')

# Display DataFrame
print(df)

This code reads 'data.csv' from the current directory and loads it into a Pandas DataFrame.

728x90

Key Parameters of read_csv()

The read_csv() function includes various parameters to customize how the data is read. Here are some commonly used options:

# 1. index_col – Setting an Index Column
# By default, Pandas assigns an automatic index, 
# but you can specify a column to be used as the index.

df = pd.read_csv('data.csv', index_col=0)  # Use the first column as the index

# 2. sep – Defining a Custom Delimiter
# If your data is separated by tabs (\t) instead of commas, specify the delimiter:

df = pd.read_csv('data.tsv', sep='\t')  # Read a tab-separated file

# 3. encoding – Handling Character Encoding
# In some datasets, text encoding may not be UTF-8. In Korea, files are often encoded in CP949.

df = pd.read_csv('data.csv', encoding='cp949')  # Read CP949-encoded data

# 4. header – Specifying the Header Row
# If your CSV file has no column names, set header=None:

df = pd.read_csv('data.csv', header=None)  # Read file without headers

# 5. na_values – Handling Missing Values
# Certain values can be interpreted as missing data (NaN), such as "N/A" or "-".

df = pd.read_csv('data.csv', na_values=['N/A', '-'])  # Treat "N/A" and "-" as missing values

# 6. usecols – Selecting Specific Columns
# To load only specific columns from a CSV file:

df = pd.read_csv('data.csv', usecols=['Name', 'Age'])  # Read only "Name" and "Age" columns

Preventing and Resolving Errors

1. ParserError – Mismatched Number of Columns

If a file has an inconsistent number of columns, you may encounter this error:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 6 fields in line 13, saw 7

This indicates that line 13 has 7 fields instead of the expected 6. To skip problematic rows:

# For Pandas 1.3.0 and earlier:
df = pd.read_csv('data.csv', error_bad_lines=False)  # Skip problematic lines

# For Pandas 1.3.0 and later:
df = pd.read_csv('data.csv', on_bad_lines='skip')  # Skip problematic lines

# Skipping Specific Rows Manually
df = pd.read_csv('data.csv', skiprows=[2, 4])  # Skip rows 2 and 4

# Filtering Data While Reading (for Large Files)
# If the dataset is large and you don’t know which rows are problematic, 
# use chunksize to process data in smaller parts:

valid_rows = []
for chunk in pd.read_csv('data.csv', chunksize=1000):
    valid_chunk = chunk[chunk.apply(lambda x: len(x) == len(chunk.columns), axis=1)]
    valid_rows.append(valid_chunk)

df = pd.concat(valid_rows, ignore_index=True)  # Combine valid rows into a DataFrame

2. Encoding Issues (UnicodeDecodeError)

If a CSV file is encoded in CP949 or EUC-KR, you may see this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 0: invalid start byte

To fix this, explicitly specify the correct encoding:

df = pd.read_csv('data.csv', encoding='cp949')  # Use CP949 encoding

# OR

df = pd.read_csv('data.csv', encoding='euc-kr')  # Use EUC-KR encoding

Saving Data with UTF-8 Encoding

To ensure compatibility across different systems, save the file with UTF-8-SIG encoding:

df.to_csv('output.csv', index=False, encoding='utf-8-sig')

UTF-8: Commonly used encoding on most operating systems.
UTF-8-SIG: Includes a BOM (Byte Order Mark) for better compatibility with Windows and Linux.

Summary

Issue	Solution
Mismatched Columns (ParserError)	Use on_bad_lines='skip' or skiprows
Encoding Errors (UnicodeDecodeError)	Specify encoding (cp949, euc-kr)
Skipping Unwanted Rows	Use skiprows
Handling Large Files	Use chunksize
Handling Missing Values	Use na_values

Pandas’ read_csv() function provides extensive options to handle real-world CSV files efficiently. By understanding its key parameters and common issues, you can load and process CSV data seamlessly.

저작자표시 비영리 변경금지 (새창열림)

'개발 Code > 파이썬 Python' 카테고리의 다른 글

[Python][pandas] Loading Data - Excel (0)	2025.02.13
[Python][program] CLI ASCII art 발렌타인 메세지 쓰기 (0)	2025.02.12
[Python][pandas] Exploring pandas in Depth (0)	2025.02.11
[Python][numpy] Numpy로 효율적인 데이터 샘플링 및 난수 생성 (0)	2025.02.09
[Python][numpy] Numpy 배열 저장 및 불러오기 (0)	2025.02.09

현재글[Python][pandas] Loading Data - CSV

🐶짱구와 꾜미 집에 놀러온 용뇽이🦊

일상 속에서 발견한 작은 언어의 재미, 스쳐 지나간 풍경과 맛있는 기억들, 그리고 배움 속에서 얻은 깨달음을 나누는 공간. A place to share the joy of language, fleeting landscapes and delightful flavors, and the insights gained through learning.

250x250

일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

🐶짱구와 꾜미 집에 놀러온 용뇽이🦊