๊ฐœ๋ฐœ Code/ํŒŒ์ด์ฌ Python

[Python][pandas] ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ - CSV

5hr1rnp 2025. 1. 22. 13:42
๋ฐ˜์‘ํ˜•

CSV๋ž€ ? 


  ๋ฐ์ดํ„ฐ ๋ถ„์„์—์„œ ๊ฐ€์žฅ ํ”ํžˆ ์ ‘ํ•˜๋Š” ํ˜•์‹ ์ค‘ ํ•˜๋‚˜๊ฐ€ CSV(Comma Separated Values) ํŒŒ์ผ์ด๋‹ค. CSV๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์‰ผํ‘œ(๋˜๋Š” ๋‹ค๋ฅธ ๊ตฌ๋ถ„์ž)๋กœ ๊ตฌ๋ถ„๋œ ํ…์ŠคํŠธ ํ˜•์‹์œผ๋กœ ์ €์žฅ๋œ๋‹ค. Pandas์˜ read_csv() ํ•จ์ˆ˜๋Š” ์ด๋Ÿฌํ•œ ํŒŒ์ผ์„ ๊ฐ„๋‹จํžˆ ์ฝ์–ด๋“ค์ผ ์ˆ˜ ์žˆ๋„๋ก ๊ฐ•๋ ฅํ•œ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•œ๋‹ค. ์ด๋ฒˆ ๊ธ€์—์„œ๋Š” CSV ํ˜•์‹์— ๋Œ€ํ•œ ๊ฐ„๋‹จํ•œ ์†Œ๊ฐœ์™€ ํ•จ๊ป˜ Pandas๋กœ CSV ํŒŒ์ผ์„ ๋ถˆ๋Ÿฌ์˜ค๋Š” ๋ฐฉ๋ฒ•, ์ž์ฃผ ์‚ฌ์šฉํ•˜๋Š” ์ฃผ์š” ๋งค๊ฐœ๋ณ€์ˆ˜, ๊ทธ๋ฆฌ๊ณ  ์—๋Ÿฌ๋ฅผ ์˜ˆ๋ฐฉํ•˜๊ฑฐ๋‚˜ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ณด๋„๋ก ํ•˜๊ฒ ๋‹ค.


CSV ํŒŒ์ผ ํ˜•์‹์ด๋ž€ ?


  CSV ํŒŒ์ผ์€ ๋ฐ์ดํ„ฐ๊ฐ€ ์‰ผํ‘œ(,)๋กœ ๊ตฌ๋ถ„๋˜์–ด ์ €์žฅ๋œ ๋‹จ์ˆœํ•œ ํ…์ŠคํŠธ ํŒŒ์ผ์ด๋‹ค. ๊ฐ ํ–‰์€ ๋ฐ์ดํ„ฐ์˜ ํ•œ ๋ ˆ์ฝ”๋“œ๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ์ฒซ ๋ฒˆ์งธ ํ–‰์€ ์ผ๋ฐ˜์ ์œผ๋กœ ์—ด ์ด๋ฆ„(ํ—ค๋”)์œผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

# CSV ํŒŒ์ผ ์˜ˆ์‹œ
# Name,Age,City
# Alice,25,New York
# Bob,30,Los Angeles
# Charlie,35,Chicago

 


CSV ํŒŒ์ผ์˜ ํŠน์„ฑ


  • ๊ตฌ๋ถ„์ž: ์ผ๋ฐ˜์ ์œผ๋กœ ์‰ผํ‘œ(,)๋กœ ๊ตฌ๋ถ„๋˜์ง€๋งŒ, ํƒญ(\t) ๋˜๋Š” ์„ธ๋ฏธ์ฝœ๋ก (;)์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ๋„ ์žˆ๋‹ค.
  • ํ—ค๋”: ์ฒซ ๋ฒˆ์งธ ํ–‰์— ์—ด ์ด๋ฆ„์ด ํฌํ•จ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค.
  • ํ…์ŠคํŠธ ์ธ์ฝ”๋”ฉ: ๊ธฐ๋ณธ์ ์œผ๋กœ utf-8์ด ์‚ฌ์šฉ๋˜์ง€๋งŒ, ํ•œ๊ตญ์—์„œ๋Š” cp949๋‚˜ euc-kr๋กœ ์ €์žฅ๋œ ํŒŒ์ผ๋„ ํ”ํžˆ ์‚ฌ์šฉ๋œ๋‹ค.

Pandas๋กœ CSV ํŒŒ์ผ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ


Pandas์˜ read_csv() ํ•จ์ˆ˜๋Š” CSV ํŒŒ์ผ์„ DataFrame์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋œ๋‹ค. ๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

# pandas๋Š” ์ฃผ๋กœ `pd`๋ผ๋Š” ์ด๋ฆ„์œผ๋กœ ๋ถˆ๋Ÿฌ์™€์„œ ์‚ฌ์šฉํ•œ๋‹ค.
import pandas as pd

# CSV ํŒŒ์ผ ์ฝ๊ธฐ
df = pd.read_csv('data.csv')

# DataFrame ์ถœ๋ ฅ
print(df)

 

์œ„ ์ฝ”๋“œ๋Š” ํ˜„์žฌ ๋””๋ ‰ํ„ฐ๋ฆฌ์— ์žˆ๋Š” 'data.csv' ํŒŒ์ผ์„ ์ฝ์–ด DataFrame์œผ๋กœ ๋ณ€ํ™˜ํ•œ ํ›„ ๋ฐ์ดํ„ฐ๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค.


728x90
๋ฐ˜์‘ํ˜•

์ฃผ์š” ๋งค๊ฐœ๋ณ€์ˆ˜ (Parameters)


read_csv()๋Š” ๋‹ค์–‘ํ•œ ์˜ต์…˜์„ ์ œ๊ณตํ•˜์—ฌ ํŒŒ์ผ์˜ ๊ตฌ์กฐ์— ๋งž๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ์„ ์ˆ˜ ์žˆ๋‹ค. ์ž์ฃผ ์‚ฌ์šฉํ•˜๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

# index_col
# DataFrame์—์„œ ํŠน์ • ์—ด์„ ์ธ๋ฑ์Šค๋กœ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์คŒ
# ์ธ๋ฑ์Šค(index)๋Š” ๋ฐ์ดํ„ฐ์˜ ๊ณ ์œ ํ•œ ์‹๋ณ„์ž ์—ญํ• ์„ ํ•˜๋ฉฐ, ํšจ์œจ์ ์ธ ๋ฐ์ดํ„ฐ ๊ฒ€์ƒ‰๊ณผ
# ์ฒ˜๋ฆฌ๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ

df = pd.read_csv('data.csv', index_col=0)  # ์ฒซ ๋ฒˆ์งธ ์—ด์„ ์ธ๋ฑ์Šค๋กœ ์„ค์ •

# sep
# ๊ตฌ๋ถ„์ž๋ฅผ ์ง€์ •ํ•˜๋ฉฐ, default๋Š” ์‰ผํ‘œ(,)

df = pd.read_csv('data.tsv', sep='\t') # ํƒญ(tab)์œผ๋กœ ๊ตฌ๋ถ„๋œ ํŒŒ์ผ ์ฝ๊ธฐ

# encoding
# ํ…ํŠธ์Šค ํŒŒ์ผ์˜ ๋ฌธ์ž ์ธ์ฝ”๋”ฉ์„ ์ง€์ •ํ•จ
# ๊ธฐ๊ด€์—์„œ ์ œ๊ณตํ•˜๋Š” ๋ฐ์ดํ„ฐ์˜ encoding ํ˜•์‹์€ `cp949`์ธ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œ
# default๋Š” `utf-8` ํ˜•์‹

df = pd.read_csv('data.csv', encoding='cp949') # `cp949`๋กœ ์ธ์ฝ”๋”ฉ๋œ ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

# header
# ํ—ค๋”(์—ด ์ด๋ฆ„)์ด ์žˆ๋Š” ํ–‰ ๋ฒˆํ˜ธ๋ฅผ ์ง€์ •ํ•œ๋‹ค. default๋Š” 0(์ฒซ ๋ฒˆ์งธ ํ–‰์„ ๋œปํ•จ)

df = pd.read_csv('data.csv', header=None) # ํ—ค๋” ์—†์ด ํŒŒ์ผ ์ฝ๊ธฐ

# na_values
# ํŠน์ • ๊ฐ’์„ ๊ฒฐ์ธก๊ฐ’(NaN, Not a Number)์œผ๋กœ ์ฒ˜๋ฆฌ
# * ์ฐธ๊ณ ๋กœ null์€ ์กด์žฌํ•˜์ง€ ์•Š์Œ์„ ๋œปํ•˜๋ฉฐ, NA๋Š” Not Available์˜ ์ค„์ž„๋ง
# ๊ธฐ๋ณธ์ ์œผ๋กœ ๋‹ค์Œ ๊ฐ’์€ NaN์œผ๋กœ ํ•ด์„๋จ : " ", "#N/A", "#N/A N/A", "#NA", "-1.#IND", 
# "-1.#QNAN", "-NaN", "-nan", "1.#IND", "1.#QNAN", "<NA>", "N/A", "NA", "NULL", 
# "NaN", "None", "n/a", "nan", "null "

df = pd.read_csv('data.csv', na_values=['N/A', '-'])

# usecols
# ์ฝ๊ณ ์ž ํ•˜๋Š” ์—ด๋งŒ ์„ ํƒํ•จ

df = pd.read_csv('data.csv', usecols=['Name', 'Age'])

์—๋Ÿฌ ์˜ˆ๋ฐฉ ๋ฐ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•


1) ParserError

ํŒŒ์ผ์„ ๊ตฌ๋ถ„ ๋ถ„์„ํ•˜๋Š” ๋™์•ˆ ์—ด์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งž์ง€ ์•Š์•„ ๋ฐœ์ƒํ•˜๋Š” ์—๋Ÿฌ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด

pandas.errors.ParserError: Error tokenizing data. C error: Expected 6 fields in line 13, saw 7

 

13๋ฒˆ์งธ ์ค„์—์„œ 6๊ฐœ์˜ ํ•„๋“œ ๊ฐ’์ด ์žˆ์–ด์•ผ ํ•˜๋Š”๋ฐ 7๊ฐœ์˜ ๊ฐ’์ด ์žˆ์–ด ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ. ํ•ด๋‹น ๋ผ์ธ์—๋งŒ ๋ฌธ์ œ๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์œผ๋‚˜, ๋” ๋งŽ์€ ํ–‰์—์„œ ๋ฌธ์ œ๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ํ–‰์€ 'skip'์œผ๋กœ ๊ฑด๋„ˆ๋›ฐ๋Š”๊ฒŒ ํ•˜๋‚˜์˜ ๋ฐฉ๋ฒ•์ผ์ˆ˜๋„ ์žˆ๋‹ค.

 

# error_bad_lines
# Pandas 1.3.0 ์ด์ „ ๋ฒ„์ „์—์„œ ์ง€์›
# ์ด ์˜ต์…˜์„ ์‚ฌ์šฉํ•˜๋ฉด ๋ฌธ์ œ๊ฐ€ ๋˜๋Š” ํ–‰์„ ๊ฑด๋„ˆ๋›ฐ๊ณ  ๋‚˜๋จธ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ์–ด์˜ฌ ์ˆ˜ ์žˆ์Œ

df = pd.read_csv('data.csv', error_bad_lines=False) # ๋ฌธ์ œ๊ฐ€ ๋˜๋Š” ํ–‰ ๊ฑด๋„ˆ๋›ฐ๊ธฐ

# on_bad_lines
# Pandas 1.3.0 ์ดํ›„ ๋ฒ„์ „์—์„œ ์ง€์›
# ์ตœ์‹  Pandas์—์„œ๋Š” on_bad_lines ์˜ต์…˜์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์ œ๊ฐ€ ๋˜๋Š” ํ–‰์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Œ
# on_bad_lines ์˜ต์…˜
# 'error': ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ์˜ค๋ฅ˜๋ฅผ ๋ฐœ์ƒ์‹œํ‚ด (default)
# 'skip': ๋ฌธ์ œ๊ฐ€ ๋˜๋Š” ํ–‰์„ ๊ฑด๋„ˆ๋œ€
# callable ํ•จ์ˆ˜: ํŠน์ • ํ–‰์„ ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌํ• ์ง€ ์‚ฌ์šฉ์ž ์ •์˜ ํ•จ์ˆ˜๋กœ ์ง€์ •

df = pd.read_csv('data.csv', on_bad_lines='skip') # ๋ฌธ์ œ๊ฐ€ ๋˜๋Š” ํ–‰ ๊ฑด๋„ˆ๋›ฐ๊ธฐ

# skiprows
# ๋ฌธ์ œ๊ฐ€ ๋˜๋Š” ํ–‰์ด ํŠน์ • ํ–‰ ๋ฒˆํ˜ธ์— ์œ„์น˜ํ•ด ์žˆ๋‹ค๋ฉด ํ•ด๋‹น ์˜ต์…˜์„ ์‚ฌ์šฉํ•ด
# ๋ช…์‹œ์ ์œผ๋กœ ๊ฑด๋„ˆ๋›ธ ์ˆ˜ ์žˆ์Œ

df = pd.read_csv('data.csv', skiprows=[2, 4]) # ํŠน์ • ํ–‰ ๊ฑด๋„ˆ๋›ฐ๊ธฐ (์˜ˆ: 2, 4๋ฒˆ ํ–‰ ๊ฑด๋„ˆ๋›ฐ๊ธฐ)

# ๋ฌธ์ œ๊ฐ€ ๋˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์ „ ํ•„ํ„ฐ๋ง ๋ฐ์ดํ„ฐ ํŒŒ์ผ์˜ ํฌ๊ธฐ๊ฐ€ ํฌ๊ฑฐ๋‚˜ ํ–‰์˜ ์œ„์น˜๋ฅผ ๋ฏธ๋ฆฌ ์•Œ ์ˆ˜ ์—†๋Š” ๊ฒฝ์šฐ, 
# ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ ์ค„์”ฉ ์ฝ์œผ๋ฉด์„œ ํŠน์ • ์กฐ๊ฑด์— ๋”ฐ๋ผ ํ•„ํ„ฐ๋งํ•˜๋Š” ๋ฐฉ์‹๋„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ
# ์ด๋ฅผ ์œ„ํ•ด Pandas์˜ `chunksize`๋ฅผ ์‚ฌ์šฉํ•จ

# ๋ฐ์ดํ„ฐ๋ฅผ ์ฒญํฌ ๋‹จ์œ„๋กœ ์ฝ๊ธฐ
valid_rows = []
for chunk in pd.read_csv('data.csv', chunksize=1000):
    # ํ•„ํ„ฐ๋ง ์กฐ๊ฑด ์ ์šฉ (์˜ˆ: ์—ด ๊ฐœ์ˆ˜๊ฐ€ ๋งž๋Š” ํ–‰๋งŒ)
    valid_chunk = chunk[chunk.apply(lambda x: len(x) == len(chunk.columns), axis=1)]
    valid_rows.append(valid_chunk)

# ์œ ํšจํ•œ ํ–‰๋“ค๋กœ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ƒ์„ฑ
df = pd.concat(valid_rows, ignore_index=True)

2) ์ธ์ฝ”๋”ฉ ๊ด€๋ จ ๋ฌธ์ œ


ํ•œ๊ตญ์–ด ํŒŒ์ผ์€ `utf-8` ๋Œ€์‹  `cp949` ๋˜๋Š” `euc-kr`๋กœ ์ €์žฅ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค.

 

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 0: invalid start byte

 

ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” `encoding` ํ˜•์‹์„ ๋ช…์‹œ์ ์œผ๋กœ ์ง€์ •ํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ๋Š”๋‹ค.

 

# encoding='cp949' ๋˜๋Š” encoding='euc-kr'๋ฅผ ์ง€์ •

df = pd.read_csv('data.csv', encoding='cp949')

# or

df = pd.read_csv('data.csv', encoding='euc-kr')

# ๋ฐ์ดํ„ฐ ์ €์žฅ ์‹œ `utf-8-sig`ํ˜•์‹ ์‚ฌ์šฉ

df.to_csv('output.csv', index=False, encoding='utf-8-sig')

# utf-8 	: ๋Œ€๋ถ€๋ถ„์˜ ์šด์˜ ์ฒด์ œ์—์„œ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ์ธ์ฝ”๋”ฉ ๋ฐฉ์‹
# utf-8-sig : utf-8์— BOM(Byte Order Mark)์„ ์ถ”๊ฐ€ํ•˜์—ฌ Windows์™€ Linux ๋ชจ๋‘์—์„œ ํ˜ธํ™˜์„ฑ์„ ๋ณด์žฅ
๋ฐ˜์‘ํ˜•