๊ฐœ๋ฐœ Code/์ธ๊ณต์ง€๋Šฅ A.I.

[Python][AI] CatBoost: ๊ฐ•๋ ฅํ•œ ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ… ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

5hr1rnp 2025. 2. 14. 15:30
๋ฐ˜์‘ํ˜•

catboost

1. CatBoost๋ž€?


CatBoost๋Š” Yandex์—์„œ ๊ฐœ๋ฐœํ•œ ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ…(Gradient Boosting) ๊ธฐ๋ฐ˜ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ, ์˜์‚ฌ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ๊ณ ์„ฑ๋Šฅ ๋ฐ ๋†’์€ ์ •ํ™•๋„๋ฅผ ์ž๋ž‘ํ•˜๋ฉฐ, ์ถ”์ฒœ ์‹œ์Šคํ…œ, ๊ฒ€์ƒ‰ ์—”์ง„, ์ž์œจ ์ฃผํ–‰, ๋‚ ์”จ ์˜ˆ์ธก ๋“ฑ ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์—์„œ ์‚ฌ์šฉ๋œ๋‹ค.


2. CatBoost์˜ ์ฃผ์š” ํŠน์ง•


1) ๋ณ„๋„ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ ์—†์ด๋„ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ

  • ๊ธฐ๋ณธ ์„ค์ •๊ฐ’์œผ๋กœ๋„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•จ
  • ์‚ฌ์šฉ์ž๊ฐ€ ๋ณต์žกํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์— ๋งŽ์€ ์‹œ๊ฐ„์„ ์†Œ๋น„ํ•˜์ง€ ์•Š์•„๋„ ๋จ

2) ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ(Categorical Features) ์ง€์›

  • ์ผ๋ฐ˜์ ์ธ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์—์„œ๋Š” ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ์ˆซ์ž๋กœ ๋ณ€ํ™˜ํ•ด์•ผ ํ•˜์ง€๋งŒ, CatBoost๋Š” ์ž๋™์œผ๋กœ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜์—ฌ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ด

3) ๋น ๋ฅด๊ณ  ํ™•์žฅ ๊ฐ€๋Šฅํ•œ GPU ๋ฒ„์ „

  • GPU๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋น ๋ฅธ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•จ
  • ์—ฌ๋Ÿฌ ๊ฐœ์˜ GPU๋ฅผ ํ™œ์šฉํ•˜๋Š” ๋ฉ€ํ‹ฐ-GPU ๋ชจ๋“œ ์ง€์›

4) ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€ ๊ธฐ๋Šฅ ๊ฐ•ํ™”

  • ์ƒˆ๋กœ์šด ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ… ๋ฐฉ์‹ ๋„์ž…์œผ๋กœ ๊ณผ์ ํ•ฉ(overfitting) ๊ฐ€๋Šฅ์„ฑ์„ ๋‚ฎ์ถค
  • ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ํ–ฅ์ƒ

5) ๋น ๋ฅธ ์˜ˆ์ธก ์†๋„

  • CatBoost๋Š” ํ•™์Šต๋œ ๋ชจ๋ธ์„ ๋น ๋ฅด๊ฒŒ ์ ์šฉํ•  ์ˆ˜ ์žˆ์–ด ์‹ค์‹œ๊ฐ„ ์˜ˆ์ธก์ด ํ•„์š”ํ•œ ์„œ๋น„์Šค์—์„œ๋„ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•จ

3. CatBoost ์„ค์น˜ ๋ฐฉ๋ฒ•


CatBoost๋Š” Python๊ณผ R์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๊ฐ„๋‹จํ•œ ๋ช…๋ น์–ด๋กœ ์„ค์น˜ ๊ฐ€๋Šฅํ•จ.

pip install catboost
# R ์‚ฌ์šฉ์ž
# install.packages('catboost')
 

728x90
๋ฐ˜์‘ํ˜•

4. CatBoost ๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ• (Python ์˜ˆ์ œ)


CatBoost๋ฅผ ํ™œ์šฉํ•œ ๊ฐ„๋‹จํ•œ ๋ถ„๋ฅ˜(Classification) ๋ชจ๋ธ ํ•™์Šต ์˜ˆ์ œ:

 

from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
data = pd.read_csv('your_data.csv')
X = data.drop('target', axis=1)
y = data['target']

# ํ•™์Šต/ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋ถ„ํ• 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# CatBoost ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต
model = CatBoostClassifier(iterations=500, depth=6, learning_rate=0.1, loss_function='Logloss', verbose=200)
model.fit(X_train, y_train, cat_features=[0, 1])  # ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ์ง€์ • ๊ฐ€๋Šฅ

# ์˜ˆ์ธก ๋ฐ ์ •ํ™•๋„ ํ‰๊ฐ€
y_pred = model.predict(X_test)
print(f'Accuracy: {accuracy_score(y_test, y_pred):.4f}')

5. CatBoost vs. ๋‹ค๋ฅธ ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ… ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ


CatBoost๋Š” XGBoost, LightGBM ๋“ฑ๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ ํŠน์ • ์กฐ๊ฑด์—์„œ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„. ์ฃผ์š” ์ฐจ์ด์ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Œ:

 

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ ์ง€์› ํ•™์Šต ์†๋„ ์˜ˆ์ธก ์†๋„ ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€
CatBoost O (์ž๋™ ์ฒ˜๋ฆฌ) ๋ณดํ†ต ๋น ๋ฆ„ ๊ฐ•ํ•จ
XGBoost X (์ˆ˜๋™ ๋ณ€ํ™˜ ํ•„์š”) ๋น ๋ฆ„ ๋น ๋ฆ„ ๋ณดํ†ต
LightGBM X (์ˆ˜๋™ ๋ณ€ํ™˜ ํ•„์š”) ๋งค์šฐ ๋น ๋ฆ„ ๋ณดํ†ต ๋ณดํ†ต

 

ํŠนํžˆ, ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๊ฐ€ ํฌํ•จ๋œ ๋ฐ์ดํ„ฐ์…‹์—์„œ CatBoost๋Š” ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ž„.


6. CatBoost๋ฅผ ํ™œ์šฉํ•œ ์‹ค์ œ ์‚ฌ๋ก€


CatBoost๋Š” ์—ฌ๋Ÿฌ ์‚ฐ์—…์—์„œ ์‚ฌ์šฉ๋จ. ๋Œ€ํ‘œ์ ์ธ ์‚ฌ๋ก€:

  • Yandex: ๊ฒ€์ƒ‰, ์ถ”์ฒœ ์‹œ์Šคํ…œ, ์ž์œจ์ฃผํ–‰ ๋“ฑ ๋‹ค์–‘ํ•œ AI ํ”„๋กœ์ ํŠธ์—์„œ ํ™œ์šฉ
  • Cloudflare: ์•…์„ฑ ๋ด‡ ํŠธ๋ž˜ํ”ฝ ํƒ์ง€
  • Careem (์ค‘๋™ ์ตœ๋Œ€ ์ฐจ๋Ÿ‰ ํ˜ธ์ถœ ์„œ๋น„์Šค): ์‚ฌ์šฉ์ž ์ด๋™ ํŒจํ„ด ์˜ˆ์ธก

7. ๊ฒฐ๋ก 


CatBoost๋Š” ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ…์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๊ฐ•๋ ฅํ•œ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ, ํŠนํžˆ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๊ฐ€ ํฌํ•จ๋œ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•จ. XGBoost๋‚˜ LightGBM์„ ์‚ฌ์šฉํ•˜๋˜ ๊ฐœ๋ฐœ์ž๋ผ๋ฉด CatBoost๋„ ํ•œ ๋ฒˆ ์‹œ๋„ํ•ด๋ณผ ๋งŒํ•จ.

CatBoost์˜ ๊ณต์‹ ๋ฌธ์„œ๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ๋” ๊นŠ์ด ์žˆ๋Š” ๋‚ด์šฉ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ: ๐Ÿ‘‰ ๊ณต์‹ ๋ฌธ์„œ

๋ฐ˜์‘ํ˜•