What is Pandas?
Pandas is an open-source Python library designed for data manipulation and analysis. It was developed by Wes McKinney in 2008 when he saw the need for an efficient and intuitive tool to handle financial data. The name "Pandas" originates from "PANel DAta," reflecting its focus on handling multidimensional data structures.
Built on top of NumPy, Pandas provides a powerful and flexible framework for working with structured data. Whether you're a beginner or an experienced data scientist, Pandas offers essential tools to process and analyze data efficiently.
It provides two primary data structures:
- Series: A one-dimensional labeled array that can store any data type.
- DataFrame: A two-dimensional, labeled, and resizable data structure similar to spreadsheets or SQL tables.
Pandas is widely used in data science due to its speed, flexibility, and expressive data structures. It has become a fundamental tool for real-world data analysis, enabling advanced data manipulation. As one of the most powerful open-source data analysis libraries, Pandas continues to evolve and is widely used across multiple programming languages.
Key Features of Pandas
1. Handling Missing Data
- Easily manage missing values such as NaN, NA, or NaT.
2. Resizable Data Structures
- Insert or delete columns in DataFrames and higher-dimensional objects.
3. Automatic Data Alignment
- Aligns labels explicitly or automatically.
4. Flexible Data Grouping
- Perform aggregation and transformation operations by grouping data.
5. Extensive Data Transformation
- Convert Python and NumPy data structures into DataFrame objects effortlessly.
6. Slicing and Indexing
- Supports slicing, fancy indexing, and subsetting of large datasets.
7. Merging and Joining
- Intuitively merge and join datasets.
8. Reshaping Data
- Supports reshaping and pivoting datasets.
9. Hierarchical Labeling
- Assign multiple labels to axes.
10. Powerful I/O Tools
- Load and store data in various formats such as CSV, Excel, and databases.
11. Time Series Support
- Generate date ranges, calculate moving averages, and transform time series data.
Getting Started with Pandas
Pandas can be installed using pip or conda:
# PyPI
pip install pandas
# conda
conda install -c conda-forge pandas
Basic Examples
Creating a Series
A Pandas Series is a one-dimensional labeled array that can store any data type, such as integers, strings, or Python objects.
# pandas series
import pandas as pd
data = [10, 20, 30, 40]
series = pd.Series(data, index=['a', 'b', 'c', 'd'])
print(series)
# output
# a 10
# b 20
# c 30
# d 40
# dtype: int64
Creating a DataFrame
A DataFrame is a two-dimensional structure consisting of rows and columns, similar to an Excel or SQL table.
# pandas dataframe
import pandas as pd
data = {
"Name": ["Kim Seoul", "Lee Jeonju", "Song Gongju"],
"Age": [25, 30, 35],
"City": ["Seoul", "Jeonju", "Gongju"]
}
df = pd.DataFrame(data)
print(df)
# output
# Name Age City
# 0 Kim Seoul 25 Seoul
# 1 Lee Jeonju 30 Jeonju
# 2 Song Gongju 35 Gongju
When to Use Pandas?
Pandas is ideal for the following tasks:
- Data Cleaning & Preprocessing
- Exploratory Data Analysis (EDA)
- Working with Time Series or Structured Data
However, for handling large-scale data, Dask or PySpark may be more suitable.
Pandas remains one of the most essential libraries in the data science ecosystem, providing efficient tools for working with structured data.
'๊ฐ๋ฐ Code > ํ์ด์ฌ Python' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
[Python][program] CLI ASCII art ๋ฐ๋ ํ์ธ ๋ฉ์ธ์ง ์ฐ๊ธฐ (0) | 2025.02.12 |
---|---|
[Python][pandas] Loading Data - CSV (0) | 2025.02.11 |
[Python][numpy] Numpy๋ก ํจ์จ์ ์ธ ๋ฐ์ดํฐ ์ํ๋ง ๋ฐ ๋์ ์์ฑ (0) | 2025.02.09 |
[Python][numpy] Numpy ๋ฐฐ์ด ์ ์ฅ ๋ฐ ๋ถ๋ฌ์ค๊ธฐ (0) | 2025.02.09 |
[Python][numpy] Numpy ๊ธฐ์ด๋ถํฐ ํ์ฉ๊น์ง (0) | 2025.02.08 |