Pandas

[Pandas / 기초] 판다스 자료구조 - Series, DataFrame

씨주 2024. 1. 9. 21:17

📍 Pandas

: 파이썬에서 사용하는 데이터 분석 라이브러리

In [1]:

import pandas as pd

pd.__version__

Out[1]:

'2.0.3'

📍 Series

: 1차원 데이터(정수, 실수, 문자열 등)

✅ 차원

: Series.ndim

In [2]:

# Series는 1차원 자료구조이므로 1 출력
s = pd.Series([1, 2, 3, 4])
s.ndim

Out[2]:

✅ Series 객체 생성

: pd.Series([list], dtype)

예) 1월부터 4월까지 평균 온도 데이터(-20, -10, 10, 20)

In [3]:

temp = pd.Series([-20, -10, 10, 20])
print(temp)

0   -20
1   -10
2    10
3    20
dtype: int64

In [4]:

# float type으로 설정
pd.Series([-20, -10, 10, 20], dtype='float')

Out[4]:

0   -20.0
1   -10.0
2    10.0
3    20.0
dtype: float64

In [5]:

# 다양한 타입의 데이터가 섞인 경우 object로 생성
s = pd.Series([91, 2.5, 'sport'])
print(s)

0       91
1      2.5
2    sport
dtype: object

: np.arange()

In [6]:

import numpy as np

array = np.arange(5)
pd.Series(array)

Out[6]:

0    0
1    1
2    2
3    3
4    4
dtype: int64

: np.zeros()

In [7]:

array = np.zeros(3)
pd.Series(array)

Out[7]:

0    0.0
1    0.0
2    0.0
dtype: float64

: np.ones()

In [8]:

array = np.ones(3)
pd.Series(array)

Out[8]:

0    1.0
1    1.0
2    1.0
dtype: float64

✅ Series 접근

: Series[idx]

In [9]:

temp[0] # 1월 온도

Out[9]:

-20

In [10]:

# 음수(negative) 색인 불가능
temp[-1]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/range.py:345, in RangeIndex.get_loc(self, key)
    344 try:
--> 345     return self._range.index(new_key)
    346 except ValueError as err:

ValueError: -1 is not in range

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[7], line 2
      1 # 음수(negative) 색인 불가능
----> 2 temp[-1]

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/series.py:1007, in Series.__getitem__(self, key)
   1004     return self._values[key]
   1006 elif key_is_scalar:
-> 1007     return self._get_value(key)
   1009 if is_hashable(key):
   1010     # Otherwise index.get_value will raise InvalidIndexError
   1011     try:
   1012         # For labels that don't resolve as scalars like tuples and frozensets

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/series.py:1116, in Series._get_value(self, label, takeable)
   1113     return self._values[label]
   1115 # Similar to Index.get_value, but we do not fall back to positional
-> 1116 loc = self.index.get_loc(label)
   1118 if is_integer(loc):
   1119     return self._values[loc]

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/range.py:347, in RangeIndex.get_loc(self, key)
    345         return self._range.index(new_key)
    346     except ValueError as err:
--> 347         raise KeyError(key) from err
    348 if isinstance(key, Hashable):
    349     raise KeyError(key)

KeyError: -1

✅ Series Index 지정

: pd.Series([list], index=[index_list])

In [11]:

temp = pd.Series([-20, -10, 10, 20], index=['Jan', 'Feb', 'Mar', 'Apr'])
temp

Out[11]:

Jan   -20
Feb   -10
Mar    10
Apr    20
dtype: int64

In [12]:

'Jan' in temp

Out[12]:

True

In [13]:

temp['Jan']

Out[13]:

-20

In [14]:

# 문자인덱스만 가능
temp.Jan

Out[14]:

-20

In [15]:

'Jun' in temp

Out[15]:

False

In [16]:

# 존재하지 않는 Index 접근 시도 시 에러
temp['Jun']

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:3653, in Index.get_loc(self, key)
   3652 try:
-> 3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:

File ~/anaconda3/lib/python3.11/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()

File ~/anaconda3/lib/python3.11/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Jun'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[6], line 1
----> 1 temp['Jun']

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/series.py:1007, in Series.__getitem__(self, key)
   1004     return self._values[key]
   1006 elif key_is_scalar:
-> 1007     return self._get_value(key)
   1009 if is_hashable(key):
   1010     # Otherwise index.get_value will raise InvalidIndexError
   1011     try:
   1012         # For labels that don't resolve as scalars like tuples and frozensets

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/series.py:1116, in Series._get_value(self, label, takeable)
   1113     return self._values[label]
   1115 # Similar to Index.get_value, but we do not fall back to positional
-> 1116 loc = self.index.get_loc(label)
   1118 if is_integer(loc):
   1119     return self._values[loc]

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:3655, in Index.get_loc(self, key)
   3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:
-> 3655     raise KeyError(key) from err
   3656 except TypeError:
   3657     # If we have a listlike key, _check_indexing_error will raise
   3658     #  InvalidIndexError. Otherwise we fall through and re-raise
   3659     #  the TypeError.
   3660     self._check_indexing_error(key)

KeyError: 'Jun'

In [17]:

# 기본부여된 숫자형 index로도 접근 가능
temp[0]

Out[17]:

-20

In [18]:

# temp 'Apr'값으로 색인되는 원리, 따라서 -1 position 또한 'Apr'요소가 있기 때문에 'Apr'값으로 출력이 됨
temp[-1]

Out[18]:

✅ Series 추가, 갱신, 삭제

In [19]:

# 추가
temp['May'] = 5
temp

Out[19]:

Jan   -20
Feb   -10
Mar    10
Apr    20
May     5
dtype: int64

In [20]:

# 갱신
temp['May'] = -15
temp

Out[20]:

Jan   -20
Feb   -10
Mar    10
Apr    20
May   -15
dtype: int64

In [21]:

# 삭제
del temp['May']
temp

Out[21]:

Jan   -20
Feb   -10
Mar    10
Apr    20
dtype: int64

✅ Series 연산

In [22]:

temp + 10

Out[22]:

Jan   -10
Feb     0
Mar    20
Apr    30
dtype: int64

In [23]:

# 같은 인덱스끼리 연산
s1=pd.Series([1,2,3,4],index=['a','b','c','d'])
s2=pd.Series([5,6,7,8],index=['b','c','d','a'])
s1 + s2

Out[23]:

a     9
b     7
c     9
d    11
dtype: int64

In [24]:

# 동일한 인덱스는 연산을 진행, 나머지 인덱스는 연산처리가 불가능해 NaN값으로 처리
s3=pd.Series([5,6,7,8],index=['e','b','f','g'])
s4=pd.Series([1,2,3,4],index=['a','b','c','d'])
s3 - s4

Out[24]:

a    NaN
b    4.0
c    NaN
d    NaN
e    NaN
f    NaN
g    NaN
dtype: float64

In [25]:

# values를 사용해 값만을 추출해 연산을 진행하게 되면 시리즈의 형태가 사라지므로 동일 위치 원소들끼리 연산 진행
s3.values - s4.values

Out[25]:

array([4, 4, 4, 4])

📍 DataFrame

: 행(row), 열(column)로 구성된 2차원 데이터(Series들의 모임)
각 열(column)은 각각의 데이터 타입(dtype)을 가짐

예) 슬램덩크 주요 인물 8명에 대한 데이터

In [26]:

data = {
    '이름' : ['채치수', '정대만', '송태섭', '서태웅', '강백호', '변덕규', '황태산', '윤대협'],
    '학교' : ['북산고', '북산고', '북산고', '북산고', '북산고', '능남고', '능남고', '능남고'],
    '키' : [197, 184, 168, 187, 188, 202, 188, 190],
    '국어' : [90, 40, 80, 40, 15, 80, 55, 100],
    '영어' : [85, 35, 75, 60, 20, 100, 65, 85],
    '수학' : [100, 50, 70, 70, 10, 95, 45, 90],
    '과학' : [95, 55, 80, 75, 35, 85, 40, 95],
    '사회' : [85, 25, 75, 80, 10, 80, 35, 95],
    'SW특기' : ['Python', 'Java', 'Javascript', '', '', 'C', 'PYTHON', 'C#']
}

In [27]:

data['이름']

Out[27]:

['채치수', '정대만', '송태섭', '서태웅', '강백호', '변덕규', '황태산', '윤대협']

✅ DataFrame 객체 생성

: pd.DataFrame({data})

In [28]:

df = pd.DataFrame(data)
df

Out[28]:

In [29]:

df['이름']

Out[29]:

0    채치수
1    정대만
2    송태섭
3    서태웅
4    강백호
5    변덕규
6    황태산
7    윤대협
Name: 이름, dtype: object

In [30]:

df[['이름', '키']]

Out[30]:

✅ DataFrame Index 지정

: pd.DataFrame({data}, index=[index_list])

In [31]:

# 인덱스갯수와 data의 행수가 같아야 한다
df = pd.DataFrame(data, index=['1번', '2번', '3번', '4번', '5번', '6번', '7번', '8번'])
df

Out[31]:

✅ DataFrame Column 지정

: pd.DataFrame({data}, columns=[column_list])

In [32]:

pd.DataFrame([[1, 2, 3],
             [4, 5, 6],
             [7, 8, 9]])

Out[32]:

In [33]:

pd.DataFrame([[1, 2, 3],
             [4, 5, 6],
             [7, 8, 9]], columns=['가', '나', '다'])

Out[33]:

In [34]:

# 원하는 column만 선택
df = pd.DataFrame(data, columns=['이름', '학교', '키'])
df

Out[34]:

In [35]:

# column 순서 변경
df = pd.DataFrame(data, columns=['이름', '키', '학교'])
df

Out[35]:

✅ DataFrame 전치

: df.T

In [36]:

df.T

Out[36]:

참고 : 나도코딩 파이썬 코딩 무료 강의 (활용편5) - 데이터 분석 및 시각화, 이 영상 하나로 끝내세요

(https://youtu.be/PjhlUzp_cU0?si=LW_MjXLjZVY9PrUt)

'Pandas' 카테고리의 다른 글

[Pandas / 기초] 판다스 데이터선택 (1)	2024.01.10
[Pandas / 기초] 판다스 통계함수 - min, max, median, nlargest, mean, var, std, sum, cumsum, cumprod, count, mode, quantile, unique, nunique, agg, corr (1)	2024.01.10
[Pandas / 기초] 판다스 데이터확인 - describe, info, head, tail, values, index, columns, shape, astype, cat, to_numeric, cut, qcut (0)	2024.01.10
[Pandas / 기초] 판다스 데이터입출력 - to_csv, to_excel, ExcelWriter, read_csv, read_excel (0)	2024.01.10
[Pandas / 기초] 판다스 Index - name, reset_index, set_index, sort_index, fancy indexing, boolean indexing (0)	2024.01.09

현재글[Pandas / 기초] 판다스 자료구조 - Series, DataFrame

희주는 개발중

얼레벌레하다보면 될지어다

Today :
Yesterday :

12-26 00:02

희주는 개발중

[Pandas / 기초] 판다스 자료구조 - Series, DataFrame

📍 Pandas

📍 Series

📍 DataFrame

'Pandas' 카테고리의 다른 글

'Pandas'의 다른글

티스토리툴바

« 2024/12 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

[Pandas / 기초] 판다스 자료구조 - Series, DataFrame

📍 Pandas

📍 Series

📍 DataFrame

'Pandas' 카테고리의 다른 글

'Pandas'의 다른글

관련글

티스토리툴바