Pandas

[Pandas / 기초] 판다스 데이터확인 - describe, info, head, tail, values, index, columns, shape, astype, cat, to_numeric, cut, qcut

씨주 2024. 1. 10. 05:54

📍 DataFrame 데이터 확인

✅ 엑셀로 열기

: pd.read_excel('파일명.xlsx', index_col='column')

In [1]:

import pandas as pd

df = pd.read_excel('score.xlsx', index_col='지원번호') # index 설정
df

Out[1]:

✅ describe()

: 데이터에 대한 통계요약정보 제공

count : 결측치를 제외한 데이터 갯수
mean : 평균값
std : 표준편차
min : 최솟값
25% : 1사분위수
50% : 2사분위수(중앙값)
75% : 3사분위수
max : 최댓값

In [2]:

df.describe()

Out[2]:

: df.describe(include='object')

categorical column(문자열 column) 요약 통계 확인 가능

In [3]:

df.describe(include='object')
# unique(고유값) : ex. 북산고, 능남고 등)

Out[3]:

✅ info()

: 데이터의 각 특성에 대한 전반적인 정보 확인

In [4]:

df.info()

✅ head(n)

: 상위 n개의 데이터프레임 확인

In [5]:

df.head() # default값은 5

Out[5]:

In [6]:

df.head(7)

Out[6]:

✅ tail(n)

: 하위 n개의 데이터프레임 확인

In [7]:

df.tail() # default값은 5

Out[7]:

In [8]:

df.tail(3)

Out[8]:

✅ values

: 데이터프레임값 추출

반환타입은 numpy array

In [9]:

df.values

Out[9]:

array([['채치수', '북산고', 197, 90, 85, 100, 95, 85, 'Python'],
       ['정대만', '북산고', 184, 40, 35, 50, 55, 25, 'Java'],
       ['송태섭', '북산고', 168, 80, 75, 70, 80, 75, 'Javascript'],
       ['서태웅', '북산고', 187, 40, 60, 70, 75, 80, nan],
       ['강백호', '북산고', 188, 15, 20, 10, 35, 10, nan],
       ['변덕규', '능남고', 202, 80, 100, 95, 85, 80, 'C'],
       ['황태산', '능남고', 188, 55, 65, 45, 40, 35, 'PYTHON'],
       ['윤대협', '능남고', 190, 100, 85, 90, 95, 95, 'C#']], dtype=object)

✅ index

: 데이터프레임 index 추출

In [10]:

df.index

Out[10]:

Index(['1번', '2번', '3번', '4번', '5번', '6번', '7번', '8번'], dtype='object', name='지원번호')

✅ columns

: 데이터프레임 columns 추출

In [11]:

df.columns

Out[11]:

Index(['이름', '학교', '키', '국어', '영어', '수학', '과학', '사회', 'SW특기'], dtype='object')

✅ shape

: (행, 열) 튜플형태로 출력

In [12]:

df.shape # row, column

Out[12]:

(8, 9)

✅ astype

: int32, float32, str(object), category 등으로 type 변경

In [13]:

df.dtypes

Out[13]:

이름      object
학교      object
키        int64
국어       int64
영어       int64
수학       int64
과학       int64
사회       int64
SW특기    object
dtype: object

In [14]:

df['학교'] = df['학교'].astype('category')
df['학교'].head()

Out[14]:

지원번호
1번    북산고
2번    북산고
3번    북산고
4번    북산고
5번    북산고
Name: 학교, dtype: category
Categories (2, object): ['능남고', '북산고']

✅ cat

: df.cat.codes

카테고리를 숫자(index)로 변환

In [15]:

df['학교'].cat.codes

Out[15]:

지원번호
1번    1
2번    1
3번    1
4번    1
5번    1
6번    0
7번    0
8번    0
dtype: int8

: df.cat.categories

숫자로 변환된 카테고리 목록을 index순으로 확인 가능

In [16]:

[df['학교'].cat.categories]

Out[16]:

[Index(['능남고', '북산고'], dtype='object')]

: df.cat.rename_categories([rename_list])

카테고리의 이름을 재지정

In [17]:

df['학교'] = df['학교'].cat.rename_categories([g for g in df['학교'].cat.categories])
df['학교'].value_counts()

Out[17]:

학교
북산고    5
능남고    3
Name: count, dtype: int64

✅ numerical(수치형) type 변환

: pd.to_numeric(df, errors)

NaN값이나 숫자로 변환이 불가능한 문자열이 존재할 때 error 발생

errors='raise' : 에러를 일으키며 코드를 중단(default값)
errors='coerce' : 잘못된 문자열은 NaN값으로 치환
errors='ignore' : 잘못된 문자열은 변환 되지 않고 무시\ 따라서 전체 컬럼의 dtype은 object로 남아있음

In [18]:

df = pd.read_csv('seoul_bicycle.csv')
df.head()

Out[18]:

In [19]:

# 운동량은 수치형 column으로 보이지만 object type임
df['운동량'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 327231 entries, 0 to 327230
Series name: 운동량
Non-Null Count   Dtype 
--------------   ----- 
327231 non-null  object
dtypes: object(1)
memory usage: 2.5+ MB

In [20]:

pd.to_numeric(df['운동량'])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/anaconda3/lib/python3.11/site-packages/pandas/_libs/lib.pyx:2280, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "\N"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[20], line 1
----> 1 pd.to_numeric(df['운동량'])

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/tools/numeric.py:217, in to_numeric(arg, errors, downcast, dtype_backend)
    215 coerce_numeric = errors not in ("ignore", "raise")
    216 try:
--> 217     values, new_mask = lib.maybe_convert_numeric(  # type: ignore[call-overload]  # noqa
    218         values,
    219         set(),
    220         coerce_numeric=coerce_numeric,
    221         convert_to_masked_nullable=dtype_backend is not lib.no_default
    222         or isinstance(values_dtype, StringDtype),
    223     )
    224 except (ValueError, TypeError):
    225     if errors == "raise":

File ~/anaconda3/lib/python3.11/site-packages/pandas/_libs/lib.pyx:2322, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "\N" at position 2344

In [21]:

# 2344 position에서 에러
df.loc[2344]

Out[21]:

대여일자      Jan-20-2020
대여소번호             165
대여소명      165. 중앙근린공원
대여구분코드         일일(회원)
성별                 \N
연령대코드         AGE_003
이용건수                1
운동량                \N
탄소량                \N
이동거리              0.0
이용시간               40
Name: 2344, dtype: object

In [22]:

pd.to_numeric(df['운동량'], errors='ignore')

Out[22]:

0           61.82
1           39.62
2          430.85
3            1.79
4         4501.96
           ...   
327226     689.57
327227          0
327228      19.96
327229      43.77
327230    4735.63
Name: 운동량, Length: 327231, dtype: object

In [23]:

# error 무시
pd.to_numeric(df['운동량'], errors='ignore').loc[2344]

Out[23]:

'\\N'

In [24]:

df['운동량']=pd.to_numeric(df['운동량'], errors='coerce')
df['운동량']

Out[24]:

0           61.82
1           39.62
2          430.85
3            1.79
4         4501.96
           ...   
327226     689.57
327227       0.00
327228      19.96
327229      43.77
327230    4735.63
Name: 운동량, Length: 327231, dtype: float64

In [25]:

# nan으로 치환
pd.to_numeric(df['운동량'], errors='coerce').loc[2344]

Out[25]:

nan

✅ cut

: 수치로 구간 나누기(binning)

연속된 수치를 구간으로 나누어 카테고리화할 때 사용

pd.cut(df, bins=[bins_list] / int, labels=[labels_list], right=True/False)

bins는 범위리스트 또는 bin 갯수 지정
labels는 지정한 bins의 개수보다 1개 적어야 함

In [26]:

# right=False로 지정시 우측 범위를 포함하지 않음
bins = [0, 200, 400, df['운동량'].max()]
labels = ['운동부족', '보통', '많음']
pd.cut(df['운동량'], bins, labels=labels, right=False)

Out[26]:

0         운동부족
1         운동부족
2           많음
3         운동부족
4           많음
          ... 
327226      많음
327227    운동부족
327228    운동부족
327229    운동부족
327230      많음
Name: 운동량, Length: 327231, dtype: category
Categories (3, object): ['운동부족' < '보통' < '많음']

In [27]:

# value값 기준 bins 지정수만큼 구간을 나눔
df['운동량_cut'] = pd.cut(df['운동량'].values, bins=10)

In [28]:

df['운동량_cut'].value_counts()

Out[28]:

운동량_cut
(-163936.052, 16393605.23]      326816
(98361631.38, 114755236.61]          9
(32787210.46, 49180815.69]           2
(16393605.23, 32787210.46]           1
(114755236.61, 131148841.84]         1
(147542447.07, 163936052.3]          1
(49180815.69, 65574420.92]           0
(65574420.92, 81968026.15]           0
(81968026.15, 98361631.38]           0
(131148841.84, 147542447.07]         0
Name: count, dtype: int64

✅ qcut

: 동일한 개수를 갖도록 구간 분할

quantity 즉 데이터의 분포를 최대한 비슷하게 유지

pd.qcut(df, q=int, qcut_bins=[bins_list], labels=[labels_list])

In [29]:

# q 지정수만큼 동일한 개수를 갖도록 구간 분할
df['운동량_qcut'] = pd.qcut(df['운동량'], q=10)

In [30]:

df['운동량_qcut'].value_counts()

Out[30]:

운동량_qcut
(93.414, 192.02]           32690
(-0.001, 24.737]           32683
(24.737, 93.414]           32683
(601.705, 1079.744]        32683
(1079.744, 1889.606]       32683
(1889.606, 3328.186]       32683
(3328.186, 6805.188]       32683
(6805.188, 163936052.3]    32683
(344.45, 601.705]          32680
(192.02, 344.45]           32679
Name: count, dtype: int64

In [31]:

# 분위로 나눔
qcut_bins = [0, 0.2, 0.8, 1]
qcut_labels = ['적음', '보통', '많음']
pd.qcut(df['운동량'], qcut_bins, labels=qcut_labels).value_counts()

Out[31]:

운동량
보통    196098
적음     65366
많음     65366
Name: count, dtype: int64

📍 Series 데이터 확인

✅ describe()

: 데이터에 대한 통계요약정보 제공

count : 결측치를 제외한 데이터 갯수
mean : 평균값
std : 표준편차
min : 최솟값
25% : 1사분위수
50% : 2사분위수(중앙값)
75% : 3사분위수
max : 최댓값

In [33]:

df['키'].describe()

Out[33]:

참고 : 나도코딩 파이썬 코딩 무료 강의 (활용편5) - 데이터 분석 및 시각화, 이 영상 하나로 끝내세요

(https://youtu.be/PjhlUzp_cU0?si=LW_MjXLjZVY9PrUt)

'Pandas' 카테고리의 다른 글

[Pandas / 기초] 판다스 데이터선택 (1)	2024.01.10
[Pandas / 기초] 판다스 통계함수 - min, max, median, nlargest, mean, var, std, sum, cumsum, cumprod, count, mode, quantile, unique, nunique, agg, corr (1)	2024.01.10
[Pandas / 기초] 판다스 데이터입출력 - to_csv, to_excel, ExcelWriter, read_csv, read_excel (0)	2024.01.10
[Pandas / 기초] 판다스 Index - name, reset_index, set_index, sort_index, fancy indexing, boolean indexing (0)	2024.01.09
[Pandas / 기초] 판다스 자료구조 - Series, DataFrame (0)	2024.01.09

현재글[Pandas / 기초] 판다스 데이터확인 - describe, info, head, tail, values, index, columns, shape, astype, cat, to_numeric, cut, qcut

희주는 개발중

얼레벌레하다보면 될지어다

Today :
Yesterday :

07-05 00:45

희주는 개발중

[Pandas / 기초] 판다스 데이터확인 - describe, info, head, tail, values, index, columns, shape, astype, cat, to_numeric, cut, qcut

📍 DataFrame 데이터 확인

📍 Series 데이터 확인

'Pandas' 카테고리의 다른 글

'Pandas'의 다른글

티스토리툴바

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

[Pandas / 기초] 판다스 데이터확인 - describe, info, head, tail, values, index, columns, shape, astype, cat, to_numeric, cut, qcut

📍 DataFrame 데이터 확인

📍 Series 데이터 확인

'Pandas' 카테고리의 다른 글

'Pandas'의 다른글

관련글

티스토리툴바

📍 Series 데이터 확인