Pandas_Data_processing_ex

[Pandas] 데이터 처리 연습¶

In [1]:

import pandas as pd
# 자주 쓰는 경로는 변수에 저장

path = './COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/'
doc = pd.read_csv(path + '04-01-2020.csv', encoding = 'utf-8-sig')

In [2]:

doc.head()

Out[2]:

	FIPS	Admin2	Province_State	Country_Region	Last_Update	Lat	Long_	Confirmed	Deaths	Combined_Key
0	45001.0	Abbeville	South Carolina	US	2020-04-01 21:58:49	34.223334	-82.461707	4	0	Abbeville, South Carolina, US
1	22001.0	Acadia	Louisiana	US	2020-04-01 21:58:49	30.295065	-92.414197	47	1	Acadia, Louisiana, US
2	51001.0	Accomack	Virginia	US	2020-04-01 21:58:49	37.767072	-75.632346	7	0	Accomack, Virginia, US
3	16001.0	Ada	Idaho	US	2020-04-01 21:58:49	43.452658	-116.241552	195	3	Ada, Idaho, US
4	19001.0	Adair	Iowa	US	2020-04-01 21:58:49	41.330756	-94.471059	1	0	Adair, Iowa, US

① Series 추출로 feature를 상세하게 탐색.¶

size : 사이즈 반환
count() : 데이터가 없는 경우를 제외하고 사이즈 반환
unique() : 유일한 값만 반환
value_counts() : 데이터가 없는 경우 제외하고, 각 값의 갯수 반환

In [3]:

countries = doc['Country_Region']
countries.head()

Out[3]:

0    US
1    US
2    US
3    US
4    US
Name: Country_Region, dtype: object

In [4]:

print (countries.size, countries.count())

2483 2483

In [5]:

# print (countries.unique(), len(countries.unique()))
print ('Countries Column 수 : ', countries.unique().size)

Countries Column 수 :  180

In [6]:

countries.value_counts()

Out[6]:

US                  2228
China                 33
Canada                15
France                10
United Kingdom        10
                    ... 
Gabon                  1
Dominica               1
Diamond Princess       1
Holy See               1
Cambodia               1
Name: Country_Region, Length: 180, dtype: int64

② 필요 Column 만 추출해서 새로운 dataframe 생성¶

In [7]:

doc = pd.read_csv(path + '04-01-2020.csv', encoding = 'utf-8-sig')
doc.head()

Out[7]:

	FIPS	Admin2	Province_State	Country_Region	Last_Update	Lat	Long_	Confirmed	Deaths	Combined_Key
0	45001.0	Abbeville	South Carolina	US	2020-04-01 21:58:49	34.223334	-82.461707	4	0	Abbeville, South Carolina, US
1	22001.0	Acadia	Louisiana	US	2020-04-01 21:58:49	30.295065	-92.414197	47	1	Acadia, Louisiana, US
2	51001.0	Accomack	Virginia	US	2020-04-01 21:58:49	37.767072	-75.632346	7	0	Accomack, Virginia, US
3	16001.0	Ada	Idaho	US	2020-04-01 21:58:49	43.452658	-116.241552	195	3	Ada, Idaho, US
4	19001.0	Adair	Iowa	US	2020-04-01 21:58:49	41.330756	-94.471059	1	0	Adair, Iowa, US

In [8]:

# df[원하는 조건 서술]
covid_stat = doc[['Confirmed', 'Deaths', 'Recovered']]
covid_stat.head()

Out[8]:

	Confirmed	Deaths
0	4	0
1	47	1
2	7	0
3	195	3
4	1	0

③ 특정 조건에 맞는 row 검색¶

In [9]:

doc_us = doc[doc['Country_Region'] == 'US']
doc_us.head()

Out[9]:

	FIPS	Admin2	Province_State	Country_Region	Last_Update	Lat	Long_	Confirmed	Deaths	Combined_Key
0	45001.0	Abbeville	South Carolina	US	2020-04-01 21:58:49	34.223334	-82.461707	4	0	Abbeville, South Carolina, US
1	22001.0	Acadia	Louisiana	US	2020-04-01 21:58:49	30.295065	-92.414197	47	1	Acadia, Louisiana, US
2	51001.0	Accomack	Virginia	US	2020-04-01 21:58:49	37.767072	-75.632346	7	0	Accomack, Virginia, US
3	16001.0	Ada	Idaho	US	2020-04-01 21:58:49	43.452658	-116.241552	195	3	Ada, Idaho, US
4	19001.0	Adair	Iowa	US	2020-04-01 21:58:49	41.330756	-94.471059	1	0	Adair, Iowa, US

In [ ]:

④ 결측치(Missing Vlaue, NaN) 처리¶

isnull() : 결측치 있으면 True, 없으면 False 반환
sum() : 없는 데이터가 있는 행의 개수 반환

In [ ]:

In [10]:

doc = pd.read_csv(path + '01-22-2020.csv', encoding = 'utf-8-sig')
doc.isnull().head()

Out[10]:

	Province/State	Country/Region	Last Update	Confirmed	Deaths	Recovered
0	False	False	False	False	True	True
1	False	False	False	False	True	True
2	False	False	False	False	True	True
3	False	False	False	False	True	True
4	False	False	False	True	True	True

In [11]:

doc.isnull().sum()

Out[11]:

Province/State     3
Country/Region     0
Last Update        0
Confirmed          9
Deaths            37
Recovered         37
dtype: int64

dropna(subset = Column name) : 결측치 삭제
fillna(특정 값) : 결측치 특정 값으로 대체

In [12]:

doc_drop = doc.dropna(subset=['Confirmed'])
doc_drop.head()

Out[12]:

	Province/State	Country/Region	Last Update	Confirmed	Deaths	Recovered
0	Anhui	Mainland China	1/22/2020 17:00	1.0	NaN	NaN
1	Beijing	Mainland China	1/22/2020 17:00	14.0	NaN	NaN
2	Chongqing	Mainland China	1/22/2020 17:00	6.0	NaN	NaN
3	Fujian	Mainland China	1/22/2020 17:00	1.0	NaN	NaN
5	Guangdong	Mainland China	1/22/2020 17:00	26.0	NaN	NaN

In [13]:

doc_fill = doc.fillna({'Deaths' : 0})
doc_fill.head()

Out[13]:

	Province/State	Country/Region	Last Update	Confirmed	Recovered
0	Anhui	Mainland China	1/22/2020 17:00	1.0	NaN
1	Beijing	Mainland China	1/22/2020 17:00	14.0	NaN
2	Chongqing	Mainland China	1/22/2020 17:00	6.0	NaN
3	Fujian	Mainland China	1/22/2020 17:00	1.0	NaN
4	Gansu	Mainland China	1/22/2020 17:00	NaN	NaN

⑤ 특정 키값을 기준으로 데이터 합치기 (Group by)¶

groubby() : SQL 구문의 groupby와 동일

In [14]:

doc = pd.read_csv(path + '04-01-2020.csv', encoding = 'utf-8-sig')
doc.head()

Out[14]:

	FIPS	Admin2	Province_State	Country_Region	Last_Update	Lat	Long_	Confirmed	Deaths	Combined_Key
0	45001.0	Abbeville	South Carolina	US	2020-04-01 21:58:49	34.223334	-82.461707	4	0	Abbeville, South Carolina, US
1	22001.0	Acadia	Louisiana	US	2020-04-01 21:58:49	30.295065	-92.414197	47	1	Acadia, Louisiana, US
2	51001.0	Accomack	Virginia	US	2020-04-01 21:58:49	37.767072	-75.632346	7	0	Accomack, Virginia, US
3	16001.0	Ada	Idaho	US	2020-04-01 21:58:49	43.452658	-116.241552	195	3	Ada, Idaho, US
4	19001.0	Adair	Iowa	US	2020-04-01 21:58:49	41.330756	-94.471059	1	0	Adair, Iowa, US

In [15]:

# Country_Region Column 기준으로 value들 합해서 반환
# Country_Region 이 DataFrame의 Index로 결과물 출력
doc = doc.groupby('Country_Region').sum()
doc.head()

Out[15]:

	FIPS	Lat	Long_	Confirmed	Deaths	Recovered	Active
Country_Region
Afghanistan	0.0	33.93911	67.709953	237	4	5	228
Albania	0.0	41.15330	20.168300	259	15	67	177
Algeria	0.0	28.03390	1.659600	847	58	61	728
Andorra	0.0	42.50630	1.521800	390	14	10	366
Angola	0.0	-11.20270	17.873900	8	2	1	5

In [16]:

# column 출력시 Country_Region은 없어짐
doc.columns

Out[16]:

Index(['FIPS', 'Lat', 'Long_', 'Confirmed', 'Deaths', 'Recovered', 'Active'], dtype='object')

In [17]:

# index 출력시 Country_Region으로 변경
doc.index

Out[17]:

Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
       ...
       'Ukraine', 'United Arab Emirates', 'United Kingdom', 'Uruguay',
       'Uzbekistan', 'Venezuela', 'Vietnam', 'West Bank and Gaza', 'Zambia',
       'Zimbabwe'],
      dtype='object', name='Country_Region', length=180)

⑥ 컬럼 타입 변경 (astype)¶

In [18]:

doc = pd.read_csv(path + "01-22-2020.csv", encoding='utf-8-sig')
doc = doc[['Country/Region', 'Confirmed']] # 필요한 컬럼만 선택하기
doc = doc.dropna(subset=['Confirmed'])     # 특정 컬럼에 없는 데이터 삭제하기
doc = doc.astype({'Confirmed': 'int64'})   # 특정 컬럼의 데이터 타입 변경하기
doc.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29 entries, 0 to 37
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Country/Region  29 non-null     object
 1   Confirmed       29 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 696.0+ bytes

⑦ 컬럼 이름 변경¶

In [19]:

doc = pd.read_csv(path + "01-22-2020.csv", encoding='utf-8-sig')
doc = doc[['Country/Region', 'Confirmed']] # 필요한 컬럼만 선택하기

In [20]:

doc.columns

Out[20]:

Index(['Country/Region', 'Confirmed'], dtype='object')

In [21]:

doc.colmuns = ['Country_Region', 'Confirmed']

<ipython-input-21-e9b3a8f50991>:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  doc.colmuns = ['Country_Region', 'Confirmed']

In [22]:

doc.colmuns

Out[22]:

['Country_Region', 'Confirmed']

⑧ 컬럼 중복 행 확인(duplicated)¶

In [23]:

doc = pd.read_csv(path + '../UID_ISO_FIPS_LookUp_Table.csv', encoding = 'utf-8-sig')
doc = doc[['iso2','Country_Region']]
doc

Out[23]:

	iso2	Country_Region
0	BW	Botswana
1	BI	Burundi
2	SL	Sierra Leone
3	AF	Afghanistan
4	AL	Albania
...	...	...
3555	US	US
3556	US	US
3557	US	US
3558	US	US
3559	US	US

3560 rows × 2 columns

In [24]:

doc.duplicated()

Out[24]:

0       False
1       False
2       False
3       False
4       False
        ...  
3555     True
3556     True
3557     True
3558     True
3559     True
Length: 3560, dtype: bool

중복행 제거 (drop.duplicates)¶

In [25]:

doc = doc.drop_duplicates(subset = 'Country_Region')
doc

Out[25]:

	iso2	Country_Region
0	BW	Botswana
1	BI	Burundi
2	SL	Sierra Leone
3	AF	Afghanistan
4	AL	Albania
...	...	...
175	ZW	Zimbabwe
199	AU	Australia
207	CA	Canada
222	CN	China
255	AS	US

180 rows × 2 columns

In [ ]:

'데이터분석 > Pandas' 카테고리의 다른 글

[Pandas] 데이터 처리 연습 2 (0)	2021.07.17
[Pandas] 데이터 처리 연습 (0)	2021.07.15
[Pandas] DataFrame Join (concat, merge) (0)	2021.07.12
[Pandas] EDA 기초 이해 (0)	2021.07.08
[Pandas] 기초 이해 (0)	2021.07.08

밑바닥부터 이것저것

[Pandas] 데이터 처리 연습

[Pandas] 데이터 처리 연습¶

① Series 추출로 feature를 상세하게 탐색.¶

② 필요 Column 만 추출해서 새로운 dataframe 생성¶

③ 특정 조건에 맞는 row 검색¶

④ 결측치(Missing Vlaue, NaN) 처리¶

⑤ 특정 키값을 기준으로 데이터 합치기 (Group by)¶

⑥ 컬럼 타입 변경 (astype)¶

⑦ 컬럼 이름 변경¶

⑧ 컬럼 중복 행 확인(duplicated)¶

중복행 제거 (drop.duplicates)¶

'데이터분석 > Pandas' 카테고리의 다른 글

티스토리툴바

[Pandas] 데이터 처리 연습

[Pandas] 데이터 처리 연습¶

① Series 추출로 feature를 상세하게 탐색.¶

② 필요 Column 만 추출해서 새로운 dataframe 생성¶

③ 특정 조건에 맞는 row 검색¶

④ 결측치(Missing Vlaue, NaN) 처리¶

⑤ 특정 키값을 기준으로 데이터 합치기 (Group by)¶

⑥ 컬럼 타입 변경 (astype)¶

⑦ 컬럼 이름 변경¶

⑧ 컬럼 중복 행 확인(duplicated)¶

중복행 제거 (drop.duplicates)¶

'데이터분석 > Pandas' 카테고리의 다른 글

'데이터분석/Pandas' Related Articles

티스토리툴바