[Pandas] 데이터 처리 연습¶
In [1]:
import pandas as pd
# 자주 쓰는 경로는 변수에 저장
path = './COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/'
doc = pd.read_csv(path + '04-01-2020.csv', encoding = 'utf-8-sig')
In [2]:
doc.head()
Out[2]:
FIPS | Admin2 | Province_State | Country_Region | Last_Update | Lat | Long_ | Confirmed | Deaths | Recovered | Active | Combined_Key | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 45001.0 | Abbeville | South Carolina | US | 2020-04-01 21:58:49 | 34.223334 | -82.461707 | 4 | 0 | 0 | 0 | Abbeville, South Carolina, US |
1 | 22001.0 | Acadia | Louisiana | US | 2020-04-01 21:58:49 | 30.295065 | -92.414197 | 47 | 1 | 0 | 0 | Acadia, Louisiana, US |
2 | 51001.0 | Accomack | Virginia | US | 2020-04-01 21:58:49 | 37.767072 | -75.632346 | 7 | 0 | 0 | 0 | Accomack, Virginia, US |
3 | 16001.0 | Ada | Idaho | US | 2020-04-01 21:58:49 | 43.452658 | -116.241552 | 195 | 3 | 0 | 0 | Ada, Idaho, US |
4 | 19001.0 | Adair | Iowa | US | 2020-04-01 21:58:49 | 41.330756 | -94.471059 | 1 | 0 | 0 | 0 | Adair, Iowa, US |
① Series 추출로 feature를 상세하게 탐색.¶
- size : 사이즈 반환
- count() : 데이터가 없는 경우를 제외하고 사이즈 반환
- unique() : 유일한 값만 반환
- value_counts() : 데이터가 없는 경우 제외하고, 각 값의 갯수 반환
In [3]:
countries = doc['Country_Region']
countries.head()
Out[3]:
0 US 1 US 2 US 3 US 4 US Name: Country_Region, dtype: object
In [4]:
print (countries.size, countries.count())
2483 2483
In [5]:
# print (countries.unique(), len(countries.unique()))
print ('Countries Column 수 : ', countries.unique().size)
Countries Column 수 : 180
In [6]:
countries.value_counts()
Out[6]:
US 2228 China 33 Canada 15 France 10 United Kingdom 10 ... Gabon 1 Dominica 1 Diamond Princess 1 Holy See 1 Cambodia 1 Name: Country_Region, Length: 180, dtype: int64
② 필요 Column 만 추출해서 새로운 dataframe 생성¶
In [7]:
doc = pd.read_csv(path + '04-01-2020.csv', encoding = 'utf-8-sig')
doc.head()
Out[7]:
FIPS | Admin2 | Province_State | Country_Region | Last_Update | Lat | Long_ | Confirmed | Deaths | Recovered | Active | Combined_Key | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 45001.0 | Abbeville | South Carolina | US | 2020-04-01 21:58:49 | 34.223334 | -82.461707 | 4 | 0 | 0 | 0 | Abbeville, South Carolina, US |
1 | 22001.0 | Acadia | Louisiana | US | 2020-04-01 21:58:49 | 30.295065 | -92.414197 | 47 | 1 | 0 | 0 | Acadia, Louisiana, US |
2 | 51001.0 | Accomack | Virginia | US | 2020-04-01 21:58:49 | 37.767072 | -75.632346 | 7 | 0 | 0 | 0 | Accomack, Virginia, US |
3 | 16001.0 | Ada | Idaho | US | 2020-04-01 21:58:49 | 43.452658 | -116.241552 | 195 | 3 | 0 | 0 | Ada, Idaho, US |
4 | 19001.0 | Adair | Iowa | US | 2020-04-01 21:58:49 | 41.330756 | -94.471059 | 1 | 0 | 0 | 0 | Adair, Iowa, US |
In [8]:
# df[원하는 조건 서술]
covid_stat = doc[['Confirmed', 'Deaths', 'Recovered']]
covid_stat.head()
Out[8]:
Confirmed | Deaths | Recovered | |
---|---|---|---|
0 | 4 | 0 | 0 |
1 | 47 | 1 | 0 |
2 | 7 | 0 | 0 |
3 | 195 | 3 | 0 |
4 | 1 | 0 | 0 |
③ 특정 조건에 맞는 row 검색¶
In [9]:
doc_us = doc[doc['Country_Region'] == 'US']
doc_us.head()
Out[9]:
FIPS | Admin2 | Province_State | Country_Region | Last_Update | Lat | Long_ | Confirmed | Deaths | Recovered | Active | Combined_Key | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 45001.0 | Abbeville | South Carolina | US | 2020-04-01 21:58:49 | 34.223334 | -82.461707 | 4 | 0 | 0 | 0 | Abbeville, South Carolina, US |
1 | 22001.0 | Acadia | Louisiana | US | 2020-04-01 21:58:49 | 30.295065 | -92.414197 | 47 | 1 | 0 | 0 | Acadia, Louisiana, US |
2 | 51001.0 | Accomack | Virginia | US | 2020-04-01 21:58:49 | 37.767072 | -75.632346 | 7 | 0 | 0 | 0 | Accomack, Virginia, US |
3 | 16001.0 | Ada | Idaho | US | 2020-04-01 21:58:49 | 43.452658 | -116.241552 | 195 | 3 | 0 | 0 | Ada, Idaho, US |
4 | 19001.0 | Adair | Iowa | US | 2020-04-01 21:58:49 | 41.330756 | -94.471059 | 1 | 0 | 0 | 0 | Adair, Iowa, US |
In [ ]:
④ 결측치(Missing Vlaue, NaN) 처리¶
- isnull() : 결측치 있으면 True, 없으면 False 반환
- sum() : 없는 데이터가 있는 행의 개수 반환
In [ ]:
In [10]:
doc = pd.read_csv(path + '01-22-2020.csv', encoding = 'utf-8-sig')
doc.isnull().head()
Out[10]:
Province/State | Country/Region | Last Update | Confirmed | Deaths | Recovered | |
---|---|---|---|---|---|---|
0 | False | False | False | False | True | True |
1 | False | False | False | False | True | True |
2 | False | False | False | False | True | True |
3 | False | False | False | False | True | True |
4 | False | False | False | True | True | True |
In [11]:
doc.isnull().sum()
Out[11]:
Province/State 3 Country/Region 0 Last Update 0 Confirmed 9 Deaths 37 Recovered 37 dtype: int64
- dropna(subset = Column name) : 결측치 삭제
- fillna(특정 값) : 결측치 특정 값으로 대체
In [12]:
doc_drop = doc.dropna(subset=['Confirmed'])
doc_drop.head()
Out[12]:
Province/State | Country/Region | Last Update | Confirmed | Deaths | Recovered | |
---|---|---|---|---|---|---|
0 | Anhui | Mainland China | 1/22/2020 17:00 | 1.0 | NaN | NaN |
1 | Beijing | Mainland China | 1/22/2020 17:00 | 14.0 | NaN | NaN |
2 | Chongqing | Mainland China | 1/22/2020 17:00 | 6.0 | NaN | NaN |
3 | Fujian | Mainland China | 1/22/2020 17:00 | 1.0 | NaN | NaN |
5 | Guangdong | Mainland China | 1/22/2020 17:00 | 26.0 | NaN | NaN |
In [13]:
doc_fill = doc.fillna({'Deaths' : 0})
doc_fill.head()
Out[13]:
Province/State | Country/Region | Last Update | Confirmed | Deaths | Recovered | |
---|---|---|---|---|---|---|
0 | Anhui | Mainland China | 1/22/2020 17:00 | 1.0 | 0.0 | NaN |
1 | Beijing | Mainland China | 1/22/2020 17:00 | 14.0 | 0.0 | NaN |
2 | Chongqing | Mainland China | 1/22/2020 17:00 | 6.0 | 0.0 | NaN |
3 | Fujian | Mainland China | 1/22/2020 17:00 | 1.0 | 0.0 | NaN |
4 | Gansu | Mainland China | 1/22/2020 17:00 | NaN | 0.0 | NaN |
⑤ 특정 키값을 기준으로 데이터 합치기 (Group by)¶
- groubby() : SQL 구문의 groupby와 동일
In [14]:
doc = pd.read_csv(path + '04-01-2020.csv', encoding = 'utf-8-sig')
doc.head()
Out[14]:
FIPS | Admin2 | Province_State | Country_Region | Last_Update | Lat | Long_ | Confirmed | Deaths | Recovered | Active | Combined_Key | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 45001.0 | Abbeville | South Carolina | US | 2020-04-01 21:58:49 | 34.223334 | -82.461707 | 4 | 0 | 0 | 0 | Abbeville, South Carolina, US |
1 | 22001.0 | Acadia | Louisiana | US | 2020-04-01 21:58:49 | 30.295065 | -92.414197 | 47 | 1 | 0 | 0 | Acadia, Louisiana, US |
2 | 51001.0 | Accomack | Virginia | US | 2020-04-01 21:58:49 | 37.767072 | -75.632346 | 7 | 0 | 0 | 0 | Accomack, Virginia, US |
3 | 16001.0 | Ada | Idaho | US | 2020-04-01 21:58:49 | 43.452658 | -116.241552 | 195 | 3 | 0 | 0 | Ada, Idaho, US |
4 | 19001.0 | Adair | Iowa | US | 2020-04-01 21:58:49 | 41.330756 | -94.471059 | 1 | 0 | 0 | 0 | Adair, Iowa, US |
In [15]:
# Country_Region Column 기준으로 value들 합해서 반환
# Country_Region 이 DataFrame의 Index로 결과물 출력
doc = doc.groupby('Country_Region').sum()
doc.head()
Out[15]:
FIPS | Lat | Long_ | Confirmed | Deaths | Recovered | Active | |
---|---|---|---|---|---|---|---|
Country_Region | |||||||
Afghanistan | 0.0 | 33.93911 | 67.709953 | 237 | 4 | 5 | 228 |
Albania | 0.0 | 41.15330 | 20.168300 | 259 | 15 | 67 | 177 |
Algeria | 0.0 | 28.03390 | 1.659600 | 847 | 58 | 61 | 728 |
Andorra | 0.0 | 42.50630 | 1.521800 | 390 | 14 | 10 | 366 |
Angola | 0.0 | -11.20270 | 17.873900 | 8 | 2 | 1 | 5 |
In [16]:
# column 출력시 Country_Region은 없어짐
doc.columns
Out[16]:
Index(['FIPS', 'Lat', 'Long_', 'Confirmed', 'Deaths', 'Recovered', 'Active'], dtype='object')
In [17]:
# index 출력시 Country_Region으로 변경
doc.index
Out[17]:
Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria', ... 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'Uruguay', 'Uzbekistan', 'Venezuela', 'Vietnam', 'West Bank and Gaza', 'Zambia', 'Zimbabwe'], dtype='object', name='Country_Region', length=180)
⑥ 컬럼 타입 변경 (astype)¶
In [18]:
doc = pd.read_csv(path + "01-22-2020.csv", encoding='utf-8-sig')
doc = doc[['Country/Region', 'Confirmed']] # 필요한 컬럼만 선택하기
doc = doc.dropna(subset=['Confirmed']) # 특정 컬럼에 없는 데이터 삭제하기
doc = doc.astype({'Confirmed': 'int64'}) # 특정 컬럼의 데이터 타입 변경하기
doc.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 29 entries, 0 to 37 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country/Region 29 non-null object 1 Confirmed 29 non-null int64 dtypes: int64(1), object(1) memory usage: 696.0+ bytes
⑦ 컬럼 이름 변경¶
In [19]:
doc = pd.read_csv(path + "01-22-2020.csv", encoding='utf-8-sig')
doc = doc[['Country/Region', 'Confirmed']] # 필요한 컬럼만 선택하기
In [20]:
doc.columns
Out[20]:
Index(['Country/Region', 'Confirmed'], dtype='object')
In [21]:
doc.colmuns = ['Country_Region', 'Confirmed']
<ipython-input-21-e9b3a8f50991>:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access doc.colmuns = ['Country_Region', 'Confirmed']
In [22]:
doc.colmuns
Out[22]:
['Country_Region', 'Confirmed']
⑧ 컬럼 중복 행 확인(duplicated)¶
In [23]:
doc = pd.read_csv(path + '../UID_ISO_FIPS_LookUp_Table.csv', encoding = 'utf-8-sig')
doc = doc[['iso2','Country_Region']]
doc
Out[23]:
iso2 | Country_Region | |
---|---|---|
0 | BW | Botswana |
1 | BI | Burundi |
2 | SL | Sierra Leone |
3 | AF | Afghanistan |
4 | AL | Albania |
... | ... | ... |
3555 | US | US |
3556 | US | US |
3557 | US | US |
3558 | US | US |
3559 | US | US |
3560 rows × 2 columns
In [24]:
doc.duplicated()
Out[24]:
0 False 1 False 2 False 3 False 4 False ... 3555 True 3556 True 3557 True 3558 True 3559 True Length: 3560, dtype: bool
중복행 제거 (drop.duplicates)¶
In [25]:
doc = doc.drop_duplicates(subset = 'Country_Region')
doc
Out[25]:
iso2 | Country_Region | |
---|---|---|
0 | BW | Botswana |
1 | BI | Burundi |
2 | SL | Sierra Leone |
3 | AF | Afghanistan |
4 | AL | Albania |
... | ... | ... |
175 | ZW | Zimbabwe |
199 | AU | Australia |
207 | CA | Canada |
222 | CN | China |
255 | AS | US |
180 rows × 2 columns
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
'데이터분석 > Pandas' 카테고리의 다른 글
[Pandas] 데이터 처리 연습 2 (0) | 2021.07.17 |
---|---|
[Pandas] 데이터 처리 연습 (0) | 2021.07.15 |
[Pandas] DataFrame Join (concat, merge) (0) | 2021.07.12 |
[Pandas] EDA 기초 이해 (0) | 2021.07.08 |
[Pandas] 기초 이해 (0) | 2021.07.08 |