중복 데이터 드롭하기
중복된 데이터 드롭하는 방법에 대해 알아보겠습니다.
student_list = [{'name': 'John', 'major': "Computer Science", 'sex': "male"},
{'name': 'Nate', 'major': "Computer Science", 'sex': "male"},
{'name': 'Abraham', 'major': "Physics", 'sex': "male"},
{'name': 'Brian', 'major': "Psychology", 'sex': "male"},
{'name': 'Janny', 'major': "Economics", 'sex': "female"},
{'name': 'Yuna', 'major': "Economics", 'sex': "female"},
{'name': 'Jeniffer', 'major': "Computer Science", 'sex': "female"},
{'name': 'Edward', 'major': "Computer Science", 'sex': "male"},
{'name': 'Zara', 'major': "Psychology", 'sex': "female"},
{'name': 'Wendy', 'major': "Economics", 'sex': "female"},
{'name': 'Sera', 'major': "Psychology", 'sex': "female"},
{'name': 'John', 'major': "Computer Science", 'sex': "male"},
]
df = pd.DataFrame(student_list, columns = ['name', 'major', 'sex'])
df.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
| name | major | sex | |
|---|---|---|---|
| 0 | John | Computer Science | male |
| 1 | Nate | Computer Science | male |
| 2 | Abraham | Physics | male |
| 3 | Brian | Psychology | male |
| 4 | Janny | Economics | female |
| 5 | Yuna | Economics | female |
| 6 | Jeniffer | Computer Science | female |
| 7 | Edward | Computer Science | male |
| 8 | Zara | Psychology | female |
| 9 | Wendy | Economics | female |
| 10 | Sera | Psychology | female |
| 11 | John | Computer Science | male |
중복된 데이터 확인 하기
df.duplicated()0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 True
dtype: booldrop_duplicates 함수로 중복 데이터를 삭제하는 예제입니다.
df = df.drop_duplicates()df.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
| name | major | sex | |
|---|---|---|---|
| 0 | John | Computer Science | male |
| 1 | Nate | Computer Science | male |
| 2 | Abraham | Physics | male |
| 3 | Brian | Psychology | male |
| 4 | Janny | Economics | female |
| 5 | Yuna | Economics | female |
| 6 | Jeniffer | Computer Science | female |
| 7 | Edward | Computer Science | male |
| 8 | Zara | Psychology | female |
| 9 | Wendy | Economics | female |
| 10 | Sera | Psychology | female |
student_list = [{'name': 'John', 'major': "Computer Science", 'sex': "male"},
{'name': 'Nate', 'major': "Computer Science", 'sex': "male"},
{'name': 'Abraham', 'major': "Physics", 'sex': "male"},
{'name': 'Brian', 'major': "Psychology", 'sex': "male"},
{'name': 'Janny', 'major': "Economics", 'sex': "female"},
{'name': 'Yuna', 'major': "Economics", 'sex': "female"},
{'name': 'Jeniffer', 'major': "Computer Science", 'sex': "female"},
{'name': 'Edward', 'major': "Computer Science", 'sex': "male"},
{'name': 'Zara', 'major': "Psychology", 'sex': "female"},
{'name': 'Wendy', 'major': "Economics", 'sex': "female"},
{'name': 'Nate', 'major': None, 'sex': "male"},
{'name': 'John', 'major': "Computer Science", 'sex': None},
]
df = pd.DataFrame(student_list, columns = ['name', 'major', 'sex'])
df.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
| name | major | sex | |
|---|---|---|---|
| 0 | John | Computer Science | male |
| 1 | Nate | Computer Science | male |
| 2 | Abraham | Physics | male |
| 3 | Brian | Psychology | male |
| 4 | Janny | Economics | female |
| 5 | Yuna | Economics | female |
| 6 | Jeniffer | Computer Science | female |
| 7 | Edward | Computer Science | male |
| 8 | Zara | Psychology | female |
| 9 | Wendy | Economics | female |
| 10 | Nate | None | male |
| 11 | John | Computer Science | None |
name 컬럼이 똑같을 경우, 중복된 데이터라고 표시하라는 예제입니다.
df.duplicated(['name'])0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 True
11 True
dtype: boolkeep 값을 first 또는 last라고 값을 줘서 중복된 값 중, 어느값을 살릴 지 결정하실 수 있습니다.
기본적으로 first로 설정되어 있습니다.
df.drop_duplicates(['name'], keep='last').dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
| name | major | sex | |
|---|---|---|---|
| 2 | Abraham | Physics | male |
| 3 | Brian | Psychology | male |
| 4 | Janny | Economics | female |
| 5 | Yuna | Economics | female |
| 6 | Jeniffer | Computer Science | female |
| 7 | Edward | Computer Science | male |
| 8 | Zara | Psychology | female |
| 9 | Wendy | Economics | female |
| 10 | Nate | None | male |
| 11 | John | Computer Science | None |
df.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
| name | major | sex | |
|---|---|---|---|
| 0 | John | Computer Science | male |
| 1 | Nate | Computer Science | male |
| 2 | Abraham | Physics | male |
| 3 | Brian | Psychology | male |
| 4 | Janny | Economics | female |
| 5 | Yuna | Economics | female |
| 6 | Jeniffer | Computer Science | female |
| 7 | Edward | Computer Science | male |
| 8 | Zara | Psychology | female |
| 9 | Wendy | Economics | female |
| 10 | Nate | None | male |
| 11 | John | Computer Science | None |
None 처리 하기
school_id_list = [{'name': 'John', 'job': "teacher", 'age': 40},
{'name': 'Nate', 'job': "teacher", 'age': 35},
{'name': 'Yuna', 'job': "teacher", 'age': 37},
{'name': 'Abraham', 'job': "student", 'age': 10},
{'name': 'Brian', 'job': "student", 'age': 12},
{'name': 'Janny', 'job': "student", 'age': 11},
{'name': 'Nate', 'job': "teacher", 'age': None},
{'name': 'John', 'job': "student", 'age': None}
]
df = pd.DataFrame(school_id_list, columns = ['name', 'job', 'age'])
df.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
| name | job | age | |
|---|---|---|---|
| 0 | John | teacher | 40.0 |
| 1 | Nate | teacher | 35.0 |
| 2 | Yuna | teacher | 37.0 |
| 3 | Abraham | student | 10.0 |
| 4 | Brian | student | 12.0 |
| 5 | Janny | student | 11.0 |
| 6 | Nate | teacher | NaN |
| 7 | John | student | NaN |
Null 또는 NaN 확인하기
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
name 8 non-null object
job 8 non-null object
age 6 non-null float64
dtypes: float64(1), object(2)
memory usage: 272.0+ bytesdf.isna().dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
| name | job | age | |
|---|---|---|---|
| 0 | False | False | False |
| 1 | False | False | False |
| 2 | False | False | False |
| 3 | False | False | False |
| 4 | False | False | False |
| 5 | False | False | False |
| 6 | False | False | True |
| 7 | False | False | True |
df.isnull().dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
| name | job | age | |
|---|---|---|---|
| 0 | False | False | False |
| 1 | False | False | False |
| 2 | False | False | False |
| 3 | False | False | False |
| 4 | False | False | False |
| 5 | False | False | False |
| 6 | False | False | True |
| 7 | False | False | True |
Null 또는 NaN 값 변경하기
아래는 Null을 0으로 설정하는 예제입니다.
tmp = df
tmp["age"] = tmp["age"].fillna(0)
tmp.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
| name | job | age | |
|---|---|---|---|
| 0 | John | teacher | 40.0 |
| 1 | Nate | teacher | 35.0 |
| 2 | Yuna | teacher | 37.0 |
| 3 | Abraham | student | 10.0 |
| 4 | Brian | student | 12.0 |
| 5 | Janny | student | 11.0 |
| 6 | Nate | teacher | 0.0 |
| 7 | John | student | 0.0 |
0으로 설정하기 보다는 선생님의 중간 나이, 학생의 중간 나이로, 각각의 직업군에 맞게 Null값을 변경해줍니다.
# fill missing age with median age for each group (teacher, student)
df["age"].fillna(df.groupby("job")["age"].transform("median"), inplace=True)df.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
| name | job | age | |
|---|---|---|---|
| 0 | John | teacher | 40.0 |
| 1 | Nate | teacher | 35.0 |
| 2 | Yuna | teacher | 37.0 |
| 3 | Abraham | student | 10.0 |
| 4 | Brian | student | 12.0 |
| 5 | Janny | student | 11.0 |
| 6 | Nate | teacher | 0.0 |
| 7 | John | student | 0.0 |
'개발공부 > 생성형 AI 기반 개발자 과정' 카테고리의 다른 글
| 크롤링 기초(api 활용편 - xml) (0) | 2025.04.16 |
|---|---|
| pandas - 7 (0) | 2025.04.06 |
| pandas - 5 (0) | 2025.04.06 |
| pandas - 4 (0) | 2025.04.06 |
| pandas - 3 (0) | 2025.04.06 |