2023년 1월 1일
08:00 AM
Buffering ...

최근 글 👑

pandas - 6

2025. 4. 6. 18:03ㆍ개발공부/생성형 AI 기반 개발자 과정
728x90

중복 데이터 드롭하기

중복된 데이터 드롭하는 방법에 대해 알아보겠습니다.

student_list = [{'name': 'John', 'major': "Computer Science", 'sex': "male"},
                {'name': 'Nate', 'major': "Computer Science", 'sex': "male"},
                {'name': 'Abraham', 'major': "Physics", 'sex': "male"},
                {'name': 'Brian', 'major': "Psychology", 'sex': "male"},
                {'name': 'Janny', 'major': "Economics", 'sex': "female"},
                {'name': 'Yuna', 'major': "Economics", 'sex': "female"},
                {'name': 'Jeniffer', 'major': "Computer Science", 'sex': "female"},
                {'name': 'Edward', 'major': "Computer Science", 'sex': "male"},
                {'name': 'Zara', 'major': "Psychology", 'sex': "female"},
                {'name': 'Wendy', 'major': "Economics", 'sex': "female"},
                {'name': 'Sera', 'major': "Psychology", 'sex': "female"},
                {'name': 'John', 'major': "Computer Science", 'sex': "male"},
         ]
df = pd.DataFrame(student_list, columns = ['name', 'major', 'sex'])
df

.dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

name major sex
0 John Computer Science male
1 Nate Computer Science male
2 Abraham Physics male
3 Brian Psychology male
4 Janny Economics female
5 Yuna Economics female
6 Jeniffer Computer Science female
7 Edward Computer Science male
8 Zara Psychology female
9 Wendy Economics female
10 Sera Psychology female
11 John Computer Science male

중복된 데이터 확인 하기

df.duplicated()
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11     True
dtype: bool

drop_duplicates 함수로 중복 데이터를 삭제하는 예제입니다.

df = df.drop_duplicates()
df

.dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

name major sex
0 John Computer Science male
1 Nate Computer Science male
2 Abraham Physics male
3 Brian Psychology male
4 Janny Economics female
5 Yuna Economics female
6 Jeniffer Computer Science female
7 Edward Computer Science male
8 Zara Psychology female
9 Wendy Economics female
10 Sera Psychology female
student_list = [{'name': 'John', 'major': "Computer Science", 'sex': "male"},
                {'name': 'Nate', 'major': "Computer Science", 'sex': "male"},
                {'name': 'Abraham', 'major': "Physics", 'sex': "male"},
                {'name': 'Brian', 'major': "Psychology", 'sex': "male"},
                {'name': 'Janny', 'major': "Economics", 'sex': "female"},
                {'name': 'Yuna', 'major': "Economics", 'sex': "female"},
                {'name': 'Jeniffer', 'major': "Computer Science", 'sex': "female"},
                {'name': 'Edward', 'major': "Computer Science", 'sex': "male"},
                {'name': 'Zara', 'major': "Psychology", 'sex': "female"},
                {'name': 'Wendy', 'major': "Economics", 'sex': "female"},
                {'name': 'Nate', 'major': None, 'sex': "male"},
                {'name': 'John', 'major': "Computer Science", 'sex': None},
         ]
df = pd.DataFrame(student_list, columns = ['name', 'major', 'sex'])
df

.dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

name major sex
0 John Computer Science male
1 Nate Computer Science male
2 Abraham Physics male
3 Brian Psychology male
4 Janny Economics female
5 Yuna Economics female
6 Jeniffer Computer Science female
7 Edward Computer Science male
8 Zara Psychology female
9 Wendy Economics female
10 Nate None male
11 John Computer Science None

name 컬럼이 똑같을 경우, 중복된 데이터라고 표시하라는 예제입니다.

df.duplicated(['name'])
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10     True
11     True
dtype: bool

keep 값을 first 또는 last라고 값을 줘서 중복된 값 중, 어느값을 살릴 지 결정하실 수 있습니다.
기본적으로 first로 설정되어 있습니다.

df.drop_duplicates(['name'], keep='last')

.dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

name major sex
2 Abraham Physics male
3 Brian Psychology male
4 Janny Economics female
5 Yuna Economics female
6 Jeniffer Computer Science female
7 Edward Computer Science male
8 Zara Psychology female
9 Wendy Economics female
10 Nate None male
11 John Computer Science None
df

.dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

name major sex
0 John Computer Science male
1 Nate Computer Science male
2 Abraham Physics male
3 Brian Psychology male
4 Janny Economics female
5 Yuna Economics female
6 Jeniffer Computer Science female
7 Edward Computer Science male
8 Zara Psychology female
9 Wendy Economics female
10 Nate None male
11 John Computer Science None

None 처리 하기

school_id_list = [{'name': 'John', 'job': "teacher", 'age': 40},
                {'name': 'Nate', 'job': "teacher", 'age': 35},
                {'name': 'Yuna', 'job': "teacher", 'age': 37},
                {'name': 'Abraham', 'job': "student", 'age': 10},
                {'name': 'Brian', 'job': "student", 'age': 12},
                {'name': 'Janny', 'job': "student", 'age': 11},
                {'name': 'Nate', 'job': "teacher", 'age': None},
                {'name': 'John', 'job': "student", 'age': None}
         ]
df = pd.DataFrame(school_id_list, columns = ['name', 'job', 'age'])
df

.dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

name job age
0 John teacher 40.0
1 Nate teacher 35.0
2 Yuna teacher 37.0
3 Abraham student 10.0
4 Brian student 12.0
5 Janny student 11.0
6 Nate teacher NaN
7 John student NaN

Null 또는 NaN 확인하기

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
name    8 non-null object
job     8 non-null object
age     6 non-null float64
dtypes: float64(1), object(2)
memory usage: 272.0+ bytes
df.isna()

.dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

name job age
0 False False False
1 False False False
2 False False False
3 False False False
4 False False False
5 False False False
6 False False True
7 False False True
df.isnull()

.dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

name job age
0 False False False
1 False False False
2 False False False
3 False False False
4 False False False
5 False False False
6 False False True
7 False False True

Null 또는 NaN 값 변경하기

아래는 Null을 0으로 설정하는 예제입니다.

tmp = df
tmp["age"] = tmp["age"].fillna(0)
tmp

.dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

name job age
0 John teacher 40.0
1 Nate teacher 35.0
2 Yuna teacher 37.0
3 Abraham student 10.0
4 Brian student 12.0
5 Janny student 11.0
6 Nate teacher 0.0
7 John student 0.0

0으로 설정하기 보다는 선생님의 중간 나이, 학생의 중간 나이로, 각각의 직업군에 맞게 Null값을 변경해줍니다.

# fill missing age with median age for each group (teacher, student)
df["age"].fillna(df.groupby("job")["age"].transform("median"), inplace=True)
df

.dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

name job age
0 John teacher 40.0
1 Nate teacher 35.0
2 Yuna teacher 37.0
3 Abraham student 10.0
4 Brian student 12.0
5 Janny student 11.0
6 Nate teacher 0.0
7 John student 0.0
728x90

'개발공부 > 생성형 AI 기반 개발자 과정' 카테고리의 다른 글

크롤링 기초(api 활용편 - xml)  (0) 2025.04.16
pandas - 7  (0) 2025.04.06
pandas - 5  (0) 2025.04.06
pandas - 4  (0) 2025.04.06
pandas - 3  (0) 2025.04.06