pandas - 6

2025. 4. 6. 18:03ㆍ개발공부/생성형 AI 기반 개발자 과정

728x90

중복 데이터 드롭하기

중복된 데이터 드롭하는 방법에 대해 알아보겠습니다.

student_list = [{'name': 'John', 'major': "Computer Science", 'sex': "male"},
                {'name': 'Nate', 'major': "Computer Science", 'sex': "male"},
                {'name': 'Abraham', 'major': "Physics", 'sex': "male"},
                {'name': 'Brian', 'major': "Psychology", 'sex': "male"},
                {'name': 'Janny', 'major': "Economics", 'sex': "female"},
                {'name': 'Yuna', 'major': "Economics", 'sex': "female"},
                {'name': 'Jeniffer', 'major': "Computer Science", 'sex': "female"},
                {'name': 'Edward', 'major': "Computer Science", 'sex': "male"},
                {'name': 'Zara', 'major': "Psychology", 'sex': "female"},
                {'name': 'Wendy', 'major': "Economics", 'sex': "female"},
                {'name': 'Sera', 'major': "Psychology", 'sex': "female"},
                {'name': 'John', 'major': "Computer Science", 'sex': "male"},
         ]
df = pd.DataFrame(student_list, columns = ['name', 'major', 'sex'])
df

.dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	name	major	sex
0	John	Computer Science	male
1	Nate	Computer Science	male
2	Abraham	Physics	male
3	Brian	Psychology	male
4	Janny	Economics	female
5	Yuna	Economics	female
6	Jeniffer	Computer Science	female
7	Edward	Computer Science	male
8	Zara	Psychology	female
9	Wendy	Economics	female
10	Sera	Psychology	female
11	John	Computer Science	male

중복된 데이터 확인 하기

df.duplicated()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11     True
dtype: bool

drop_duplicates 함수로 중복 데이터를 삭제하는 예제입니다.

df = df.drop_duplicates()

df

.dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	name	major	sex
0	John	Computer Science	male
1	Nate	Computer Science	male
2	Abraham	Physics	male
3	Brian	Psychology	male
4	Janny	Economics	female
5	Yuna	Economics	female
6	Jeniffer	Computer Science	female
7	Edward	Computer Science	male
8	Zara	Psychology	female
9	Wendy	Economics	female
10	Sera	Psychology	female

student_list = [{'name': 'John', 'major': "Computer Science", 'sex': "male"},
                {'name': 'Nate', 'major': "Computer Science", 'sex': "male"},
                {'name': 'Abraham', 'major': "Physics", 'sex': "male"},
                {'name': 'Brian', 'major': "Psychology", 'sex': "male"},
                {'name': 'Janny', 'major': "Economics", 'sex': "female"},
                {'name': 'Yuna', 'major': "Economics", 'sex': "female"},
                {'name': 'Jeniffer', 'major': "Computer Science", 'sex': "female"},
                {'name': 'Edward', 'major': "Computer Science", 'sex': "male"},
                {'name': 'Zara', 'major': "Psychology", 'sex': "female"},
                {'name': 'Wendy', 'major': "Economics", 'sex': "female"},
                {'name': 'Nate', 'major': None, 'sex': "male"},
                {'name': 'John', 'major': "Computer Science", 'sex': None},
         ]
df = pd.DataFrame(student_list, columns = ['name', 'major', 'sex'])
df

.dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	name	major	sex
0	John	Computer Science	male
1	Nate	Computer Science	male
2	Abraham	Physics	male
3	Brian	Psychology	male
4	Janny	Economics	female
5	Yuna	Economics	female
6	Jeniffer	Computer Science	female
7	Edward	Computer Science	male
8	Zara	Psychology	female
9	Wendy	Economics	female
10	Nate	None	male
11	John	Computer Science	None

name 컬럼이 똑같을 경우, 중복된 데이터라고 표시하라는 예제입니다.

df.duplicated(['name'])

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10     True
11     True
dtype: bool

keep 값을 first 또는 last라고 값을 줘서 중복된 값 중, 어느값을 살릴 지 결정하실 수 있습니다.
기본적으로 first로 설정되어 있습니다.

df.drop_duplicates(['name'], keep='last')

.dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	name	major	sex
2	Abraham	Physics	male
3	Brian	Psychology	male
4	Janny	Economics	female
5	Yuna	Economics	female
6	Jeniffer	Computer Science	female
7	Edward	Computer Science	male
8	Zara	Psychology	female
9	Wendy	Economics	female
10	Nate	None	male
11	John	Computer Science	None

df

.dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	name	major	sex
0	John	Computer Science	male
1	Nate	Computer Science	male
2	Abraham	Physics	male
3	Brian	Psychology	male
4	Janny	Economics	female
5	Yuna	Economics	female
6	Jeniffer	Computer Science	female
7	Edward	Computer Science	male
8	Zara	Psychology	female
9	Wendy	Economics	female
10	Nate	None	male
11	John	Computer Science	None

None 처리 하기

school_id_list = [{'name': 'John', 'job': "teacher", 'age': 40},
                {'name': 'Nate', 'job': "teacher", 'age': 35},
                {'name': 'Yuna', 'job': "teacher", 'age': 37},
                {'name': 'Abraham', 'job': "student", 'age': 10},
                {'name': 'Brian', 'job': "student", 'age': 12},
                {'name': 'Janny', 'job': "student", 'age': 11},
                {'name': 'Nate', 'job': "teacher", 'age': None},
                {'name': 'John', 'job': "student", 'age': None}
         ]
df = pd.DataFrame(school_id_list, columns = ['name', 'job', 'age'])
df

.dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	name	job	age
0	John	teacher	40.0
1	Nate	teacher	35.0
2	Yuna	teacher	37.0
3	Abraham	student	10.0
4	Brian	student	12.0
5	Janny	student	11.0
6	Nate	teacher	NaN
7	John	student	NaN

Null 또는 NaN 확인하기

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
name    8 non-null object
job     8 non-null object
age     6 non-null float64
dtypes: float64(1), object(2)
memory usage: 272.0+ bytes

df.isna()

.dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	name	job	age
0	False	False	False
1	False	False	False
2	False	False	False
3	False	False	False
4	False	False	False
5	False	False	False
6	False	False	True
7	False	False	True

df.isnull()

.dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	name	job	age
0	False	False	False
1	False	False	False
2	False	False	False
3	False	False	False
4	False	False	False
5	False	False	False
6	False	False	True
7	False	False	True

Null 또는 NaN 값 변경하기

아래는 Null을 0으로 설정하는 예제입니다.

tmp = df
tmp["age"] = tmp["age"].fillna(0)
tmp

.dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	name	job	age
0	John	teacher	40.0
1	Nate	teacher	35.0
2	Yuna	teacher	37.0
3	Abraham	student	10.0
4	Brian	student	12.0
5	Janny	student	11.0
6	Nate	teacher	0.0
7	John	student	0.0

0으로 설정하기 보다는 선생님의 중간 나이, 학생의 중간 나이로, 각각의 직업군에 맞게 Null값을 변경해줍니다.

# fill missing age with median age for each group (teacher, student)
df["age"].fillna(df.groupby("job")["age"].transform("median"), inplace=True)

df

.dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	name	job	age
0	John	teacher	40.0
1	Nate	teacher	35.0
2	Yuna	teacher	37.0
3	Abraham	student	10.0
4	Brian	student	12.0
5	Janny	student	11.0
6	Nate	teacher	0.0
7	John	student	0.0

728x90

'개발공부 > 생성형 AI 기반 개발자 과정' 카테고리의 다른 글

크롤링 기초(api 활용편 - xml) (0)	2025.04.16
pandas - 7 (0)	2025.04.06
pandas - 5 (0)	2025.04.06
pandas - 4 (0)	2025.04.06
pandas - 3 (0)	2025.04.06

pandas - 6

태그클라우드 이동

최근 글 👑

pandas - 6

2025. 4. 6. 18:03ㆍ개발공부/생성형 AI 기반 개발자 과정

중복 데이터 드롭하기

중복된 데이터 확인 하기

None 처리 하기

Null 또는 NaN 확인하기

Null 또는 NaN 값 변경하기

'개발공부 > 생성형 AI 기반 개발자 과정' 카테고리의 다른 글

관련글

티스토리툴바