Drop Duplicate Rows Across Multiple Columns in a DataFrame

Back to Blog

We should not have duplicate rows in a DataFrame because they cause the results of our analysis to be unreliable or simply wrong and waste memory and computation.

In this short how-to article, we will learn how to drop duplicate rows in Pandas and PySpark DataFrames.

How to Delete Rows Based on Column Values in a DataFrame?

Pandas

We can use the drop_duplicates function for this task. By default, it drops rows that are identical, which means the values in all the columns are the same.

df = df.drop_duplicates()

In some cases, having the same values in certain columns is enough for being considered as duplicates. The subset parameter can be used to select columns to look for when detecting duplicates.

df = df.drop_duplicates(subset=["f1","f2"])

By default, the first occurrence of duplicate rows is kept in the DataFrame and the other ones are dropped. We also have the option to keep the last occurrence.

# keep the last occurrence
df = df.drop_duplicates(subset=["f1","f2"], keep="last")

PySpark

The dropDuplicates function can be used for removing duplicate rows.

df = df.dropDuplicates()

It allows checking only some of the columns for determining the duplicate rows.

df = df.dropDuplicates(["f1","f2"])

This question is also being asked as:

How to remove duplicate values using Pandas and keep any one
Checking for duplicate data in Pandas

People have also asked for:

Aporia Team

Sometimes, writing is a joint effort.

building a RAG app?

Read about Aporia’s AI Guardrails

Learn more

Pandas

PySpark

This question is also being asked as:

People have also asked for:

On this page

Related Articles

How to Build an End-To-End ML Pipeline With Databricks & Aporia

How to Convert a Dictionary to a DataFrame

How to Delete Rows Based on Column Values in a DataFrame

How to Convert the Index of a DataFrame to a Column

How to Write a DataFrame to a CSV File

How to Sort a DataFrame by Values in a Column

How to Count the Frequency that a Value Occurs in a DataFrame Column

How to Count the NaN Values in a DataFrame

How to Drop Duplicate Rows Across Multiple Columns in a DataFrame

Pandas

PySpark

This question is also being asked as:

People have also asked for:

On this page

Related Articles