The most advanced ML Observability platform
We’re super excited to share that Aporia is now the first ML observability offering integration to the Databricks Lakehouse Platform. This partnership means that you can now effortlessly automate your data pipelines, monitor, visualize, and explain your ML models in production. Aporia and Databricks: A Match Made in Data Heaven One key benefit of this […]
Start integrating our products and tools.
We’re excited 😁 to share that Forbes has named Aporia a Next Billion-Dollar Company. This recognition comes on the heels of our recent $25 million Series A funding and is a huge testament that Aporia’s mission and the need for trust in AI are more relevant than ever. We are very proud to be listed […]
We should not have duplicate rows in a DataFrame because they cause the results of our analysis to be unreliable or simply wrong and waste memory and computation.
In this short how-to article, we will learn how to drop duplicate rows in Pandas and PySpark DataFrames.
We can use the drop_duplicates function for this task. By default, it drops rows that are identical, which means the values in all the columns are the same.
df = df.drop_duplicates()
In some cases, having the same values in certain columns is enough for being considered as duplicates. The subset parameter can be used to select columns to look for when detecting duplicates.
df = df.drop_duplicates(subset=["f1","f2"])
By default, the first occurrence of duplicate rows is kept in the DataFrame and the other ones are dropped. We also have the option to keep the last occurrence.
# keep the last occurrence df = df.drop_duplicates(subset=["f1","f2"], keep="last")
The dropDuplicates function can be used for removing duplicate rows.
df = df.dropDuplicates()
It allows checking only some of the columns for determining the duplicate rows.
df = df.dropDuplicates(["f1","f2"])