Ensure reliable, on-target Gen-AI responses
Protect intellectual property and ensure compliance
Safely navigate GenAI: Detect and avoid off-topic conversations
Keep interactions tasteful, filter NSFW content
Secure company data: Detect and anonymize sensitive info
Shield data from smart LLM SQL queries
Detect and filter out malicious input for prompt integrity
Safeguard LLM: Keep model instructions confidential
Explore LLM interactions for user engagement insights
Track costs, queries, and tokens for budget control
Tailored production ML dashboards to monitor key metrics
Real-time ML monitoring to detect drifts and monitor predictions
Direct Data Connectors: Monitor and observe billions of predictions
Root Cause Analysis to gain actionable insights and explore model predictions
LLM Observability for your ML: Monitor, troubleshoot and enhance efficiency
Explainable AI to understand, ensure trust, and communicate predictions
Tailored Aporia Observe for your models: Integrate any model in minutes
Integrate Aporia to every LLM and tool in the market
Empower tabular models with Aporia
Streamline AI Act compliance with Aporia Guardrails and Observe
Unlock potential in CV & NLP models
A team of Cybersecurity, Compliance, and AI Experts that ensures Aporia users top-tier protection
Optimize LLM & GenAI apps for peak performance
Your go-to resource for Aporia insights and guides
Integrate Aporia to your LLM as a Proxy with Guardrail Policies
Integrate Aporia with Your Firewall for AI Tool Security
Easily Integrate and Monitor ML Models in Production
Define ML Observability Resources as Code with SDK
Learn about AI control from our experts
Your dictionary for AI terminology.
Step-by-step guides to master AI
Dive into our GitHub projects and examples
Unlock AI secrets with our eBooks
Elevate your GenAI and LLM knwoledge
Navigate the core of ML observability
Metrics, feature importance and more
We should not have duplicate rows in a DataFrame because they cause the results of our analysis to be unreliable or simply wrong and waste memory and computation.
In this short how-to article, we will learn how to drop duplicate rows in Pandas and PySpark DataFrames.
We can use the drop_duplicates function for this task. By default, it drops rows that are identical, which means the values in all the columns are the same.
df = df.drop_duplicates()
In some cases, having the same values in certain columns is enough for being considered as duplicates. The subset parameter can be used to select columns to look for when detecting duplicates.
df = df.drop_duplicates(subset=["f1","f2"])
By default, the first occurrence of duplicate rows is kept in the DataFrame and the other ones are dropped. We also have the option to keep the last occurrence.
# keep the last occurrence
df = df.drop_duplicates(subset=["f1","f2"], keep="last")
The dropDuplicates function can be used for removing duplicate rows.
df = df.dropDuplicates()
It allows checking only some of the columns for determining the duplicate rows.
df = df.dropDuplicates(["f1","f2"])