Aporia How to's

How to Select Rows From a DataFrame Based on a List of Values?

2 min read
select rows by condition list of values pandas pyspark dataframe

Selecting rows based on a condition is a common operation in data wrangling. In this short how-to article, we will learn how to use a list of values to select rows from Pandas and PySpark DataFrames.

select rows by condition list of values pandas pyspark dataframe

Pandas

The isin method of Pandas can be used for selecting rows based on a list of conditions. We just need to write the conditions inside a Python list.

df = df[df.group.isin(["A","B","D"])]

If you are interested in selecting rows that are not in this list, you can either add a tilde (~) operator at the beginning or set the condition as False.

# Tilde operator
df = df[~df.group.isin(["A","B","D"])]

# False condition
df = df[df.group.isin(["A","B","D"])==False]

PySpark

PySpark has an isin method which works similar to that of Pandas.

df = df.filter(df.group.isin(["A","B","D"]))

Letter cases cause strings to be different in PySpark too. We can use the lower or upper function to standardize letter cases before searching for a substring.

# Tilde operator
df = df.filter(~df.group.isin(["A","B","D"]))

# False condition
df = df.filter(df.group.isin(["A","B","D"])==False)

This question is also being asked as:

  • Filter DataFrame rows if value in column is in a set list of values
  • Efficiently select rows that match one of several values in Pandas DataFrame

People have also asked for:

Green Background

Control All your GenAI Apps in minutes