How to Filter a DataFrame by Substring Criteria?

filter a row or column containing multiple strings

One of the commonly used methods for filtering textual data is looking for a substring. In this how-to article, we will learn how to filter string columns in Pandas and PySpark by using a substring.

How to Filter a DataFrame by Substring Criteria?


We can use the contains method, which is available through the str accessor.

					df = df[df["Fruit"].str.contains("Apple")]


Letter cases are important because โ€œAppleโ€ and โ€œappleโ€ are not the same strings. If we are not sure of the letter cases, the safe approach is to convert all the letters to uppercase or lowercase before filtering.

					df = df[df["Fruit"].str.lower().str.contains("apple")]



PySpark also has a contains method that can be used as follows:

					from pyspark.sql import functions as F

df = df.filter(F.col("Fruit").contains("Apple"))


Letter cases cause strings to be different in PySpark too. We can use the lower or upper function to standardize letter cases before searching for a substring.

					from pyspark.sql import functions as F

df = df.filter(F.lower(F.col("Fruit")).contains("apple"))


This question is also being asked as:

  • How to filter rows containing a string pattern from a Pandas DataFrame?
  • How can I search for a string in a column?

People have also asked for:

You may also like

Start Monitoring Your Models in Minutes