How to Filter a DataFrame by Substring Criteria?

filter a row or column containing multiple strings

One of the commonly used methods for filtering textual data is looking for a substring. In this how-to article, we will learn how to filter string columns in Pandas and PySpark by using a substring.

How to Filter a DataFrame by Substring Criteria?

Pandas

We can use the contains method, which is available through the str accessor.

				
					df = df[df["Fruit"].str.contains("Apple")]


				
			

Letter cases are important because โ€œAppleโ€ and โ€œappleโ€ are not the same strings. If we are not sure of the letter cases, the safe approach is to convert all the letters to uppercase or lowercase before filtering.

				
					df = df[df["Fruit"].str.lower().str.contains("apple")]


				
			

PySpark

PySpark also has a contains method that can be used as follows:

				
					from pyspark.sql import functions as F

df = df.filter(F.col("Fruit").contains("Apple"))


				
			

Letter cases cause strings to be different in PySpark too. We can use the lower or upper function to standardize letter cases before searching for a substring.

				
					from pyspark.sql import functions as F

df = df.filter(F.lower(F.col("Fruit")).contains("apple"))


				
			

This question is also being asked as:

  • How to filter rows containing a string pattern from a Pandas DataFrame?
  • How can I search for a string in a column?

People have also asked for:

You may also like

Start Monitoring Your Models in Minutes