The number of distinct values of an attribute (i.e. column) can be important in data analytics, visualization, or modeling. In this short how-to article, we will learn how to find the distinct values in columns of Pandas and PySpark DataFrames.
Pandas
The unique function returns an array that contains the distinct values in a column whereas the nunique function gives us the number of distinct values.
# distinct values
df["Brand"].unique()
# number of distinct values
df["Brand"].nunique()
PySpark
We can see the distinct values in a column using the distinct function as follows:
df.select("name").distinct().show()
To count the number of distinct values, PySpark provides a function called countDistinct.
from pyspark.sql import functions as F
df.select(F.countDistinct("name")).show()
This question is also being asked as:
- Number of unique elements in all columns of a PySpark DataFrame