The most advanced ML Observability platform
We’re super excited to share that Aporia is now the first ML observability offering integration to the Databricks Lakehouse Platform. This partnership means that you can now effortlessly automate your data pipelines, monitor, visualize, and explain your ML models in production. Aporia and Databricks: A Match Made in Data Heaven One key benefit of this […]
Start integrating our products and tools.
We’re excited 😁 to share that Forbes has named Aporia a Next Billion-Dollar Company. This recognition comes on the heels of our recent $25 million Series A funding and is a huge testament that Aporia’s mission and the need for trust in AI are more relevant than ever. We are very proud to be listed […]
When a function is applied to a column of a DataFrame, the values in all the rows are affected by the operation that function does. In this short how-to article, we will learn how to apply a function to two columns in Pandas and PySpark DataFrames.
Consider we have a user-defined function f that takes two input values. We can apply this function to two columns of a DataFrame using the apply function and a lambda expression. If this function returns a single value, we can even create a new column with the values it returns.
Let’s say we have first and last name columns and want to create a new column containing the initials. The find_initials function defined below does this operation. We can apply it to the first name and last name columns as follows:
# Defining the function def find_initials(fname, lname): return fname[0] + lname[0] # Applying it to two columns df["initials"] = df.apply(lambda x: find_initials(x["fname"], x["lname"]), axis=1)
The operations are similar but we need an additional step to create a user-defined function (udf).
# importing necessary modules from pyspark.sql import functions as F from pyspark.sql.types import StringType # Defining the function def take_initials(fname, lname): return fname[0] + lname[0] # Creating a udf udf = F.udf(take_initials, StringType()) # Applying it to two columns df = df.withColumn("initials", udf(F.col("fname"), F.col("lname")))
It is important to note that performing a row-wise operation in both Pandas and PySpark is expensive and not preferred if there is another way. The alternative is to use vectorized operations.