How to Build an End-To-End ML Pipeline With Databricks & Aporia
This tutorial will show you how to build a robust end-to-end ML pipeline with Databricks and Aporia. Here’s what you’ll...
🤜🤛 Aporia partners with Google Cloud to bring reliability and security to AI Agents - Read more
When a function is applied to a column of a DataFrame, the values in all the rows are affected by the operation that function does. In this short how-to article, we will learn how to apply a function to two columns in Pandas and PySpark DataFrames.
Consider we have a user-defined function f that takes two input values. We can apply this function to two columns of a DataFrame using the apply function and a lambda expression. If this function returns a single value, we can even create a new column with the values it returns.
Let’s say we have first and last name columns and want to create a new column containing the initials. The find_initials function defined below does this operation. We can apply it to the first name and last name columns as follows:
# Defining the function
def find_initials(fname, lname):
return fname[0] + lname[0]
# Applying it to two columns
df["initials"] = df.apply(lambda x: find_initials(x["fname"], x["lname"]), axis=1)
The operations are similar but we need an additional step to create a user-defined function (udf).
# importing necessary modules
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
# Defining the function
def take_initials(fname, lname):
return fname[0] + lname[0]
# Creating a udf
udf = F.udf(take_initials, StringType())
# Applying it to two columns
df = df.withColumn("initials", udf(F.col("fname"), F.col("lname")))
It is important to note that performing a row-wise operation in both Pandas and PySpark is expensive and not preferred if there is another way. The alternative is to use vectorized operations.
This tutorial will show you how to build a robust end-to-end ML pipeline with Databricks and Aporia. Here’s what you’ll...
Dictionary is a built-in data structure of Python, which consists of key-value pairs. In this short how-to article, we will...
A row in a DataFrame can be considered as an observation with several features that are represented by columns. We...
DataFrame is a two-dimensional data structure with labeled rows and columns. Row labels are also known as the index of...
DataFrames are great for data cleaning, analysis, and visualization. However, they cannot be used in storing or transferring data. Once...
In this short how-to article, we will learn how to sort the rows of a DataFrame by the value in...
In a column with categorical or distinct values, it is important to know the number of occurrences of each value....
NaN values are also called missing values and simply indicate the data we do not have. We do not like...