Functions, Users, and Comparative Analysis
We decided that Docs should have prime location.
Build AI products you can trust.
We’re super excited to share that Aporia is now the first ML observability offering integration to the Databricks Lakehouse Platform. This partnership means that you can now effortlessly automate your data pipelines, monitor, visualize, and explain your ML models in production. Aporia and Databricks: A Match Made in Data Heaven One key benefit of this […]
Fundamentals of ML observability
Metrics, feature importance and more
We’re excited 😁 to share that Forbes has named Aporia a Next Billion-Dollar Company. This recognition comes on the heels of our recent $25 million Series A funding and is a huge testament that Aporia’s mission and the need for trust in AI are more relevant than ever. We are very proud to be listed […]
We often need to create a new column as part of a data analysis process or a feature engineering process in machine learning. In this short how-to article, we will learn how to add a new column to an existing Pandas and PySpark DataFrame.
months = [1, 2, 6] df["Month"] = months
This method adds the new column at the end of the DataFrame as you see in the drawing above. If you want to add the new at a specific location, use the insert function.
months = [1, 4, 6] df.insert(1, "Month", months)
The 3 parameters inside the insert function are the location, name, and the values of the new column. Therefore, the code block above adds a column named “Month” at index 1 which means the second column.
Instead of writing the month values manually, we can extract this information from the date column which is more practical when working with large datasets.
# Add at the end df["Month"] = df["Date"].dt.month # Insert as the second column df.insert(1, "Month", df["Date"].dt.month)
The new column can be added using the withColumn function. In PySpark, we cannot pass a list as the values of the new column. However, we can extract the month information from the date using the month and col methods.
from pyspark.sql import functions as F df = df.withColumn("Month", F.month(F.col("Date")))