Aporia How to's

How to Add a New Column to an Existing DataFrame?

2 min read
pandas and pyspark dataframe new column

We often need to create a new column as part of a data analysis process or a feature engineering process in machine learning. In this short how-to article, we will learn how to add a new column to an existing Pandas and PySpark DataFrame.
add-new-column-to-existing-dataframe

Pandas

months = [1, 2, 6]
df["Month"] = months

This method adds the new column at the end of the DataFrame as you see in the drawing above. If you want to add the new at a specific location, use the insert function.

months = [1, 4, 6]
df.insert(1, "Month", months)

The 3 parameters inside the insert function are the location, name, and the values of the new column. Therefore, the code block above adds a column named “Month” at index 1 which means the second column.

add-new-column-to-existing-dataframe

Instead of writing the month values manually, we can extract this information from the date column which is more practical when working with large datasets.

# Add at the end
df["Month"] = df["Date"].dt.month

# Insert as the second column
df.insert(1, "Month", df["Date"].dt.month)

PySpark

The new column can be added using the withColumn function. In PySpark, we cannot pass a list as the values of the new column. However, we can extract the month information from the date using the month and col methods.

from pyspark.sql import functions as F

df = df.withColumn("Month", F.month(F.col("Date")))

This question is also being asked as:

  • Add a new column in Pandas DataFrame.
  • How can I add a new computed column in a DataFrame?

People have also asked for:

Green Background

Control All your GenAI Apps in minutes