The most advanced ML Observability product in the market
Building an ML platform is nothing like putting together Ikea furniture; obviously, Ikea is way more difficult. However, they both, similarly, include many different parts that help create value when put together. As every organization sets out on a unique path to building its own machine learning platform, taking on the project of building a […]
Start integrating our products and tools.
We’re excited 😁 to share that Forbes has named Aporia a Next Billion-Dollar Company. This recognition comes on the heels of our recent $25 million Series A funding and is a huge testament that Aporia’s mission and the need for trust in AI are more relevant than ever. We are very proud to be listed […]
We often need to create a new column as part of a data analysis process or a feature engineering process in machine learning. In this short how-to article, we will learn how to add a new column to an existing Pandas and PySpark DataFrame.
months = [1, 2, 6] df["Month"] = months
This method adds the new column at the end of the DataFrame as you see in the drawing above. If you want to add the new at a specific location, use the insert function.
months = [1, 4, 6] df.insert(1, "Month", months)
The 3 parameters inside the insert function are the location, name, and the values of the new column. Therefore, the code block above adds a column named “Month” at index 1 which means the second column.
Instead of writing the month values manually, we can extract this information from the date column which is more practical when working with large datasets.
# Add at the end df["Month"] = df["Date"].dt.month # Insert as the second column df.insert(1, "Month", df["Date"].dt.month)
The new column can be added using the withColumn function. In PySpark, we cannot pass a list as the values of the new column. However, we can extract the month information from the date using the month and col methods.
from pyspark.sql import functions as F df = df.withColumn("Month", F.month(F.col("Date")))