The most advanced ML Observability platform
We’re super excited to share that Aporia is now the first ML observability offering integration to the Databricks Lakehouse Platform. This partnership means that you can now effortlessly automate your data pipelines, monitor, visualize, and explain your ML models in production. Aporia and Databricks: A Match Made in Data Heaven One key benefit of this […]
Start integrating our products and tools.
We’re excited 😁 to share that Forbes has named Aporia a Next Billion-Dollar Company. This recognition comes on the heels of our recent $25 million Series A funding and is a huge testament that Aporia’s mission and the need for trust in AI are more relevant than ever. We are very proud to be listed […]
Each column in a DataFrame has a data type (dtype). Some functions and methods expect columns in a specific data type, and therefore it is a common operation to convert the data type of columns. In this short how-to article, we will learn how to change the data type of a column in Pandas and PySpark DataFrames.
In a Pandas DataFrame, we can check the data types of columns with the dtypes method.
df.dtypes Name string City string Age string dtype: object
The astype function changes the data type of columns. Consider we have a column with numerical values but its data type is string. This is a serious issue because we cannot perform any numerical analysis on textual data.
df["Age"] = df["Age"].astype("int")
We just need to write the desired data type inside the astype function. Let’s confirm the changes by checking the data types again.
df.dtypes Name string City string Age int64 dtype: object
It is possible to change the data type of multiple columns in a single operation. The columns and their data types are written as key-value pairs in a dictionary.
df = df.astype({"Age": "int", "Score": "int"})
In PySpark, we can use the cast method to change the data type.
from pyspark.sql.types import IntegerType from pyspark.sql import functions as F # first method df = df.withColumn("Age", df.age.cast("int")) # second method df = df.withColumn("Age", df.age.cast(IntegerType())) # third method df = df.withColumn("Age", F.col("Age").cast(IntegerType()))
To change the data type of multiple columns, we can combine operations by chaining them.
df = df.withColumn("Age", df.age.cast("int")) \ .withColumn("Score", df.age.cast("int"))