The most advanced ML Observability product in the market
Building an ML platform is nothing like putting together Ikea furniture; obviously, Ikea is way more difficult. However, they both, similarly, include many different parts that help create value when put together. As every organization sets out on a unique path to building its own machine learning platform, taking on the project of building a […]
Start integrating our products and tools.
We’re excited 😁 to share that Forbes has named Aporia a Next Billion-Dollar Company. This recognition comes on the heels of our recent $25 million Series A funding and is a huge testament that Aporia’s mission and the need for trust in AI are more relevant than ever. We are very proud to be listed […]
Each column in a DataFrame has a data type (dtype). Some functions and methods expect columns in a specific data type, and therefore it is a common operation to convert the data type of columns. In this short how-to article, we will learn how to change the data type of a column in Pandas and PySpark DataFrames.
In a Pandas DataFrame, we can check the data types of columns with the dtypes method.
df.dtypes Name string City string Age string dtype: object
The astype function changes the data type of columns. Consider we have a column with numerical values but its data type is string. This is a serious issue because we cannot perform any numerical analysis on textual data.
df["Age"] = df["Age"].astype("int")
We just need to write the desired data type inside the astype function. Let’s confirm the changes by checking the data types again.
df.dtypes Name string City string Age int64 dtype: object
It is possible to change the data type of multiple columns in a single operation. The columns and their data types are written as key-value pairs in a dictionary.
df = df.astype({"Age": "int", "Score": "int"})
In PySpark, we can use the cast method to change the data type.
from pyspark.sql.types import IntegerType from pyspark.sql import functions as F # first method df = df.withColumn("Age", df.age.cast("int")) # second method df = df.withColumn("Age", df.age.cast(IntegerType())) # third method df = df.withColumn("Age", F.col("Age").cast(IntegerType()))
To change the data type of multiple columns, we can combine operations by chaining them.
df = df.withColumn("Age", df.age.cast("int")) \ .withColumn("Score", df.age.cast("int"))