get datatype of column using pyspark

Question

We are reading data from MongoDB Collection. Collection column has two different values (e.g.: (bson.Int64,int) (int,float) ).

I am trying to get a datatype using pyspark.

My problem is some columns have different datatype.

Assume quantity and weight are the columns

quantity           weight
---------          --------
12300              656
123566000000       789.6767
1238               56.22
345                23
345566677777789    21

Actually we didn't defined data type for any column of mongo collection.

When I query to the count from pyspark dataframe

dataframe.count()

I got exception like this

"Cannot cast STRING into a DoubleType (value: BsonString{value=&apos;200.0&apos;})"

What have you tried so far? Without providing what you have tried and not worked, it is highly doubtful that anyone here would be able to help you. Please check 'How to create a Minimal, Complete, and Verifiable example' https://stackoverflow.com/help/mcve — desertnaut, Jul 11 '17 at 11:38

eliasah · Answer 1 · 2017-11-06T15:36:25.110

120

Your question is broad, thus my answer will also be broad.

To get the data types of your DataFrame columns, you can use dtypes i.e :

>>> df.dtypes
[('age', 'int'), ('name', 'string')]

This means your column age is of type int and name is of type string.

edited Nov 06 '17 at 15:36

answered Jul 11 '17 at 16:10

eliasah

39,588
11
124
154

Would you care updating your question with your information. It's unclear what you are asking @Sreenuvasulu – eliasah Jul 12 '17 at 09:55
My columns have different data types here too. – eliasah Jul 12 '17 at 09:55
Please don't post information like this in the comment box. This is not readable. And take some time writing your question. – eliasah Jul 12 '17 at 10:16
This is a complete different issue from what you have asked. Would you care adding some reproducible example ? – eliasah Jul 12 '17 at 10:20
can you look into weight column,it has two different datatype – Sreenuvasulu Jul 12 '17 at 11:06
3

I'll not look into your question unless you provide the information @desertnaut asked for. You had a question which I have answered and now the question evolved into something completely different and you make no effort in writing the question correctly. Please review your question according the guidelines discussed in the comments. – eliasah Jul 12 '17 at 11:36

score 36 · Answer 2 · answered Nov 03 '19 at 23:59

For anyone else who came here looking for an answer to the exact question in the post title (i.e. the data type of a single column, not multiple columns), I have been unable to find a simple way to do so.

Luckily it's trivial to get the type using dtypes:

def get_dtype(df,colname):
    return [dtype for name, dtype in df.dtypes if name == colname][0]

get_dtype(my_df,'column_name')

(note that this will only return the first column's type if there are multiple columns with the same name)

More succinct: `dict(df.dtypes)[colname]` – RobinL May 25 '20 at 07:10 — RobinL, May 25 '20 at 07:10

score 13 · Accepted Answer · answered Jan 02 '20 at 13:53

import pandas as pd
pd.set_option('max_colwidth', -1) # to prevent truncating of columns in jupyter

def count_column_types(spark_df):
    """Count number of columns per type"""
    return pd.DataFrame(spark_df.dtypes).groupby(1, as_index=False)[0].agg({'count':'count', 'names': lambda x: " | ".join(set(x))}).rename(columns={1:"type"})

Example output in jupyter notebook for a spark dataframe with 4 columns:

count_column_types(my_spark_df)

score 8 · Answer 4 · answered Jul 12 '17 at 08:43

8

I don't know how are you reading from mongodb, but if you are using the mongodb connector, the datatypes will be automatically converted to spark types. To get the spark sql types, just use schema atribute like this:

df.schema

answered Jul 12 '17 at 08:43

Luis A.G.

1,017
2
15
23

score 3 · Answer 5 · answered Apr 06 '18 at 17:13

3

Looks like your actual data and your metadata have different types. The actual data is of type string while the metadata is double.

As a solution I would recommend you to recreate the table with the correct datatypes.

answered Apr 06 '18 at 17:13

Henrique Florencio

3,440
1
18
19

score 3 · Answer 6 · answered Oct 31 '21 at 08:00

3

df.dtypes to get a list of (colname, dtype) pairs, ex.

[('age', 'int'), ('name', 'string')]

df.schema to get a schema as StructType of StructField, ex.

StructType(List(StructField(age,IntegerType,true),StructField(name,StringType,true)))

df.printSchema() to get a tree view of the schema, ex.

root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)

answered Oct 31 '21 at 08:00

qwr

9,525
5
58
102

To get the datatype of a column in a dataframe, we can also use the apply function (https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/sql/types/StructType.html#apply-java.lang.String-) of StructType object: `df.schema.apply("column-name-here").dataType`. This will give the data type as a DataType object (https://spark.apache.org/docs/3.2.1/api/java/org/apache/spark/sql/types/DataType.html). – Prabhatika Vij Apr 29 '23 at 06:06

score 1 · Answer 7 · answered Jul 13 '22 at 07:21

data = [('A+','good','Robert',550,3000),
  ('A+','good','Robert',450,4000),
  ('A+','bad','James',300,4000),
  ('A','bad','Mike',100,4000),
  ('B-','not bad','Jenney',250,-1)
]

columns = ["A","B","C","D","E"]
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Temp-Example').getOrCreate()
df = spark.createDataFrame(data=data, schema = columns)
df.printSchema()


# root
#  |-- A: string (nullable = true)
#  |-- B: string (nullable = true)
#  |-- C: string (nullable = true)
#  |-- D: long (nullable = true)
#  |-- E: long (nullable = true)

you can get datatype by simple code

# get datatype 
from collections import defaultdict
import pandas as pd

data_types = defaultdict(list)
for entry in df.schema.fields:
    data_types[str(entry.dataType)].append(entry.name)

pd.DataFrame(list((i,len(data_types[i])) for i in data_types) , columns = ["datatype","Nums"])

#   datatype    Nums
# 0 StringType()    3
# 1 LongType()  2

score -13 · Answer 8 · answered Jul 11 '17 at 16:05

-13

I am assuming you are looking to get the data type of the data you read.

input_data = [Read from Mongo DB operation]

You can use

type(input_data)

to inspect the data type

answered Jul 11 '17 at 16:05

ganeiy

298
2
9

This doesn't apply here. It's pyspark specific. – eliasah Jul 11 '17 at 16:08
this type funtion used in python not applicable in pyspark – Innovator-programmer Nov 24 '21 at 14:05

get datatype of column using pyspark

8 Answers8

Linked