pyspark: ValueError: Some of types cannot be determined after inferring

Question

I have a pandas data frame my_df, and my_df.dtypes gives us:

ts              int64
fieldA         object
fieldB         object
fieldC         object
fieldD         object
fieldE         object
dtype: object

Then I am trying to convert the pandas data frame my_df to a spark data frame by doing below:

spark_my_df = sc.createDataFrame(my_df)

However, I got the following errors:

ValueErrorTraceback (most recent call last)
<ipython-input-29-d4c9bb41bb1e> in <module>()
----> 1 spark_my_df = sc.createDataFrame(my_df)
      2 spark_my_df.take(20)

/usr/local/spark-latest/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio)
    520             rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
    521         else:
--> 522             rdd, schema = self._createFromLocal(map(prepare, data), schema)
    523         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
    524         jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())

/usr/local/spark-latest/python/pyspark/sql/session.py in _createFromLocal(self, data, schema)
    384 
    385         if schema is None or isinstance(schema, (list, tuple)):
--> 386             struct = self._inferSchemaFromList(data)
    387             if isinstance(schema, (list, tuple)):
    388                 for i, name in enumerate(schema):

/usr/local/spark-latest/python/pyspark/sql/session.py in _inferSchemaFromList(self, data)
    318         schema = reduce(_merge_type, map(_infer_schema, data))
    319         if _has_nulltype(schema):
--> 320             raise ValueError("Some of types cannot be determined after inferring")
    321         return schema
    322 

ValueError: Some of types cannot be determined after inferring

Does anyone know what the above error mean? Thanks!

Gregology · Answer 1 · 2019-01-15T20:01:53.753

61

In order to infer the field type, PySpark looks at the non-none records in each field. If a field only has None records, PySpark can not infer the type and will raise that error.

Manually defining a schema will resolve the issue

>>> from pyspark.sql.types import StructType, StructField, StringType
>>> schema = StructType([StructField("foo", StringType(), True)])
>>> df = spark.createDataFrame([[None]], schema=schema)
>>> df.show()
+----+
|foo |
+----+
|null|
+----+

edited Jan 15 '19 at 20:01

answered Nov 15 '16 at 18:28

Gregology

1,625
2
18
32

2

Can I just give the schema for the entire None column and skip the rest of the columns? – Aviral Srivastava Apr 18 '19 at 23:20

score 18 · Answer 2 · answered Jan 15 '18 at 18:10

18

And to fix this problem, you could provide your own defined schema.

For example:

To reproduce the error:

>>> df = spark.createDataFrame([[None, None]], ["name", "score"])

To fix the error:

>>> from pyspark.sql.types import StructType, StructField, StringType, DoubleType
>>> schema = StructType([StructField("name", StringType(), True), StructField("score", DoubleType(), True)])
>>> df = spark.createDataFrame([[None, None]], schema=schema)
>>> df.show()
+----+-----+
|name|score|
+----+-----+
|null| null|
+----+-----+

answered Jan 15 '18 at 18:10

Akavall

82,592
51
207
251

3

If we have more than 2 columns, and only 1 column is fully null, is there a better elegant way to pass the schema without explicitly define schema for all the columns? – Mojgan Mazouchi Feb 15 '21 at 00:09
Why can't we simple convert to a Spark DF with all nulls? For me, it worked fine the other way around when converting from Spark - toPandas(). Am converting spark df toPandas() to use pandas functionality, but can't convert back now – Psychotechnopath Nov 15 '22 at 09:55

rjurney · Answer 3 · 2019-11-12T16:59:58.443

8

If you are using the RDD[Row].toDF() monkey-patched method you can increase the sample ratio to check more than 100 records when inferring types:

# Set sampleRatio smaller as the data size increases
my_df = my_rdd.toDF(sampleRatio=0.01)
my_df.show()

Assuming there are non-null rows in all fields in your RDD, it will be more likely to find them when you increase the sampleRatio towards 1.0.

edited Nov 12 '19 at 16:59

answered Apr 24 '19 at 17:54

rjurney

4,824
5
41
62

If your rdd is very large, make your sample ratio more like 0.01 or spark will take a long time at the very end of the job – crypdick Nov 12 '19 at 04:24
1

@crypdick I'll amend the answer, this is a better default, thanks. – rjurney Nov 12 '19 at 16:59

score 6 · Answer 4 · answered Sep 13 '19 at 13:49

6

I've run into this same issue, if you do not need the columns that are null you can simply drop them from the pandas dataframe before importing to spark:

my_df = my_df.dropna(axis='columns', how='all') # Drops columns with all NA values
spark_my_df = sc.createDataFrame(my_df)

answered Sep 13 '19 at 13:49

Aaron Robeson

192
1
4
12

How do you do this if you are not importing from pandas ? – DataBach Feb 11 '20 at 10:26
It would depend on what you are using to import, the original question is about importing from Pandas. – Aaron Robeson Aug 24 '20 at 15:32

score 2 · Answer 5 · answered Aug 29 '19 at 15:09

2

This is probably because of the columns that have all null values. You should drop those columns before converting them to a spark dataframe

answered Aug 29 '19 at 15:09

Kamaldeep Singh

492
4
8

score 0 · Answer 6 · answered Oct 30 '22 at 21:36

The reason for this error is that Spark is not able to determine the data types of your pandas dataframe so, one way to solve this you can pass the schema separately to the sparks createDataFrame function.

For example your pandas dataframe looks like this

d = {
  'col1': [1, 2],
  'col2': ['A', 'B]
}
df = pd.DataFrame(data = d)
print(df)

   col1 col2
0    1   A
1    2   B

When you want to convert it into Spark dataframe start by defining schema and adding it to your createDataFrame as follows

from pyspark.sql.types import StructType, StructField, LongType, StringType

schema = StructType([
  StructField("col1", LongType()),
  StructField("col2", StringType()),
])


spark_df = spark.createDataFrame(df, schema = schema)

pyspark: ValueError: Some of types cannot be determined after inferring

6 Answers6

Linked