51

I have a pandas data frame my_df, and my_df.dtypes gives us:

ts              int64
fieldA         object
fieldB         object
fieldC         object
fieldD         object
fieldE         object
dtype: object

Then I am trying to convert the pandas data frame my_df to a spark data frame by doing below:

spark_my_df = sc.createDataFrame(my_df)

However, I got the following errors:

ValueErrorTraceback (most recent call last)
<ipython-input-29-d4c9bb41bb1e> in <module>()
----> 1 spark_my_df = sc.createDataFrame(my_df)
      2 spark_my_df.take(20)

/usr/local/spark-latest/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio)
    520             rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
    521         else:
--> 522             rdd, schema = self._createFromLocal(map(prepare, data), schema)
    523         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
    524         jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())

/usr/local/spark-latest/python/pyspark/sql/session.py in _createFromLocal(self, data, schema)
    384 
    385         if schema is None or isinstance(schema, (list, tuple)):
--> 386             struct = self._inferSchemaFromList(data)
    387             if isinstance(schema, (list, tuple)):
    388                 for i, name in enumerate(schema):

/usr/local/spark-latest/python/pyspark/sql/session.py in _inferSchemaFromList(self, data)
    318         schema = reduce(_merge_type, map(_infer_schema, data))
    319         if _has_nulltype(schema):
--> 320             raise ValueError("Some of types cannot be determined after inferring")
    321         return schema
    322 

ValueError: Some of types cannot be determined after inferring

Does anyone know what the above error mean? Thanks!

Edamame
  • 23,718
  • 73
  • 186
  • 320

6 Answers6

61

In order to infer the field type, PySpark looks at the non-none records in each field. If a field only has None records, PySpark can not infer the type and will raise that error.

Manually defining a schema will resolve the issue

>>> from pyspark.sql.types import StructType, StructField, StringType
>>> schema = StructType([StructField("foo", StringType(), True)])
>>> df = spark.createDataFrame([[None]], schema=schema)
>>> df.show()
+----+
|foo |
+----+
|null|
+----+
Gregology
  • 1,625
  • 2
  • 18
  • 32
18

And to fix this problem, you could provide your own defined schema.

For example:

To reproduce the error:

>>> df = spark.createDataFrame([[None, None]], ["name", "score"])

To fix the error:

>>> from pyspark.sql.types import StructType, StructField, StringType, DoubleType
>>> schema = StructType([StructField("name", StringType(), True), StructField("score", DoubleType(), True)])
>>> df = spark.createDataFrame([[None, None]], schema=schema)
>>> df.show()
+----+-----+
|name|score|
+----+-----+
|null| null|
+----+-----+
Akavall
  • 82,592
  • 51
  • 207
  • 251
  • 3
    If we have more than 2 columns, and only 1 column is fully null, is there a better elegant way to pass the schema without explicitly define schema for all the columns? – Mojgan Mazouchi Feb 15 '21 at 00:09
  • Why can't we simple convert to a Spark DF with all nulls? For me, it worked fine the other way around when converting from Spark - toPandas(). Am converting spark df toPandas() to use pandas functionality, but can't convert back now – Psychotechnopath Nov 15 '22 at 09:55
8

If you are using the RDD[Row].toDF() monkey-patched method you can increase the sample ratio to check more than 100 records when inferring types:

# Set sampleRatio smaller as the data size increases
my_df = my_rdd.toDF(sampleRatio=0.01)
my_df.show()

Assuming there are non-null rows in all fields in your RDD, it will be more likely to find them when you increase the sampleRatio towards 1.0.

rjurney
  • 4,824
  • 5
  • 41
  • 62
6

I've run into this same issue, if you do not need the columns that are null you can simply drop them from the pandas dataframe before importing to spark:

my_df = my_df.dropna(axis='columns', how='all') # Drops columns with all NA values
spark_my_df = sc.createDataFrame(my_df)
Aaron Robeson
  • 192
  • 1
  • 4
  • 12
2

This is probably because of the columns that have all null values. You should drop those columns before converting them to a spark dataframe

Kamaldeep Singh
  • 492
  • 4
  • 8
0

The reason for this error is that Spark is not able to determine the data types of your pandas dataframe so, one way to solve this you can pass the schema separately to the sparks createDataFrame function.

For example your pandas dataframe looks like this

d = {
  'col1': [1, 2],
  'col2': ['A', 'B]
}
df = pd.DataFrame(data = d)
print(df)

   col1 col2
0    1   A
1    2   B

When you want to convert it into Spark dataframe start by defining schema and adding it to your createDataFrame as follows

from pyspark.sql.types import StructType, StructField, LongType, StringType

schema = StructType([
  StructField("col1", LongType()),
  StructField("col2", StringType()),
])


spark_df = spark.createDataFrame(df, schema = schema)
Hadi Mir
  • 4,497
  • 2
  • 29
  • 31