3

I read some CSV file into pandas, nicely preprocessed it and set dtypes to desired values of float, int, category. However, when trying to import it into spark I get the following error:

Can not merge type <class 'pyspark.sql.types.DoubleType'> and <class 'pyspark.sql.types.StringType'>

After trying to trace it for a while I some source for my troubles -> see the CSV file:

"myColumns"
""
"A"

Red into pandas like: small = pd.read_csv(os.path.expanduser('myCsv.csv'))

And failing to import it to spark with:

sparkDF = spark.createDataFrame(small)

Currently I use Spark 2.0.0

Possibly multiple columns are affected. How can I deal with this problem?

enter image description here

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292

1 Answers1

5

You'll need to define the spark DataFrame schema explicitly and pass it to the createDataFrame function :

from pyspark.sql.types import *
import pandas as pd

small = pdf.read_csv("data.csv")
small.head()
#  myColumns
# 0       NaN
# 1         A
sch = StructType([StructField("myColumns", StringType(), True)])

df = spark.createDataFrame(small, sch)
df.show()
# +---------+
# |myColumns|
# +---------+
# |      NaN|
# |        A|
# +---------+

df.printSchema()
# root
# |-- myColumns: string (nullable = true)
eliasah
  • 39,588
  • 11
  • 124
  • 154
  • 1
    so it is not inferred from the pandas dtypes? :( I see. – Georg Heiler Oct 06 '16 at 07:21
  • 1
    could not finally confirm this yet --> there are several fields affected. But your code snippet works for the minimal example. – Georg Heiler Oct 06 '16 at 08:07
  • then why don't you read the csv using the spark-csv package ? – eliasah Oct 06 '16 at 08:08
  • Good question. I cleaned the raw data in python and thought this would be easier. When I tried to read the data in spark there were some problems initially (with the raw data). What was strange for me when I used DataSets in scala to load the data all the columns were loaded anyways - even if I tried to exclude some problematic ones by not specifying it as an attribute in the case class. The fixed data loads fine into spark, but trying to perform a conversion to parquet like `mynewDf.write.parquet("myDf.parquet")` errors. – Georg Heiler Oct 06 '16 at 08:18
  • I thought that this error was caused by a wrong dtype interpretation and wanted to read the data correctly via pandas. But in the long run everything (especially the pandas preprocessing) should move to scala / spark. – Georg Heiler Oct 06 '16 at 08:18
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/125073/discussion-between-eliasah-and-georg-heiler). – eliasah Oct 06 '16 at 08:19