16

I am currently pulling data from SQL Server using PyODBC and trying to insert into a table in Hive in a Near Real Time (NRT) manner.

I got a single row from source and converted into List[Strings] and creating schema programatically but while creating a DataFrame, Spark is throwing StructType error.

>>> cnxn = pyodbc.connect(con_string)
>>> aj = cnxn.cursor()
>>>
>>> aj.execute("select * from tjob")
<pyodbc.Cursor object at 0x257b2d0>

>>> row = aj.fetchone()

>>> row
(1127, u'', u'8196660', u'', u'', 0, u'', u'', None, 35, None, 0, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, u'', 0, None, None)
>>> rowstr = map(str,row)
>>> rowstr
['1127', '', '8196660', '', '', '0', '', '', 'None', '35', 'None', '0', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', '', '0', 'None', 'None']

>>> schemaString = " ".join([row.column_name for row in aj.columns(table='tjob')])

>>> schemaString
u'ID ExternalID Name Description Notes Type Lot SubLot ParentJobID ProductID PlannedStartDateTime PlannedDurationSeconds Capture01 Capture02 Capture03 Capture04 Capture05 Capture06 Capture07 Capture08 Capture09 Capture10 Capture11 Capture12 Capture13 Capture14 Capture15 Capture16 Capture17 Capture18 Capture19 Capture20 User UserState ModifiedDateTime UploadedDateTime'

>>> fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
>>> schema = StructType(fields)

>>> [f.dataType for f in schema.fields]
[StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType]

>>> myrdd = sc.parallelize(rowstr)

>>> myrdd.collect()
['1127', '', '8196660', '', '', '0', '', '', 'None', '35', 'None', '0', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', '', '0', 'None', 'None']

>>> schemaPeople = sqlContext.createDataFrame(myrdd, schema)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/apps/opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/spark/python/pyspark/sql/context.py", line 404, in createDataFrame
    rdd, schema = self._createFromRDD(data, schema, samplingRatio)
  File "/apps/opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/spark/python/pyspark/sql/context.py", line 298, in _createFromRDD
    _verify_type(row, schema)
  File "/apps/opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/spark/python/pyspark/sql/types.py", line 1132, in _verify_type
    raise TypeError("StructType can not accept object in type %s" % type(obj))
TypeError: StructType can not accept object in type <type 'str'>
ZygD
  • 22,092
  • 39
  • 79
  • 102
ThirdEye
  • 433
  • 1
  • 5
  • 13

2 Answers2

32

here is the reason for error message:

>>> rowstr
['1127', '', '8196660', '', '', '0', '', '', 'None' ... ]   
#rowstr is a list of str

>>> myrdd = sc.parallelize(rowstr)
#myrdd is a rdd of str

>>> schema = StructType(fields)
#schema is StructType([StringType, StringType, ....])

>>> schemaPeople = sqlContext.createDataFrame(myrdd, schema)
#myrdd should have been RDD([StringType, StringType,...]) but is RDD(str)

to fix that, make the RDD of proper type:

>>> myrdd = sc.parallelize([rowstr])
Shyamendra Solanki
  • 8,751
  • 2
  • 31
  • 25
  • 4
    As to "proper type" what is that? A list of str? sc.parallelize()? – Geoffrey Anderson May 30 '18 at 18:28
  • 11
    This is a very helpful answer. I spent more than hour stuck with the misleading error TypeError: StructType can not accept object 123 in type . Thanks to this post I understood that I need to create the data frame with the expression spark.createDataFrame([(123,)], my_schema) rather than spark.createDataFrame([123], my_schema) – adelinor Jun 23 '21 at 14:40
  • @adelinor do you know why is there a comma necessary after 123 in spark.createDataFrame([(123,)], my_schema)? – hipokito Aug 03 '21 at 11:15
  • @hipokito : the comma seems to be important for inferring the type of the expression. – adelinor Aug 05 '21 at 11:50
  • 2
    @adelinor : comma is particularly important when you deal with singleton, a one-element tuple; a trailing comma is required to create a one-element tuple. – Andrea Baldino Aug 31 '21 at 15:16
  • In case of a dictionary/list the square bracket is mandatory df= spark.createDataFrame(data = [dict_data], schema = col_schema) – Kumar Nov 11 '21 at 20:05
-2

I received similar error now!

TypeError: StructType can not accept object '_id' in type <class 'str'>

and this is how I resolved it.

I am working with heavily nested json file for scheduling , json file is composed of list of dictionary of list etc.

e.g. ['1127', {time: '_id', '8196660', '', '', '0', '', '', 'None' ...}, {busstops: {_id, name} ]

For me _id was repeated many times within other dictionary and I resolved it by specifying the dictionary key.

kl= spark.createDataFrame(obj_day, schema=test()) #: I get the error

but I resolved it with

kl= spark.createDataFrame(obj_day["busstops"], schema=test())
Maximouse
  • 4,170
  • 1
  • 14
  • 28