0

I've tried to convert a list of dicts into a Databricks' Koalas DataFrame but I keep getting the error message:

ArrowInvalid: cannot mix list and non-list, non-null values

Pandas works perfectly (with pd.DataFrame(list)) but because of company restrictions I must use PySpark/Koalas. I've also tried to convert the list into a dictionary and the error persists.

An example of the list:

[{'A': None,
  'B': None,
  'C': None,
  'D': None,
  'E': [],
  ...},
{'A': data,
  'B': data,
  'C': data,
  'D': data,
  'E': None,
  ...}
]

And the dict is like:

{'A': [None,  data,  [],  [],  data],
'B': [None, data, None, [], None],
'C': [None, data, None, [], None],
'D': [None, data, None, [], None],
'E': [[], None, data, [], None]}

Is it possible to get a DataFrame from this? Thanks

Alex M
  • 51
  • 5
  • It appears that the error is occurring because you have both empty lists, `[]` and `None` values in your records. Are you allowed to modify the data? I was able to create a Koalas DataFrame with your data after replacing the `[]` elements with `None`. – colton Sep 29 '21 at 18:29

1 Answers1

0

You can create a Spark DataFrame using your data without data-manipulation using spark.createDataFrame().

sdf = spark.createDataFrame(
    data_list,
    T.StructType([
        T.StructField('A', T.ArrayType(T.IntegerType()), True),
        T.StructField('B', T.ArrayType(T.IntegerType()), True),
        T.StructField('C', T.ArrayType(T.IntegerType()), True),
        T.StructField('D', T.ArrayType(T.IntegerType()), True),
        T.StructField('E', T.ArrayType(T.IntegerType()), True),
    ])
)

Which can then be converted to a Koalas DataFrame using to_koalas().

>>> sdf.to_koalas()
           A          B          C          D     E
0       None       None       None       None    []

1  [1, 2, 3]  [1, 2, 3]  [1, 2, 3]  [1, 2, 3]  None

Additionally, I was able to create a Koalas DataFrame without going through Spark, by modifying your data so that empty lists [] instead have a value of None.

data_list = [
        {
            'A': None,
            'B': None,
            'C': None,
            'D': None,
            'E': None,
        },
        {
            'A': [1, 2, 3],
            'B': [1, 2, 3],
            'C': [1, 2, 3],
            'D': [1, 2, 3],
            'E': None,
        }
]
>>> import databricks.koalas as ks
>>> ks.DataFrame(data_list)
           A          B          C          D     E
0       None       None       None       None  None
1  [1, 2, 3]  [1, 2, 3]  [1, 2, 3]  [1, 2, 3]  None
colton
  • 81
  • 5
  • 1
    I just had to replace the empty lists and all the None by numpy.NaN and Koalas was able to convert it. – Alex M Sep 30 '21 at 20:58