2

When I try to convert some xml to dataframe using xmltodict it happens that a particular column contains all the info I need as dict or list of dict. I'm able to convert this column in multiple ones with pandas but I'm not able to perform the similar operation in dask.

Is not possible to use meta because I've no idea of all the possible fields that are available in the xml and dask is necessary because the true xml files are bigger than 1Gb each.

example.xml:

<?xml version="1.0" encoding="UTF-8"?>
<itemList>
  <eventItem uid="1">
    <timestamp>2019-07-04T09:57:35.044Z</timestamp>
    <eventType>generic</eventType>
    <details>
      <detail>
        <name>columnA</name>
        <value>AAA</value>
      </detail>
      <detail>
        <name>columnB</name>
        <value>BBB</value>
      </detail>
    </details>
  </eventItem>
  <eventItem uid="2">
    <timestamp>2019-07-04T09:57:52.188Z</timestamp>
    <eventType>generic</eventType>
    <details>
      <detail>
        <name>columnC</name>
        <value>CCC</value>
      </detail>
    </details>
  </eventItem>
</itemList>

Working pandas code:

import xmltodict
import collections
import pandas as pd

def pd_output_dict(details):
    detail = details.get("detail", [])
    ret_value = {}
    if type(detail) in (collections.OrderedDict, dict):
        ret_value[detail["name"]] = detail["value"]
    elif type(detail) == list:
        for i in detail:
            ret_value[i["name"]] = i["value"]
    return pd.Series(ret_value)

with open("example.xml", "r", encoding="utf8") as f:
    df_dict_list = xmltodict.parse(f.read()).get("itemList", {}).get("eventItem", [])
    df = pd.DataFrame(df_dict_list)
    df = pd.concat([df, df.apply(lambda row: pd_output_dict(row.details), axis=1, result_type="expand")], axis=1)
    print(df.head())

Not working dask code:

import xmltodict
import collections
import dask
import dask.bag as db
import dask.dataframe as dd

def dd_output_dict(row):
    detail = row.get("details", {}).get("detail", [])
    ret_value = {}
    if type(detail) in (collections.OrderedDict, dict):
        row[detail["name"]] = detail["value"]
    elif type(detail) == list:
        for i in detail:
            row[i["name"]] = i["value"]
    return row

with open("example.xml", "r", encoding="utf8") as f:
    df_dict_list = xmltodict.parse(f.read()).get("itemList", {}).get("eventItem", [])
    df_bag = db.from_sequence(df_dict_list)
    df = df_bag.to_dataframe()
    df = df.apply(lambda row: dd_output_dict(row), axis=1)

The idea is to have in dask similar result I've in pandas but a the moment I'm receiving errors:

>>> df = df.apply(lambda row: output_dict(row), axis=1)
Traceback (most recent call last):
  File "C:\Anaconda3\lib\site-packages\dask\dataframe\utils.py", line 169, in raise_on_meta_error
    yield
  File "C:\Anaconda3\lib\site-packages\dask\dataframe\core.py", line 4711, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
  File "C:\Anaconda3\lib\site-packages\dask\utils.py", line 854, in __call__
    return getattr(obj, self.method)(*args, **kwargs)
  File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 6487, in apply
    return op.get_result()
  File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 151, in get_result
    return self.apply_standard()
  File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 257, in apply_standard
    self.apply_series_generator()
  File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 286, in apply_series_generator
    results[i] = self.f(v)
  File "<stdin>", line 1, in <lambda>
  File "<stdin>", line 4, in output_dict
AttributeError: ("'str' object has no attribute 'get'", 'occurred at index 0')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Anaconda3\lib\site-packages\dask\dataframe\core.py", line 3964, in apply
    M.apply, self._meta_nonempty, func, args=args, udf=True, **kwds
  File "C:\Anaconda3\lib\site-packages\dask\dataframe\core.py", line 4711, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
  File "C:\Anaconda3\lib\contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "C:\Anaconda3\lib\site-packages\dask\dataframe\utils.py", line 190, in raise_on_meta_error
    raise ValueError(msg)
ValueError: Metadata inference failed in `apply`.

You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.

To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.

Original error is below:
------------------------
AttributeError("'str' object has no attribute 'get'", 'occurred at index 0')

Traceback:
---------
  File "C:\Anaconda3\lib\site-packages\dask\dataframe\utils.py", line 169, in raise_on_meta_error
    yield
  File "C:\Anaconda3\lib\site-packages\dask\dataframe\core.py", line 4711, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
  File "C:\Anaconda3\lib\site-packages\dask\utils.py", line 854, in __call__
    return getattr(obj, self.method)(*args, **kwargs)
  File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 6487, in apply
    return op.get_result()
  File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 151, in get_result
    return self.apply_standard()
  File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 257, in apply_standard
    self.apply_series_generator()
  File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 286, in apply_series_generator
    results[i] = self.f(v)
  File "<stdin>", line 1, in <lambda>
  File "<stdin>", line 4, in output_dict
jpp
  • 159,742
  • 34
  • 281
  • 339
dadokkio
  • 41
  • 1
  • 3

1 Answers1

0

Right, so operations like map_partitions will need to know the column names and data types. As you've mentioned, you can specify this with the meta= keyword.

Perhaps you can run through your data once to compute what these will be, and then construct a proper meta object, and pass that in? This is inefficient, and requires reading through all of your data, but I'm not sure that there is another way.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Thanks I'll try to generate the meta, the biggest problem is that the values for the dict can be nested but I don't need to manage them and only keep those columns as dict or list. There is something like a generic object to use when I define meta for that objects or I need to write precisely all structures also for nested items? – dadokkio Aug 30 '19 at 06:15