3

I'm trying to load json files into a dask df.

files = glob.glob('**/*.json', recursive=True)
df = dd.read_json(files, lines = False)

There are some missing values in the data, and some of the files have extra columns. Is there a way to specify a column list, so all possible columns will exist in the concatenated dask df? Additionally, can't it handle missing values? I get the following error when trying to compute the df:

ValueError: Metadata mismatch found in `from_delayed`.

Partition type: `DataFrame`
+-----------------+-------+----------+
| Column          | Found | Expected |
+-----------------+-------+----------+
| x22             | -     | float64  |
| x21             | -     | object   |
| x20             | -     | float64  |
| x19             | -     | float64  |
| x18             | -     | object   |
| x17             | -     | float64  |
| x16             | -     | object   |
| x15             | -     | object   |
| x14             | -     | object   |
| x13             | -     | object   |
| x12             | -     | object   |
| x11             | -     | object   |
| x10             | -     | object   |
| x9              | -     | float64  |
| x8              | -     | object   |
| x7              | -     | object   |
| x6              | -     | object   |
| x5              | -     | int64    |
| x4              | -     | object   |
| x3              | -     | float64  |
| x2              | -     | object   |
| x1              | -     | object   |
+-----------------+-------+----------+
Maria
  • 159
  • 10

3 Answers3

3

read_json() is new and tested for the "common" case of homogenous data. It could, like read_csv, be extended to cope with column selection and data type coercion fairly easily. I note that the pandas function allows the passing of a dtype= parameter.

This is not an answer, but perhaps you would be interested in submitting a PR at the repo ? The specific code lives in file dask.dataframe.io.json.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • 1
    Thanks, I'll consider it. – Maria Jun 20 '18 at 07:10
  • I just bumped into this issue too and I'm not sure where I would implement column selection in the source. FWIW I think heterogenous JSON data is as likely as any other data format. I wonder if I should be using dask dataframes for the task of reading a large json file (comment data from Pushshift) or not.. – Rob Hawkins Dec 27 '18 at 13:16
  • 1
    For heterogeneous data, you would want to use dask.bag to read the text, parse the json and manipulate the resulting dictionaries. – mdurant Dec 27 '18 at 13:35
  • Oh, I see, I can load using dask.bag, do some manipulation, then convert that to a dataframe. Maybe I can add this example to dask documentation when done. Thanks @mdurant – Rob Hawkins Dec 28 '18 at 07:38
2

I bumped into similar problem and came up with another solution:

def read_data(path, **kwargs):
    meta = dd.read_json(path, **kwargs).head(0)
    meta = meta.head(0)
    # edit meta dataframe to match what's read here

    def json_engine(*args, **kwargs):
        df = pd.read_json(*args, **kwargs)
        # add or drop necessary columns here
        return df

    return dd.read_json(path, meta=meta, engine=json_engine, **kwargs)

So idea of this solution is that you do two things:

  1. Edit meta as you see fit (for example removing column from it which you don't need)
  2. Wrapping json engine function and dropping/adding necessary columns so meta will match what's returned by this function.

Examples:

  1. You have one particular irrelevant column which cause your code to fail with error:
| Column          | Found | Expected     |
| x22             | -     | object       |

In this case you simply drop this column from meta and in your json_engine() wrapper.

  1. You have some relevant columns which are reported missing for some partitions. In this case you get similar error to topic starter.

In this case you add necessary columns to meta with necessary types (BTW meta is just empty pandas dataframe in this case) and you also add those columns as empty in your json_engine() wrapper if necessary.

Also look at proposal in comments to https://stackoverflow.com/a/50929229/2727308 answer - to use dask.bag instead.

featuredpeow
  • 2,061
  • 1
  • 19
  • 18
  • This is great! Much more straightforward, and for many load-time tweaks probably more efficient, than starting with a Bag. – zgana Mar 16 '22 at 21:25
0

I added the pandas read_json kwarg dtype as object so all the columns are inferred as objects:

df = dd.read_json(files, dtype=object)
Ofer Helman
  • 714
  • 1
  • 8
  • 25