3

I'm just starting my adventure with DASK and land I'm learning on an example dataset in json format. I know that this is not the easiest data format in the world for a beginner :)

I have a dataset in the json format. I loaded the data via dd.read_json to dataframe and everything goes well. The problem occurred with, for example, the compute() or len() function.

I get this error:

ValueError: Metadata mismatch found in `from_delayed`.

Partition type: `DataFrame`
+----------+-------+----------+
| Column   | Found | Expected |
+----------+-------+----------+
| column1  |   -   | object   |
| column2  |   -   | object   |
+----------+-------+----------+

I tried different things, but nothing helps. I don't know how to handle this error.

Please help, I will be very grateful !

AWL
  • 31
  • 1
  • 2
  • The problem is that dask is infering the datatype for the given columns. You can set it using `dtype`.You can habe a look at this [library](https://github.com/d6t/d6tstack) too – rpanai Mar 05 '19 at 12:35
  • @user32185 thanks a lot. I changed the type of these two columns to `float` using the `astype` function. But now I'm getting a different kind of error: `ValueError: could not convert string to float: y2122-9865-b432-986ty34924` – AWL Mar 05 '19 at 17:22
  • Is it happening on one columns only? – rpanai Mar 05 '19 at 17:45
  • @user32185 no, it happens in two columns. These columns have "NaN" values, and the column type is object. – AWL Mar 05 '19 at 21:31
  • Ok, I somehow dealt with it, but in different way. At the beginning I loaded the json data to dask.bag using `json_data_bag = db.read_text (file_path) .map (json.loads)` and later I transformed it into a dataframe using `json_data_df = json_data_bag.to_dataframe ()`. For now it works :) – AWL Mar 06 '19 at 10:51

1 Answers1

10

My guess is that your JSON data has different columns in different parts of the data. When Dask DataFrame loads your JSON data it looks at the first chunk of data to determine what the column names and datatypes are. It then assumes that all of your data looks like this.

This assumption turns out to be wrong in your case and probably there is some column that only appears later on in the file.

You might consider increasing the size of the sample that Dask reads when determining metadata like column names.

df = dd.read_json(..., sample=2**26)

The default is 1MB (2**20)

MRocklin
  • 55,641
  • 23
  • 163
  • 235