I'm trying to load json files into a dask df.
files = glob.glob('**/*.json', recursive=True)
df = dd.read_json(files, lines = False)
There are some missing values in the data, and some of the files have extra columns. Is there a way to specify a column list, so all possible columns will exist in the concatenated dask df? Additionally, can't it handle missing values? I get the following error when trying to compute the df:
ValueError: Metadata mismatch found in `from_delayed`.
Partition type: `DataFrame`
+-----------------+-------+----------+
| Column | Found | Expected |
+-----------------+-------+----------+
| x22 | - | float64 |
| x21 | - | object |
| x20 | - | float64 |
| x19 | - | float64 |
| x18 | - | object |
| x17 | - | float64 |
| x16 | - | object |
| x15 | - | object |
| x14 | - | object |
| x13 | - | object |
| x12 | - | object |
| x11 | - | object |
| x10 | - | object |
| x9 | - | float64 |
| x8 | - | object |
| x7 | - | object |
| x6 | - | object |
| x5 | - | int64 |
| x4 | - | object |
| x3 | - | float64 |
| x2 | - | object |
| x1 | - | object |
+-----------------+-------+----------+