3

I have a dask dataframe containing a column in json format, and I want to parse the column into dataframe format.

the column in json format looks like:

{"Name": {"id": 1000, "address": "ABC", ....}},,,

So I want to extract only value of "Name", and make each keys in them a column, each values a values in it, like:

id    address ...
1000  ABC
2000  DEF
3000  GHA
...   ...

I think we can read json file into dask dataframe by read_json, but how could I do that?

SayZ
  • 101
  • 1
  • 5
  • How would you do this with Pandas ? – quasiben May 14 '20 at 01:44
  • If it's pandas dataframe, I would do use json_normalize from pandas.io.json, like (not working in dask dataframe), df_json = json_normalize(df['json_col'].apply(lambda x: json.loads(x))) df_json.head() – SayZ May 14 '20 at 04:35
  • 1
    So you could do something similar with dask bag, `db.read_text('datajsonl').map(json.loads).compute() ` . Then convert to a dataframe with `.to_dataframe`. Have you read over https://examples.dask.org/applications/json-data-on-the-web.html ? – quasiben May 14 '20 at 13:04
  • 1
    @quasiben , please submit this as an answer, so it doesn't look like the question is pending – mdurant May 14 '20 at 13:19
  • @quasiben sorry, there's 1 thing I did not mention. I read data from mysql using read_sql_table method. so, I can't use other methods to read like read_text. I mean, the output dataframe by read_sql_table contains a column in json format, which I want to normalize. – SayZ May 15 '20 at 01:59

1 Answers1

-1

The operation that you're doing appears to be embarrasingly parallel. As a result, you can write a Pandas function and then apply that function across a dask dataframe in parallel.

def f(df: pandas.DataFrame) -> pandas.DataFrame:
    ... however you would do this in Pandas

ddf = ddf.map_partitions(f)
MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • This won't work without the `meta` keyword. Please provide complete answers or avoid commenting at all! – Dzeri96 Aug 15 '21 at 13:30