How to efficently flatten JSON structure returned in elasticsearch_dsl queries?

Question

I'm using elasticsearch_dsl to make make queries for and searches of an elasticsearch DB.

One of the fields I'm querying is an address, which as a structure like so:

address.first_line
address.second_line
address.city
adress.code

The returned documents hold this in JSON structures, such that the address is held in a dict with a field for each sub-field of address.

I would like to put this into a (pandas) dataframe, such that there is one column per sub-field of the address.

Directly putting address into the dataframe gives me a column of address dicts, and iterating the rows to manually unpack (json.normalize()) each address dict takes a long time (4 days, ~200,000 rows).

From the docs I can't figure out how to get elasticsearch_dsl to return flattened results. Is there a faster way of doing this?

Have you tried converting the `Series` (column) to a list and passing directly to the `DataFrame` constructor..? ie something like `pd.DataFrame(df['address].tolist())` ... — Chris Adams, Jan 06 '20 at 15:05
Can you give a short example of the JSON from elasticsearch? — Code Different, Jan 06 '20 at 15:21
Hi @chrisa I found that didn't produce a dataframe I could concatenate onto the original, but I've found a similar solution that I've put below. — Chor Hatara Hud'u Keturi, Jan 13 '20 at 15:55

Chor Hatara Hud'u Keturi · Accepted Answer · 2021-01-04T16:02:37.897

Searching for a way to solve this problem, I've come across my own answer and found it lacking, so will update with a better way

Specifically: pd.json_normalize(df['json_column'])

In context: pd.concat([df, pd.json_normalize(df['json_column'])], axis=1)

Then drop the original column if required.

Original answer from last year that does the same thing much more slowly

df.column_of_dicts.apply(pd.Series) returns a DataFrame with those dicts flattened.

pd.concat(df,new_df) gets the new columns onto the old dataframe.

Then delete the original column_of_dicts.

pd.concat([df, df.address.apply(pd.Series)], axis=1) is the actual code I used.

How to efficently flatten JSON structure returned in elasticsearch_dsl queries?

1 Answers1