1

I am trying to split a column with an array of a list into multiple columns and create multiple rows. Below is the sample data:

signalid    monthyear   readings
5135         201901    [{"v":"90","q":1,"t":1546444800000,"g":0}]
5135         201901    [{"v":"50","q":1,"t":1546444900000,"g":0}]
5135         201901    [{"v":"40","q":1,"t":1546445800000,"g":0}]
5135         201901    [{"v":"30","q":1,"t":1546446800000,"g":0},{"v":"35","q":1,"t":1546446900000,"g":0}]

Details about data:

  1. Data is fetched from the Cassandra database.

  2. The column data type in Cassandra is text.

  3. dataframe.readings[0]

'[{"v":"9817","q":1,"t":1548979150085,"g":0},{"v":"9821","q":1,"t":1548979151475,"g":0}]'

  1. dataframe.readings.dtype

    dtype('O')

Normalize code:

normalizeddataframe = json_normalize(data=dataframe, record_path='readings', 
                            meta=["signalid", "monthyear"])

Expected output:

signalid    monthyear  v     q         t           g
5135         201901    90    1    1546444800000    0
5135         201901    50    1    1546444900000    0
5135         201901    40    1    1546445800000    0
5135         201901    30    1    1546446800000    0
5135         201901    35    1    1546446900000    0

Result:

TypeError                                 Traceback (most recent call last)
<ipython-input-13-486775fb3431> in <module>
     60 #normalizeddataframe = pandas.read_json(dataframe.readings)
     61 normalizeddataframe = json_normalize(data=dataframe, record_path='readings', 
---> 62                             meta=["signalid", "monthyear", "fromtime", "totime", "avg", "insertdate", "max", "min"])
     63 #normalizeddataframe = json_normalize(pandas.Series.to_json(dataframe.readings))
     64 print("Processig completed");

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\json\normalize.py in json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep)
    260                 records.extend(recs)
    261 
--> 262     _recursive_extract(data, record_path, {}, level=0)
    263 
    264     result = DataFrame(records)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\json\normalize.py in _recursive_extract(data, path, seen_meta, level)
    236         else:
    237             for obj in data:
--> 238                 recs = _pull_field(obj, path[0])
    239 
    240                 # For repeating the metadata later

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\json\normalize.py in _pull_field(js, spec)
    183                 result = result[field]
    184         else:
--> 185             result = result[spec]
    186 
    187         return result

TypeError: string indices must be integers
Nithin B
  • 601
  • 1
  • 9
  • 26

3 Answers3

2

I know this is a few years old at this point but I figured I'd add my answer in case anyone else ran into this issue. This focuses on maintaining the structure of the input and output as specified by the author. I'm sure there's some room for optimization.

import pandas as pd
import json

signalid = [5135, 5135, 5135, 5135]
monthyear = [201901, 201901, 201901, 201901]
readings = ['[{"v":"90","q":1,"t":1546444800000,"g":0}]', '[{"v":"50","q":1,"t":1546444900000,"g":0}]', '[{"v":"40","q":1,"t":1546445800000,"g":0}]', '[{"v":"30","q":1,"t":1546446800000,"g":0},{"v":"35","q":1,"t":1546446900000,"g":0}]']

# Reconstruct sample DataFrame
df = pd.DataFrame({
        "signalid": signalid,
        "monthyear": monthyear,
        "readings": readings
    })

# Convert strings to JSON objects
df['readings'] = df['readings'].map(json.loads)

# Can't use nested lists of JSON objects in pd.json_normalize
df = df.explode(column='readings').reset_index(drop=True)

# pd.json_normalize expects a list of JSON objects not a DataFrame
df = pd.concat([df[['signalid', 'monthyear']], pd.json_normalize(df['readings'])], axis=1) 

print(df)

Output:

   signalid  monthyear   v  q              t  g
0      5135     201901  90  1  1546444800000  0
1      5135     201901  50  1  1546444900000  0
2      5135     201901  40  1  1546445800000  0
3      5135     201901  30  1  1546446800000  0
4      5135     201901  35  1  1546446900000  0
Noah P.
  • 21
  • 3
0

Let's assume we have a data frame with column name as readings and the column values are

[{"v":"90","q":1,"t":1546444800000,"g":0}]

[{"v":"50","q":1,"t":1546444900000,"g":0}]

[{"v":"40","q":1,"t":1546445800000,"g":0}]

for row in data['readings']:
    for value_dict in row:
        for key in value_dict:
            data[key] = value_dict[key]

In case of multiple jsons in the list, you did not mention what is your expected output. I assume this might help you. If you mention the desired output, I can modify my code according to it.

bumblebee
  • 1,811
  • 12
  • 19
  • Added expected output to the question. – Nithin B Feb 06 '19 at 05:31
  • I think the problem is row variable is a string object and not JSON object, therefore value_dist and key variable are coming as [. – Nithin B Feb 06 '19 at 05:42
  • You cannot add multiple column values to the same row. It will overwrite the previous one. – bumblebee Feb 06 '19 at 06:59
  • But v,q,t,g values are different in array. So two rows need to created according to me. This link shows the same thing that I am trying to do https://www.kaggle.com/jboysen/quick-tutorial-flatten-nested-json-in-pandas . I think the problem is because it is a string instead of json object. – Nithin B Feb 06 '19 at 08:37
  • The above code gives TypeError: string indices must be integers error – Nithin B Feb 06 '19 at 09:18
  • Use json_normalize as you mentioned in the above link and append the normalized data. Refer this https://stackoverflow.com/questions/49671693/pandas-dataframe-normalize-one-json-column-and-merge-with-other-columns – bumblebee Feb 06 '19 at 10:29
0

Maybe there is a simpler solution but this one should work. The idea is to convert each entry of the readings column into a dataframe whose index is given by signalid and monthyear.

This function convert a dictionary into a DataFrame and sets the index:

def to_df(index, x):
    return pd.DataFrame.from_dict(x, orient='index').T.set_index(pd.MultiIndex.from_arrays(index, names=['signalid', 'monthyear']))

The next function triggers the call of to_df

def concat(y):
    return pd.concat(map(lambda x: to_df([[y.signalid], [y.monthyear]], x), y.readings))

We now apply concat to each row. The result is a Series of DataFrames which we can concat to receive the final data:

s = df.apply(concat, axis=1)
pd.concat(s.tolist())
JoergVanAken
  • 1,286
  • 9
  • 10