-1

I have a numpy.ndarray data that looks like below and I want to flatten it out so that i can manipulate it. Please find my sample data below:

sample_data=[list([{'region': 'urn:li:region:9194', 'followerCounts': {'organicFollowerCount': 157, 'paidFollowerCount': 0}}, {'region': 'urn:li:region:7127', 'followerCounts': {'organicFollowerCount': 17, 'paidFollowerCount': 0}}])]

I have tried to use the following code but no luck yet:

sample.flatter()

The desired output is as follows:

region                 organicFollowerCount   paidFollowerCount

urn:li:region:9194    157                          0
urn:li:region:7127    17                           0

Can anyone help me achieving this please?

user86907
  • 817
  • 9
  • 21

2 Answers2

1

Here is an approach that uses pd.json_normalize:

import pandas as pd

# note that `sample data` has been modified into a list of dictionaries
sample_data = [
    {'region': 'urn:li:region:9194', 
     'followerCounts': {'organicFollowerCount': 157, 'paidFollowerCount': 0}}, 
    {'region': 'urn:li:region:7127', 
     'followerCounts': {'organicFollowerCount': 17, 'paidFollowerCount': 0}}
]

Now, convert each item in the list to a data frame:

dfs = list()

# convert one dict at a time into a data frame, using json_normalize()
for sd in sample_data:
    t = pd.json_normalize(sd)
    dfs.append(t)

# convert list of dataframes into a single data frame, 
#   and change column labels
t = pd.concat(dfs).rename(columns={
    'followerCounts.organicFollowerCount': 'organicFollowerCount',
    'followerCounts.paidFollowerCount': 'paidFollowerCount'
}).set_index('region')

print(t)


                    organicFollowerCount  paidFollowerCount
region                                                     
urn:li:region:9194                   157                  0
urn:li:region:7127                    17                  0

As @thehumaneraser noted, this format is not ideal, but we can't always influence the format of the data we receive.

jsmart
  • 2,921
  • 1
  • 6
  • 13
0

You are not going to be able to flatten this data the way you want with Numpy's flatten method. That method simply takes a multi-dimensional ndarray and flattens it to one dimension. You can read the docs here.

A couple other things. First of all, your sample data above is not an ndarray, it is just a python list. And actually since you call list() inside square brackets it is a nested list of dictionaries. This is really not a good way to store this information and based on this convoluted format you leave yourself very few options for nicely "flattening" it into the table you desire.

If you have many rows like this I would do the following:

headers = ["region", "organicFollowerCount", "paidFollowerCount"]
data = [headers]
for row in sample_data[0]: # Subindexing here because it is unwisely a nested list
    formatted_row = []
    formatted_row.append(row["region"])
    formatted_row.append(row["followerCounts"]["organicFollowerCount"])
    formatted_row.append(row["followerCounts"]["paidFollowerCount"]
    data.append(formatted_row)
data = np.array(data)

This will give you an ndarray of the data as you have it here, but this is still an ugly solution. Really this is a highly impractical presentation of data and you should ditch it for a better one.

One last thing: don't use camel case. That is standard practice for some languages like Java but nor for Python. Instead of organicFollowerCount use organic_follower_count and so on.

thehumaneraser
  • 632
  • 4
  • 21