Python: Group Unique ids and Merge Values into an Array

Question

I have a pandas DF that looks like this

DF:

       name ;time_cost
        x   ;28800000, 250 
        x   ;39600000, 300 
        x   ;61200000, 250 
        x   ;72000000, 0 
        y   ;86400000, 0 
        y   ;115200000, 250 
        y   ;126000000, 300 
        y   ;147600000, 250
        y   ;158400000, 0

df.head().to_dict()
{'name': {0: 'x',
          1: 'x',
          2: 'x',
          3: 'x'},
 'time_cost': {0: '28800000, 250',
                1: '39600000, 300',
                2: '61200000, 250',
                0: '72000000, 0'}}

I'm trying to put all the values from time_cost into an array like so:

[[[28800000, 250],
[39600000, 300],
[61200000, 250],
[72000000, 0 ],
[86400000, 0 ]],
[[115200000, 250],
[126000000, 300],
[147600000, 250],
[158400000, 0]]]

Here's what I have tried:

    import pandas as pd

        df = pd.read_csv('file.csv', sep=';')
        def f(df):
            return pd.Series(dict(timecost_range = "%s" % '| '.join(df['time_cost'])))

        result = df.groupby('name').apply(f)
        result
                timecost_range
        name    
        x       28800000, 250| 39600000, 300| 61200000, 250| 72000000, 0
        y       86400000, 0| 115200000, 250| 126000000, 300| 147600000, 250|...

This works somewhat, but isn't exactly what I am looking for. Any ideas or suggestions would be useful.

Can you add `df.head().to_dict()` to the question. I think it's ambiguous as to how `'time_cost'` is stored in your `DataFrame` — ALollz, Aug 29 '18 at 21:22
Also, are there only these two (three) columns in `df` or can there be more? — cs95, Aug 29 '18 at 21:25
Hi @ALollz I've edited my question. @coldspeed the CSV separator is `sep=';'`, and yes there are only two (three) columns. — mm_nieder, Aug 29 '18 at 21:34

Anna Iliukovich-Strakovskaia · Accepted Answer · 2018-08-30T14:41:10.593

In my example, data is:

df= pd.DataFrame({'name': {0: 'x',
          1: 'x',
          2: 'x',
          3: 'y'},
 'time_cost': {0: '28800000, 250',
                1: '39600000, 300',
                2: '61200000, 250',
                3: '72000000, 0'}})

Step 1. You can use something like this to get result:

def split_function(n):
    return n.split(',')

df['time_cost'] = df.time_cost.apply(split_function)

Output:

name         time_cost
0    x  [28800000,  250]
1    x  [39600000,  300]
2    x  [61200000,  250]
3    y    [72000000,  0]

Step 2. If you want two different columns in your DataFrame you can use:

df.time_cost.apply(pd.Series)

Output:

          0     1
0  28800000   250
1  39600000   300
2  61200000   250
3  72000000     0

Step 3. And then join them:

df = df.join(df.time_cost.apply(pd.Series))

Output:

name         time_cost         0     1
0    x  [28800000,  250]  28800000   250
1    x  [39600000,  300]  39600000   300
2    x  [61200000,  250]  61200000   250
3    y    [72000000,  0]  72000000     0

And then you can use drop to drop "time_cost" column and rename to rename new columns if you like.

Is it what you want? I hope it will be helpful.

UPD:

Step 4. If you want grouped by name, you can use this:

df[0] = df[0].astype(int)
df[1] = df[1].astype(int)

def concat_function_0(df):
    return np.array(df[0])
def concat_function_1(df):
    return np.array(df[1])

df = pd.DataFrame([df.groupby('name').apply(concat_function_1), df.groupby('name').apply(concat_function_0)]).T

It isn't pythonic, but it works = )

Output:

name               0                               1                                                 
x     [250, 300, 250]  [28800000, 39600000, 61200000]
y                 [0]                      [72000000]

UPD:

Step 5. For your result, after first step use this:

def df_to_array(df):
    return list(df.time_cost)

result = df.groupby('name').apply(df_to_array).values

Output:

[[['28800000', ' 250'], ['39600000', ' 300'], ['61200000', ' 250']]
 [['72000000', ' 0']]]

Hi @Anna Iliukovich-Strakovskaia, thank you for your response. This is close, but I need to `groupby` `name` and then some how merge the `time_cost` values together in an array like so: `[[[28800000, 250], [39600000, 300], [61200000, 250], [72000000, 0 ], [86400000, 0 ]], [[115200000, 250], [126000000, 300], [147600000, 250], [158400000, 0]]]` — mm_nieder, Aug 30 '18 at 14:24
Hi! @mm_nieder see the last update. I didn't understand what you want first time = ). Does last update help? — Anna Iliukovich-Strakovskaia, Aug 30 '18 at 14:42
@mm_nieder So you don't need step 2-4, but I think it can be helpful for other peoples. — Anna Iliukovich-Strakovskaia, Aug 30 '18 at 14:43
Thank you SO MUCH for your brilliant answer :) Very helpful! — mm_nieder, Aug 30 '18 at 14:57

Python: Group Unique ids and Merge Values into an Array

1 Answers1