3

This is a question about how to make things properly with pandas (I use version 1.0). Let say I have a DataFrame with missions which contains an origin and one or more destinations:

   mid from         to
0    0    A        [C]
1    1    A     [B, C]
2    2    B        [B]
3    3    C  [D, E, F]

Eg.: For the mission (mid=1) people will travel from A to B, then from B to C and finally from C to A. Notice, that I have no control on the datamodel of the input DataFrame.

I would like to compute metrics on each travel of the mission. The expected output would be exactly:

    tid  mid from to
0     0    0    A  C
1     1    0    C  A
2     2    1    A  B
3     3    1    B  C
4     4    1    C  A
5     5    2    B  B
6     6    2    B  B
7     7    3    C  D
8     8    3    D  E
9     9    3    E  F
10   10    3    F  C

I have found a way to achieve my goal. Please, find bellow the MCVE:

import pandas as pd

# Input:
df = pd.DataFrame(
    [["A", ["C"]],
     ["A", ["B", "C"]],
     ["B", ["B"]],
     ["C", ["D", "E", "F"]]],
    columns = ["from", "to"]
).reset_index().rename(columns={'index': 'mid'})

# Create chain:
df['chain'] = df.apply(lambda x: list(x['from']) + x['to'] + list(x['from']), axis=1)
# Explode chain:
df = df.explode('chain')
# Shift to create travel:
df['end'] = df.groupby("mid")["chain"].shift(-1)
# Remove extra row, clean, reindex and rename:
df = df.dropna(subset=['end']).reset_index(drop=True).reset_index().rename(columns={'index': 'tid'})
df = df.drop(['from', 'to'], axis=1).rename(columns={'chain': 'from', 'end': 'to'})

My question is: Is there a better/easier way to make it with Pandas? By saying better I mean, not necessary more performant (it can be off course), but more readable and intuitive.

jlandercy
  • 7,183
  • 1
  • 39
  • 57

2 Answers2

3

Your operation is basically explode and concat:

# turn series of lists in to single series
tmp = df[['mid','to']].explode('to')

# new `from` is concatenation of `from` and the list
df1 = pd.concat((df[['mid','from']],
                 tmp.rename(columns={'to':'from'})
          )
         ).sort_index()

# new `to` is concatenation of list and `to``
df2 = pd.concat((tmp,
                 df[['mid','from']].rename(columns={'from':'to'})
                )
         ).sort_index()

df1['to'] = df2['to']

Output:

   mid from to
0    0    A  C
0    0    C  A
1    1    A  B
1    1    B  C
1    1    C  A
2    2    B  B
2    2    B  B
3    3    C  D
3    3    D  E
3    3    E  F
3    3    F  C
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
2

If you don't mind re-constructing the entire DataFrame then you can clean it up a bit with np.roll to get the pairs of destinations and then assign the value of mid based on the number of trips (length of each sublist in l)

import pandas as pd
import numpy as np
from itertools import chain

l = [[fr]+to for fr,to in zip(df['from'], df['to'])]

df1 = (pd.DataFrame(data=chain.from_iterable([zip(sl, np.roll(sl, -1)) for sl in l]),
                    columns=['from', 'to'])
         .assign(mid=np.repeat(df['mid'].to_numpy(), [*map(len, l)])))

   from to  mid
0     A  C    0
1     C  A    0
2     A  B    1
3     B  C    1
4     C  A    1
5     B  B    2
6     B  B    2
7     C  D    3
8     D  E    3
9     E  F    3
10    F  C    3
ALollz
  • 57,915
  • 7
  • 66
  • 89