2

In Pandas, is it possible to have a dataframe with a column that contains a varying number of subcolumns?

For example, suppose I have this CSV file:

transactionId, userName, date, itemList, totalCost

where the itemList contains a variable number of itemId;itemPrice pairs, with the pairs separated by a pipe (|). There is no upper bound on the number of itemId;itemPrice pairs in the list.

itemId ; itemPrice | itemId ; itemPrice

Here are some examples of rows:

transactionId, userName, date,       itemList,              totalCost
123,           Bob  ,    7/29/2017,  ABC;10|XYZ;20,         30
234,           Alice,    7/31/2017,  CDE;20|QRS;15|KLM;10,  45

The first row has two itemId;itemPrice pairs, while the second row has three pairs.

How can I create a dataframe to contain this information? Would I need a dataframe inside a dataframe?

There are other Stackoverflow posts on variable number of columns, but they assume a maximum number of columns.

stackoverflowuser2010
  • 38,621
  • 48
  • 169
  • 217
  • You could consider storing them as dictionaries? – cs95 Jul 31 '17 at 21:35
  • 2
    If you want one separate value per cell (rather than a dictionary), you should decide whether to make the table "wide" (one extra column per extra value) or "tall" (one extra row per extra value). The answer would differ based on your choice. – DYZ Jul 31 '17 at 21:36
  • @DYZ: I think I need them wide. For example, one of my goals is to audit whether the sum of the `itemPrice` values is equal to the `totalCost` of the entire row. – stackoverflowuser2010 Jul 31 '17 at 21:52

2 Answers2

6

I'd try to normalize your data as proposed by @DYZ in comments:

In [145]: df = df.join(df.pop('itemList')
     ...:                .str.extractall(r'(?P<item>\w+);(?P<price>\d+)')
     ...:                .reset_index(level=1, drop=True))
     ...:

In [146]: df
Out[146]:
   transactionId userName       date  totalCost item price
0            123      Bob  7/29/2017         30  ABC    10
0            123      Bob  7/29/2017         30  XYZ    20
1            234    Alice  7/31/2017         45  CDE    20
1            234    Alice  7/31/2017         45  QRS    15
1            234    Alice  7/31/2017         45  KLM    10

Normalized data allows us to apply Pandas/Numpy/SciPy/etc. ufunctions directly on columns containing scalar values.

Demo: checking totalCost

df.price = pd.to_numeric(df.price, errors='coerce')

In [151]: df.assign(tot2=df.groupby(level=0).price.transform('sum'))
Out[151]:
   transactionId userName       date  totalCost item  price  tot2
0            123      Bob  7/29/2017         30  ABC     10    30
0            123      Bob  7/29/2017         30  XYZ     20    30
1            234    Alice  7/31/2017         45  CDE     20    45
1            234    Alice  7/31/2017         45  QRS     15    45
1            234    Alice  7/31/2017         45  KLM     10    45

In [152]: df.assign(tot2=df.groupby(level=0).price.transform('sum')).query("totalCost != tot2")
Out[152]:
Empty DataFrame
Columns: [transactionId, userName, date, totalCost, item, price, tot2]
Index: []

PS last empty DF shows that we don't have any entries where totalCost != sum(price)

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
2

You parse them with a list comprehension

d1 = df.assign(
    itemList=[[x.split(';') for x in y.split('|')] for y in df.itemList.tolist()]
)

d1

   transactionId userName       date                           itemList  totalCost
0            123      Bob  7/29/2017             [[ABC, 10], [XYZ, 20]]         30
1            234    Alice  7/31/2017  [[CDE, 20], [QRS, 15], [KLM, 10]]         45

Response to Comment

f = lambda x: np.array(x)[:, 1].astype(int).sum()

d1.assign(sumPrice=d1.itemList.apply(f))

   transactionId userName       date                           itemList  totalCost  sumPrice
0            123      Bob  7/29/2017             [[ABC, 10], [XYZ, 20]]         30        30
1            234    Alice  7/31/2017  [[CDE, 20], [QRS, 15], [KLM, 10]]         45        45
piRSquared
  • 285,575
  • 57
  • 475
  • 624