Pandas - read CSV into dataframe, where a column has a varying number of subcolumns

Question

In Pandas, is it possible to have a dataframe with a column that contains a varying number of subcolumns?

For example, suppose I have this CSV file:

transactionId, userName, date, itemList, totalCost

where the itemList contains a variable number of itemId;itemPrice pairs, with the pairs separated by a pipe (|). There is no upper bound on the number of itemId;itemPrice pairs in the list.

itemId ; itemPrice | itemId ; itemPrice

Here are some examples of rows:

transactionId, userName, date,       itemList,              totalCost
123,           Bob  ,    7/29/2017,  ABC;10|XYZ;20,         30
234,           Alice,    7/31/2017,  CDE;20|QRS;15|KLM;10,  45

The first row has two itemId;itemPrice pairs, while the second row has three pairs.

How can I create a dataframe to contain this information? Would I need a dataframe inside a dataframe?

There are other Stackoverflow posts on variable number of columns, but they assume a maximum number of columns.

If you want one separate value per cell (rather than a dictionary), you should decide whether to make the table "wide" (one extra column per extra value) or "tall" (one extra row per extra value). The answer would differ based on your choice. — DYZ, Jul 31 '17 at 21:36
@DYZ: I think I need them wide. For example, one of my goals is to audit whether the sum of the `itemPrice` values is equal to the `totalCost` of the entire row. — stackoverflowuser2010, Jul 31 '17 at 21:52

MaxU - stand with Ukraine · Answer 1 · 2017-07-31T21:57:41.950

I'd try to normalize your data as proposed by @DYZ in comments:

In [145]: df = df.join(df.pop('itemList')
     ...:                .str.extractall(r'(?P<item>\w+);(?P<price>\d+)')
     ...:                .reset_index(level=1, drop=True))
     ...:

In [146]: df
Out[146]:
   transactionId userName       date  totalCost item price
0            123      Bob  7/29/2017         30  ABC    10
0            123      Bob  7/29/2017         30  XYZ    20
1            234    Alice  7/31/2017         45  CDE    20
1            234    Alice  7/31/2017         45  QRS    15
1            234    Alice  7/31/2017         45  KLM    10

Normalized data allows us to apply Pandas/Numpy/SciPy/etc. ufunctions directly on columns containing scalar values.

Demo: checking totalCost

df.price = pd.to_numeric(df.price, errors='coerce')

In [151]: df.assign(tot2=df.groupby(level=0).price.transform('sum'))
Out[151]:
   transactionId userName       date  totalCost item  price  tot2
0            123      Bob  7/29/2017         30  ABC     10    30
0            123      Bob  7/29/2017         30  XYZ     20    30
1            234    Alice  7/31/2017         45  CDE     20    45
1            234    Alice  7/31/2017         45  QRS     15    45
1            234    Alice  7/31/2017         45  KLM     10    45

In [152]: df.assign(tot2=df.groupby(level=0).price.transform('sum')).query("totalCost != tot2")
Out[152]:
Empty DataFrame
Columns: [transactionId, userName, date, totalCost, item, price, tot2]
Index: []

PS last empty DF shows that we don't have any entries where totalCost != sum(price)

piRSquared · Answer 2 · 2017-07-31T22:16:39.230

2

You parse them with a list comprehension

d1 = df.assign(
    itemList=[[x.split(';') for x in y.split('|')] for y in df.itemList.tolist()]
)

d1

   transactionId userName       date                           itemList  totalCost
0            123      Bob  7/29/2017             [[ABC, 10], [XYZ, 20]]         30
1            234    Alice  7/31/2017  [[CDE, 20], [QRS, 15], [KLM, 10]]         45

Response to Comment

f = lambda x: np.array(x)[:, 1].astype(int).sum()

d1.assign(sumPrice=d1.itemList.apply(f))

   transactionId userName       date                           itemList  totalCost  sumPrice
0            123      Bob  7/29/2017             [[ABC, 10], [XYZ, 20]]         30        30
1            234    Alice  7/31/2017  [[CDE, 20], [QRS, 15], [KLM, 10]]         45        45

edited Jul 31 '17 at 22:16

answered Jul 31 '17 at 21:42

piRSquared

285,575
57
475
624

This is an interesting representation. I created this dataframe, and `dataframe.info()` says that the `itemList` column is of type `Object`. How would I iterate over the items in `itemList` to sum up the `itemPrice` values? – stackoverflowuser2010 Jul 31 '17 at 22:04
I could do it but we have already gone down a dark path without a flashlight... You sure you want to keep going? I'd stick to the tall dataframe in @MaxU's answer. – piRSquared Jul 31 '17 at 22:06
If you could, a lot of readers would appreciate it, myself included. I'm trying to understand the tradeoffs involved, e.g. space vs coding complexity. – stackoverflowuser2010 Jul 31 '17 at 22:10
@stackoverflowuser2010 added example. – piRSquared Jul 31 '17 at 22:17
Thank you. Greatly appreciate it. – stackoverflowuser2010 Jul 31 '17 at 22:22
For completeness, this works as well: `d2 = d1.assign(itemPriceSum = d1.itemList.apply(lambda x: sum([int(item[1]) for item in x])))` – stackoverflowuser2010 Jul 31 '17 at 23:18

Pandas - read CSV into dataframe, where a column has a varying number of subcolumns

2 Answers2