How to create an efficient term-document matrix from bag-of-words dataset

Question

I am experimenting with UCI Bag of Words Dataset. I have read document IDs, words (word IDs), and word counts into three separate lists. The first 10 items of those lists are similar to what is below:

['1', '1', '1', '1', '1', '2', '2', '2', '3', '3'] #docIDs
['118', '285', '129', '168', '20', '529', '6941', '7', '890', '285'] #wordIDs
['1', '1', '1', '1', '2', '1', '1', '5', '1', '1'] #count

I can't figure out how to make term-document matrix from these lists, without any redundancy. I'd like to turn rows into docIDs, columns into wordIDs, and corresponding cell values into word count. What is the efficient way to do this with python (pandas) ?

Rawson · Answer 1 · 2022-05-21T14:18:01.873

I think this answers your question:

Lists:

docid = ['1', '1', '1', '1', '1', '2', '2', '2', '3', '3'] #docIDs
wordid = ['118', '285', '129', '168', '20', '529', '6941', '7', '890', '285'] #wordIDs
counted = ['1', '1', '1', '1', '2', '1', '1', '5', '1', '1'] #count

DataFrame with each list in a separate column:

df = pd.DataFrame([docid, wordid, counted],
                  index = ["docIDs", "wordIDs", "count"]).T

Pivot this for index as "docIDs", columns as "wordIDs", values as "count":

df = df.pivot(index="docIDs", columns="wordIDs", values="count")

Output:

#wordIDs  118  129  168   20  285  529 6941    7  890
#docIDs                                              
#1          1    1    1    2    1  NaN  NaN  NaN  NaN
#2        NaN  NaN  NaN  NaN  NaN    1    1    5  NaN
#3        NaN  NaN  NaN  NaN    1  NaN  NaN  NaN    1

Alternatively, you can use unstack() by setting the desired index and columns as the index, then unstacking the columns:

df.set_index(["docIDs", "wordIDs"])["count"].unstack("wordIDs")

Which produces the same result. This should use less memory.

Thank you. It works for the portion of the dataset, but gives memory leak for full data. Is there any more efficient way to do this? Or the only way is to process in batches. — Find Mind, May 21 '22 at 12:15
I have added a second option that should use less memory, but if this also doesn't work then it may be best to do it in batches. — Rawson, May 21 '22 at 14:18

How to create an efficient term-document matrix from bag-of-words dataset

1 Answers1