1

I am trying to create a dataframe where the first column ("Value") has a multi-word string in each row and all other columns have labels representing unique words from all strings in "Value". I want to populate this dataframe with the word frequency for every string (a row) checking against all unique words (columns). In a sense, create a simple TDM

rows = ['you want peace', 'we went home', 'our home is nice', 'we want peace at home']
col_list = [word.lower().split(" ") for word in rows]
set_col = set(list(itertools.chain.from_iterable(col_list)))

columns = set_col
ncols = len(set_col)

testDF = pd.DataFrame(columns = set_col)
testDF.insert(0, "Value", " ")

testDF["Value"] = rows
testDF.fillna(0, inplace=True)

irow = 0

for tweet in testDF["Value"]:

    for word in tweet.split(" "):
        for col in xrange(1, ncols):

            if word == testDF.columns[col]: testDF[irow, col] += 1

    irow += 1

testDF.head()

However, I am getting an error:

KeyError                                  Traceback (most recent call last)
<ipython-input-64-9a991295ccd9> in <module>()
     23         for col in xrange(1, ncols):
     24 
---> 25             if word == testDF.columns[col]: testDF[irow, col] += 1
     26 
     27     irow += 1

C:\Users\Tony\Anaconda\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
   1795             return self._getitem_multilevel(key)
   1796         else:
-> 1797             return self._getitem_column(key)
   1798 
   1799     def _getitem_column(self, key):

pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3824)()

pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3704)()

pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item   (pandas\hashtable.c:12280)()

pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12231)()

KeyError: (0, 9)

I am not sure what is wrong thus, will appreciate your help Also, if there is a cleaner way to do this (except NO textmining - problem with installing) it would be great to learn!

Anand S Kumar
  • 88,551
  • 18
  • 188
  • 176
Toly
  • 2,981
  • 8
  • 25
  • 35

1 Answers1

2

I am not 100% sure what your complete program is trying to do , but if by the following -

testDF[irow, col]

You mean't to index the cell in the dataframe, with irow as the index and col as the column, you cannot use simple subscript for that. You should insteand use .iloc or such. Example -

 if word == testDF.columns[col]: testDF.iloc[irow, col] += 1

Use .iloc if you intended for irow to the the 0-indexed number of the index , if irow is the exact index of the DataFrame, you can use .loc instead of .iloc .

Anand S Kumar
  • 88,551
  • 18
  • 188
  • 176
  • .iloc works like a breeze! Still new to Python and keep forgetting that access to dataframe elements is different from that for pd.arrays:) – Toly Oct 23 '15 at 05:54