0

I want to join all the text columns of my dataframe, so that i can fit this into a CountVectorizer.

def populate_distance_metrics(in_df, col_list, prim_col):

    vect_data=in_df[col_list[0]].map(str)
    print (type(vect_data))
    for col,idx in enumerate(col_list):
        if idx==0:
            continue;
        vect_data = vect_data + " " + in_df[col]

    cv = CountVectorizer(stop_words='english', max_features=1000)
    # Learn a vocabulary dictionary of all tokens 
    cv.fit(vect_data)
    print ('cv fit')

in_df is the source data frame and col_list is a an array such as ['a','b','c',...] and i want to keep this flexible. the type of

vect_data=in_df[col_list[0]].map(str)

is

<class 'pandas.core.series.Series'>

the above code fails at vect_data = vect_data + " " + in_df[col]

vect_data = vect_data + " " + in_df[col]
  File "asd/asd/dsfg/lib/python3.6/site-packages/pandas/core/frame.py", line 2059, in __getitem__
    return self._getitem_column(key)
  File "asd/asd/dsfg/lib/python3.6/site-packages/pandas/core/frame.py", line 2066, in _getitem_column
    return self._get_item_cache(key)
  File "asd/asd/dsfg/lib/python3.6/site-packages/pandas/core/generic.py", line 1386, in _get_item_cache
    values = self._data.get(item)
  File "asd/asd/dsfg/lib/python3.6/site-packages/pandas/core/internals.py", line 3543, in get
    loc = self.items.get_loc(item)
  File "asd/asd/dsfg/lib/python3.6/site-packages/pandas/indexes/base.py", line 2136, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)

  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)

  File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)

  File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)

KeyError: 0

However, it works when i do

cv.fit(in_df['a']+ ' '+ in_df['b']+ in_df['c'])

what am i doing wrong?

AbtPst
  • 7,778
  • 17
  • 91
  • 172

0 Answers0