I'd like to point out that the accepted answer from Christian Gomes will fail if one of your words is 'label'
, because you'll overwrite the classification label with the count of the word in the vector. Also, because the count wasn't converted to an int
, you can't do any math.
Since you know for certain that each (feature, value) pair is separated by a :
, you could get around this by making your 'label'
key something like ':label'
. It's not ideal, but it will avoid the collision.
Alternatively, you could store the labels in a separate dataframe, which is probably a better solution, since you probably don't want to do math on your classification labels.
svmformat_file = """~/svmformat_file_sample"""
# Read to list
with open(svmformat_file, mode="r") as fp:
svmformat_list = fp.read().splitlines()
# For each line we save the key:values to a dict
pandas_label_list = []
pandas_feature_list = []
for line in svmformat_list:
feature_dict = {}
items = line.split()
pandas_label_list.append({'label': items[0]})
for pair in items[1:]:
feature_name, count = pair.split(':')
feature_dict[feature_name] = int(count)
pandas_feature_list.append(feature_dict)
Then, using the same data Christian used, you now have two dataframes:
>>> pd.DataFrame(pandas_label_list)
>>> label
0 talk.politics.guns
1 talk.politics.mideast
>>> pd.DataFrame(pandas_feature_list)
>>> a about abrams absolutely I run go
0 12 1 1 1 NaN NaN NaN
1 NaN NaN NaN NaN 4 10 3