0

I have data in SVMlight format (label feature1:value1 feature2:v2 ...) as such

talk.politics.guns a:12 about:1 abrams:1 absolutely:1
talk.politics.mideast I:4 run:10 go:3

I tried sklearn.load_svmlight_file but it doesn't seem to work with categorical string features and labels. I am trying to store it into pandas DataFrame. Any pointers would be appreciated.

Hackore
  • 163
  • 1
  • 12
  • Your SVMLight file does not have the correct format. Read more about the format here: https://www.cs.cornell.edu/people/tj/svm_light/ – ambodi Dec 30 '22 at 16:22

2 Answers2

1

I'd like to point out that the accepted answer from Christian Gomes will fail if one of your words is 'label', because you'll overwrite the classification label with the count of the word in the vector. Also, because the count wasn't converted to an int, you can't do any math.

Since you know for certain that each (feature, value) pair is separated by a :, you could get around this by making your 'label' key something like ':label'. It's not ideal, but it will avoid the collision.

Alternatively, you could store the labels in a separate dataframe, which is probably a better solution, since you probably don't want to do math on your classification labels.

svmformat_file = """~/svmformat_file_sample"""

# Read to list
with open(svmformat_file, mode="r") as fp:
    svmformat_list = fp.read().splitlines()

# For each line we save the key:values to a dict
pandas_label_list = []
pandas_feature_list = []
for line in svmformat_list:
    feature_dict = {}

    items = line.split()
    pandas_label_list.append({'label': items[0]})

    for pair in items[1:]:
        feature_name, count = pair.split(':')
        feature_dict[feature_name] = int(count)

    pandas_feature_list.append(feature_dict)

Then, using the same data Christian used, you now have two dataframes:

>>> pd.DataFrame(pandas_label_list)
>>>                 label
0   talk.politics.guns
1   talk.politics.mideast
>>> pd.DataFrame(pandas_feature_list)
>>> a about abrams absolutely   I run  go
0  12     1      1          1 NaN NaN NaN
1 NaN   NaN    NaN        NaN   4  10   3
0

You can do it by hand... One way you can convert the file you want in a DataFrame:

svmformat_file = """~/svmformat_file_sample"""

# Read to list
with open(svmformat_file, mode="r") as fp:
    svmformat_list = fp.readlines()

# For each line we save the key:values to a dict
pandas_list = []
for line in svmformat_list:
    line_dict = dict()

    line_split = line.split(' ')
    line_dict["label"] = line_split[0]

    for col in line_split[1:]:
        col = col.rstrip()  # Remove '\n'
        col_split = col.split(':')
        key, value = col_split[0], col_split[1]
        line_dict[key] = value

    pandas_list.append(line_dict)

The result DataFrame with your example file:

pd.DataFrame(pandas_list)

enter image description here

Christian Gomes
  • 316
  • 2
  • 8