1

I want to use svmstruct for my Named entity Recognition task. Some of my features for each token are not in numerical format ( mostly in textual format such a n-char affixes or word shape ,...) . Since svmstruct's input format is same as svmlight format , I would like to know how should i convert those textual features to numerical ones?

All Bests

user1844172
  • 51
  • 1
  • 1
  • 5

1 Answers1

0

Basically you need to encode your text data as binary categories.

For example lets say you have the data

affix    shape
==============
ing      lower
         initcap
ed       allcaps

What you want to send to svmstruct is something like this:

affix_ing:1 shape_lower:1
shape_initcap:1
affix_ed:1 shape_allcaps

Now you can't you words as column identifiers, but svmstruct uses a sparse format so you can use widely separate column numbers as long as they are unique.

This is a great application for a hash function. So the technique is to make up column IDs on the fly and dummy encode your discrete data.

hash(colName + colValue) => 1

Depending on your data you might not need colName. Is a colName likely to collide with a colValue?

You can use a hash function like murmur hash or cityhash to get a huge space with fast calculation and low collisions.

dwatson
  • 41
  • 1
  • 5