4

Is it possible/good to add numerical features in crf models? e.g. position in the sequence.

I'm using CRFsuite. It seems all the features will be converted to string, e.g. 'pos=0', 'pos=1', which then lose it's meaning as euclidean distance.

Or should I use them to train another model, e.g. svm, then ensemble with crf models?

Lishu
  • 1,438
  • 1
  • 13
  • 14
  • It can be done with format like `LABEL f1:0.1 f2:0.4 f3:0.8 f4:0.2 f5:0.9`. see https://datascience.stackexchange.com/a/4886/94403 – Hai Feng Kao May 08 '20 at 07:11

3 Answers3

9

I figured out that CRFsuite does handle numerical features, at least according to this documentation:

  • {“string_key”: float_weight, ...} dict where keys are observed features and values are their weights;
  • {“string_key”: bool, ...} dict; True is converted to 1.0 weight, False - to 0.0;
  • {“string_key”: “string_value”, ...} dict; that’s the same as {“string_key=string_value”: 1.0, ...}
  • [“string_key1”, “string_key2”, ...] list; that’s the same as {“string_key1”: 1.0, “string_key2”: 1.0, ...}
  • {“string_prefix”: {...}} dicts: nested dict is processed and “string_prefix” s prepended to each key.
  • {“string_prefix”: [...]} dicts: nested list is processed and “string_prefix” s prepended to each key.
  • {“string_prefix”: set([...])} dicts: nested list is processed and “string_prefix” s prepended to each key.

As long as:

  1. I keep the input properly formatted;
  2. I use float vs string of float;
  3. I normalize it.
Community
  • 1
  • 1
Lishu
  • 1,438
  • 1
  • 13
  • 14
4

CRF itself can use numerical features, and you should use them, but if your implementations converts them to strings (encodes in the binary form by the "one hot spot encoding") then it might be of reduced significance. I suggest to look for more "pure" CRF which allows continuous variables.

A fun fact is that CRF in its core is just structured MaxEnt (LogisticRegression) which works in continuous domain, this string encoding is actually a way to go from categorical values into continuous domain so your problem is actually a result of "overdesigning" of CRFSuite which forgot about actual capabilities of CRF model.

lejlot
  • 64,777
  • 8
  • 131
  • 164
  • Got you. The reason I go with CRFsuite is that it comes with a nice [python wrapper](http://python-crfsuite.readthedocs.org/en/latest/) which is easy to use. Will it help to use those numerical features in another model and then ensemble with crf? – Lishu Oct 02 '14 at 16:25
  • It does not seem right, CRF is a sequence classifier. Ensembling it with non-sequential model is rather werid. It would be much more profitable to look for a way to actually include the numerical features inside CRF as -as said before- CRF is fully capable of such actions – lejlot Oct 02 '14 at 23:29
0

Just to clarify a bit the answer by Lishu (which is correct but might confuse other readers as it did to me until I tried it). This:

{“string_key”: float_weight, ...} dict where keys are observed features and values are their weights

could have been written as

{“feature_template_name”: feature_value, ...} dict where keys are feature names and values are their values

i.e. with this you're not setting the weight for the CRF corresponding to this feature_template, but the value of this feature. I prefer to refer to them feature templates that have feature values, so that everything is more clear than just "features". Then, the CRF will learn a weight associated to each of the possible feature_values for this feature_template