2

Are the feature vectors generated by featuretools/DFS dense or sparse or does it depend on something?

Henry Thornton
  • 4,381
  • 9
  • 36
  • 43

1 Answers1

3

The sparseness of feature vectors generated by Featuretools will in general be dependent on

  1. the EntitySet in question and
  2. the primitives chosen.

Primitives are meant to give back dense information. While it's possible (but not helpful) to construct example EntitySets that will make the output of primitive sparse, it's more common for the primitive to give back no information than sparse information.

However, certain primitives and workflows are more likely to give back sparse than others. A big one to worry about is feature encoding, which uses one-hot. Because that's generating a vector with 1s only when a certain value occurs, an infrequently occurring categorical value immediately would be converted into a sparse vector. Using Where aggregation primitives can sometimes have similar results.

Seth Rothschild
  • 384
  • 1
  • 14
  • Thank-you. That is useful to know. Maybe this should be a separate question but what is the typical (or average) dimension of a feature vector? I realize it is a how long is a piece of string question. – Henry Thornton Mar 10 '18 at 08:47
  • It's probably fine in the same question, because it's a similar answer: it's going to be dependent on the dataset, which primitives you use and the _number_ of primitives you use. You'll get dramatically different answers for dimensionality depending on the input. – Seth Rothschild Mar 10 '18 at 16:15
  • Just trying to get a sense. Typically, folks create word-embedding dense vectors of size 300. Deep-learning is typically 1024 or 2048, audio 1024. How about a typical range for DFS? – Henry Thornton Mar 11 '18 at 09:13
  • Hi Henry- I'm one of the core developers of Featuretools. We like to target feature vectors around 100 dimensions, which typically work well with ensemble learning methods like Random Forest. For the typical dataset we see, we are able to achieve about that many features with pretty stock parameter settings – bschreck Mar 22 '18 at 00:02
  • 1
    However, some datasets are much more complex. It is not uncommon for Featuretools to generate thousands of features, especially after encoding categorical features. Much harder to train Random Forests with that many features, so we'll try to apply some form of feature selection to bring it down to around 100 (just using the Random Forest's built in `feature_importances_` attribute works nicely) – bschreck Mar 22 '18 at 00:04
  • @bschreck Hello - good to know you. Apologies for tardy response but for some reason SO isn't emailing me with latest responses. If thousands of features are reduced to around 100 then it must affect accuracy of predictions? – Henry Thornton Apr 04 '18 at 15:02
  • Generally the reduced feature set will produce better accuracies. Unless you have massive amounts of data and number of estimators in the Random Forest, it won't be able to distinguish well between 1000+ features – bschreck Apr 05 '18 at 19:33
  • The goal with the reduced feature set is to pick the most predictive ones from the original set – bschreck Apr 05 '18 at 19:34