1

I'm using featuretools Deep Feature Sintesys to build features for a dataset of 40k rows and 200 columns. I choose about 40 transformation primitivies, as you can see in the code bellow:

feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="df", n_jobs=6,
                                      trans_primitives=primitives.name.to_list(),
                                      verbose=True)

but when I run my code, It takes a lot of time to discover the features to build, and this process doesn't run in multiple cores in my CPU, and not even a single-core gets 100% of usage. In other words, I'm waiting hours to run a process that is just using the minimal resources of my machine (memory also is not a problem).

After the feature tools discover the features (and print a log "built n features") then it creates the cluster and uses all the cores specified in the "n_jobs" parameter, in 100% of capability. This second moment is really fast, just some seconds, once all my resources are being used.

My question is, why is this happening? It's possible to discover the features faster to reduce this time? And just don't understand how a process that doesn't use resources takes too long.

  • I'm not familiar with the specific kind of problem you're trying to solve more quickly. I'll just say that if a single library utilizes multiple cores for one type of processing but only a single core for another, this is likely due to the fact that one type of processing lends itself to splitting up the task across multiple cores while the other does not. Utilizing multiple cores when the task at hand doesn't naturally break down into isolatable subtasks can be very difficult, and that difficulty can in some cases cancel out the benefit of utilizing the multiple cores. – CryptoFool Apr 09 '21 at 20:25
  • Which version of featuretools are you using? How manu features are being created? – Roy Wedge Apr 09 '21 at 20:41
  • Yes, I understand that it's not possible to process everything in multicore, but not even the single-core being used is in 100% of capability. For me, this is really strange, once this process takes extremely longer than the calculus part itself. As I don't see anyone else talking about this, I'm thinking if this problem just happen with me! – Alvaro Leandro Cavalcante Apr 09 '21 at 20:42
  • featuretools-0.23.3 is my version – Alvaro Leandro Cavalcante Apr 09 '21 at 20:43
  • 2
    If your dataset only has a single entity, try using `max_depth=1` as a parameter to dfs and compare the features created. – Roy Wedge Apr 12 '21 at 15:52
  • Thank you, using max_depth=1 in this context solved my problem, now it's running faster and building features under minutes! – Alvaro Leandro Cavalcante Apr 19 '21 at 18:41

0 Answers0