2

How do I represent a set/list of items in the input data (data frame) for H2O?

I'm using sparkling water 1.6.5 with H2O Flow. My input data (columns in the CSV file) look like this:

age: numeric
gender: enum
hobbies: ?
sports: ?

hobbies and sports are lists/sets with a limited number of possible entries (~20 each). H2O does not seem to have a suitable data type for this. How do I export these into a CSV file that can be processed by H2O Flow?

Markus Kramer
  • 411
  • 5
  • 13
  • No idea about h2o, but machine learning has a concept called `one hot encoding`. You can simply make every possible entry in your hobby and sports list a "csv column" itself that is binary like your gender attribute. – Thomas Jungblut Jun 25 '16 at 10:08
  • Sounds like a valid option, thanks. However, I hope there is an easier / more maintainable way than doing this manually. – Markus Kramer Jun 27 '16 at 20:20

1 Answers1

3

If you were just recording their main hobby, or main sport, then it would be a single enum column, e.g. hobbies, with 20 levels. You would simply write it as a string field in your csv file, and H2O would read it.

But I think what you are after is where each person has 0+ choices from 20 hobbies? In that case you need to have 20 columns in your csv file, one per hobby; each will be a 2-value enum. It doesn't matter what the two values are: Y/N, T/F, Y/blank, hobby-name/blank, etc. Your csv file might look this:

name,gender,football?,running?,data mining?,sleeping?
Tom,M,Y,,,Y
Dick,M,,,Y,
Suzy,F,,Y,Y,

Tom likes football and sleeping, Dick lives for data mining and nothing else, and Suzy is into running and data mining.

By the way, if using deeplearning then it will end up with the same network configuration: a single 20-level enum input will be converted into 20 binary inputs nodes.

Darren Cook
  • 27,837
  • 13
  • 117
  • 217
  • Thx. How do I write this into my CSV? I tried to separate the hobbies with commas (e.g. "singing,painting") but that didn't work. I don't have to use a CSV file if there is a better format. – Markus Kramer Jun 27 '16 at 20:16
  • Sorry, @MarkusKramer, I'd missed the point of your question. Just updated my answer. – Darren Cook Jun 28 '16 at 07:25
  • thanks for the explanation. so the "one hot encoding" method suggested by Thomas is the answer for H2O as well – Markus Kramer Jun 28 '16 at 18:53