Python h2o svmlight data

Question

I have an svmlight-formatted file with values of the form:

92.91 18256731:1 71729421:1 72329637:1 83328561:1 118265976:1 134892759:1 198163358:1 352348616:1 526943048:1 5.30 102156934:1 134892759:1 198163358:1 254112843:1 262373758:1 512748316:1 526943048:1 22.00 32172600:1 72329637:1 118265976:1 134892759:1 198163358:1 411824213:1 443226486:1 445371412:1 526943048:1

I am trying to import this in h2o using h2o.import_file(fname.svmlight)

Does h2o support high dimensional sparse binary features?

Do I need to convert the hashed values in some indexes for this to work?

The example data are as shown, importing the file takes ages, but when I convert these in small indexes it seems to work ok. — user90772, Aug 03 '17 at 13:01

score 0 · Answer 1 · answered Aug 10 '17 at 09:57

Your three lines of svmlight is like a virus! According to top the java process is as close to 800% CPU (8-core machine) as it can get. After 45m of cpu effort (5-6 mins wall clock) I had to use kill -9 on it to get my machine back.

Even if your type of file is not officially supported, I think the fact that it brings down a machine makes it a serious bug, so I've reported it here: https://0xdata.atlassian.net/browse/PUBDEV-4798

BTW, you can find a unit test showing use of smvlight here: https://github.com/h2oai/h2o-3/blob/30f382efac687be3959a253d975cb48c341c92b4/h2o-r/tests/testdir_misc/runit_parser_type.R

Thank you for reporting it. I think this is the point of sparse arrays, to save memory by using only index:value combinations. This is a valid svmlight format, you can try to parse it on scikit-learn for example. It should not matter if the index is 1, 100 or 1,000,000.Thanks for the prompt reply again! — user90772, Aug 10 '17 at 16:42

Python h2o svmlight data

1 Answers1