Big data in datalab

Question

I'm trying to load my csv file into datalab. But the csv file is too large to load. Even if I managed to do that, it'll take too long to do the preprocessing.

I'm thinking of using Keras to do ML on this dataset. The questions are:

How do I use a data generator to feed Keras my raw data?
What about the data preprocessing, shall I do it in dataprep or dataflow or is it just fine in datalab?
Is there any way to speed up the training process? Now, I have to leave the datalab window open for a long time for the training to finish. I don't feel comfortable leaving the webpage open for such a long time.

Thanks!

score 2 · Answer 1 · answered Jan 22 '18 at 02:51

I suggest you to load your data with the pandas library and extract the underlying numpy array. Then you can feed whatever input or output data you want to your model.

If your csv is too big for having it stored in your memory, the other way is to implement a Python generator that yields a batch of data every time.

There's a lot of variables that determine the duration of your training process, and unfortunately it's hard to tell what might be the best thing for you. You could increment the learning rate of your optimizer, or build a smaller model (less weights to be trained), or feed less data, or train for a smaller number of epochs / steps.

score 1 · Answer 2 · answered Jan 27 '18 at 23:23

1

It might be possible to go further with a larger/more-memory VM, but this too will have limits of course, just larger.

Ultimately, you'll likely (and might have already) hit a threshold where you'll want to consider this approach:

Build a sample of your data that you use during development. That is what you'll use in Datalab.
Build a distributed training program that can run against the full dataset. I'd suggest looking at Cloud ML Engine for its support for distributed training, and associated samples.

answered Jan 27 '18 at 23:23

Nikhil Kothari

5,215
2
22
28

Do you mean that use datalab to define which algorithm/hyperparameter/model to use and then use Cloud ML Engine to distributedly train the model? – Elona Mishmika Feb 04 '18 at 07:41
Good suggestion, but I don't know how to use Cloud ML Engine to do preprocessing :( And if I'm just using sklearn in datalab, I don't think I can move it to ML Engine as ML Engine is only compatible with TF right? – Elona Mishmika Mar 12 '18 at 05:29

Big data in datalab

2 Answers2