Python tools for out-of-core computation/data mining

Question

I am interested in python mining data sets too big to sit in RAM but sitting within a single HD.

I understand that I can export the data as hdf5 files, using pytables. Also the numexpr allows for some basic out-of-core computation.

What would come next? Mini-batching when possible, and relying on linear algebra results to decompose the computation when mini-batching cannot be used?

Or are there some higher level tools I have missed?

Thanks for insights,

score 4 · Answer 1 · edited May 23 '17 at 12:08

4

What exactly do you want to do — can you give an example or two please ?

numpy.memmap is easy —

Create a memory-map to an array stored in a binary file on disk.
Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory. Numpy's memmap's are array-like objects ...

see also numpy+memmap on SO.

The scikit-learn people are very knowledgeable, but prefer specific questions.

edited May 23 '17 at 12:08

Community

1
1

answered Jan 30 '13 at 14:36

denis

21,378
10
65
88

Thanks for answer Denis. It appears skilearn has mini-batching facilities. Actually I am looking for the most rational way to deal with out-of-the-core learning of a sub map-reduce size. Particularly I am striving to understand the relative strengths of hdf5, sql, nosql. – user17375 Jan 31 '13 at 15:08
Zelazny7's large-data-work-flows question is better because concrete, and gets better answers – denis Mar 18 '13 at 11:20

score 3 · Answer 2 · edited May 23 '17 at 12:15

I have a similar need to work on sub map-reduce sized datasets. I posed this question on SO when I started to investigate python pandas as a serious alternative to SAS: "Large data" work flows using pandas

The answer presented there suggests using the HDF5 interface from pandas to store pandas data structures directly on disk. Once stored, you could access the data in batches and train a model incrementally. For, example, scikit-learn has several classes that can be trained on incremental pieces of a dataset. One such example is found here:

http://scikit-learn.org/0.13/modules/generated/sklearn.linear_model.SGDClassifier.html

Any class that implements the partial_fit method can be trained incrementally. I am still trying to get a viable workflow for these kinds of problems and would be interested in discussing possible solutions.

score 3 · Accepted Answer · answered Jul 31 '13 at 08:51

In sklearn 0.14 (to be released in the coming days) there is a full-fledged example of out-of-core classification of text documents.

I think it could be a great example to start with :

http://scikit-learn.org/dev/auto_examples/applications/plot_out_of_core_classification.html

In the next release we'll extend this example with more classifiers and add documentation in the user guide.

NB: you can reproduce this example with 0.13 too, all the building blocks were already there.

Python tools for out-of-core computation/data mining

3 Answers3