Huge array representing redundant and sparse data optimization

Question

I tried to search an answer to my question, it probably exists but I couldn't find any, probably because I don't know how to name it.

So let's get to the point.

I am working on a statistical model. My dataset contains information about 45 public transport stations and static information about them (so 45 rows and around 1000 feature columns) and regularly-spaced temporal measurements (so 2 000 000 rows but only 10 columns). So the "real" amount of information in itself is small enough (around 500 Mb) to be processed easily.

The problem is that most statistical modules on python require simple 2d arrays, like numpy's arrays. So I have to combine my data, for all the 2 000 000 rows of measure I have to attach the 1000+ columns of features related to the station in which the measure took place so I end up with a 17 Gb database... but most of it is redundant and I feel it is a waste of resource.

I have an idea but I have absolutely no idea of how to do it : ultimately an array is just a function that for a given i and j returns a value, so is it possible for me to "emulate", or come up with a fake array, pseudo array, array interface, that can be accepted like a numpy array by the modules? I don't see why I couldn't do it, because ultimately an array IS a function (i:j) -> a[i][j]. My access could be a little slower than the memory access of a classic array but this is my problem, and it still will be fast. I just don't know how to do it...

Thank you in advance and tell me if I need to bring more information or change the way I ask my question!

EDIT :

Ok maybe I can clarify my problem a little bit :

My data can be presented in a relational database or an object database fashion. It is pretty small (500 Mb) in that format, but when I merge my tables to make it possible for scikit-learn to process it, it becomes way too big (17 Gb +) and it seems a waste of memory! I could use techniques to handle big data, that may be a solution, but isn't it possible to prevent doing so? Can I use my data directly in scikit_learn without having to merge explicitly the tables? Can I make an implicit merging that emulates the datastructure without taking additional memory?

I think there are no unified answer, because it depends on what you want to do with data further. I can recommend you a book ["Mining of Massive Datasets"](http://infolab.stanford.edu/~ullman/mmds/book.pdf), which contains a lot of techniques of data mining of very large amounts of data. Hopefully, you can find there what you need. — Vadim Shkaberda, Jul 04 '16 at 17:47
Thank you for the answer. But I do not need to mine massive amounts of data, I am trying to prevent the data of becoming massive because the real amount of information, of data, is actually small and becomes big only when combined naively in an array. A similar problem would be if I had a gigantic matrix but extremely sparse so the memory footprint is light, but I needed a dense 2d array for a function but it would dramatically increase the memory footprint of my data for nothing... — Janer, Jul 04 '16 at 18:19
There is a `scipy.sparse` class that has been used with `scikit-learn`. Search for `[scikit-learn] sparse` to find questions like http://stackoverflow.com/questions/37953877/create-sparse-matrix-in-python and http://stackoverflow.com/questions/37866390/add-categorical-variablegender-to-sparse-matrix-for-multiclass-classification — hpaulj, Jul 04 '16 at 21:22
It is only for sparse matrix, I brought the case of sparse matrix to make a point and give a exemple about the problem of using a lot of space with data that may be contained. My problem is more general. I know how to, given i and j, obtain the desired value in a memory-efficient container. It behaves like an array but is not one. — Janer, Jul 04 '16 at 22:37
You will have to study the `sklearn` code to see what functionality it expects. Indexing with `x[i,j]` is just a start. — hpaulj, Jul 04 '16 at 22:50
I can do this but technically how to make it accept my object? I have an idea. Is creating a class inheriting from numpy array could do the trick? — Janer, Jul 04 '16 at 23:38
`sparse` does not try to subclass `ndarray`. But it does use arrays to store values. One format subclasses `dict`. Look at that code. — hpaulj, Jul 05 '16 at 12:11
I got a better understanding of my problem and I edited the original post.. — Janer, Jul 08 '16 at 18:05

Huge array representing redundant and sparse data optimization

0 Answers0