4

Let's say we have a long one dimensional array like this with millions of elements:

[0,1,1,1,1,2,1,1,1,1,1,1,1,...,1,2,2,2,2,2,2,2,4,4,4,4,4,4,4,4,4,3,4,1,1,1,1]

If there was just one repeating element we could use a sparse array but since it can be any kind of integer value (or a set of nominal elements), this does not do the trick I suppose (or am I wrong there?).

I read about PyTables being able to compress data like this on the fly, based on HDF5 files, as far as I could see, this seems like the go to option for python.

Has someone experience with it and can tell whether that is an adequate route to go or are there other ways, that are even more efficient in terms of cpu as well as memory usage (trading the least amount of CPU cycles into decreased memory size).

meow
  • 2,062
  • 2
  • 17
  • 27
  • 1
    The lazy solution would be to save its string representation to a text file. If that's too big, you could zip that file. – Alex von Brandenfels Nov 17 '17 at 17:45
  • The element wise difference of your array ([numpy.diff](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.diff.html)) would have sparse properties, wouldn't it? But this would clearly trade RAM for CPU usage... – koffein Nov 17 '17 at 18:31
  • @koffein that's interesting, I'll have a look at that. – meow Nov 17 '17 at 19:56

2 Answers2

2

I try to use the repetitive nature of you data. Let's say you have the data stored in a long sequence, for example a list:

long_sequence = [0,1,1,1,1,2,1,1,1,1,1,1,1,3,1,2,2,2,2,2,2,2,4,4,4,4,4,4,4,4,4,3,4,1,1,1,1]

Now I store the changes between two consecutive elements in a list of tuples:

compressed_list = []
last_element = None
for i, element in enumerate(long_sequence):
    if element == last_element: continue
    else:
        compressed_list.append((i, element))
        last_element = element

# compressed_list: 
[(0, 0),
 (1, 1),
 (5, 2),
 (6, 1),
 (13, 3),
 (14, 1),
 (15, 2),
 (22, 4),
 (31, 3),
 (32, 4),
 (33, 1)]

Now, this could solve the storage problem, but the access of the data might be still computationally expensive (using pure python):

def element_from_compressed_list_by_index(lst, idx):
     for (i, element), (next_i, _) in zip(lst, lst[1:]):
         if i <= idx < next_i: return element
     else: raise KeyError("No Element found at index {}".format(idx))
# This does not work for the last repetitive section of the sequence,
# but the idea gets across I think...

element_from_compressed_list_by_index(compressed_list, 3)
# Out: 1

A better way for reading and storing the data could be a sqlite database.

import sqlite3 as sq
con = sq.connect(':memory') # would be a file path in your case

# create a database table to store the compressed list
con.execute("CREATE TABLE comp_store (start_index int, element int);")
# add the data
con.executemany("INSERT INTO comp_store (start_index, element) VALUES (?,?)", compressed_list)

To fetch one element from the database using its index (7 in the example below) you could use the following query.

con.execute('SELECT element FROM comp_store WHERE start_index <= ? ORDER BY start_index DESC LIMIT 1', 
           (7,))

I think PyTables can still be the right answer and as far as I know, the HDF5 format is broadly used by open source and proprietary products. But if you want to use the python standard library for some reason, this could also be a good way to go.

Hint: The functions zipand enumerate are more efficient (in fact: lazy) in python3 than in python2...

koffein
  • 1,792
  • 13
  • 21
1

Quoting this wonderful answer, that helped me a few years ago:

Python-standard shelve module provides dict-like interface for persistent objects. It works with many database backends and is not limited by RAM. The advantage of using shelve over direct work with databases is that most of your existing code remains as it was. This comes at the cost of speed (compared to in-RAM dicts) and at the cost of flexibility (compared to working directly with databases).

If I understand your question correctly(storing is the concern, not the reading), shelve is absolutely the way to go.

Will be monitoring the thread for any other creating answers on the topic. :-)

Sipty
  • 1,159
  • 1
  • 10
  • 18
  • Thanks for your answer, I know shelve, it is really nice. The thing is that for really large objects such as numerous arrays with millions and millions of entries (we are talking about >30 gb in size), it tends to be quite slow and wont save memory. The main thing I'm trying to achieve is leveraging the structure of the data (thus lots of repetitive elements, to compress the data (think of sparse data types or standard compression approaches like Burrows-Wheeler transform followed by a compression algorithm). – meow Nov 17 '17 at 23:28