I try to use the repetitive nature of you data. Let's say you have the data stored in a long sequence, for example a list:
long_sequence = [0,1,1,1,1,2,1,1,1,1,1,1,1,3,1,2,2,2,2,2,2,2,4,4,4,4,4,4,4,4,4,3,4,1,1,1,1]
Now I store the changes between two consecutive elements in a list of tuples:
compressed_list = []
last_element = None
for i, element in enumerate(long_sequence):
if element == last_element: continue
else:
compressed_list.append((i, element))
last_element = element
# compressed_list:
[(0, 0),
(1, 1),
(5, 2),
(6, 1),
(13, 3),
(14, 1),
(15, 2),
(22, 4),
(31, 3),
(32, 4),
(33, 1)]
Now, this could solve the storage problem, but the access of the data might be still computationally expensive (using pure python):
def element_from_compressed_list_by_index(lst, idx):
for (i, element), (next_i, _) in zip(lst, lst[1:]):
if i <= idx < next_i: return element
else: raise KeyError("No Element found at index {}".format(idx))
# This does not work for the last repetitive section of the sequence,
# but the idea gets across I think...
element_from_compressed_list_by_index(compressed_list, 3)
# Out: 1
A better way for reading and storing the data could be a sqlite database.
import sqlite3 as sq
con = sq.connect(':memory') # would be a file path in your case
# create a database table to store the compressed list
con.execute("CREATE TABLE comp_store (start_index int, element int);")
# add the data
con.executemany("INSERT INTO comp_store (start_index, element) VALUES (?,?)", compressed_list)
To fetch one element from the database using its index (7
in the example below) you could use the following query.
con.execute('SELECT element FROM comp_store WHERE start_index <= ? ORDER BY start_index DESC LIMIT 1',
(7,))
I think PyTables can still be the right answer and as far as I know, the HDF5 format is broadly used by open source and proprietary products. But if you want to use the python standard library for some reason, this could also be a good way to go.
Hint: The functions zip
and enumerate
are more efficient (in fact: lazy) in python3 than in python2...