0

I have 6000 log files(so very large data) in the form:

u1 1234
u3 231
u5 3456
u1 2367
u7 68968
u8 2112
u6 32423
u7 12345

Where ui is the ith unit and second column is timestamp.Now i need to store all the data in an efficient data structure.

Language:Python

What i was doing was a dictionary of list.Where key will be the unit number and value will be a list containing values of all the timestamps related to that unit.If there is a better option please suggest.

Prakhar Gurawa
  • 373
  • 3
  • 13
  • 1
    How quickly do you need to be able to query this data structure? One speed->size trade-off comes to mind. Sort the timestamps per list, and replace each value with the difference between timestamps (delta T), which will all be smaller number to store. You can then generate the original list by calculating the cumulative sum of elements in the list. Of course if all timestamps are as small as in your example this won't save much space, but I can imagine in practice time stamps being larger. – Dillon Davis Jul 30 '18 at 06:24
  • 2
    Consider pandas for efficient indexed data-structures like this. – juanpa.arrivillaga Jul 30 '18 at 06:35
  • 1
    @DillonDavis while Python uses arbitrary sized `int` objects, an the actual size storing the bits representing the `int` will be much smaller than the overall overhead of a Python `object`, so really, you should consider `numpy`/`pandas` or even `array.array` or the `ctypes` module for efficient storage memory-wise – juanpa.arrivillaga Jul 30 '18 at 06:40
  • 1
    @DillonDavis although, once you switch over to something like `numpy`, that would be a clever trick to be able to use a small, unsigned integer type! – juanpa.arrivillaga Jul 30 '18 at 06:41
  • @juanpa.arrivillaga those are very good points. To counter the overhead of the objects, I have an alternative to numpy/pandas, under the assumption that the data structure may be immutable. First- produce a tuple containing the concatenation of all tuples from each would-be-dictionary-list. Then produce two more tuples- one containing the index into our first tuple for would-be-dict-entry unit `i`, at position `i` in the tuple, and a second tuple containing the lengths. So from the example: `elem=(1234,2367,231,3456,...);indices=(-1,0,-1,2,-1,3,...);length=(-1,2,-1,1,-1,1,...)`. – Dillon Davis Jul 30 '18 at 07:03
  • @juanpa.arrivillaga of course numpy/pandas would be the easier and should be the go-to solution here, but its worth showing that it can also be done in native python if need be. – Dillon Davis Jul 30 '18 at 07:04
  • 1
    @DillonDavis well, these are good ideas, but that doesn't address the object overhead. On my 64-bit Python 3.7, the *smallest* integer is 24 bytes (`0`), the next set of integers will be 28, then 32. The sparse data structure you described would still require Python `int` objects, so you should use `array.array` (if you want to avoid `numpy`) as the container for primitive integer types where you could get away with 4/8 bytes per integer. Smaller perhaps. – juanpa.arrivillaga Jul 30 '18 at 07:23
  • @juanpa.arrivillaga you are correct. I quickly referenced [this answer](https://stackoverflow.com/a/30316760/6221024) for the size of an integer, and missed the "empty" column, instead reading an `int` as 4 bytes per 30 powers of 2. – Dillon Davis Jul 30 '18 at 07:30

0 Answers0