-1

I'm reading some .txt files as lists, using this method:

with open('../Results/DIMP_1120.txt', 'r') as f:
    DIMP_1120 = list(csv.reader(f, delimiter="|"))
with open('../Results/DIMP_1121.txt', 'r') as f:
    DIMP_1121 = list(csv.reader(f, delimiter="|"))
with open('../Results/DIMP_1122.txt', 'r') as f:
    DIMP_1122 = list(csv.reader(f, delimiter="|"))

But this is taking almost 10x the size of the file in the RAM memory.

Is there an efficient way to read it?

After that I'll append those lists and sort them.

big_list = DIMP_1120 + DIMP_1121 + DIMP_1122

#Order all lists by *Sorter (Row_id2)
from operator import itemgetter
big_list= sorted(big_list, key=itemgetter(0))

So I guess I need to bring all lists to memory at once.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
  • 3
    Do you need it as a `list` all at once? If you can process it row by row it will only need the memory for a row at a time (well, two rows at a time given how Python iteration works, but close enough). Otherwise, yeah, the overhead of Python `str` is ~49 bytes a piece, plus the overhead of each `list` wrapper, so if most of the fields are short, the overhead relative to the data will be quite high. – ShadowRanger Sep 10 '21 at 01:13
  • Hey, thx for reply. Yes I need it all at once, because i'm appending to other lists and sorting all of them. Reading as Pickle instead of txt could help? – Tanai Goncalves Sep 10 '21 at 01:19
  • Pickling won't help much unless the underlying data can be stored using more memory efficient types (e.g. storing and restoring raw `int`). I added a note on how you might take advantage of that while still using `csv`, simply by performing the type conversions as you load. – ShadowRanger Sep 10 '21 at 01:23
  • "appending to other lists and sorting all of them" doesn't necessarily require reading all at once. We could help you better if we knew more details. – no comment Sep 10 '21 at 01:29
  • Can you show a few lines of the data? How many files do you have? How large is each file? What are you doing with the final big sorted list? – no comment Sep 10 '21 at 17:59

2 Answers2

3

If you can process the data a row at a time without storing each row, e.g.

for row in csv.reader(f, delimiter="|"):

do that; it's the only way to dramatically reduce peak memory usage.

Otherwise, the best you can do is convert the row storage format from list to tuple as you read, which should save at least a little memory (more if csv.reader isn't truncating the overallocation list does by default), as tuples don't overallocate, and they store data inline with the Python object header (without overallocation or additional allocator round-off overhead), while list's header just adds a pointer to separately allocated memory (which overallocates and incurs twice the round-off overhead); for a dynamically allocated list of size 2 (e.g. in CPython 3.9 where unpacking generalizations behave like sequential appends, [*(0, 1)]), the container overhead can drop from 120 bytes to 56 bytes (possibly more, since allocator round-off error isn't visible to sys.getsizeof, and list pays it twice, tuple just once) just by converting to tuple, which can make a difference for millions of such rows. The most efficient means of converting it would be to change:

DIMP_1120 = list(csv.reader(f, delimiter="|"))  

to:

DIMP_1120 = list(map(tuple, csv.reader(f, delimiter="|")))

map operates lazily on Python 3, so each row would be read as a list, converted to a tuple, and stored in the outer list before the next was read; it wouldn't involve storing the whole input as lists and tuples at the same time, even for a moment. If your underlying data has some fields that could be converted up-front to a more efficiently stored type (e.g. int), a list comprehension that both converts the fields and packs them as tuples instead of lists could gain more, e.g. for four fields per row, the last three of which are logically ints, you could do:

DIMP_1120 = [(a, int(b), int(c), int(d)) for a, b, c, d in csv.reader(f, delimiter="|")]
# If you might have some empty/missized rows you wish to ignore, an if check can discard
# wrong length lists; a nested "loop" over the single item can unpack after checking:
DIMP_1120 = [(a, int(b), int(c), int(d)) for lst in csv.reader(f, delimiter="|")
             if len(lst) == 4
             for a, b, c, d in (lst,)]

unpacking the lists from csv.reader, converting the relevant fields to int, and repacking as a tuple.

Side-note: Make sure to pass newline="" (the empty string) to your open call; the csv module requires this to properly handle newlines from different CSV dialects.

Update: Reading into separate lists, then concatenating, then sorting, boosts peak outer list overhead from being proportionate to number of rows to being proportionate to being ~2.66x times the number of rows (assuming all files are the same size). You can avoid that overhead by changing:

with open('../Results/DIMP_1120.txt', 'r') as f:
    DIMP_1120 = list(csv.reader(f, delimiter="|"))  
with open('../Results/DIMP_1121.txt', 'r') as f:
    DIMP_1121 = list(csv.reader(f, delimiter="|"))  
with open('../Results/DIMP_1122.txt', 'r') as f:
    DIMP_1122 = list(csv.reader(f, delimiter="|"))  

big_list = DIMP_1120 + DIMP_1121 + DIMP_1122

#Order all lists by *Sorter (Row_id2)
from operator import itemgetter
big_list= sorted(big_list, key=itemgetter(0))

to:

from itertools import chain

with open('../Results/DIMP_1120.txt', 'r') as f1, \
     open('../Results/DIMP_1121.txt', 'r') as f2, \
     open('../Results/DIMP_1122.txt', 'r') as f3:
    
    ALL_DIMP = chain.from_iterable(csv.reader(f, delimiter="|")
                                   for f in (f1, f2, f3))
    big_list = sorted(map(tuple, ALL_DIMP), key=itemgetter(0))

Only one list is ever made (your original code had six lists; one for each input file, one for the concatenation of the first two files, one for the concatenation of all three files, and a new one for the sorted concatenation of all three files), containing all the data, and it's created sorted from the get-go.

I'll note that this may be something better done at the command line, at least on *NIX-like systems, where the sort command line utility knows how to sort huge files by field, with automatic spilling to disk to avoid storing too much in memory at once. It could be done in Python, but it would be uglier (unless there's some PyPI module for doing this I don't know about).

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • [little experiment](https://tio.run/##hYwxDgIhEEV7TjEdkBAaGxtPYixYmFUShcnMRHe9PK6bWNu@//6jVW@9HY7EY9QHdVbI8gwgqxjD/QUnsAkmyFBsFLpXdd4Q16ZuW3cvXlGlvrHPX@T979dwUbfFImMqyO5sU5hCDsVe/L/GGB8) of the overhead. – no comment Sep 10 '21 at 03:23
  • 1
    @don'ttalkjustcode: There's still a non-linear growth pattern; try it with `'a,b,c,d,e'` and the memory usage for the `csv.reader`'s output jumps from 88 bytes of overhead to 120 (`tuple`ing then drops it back to 80). Looks like a difference in how `split` is overallocating (I accidentally chose an example that overallocates more than default `list` building; when no `maxsplit` provided, it reserves space for 12 items up front and doesn't trim at the end). So conversion to `tuple` can save memory, but not as much as my toy comparison indicated. – ShadowRanger Sep 10 '21 at 03:36
  • Thanks. This improved memory usage in almost 30%. – Tanai Goncalves Sep 10 '21 at 13:26
  • I updated my example to avoid the `.split` extra-overallocation (while also, intentionally, showing a best case scenario, where the `list` overhead is more than twice the `tuple` overhead; in practice, all `tuple` can *guarantee* is that `tuple(somelist)` will consume at least 16 fewer bytes than calling `list(somelist)`, which in CPython 3.9 will not include any overallocation overhead on the copy of `somelist` (it sizes precisely to the elements given). `tuple` still saves 16 bytes (plus allocator round-off loss) by avoiding storing a pointer and a separate `ssize_t` capacity field. – ShadowRanger Nov 10 '21 at 19:28
0

Reading the data to a list means you are loading and saving all lines to memory. What you can do instead is iterate over the lines one by one by iterating over the csv.reader() instead, which as documented:

csv.reader(csvfile, dialect='excel', **fmtparams)

Return a reader object which will iterate over lines in the given csvfile. csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called... Each row read from the csv file is returned as a list of strings.

with open('../Results/DIMP_1120.txt', 'r') as f:
    for row in csv.reader(f, delimiter="|")):
        # Process the current line

The disadvantage of this is you only have access to 1 line at a time. But I believe if you don't want to load everything to memory at once, this might be the only way. You just need to redesign your logic to process everything that needs to be done per line.

  • Hey, thx for reply. Yes I need it all at once, because i'm appending to other lists and sorting all of them. Reading as Pickle instead of txt could help? – Tanai Goncalves Sep 10 '21 at 01:20