If you can process the data a row at a time without storing each row, e.g.
for row in csv.reader(f, delimiter="|"):
do that; it's the only way to dramatically reduce peak memory usage.
Otherwise, the best you can do is convert the row storage format from list
to tuple
as you read, which should save at least a little memory (more if csv.reader
isn't truncating the overallocation list
does by default), as tuple
s don't overallocate, and they store data inline with the Python object header (without overallocation or additional allocator round-off overhead), while list
's header just adds a pointer to separately allocated memory (which overallocates and incurs twice the round-off overhead); for a dynamically allocated list
of size 2 (e.g. in CPython 3.9 where unpacking generalizations behave like sequential appends, [*(0, 1)]
), the container overhead can drop from 120 bytes to 56 bytes (possibly more, since allocator round-off error isn't visible to sys.getsizeof
, and list
pays it twice, tuple
just once) just by converting to tuple
, which can make a difference for millions of such rows. The most efficient means of converting it would be to change:
DIMP_1120 = list(csv.reader(f, delimiter="|"))
to:
DIMP_1120 = list(map(tuple, csv.reader(f, delimiter="|")))
map
operates lazily on Python 3, so each row would be read as a list
, converted to a tuple
, and stored in the outer list
before the next was read; it wouldn't involve storing the whole input as list
s and tuple
s at the same time, even for a moment. If your underlying data has some fields that could be converted up-front to a more efficiently stored type (e.g. int
), a list
comprehension that both converts the fields and packs them as tuple
s instead of list
s could gain more, e.g. for four fields per row, the last three of which are logically int
s, you could do:
DIMP_1120 = [(a, int(b), int(c), int(d)) for a, b, c, d in csv.reader(f, delimiter="|")]
# If you might have some empty/missized rows you wish to ignore, an if check can discard
# wrong length lists; a nested "loop" over the single item can unpack after checking:
DIMP_1120 = [(a, int(b), int(c), int(d)) for lst in csv.reader(f, delimiter="|")
if len(lst) == 4
for a, b, c, d in (lst,)]
unpacking the list
s from csv.reader
, converting the relevant fields to int
, and repacking as a tuple
.
Side-note: Make sure to pass newline=""
(the empty string) to your open
call; the csv
module requires this to properly handle newlines from different CSV dialects.
Update: Reading into separate list
s, then concatenating, then sorting, boosts peak outer list
overhead from being proportionate to number of rows to being proportionate to being ~2.66x times the number of rows (assuming all files are the same size). You can avoid that overhead by changing:
with open('../Results/DIMP_1120.txt', 'r') as f:
DIMP_1120 = list(csv.reader(f, delimiter="|"))
with open('../Results/DIMP_1121.txt', 'r') as f:
DIMP_1121 = list(csv.reader(f, delimiter="|"))
with open('../Results/DIMP_1122.txt', 'r') as f:
DIMP_1122 = list(csv.reader(f, delimiter="|"))
big_list = DIMP_1120 + DIMP_1121 + DIMP_1122
#Order all lists by *Sorter (Row_id2)
from operator import itemgetter
big_list= sorted(big_list, key=itemgetter(0))
to:
from itertools import chain
with open('../Results/DIMP_1120.txt', 'r') as f1, \
open('../Results/DIMP_1121.txt', 'r') as f2, \
open('../Results/DIMP_1122.txt', 'r') as f3:
ALL_DIMP = chain.from_iterable(csv.reader(f, delimiter="|")
for f in (f1, f2, f3))
big_list = sorted(map(tuple, ALL_DIMP), key=itemgetter(0))
Only one list
is ever made (your original code had six list
s; one for each input file, one for the concatenation of the first two files, one for the concatenation of all three files, and a new one for the sorted concatenation of all three files), containing all the data, and it's created sorted from the get-go.
I'll note that this may be something better done at the command line, at least on *NIX-like systems, where the sort
command line utility knows how to sort huge files by field, with automatic spilling to disk to avoid storing too much in memory at once. It could be done in Python, but it would be uglier (unless there's some PyPI module for doing this I don't know about).