So, I have a large 30k line program I've been writing for a year. It basically gathers non-normalized and non-standardized data from multiple sources and matches everything up after standardizing the sources.
I've written most everything with ordered dictionaries. This allowed me to keep the columns ordered, named and mutable, which made processing easier as values can be assigned/fixed throughout the entire mess of code.
However, I'm currently running out of RAM from all these dictionaries. I've since learned that switching to namedtuples will fix this, the only problem is that these aren't mutable, so that brings up one issue in doing the conversion.
I believe I could use a class to eliminate the immutability, but will may RAM savings be the same? Another option would be to use namedtuples and reassign them to new namedtouples every time a value needs to change (i.e. NewTup=Tup(oldTup.odj1, oldTup.odj2, "something new"). But I think I'd need an explicit way to destroy the old one afterwords or space could become an issue again.
The bottom line is my input files are about 6GB on disk (lots of data). I'm forced to process this data on a server with 16GB RAM and 4 GB swap. I originally programmed all the rows of these various I/O data sets with dictionaries, which is eating too much RAM... but the mutable nature and named referencing was a huge help in faster development, how do I cut my addition to dictionaries so that I can utilize the cost savings of other objects without rewriting the entire application do to the immutable nature of tuples.
SAMPLE CODE:
for tan_programs_row in f_tan_programs:
#stats not included due to urgent need
tan_id = tan_programs_row["Computer ID"].strip() #The Tanium ID by which to reference all other tanium files (i.e. primary key)
if("NO RESULT" not in tan_id.upper()):
tan_programs_name = tan_programs_row["Name"].strip() #The Program Name
tan_programs_publisher = tan_programs_row["Publisher"].strip() #The Program Vendor
tan_programs_version = tan_programs_row["Version"].strip() #The Program Vendor
try:
unnorm_tan_dict[tan_id] #test the key, if non-existent go to exception
except KeyError:
#form the item since it doesn't exist yet
unnorm_tan_dict[tan_id] = {
"Tanium ID": tan_id,
"Computer Name": "INDETERMINATE",
"Operating System": "INDETERMINATE",
"Operating System Build Number": "INDETERMINATE",
"Service Pack": "INDETERMINATE",
"Country Code": "INDETERMINATE",
"Manufacturer": "INDETERMINATE",
"Model": "INDETERMINATE",
"Serial": "INDETERMINATE"
}
unnorm_tan_prog_list.append(rows.TanRawProg._make([tan_id, tan_programs_name, tan_programs_publisher, tan_programs_version]))
for tan_processes_row in f_tan_processes:
#stats not included due to urgent need
tan_id = tan_processes_row["Computer ID"].strip() #The Tanium ID by which to reference all other tanium files (i.e. primary key)
if("NO RESULT" not in tan_id.upper()):
tan_process_name = tan_processes_row["Running Processes"].strip() #The Program Name
try:
unnorm_tan_dict[tan_id] #test the key, if non-existent go to exception
except KeyError:
#form the item since it doesn't exist yet
unnorm_tan_dict[tan_id] = {
"Tanium ID": tan_id,
"Computer Name": "INDETERMINATE",
"Operating System": "INDETERMINATE",
"Operating System Build Number": "INDETERMINATE",
"Service Pack": "INDETERMINATE",
"Country Code": "INDETERMINATE",
"Manufacturer": "INDETERMINATE",
"Model": "INDETERMINATE",
"Serial": "INDETERMINATE"
}
unnorm_tan_proc_list.append(rows.TanRawProc._make([tan_id, tan_process_name]))
*Later on these values are often changed by bringing in other data sets.