-2

So, I have a large 30k line program I've been writing for a year. It basically gathers non-normalized and non-standardized data from multiple sources and matches everything up after standardizing the sources.

I've written most everything with ordered dictionaries. This allowed me to keep the columns ordered, named and mutable, which made processing easier as values can be assigned/fixed throughout the entire mess of code.

However, I'm currently running out of RAM from all these dictionaries. I've since learned that switching to namedtuples will fix this, the only problem is that these aren't mutable, so that brings up one issue in doing the conversion.

I believe I could use a class to eliminate the immutability, but will may RAM savings be the same? Another option would be to use namedtuples and reassign them to new namedtouples every time a value needs to change (i.e. NewTup=Tup(oldTup.odj1, oldTup.odj2, "something new"). But I think I'd need an explicit way to destroy the old one afterwords or space could become an issue again.

The bottom line is my input files are about 6GB on disk (lots of data). I'm forced to process this data on a server with 16GB RAM and 4 GB swap. I originally programmed all the rows of these various I/O data sets with dictionaries, which is eating too much RAM... but the mutable nature and named referencing was a huge help in faster development, how do I cut my addition to dictionaries so that I can utilize the cost savings of other objects without rewriting the entire application do to the immutable nature of tuples.

SAMPLE CODE:

    for tan_programs_row in f_tan_programs:
    #stats not included due to urgent need
    tan_id = tan_programs_row["Computer ID"].strip() #The Tanium ID by which to reference all other tanium files (i.e. primary key)
    if("NO RESULT" not in tan_id.upper()):
        tan_programs_name = tan_programs_row["Name"].strip() #The Program Name
        tan_programs_publisher = tan_programs_row["Publisher"].strip() #The Program Vendor
        tan_programs_version = tan_programs_row["Version"].strip() #The Program Vendor

        try:
            unnorm_tan_dict[tan_id] #test the key, if non-existent go to exception
        except KeyError:
            #form the item since it doesn't exist yet
            unnorm_tan_dict[tan_id] = {
                "Tanium ID": tan_id,
                "Computer Name": "INDETERMINATE",
                "Operating System": "INDETERMINATE",
                "Operating System Build Number": "INDETERMINATE",
                "Service Pack": "INDETERMINATE",
                "Country Code": "INDETERMINATE",
                "Manufacturer": "INDETERMINATE",
                "Model": "INDETERMINATE",
                "Serial": "INDETERMINATE"
            }
        unnorm_tan_prog_list.append(rows.TanRawProg._make([tan_id, tan_programs_name, tan_programs_publisher, tan_programs_version]))

for tan_processes_row in f_tan_processes:
    #stats not included due to urgent need
    tan_id = tan_processes_row["Computer ID"].strip() #The Tanium ID by which to reference all other tanium files (i.e. primary key)
    if("NO RESULT" not in tan_id.upper()):
        tan_process_name = tan_processes_row["Running Processes"].strip() #The Program Name
        try:
            unnorm_tan_dict[tan_id] #test the key, if non-existent go to exception
        except KeyError:
            #form the item since it doesn't exist yet
            unnorm_tan_dict[tan_id] = {
                "Tanium ID": tan_id,
                "Computer Name": "INDETERMINATE",
                "Operating System": "INDETERMINATE",
                "Operating System Build Number": "INDETERMINATE",
                "Service Pack": "INDETERMINATE",
                "Country Code": "INDETERMINATE",
                "Manufacturer": "INDETERMINATE",
                "Model": "INDETERMINATE",
                "Serial": "INDETERMINATE"
            }
        unnorm_tan_proc_list.append(rows.TanRawProc._make([tan_id, tan_process_name]))

*Later on these values are often changed by bringing in other data sets.

halfer
  • 19,824
  • 17
  • 99
  • 186
gunslingor
  • 1,358
  • 12
  • 34
  • Reducing the dicts memory use will only get you so far - adds some more data or have some other process eating memory on the same server and you'll have memory issues again. If you really want to solve the problem you will need a solution that does not requires loading that much data in ram (a proper database - wether relational or not - might be part of the solution). – bruno desthuilliers Dec 11 '17 at 14:01

2 Answers2

4

Just write your own class, and use __slots__ to keep the memory footprint to a minimum:

class UnnormTan(object):
    __slots__ = ('tan_id', 'computer_name', ...)
    def __init__(self, tan_id, computer_name="INDETERMINATE", ...):
        self.tan_id = tan_id
        self.computer_name = computer_name
        # ...

This can get a little verbose perhaps, and if you need to use these as dictionary keys you'll have more typing to do.

There is a project that makes creating such classes easier: attrs:

from attrs import attrs, attrib

@attrs(slots=True)
class UnnormTan(object):
    tan_id = attrib()
    computer_name = attrib(default="INDETERMINATE")
    # ...

Classes created with the attrs library automatically take care of proper equality testing, representation and hashability.

Such objects are the most memory efficient representation of data Python can offer. If that is not enough (and it could well be that it is not enough), you need to look at offloading your data to disk. The easiest way to do that is to use a SQL database, like with the bundled sqlite3 SQLite library. Even if you used a :memory: temporary database, the database will manage your memory load by swapping out pages to disk as needed.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
0

It seems to me, that your main problem is you trying to create a database entirely in memory. You should use an actual database like MySQL or PostgreSQL. You can use nice ORM like peewee or Django ORM for interaction with databases.

On the other hand, if you can't handle the whole data, you can split your data into parts you can handle.

Module "TinyDB" (http://tinydb.readthedocs.io/en/latest/) can help you to keep using dictionaries and don't run out of RAM.

  • SQL is on a separate server, network speed is a big bottle neck unfortunately. Additionally, if I put all the data in the database first and foremost and work on every row there, then I can only really do UPDATES... but these are really slow since they require searching and can only be done one at a time plus I know I will touch every row anyway... INSERTS can be grouped in groups of 500 at time, which is about 100 times faster than 500 UPDATES. – gunslingor Dec 11 '17 at 14:49
  • If you don't wand face any lacks of RAM you actually have no other choice but to use DATAbases for your DATA storage and processing. You can easily setup database on your server and there will be no network involved. You can process your rows anyway you wand, but the key point is storing everything you don't need right now somewhere not in memory. You should take a look at MongoDB. – Михаил Крюков Dec 11 '17 at 15:19
  • Unfortunately the laws of certain compliance regulations prevent this... actually requiring separate networks for DB and App servers... unfortunately, companies don't always design things right and I'm left in these crazy positions. – gunslingor Dec 11 '17 at 15:52
  • If you can't use anything besides raw python for processing your data only choices you have left is to 1) learn to split you data into processable parts and process them one by one 2) use TinyDB (slow, but simple) or CodernityDB (more advanced and fast) – Михаил Крюков Dec 11 '17 at 16:02