I am working with a database that originally is in ‘.sav’ format of SPSS. This database has columns of information that do not optimize memory usage (i.e, strings variables have left blank spaces, also in the numeric there are decimals where there are supposed to be an integer). The process I do is to clean the database (correct the formats) and then I exported them to a flat text file to be processed in a text editor. The text editor is not that powerful, so the correct assignment of columns and formats is a very important process, as well as the reduction of memory usage. I have created a program to reduce memory usage using Python (pyreadstat library). The program is working with one limitation, it is only considering the values of the columns found in the database and no the metadata that is also imported from the SPSS file. My clients usually send two databases, one of the preliminary data and after a few weeks, I receive the final data that is going to be presented.
I use the preliminary data to continue programming in a text editor and save time until I get the final data. I found a potential problem in this process, which is that in the final data value may come that does not correspond to the format of the preliminary data. For example, with the variable age in the database. It can happen that in the preliminary data the oldest person is 99 years old, so the column format is a two-digit integer, however in the final data a 100-year-old subject could appear then the format of the column would change it would be an integer of three digits. This could change everything that I format in the text editor because the columns of the database are pulled or extracted in a wrong way. One way to overcome this problem would be to use the database metadata. In the metadata, there are all the values and labels of the variables of my database, and this information was double-checked by the client. Then if I use this the memory usage reduction would be optimal.
My question is if there is any way to use the metadata that I am importing from the SPSS file to format the columns of my database in a Python program? Also, Is this the most efficient way to solve this problem or is there any other more appropriate procedure to accomplish this task?
Thank you very much for your help, having this procedure working or having some ideas from more experienced people on the subject would save me a lot of time in my work.