4

I am working with a database that originally is in ‘.sav’ format of SPSS. This database has columns of information that do not optimize memory usage (i.e, strings variables have left blank spaces, also in the numeric there are decimals where there are supposed to be an integer). The process I do is to clean the database (correct the formats) and then I exported them to a flat text file to be processed in a text editor. The text editor is not that powerful, so the correct assignment of columns and formats is a very important process, as well as the reduction of memory usage. I have created a program to reduce memory usage using Python (pyreadstat library). The program is working with one limitation, it is only considering the values of the columns found in the database and no the metadata that is also imported from the SPSS file. My clients usually send two databases, one of the preliminary data and after a few weeks, I receive the final data that is going to be presented.

I use the preliminary data to continue programming in a text editor and save time until I get the final data. I found a potential problem in this process, which is that in the final data value may come that does not correspond to the format of the preliminary data. For example, with the variable age in the database. It can happen that in the preliminary data the oldest person is 99 years old, so the column format is a two-digit integer, however in the final data a 100-year-old subject could appear then the format of the column would change it would be an integer of three digits. This could change everything that I format in the text editor because the columns of the database are pulled or extracted in a wrong way. One way to overcome this problem would be to use the database metadata. In the metadata, there are all the values and labels of the variables of my database, and this information was double-checked by the client. Then if I use this the memory usage reduction would be optimal.

My question is if there is any way to use the metadata that I am importing from the SPSS file to format the columns of my database in a Python program? Also, Is this the most efficient way to solve this problem or is there any other more appropriate procedure to accomplish this task?

Thank you very much for your help, having this procedure working or having some ideas from more experienced people on the subject would save me a lot of time in my work.

ramiromp
  • 53
  • 4
  • 2
    your question is a bit broad, as the answer may lie in several places which are not described here; like why is your input always SPSS ? AFAIK there is no proper/maintained python module to read spss files; why not csv or other format ? what other formats does you "text editor" supports ? is that the only option ? And sice you asked in principle, you may not get a very specific and detailed answer; the more detailed the question, the more detailed (and useful) the answer may be ;) – horace_vr Sep 11 '19 at 15:32
  • Thank you for your reply. Sorry about that, I just need to know how to use the metadata of a database to format its columns to reduce the memory usage. – ramiromp Sep 11 '19 at 18:32
  • 1
    ive done this of thing before, but not on spss. but you have to meet us halfway. can you prettyprint or something the following: your spss metadata for 1 field, the current results for 1field or row, last: what you want instead. and maybe trim down your explanation (most of it is background) or you can put it later, after a concise intro. – JL Peyret Sep 11 '19 at 21:29
  • I am solving the problem. What I did is create a data frame that contains the length of the longest value label of my metadata for each one of the variables. Then, I would have the exactly format that I need to use to create a clean data frame. – ramiromp Sep 12 '19 at 13:30

0 Answers0