0

I have a raw SAS file that is around 16GB, and even after keeping columns relevant to my problem, the file size comes to around 8GB. It kind of looks like this:

CUST_ID   FIELD_1   FIELD_2   FIELD_3 ... FIELD_7
1          65         786      ABC          Y
2          87         785      GHI          N
3          88         877      YUI          Y
...
9999999    92         767      XYS          Y

When I tried to import it into Python using the code: df=pd.read_sas(path,format='SAS7BDAT') my screen turned black, and after multiple attempts I finally got the error MemoryError. Since I need the entire set of CUST_ID for my problem, selecting only a sample and deleting other rows is out of the question.

I thought maybe I could split this entire file into multiple sub-files so that I can carry out all the required calculations that I need to, and then finally reunite these files into a single large file after completing all necessary work.

Is there any way to solve this issue? I really appreciate all the help that I can get!

Edit:

I've tried this

chunk_list=[]
for chunk in df_chunk 
       chunk_filter=chunk
       chunk_list.append(chunk_filter)

df_concat=pd.concat(chunk_list)

But I'm still getting a Memory Error. Any help??

  • Are you happy to use SQL? I'm tempted to say to clean the data in there before passing to python – Plato77 Feb 12 '20 at 16:25
  • How much memory is there available in your system? – user667489 Feb 12 '20 at 16:28
  • I'd prefer a solution that involves only python. Is there a way to clean it within python itself? – coffee-raid Feb 12 '20 at 16:29
  • 1
    Remember SAS saves datasets to hard disk while Python runs all operations in RAM. And to run Pandas work, you want enough RAM above data size. So even if you were to read the data, chunks or not, other steps may generated MemoryError. Off topic here on Stackoverflow but consider a larger machine, remote or cloud solutions, or virtual env. – Parfait Feb 12 '20 at 16:50
  • @Parfait or use SAS since it won't crash or run out of memory ;) – Reeza Feb 12 '20 at 17:17
  • 2
    Indeed @Reeza. But it can be the case, OP only has the `.sas7bdat` file. – Parfait Feb 12 '20 at 17:42

2 Answers2

1

There is a chunksize argument for read_sas which should allow you to break the big file into smaller parts to enable you to read it in. chunksize is the number of records to read at a time.

TaxpayersMoney
  • 669
  • 1
  • 8
  • 26
  • so, for ex, if I want to read ```100000 CUST_ID``` at a time, I only need to type in ```pd.read_sas(path, chunksize=100000)``` ? – coffee-raid Feb 12 '20 at 16:34
  • I'm going to try this out. Thank you for the help! – coffee-raid Feb 12 '20 at 17:04
  • So I've tried the following ```chunk_list=[] for chunk in df_chunk: chunk_list.append(chunk)``` and out of the loop, I wrote ```df_concat=pd.concat(chunk_list) ```. But I am still getting a ```Memory Error```. Any help? – coffee-raid Feb 13 '20 at 05:27
0

Set the iterator flag to true and split the file in a loop before doing your processing.

Ref:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sas.html

or split the file in SAS before doing the output.


I think what you are trying is the following:

CHUNK = 10
df=pd.read_sas(path,format='SAS7BDAT',chunksize = CHUNK)

for chunk in df:
  # perform compression
  # write it out of your memory onto disk to_csv('new_file',
    # mode='a', # append mode
    # header=False, # don't rewrite the header, you need to init the file with a header
    # compression='gzip') # this is more to save space on disk maybe not needed

df=pd.read_csv(new_file)

you could try to compress the data inside the loop because otherwise it will fail again when merging:

  1. Dropping columns
  2. Lower-range numerical dtype
  3. Categoricals
  4. Sparse columns

Ref: https://pythonspeed.com/articles/pandas-load-less-data/

Grom
  • 50
  • 4