8

I have a 50 gb SAS dataset. I want to read it in pandas dataframe. What is the best way to fast read the sas dataset.

I used the below code which is way too slow:

import pandas as pd
df = pd.read_sas("xxxx.sas7bdat", chunksize = 10000000)
dfs = []
for chunk in df:
    dfs.append(chunk)
df_final = pd.concat(dfs)

Is there any way faster way to read large dataset in python? Can run this process parallely?

Shanoo
  • 1,185
  • 1
  • 11
  • 38
  • SAS reading is always a performance bottle neck for me, with or without `chunksize`. I usually export SAS data as excel, or read it once with `pd.read_sas` and save it as other python binaries. – Quang Hoang Oct 29 '19 at 17:11

2 Answers2

8

I know it's a very late response but I think my answer is going to be useful for future readers. Few months back when I had to read and process SAS data either SAS7BDAT or xpt format SAS data, I was looking for different libraries and packages available to read these datasets, among them, I shortlisted the libraries as follows:

  1. pandas (It was on high priority list due to community support and performance)
  2. SAS7BDAT (Is able to read SAS7BDAT files only, and last release July 2019)
  3. pyreadstat (Promising performance as per the documentation plus ability to read meta data)

Before picking up any package, I did some performance benchmarking, although I don't have benchmark result at the time of posting this answer, I found pyreadstat to be faster than pandas, (seems like it's using multiprocessing while reading the data as mentioned in the documentation but I'm not exactly sure), and also the memory consumption and the footprint was much lesser while using pyreadstat in comparison to pandas, plus it is able to read the metadata, and even allows to read the metadeta only, so I finally ended up picking pyreadstat.

The data read using pyreadstat is also in the form of dataframe, so it doesn't need some manual conversion to pandas dataframe.

Talking about reading large SAS data, pyreadstat has row_limit and offset parameters which can be used to read in chunk, so the Memory is not going to be a bottleneck, furthermore, while reading the SAS data in chunk, you can convert each chunk to categorical and append it to the resulting data before reading another chunk; it will compress the data size so the Memory consumption is extremely low (depends on the data, the lesser the number of unique values in the dataframe, is lesser the memory usage). The following code snippet might be useful for someone who is willing to read large SAS data:

import pandas as pd
import pyreadstat
filename = 'foo.SAS7BDAT'
CHUNKSIZE = 50000
offset = 0
# Get the function object in a variable getChunk
if filename.lower().endswith('sas7bdat'):
    getChunk = pyreadstat.read_sas7bdat
else:
    getChunk = pyreadstat.read_xport
allChunk,_ = getChunk(filename, row_limit=CHUNKSIZE, row_offset=offset)
allChunk = allChunk.astype('category')

while True:
    offset += CHUNKSIZE
    # for xpt data, use pyreadstat.read_xpt()
    chunk, _ = pyreadstat.read_sas7bdat(filename, row_limit=CHUNKSIZE, row_offset=offset)
    if chunk.empty: break  # if chunk is empty, it means the entire data has been read, so break

    for eachCol in chunk:  #converting each column to categorical 
        colUnion = pd.api.types.union_categoricals([allChunk[eachCol], chunk[eachCol]])
        allChunk[eachCol] = pd.Categorical(allChunk[eachCol], categories=colUnion.categories)
        chunk[eachCol] = pd.Categorical(chunk[eachCol], categories=colUnion.categories)

    allChunk = pd.concat([allChunk, chunk])  #Append each chunk to the resulting dataframe

PS: Please be noted that the resulting dataframe allChunk is going to have all column as Categorical data

Here is some benchmark (Time to read the file to a dataframe) performed on real data (Raw and Standardized) for CDISC, the file size ranges from some KB to some MB, and includes both xpt and sas7bdat file formats:

Reading ADAE.xpt 49.06 KB for 100 loops:
    Pandas Average time : 0.02232 seconds
    Pyreadstat Average time : 0.04819 seconds
----------------------------------------------------------------------------
Reading ADIE.xpt 27.73 KB for 100 loops:
    Pandas Average time : 0.01610 seconds
    Pyreadstat Average time : 0.03981 seconds
----------------------------------------------------------------------------
Reading ADVS.xpt 386.95 KB for 100 loops:
    Pandas Average time : 0.03248 seconds
    Pyreadstat Average time : 0.07580 seconds
----------------------------------------------------------------------------
Reading beck.sas7bdat 14.72 MB for 50 loops:
    Pandas Average time : 5.30275 seconds
    Pyreadstat Average time : 0.60373 seconds
----------------------------------------------------------------------------
Reading p0_qs.sas7bdat 42.61 MB for 50 loops:
    Pandas Average time : 15.53942 seconds
    Pyreadstat Average time : 1.69885 seconds
----------------------------------------------------------------------------
Reading ta.sas7bdat 33.00 KB for 100 loops:
    Pandas Average time : 0.04017 seconds
    Pyreadstat Average time : 0.00152 seconds
----------------------------------------------------------------------------
Reading te.sas7bdat 33.00 KB for 100 loops:
    Pandas Average time : 0.01052 seconds
    Pyreadstat Average time : 0.00109 seconds
----------------------------------------------------------------------------
Reading ti.sas7bdat 33.00 KB for 100 loops:
    Pandas Average time : 0.04446 seconds
    Pyreadstat Average time : 0.00179 seconds
----------------------------------------------------------------------------
Reading ts.sas7bdat 33.00 KB for 100 loops:
    Pandas Average time : 0.01273 seconds
    Pyreadstat Average time : 0.00129 seconds
----------------------------------------------------------------------------
Reading t_frcow.sas7bdat 14.59 MB for 50 loops:
    Pandas Average time : 7.93266 seconds
    Pyreadstat Average time : 0.92295 seconds

As you can see, for xpt files, the time to read the files isn't better, but for sas7bdat files, pyreadstat just outperforms pandas.

The above benchmark was performed on pyreadstat 1.0.9, pandas 1.2.4 and Python 3.7.5.

ThePyGuy
  • 17,779
  • 5
  • 18
  • 45
0

It sounds like maybe your dataset is larger than your working memory (RAM). For example, my laptop has 16GB of RAM … if I tried to open your 50 GB dataset it wouldn't fit in memory.

Your example has half of one possible solution … which is you can work on part of the dataset at a time using the "chunksize" parameter. But you won't be able to concatenate and read in a single dataframe, because pandas requires that the data fits into your working memory. Again, I don't know what you have, but my laptop has 16 GB.

Historically people have worked with large datasets with Hadoop/Spark. But then it's a pain to set it up, and you have to involve the IT department and infrastructure people etc. So in the past few years, the Python community has started using Dask. I have never used Dask myself, but here is a talk at PyCon 2018 on this topic. (The first few seconds of the video, you can see me at the podium introducing the speaker Matthew Rocklin).

https://www.youtube.com/watch?v=Iq72dt1gO9c

Here is some documentation for Dask:

https://docs.dask.org/en/latest/

  • Another suitable package for very large datasets may be vaex: https://vaex.readthedocs.io/en/latest/ – the_economist Oct 27 '20 at 13:10
  • As a follow-up, I have worked with Koalas/PySpark/Hadoop both from within AWS-EMR and within Azure-Databricks. The workflow is (1) setup Spark cluster, (2) read in data as PySpark dataframe, (3) convert PySpark dataframe to Koalas dataframe, (4) use Koalas syntax which tried to mirror Pandas syntax (although I found some idiosyncrasies such as .agg().sum() when working with Booleans … 95% of what you want to do in Pandas can be done in Koalas); (5) last step might be to write out to Redshift or some other data warehouse (using psycopg2). – user_stack_overflow Oct 30 '20 at 19:46