I know it's a very late response but I think my answer is going to be useful for future readers. Few months back when I had to read and process SAS
data either SAS7BDAT
or xpt
format SAS
data, I was looking for different libraries and packages available to read these datasets, among them, I shortlisted the libraries as follows:
pandas
(It was on high priority list due to community support and
performance)
SAS7BDAT
(Is able to read SAS7BDAT
files only, and last release July 2019)
pyreadstat
(Promising performance as per the documentation plus ability to read meta data)
Before picking up any package, I did some performance benchmarking, although I don't have benchmark result at the time of posting this answer, I found pyreadstat
to be faster than pandas
, (seems like it's using multiprocessing while reading the data as mentioned in the documentation but I'm not exactly sure), and also the memory consumption and the footprint was much lesser while using pyreadstat
in comparison to pandas
, plus it is able to read the metadata, and even allows to read the metadeta only, so I finally ended up picking pyreadstat
.
The data read using pyreadstat
is also in the form of dataframe, so it doesn't need some manual conversion to pandas dataframe.
Talking about reading large SAS
data, pyreadstat
has row_limit
and offset
parameters which can be used to read in chunk, so the Memory is not going to be a bottleneck, furthermore, while reading the SAS
data in chunk, you can convert each chunk to categorical and append it to the resulting data before reading another chunk; it will compress the data size so the Memory consumption is extremely low (depends on the data, the lesser the number of unique values in the dataframe, is lesser the memory usage). The following code snippet might be useful for someone who is willing to read large SAS
data:
import pandas as pd
import pyreadstat
filename = 'foo.SAS7BDAT'
CHUNKSIZE = 50000
offset = 0
# Get the function object in a variable getChunk
if filename.lower().endswith('sas7bdat'):
getChunk = pyreadstat.read_sas7bdat
else:
getChunk = pyreadstat.read_xport
allChunk,_ = getChunk(filename, row_limit=CHUNKSIZE, row_offset=offset)
allChunk = allChunk.astype('category')
while True:
offset += CHUNKSIZE
# for xpt data, use pyreadstat.read_xpt()
chunk, _ = pyreadstat.read_sas7bdat(filename, row_limit=CHUNKSIZE, row_offset=offset)
if chunk.empty: break # if chunk is empty, it means the entire data has been read, so break
for eachCol in chunk: #converting each column to categorical
colUnion = pd.api.types.union_categoricals([allChunk[eachCol], chunk[eachCol]])
allChunk[eachCol] = pd.Categorical(allChunk[eachCol], categories=colUnion.categories)
chunk[eachCol] = pd.Categorical(chunk[eachCol], categories=colUnion.categories)
allChunk = pd.concat([allChunk, chunk]) #Append each chunk to the resulting dataframe
PS: Please be noted that the resulting dataframe allChunk
is going to have all column as Categorical
data
Here is some benchmark (Time to read the file to a dataframe) performed on real data (Raw and Standardized) for CDISC, the file size ranges from some KB to some MB, and includes both xpt and sas7bdat file formats:
Reading ADAE.xpt 49.06 KB for 100 loops:
Pandas Average time : 0.02232 seconds
Pyreadstat Average time : 0.04819 seconds
----------------------------------------------------------------------------
Reading ADIE.xpt 27.73 KB for 100 loops:
Pandas Average time : 0.01610 seconds
Pyreadstat Average time : 0.03981 seconds
----------------------------------------------------------------------------
Reading ADVS.xpt 386.95 KB for 100 loops:
Pandas Average time : 0.03248 seconds
Pyreadstat Average time : 0.07580 seconds
----------------------------------------------------------------------------
Reading beck.sas7bdat 14.72 MB for 50 loops:
Pandas Average time : 5.30275 seconds
Pyreadstat Average time : 0.60373 seconds
----------------------------------------------------------------------------
Reading p0_qs.sas7bdat 42.61 MB for 50 loops:
Pandas Average time : 15.53942 seconds
Pyreadstat Average time : 1.69885 seconds
----------------------------------------------------------------------------
Reading ta.sas7bdat 33.00 KB for 100 loops:
Pandas Average time : 0.04017 seconds
Pyreadstat Average time : 0.00152 seconds
----------------------------------------------------------------------------
Reading te.sas7bdat 33.00 KB for 100 loops:
Pandas Average time : 0.01052 seconds
Pyreadstat Average time : 0.00109 seconds
----------------------------------------------------------------------------
Reading ti.sas7bdat 33.00 KB for 100 loops:
Pandas Average time : 0.04446 seconds
Pyreadstat Average time : 0.00179 seconds
----------------------------------------------------------------------------
Reading ts.sas7bdat 33.00 KB for 100 loops:
Pandas Average time : 0.01273 seconds
Pyreadstat Average time : 0.00129 seconds
----------------------------------------------------------------------------
Reading t_frcow.sas7bdat 14.59 MB for 50 loops:
Pandas Average time : 7.93266 seconds
Pyreadstat Average time : 0.92295 seconds
As you can see, for xpt files, the time to read the files isn't better, but for sas7bdat files, pyreadstat just outperforms pandas.
The above benchmark was performed on pyreadstat 1.0.9, pandas 1.2.4 and Python 3.7.5.