5

I'm working on a data set (PSID) that gives data in a SAS format (a .txt and another file containing instructions to interpret the data). I cannot find anything in Python to read this type of data.

Does anyone know of a pre-existing module/script to read SAS data?

Edit (added from a comment to an answer): The data is in ascii/text and the start of a row of data looks like this:

3 10 1015000 150013200 00 002500 00 00

JohnE
  • 29,156
  • 8
  • 79
  • 109
pdevar
  • 323
  • 2
  • 3
  • 11
  • 1
    A txt file is not a SAS data set. Open the file and examine it, it's most likely CSV or Tab delimited. – Reeza Jun 18 '15 at 11:18
  • @Reeza If I remember correctly, PSID data files are provided as fixed-format text files, with a SAS data step or Stata dct file being necessary to read the data and assign the variables and formats. Hence the apparent need to read data into SAS then import into Python. – DWal Jun 18 '15 at 12:29
  • @DWal Seems to me you could just read it into Python directly instead of going via SAS? SAS data step inputting a text file is extremely easy to parse. – Joe Jun 18 '15 at 16:42
  • @Joe Depending on how many of the thousands of variables OP selected to download, it's probably not something you'd want to do by hand. Not knowing the syntax to read fixed-format data in Python, I don't know how difficult it would be to convert the data step code programmatically. I'm sure it's much easier to go through SAS or Stata if they are available. – DWal Jun 18 '15 at 18:06
  • By the way, I checked the PSID website, and they also give the option of downloading a dBase data file if you've selected fewer than 1024 variables. I bet you could read a dBase file directly into Python, eliminating the need to use SAS or Stata to read the ASCII files. – DWal Jun 18 '15 at 18:12
  • @JohnE Unfortunately that is not true. PSID does not provide a Stata data file (.dta). It provides a fixed-format ASCII file with the Stata code (.do) to read the ASCII file into a .dta file. Same goes for SAS and SPSS. You would see this if you went all the way to the download page. – DWal Jun 19 '15 at 01:19
  • @DWal -- sorry, my bad. From my quick read I thought PSID was actually providing a variety of formats as other places often do. So if PSID is only provided in text/ascii, the best approach from python/pandas may be `read_csv` or `read_table` though it's hard to say without seeing exactly how PSID outputs the data. Also, it is not uncommon for 3rd parties to provide SAS or stata data sets of publicly available data, though I couldn't find that for PSID. I'd be surprised if if it weren't hosted somewhere though, unless there's a specific reason it can't be done (like the PSID folks prohibit it) – JohnE Jun 19 '15 at 01:31
  • @DWal Indeed, your first comment was right about the two files. I'm actually working on converting an R package, psidR which builds panels from the raw psid data sets, to Python. – pdevar Jun 19 '15 at 12:14
  • @pdevar I don't use R, but I believe R and pandas can share data via HDF, so you could also read into R, save as HDF, and then read that into pandas without any need for translating code from R to pandas. – JohnE Jun 19 '15 at 14:52
  • @JohnE I think that would be the most practicle solution, and I would prefer to use HDF, but I wanted to write a general tool that is entirely in Python. – pdevar Jun 19 '15 at 17:14

7 Answers7

6

As of version 17, Pandas now supports reading sas files with the .xpt file extension. See this link to pandas documentation for additional details.

df = pd.read_sas('sas_xport.xpt')
josiah
  • 139
  • 1
  • 8
  • Nice, but this doesn't solve the problem for the .sas and .txt data and key file type. Sadly, there are a lot of public institutions that still offer this data type (and nothing else). – pdevar Dec 03 '15 at 15:18
  • Thanks for the clarification. I wasn't aware of this nuance. Good to know. – josiah Aug 21 '17 at 19:54
1

When you have the option to download a SAS dataset you will often also have the option to download a Stata dataset (this is indeed the case for PSID btw). In that case, the easiest way will likely be to import with read_stata (this might change in the future, but I believe is a very accurate statement as of today).

Less convenient, but almost always an option, is to download a text file (usually referred to as text, ascii, or csv). Those tend to come in two flavors: delimited (with comma or tab), or space separated (columnar or tabulated). If the file is comma or tab delimited, use read_csv and set the delimiter as appropriate. If it's space delimited or tabular, you might have good luck with read_csv, or you might be better off with read_fwf or read_table. Depends a bit on the variable types and formatting.

From what I have read, sas7bdat mentioned by @hd1 seems to work well but is not part of pandas yet. For that reason, I tend to default to read_stata or read_csv but hopefully sas7bdat also works well and perhaps will be brought into pandas in the future. Also, I'm wondering about the speed of sas7bdat. read_csv has been pretty fast for a long time and read_stata is very fast in the latest versions (since 15.0, I believe). I'm not sure about the speed of sas7bdat?

JohnE
  • 29,156
  • 8
  • 79
  • 109
  • I'm not sure you can download stata datasets for the psid. The main interview files come in .zip and contain ascii and a .do file to interpret it, but no stata dataset. In terms of using the sas7bdat, as far as I know that works only for .sas7bdat, but maybe someone can correct me? – pdevar Jun 19 '15 at 12:19
1

So I've written a package that can import the data. It can be found here:

https://pypi.python.org/pypi/psid_py

It's my first package, so sorry for the sloppy work. Additionally, it's only been tested against the PSID data sets and I'm positive that there are mistakes for other SAS formats. However, something's better than nothing.

Beyond reading in sas data, it will also build a panel data for you, in case you're into that kind of thing.

pdevar
  • 323
  • 2
  • 3
  • 11
1

I know it's a very late response but I think my answer is going to be useful for future readers. Few months back when I had to read and process SAS data either SAS7BDAT or xpt format SAS data, I was looking for different libraries and packages available to read these datasets, among them, I shortlisted the libraries as follows:

  1. pandas (It was on high priority list due to community support and performance)
  2. SAS7BDAT (Is able to read SAS7BDAT files only, and last release July 2019)
  3. pyreadstat (Promising performance as per the documentation plus ability to read meta data)

Before picking up any package, I did some performance benchmarking, I found pyreadstat to be faster than pandas, (seems like it's using multiprocessing while reading the data as mentioned in the documentation but I'm not exactly sure), and also the memory consumption and the footprint was much lesser while using pyreadstat in comparison to pandas, plus it is able to read the metadata, and even allows to read the metadeta only, so I finally ended up picking pyreadstat.

The data read using pyreadstat is already in the form of dataframe, so it doesn't need some manual conversion to pandas dataframe.

import pyreadstat

#for SAS7BDAT files
df, _ = pyreadstat.read_sas7bdat('some_file.sas7bdat')

#for xpt files
df, _ = pyreadstat.read_xpt('some_file.xport')

Here is some benchmark (Time to read the file to a dataframe) performed on real data (Raw and Standardized) for CDISC, the file size ranges from some KB to some MB, and includes both xpt and sas7bdat file formats:

Reading ADAE.xpt 49.06 KB for 100 loops:
    Pandas Average time : 0.02232 seconds
    Pyreadstat Average time : 0.04819 seconds
----------------------------------------------------------------------------
Reading ADIE.xpt 27.73 KB for 100 loops:
    Pandas Average time : 0.01610 seconds
    Pyreadstat Average time : 0.03981 seconds
----------------------------------------------------------------------------
Reading ADVS.xpt 386.95 KB for 100 loops:
    Pandas Average time : 0.03248 seconds
    Pyreadstat Average time : 0.07580 seconds
----------------------------------------------------------------------------
Reading beck.sas7bdat 14.72 MB for 50 loops:
    Pandas Average time : 5.30275 seconds
    Pyreadstat Average time : 0.60373 seconds
----------------------------------------------------------------------------
Reading p0_qs.sas7bdat 42.61 MB for 50 loops:
    Pandas Average time : 15.53942 seconds
    Pyreadstat Average time : 1.69885 seconds
----------------------------------------------------------------------------
Reading ta.sas7bdat 33.00 KB for 100 loops:
    Pandas Average time : 0.04017 seconds
    Pyreadstat Average time : 0.00152 seconds
----------------------------------------------------------------------------
Reading te.sas7bdat 33.00 KB for 100 loops:
    Pandas Average time : 0.01052 seconds
    Pyreadstat Average time : 0.00109 seconds
----------------------------------------------------------------------------
Reading ti.sas7bdat 33.00 KB for 100 loops:
    Pandas Average time : 0.04446 seconds
    Pyreadstat Average time : 0.00179 seconds
----------------------------------------------------------------------------
Reading ts.sas7bdat 33.00 KB for 100 loops:
    Pandas Average time : 0.01273 seconds
    Pyreadstat Average time : 0.00129 seconds
----------------------------------------------------------------------------
Reading t_frcow.sas7bdat 14.59 MB for 50 loops:
    Pandas Average time : 7.93266 seconds
    Pyreadstat Average time : 0.92295 seconds

As you can see, for xpt files, the time to read the files isn't better, but for sas7bdat files, pyreadstat just outperforms pandas.

The above benchmark was performed on pyreadstat 1.0.9, pandas 1.2.4 and Python 3.7.5.

ThePyGuy
  • 17,779
  • 5
  • 18
  • 45
  • Good comment, but a little off topic for this question. This question is about how to parse SAS code that reads a text file with fixed length fields and convert it to python code to read the same fixed length text file. – Tom Jun 10 '21 at 04:08
  • @Tom, yes, you are right, but since the title is **Import SAS data file into python data frame** many people will land here to know how to read SAS data in python, so I thought of adding this answer. – ThePyGuy Jun 10 '21 at 04:11
  • Pyreadstat does not use multiprocessing by default, you can take advantage of it by using pyreadstat.read_file_multiprocessing, it is explained in the Readme (and you can get extra 3x speedup). Multiprocessing works for sas7bdat but not for xpt. The better performance compared to pandas comes from the fact that it uses a very fast C library in the background. – Otto Fajardo Jun 22 '21 at 18:59
0

Look at the SAS code that you get. You should see that it is very consistently formatted so that you can parse out the variable names and the columns to read for those variables. For example in this paper https://psidonline.isr.umich.edu/Guide/FileStructure.pdf you can see that the INPUT statement is of the form:

INPUT
    ER30001 2 - 5
    ER30002 6 - 8
    ER30642 1528 - 1532
    ER30643 1533 - 1534
...
;

So just read the SAS program and generate the appropriate Python to read the text file using the same variable names.

Tom
  • 47,574
  • 2
  • 16
  • 29
0

The data is in a fixed with table. Fixed with means that for example the 3rd value starts in every row at the 15th letter and goes until the 114 letter.

1.Open your SAS or SPSS input statement. In case of SAS you will find something like this:

@1  ANO_CENSO   5.  /*Ano do Censo*/
@6  PK_COD_ENTIDADE 9.  /*Código da Escola*/
@15 NO_ENTIDADE $Char100.   /*Nome da Escola*/
@115    COD_ORGAO_REGIONAL_INEP $Char5. /*Código do Órgão Regional de Ensino*/

the left number is the column (in number of letters, where the according value)

2.In the script below, fill in the columns, first is the column name second the tuple is the first and the column of the first and the last letter/number of the variable. NOTE THAT SAS starts counting at 1 and python at 0.

import pandas as pd

columns=(
('ANO', (0, 5)),
('CODE', (5, 14)),
('DESC_SITUACAO_FUNCIONAMENTO', (119, 134)),
('FK_COD_ESTADO', (176, 178)),
('SIGLA', (178, 181)),
)

df = pd.read_fwf('TS_ESCOLA.TXT', names=zip(*columns)[0], colspecs=zip(*columns)[1], header=None)

// pd.read_fwf is the fixed with reader of pandas.

Davoud Taghawi-Nejad
  • 16,142
  • 12
  • 62
  • 82
  • That is great, but the problem is the number of variables. With hundreds of variables whose names change from year to year, it is impractical to do this. – pdevar Dec 03 '15 at 15:16
0

I know this is an old post however just wanted to provide an efficient method if someone reaches this page through google.

we can use pyreadstat to do the same (can get data as well as metadata).

import pyreadstat
df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat')

or save the pandas frame using following code

pyreadstat.write_xport(df, "path/to/destination.xpt", file_label="test", column_labels=column_labels)
Arun
  • 421
  • 3
  • 6