Unable to read big stata (.dta) file in chunks in python (pyreadstat)

Question

I am trying to read a 34Gb stata file but getting an error. So just to make sure I tried the same code on an 11Mb file.

The code is:

import pyreadstat
dtafile = 'E:/Work/test file.dta'
reader = pyreadstat.read_file_in_chunks(pyreadstat.read_dta, dtafile, chunksize= 5, limit= 1)

for df,meta in reader:
    print (df)

And I got correct output as:

   app_id    inventor_id  ... lagged_generality_FYnormalized  _merge
0  101985                 ...                       1.038381       3
1  102019  SCHOTTEK 2827  ...                       0.830110       3
2  102019  KUELLMER 2827  ...                       0.830110       3
3  102019   DICKNER 2827  ...                       0.830110       3
4  102562    VINEGAR 986  ...                       0.825088       3

[5 rows x 1448 columns]

Process finished with exit code 0

But when I am doing the same thing with the 34Gb file then I am getting the following error:


Traceback (most recent call last):
  File "C:\Users\Gaju\PycharmProjects\first project\work.py", line 77, in <module>
    for df,meta in reader:
  File "pyreadstat\pyreadstat.pyx", line 661, in read_file_in_chunks
  File "pyreadstat\pyreadstat.pyx", line 276, in pyreadstat.pyreadstat.read_dta
  File "pyreadstat\_readstat_parser.pyx", line 1080, in pyreadstat._readstat_parser.run_conversion
  File "pyreadstat\_readstat_parser.pyx", line 864, in pyreadstat._readstat_parser.run_readstat_parser
  File "pyreadstat\_readstat_parser.pyx", line 794, in pyreadstat._readstat_parser.check_exit_status
pyreadstat._readstat_parser.ReadstatError: Invalid file, or file has unsupported features

Process finished with exit code 1

I know that the both (the test file and the 34Gb file) are similar and are made in stata but I am still unable to understand what is going wrong?

The large file you're trying to read is either corrupt, or maybe requires some features `pyreadstat` doesn't support. — AKX, Oct 12 '22 at 12:42
is there any way to read the corrupt file? Maybe we can get some amount of data? — Gaju_masare, Oct 12 '22 at 17:53

Unable to read big stata (.dta) file in chunks in python (pyreadstat)

0 Answers0