1

I am sampling a larger data set to fit and predict with a statsmodels GLM model.

Depending on the sample, running model.predict will omit some small number (<10) of records in the array that it returns. I assume it experiences some error in processing some small number of the rows in the data set.

For instance, if I predict using rows 15000:20000, the shape of the array returned will be 4994, or 4997, or something similar.

This is a pain because I can't tell which rows are omitted, and I would like to run the .predict function on the entire dataframe and then easily add the prediction values as a new column.

Does someone either (a) know what's going on and how to fix it, or (b) have a good method for adding the prediction values back to the dataframe based on index?

user1893148
  • 1,990
  • 3
  • 24
  • 34
  • 1
    Do you have NaNs in the data? Are you using formulas with patsy 0.2? There is not enough information in your question. It would be better to discuss this on the pystatsmodels mailing list or the statsmodels issue tracker, after you provide a lot more information or an example that replicates this. I never heard of a case like this, and my only guess is that some missing value handling kicks in. – Josef Sep 28 '13 at 10:20
  • [I have a similar issue.](http://stackoverflow.com/q/22580477/656912) – orome Mar 22 '14 at 16:49

0 Answers0