0

I have a dataframe with 7 variables:

   RACA   pca   pp  pcx  psc     lp     csc
0     BARBUDA  1915  470  150  140  87.65   91.41
1     BARBUDA  1345  305  100  110  79.32   98.28
2     BARBUDA  1185  295   80   85  62.19   83.12
3     BARBUDA  1755  385  120  130  80.65   90.01
4     BARBUDA  1570  325  120  120  77.96   87.99
5    CANELUDA  1640  365  110  115  81.38   87.26
6    CANELUDA  1960  525  135  145  89.21   99.37
7    CANELUDA  1715  410  100  120  79.35   99.84
8    CANELUDA  1615  380  100  110  76.32   99.27
9    CANELUDA  2230  500  165  160  90.22   99.56
10   CANELUDA  1570  400  105   95  85.24   83.95
11  COMERCIAL  1815  380  145   90  73.32   92.81
12  COMERCIAL  2475  345  180  140  71.77  105.64
13  COMERCIAL  1870  295  125  125  72.36   97.89
14  COMERCIAL  2435  565  185  160  73.24  107.39
15  COMERCIAL  1705  315  115  125  72.03   96.11
16  COMERCIAL  2220  495  165  150  87.63   96.89
17     PELOCO  1145  250   75   85  50.57   77.90
18     PELOCO   705   85   55   50  38.26   78.09
19     PELOCO  1140  195   80   75  66.15   96.35
20     PELOCO  1355  250   90   95  50.60   91.39
21     PELOCO  1095  220   80   80  53.03   84.57
22     PELOCO  1580  255  125  120  59.30   95.57

I want to fit a glm for every dependent variable, pca:csc, in R it's quite simple to do it, but I don't know how to get this working on Python. I tried to write a for loop and pass the column name to the formula but so far didn't work out:

for column in df:
    col = str(column)
    model = sm.formula.glm(paste(col,"~ RACA"), data=df).fit()
    print(model.summary())

I am using Pandas and statsmodel

import pandas as pd
import statsmodels.api as sm

I imagine it must be so simple, but I sincerely couldn't figure it out yet.

Paulo Barros
  • 157
  • 1
  • 2
  • 12
  • Is there a particular reason you're fitting a GLM to each column? What are you trying to accomplish? – blacksite May 14 '20 at 21:08
  • In R when dealing with unbalanced data I tend to use GLM instead of ANOVA, I am trying to learn how to do some statistics with Python so that's more of an exercise than an practical application. – Paulo Barros May 15 '20 at 00:35

1 Answers1

0

I was able to figure out a solution, don't know if it's the most efficient or elegant one, but give the results I wanted:

for column in df.loc[:,'pca':'csc']:
    col = str(column)
    formula = col + "~RACA"
    model = sm.formula.glm(formula = formula, data=df).fit()
    print(model.summary())

I am open to suggestions on how I could improve this. Thank you!

Paulo Barros
  • 157
  • 1
  • 2
  • 12