2
from patsy import *
from pandas import *
dta =  DataFrame([["lo", 1],["hi", 2.4],["lo", 1.2],["lo", 1.4],["very_high",1.8]], columns=["carbs", "score"])
dmatrix("carbs + score", dta)
DesignMatrix with shape (5, 4)
Intercept  carbs[T.lo]  carbs[T.very_high]  score
        1            1                   0    1.0
        1            0                   0    2.4
        1            1                   0    1.2
        1            1                   0    1.4
        1            0                   1    1.8
Terms:
'Intercept' (column 0), 'carbs' (columns 1:3), 'score' (column 3)

Question : instead of specifying "names" of the columns using Designinfo (which basically makes my code less re-usable) , can I not READ the names given by this DesignMatrix so that I can feed this into a DataFrame later, without needing to know pre-hand what the "reference level/control group" level was ?

ie. When I do dmatrix("C(carbs, Treatment(reference='lo')) + score", dta)

"""
# How can I get something like this with dmatrix's output without hardcoding ?
names = obtained from dmatrix's output above 
This should give names = ['Intercept' ,'carbs[T.lo]', 'carbs[T.very_high]', 'score']
"""
g=DataFrame(dmatrix("carbs + score", dta),columns=names)

    Intercept  carbs[T.lo]  carbs[T.very_high]  score
   0  1  2    3
0  1  1  0  1.0
1  1  0  0  2.4
2  1  1  0  1.2
3  1  1  0  1.4
4  1  0  1  1.8

type(g)=<class 'pandas.core.frame.DataFrame'>

so g would be the transformed dataframe I can do logistic modelling on without needing to keep a note of (or hard-coding thereof) of the column names & their reference levels.

DontDivideByZero
  • 1,171
  • 15
  • 28
ekta
  • 1,560
  • 3
  • 28
  • 57

1 Answers1

6

I think the information you're looking for is in design_info.column_names:

>>> dm = dmatrix("carbs + score", dta)
>>> dm.design_info
DesignInfo(['Intercept', 'carbs[T.lo]', 'carbs[T.very_high]', 'score'],
           term_slices=OrderedDict([(Term([]), slice(0, 1, None)), (Term([EvalFactor('carbs')]), slice(1, 3, None)), (Term([EvalFactor('score')]), slice(3, 4, None))]),
           builder=<patsy.build.DesignMatrixBuilder at 0xb03f8cc>)
>>> dm.design_info.column_names
['Intercept', 'carbs[T.lo]', 'carbs[T.very_high]', 'score']

and so

>>> DataFrame(dm, columns=dm.design_info.column_names)
   Intercept  carbs[T.lo]  carbs[T.very_high]  score
0          1            1                   0    1.0
1          1            0                   0    2.4
2          1            1                   0    1.2
3          1            1                   0    1.4
4          1            0                   1    1.8

[5 rows x 4 columns]
DSM
  • 342,061
  • 65
  • 592
  • 494
  • Exactly, yes. Just one more thing, is Designinfo different from design_info. I have also seen the same in case of statsmodels "Logit" and logit. Does it depend on how we import "from patsy import *" vs. "from patsy import DesignMatrix, DesignInfo" ? – ekta May 09 '14 at 11:48
  • 1
    `DesignInfo` is the name of the *class*; here, `design_info` (`dm.design_info`) is the name of an *instance* of that class. Confusingly, it's also the name of a module, `patsy.design_info`.. – DSM May 09 '14 at 12:00
  • Would it be the same for Logit & logit as well. "import statsmodels.formula.api as sm" or "import statsmodels.api as sm" use, sm.formula.logit(model, data=df).fit() or Logit in the cases, thereof ? What is the best way to understand this ? Also, dir(dm.design_info) ['__class__', ... and so does dir(DesignInfo) ['__class__', .. and so does type(dm.design_info) May be this is intuitive, but I don't get this very well. – ekta May 09 '14 at 12:14
  • Unfortunately comment sections aren't good for explaining language basics. You can find out what type an object is by using `type(name_of_object)`, and importing doesn't change the type of what's imported. – DSM May 09 '14 at 12:23