Using patsy
, I noticed that it named dummy variables sometimes with T
and without T
in other cases. And today I realised that T
is attached when the constant term is present in a regression equation, and no T
without the constant term. For example, compare z[T.1]
, z[0]
, z[1]
, indicated by OUTPUT
in the following code.
import pandas as pd
import patsy
data = {'z': ['1', '0', '0'],
'y': [150, 200, 50],
'x': [200, 210, 90]}
df = pd.DataFrame(data)
# with constant -----------------------
form_const = 'y ~ x + z'
y_const, X_const = patsy.dmatrices(form_const, df, return_type='dataframe')
print(X_const.columns.tolist())
# ['Intercept', 'z[T.1]', 'x'] <- OUTPUT
# withOUT constant --------------------
form_no_const = 'y ~ -1 + x + z'
y_no_const, X_no_const = patsy.dmatrices(form_no_const, df, return_type='dataframe')
print(X_no_const.columns.tolist())
# ['z[0]', 'z[1]', 'x'] <- OUTPUT
Questions
What is the role of T
? Does it just indicate the presence of the constant term? If so, isn't it redundant, given that we can always see the presence/absence of the constant term? Are there any other roles?
Your insight is appreciated in advance.