0

I'm working on upgrading a LogisticRegression text classification from single word features to bigrams (two word features). However when I include the two word feature in the formula sent to patsy.dmatrices, I receive the following error...

y, X = dmatrices("is_host ~ dedicated + hosting + dedicated hosting", df, return_type="dataframe")

  File "<string>", line 1
    dedicated hosting
                ^
SyntaxError: unexpected EOF while parsing

I've looked around online for any examples on how to approach this and haven't found anything. I tried throwing a few different syntax options at the formula and none seem to work.

"is_host ~ dedicated + hosting + {dedicated hosting}"
"is_host ~ dedicated + hosting + (dedicated hosting)"
"is_host ~ dedicated + hosting + [dedicated hosting]"

What is the proper way to include multi-word features in the formula passed to dmatricies?

digitaldavenyc
  • 1,302
  • 1
  • 12
  • 24

1 Answers1

0

You want:

y, X = dmatrices("is_host ~ dedicated + hosting + Q('dedicated hosting')", df, return_type="dataframe")

Q is short for quote.

Nathaniel J. Smith
  • 11,613
  • 4
  • 41
  • 49