0

I'm trying to come up with an easy way for analysts at my office to invoke a stemmer from MSSQL. It would be used to generate stemmed notes fields for two purposes:

  1. to create training sets where the most common stemmed notes fields are counted and aggregated, then classified starting with the most frequently appearing first. (For our application we're going to have a lot of commonly repeating phrases and a lot of one off stuff).

  2. to generate derived stemmed versions of the notes fields to run classification models on.

I'm thinking it would be easy for everyone if I were able to create a SQL function called maybe StemText that took a text field and returned the stemmed version. Then I could create another function which invoked a specified classification model name in R (in this case on the stemmed notes field).

We currently have a bunch of stemmed training set data (and successful classification systems) built around the Snowball/Porter2 stemmer. I saw that there was a T-SQL port of the Porter1 stemmer, but not Porter2. In any case, I see that there are a number of packages out there which include the Snowball algorithm like SOLR/Lucene and so forth, but I'm not seeing much info on how to invoke them from T-SQL. I also see that there is a snowball add on package for R and there is a lot more info out there for interfacing with R from MSSQL.

Does anyone have any thoughts or advice on this approach?

user2027080
  • 115
  • 1
  • 10

1 Answers1

1

I ended up creating a CLR function using a Visual Studio SQL Project that wrapped and augmented one of the official CSharp ports of the Porter2 stemmer. I runs quite fast and was pretty easy to setup.

user2027080
  • 115
  • 1
  • 10