1

I have a SQL Server 2012 database with a table that contains questions and answers. Simplified structure is like this:

question_id  int 
question   varchar(500)
answer     varchar(50)

I'd like to find word pairs or clusters between the question and answer columns. I'm wondering if there is a way to do this in SQL as opposed to R, etc. I've seen a lot of stuff on document clustering, but I'm wondering where to start just in the db itself at the column level.

So, for example, I want to find the answer to this question: when the question column contains word X, how often is it paired with word Y in the answer column?

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • 1
    Just extracting words from a string is a pain in SQL. I would suggest you use another tool for this. If you are intent on using SQL, it would help if you have a table of valid words in the `question` and `answer` columns. – Gordon Linoff Dec 30 '13 at 01:13
  • SQL Server comes with full text indexing. Great for finding a key word in a help desk ticket. However, what you asking is a correlation of words in the question column to the answer column. This sounds like a C# console program to store valid question/answer words being stored as keys (one table) and a key/pair of every combo (another table). Then some analysis after you have the data. – CRAFTY DBA Dec 30 '13 at 01:32
  • r u looking for this "where question like '%X%' and answer like '%Y%'" – KumarHarsh Dec 30 '13 at 03:54
  • @KumarHarsh No, the idea is that I don't know what X and Y are beforehand. I need to discover the relationships/patterns. So, it looks like there is no straightforward way to do this in SQL alone. I'm assuming this could be done in SSAS using MDX or in Python/R, but I don't really know where to begin as far as the model or algorithms. Thanks. – user3144970 Dec 30 '13 at 04:05
  • Does SQL server also support stemming and full-text *joins*? Usually, text search functionality in SQL databases is existant, but largely useless for data mining because it doesn't scale good enough. – Has QUIT--Anony-Mousse Dec 30 '13 at 11:01

1 Answers1

1

For this kind of problems, I recommend looking at the word count map-reduce example.

Some SQL servers by now allow you to execute map-reduce like functions on their tables, and can then optimize this reasonably well. Note that the iterim data can still become really large, so you may need to A) use clever hashing techniques (think of bloom filters) to avoid materializing the initial join at all, or B) maybe actually load the data into a Hadoop cluster and process it there. MapReduce is well capable of scaling this problem to huge data sets (think: all of Wikipedia)

Either way, I don't think SQL will make you happy. String manipulation in SQL is a pain. See e.g. the following related questions:

How do I split a string so I can access item x?

T-SQL: Opposite to string concatenation - how to split string into multiple records

You first need to implement tokenizing/splitting and stemming, stopword removal, and then perform a JOIN on the tokens from question and answer, and then do this for all questions and count the most common pairs.

Community
  • 1
  • 1
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194