0

I am building something very similar to Google Alerts. If you don't know what it is, consider the following scenario,

  1. Thousands of new textual articles, blog posts influx everyday
  2. Each user has a list of favorite "keywords" that he'd like to subscribe to
  3. There are million users with million keywords
  4. We scan every article/blog post looking for every keyword
  5. Notify each user if a specific keyword matches.

For one keyword, doing a basic full text search against thousands of articles is easy, but how do make a full text search effectively with million keywords?

Since I don't have a strong CS backtround, the only idea I came of is compiling all keywords into regex, or automata, will this work well? (Like Google's re2)

I think I am missiong some thing important here. Like compiling those keywords into some advanced data structure. Since many keywords are alike (e.g. plural form, simple AND, NOT logic, etc). Are there any prior theory I need to know before head into this?

All suggestions are welcome, thanks in advance!

est
  • 11,429
  • 14
  • 70
  • 118

1 Answers1

0

I can think of the following: (1) Make sure each search query is really fast. Millisecond performance is very important. (2) Group multiple queries with the same keywords and do a single query for each group.

Since different queries are using different keywords and AND/OR operations, I don't see other ways to group them.

Chen Li
  • 86
  • 2