4

I'm currently in the process of designing some desktop software and I've always wanted to implement an intuitive search function. For example, I need to write an algorithm that parses a search query like "next monday between 2 and 3pm" or "anytime after 2 on friday", or even "how do I use ". So the context can be very different but be asking the same thing, which is what gets me.

Should I be tokenizing the query (which I'm doing so far), or should I treat the string as a whole pattern and compare to a library of some sort?

I'm not sure if SO is the right place for this so if necessary point me in the right direction. Basically I would just like some advice as to the approach I should be taking.

Thanks.

rtheunissen
  • 7,347
  • 5
  • 34
  • 65

2 Answers2

1

Temporal Extraction (i.e. Extract date/time entities from free form text) - How? might give you some pointers.

"Entity extraction" is the process of extracting human recognizable entities (names, places, dates, etc.) from unstructured text. That article deals specifically with temporal entities but reading up on "entity extraction" in general is a good place to start.

Entity extraction has to be done per-language though, so expect difficulty when you're trying to internationalize your product to other locales. For Google Calendar, we spent a lot of time on temporal entity extraction and on expression recurrence relations in human readable form ("every last Friday in November") and each of the 40 locales that we operate in have their own quirks.

Community
  • 1
  • 1
Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
0

If you are planning to use a predefined grammar, you should consider using a state machine. There is for example the Ragel State Machine Compiler, which lets you use simple regular expressions to define a state machine and allows you to generate the actual source code for various target languages.

Here is a simple parser that I wrote to get all table names from an SQL select query. You could do something similar (https://gist.github.com/1524986).

umjasnik
  • 36
  • 3