3

Imagine that I am building a hashtag search. My main indexed type is called Post, which has a list of Hashtag items, which are marked as IndexedEmbedded. Separately, every post has a list of Comment objects, each of which, again, contains a list of Hashtag objects.

On the search side, I am using a MultiFieldQueryParser, to which I pass a long list of possible search fields, including some nested fields like:

hashTags.value and coments.hashTags.value

Now, the interesting thing happens when I want to search for something, say #architecture. I figure out where the hashtags are, so the simplest logical thing to do would be to convert a query of the type #architecture, into one of the type hashTags.value:architecture or comments.hashTags.value:architecture Although possible, this is very inflexible. What if I come up with yet another field that contains hashtags? I'd have to include that too.

Is there a general way to do this?

P.S. Please have in mind that the root type I am searching for is Post, because this is the kind of results I'd like to achieve

TheBlastOne
  • 4,291
  • 3
  • 38
  • 72
Preslav Rachev
  • 3,983
  • 6
  • 39
  • 63

2 Answers2

5

Hashtags are keywords, and you should let Lucene handle the text analysis to extract the hashtags from your main text and store them in a custom field.

You can do this very easily with Hibernate Search by defining your text to be indexed in two different @Field (using @Fields annotation). You could have one field named comments and another commentsHashtags.

You then apply a custom Analyser to commentsHashtags which does some standard tokenization and discards any term not starting with #; you can define one easily by taking the standard tokenizer and apply a custom filter.

When you run a query, you don't have to write custom code to look for hashtags in the query input, let it be processed by the same Analyser (which is the default anyway) and target both fields, you can even boost the hashtags more if that makes sense.

With this solution you

  • take advantage of the high efficiency of Search's text analysis
  • avoid entities and tables on the database containing the hashtags: useless overhead
  • avoid messing with free text extraction

It gets you another strong win point: you can then open a raw IndexReader and load the termvector from commentsHashtags to get both a list of all used tags, and metrics about them. Cool to do some data mining, or just visualize a tag cloud.

Sanne
  • 6,027
  • 19
  • 34
  • Not gonna work. Our framework has a little bit of legacy stuff there, so we have to keep it as it is. Any other idea how to treat the multiple fields as one? – Preslav Rachev Oct 09 '12 at 13:51
  • Now I'd like to see the points supporting the "keep the legacy stuff as it is" argument, defeating the points Sanne made...? You cannot have a logical and smooth transition to new features if you don't migrate -or kill- legacy stuff at the same time. – TheBlastOne Oct 15 '12 at 11:11
1

Instead of treating the fields as different and the top-level document as Post, why not store both Posts and Comments as Lucene documents? That way, you can just have a single field called "hashtags" that you search. You should also have a field called "type" or something to differentiate between comments and posts.

Search results may be either comments of posts. You can filter by type if users want to search only posts or comments. Or you can show them differently in your UI.

If you want to add another concept that also uses hashtags (like ... I dunno... splanks or whatever silly name we all give to Internet communications in the future), then you can add it alongside the existing Post and Comment documents simply my indexing your new type with a "hashtags" field. You'll have to do plenty of work to add the splanks, anyway, so adding a handler for that new type of search result shouldn't be too much of an inconvenience.

Christopher Schultz
  • 20,221
  • 9
  • 60
  • 77