2

My model here consists on online courses. Every course has got an id number, a title and can have a different number of content files (large html files). I tried to represent them in Lucene using the following scheme (every line is a document):

  • course: "1", title: "Introduction to Java"
  • course: "1", content: "Chapter 1: basics..."
  • course: "1", content: "Chapter 2: collections..."
  • course: "2", title: "Java networking"
  • course: "2", content: "First part: sockets..."
  • course: "3", title: ...

But now, suppose I need to ask Lucene to give me all the courses (just the id) with "Java" in the title and "collections" in some of its contents. A query such as title:java AND content:collections won't work because the information is split into several documents.

Can somebody suggest me some alternate representation or querying technique to address this problem? Note that I can't just join all the contents into a single file and index it in the same document along with the title because some chapters are added after the course has been created.

Thanks in advance.

grieih
  • 143
  • 2

1 Answers1

0

I've not tried it yet, but check out index-time or query-time joins: http://lucene.apache.org/core/4_0_0/join/org/apache/lucene/search/join/package-summary.html

Here's a presentation on it: http://www.lucenerevolution.org/sites/default/files/grouping-and-joining_0.pdf.

Jeff French
  • 1,151
  • 1
  • 12
  • 19