2
The structure of data to index is like below:

{
  "EmailId":"1",  //should be stored
  "EmailText":"hello world",
  "Attachments": 
                {
                   "AttachmentId":"1",  //should be stored
                   "FileName": "hello.txt"  //should be stored
                   "AttachmentText":"this is first attachment text"
                },
                {
                   "AttachmentId":"2",
                   "FileName": "welcome.xlsx"
                   "AttachmentText":"this is second attachment text"
                }
}

I could maintain a separate index for email body and attachment text, but is there any way we could do a multilevel indexing like above to maintain a single index? I should be able to search a keyword in the AttachmentText and get back the AttachmentId and EmailId.

I am using Lucene.Net but if there is any solution in Lucene Java then it is absolutely fine.

Thank you in advance.

Mohit
  • 65
  • 9
  • 2
    What's stopping you from flattening the data to be indexed when you create your Lucene documents? `doc1` contains `EmailId` = `1`, `AttachmentId` = `1`, `AttachmentText` = `this is first attachment text`. And then `doc2` contains `EmailId` = `1`, `AttachmentId` = `2`, `AttachmentText` = `this is second attachment text`... and so on. Depending on _all_ of the types of searches you want to perform, there may be other ways to flatten the data, also. – andrewJames Apr 28 '23 at 12:07
  • Flattening the data solved my problem but there is change that duplicate EmailId will be returned then querying index. Is there any way we can avoid duplicate ids within lucene? – Mohit May 01 '23 at 07:13
  • 2
    De-duplicate the email ID after retrieving your results from Lucene. But really, it all depends on what you want to do with the data. – andrewJames May 01 '23 at 13:26

1 Answers1

2

One approach:

You can flatten your source data:

doc1 contains:

EmailId = 1, AttachmentId = 1, AttachmentText = this is first attachment text.

doc2 contains:

EmailId = 1, AttachmentId = 2, AttachmentText = this is second attachment text

... and so on.

This is certainly not the only way to flatten your data. It depending on all of the types of searches you want to perform. There may be other suitable ways to flatten the data, also.


Regarding the comment:

duplicate EmailId will be returned [w]hen querying...

Yes - I would say you can de-duplicate the results data (the Lucene doc hits) after running your query. It really depends on what you plan to do with your search results. If you want to display them to a user, then you can convert your "flat" results back into a hierarchy for that purpose.


One extra point worth adding:

Some flattening approaches may cause you to have a lot of duplicate indexed data - for example, if you want to search EmailText data. I would try to avoid that by having two different document structures:

Document A: fields for searching attachment text:

  • AttachmentEmailId (this is your source data's EmailId field)
  • AttachmentId
  • AttachmentText

Document B: fields for searching email body text:

  • EmailId
  • EmailText

This way, the data in each EmailText is not indexed more than once.

One Lucene index can have multiple different documents. And as above, you can rebuild the hierarchical structure of your original data, when presenting the results (if you need/want to do that).

Another approach would be a more generic structure - something like:

Document fields:

  • Id (can be an EmailId value or an AttachmentId value)
  • Text
  • ParentId (null if the Id is an EmailId value)

Here, only one doc structure is needed.

andrewJames
  • 19,570
  • 8
  • 19
  • 51
  • I just want to get EmailId and/or AttachmentId searching only the EmailText or AttachmentText or both. Considering my example, do you mean to flatten the document like below: `Doc 1 - EmailId : 1; EmailText: "hello world"` `Doc 2 - EmailId : 1; AttachmentId: 1; AttachmentText: "this is first attachment text"` `Doc 3 - EmailId : 1; AttachmentId: 2; AttachmentText: "this is second attachment text"` This way we will not have duplicate index data and we will be storing EmailId and AttachmentId to grab them later after search. – Mohit May 02 '23 at 03:58
  • Your example in your comment is more-or-less the same as my "Document A" and "Document B" approach. In both cases, there are 2 separate document structures. In my case, I gave them different field names, to avoid confusion: Doc 1 - EmailId : 1; EmailText: "hello world" Doc 2 - AttachmentEmailId : 1; AttachmentId: 1; AttachmentText: "this is first attachment text" Doc 3 - AttachmentEmailId : 1; AttachmentId: 2; AttachmentText: "this is second attachment text" – andrewJames May 02 '23 at 12:37
  • Sounds good. I will follow this approach for now. Thank you so much for your help. – Mohit May 03 '23 at 03:35