0

I'm trying find out if there's performant way to search through my current data structures, or if I have to restructure them.

I have the following structure for my indices:

  • Publication (attributes: id, title, keywords)
  • PublicationFile (attributes: id, publication_id, text, page_number)

A publication has many publication files, publication file contains the contents of the file and the page it was found in (text and page_number).

title, keywords, and text are the searchable attributes, so if someone searches for 'economy' I want to search through both my indices.

I would like to perform a search that searches through both indices and returns the results in a manner that allows me to do something like this:

Publication1 keyword1 keyword2 Found results in Publication1's file contents in: [file a (pages: 1, 2, 3), file b (pages: 5)]

So I kind of want the search that happens to return results grouped by a publication's ID. The only way I can think of right now is to search both indices and then loop through the results and link the file/page matches to a publication.

In summary my questions are:

  1. Is there a way I can structure my data to avoid the nested loops to process it?
  2. Is there a way I can do this through Algolia without having to modify my structure? I would ideally want to re-use Algolia's frontend searching code and avoid processing this data on my backend.
Omar Bahareth
  • 875
  • 6
  • 22

1 Answers1

2

To answer your questions:

1) Yes, I'll get into more details below

2) No unfortunately not, you'll have to modify your data structure.


Here is how I'd recommend you structure your data to achieve what you're trying to do.

{
  objectID: "publicationFieIdId",
  publicationId: '',
  title: '',
  keywords: ['', ''],
  text: "",
  page_number: 1,
  published_at: 1485892992 // timestamp
}

Essentially you need to flatten your 2 indices into a single one to achieve what you're trying to do. Modifying the data structure is going to be less headache down the road than maintaining that client side code. and perform better too.

Few articles or documentation links that could be useful to explain why:

https://blog.algolia.com/inside-the-engine-part-7-better-relevance-via-dedup-at-query-time/

https://www.algolia.com/doc/guides/search/distinct/

Hope this helps!

Maxime

Maxime
  • 88
  • 3
  • Thanks! that really helps, I just need to find a way to keep the data under 10 KB per record now. – Omar Bahareth Feb 01 '17 at 07:19
  • Actually, going over it again, I would still have to use my backend to process the data to display in the same structure as the example in my question, right? So there doesn't seem to be a way to avoid that part, but then how would I paginate through the data if I'm processing it? I want my search results to show publications, and the file names/page numbers that results were found in. The main item of the result is the publication, with the file matches treated as subitems. – Omar Bahareth Feb 01 '17 at 11:29
  • I didn't find anyway to get exactly what I wanted, but deduplicating using the file ID made me able to show matches as either a publication, or a file. I also put them into one index similar to the blog post you linked, I used `record_type` and `record_priority` to help get the results sorted in a manner that makes sense for my use-case. I didn't get exactly what I needed from your answer (and as you said, it doesn't seem to be possible yet), but your answer was the best compromise in my case. Thanks a lot. – Omar Bahareth Feb 10 '17 at 09:59