3

I have an use case where I wanted to search a word nearby a date of a specific range in the file content.. Ex: Consider this document with content as "income tax for the period Jan 2011 to Jan 2012 amounted to $2,000" Now I have the query as "tax [20110101 TO 20120201]"~4 for which I want the above document to be a hit.. I'm using complex phrase query parser to handle complex proximity queries.

So can any one point me in the right direction of how to implement this in Solr.

Kaleem
  • 41
  • 5
  • Have you tried using the correct date range syntax in your query, and changing the dates in your documents to use that syntax at index time? – David Faber Mar 24 '12 at 18:23
  • Changing the format of the dates at index time will work if I index them to a separate date field, but here I want the dates to be indexed in the same field as the file content so that the proximity queries can work. The problem here is if I index the dates by changing the format into the same text(file content) field, they would be lexicographic-ally sorted, and more over there can many dates in a document and so as in the whole index, and this might blow up the memory when I want to search for a word near any date like the query:"tax [16000101 25000101]"(basically all the reasonable dates) – Kaleem Mar 24 '12 at 21:22
  • I also had an idea of indexing all the dates in the file content to a separate Solr.DateField by retaining their offsets and use that field for proximity searches with the text field,sort of a cross-field proximity query like - "tax dateField:[16000101 25000101]]", but even that would shoot up the memory as it tries to load all the index and compare their positions. So is there any better way to achieve this with less memory consumption and less processing time? – Kaleem Mar 24 '12 at 21:36
  • 1
    Maybe what you could do is index the dates in a separate field, but keep a tag in the file where the date is/was. That is, when indexing the file, replace the date range with a tag, and index the start and end dates as separate fields. Then you can do a proximity search by querying `+"tax "~4 +startDateField:[* TO endDate] +endDateField:[startDate TO *]`. Hope that helps (and makes sense). – David Faber Mar 25 '12 at 01:25
  • that is a good way to do that but consider this example _document_ _content_:`"tax for period 2012 March is $2000. Compared to 2010 June .."` and the _query_:`"tax [20100101 TO 20101231]"~3` by which I want tax to be near a 2010 year date, and by your way if I made the query as `+"tax "~3 +dateField:[20100101 TO 20101231]` then the particular document will be qualified but actually it shouldn't as it is not satisfying the query I intended. So how can I be sure that the date matched by the `dateTag` is the one satisfying the `dateField:[startDate endDate]` part of the query? – Kaleem Mar 25 '12 at 08:35
  • I don't think you can be certain of it - and in any case, when you're indexing files and doing proximity searches, Solr is bound to return some results that are not necessarily relevant. You have to fully quantify your data if you don't want potentially bad results being returned. – David Faber Mar 25 '12 at 18:35
  • Thanks @DavidFaber for the help..but I didn't quite get the last part..Why is Solr is bound to return some results that are not necessarily relevant when using proximity?? – Kaleem Mar 25 '12 at 19:17
  • 2
    I shouldn't say not necessarily relevant, but not necessarily what you want. For example if you have a medical journal article, searching for "malaria africa"~5 won't necessarily return results about malaria in Africa. It could return results like "Malaria, long thought confined to Africa, has been spreading to the United States." That might not be what the end-user was looking for. – David Faber Mar 25 '12 at 22:53

0 Answers0