0

I want to run the following regex query in solr name:/.+\.m+d$/. I have documents in my index with the following names:

readme.md
2013.02.26.md
test.mmd

and none of them match. Removing the $ matches the readme.md entry. I believe the problem is that I need to specify a global pattern modifier but can't find the syntax to do this.

Jayendra
  • 52,349
  • 4
  • 80
  • 90
Zarnywoop
  • 3
  • 2

2 Answers2

2

These are my observations based on experimenting with Solr regex matches:

  • Do HTML percent encoding of all the special characters in your regex. This site has been helpful for doing the percent encoding manually.

  • Make sure you do regex matching on string fields if you want to match the entire value. Regex matching on text fields will involve tokenization and will work according to which tokens got produced during indexing.

  • For solr regexes don't specify the beginning anchor ^ or the end anchor $, since it always assumes you are matching against the entire string. Unless you specify a .* or .+ (or some such regex) at the beginning or the end, it is always a match with ^ in the beginning and $ at the end.

I just indexed the 3 values in your question in a string field and issued this query and it matches all the 3 documents:

q=id:/.%2B%5C.m%2Bd/

The PCRE of .%2B%5C.m%2Bd is .+\.m+d$.

arun
  • 10,685
  • 6
  • 59
  • 81
  • thanks for your reply. I notice that my name field was a text_general, but changing to string doesn't seem to have any effect. Also searching against the id field I get the same results. BTW I'm using the Solr Admin query for my testing and the escaping is done by the form, so I don't think that is the issue here. – Zarnywoop Mar 04 '13 at 14:57
  • Actually changing the field to a string seems to have fixed this. It now works with out needing a regex i.e. name:*.m*d. When using the regex as you pointed out you don't need the trailing $. Thanks again for your help. – Zarnywoop Mar 04 '13 at 15:29
  • Thank you, your comment finally helped me after trying to find documentation on solr regexp. After making a string index and putting .* at the start and end, my sorl regexp's work as expected :) – Shinhan Aug 27 '13 at 10:51
  • @arun, you're in principle right about that percent-encoding thing, but unless you're doing some quick testing with `curl` or so your job as a programmer should be finished the moment you have built the query object. its serialization including the escaping should be done by your HTTP query tool (say, `jQuery.ajax`). if you do it manually / explicitly in production, you're almost certainly doing it wrong. – flow Oct 01 '13 at 15:15
  • @flow, right. I use SolrJ. This is only for testing the query directly via the REST interface – arun Oct 01 '13 at 17:15
0

I tryed this in Reg exp buddy. IT matches your test.

.+\.m+d

php (Preg) syntax for iterate over all matches in string.

preg_match_all('/.+\.m+d/', $subject, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++) {
    # Matched text = $result[0][$i];
}

This is if ^$ match at line breaks and dot matches new line and case insesitive

preg_match_all('/.+\.m+d/sim', $subject, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++) {
    # Matched text = $result[0][$i];
}
Patrik Lindström
  • 1,057
  • 3
  • 13
  • 23
  • The reg exp syntax seems to be hard to find for Solr. Look at this stackoverflow question: http://stackoverflow.com/questions/9332343/what-regular-expression-features-are-supported-by-solr-edismax – Patrik Lindström Feb 27 '13 at 15:38
  • there is a syntax specification at https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/util/automaton/RegExp.html, but they won't tell what non-standard extensions like `/~/` and `/<2-4>/` will do. – flow Oct 01 '13 at 15:12