0

I am needing to search texts that have these (differing) values in them:

0000.html - 8675.html
and
H0000 - H8675

and include them in the overall search. The searches are failing because the next page has a slight variation only here, here, and here.

I am told the answer is to replace the spot in the text where these climbing numbers reside with regular expressions. I have tried to use different examples, but I think the period dot may be causing them to fail. I may be completely off track, as I am unfamiliar with this code.

Is there someone that has experience in this that might lend a hand up?


Thanks zx81 - I have not yet been able to make any of those work - I pulled a section of the text and searched (the same text) with one instance of this in the xxxx.html and it spits it back as not a match.

I would normally buy this RexexMagic but after hours of using examples that should work and then this specifically made for this, I have lost hope this will ever work for what I am trying to do.

But thanks a whole lot for your help!

  • For number ranges, if you don't know regex, I recommend you use a regex range generator. There's a free one [here](http://utilitymill.com/utility/Regex_For_Range), but since you use JGSoft products, you may want to look at Jan's RegexMagic which does exactly what you want. – zx81 Jun 29 '15 at 01:46

1 Answers1

1

In the third expression, we will match your two ranges in one go. First, here are some expressions for the individual ranges.

Here's one way to match the range from 0000.html to 8675.html:

\b(?=\d{4}\.)0*(?:867[0-5]|86[0-6][0-9]|8[0-5][0-9]{2}|[1-7][0-9]{3}|[1-9][0-9]{1,2}|[0-9])\.html

Explanation

  • The pattern (?:867[0-5]|86[0-6][0-9]|8[0-5][0-9]{2}|[1-7][0-9]{3}|[1-9][0-9]{1,2}|[0-9]) matches numbers from 0 to 8675
  • I added 0* in front to match optional zeroes
  • I added (?=\d{4}\.) in front to ensure we have exactly four digits before the dot
  • I added a word boundary \b in front to ensure that our string is not embedded in a longer string such as 18675.html or B8675.html.

For the second one, add an H at the front:

\bH(?=\d{4}\.)0*(?:867[0-5]|86[0-6][0-9]|8[0-5][0-9]{2}|[1-7][0-9]{3}|[1-9][0-9]{1,2}|[0-9])\.html

To kill both with one stone, make the H optional:

\bH?(?=\d{4}\.)0*(?:867[0-5]|86[0-6][0-9]|8[0-5][0-9]{2}|[1-7][0-9]{3}|[1-9][0-9]{1,2}|[0-9])\.html

In Practice

For these kind of expressions, unless you are experienced in regex, I recommend you use a range generator. (And if you are experienced in regex, you already know these ranges are so error-prone that you are better off using a range generator.)

There are some free ones online (which I don't fully trust), but since you use JGSoft's EditPad, you may want to look at his RegexMagic.

Even so, you'll probably have to tweak the generated expressions so they meet your specs.

zx81
  • 41,100
  • 9
  • 89
  • 105