Extract SKU values which may be numeric or alphanumeric and must be 4 to 20 characters long

Question

I am open to including more code than just a regular expression.

I am writing some code that takes a picture, runs a couple Imagick filters, then a tesseractOCR pass, to output text.

From that text, I am using a regex with PHP to extract a SKU (model number for a product) and output the results into an array, which is then inserted to a table.

All is well, except that in my expression I'm using now:

\w[^a-z\s\/?!@#-$%^&*():;.,œ∑´®†¥¨ˆøπåß∂ƒ©˙∆˚¬Ω≈ç√∫˜µ≤≥]{4,20}

I will still get back some strings which contain ONLY letters.

The ultimate goal:

-strings that may contain uppercase letters and numbers,
-strings that contain only numbers,
-strings that do not contain only letters,
-strings which do not contain any lowercase letters,
-these strings must be between 4-20 characters

as an example:

a SKU could be 5209, or it could also be WRE5472UFG5621.

Sample text that you're trying to match but that returns unwanted results, please. — Markus AO, Jan 11 '22 at 21:32
here are some results being returned after the regex I had above: "APPLIANCES", "GTS17DTNRWW", "6361278" I'm trying to eliminate strings with only letters, like "APPLIANCES" in this case. both "GTS17DTNRWW" and "6361278" are desired results. sometimes my statement will return several unwanted strings of all letter characters. maybe like: "ALSO" "AVAILABLE" "DISCOUNT" I hope I've explained that well — InvisibleHamSandwich, Jan 11 '22 at 21:52
here is an exact snippet of text I'm filtering withregex: -9 Cycles 3 Temperature Levels Steam Sanitizet+ -Sensor Dry | ALSO AVAILABLE (PRICES MAY VARY) |- White - 1258843 - DVE45R6100W {+ Platinum - 1501 525 - DVE45R6100P desirable: 1258843 DVE45R6100W — InvisibleHamSandwich, Jan 11 '22 at 21:53
Can you please update them into your post so we have all the info in the same place. (Use code tags for clarity.) — Markus AO, Jan 11 '22 at 22:13
The regex maestros will ask that you [edit] your question to include at least one sample string and the exact desired output. Ideally, having 3 to 5 sample strings and their results should sufficiently present all edge cases. — mickmackusa, Jan 12 '22 at 11:25
Do you actually need unicode support? Does this pattern fail you? https://3v4l.org/IeJTT It might be doing more work than required. Please offer more test cases to reveal all known edge cases. Is this enough? https://3v4l.org/IRlTJ — mickmackusa, Jan 12 '22 at 11:41

mickmackusa · Answer 1 · 2022-01-15T13:22:59.013

Okay, you have accepted an indirect answer since I've asked for question improvement in a comment under the question. I'll interpret this to mean that you have no intention of clarifying the question further and the other answer works as desired. For this reason, I'll offer a single regex solution so that you don't need to need to use iterated regex filtering after making an initial regex extraction.

For your limited sample data, your requirement boils down to:

Match whole "words" (visible characters separated by spaces) which:

consist of numeric or alphanumeric strings and
are a length between 4 and 20 characters.

You can subsequently eliminate duplicated matched strings with array_unique() if desirable.

Code: (Demo)

$str = '-9 Cycles 3 Temperature Levels Steam Sanitizet+ -Sensor Dry | ALSO AVAILABLE (PRICES MAY VARY) |- White - 1258843 - DVE45R6100W {+ Platinum - 1501 525 - DVE45R6100P desirable: 1258843 DVE45R6100W';

if (preg_match_all('~\b(?:[A-Z]{4,20}(*SKIP)(*FAIL)|[A-Z\d]{4,20})\b~', $str, $m)) {
    var_export(array_unique($m[0]));
}

Output:

array (
  0 => '1258843',
  1 => 'DVE45R6100W',
  2 => '1501',
  3 => 'DVE45R6100P',
)

Pattern Breakdown:

\b             #the zero-width position between a character matched by \W and a character matched by \w
(?:            #start non-capturing group
  [A-Z]{4,20}(*SKIP)(*FAIL) #match and disqualify all-letter words
  |                         #or
  [A-Z\d]{4,20}             #match between 4 and 20 digits or uppercase letters
)              #end non-capturing group
\b             #the zero-width position between a character matched by \W and a character matched by \w

Here are a couple alternative regex patterns for comparison -- one that doesn't use any lookarounds uses a "skip-fail" technique to disqualify purely alphabetical "words".

437 steps: \b(?=\S*\d)[A-Z\d]{4,20}\b
325 steps: \b(?=[A-Z]*\d)[A-Z\d]{4,20}\b
298 steps: \b(?:[A-Z]{4,20}(*SKIP)(*FAIL)|[A-Z\d]{4,20})\b

The equivalent non-regex process (which I do not endorse) is: (Demo)

foreach (explode(' ', $str) as $word) {
    $length = strlen($word);
    if ($length >= 4                    // has 4 characters or more
        && $length <= 20                // has 20 characters or less
        && !isset($result[$word])       // not yet in result array
        && ctype_alnum($word)           // comprised numbers and/or letters only
        && !ctype_alpha($word)          // is not comprised solely of letters
        && $word === strtoupper($word)  // has no lowercase letters
    ) {
        $result[$word] = $word;
    }
}
var_export(array_values($result));

Cheers for the more elegant version. I was curious over the cost of using a lookahead vs. a "blunt" two-step filtering, and crunched a test case (details in my answer). It appears that the two-step approach has a ~15% performance edge over the one-step lookahead regex. So it isn't suboptimal performance-wise, while surely the "lesser" of the two in terms of eloquence. — Markus AO, Jan 15 '22 at 10:04
I didn't play with you benchmark script, but I tuned up my recommended pattern in terms of step count on the OP's lone sample string. I have hard time believing that a single pass over the input string takes more time than a pass over the input string followed by a regex pass over all matches found. I also assume that longer strings with more matches will experience a greater performance cost with the brute force technique. That said, I'm on vacation with my family and am not willing to put more effort in right now. — mickmackusa, Jan 15 '22 at 22:32

Markus AO · Accepted Answer · 2022-01-15T10:02:01.943

Until the regex maestros show up, a lazy person such as myself would just do two rounds on this and keep it simple. First, match all strings that are only A-Z, 0-9 (rather than crafting massive no-lists or look-abouts). Then, use preg_grep() with the PREG_GREP_INVERT flag to remove all strings that are A-Z only. Finally, filter for unique matches to eliminate repeat noise.

$str = '-9 Cycles 3 Temperature Levels Steam Sanitizet+ -Sensor Dry | ALSO AVAILABLE (PRICES MAY VARY) |- White - 1258843 - DVE45R6100W {+ Platinum - 1501 525 - DVE45R6100P desirable: 1258843 DVE45R6100W';

$wanted = [];

// First round: Get all A-Z, 0-9 substrings (if any)
if(preg_match_all('~\b[A-Z0-9]{6,24}\b~', $str, $matches)) {

    // Second round: Filter all that are A-Z only
    $wanted = preg_grep('~^[A-Z]+$~', $matches[0], PREG_GREP_INVERT);

    // And remove duplicates:
    $wanted = array_unique($wanted);
}

Result:

array(3) {
    [2] · string(7) "1258843"
    [3] · string(11) "DVE45R6100W"
    [4] · string(11) "DVE45R6100P"
}

Note that I've increased the match length to {6,24} even though you speak of a 4-character match, since your sample string has 4-digit substrings that were not in your "desirable" list.

Edit: I've moved the preg_match_all() into a conditional construct containing the the remaining ops, and set $wanted as an empty array by default. You can conveniently both capture matches and evaluate if matched in one go (rather than e.g. have if(!empty($matches))).

Update: Following @mickmackusa's answer with a more eloquent regex using a lookahead, I was curious over the performance of a "plain" regex with filtering, vs. use of a lookahead. Then, a test case (only 1 iteration at 3v4l to not bomb them, use your own server for more!).

The test case used 100 generated strings with potential matches, run at 5000 iterations using both approaches. Matching results returned are identical. The single-step regex with lookahead took 0.83 sec on average, while the two-step "plain" regex took 0.69 sec on average. It appears that using a lookahead is marginally more costly than the more "blunt" approach.

[Regex functions](https://www.php.net/manual/en/ref.pcre.php) and the syntax for more complex matching are a mouthful to tackle but they are well worth it. Say you would `explode()` by space or `,`, however perhaps there are multiple spaces or tabs, perhaps comma has surrounding spaces or not; use [preg_split()](https://www.php.net/preg_split), etc. Fortunately there are also excellent resources for reference, I find myself often at https://www.regular-expressions.info/ .... — Markus AO, Jan 12 '22 at 11:02

Extract SKU values which may be numeric or alphanumeric and must be 4 to 20 characters long

2 Answers2