1

I need to match the following line with multiple capturing groups:

0.625846        29Si    29      [4934.39        0]      [0.84   100000000000000.0]

I use the regex:

^(0+\.[0-9]?e?[+-]?[0-9]+)\s+([0-9]+\.?[0-9]*|[0-9][0-9]?[0-9]?[A-Z][a-z]?)\s+([0-9][0-9]?[0-9]?)\s+(\[.*\])\s+(\[.*\])$

see this link for a regex101 workspace. However I find that when I'm trying the matching using regex.h it behaves differently on OSX or linux, specifically:

Fails on: OSX: 10.14.6 LLVM: 10.0.1 (clang-1001.0.46.4)

Works on: linux: Ubuntu 18.04 g++: 7.5.0

I worked up a brief code the reproduces the problem, compiled with g++ regex.cpp -o regex:

#include <iostream>

//regex
#include <regex.h>

using namespace std;

int main(int argc, char** argv) {


  //define a buffer for keeping results of regex matching 
  char       buffer[100];

  //regex object to use
  regex_t regex;

  //*****regex match and input file line*******
  string iline = "0.625846        29Si    29      [4934.39        0]      [0.84   100000000000000.0]";
  string matchfile="^(0+\\.[0-9]?e?[+-]?[0-9]+)\\s+([0-9]+\\.?[0-9]*|[0-9][0-9]?[0-9]?[A-Z][a-z]?)\\s+([0-9][0-9]?[0-9]?)\\s+(\\[.*\\])\\s+(\\[.*\\])$";


  //compile the regex 
  int reti = regcomp(&regex,matchfile.c_str(),REG_EXTENDED);

  regerror(reti, &regex, buffer, 100);

  if(reti==0)
    printf("regex compile success!\n");
  else
    printf("regcomp() failed with '%s'\n", buffer);


  //match the input line
  regmatch_t input_matchptr[6];
  reti = regexec(&regex,iline.c_str(),6,input_matchptr,0);

  regerror(reti, &regex, buffer, 100);

  if(reti==0)
    printf("regex compile success!\n");
  else
    printf("regexec() failed with '%s'\n", buffer);

  //******************************************

  return 0;

I have also modified my regex to comply with POSIX (I think?) by removing the previous use of +? and *? operators as per this post but may have missed something that makes me incompatible with POSIX? However, the regex now seems to compile correctly which makes me thing I used a valid regex but I still don't understand why no match is obtained. Which I understand that LLVM requires.

How can I modify my regex to correctly match?

Anton Korobeynikov
  • 9,074
  • 25
  • 28
villaa
  • 1,043
  • 3
  • 14
  • 32
  • If you are using C++, consider using C++ [``](https://en.cppreference.com/w/cpp/regex) facility. – n. m. could be an AI Dec 26 '21 at 18:59
  • That example is simple, *except for the regex,* which isn't simplified at all and *which is the part that matters most.* Are you even using the same version of the same regex library? Can you (or why didn't you) simplify the regex? – arnt Dec 26 '21 at 21:55
  • 1
    Try `string matchfile="^(0+\\.[0-9]?e?[+-]?[0-9]+)[[:space:]]+([0-9]+\\.?[0-9]*|[0-9][0-9]?[0-9]?[A-Z][a-z]?)[[:space:]]+([0-9][0-9]?[0-9]?)[[:space:]]+(\\[.*\\])[[:space:]]+(\\[.*\\])$";` – Wiktor Stribiżew Dec 26 '21 at 22:23
  • @arnt because I don't have the ability to simplify the regex? – villaa Dec 27 '21 at 00:09
  • @WiktorStribiżew this *may* be working, thought I tried it before but I think I missed one [:space:] or incorrectly wrote it. will do more testing – villaa Dec 27 '21 at 01:11
  • @WiktorStribiżew that seems to work on both systems, thanks! How did you know to put the double brackets around `:space:` ? Based on [this](https://www.regular-expressions.info/posixbrackets.html) page I tried this before but only used single brackets. – villaa Dec 27 '21 at 01:26
  • One simple way to modify regexes is to start with the first element, then add parts of the origiinal, one or a few characters at a time, until the bug appears. In this case, until the two implementations diverge. It won't work correctly, but that's okay, because that is already the case. You start the process precisely because it doesn't work as intended. – arnt Dec 27 '21 at 09:11

1 Answers1

1

To answer the immediate question, you need to use

string matchfile="^(0+\\.[0-9]?e?[+-]?[0-9]+)[[:space:]]+([0-9]+\\.?[0-9]*|[0-9][0-9]?[0-9]?[A-Z][a-z]?)[[:space:]]+([0-9][0-9]?[0-9]?)[[:space:]]+(\\[.*\\])[[:space:]]+(\\[.*\\])$";

That is, instead of Perl-like \s, you can use [:space:] POSIX character class inside a bracket expression.

You mention that you tried [:space:] outside of a bracket expression, and it did not work - that is expected. As per Character Classes,

[:digit:] is a POSIX character class, used inside a bracket expression like [x-z[:digit:]].

This means that POSIX character classes are only parse as such when used inside bracket expressions.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    thx I didn't appreciate that feature of character class, i.e. that it _cannot_ be used outside of a bracket expression. – villaa Dec 27 '21 at 19:01