1

I need to parse URI-like string. This URI is specific to the project and corresponds to "scheme://path/to/file", where path should be a syntactically correct path to file from filesystem point of view. For this purpose std::regex was used with pattern R"(^(r[o|w])\:\/\/(((?!\$|\~|\.{2,}|\/$).)+)$)".

It works fine but code analyzer complies that it is not compliant as $ character is not belong to the C++ Language Standard basic source character set:

AUTOSAR C++14 A2-3-1 (Required) Only those characters specified in the C++ Language Standard basic source character set shall be used in the source code.

Exception to this rule (according to Autosar Guidelines):

It is permitted to use other characters inside the text of a wide string and a UTF-8 encoded string literal.

wchar_t is prohibited by other rule, although it works with UTF-8 string (but it looks ugly and unreadable in the code, also I'm afraid it is not safe).

Could someone help me with workaround or std::regex here is not the best solution, then what would be better?

Are any other drawbacks of using UTF-8 string literal?

P.S. I need $ to be sure (on parsing phase) that path is not a directory and that it is not contain none of /../, ~, $ , so I can't just skip it.

ivoriik
  • 155
  • 1
  • 11
  • 1
    "I'm afraid it is not safe" Can you elaborate? Unless you are talking about all the required escaping making the regex error-prone, if you only use characters under `128`, a utf-8 string literal is effectively indistinguishable from a regular string literal. –  Aug 04 '21 at 19:36
  • Looked at from the other direction, an ASCII string literal with values less than 128 is also a UTF-8 encoded string. – Dave S Aug 04 '21 at 20:14
  • 1
    Replacing `$` with `\x24` (its hex ASCII value) in the string looks valid according to the rule and most likely will trick the analyzer. But this is stupid and the rule is really stupid itself. Even in the example they have `// $Id: ` and the rule only states "it is also permitted to use a character @ inside comments" so it is illegal in comments as well. – dewaffled Aug 04 '21 at 21:27
  • I haven't worked with autosar but when I was still in the automotive industry MISRA also has similar silly rules – phuclv Aug 05 '21 at 02:11
  • @dewaffled how is op supposed to use an escape sequence in a raw string? –  Aug 05 '21 at 03:01
  • @Frank I've indeed missed that it is a raw string. One can make a concatenated string with something like `R"(...)" "\x24" R"(+++)"` which will evaluate to `"...$+++"` if they really want to hack it this way. – dewaffled Aug 05 '21 at 09:53
  • 1
    How about an u8R"" literal? Otherwise, as Frank explained in his Option A, I'm not even sure `std::regex` should be used at all, because the first thing is the qualification of the compiler AND the libraries, If they are not qualified, you should not use them. Or you need a formal deviation process. You can not just use any kind of library, not even the standard library (which btw. also applies to the C standard library). – kesselhaus Aug 06 '21 at 02:46
  • @kesselhaus This prompted me to go back to the doc and notice the *prefix(optional)* part of raw string literals for the first time. Thanks for pointing this out. –  Aug 06 '21 at 15:26

1 Answers1

1

I feel like making the code worse for the sake of satisfying an analyser is counterproductive and most likely violates the spirit of the guidelines, so I'm intentionally ignoring ways to address the problem that would involve building the regex string in a convoluted manner, since what you did is the best way to build such a regex string.

Could someone help me with workaround or std::regex here is not the best solution, then what would be better?

Option A: Write a simple validation function:

I'm actually surprised that such strict guidelines even allow regexes in the first place. They are notoriously hard to audit, debug, and maintain.

You could easily express the same logic with actual code, which would not only satisfy the analyser, but be more aligned with the spirit of the guidelines. On top of that it'll compile faster and probably run faster as well.

Something along these rough lines, based on a cursory reading of your regex. (please don't just use this without running it through a battery of tests, I sure didn't):

bool check_and_remove_path_prefix(std::string_view& path) {
  constexpr std::array<std::string_view, 2> valid_prefixes = { 
    R"(ro://)", 
    R"(rw://)"
  };

  for(auto p: valid_prefixes) {
    if(path.starts_with(p)) {
      path.remove_prefix(p.size());
      return true;
    }
  }
  return false;
}

bool is_valid_path_elem_char(char c) {
  // This matches your regex, but is probably wrong, as it will accept a bunch of control characters.
  // N.B. \x24 is the dollar sign character
  return c != '~' && c != '\x24' && c != '\r' && c != '\n';
}
 
bool is_valid_path(std::string_view path) {
  if(!check_and_remove_path_prefix(path)) { return false; }

  char prev_c = '\0';
  bool current_segment_empty = true;
  for(char c : path) {
    // Disallow two or more consecutive periods
    if( c == '.' && prev_c == '.') { return false; }

    // Disallow empty segments
    if(c == '/') {
      if(current_segment_empty) { return false; }
      current_segment_empty = true;
    }
    else {
      if(!is_valid_path_elem_char(c)) { return false; }
      current_segment_empty = false;
    }
    
    prev_c = c;
  }

  return !current_segment_empty;
}

Option B: Don't bother with the check

It's hard from our point of view to determine whether that option is in the cards or not for you, but for every intent and purpose, the distinction between a badly formed path and a well-formed path that does not point to a valid file is moot.

So just use the path as if it's valid, you should be handling the errors that would result from a badly formed path anyways.