0

Below, I produce both broken code and a fixed version of the same. The problem is that I am not able to fully explain to myself why the former doesn't work but the latter does. I obviously need to review some very basic concept of the C++ language: could you provide pointers as to what I should review, and possibly also give an explanation as to why I get the results I get with the broken code.

In the '../docs/' directory referred in the code, I simply used the 'touch' command on linux to create a number of doc......html files of various length.

#include <iostream>
#include <regex>
#include <boost/filesystem.hpp>

namespace fs = boost::filesystem;

int main() {
        fs::path p("../docs/");

        for (auto& dir_it : fs::directory_iterator(p)) {
                std::regex re = std::regex("^(doc[a-z]+)\\.html$");
                std::smatch matches;
                // BROKEN HERE:
                if (std::regex_match(dir_it.path().filename().string(), matches, re)) {
                        std::cout << "\t\t" <<dir_it.path().filename().string();
                        std::cout << "\t\t\t" << matches[1] << std::endl;
                }
        }

        return 0;
}

Produces:

        documentati.html                        ati
        documentationt.html                     �}:ationt
        document.html                   document
        documenta.html                  documenta
        docume.html                     docume
        documentat.html                 documentat
        docum.html                      docum
        documentatio.html                       ��:atio
        documen.html                    documen
        docu.html                       docu
        documentation.html                      ��:ation
        documaeuaoeu.html                       ��:aoeu

Note 1: The bug above is triggered with filenames which are above a certain length. I only understand that's because the std::string object is resizing itself.

Note 2: The above code is very similar than the code used in the following question, but with boost::regex_match instead of std::regex_match: Can I use a mask to iterate files in a directory with Boost?
It used to work for me before as well, but now I use GCC 5.4 instead of GCC 4.6, std::regex instead of boost::regex, C++11, and a much newer version of boost::filesystem. Which change is relevant, that caused working code to get broken?

Fixed:

#include <iostream>
#include <regex>
#include <boost/filesystem.hpp>

namespace fs = boost::filesystem;

int main() {
        fs::path p("../docs/");

        for (auto& dir_it : fs::directory_iterator(p)) {
                std::regex re = std::regex("^(doc[a-z]+)\\.html$");
                std::smatch matches;
                std::string p = dir_it.path().filename().string();
                if (std::regex_match(p, matches, re)) {
                        std::cout << "\t\t" <<dir_it.path().filename().string();
                        std::cout << "\t\t\t" << matches[1] << std::endl;
                }
        }

        return 0;
}

produces:

        documentati.html                        documentati
        documentationt.html                     documentationt
        document.html                   document
        documenta.html                  documenta
        docume.html                     docume
        documentat.html                 documentat
        docum.html                      docum
        documentatio.html                       documentatio
        documen.html                    documen
        docu.html                       docu
        documentation.html                      documentation
        documaeuaoeu.html                       documaeuaoeu

Using boost 1.62.0-r1 and gcc (Gentoo 5.4.0-r3), the boost::filesystem documentation does not appear to provide any clear indication as to what path().filename().string() returns: http://www.boost.org/doc/libs/1_62_0/libs/filesystem/doc/reference.html#path-filename
It appears that it depends:
Why does boost::filesystem::path::string() return by value on Windows and by reference on POSIX?

augustin
  • 14,373
  • 13
  • 66
  • 79

1 Answers1

2

std::smatch doesn't store a copy of the characters comprising the match, but only pairs of iterators into the original string that was searched.

The first fragment passes a temporary string to std::regex_match. By the time you get around to examining matches, that string is gone and the iterators held by matches are all dangling. The program then exhibits undefined behavior by trying to use them.

In the second fragment, the string passed to std::regex_match is still alive by the time matches is used, so all is well.


Note 1: That the problem manifests differently depending on file name length is likely due to small string optimization. It's common for std::string implementations to reserve a small buffer in the class instance, and store short strings there; whereas longer strings are allocated on the heap. It's possible that the stack space previously occupied by the temp string is still intact, while the heap space has been reused for other allocations. The program exhibits undefined behavior either way - "seems to work" is one possible manifestation of UB.


Note 2: It doesn't matter much whether path::string() returns by value or by reference. path::filename() returns by value, so dir_it.path().filename().string() would get destroyed too early either way, whether it's a member of a temporary dir_it.path().filename() object, or manufactured on the fly by string().

Igor Tandetnik
  • 50,461
  • 4
  • 56
  • 85