0

Is there a way to determine if a given commit or just current changes are nothing more than just formatting applied.

Such that

int main(){}
int main ( ) {
}

would be considered unchanged.

Would be nice for

int main(){}
intmain(){}

to be considered as changed, but not necessary.

I suppose it's not that easy. But maybe you know of some existing solutions written in python or whatever.

Hrisip
  • 900
  • 4
  • 13
  • C++'s grammar is horribly complicated (non-context-free, yey) - while whitespace in C/C++ is insignificant, it's very difficult to tell if a given whitespace character is inside a string or not without writing a full ISO-compliant parser - so what you're asking is actually very, very difficult. – Dai Dec 14 '20 at 21:56
  • 1
    @Dai that's only strings, not a full parser. – Hrisip Dec 14 '20 at 22:03
  • 1
    FWIW, before C++11, `std::vector>` and `std::vector< std::vector >` are two different bits of code per the grammar and the former will fail to compile as `>>` is treated as an operator. Not sure if there are other gotchas like that but its something to consider. – NathanOliver Dec 14 '20 at 22:03
  • You could reformat in a pre-commit hook and check if they are equal. This [answer](https://stackoverflow.com/a/841083/980129) and [this](https://gist.github.com/kblomqvist/bb59e781ce3e0006b644) may help you. – Manuel Dec 14 '20 at 22:29
  • @NathanOliver: There is at least also prefix of c-string and UDL and macro concatenation. – Jarod42 Dec 15 '20 at 03:46
  • @Hrisip But that's my point: Without using an existing C++ parser (with preprocessor macro engine) I feel it's impossible to determine if two C++ programs share identical syntax when disregarding extraneous whitespace - I mentioned strings because whitespace in strings is significant, which means you need to be able to accurately tell what regions of a file are in a string or not - but doing so is difficult considering preprocessor macros, "stringification" / "stringizing", and C++11's raw-string-literals (which are hell to deal with already) let alone everything else in C++ like `>>` vs `> >` – Dai Dec 16 '20 at 00:53

1 Answers1

1

The usual way to solve this problem is to bring everything to a common format, then compare the result. That is, given two programs in source language X (for any arbitrary X) and a formatter or pretty-printer for language X, where you don't know if programs P1 and P2 are "the same program" but with different ideas of what a "pretty" layout looks like, we simply run both P1 and P2 through the pretty-printer.

How well this works tends to depend in part on the pretty-printer, since some are more sensitive to initial conditions than others. For instance, some will reflow comments and some won't (or they may have control flags for this).

Note that detecting non-layout-only vs layout-only differences between any two particular snapshots of some program is not really a Git issue, but if you're looking for a way to extract two commits independently, consider these options:

  • Use git worktree to extract a given commit to a given work-tree.
  • Use git archive to turn a commit into a tarball or other archive, which you can un-archive anywhere you like.

If you wish to enforce a consistent style, consider a pre-commit hook (see Manuel's comment). Note, however, that writing a good pre-commit hook is pretty hard, because Git builds commits not from what is in a user's work-tree, but rather from what is in Git's index. For some useful tricks, see, e.g., the pre-commit hook here, which is not all perfect, but does at least work for common cases.

torek
  • 448,244
  • 59
  • 642
  • 775
  • 1
    The *"formatter"* can also give different view of the code as AST. (I might also say compiler with executable, but comment are stripped down). – Jarod42 Dec 15 '20 at 03:49
  • @Jarod42: annoyingly, compiler outputs these days are full of date-and-time-stamps and other such things that make for non-repeatable builds. Getting ASTs from the source files is good, if you can do that. – torek Dec 15 '20 at 07:00
  • 1
    Forgot about timestamp :/ but there is some "solutions" to have [Deterministic-builds-with-C-C++](https://blog.conan.io/2019/09/02/Deterministic-builds-with-C-C++.html). and for AST, see for example [how-to-view-clang-ast](https://stackoverflow.com/questions/18560019/how-to-view-clang-ast). – Jarod42 Dec 15 '20 at 09:48