Why are static analysis tools missing this seemingly obvious case?

Question

I have a very simple C program with a potential buffer overflow using strcpy:

#include <string.h>
#include <stdio.h>

void buffer_overflow(char* dst, const char* src)
{
        strcpy(dst, src);
}

int main(int argc, char** argv)
{
        if(argc == 2)
        {
                char buffer[16] = {0};
                buffer_overflow(buffer, argv[1]);
                printf("[%d]: %s", (int)strlen(buffer), buffer);
        }

        return 0;
}

Neither clang static analyzer (using scan-build gcc -O0 -g3 -gdwarf-2) nor cppcheck (using cppcheck --enable=warning,style) find this as an issue.

Am I just asking too much from my static analysis tools?

Not sure what the problem is, from a static analysis point of view, this program is 100% fine? It's the _dynamic_ parameters that are troublesome, but how is static analysis supposed to foresee what you will enter? Assume you did it "properly" and first called `strlen`, then reserved memory. What if I call your program with an argument that's two gigabytes long? — Damon, Mar 08 '19 at 21:04
`cppcheck` finds the `strcpy()` as problematic if I don't hide it away in the {{buffer_overflow()}} question. Clang static analyzer _specifically_ has an analyzer for the `strcpy` API. — Chad, Mar 08 '19 at 21:06
@Damon: the point of a static analysis is to tell you that there is a circumstance in which your application can go wrong. Clearly that can happen with large sizes of argv[1], so in theory the analyzers could detect this. (After all, you can detect it, why can't an algorithm?) — Ira Baxter, Mar 08 '19 at 22:18

Ira Baxter · Answer 1 · 2019-03-09T17:59:17.483

I can't speak for the quality of "your" static analysis tools.

Here's a dynamic analysis tool, CheckPointer, from my company that finds the problem(s) with your code (which I tested as "buggy.c"):

C:\DMS\Domains\C\GCC4\Tools\CheckPointer\Example>runexample
RunExample.cmd 1.2: Batch file to execute C CheckPointer example
Copyright (C) 2011-2016 Semantic Designs; All Rights Reserved
c:\DMS\Domains\C\GCC4\Tools\CheckPointer\Example\Source\buggy.c
*** Instrument source code for memory access checking
Copyright (C) 2011 Semantic Designs; All Rights Reserved
C~GCC4 CheckPointer Version 1.2.1001
Copyright (C) 2011-2016 Semantic Designs, Inc; All Rights Reserved; SD Confidential
Powered by DMS (R) Software Reengineering Toolkit
*** Unregistered CheckPointer Version 1.2
*** Operating with evaluation limits.
Parsing source file "C:/Users/idbaxter/AppData/Local/Temp/DMS/Domains/C/GCC4/Tools/CheckPointer/Example/Source/buggy.c" using encoding CP-1252 +CRLF $^J $^M $^e -1 +8 ...
Writing target file "C:/Users/idbaxter/AppData/Local/Temp/DMS/Domains/C/GCC4/Tools/CheckPointer/Example/Target/buggy.c" using encoding CP-1252 +CRLF $^J $^M $^e -1 +8 ...
*** Compiling sources with memory access checking code
gcc.exe -I"c:\DMS\Domains\C\GCC4\Tools\CheckPointer" -I.\Target -obuggy.exe Target\buggy.c ...


C:\DMS\Domains\C\GCC4\Tools\CheckPointer\Example\Source>C:/Users/idbaxter/AppData/Local/Temp/DMS/Domains/C/GCC4/Tools/CheckPointer/Example\buggy.exe foo
[3]: foo

C:\DMS\Domains\C\GCC4\Tools\CheckPointer\Example\Source>C:/Users/idbaxter/AppData/Local/Temp/DMS/Domains/C/GCC4/Tools/CheckPointer/Example\buggy.exe 0123456789ABCDE
[15]: 0123456789ABCDE

C:\DMS\Domains\C\GCC4\Tools\CheckPointer\Example\Source>C:/Users/idbaxter/AppData/Local/Temp/DMS/Domains/C/GCC4/Tools/CheckPointer/Example\buggy.exe 0123456789ABCDEF
*** Error: CWE-119: Improper Restriction of Operations within the Bounds of a Memory Buffer
       Dereference of pointer is out of bounds.
  in wrapper function: strcpy
called in function: buffer_overflow, line: 6, file: C:/Users/idbaxter/AppData/Local/Temp/DMS/Domains/C/GCC4/Tools/CheckPointer/Example/Source/buggy.c
called in function: main, line: 14, file: C:/Users/idbaxter/AppData/Local/Temp/DMS/Domains/C/GCC4/Tools/CheckPointer/Example/Source/buggy.c
*** Error: CWE-119: Improper Restriction of Operations within the Bounds of a Memory Buffer
       Dereference of pointer is out of bounds.
  in wrapper function: strlen
called in function: main, line: 15, file: C:/Users/idbaxter/AppData/Local/Temp/DMS/Domains/C/GCC4/Tools/CheckPointer/Example/Source/buggy.c
*** Error: CWE-119: Improper Restriction of Operations within the Bounds of a Memory Buffer
       Dereference of pointer is out of bounds.
  in wrapper function: printf
called in function: main, line: 15, file: C:/Users/idbaxter/AppData/Local/Temp/DMS/Domains/C/GCC4/Tools/CheckPointer/Example/Source/buggy.c
[16]: 0123456789ABCDEF
C:\DMS\Domains\C\GCC4\Tools\CheckPointer\Example\Source>

is this from "Semantic Designs"? Does it support C++ as well? — Chad, Mar 08 '19 at 21:02
Yes, this is from Semantic Designs. When I say things like this I tend to get burned by SO commercial product haters. No, we don't have one for C++. Yet. — Ira Baxter, Mar 08 '19 at 22:19

Scott McPeak · Answer 2 · 2019-09-08T22:40:44.530

This is a fairly open-ended question that doesn't have a right answer. But as someone who has worked on several static analyzers, I'll bite and briefly explain what's hard about this example.

Problem 1: False positives

You and I know that argv[1] could be an arbitrarily long string. (In part that knowledge is derived from the context of being a Stack Overflow question, in which code is usually presented absent any context or a-prior assumptions.) But if a static analyzer reported every strcpy where the source string was not known (by the static analyzer!) to have a bounded length, it would in practice report a very large number of what most developers would consider false positives (FPs): incorrect reports that they don't want to see.

That's because there a very large number of cases where strcpy is used correctly, but the reason it is correct is beyond the reasoning capacity of the analyzer, and might not even be present in the source code at all (e.g., "program X is only ever invoked by program Y, which never passes arguments longer than 80 characters"). In comparison, only a very small fraction of strcpy calls (with arguments the analyzer can't bound) are wrong. Let's be generous and say 10% of them are wrong--that's still a 90% FP rate if we report them all, far beyond what most developers will tolerate.

When a tool reports too many FPs, most developers quickly stop using them, at which point it's providing no value. Consequently, most static analyses choose to limit what they report to cases where the tool is fairly confident, at least by default.

Problem 2: Interprocedural analysis

Even for a tool that wants to report this (as you say Cppcheck does, when the strcpy is directly in main), the second problem is understanding what buffer_overflow does. It is impractical for a static analyzer to simply "inline" the contents of callees into callers arbitrarily deeply because the resulting AST would be enormous and the number of paths astronomical. Consequently, analyzers generally summarize callee behavior. The exact form of those summaries, and the algorithms that compute them, are the subject of active academic research and closely guarded trade secrets.

The behavior of strcpy is actually fairly complex compared to what a typical function summary can express. There is a size involved, but that size is derived by examining the contents of the data one of the pointers points at, specifically the location of the first NUL byte. Then that size affects what happens to the things pointed at by both arguments. That's a lot to encode in a summary in a general and scalable way, so most tools don't. As a result, the tool has only a very crude understanding of what buffer_overflow does, typically too crude to allow it to confidently report a defect here.

Why are static analysis tools missing this seemingly obvious case?

2 Answers2

Problem 1: False positives

Problem 2: Interprocedural analysis

Linked