How much can we trust to warnings generated by static analysis tools for vulnerablity detection?

Question

I am running flawfinder on a set of libraries written in C/C++. I have a lot of generated warnings by flawfinder. My question is that, how much I can rely on these generated warnings? For example, consider the following function from numpy library (https://github.com/numpy/numpy/blob/4ada0641ed1a50a2473f8061f4808b4b0d68eff5/numpy/f2py/src/fortranobject.c):

static PyObject *
fortran_doc(FortranDataDef def)
{
    char *buf, *p;
    PyObject *s = NULL;
    Py_ssize_t n, origsize, size = 100;

    if (def.doc != NULL) {
        size += strlen(def.doc);
    }
    origsize = size;
    buf = p = (char *)PyMem_Malloc(size);
    if (buf == NULL) {
        return PyErr_NoMemory();
    }

    if (def.rank == -1) {
        if (def.doc) {
            n = strlen(def.doc);
            if (n > size) {
                goto fail;
            }
            memcpy(p, def.doc, n);
            p += n;
            size -= n;
        }
        else {
            n = PyOS_snprintf(p, size, "%s - no docs available", def.name);
            if (n < 0 || n >= size) {
                goto fail;
            }
            p += n;
            size -= n;
        }
    }
    else {
        PyArray_Descr *d = PyArray_DescrFromType(def.type);
        n = PyOS_snprintf(p, size, "'%c'-", d->type);
        Py_DECREF(d);
        if (n < 0 || n >= size) {
            goto fail;
        }
        p += n;
        size -= n;

        if (def.data == NULL) {
            n = format_def(p, size, def) == -1;
            if (n < 0) {
                goto fail;
            }
            p += n;
            size -= n;
        }
        else if (def.rank > 0) {
            n = format_def(p, size, def);
            if (n < 0) {
                goto fail;
            }
            p += n;
            size -= n;
        }
        else {
            n = strlen("scalar");
            if (size < n) {
                goto fail;
            }
            memcpy(p, "scalar", n);
            p += n;
            size -= n;
        }
    }
    if (size <= 1) {
        goto fail;
    }
    *p++ = '\n';
    size--;

    /* p now points one beyond the last character of the string in buf */
#if PY_VERSION_HEX >= 0x03000000
    s = PyUnicode_FromStringAndSize(buf, p - buf);
#else
    s = PyString_FromStringAndSize(buf, p - buf);
#endif

    PyMem_Free(buf);
    return s;

 fail:
    fprintf(stderr, "fortranobject.c: fortran_doc: len(p)=%zd>%zd=size:"
                    " too long docstring required, increase size\n",
            p - buf, origsize);
    PyMem_Free(buf);
    return NULL;
}

There are two memcpy() API calls, and flawfinder tells me that:

['vul_fortranobject.c:216: [2] (buffer) memcpy:\\n Does not check for buffer overflows when copying to destination (CWE-120).\\n Make sure destination can always hold the source data.\\n memcpy(p, "scalar", n);']

I am not sure whether the report is true.

Q: Which are you more worried about: false positives (the analyzer warning you about things that aren't necessarily bona fide issues), or false negatives (NOT reporting issues you SHOULD be warned about)? — paulsm4, Jul 23 '22 at 23:58
Thanks for the comment. In line 216, I think this statement is not vulnerable since n is always equal to 6, but my question is that since there is not any manual check, can we say that the generated warning by flawfinder is actually true positive? because we don't have any check. If we rely on this rule, the warning is true positive. But, if we run the program, the length of n is always 6 which buffer overflow never happens in this special case. In other words, lack of manual checking makes the warning true positive? — Nima shiri, Jul 24 '22 at 00:06
@paulsm4 in line 153, p is defined as char. Then, in line 162, we have dynamic memory allocation of size 100. This tells us that p actually can hold source, so it is not vulnerable by its nature. But, flawfinder says possible buffer overflow. — Nima shiri, Jul 24 '22 at 00:10

paulsm4 · Accepted Answer · 2022-07-24T03:12:22.647

To answer your question: static analysis tools (like FlawFinder) can generate a LOT of "false positives".

I Googled to find some quantifiable information for you, and found an interesting article about "DeFP":

https://arxiv.org/pdf/2110.03296.pdf

Static analysis tools are frequently used to detect potential vulnerabilities in software systems. However, an inevitable problem of these tools is their large number of warnings with a high false positive rate, which consumes time and effort for investigating. In this paper, we present DeFP, a novel method for ranking static analysis warnings.

Based on the intuition that warnings which have similar contexts tend to have similar labels (true positive or false positive), DeFP is built with two BiLSTM models to capture the patterns associated with the contexts of labeled warnings. After that, for a set of new warnings, DeFP can calculate and rank them according to their likelihoods to be true positives (i.e., actual vulnerabilities).

Our experimental results on a dataset of 10 real-world projects show that using DeFP, by investigating only 60% of the warnings, developers can find +90% of actual vulnerabilities. Moreover, DeFP improves the state-of-the-art approach 30% in both Precision and Recall.

Apparently, the authors built a neural network to analyze FlawFinder results, and rank them.

I doubt DeFP is a practical "solution" for you. But yes: if you think that specific "memcpy()" warning is a "false positive" - then I'm inclined to agree. It very well could be :)

How much can we trust to warnings generated by static analysis tools for vulnerablity detection?

1 Answers1