0

I'm using pycparser to parse some C code that I cannot compile with cpp before parsing, so I'm manually stripping off all comments and preprocessor directives with the following function:

def remove_comments(text):
    def replacer(match):
        s = match.group(0)
        if s.startswith('/') or s.startswith('#'):
            return ""
        else:
            return s

    pattern = re.compile(
        r'#.*?$|//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
        re.DOTALL | re.MULTILINE
    )
    return re.sub(pattern, replacer, text)

This is the output of this function on the memmgr.c file from examples:

typedef ulong Align;

union mem_header_union
{
    struct 
    {


        union mem_header_union* next;



        ulong size; 
    } s;



    Align align_dummy;
};

typedef union mem_header_union mem_header_t;



static mem_header_t base;



static mem_header_t* freep = 0;



static byte pool[POOL_SIZE] = {0};
static ulong pool_free_pos = 0;


void memmgr_init()
{
    base.s.next = 0;
    base.s.size = 0;
    freep = 0;
    pool_free_pos = 0;
}


static mem_header_t* get_mem_from_pool(ulong nquantas)
{
    ulong total_req_size;

    mem_header_t* h;

    if (nquantas < MIN_POOL_ALLOC_QUANTAS)
        nquantas = MIN_POOL_ALLOC_QUANTAS;

    total_req_size = nquantas * sizeof(mem_header_t);

    if (pool_free_pos + total_req_size <= POOL_SIZE)
    {
        h = (mem_header_t*) (pool + pool_free_pos);
        h->s.size = nquantas;
        memmgr_free((void*) (h + 1));
        pool_free_pos += total_req_size;
    }
    else
    {
        return 0;
    }

    return freep;
}










void* memmgr_alloc(ulong nbytes)
{
    mem_header_t* p;
    mem_header_t* prevp;





    ulong nquantas = (nbytes + sizeof(mem_header_t) - 1) / sizeof(mem_header_t) + 1;




    if ((prevp = freep) == 0)
    {
        base.s.next = freep = prevp = &base;
        base.s.size = 0;
    }

    for (p = prevp->s.next; ; prevp = p, p = p->s.next)
    {

        if (p->s.size >= nquantas) 
        {

            if (p->s.size == nquantas)
            {



                prevp->s.next = p->s.next;
            }
            else 
            {
                p->s.size -= nquantas;
                p += p->s.size;
                p->s.size = nquantas;
            }

            freep = prevp;
            return (void*) (p + 1);
        }



        else if (p == freep)
        {
            if ((p = get_mem_from_pool(nquantas)) == 0)
            {

                printf("!! Memory allocation failed !!\n");

                return 0;
            }
        }
    }
}

But I'm getting this ParseError:

pycparser.plyparser.ParseError: :1:15: before: Align

What is wrong with pycparser?

Vektor88
  • 4,841
  • 11
  • 59
  • 111

2 Answers2

1

I suppose that there was a typedef unsigned long ulong; in some included file. Without that declaration, ulong cannot appear where the grammar requires a typename.

Try adding the declaration of ulong, somewhere before its first use.


To more specifically about the question: "What is wrong with pycparser?":

The goal of pycparser is to parse C programs. It is not an approximate parser; it actually aims to produce a complete, accurate parse of any valid C99 program.

Unfortunately, it is impossible to accurately parse a C program without knowing which identifiers are typenames. It is not necessary to know the precise type of an identifier, so pycparser doesn't require access to all prototypes and global definitions; it does, however, require access to all relevant typedefs.

This is documented in section 3.2 of pycparser's readme file, which points to a longer discussion at the author's website:

The key point to understand here is that pycparser doesn't really care about the semantics of types. It only needs to know whether some token encountered in the source is a previously defined type. This is essential in order to be able to parse C correctly.

As suggested by Eli, your best bet is to collect just the typedefs used by the code you want to analyze, and insert them at the beginning of the code. There probably are not too many of them.

Eli Bendersky's essay is excellent, and well worth reading. Let me just provide a couple of examples of C code which cannot be parsed without knowing whether a name is a typedef or not.

The classic example is, I think, well-known:

(foo) - 3 * 4

This expression has two possible parses, only one of which can apply in any given program. On the left, the parse if foo is a variable; on the right, the parse if foo is a type:

    -                              *
   / \                            / \
  /   \                          /   \
foo    *                       cast   4
      / \                      /  \
     /   \                    /    \
    3     4                 foo     -   
                                    |
                                    |
                                    3

In other words, if foo is a variable, the expression subtracts 3*4 from foo. But if foo is a type, the expression casts -3 to type foo and then multiplies the result by 4`.

Apparently, the particular application from which this question was derived does not actually require detailed knowledge about the parse of every expression. But there is no way to communicate this fact to pycparser; pycparser's is intended to provide a complete parse.

In any case, it is possible to construct a possibly more relevant example. Consider the two statements (which cannot appear in the same translation unit):

foo (*bar()) ();

and

foo (*bar()) ();

Despite their similarity ( :-) ), these two statements are completely different. The first one declares a function named bar. The second one calls a function named foo, after calling a function named bar to compute the argument to foo.

Even if you were just collecting declarations, it would be important to know whether that statement were a declaration or not. (Yes, that construct is very rare; it might not appear anywhere in the code-base being analyzed. But how does pycparser know that?)

Here are the complete contexts:

#include <stdio.h>                 | #include <stdio.h>
typedef int foo;                   |
int answer() { return 42; }        | int answer() { return 42; }
                                   |
                                   | int (*foo(int a)) () {
                                   |   printf("%d\n", a);
                                   |   return answer;
                                   | }
                                   |
                                   | static const int unused = 43;
int (*bar()) () { return answer; } | int const* bar() { return &unused; }
                                   |
int main() {                       | int main() {
  /* Declare bar */                |  /* Call foo */
  foo (*bar()) ();                 |  foo (*bar()) ();
  printf("%d\n", bar()());         |  return 0;
  return 0;                        | }
}                                  |

rici
  • 234,347
  • 28
  • 237
  • 341
  • what if I only care about syntax parsing of arbitrary code snippets in order to, let's say, list all methods? – Vektor88 Sep 11 '15 at 19:47
  • @Vektor88: It is not possible to parse C code without knowing which identifiers are typenames. (It's possible to get an approximate parse, but I think pycparser wants to correctly parse the input.) Why can't you preprocess the file? – rici Sep 11 '15 at 19:58
  • I'm only caring about grammatical aspects of the code, so I don't really need to satisfy dependencies and if I wanted to process the whole linux kernel code it would take ages and lots of efforts to satisfy all the dependencies. Apparently pycparser isn't for me. – Vektor88 Sep 12 '15 at 07:18
  • @Vektor88: Perhaps pycparser isn't the tool you are looking for, but any C parser is likely to produce similar results. I added some examples in an attempt to explain why it is impossible to parse C without knowing whether an identifier is the name of a type or the name of a variable/function. – rici Sep 12 '15 at 17:55
1

The C preprocessor doesn't only strip comments. It also handles all the macros, including #include, which pulls in header files. In your case, it's trying to #include "memmgr.h", which has this typedef:

typedef unsigned long ulong;

Along with some others.

The bottom line is - you should invoke the preprocessor before pycparser. If you have valid C code, there's no reason the preprocessor shouldn't work. This may be a topic for a separate SO question in the C tag.

Eli Bendersky
  • 263,248
  • 89
  • 350
  • 412
  • As already explained, I want to parse some arbitrary code and get a list of functions, variables and so on. Satisfying dependencies is ok when your purpose is checking if your code "will work", but time consuming and useless if you only want some approximate information about the structure of the code, and the code might even be a method declaration extracted from some more complex code. – Vektor88 Sep 12 '15 at 13:55