2

I'm reading lists from a large file, which I eventually want to store as array.arrays. Because

map(int, line.split())

is very slow, I wrote a small C module which does strtok and a faster version of atoi:

inline long
minhashTables_myatoi(const char* s)
{
    int r;
    for (r = 0; *s; r = r * 10 + *s++ - '0');
    return r;
}

static PyObject*
minhashTables_ints(PyObject *self, PyObject *args)
{
    char* s;
    Py_ssize_t slen;

    if(!PyArg_ParseTuple(args, "s#", &s, &slen))
        return NULL;

    long* buf = malloc(sizeof(long) * (slen+1)/2);

    const char* tok = strtok(s, " ");
    buf[0] = minhashTables_myatoi(tok);
    Py_ssize_t i;
    for(i = 1; (tok = strtok(NULL, " ")) != NULL; i++)
        buf[i] = minhashTables_myatoi(tok);

    Py_ssize_t buflen = i;
    PyObject* list = PyList_New(buflen);
    PyObject *o;
        for(i = 0; i < buflen; i++)
    {
        o = PyInt_FromLong(buf[i]);
        PyList_SET_ITEM(list, i, o);
    }
    free(buf);

    return list;
}

So my python script calls ints() with a string and passes it to the array.array constructor and saves the resulting array in a list.

My problem is, that now the script leaks memory, which it did not with the map instead of the ints() function, of course.

Also using my own version of Pythons int() using a C module does not leak memory.

Thanks for your help!

Edit: To valgrind the module I used this script:

import minhashTables

data = ' '.join(map(str, range(10)))
print 'start'
foo = minhashTables.ints(data)
del data
del foo
print 'stop'

And I run valgrind --tool=memcheck --leak-check=full --show-reachable=yes python test.py, but there so no output from valgrind between start and stop, through there are tons before and afterwards.

Edit: Code for confirming it's leaking: import minhashTables

for i in xrange(1000000000):
    data = ' '.join(map(str, range(10, 10000)))
    foo = minhashTables.ints(data)

I have to recreate the string, because strtok changes it. By the way, copying the string into another memory location doesn't change the behavior.

chuck
  • 493
  • 3
  • 9
  • 2
    You might want to have a look at the NumPy functions [loadtxt()](http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html#numpy.loadtxt) and [genfromtxt()](http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt). They are pretty fast. – Sven Marnach Feb 10 '11 at 09:43
  • I have to store different parts of the file into different arrays, so unfortunately I don't want to read the whole file. What's s#? – chuck Feb 10 '11 at 09:52
  • s# is a format specifier for PyArg_ParseTuple that gives you both a pointer to the string data, and the string's length. See http://docs.python.org/c-api/arg.html – jchl Feb 10 '11 at 09:58
  • 4
    I'm not sure if it's the cause of your memory leak, but calling `strtok(s)` is a bad idea. s is a pointer to the existing Python string data; you're not supposed to modify that data. – jchl Feb 10 '11 at 10:00
  • 1
    @chuck: You can also pass `StringIO` instances to `loadtxt()` and `genfromtxt()` (which basically allows to pass in strings). You don't need to write this yourself. – Sven Marnach Feb 10 '11 at 10:03
  • 1
    I can't see any problem with the refcounting in this code. Just a thought: if you implemented the same algorithm in Python, would it be fast enough? Is the problem that Python is intrinsically slow, or just that .split() performs badly on large strings? – jchl Feb 10 '11 at 10:07
  • @Sven: Good call, you're right I should try that. – chuck Feb 10 '11 at 10:19
  • @jchl: Thanks for the refcount check! The map code spent a lot of time in split(), int() and also map(), so I guess you might be right that large strings are the problem. – chuck Feb 10 '11 at 10:21
  • And how much memory it leaks? May be comparing it with your data and other should be helpful to find out the root cause of leak? Does leak differ on different data sets? – Mikhail Churbanov Feb 10 '11 at 11:11
  • `numpy.fromstring()` works fine and even though specialized, it's not as fast as the C code. – chuck Feb 11 '11 at 07:22
  • As much as I love Python, may I suggest that if you're dealing with an array with around 1 billion integers, you're using the wrong programming language. Python is perfectly fast enough for many applications, but with that amount of data you're going to want every ounce of speed you can get. – jchl Feb 11 '11 at 09:23
  • @jchl I understand your arguments and I'm having the same concern. But this is just a prototype where I put things into C which are very unlikely to change. I love Python even more after I'm seeing it handle that much data so rapidly. I'm not primary after speed. – chuck Feb 11 '11 at 12:47

3 Answers3

2

I suggest you take a look at Valgrind - it is a very useful tool for getting to the bottom of memory leaks in C.

GrahamS
  • 9,980
  • 9
  • 49
  • 63
  • Do you mean running valgrind on the python interpreter which is running my script? I'm familiar with valgrind but I'm having trouble digging through the output of that. – chuck Feb 10 '11 at 10:26
  • Can you limit the output by creating a little test script just containing your two functions and a test harness, then run valgrind on python interpreting that script? – GrahamS Feb 10 '11 at 10:41
  • Yes. I put it into the question above. – chuck Feb 10 '11 at 10:53
  • @chuck: so how do you know it leaks? And do you know how much it is leaking? – GrahamS Feb 10 '11 at 10:57
  • I notice it uses about 4 times the memory compared to using map(int, line.split()) (~40GB instead of ~10GB). – chuck Feb 10 '11 at 11:04
  • @chuck: 40GB?? Wow! I take it the file is very big then. Do you have evidence that it is actually leaking, or could it just be using more memory while it executes? – GrahamS Feb 10 '11 at 11:57
  • 1
    The file is about 9GB big. I have no hard evidence that is actually leaking. It's more that both the python and the C version create lists to construct arrays and I expect the same memory usage patterns as the lists and their content should be deallocated shortly after the array is created and before the next list is created. Each list might take up a few hundred MB but no way so much that it could explain the difference of 30GB. (After the loading is done and the actual calculation begins I expect all difference to be gone, because the loading scope is gone.) – chuck Feb 10 '11 at 12:40
  • @chuck: Okay, so *possibly* it isn't a leak? You could do a quick test by calling your test script in a loop to see if the memory used increases. – GrahamS Feb 10 '11 at 14:42
  • @GrahamS: Good call, Sir! Indeed, the memory does increase. I'll post code above. – chuck Feb 10 '11 at 16:14
  • 1
    @chuck I'm guessing this is not actually a leak, but just a case of there being freed memory that hasn't been returned to the operating system. How are you measuring the memory usage? I suggest you look at the data returned by `mallinfo()` (which you can call from Python using `ctypes`) to get an accurate picture of how much memory your application is actually using. – jchl Feb 11 '11 at 09:27
1

Do you really need to malloc off space for all those longs?

I'm not familiar with the Python/C API, so this might be terrible advice, but can't you just iterate over the string and append each long you find onto the list as you go?

i.e. take this code:

static const char* const testString = "12 345  67  8 910 11 1213 141516, 1718";

int main()
{
    const char* i = testString;
    long parseLong = 0;
    int gotLong = 0;

    for (;*i;++i)
    {
        if ('0' <= *i && *i <= '9')
        {
            parseLong = (parseLong * 10) + (*i - '0');
            gotLong = 1;
        }
        else if (gotLong)
        {
            printf("Got: %d\n", parseLong);
            parseLong = 0;
            gotLong = 0;
        }
    }

    if (gotLong)
        printf("Got: %d\n", parseLong);
}

And then replace the printf with some suitable pythony-goodness like PyList_Append().

As well as avoiding malloc, using less memory and being able to safely operate directly on the constant Python string, this code also handles corner cases like empty strings, multiple spaces and other separators between numbers.


Edit: Counting the Longs If you want to count the number of longs first, so you can alloc the correct length of Python List then you could just add in something like this:
    for (i = testString;*i;++i)
    {
        const int isdigitoflong = isdigit(*i);

        if (!gotLong && isdigitoflong)
            longCount++;

        gotLong = isdigitoflong;
    }

Which should be relatively quick.


Edit 2: Better Parser
Here's a slightly better version of the parser above that's a bit more compact, doesn't need gotLong and doesn't have to repeat code to deal with the final long:
    for (i = testString;*i;++i)
    {
        if (isdigit(*i))
        {
            do {
                parseLong = (parseLong * 10) + (*i - '0');
            } while (*++i && isdigit(*i));

            printf("Got: %d\n", parseLong);
            parseLong = 0;
        }
    }   
GrahamS
  • 9,980
  • 9
  • 49
  • 63
  • My intention was to avoid the reallocation overhead of `PyList_Append()` (saw that advice in another question). I like your solution and I think I will adopt the parsing part of it, although I don't really care about destroyed or empty strings. As you can imagine a few MB of memory usage are not my concern, if you consider the accumulated size of the leak. – chuck Feb 10 '11 at 16:26
  • @chuck: Yeah I wondered if you were trying to avoid that. I'm not sure how Python implements lists so it might be worth benchmarking to see what the real life difference. Also you mentioned that the lists are taking up a few hundred MB. If each line is also huge then you might find it is actually faster to run this parser in two passes: first pass to just count the number of longs, then allocate your list and make a second pass to actually parse them and place them in the list. – GrahamS Feb 10 '11 at 17:28
  • @chuck: I've added code to my answer to show how you could quickly count the longs in the string for the first pass. – GrahamS Feb 10 '11 at 18:04
  • @chuck: And I've added a better parser too. :) – GrahamS Feb 10 '11 at 18:27
-1

Try this

inline long
    minhashTables_myatoi(const char* s)
    {
        long result=0;
        while((*s)!='\0'){
            result = result * 10 + (*s- '0');
            s++;
        }
        return result;
    }
Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99
Simonne
  • 1
  • 1
  • 1
    Can you elaborate as to why this would fix the memory leak? – jhwist Feb 10 '11 at 09:49
  • This clearly doesn't fix the memory leak, and what's more, isn't obviously better than the original version (especially since the formatting is messed up). Though I can't believe that either version is faster than `atoi`. – jchl Feb 10 '11 at 10:10
  • @jchl: At first I couldn't believe it, too, but timing showed it is, so I kept it. – chuck Feb 10 '11 at 10:18