I need to run regularly scheduled tasks that fetch relatively large xml documents (5Mb) and process them.
A problem I am currently experiencing is that I hit the memory limits of the application instance, and the instance is terminated while my task is running.
I did some rough measurements:
- The task is usually scheduled into an instance that already uses 40-50 Mb of memory
- Url fetching of 5 mb text file increases instance memory usage to 65-75 Mb.
- Decoding fetched text into Unicode increases memory usage to 95-105 Mb.
- Passing unicode string to lxml parser and accessing its root node increases instance memory usage to about 120-150 Mb.
- During actual processing of the document (converting xml nodes to datastore models, etc.) the instance is terminated.
I could avoid 3rd step and save some memory by passing encoded text directly to lxml parser, but specifying encoding for lxml parser has some problems on GAE for me.
I can probably use MapReduce library for this job, but is it really worthwhile for a 5mb file?
Another option could be to split the task into several tasks.
Also I could probably save the file to blobstore, and then process it by reading it line by line from blobstore? As a side note it would be convenient if UrlFetch service allowed to read the response "on demand" to simplify processing of large documents.
So generally speaking what is the most convenient way to perform such kind of work?
Thank you!