Avoiding exceeding soft private memory limits when url-fetching large documents

Question

I need to run regularly scheduled tasks that fetch relatively large xml documents (5Mb) and process them.

A problem I am currently experiencing is that I hit the memory limits of the application instance, and the instance is terminated while my task is running.

I did some rough measurements:

The task is usually scheduled into an instance that already uses 40-50 Mb of memory
Url fetching of 5 mb text file increases instance memory usage to 65-75 Mb.
Decoding fetched text into Unicode increases memory usage to 95-105 Mb.
Passing unicode string to lxml parser and accessing its root node increases instance memory usage to about 120-150 Mb.
During actual processing of the document (converting xml nodes to datastore models, etc.) the instance is terminated.

I could avoid 3rd step and save some memory by passing encoded text directly to lxml parser, but specifying encoding for lxml parser has some problems on GAE for me.

I can probably use MapReduce library for this job, but is it really worthwhile for a 5mb file?

Another option could be to split the task into several tasks.

Also I could probably save the file to blobstore, and then process it by reading it line by line from blobstore? As a side note it would be convenient if UrlFetch service allowed to read the response "on demand" to simplify processing of large documents.

So generally speaking what is the most convenient way to perform such kind of work?

Thank you!

Are you able to use a SAX parser rather than a DOM one? – Nick Johnson Apr 12 '12 at 00:10 — Nick Johnson, Apr 12 '12 at 00:10

alex · Answer 1 · 2012-04-11T09:50:16.890

2

Is this on frontend or a backend instance? Looks like a job for a backend instance to me.

Have you considered using different instance types?

edited Apr 11 '12 at 09:50

answered Apr 11 '12 at 09:38

alex

2,450
16
22

Thanks for your answer. Currently I am trying to run this job as a task, so this is on frontend (F1) instances. I am not really inclined to change an instance class, because user requests to my app are not heavy and F1 instances perform well. But seems like I really should have tried backend instances. I'll try and report back my results. – Maxim Apr 11 '12 at 11:10
1

Note that you're not limited to frontend instances with tasks. You can dispatch a task to a backend instance too. – alex Apr 11 '12 at 11:23

Avoiding exceeding soft private memory limits when url-fetching large documents

1 Answers1