1

Following up this question here.

I finally wrote up a code generation tool to wrap all my database data into something like this:

Pdtfaamt(fano=212373,comsname='SMM',pdtcode='20PLFCL',kind='1',fatype='S',itemno='A3',itemamt=75,type=0).save()
Pdtfaamt(fano=212374,comsname='SMM',pdtcode='20PLFCL',kind='1',fatype='S',itemno='E1',itemamt=75,type=0).save()
Pdtfaamt(fano=212375,comsname='SMM',pdtcode='20PLFCL',kind='1',fatype='S',itemno='E6',itemamt=75,type=0).save()
Pdtfaamt(fano=212376,comsname='SMM',pdtcode='20PLFCL',kind='1',fatype='C',itemno='A3',itemamt=3,type=1).save()

Yes, that's right! I pulled the entire database out and transformed the data into population instruction codes so that I am able to migrate my database up to GAE.

So I deployed the django-nonrel project, used django-nonrel remote api to trigger the data population process.

It works okay, except that there is a problem: it's extremely slow. Could anyone tell me how I will be able to improve the speed? I have done some calculation, it may take up to 30 days to get all my data up and running there on GAE.

ps. I am using django-nonrel, and djangoappengine for the backend.

Community
  • 1
  • 1
Winston Chen
  • 6,799
  • 12
  • 52
  • 81

2 Answers2

2

Write your import script to take advantage of python's multiprocessing Pool

def import_thing(data):
    thing = ThingEntity(**data)
    thing.put()

def main():
    data = [{fano:'212374', comsname:'SMM', },
              {fano:'212374', comsname:'212375', },
              ...etc ]
    pool = multiprocessing.Pool(4) # split data into 4 parts to run in parallel
    pool.map(import_thing, data)

Since the AppEngine production servers like having lots of connections you should play around with the pool size to find the best number. This will not work for importing to the dev server as it's single-threaded.

Also important: Ensure you are putting them in batches of say 10-20 not putting one at a time, or the round-trips will be killing your performance. So an improved script should work in chunks like:

data = [
    [item1,item2,item3],
    [item4, item5, item6],
    [item7, item8, item9],
]
pool.map(import_batch, data)
Chris Farmiloe
  • 13,935
  • 5
  • 48
  • 57
  • I am rewriting the code-gen according to your suggestions. However, I got something like this: File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 422, in get raise self._value TypeError: __init__() keywords must be strings Are we allowed to use multiprocessing on GAE? or it's because of some other reasons? – Winston Chen Jun 30 '11 at 07:01
  • by the way, I am still using remote console to trigger the code. – Winston Chen Jun 30 '11 at 07:03
  • 1
    Your error sounds like you have a dict with unicode-string keys .... ensure you pass your `**kwargs` as `{"prop":"value"}` and not `{u'prop': 'value'}` – Chris Farmiloe Jun 30 '11 at 09:11
1

You probably want to look into the Mapper API.

Daniel Roseman
  • 588,541
  • 66
  • 880
  • 895