Bulk uploading "search api" documents to appengine?

Question

We would like to upload about 30k entities to the datastore in one go, while also creating documents from strings associated with these entities.

This is to allow for partial search on strings which the datastore is not suited for.

However, we haven't been able to find any resources or documentation on how to bulk upload documents using the search api functionality.

How do we go about this?

We tried using the bulkloader, which keeps giving the following error

google.appengine.ext.db.KindError: No implementation for kind 'Prototype'

This was because we were trying to upload ndb models, but the error suggest that it was defaulting to db

We tried to hack our way around it and define the class as a db Model and upload it. This works, and the data is uploaded to the datastore, however the post_put_hook doesn't work

Here's the code:

#models.py

import datetime
from google.appengine.ext import db
from google.appengine.tools import bulkloader


class PrototypeE(db.Model):
    #creating db Model of data
    p_id=db.StringProperty(indexed=True,required=True)
    p_name=db.StringProperty(required=True)
    p_val=db.IntegerProperty(required=True)
    p_lnk=db.StringProperty(required=True)
    p_src=db.StringProperty(choices=site_list)
    p_create_time=db.DateTimeProperty(auto_now_add=True)
    p_update_time=db.DateTimeProperty(auto_now=True)
    p_gen=db.StringProperty(choices=gen_list)
    p_img=db.StringProperty()
    p_cat=db.StringProperty()
    p_brd=db.StringProperty()
    p_keys=db.StringProperty()


    def _post_put_hook(self,Future):
        doc_key=Future.get_result()
        doc_id=doc_key.id()
        doc= search.Document(doc_id=unicode(doc_id),
        fields=[
        search.TextField(name="keywords",value=self.p_keys),
        search.NumberField(name="value",value=self.p_price)
        ])  #document
        logging.info(doc)
        try:
            index=search.Index(name="Store_Doc")
            index.put(doc)      #putting data into document
        except search.Error:
            logging.exception('Doc put failed')

And the loader:

#proto_loader.py

import datetime
from google.appengine.ext import ndb
from google.appengine.tools import bulkloader
import models



class ProtoLoader(bulkloader.Loader):
    def __init__(self):
    bulkloader.Loader.__init__(self, 'PrototypeE',
       [('p_id', str),
    ('p_name', str),
    ('p_val', int),
    ('p_lnk', str),
    ('p_src',str),
    ('p_gen',str),
    ('p_img',str),
    ('p_cat',str),
    ('p_brd',str),
    ('p_keys',str)
    ])

loaders = [ProtoLoader]

This succeeds in uploading the data to the datastore, but the hook is not called and no documents are created.

Do we need to edit the bulkloader file to get around this issue?

UPDATE: As mentioned earlier, the reason we attempted mixing ndb and db is that we get the following error when defining the class as an ndb.Model all through

Traceback (most recent call last):
  File "appcfg.py", line 126, in <module>
    run_file(__file__, globals())
  File "appcfg.py", line 122, in run_file
    execfile(_PATHS.script_file(script_name), globals_)
  File "/home/stw/Google/google_appengine/google/appengine/tools/appcfg.py", line 5220, in <module>
    main(sys.argv)
  File "/home/stw/Google/google_appengine/google/appengine/tools/appcfg.py", line 5211, in main
    result = AppCfgApp(argv).Run()
  File "/home/stw/Google/google_appengine/google/appengine/tools/appcfg.py", line 2886, in Run
    self.action(self)
  File "/home/stw/Google/google_appengine/google/appengine/tools/appcfg.py", line 4890, in __call__
    return method()
  File "/home/stw/Google/google_appengine/google/appengine/tools/appcfg.py", line 4693, in PerformUpload
    run_fn(args)
  File "/home/stw/Google/google_appengine/google/appengine/tools/appcfg.py", line 4574, in RunBulkloader
    sys.exit(bulkloader.Run(arg_dict))
  File "/home/stw/Google/google_appengine/google/appengine/tools/bulkloader.py", line 4408, in Run
    return _PerformBulkload(arg_dict)
  File "/home/stw/Google/google_appengine/google/appengine/tools/bulkloader.py", line 4219, in _PerformBulkload
    LoadConfig(config_file)
  File "/home/stw/Google/google_appengine/google/appengine/tools/bulkloader.py", line 3886, in LoadConfig
    Loader.RegisterLoader(cls())
  File "proto_loader.py", line 40, in __init__
    ('p_keys',str)
  File "/home/stw/Google/google_appengine/google/appengine/tools/bulkloader.py", line 2687, in __init__
    GetImplementationClass(kind)
  File "/home/stw/Google/google_appengine/google/appengine/tools/bulkloader.py", line 957, in GetImplementationClass
    implementation_class = db.class_for_kind(kind_or_class_key)
  File "/home/stw/Google/google_appengine/google/appengine/ext/db/__init__.py", line 296, in class_for_kind
    raise KindError('No implementation for kind \'%s\'' % kind)
google.appengine.ext.db.KindError: No implementation for kind 'PrototypeE'

As the error indicates, bulkloader assumes a db Class and checks with db.class_for_kind which results in an error when using ndb

You can write 10-20 lines of code to read your entities from their current source, insert them into the Datastore and index them using the Search API. What is the problem? — Andrei Volgin, Jun 23 '14 at 18:00
or retrieve the files from a GCS bucket and index them from there. — Paul Collingwood, Jun 23 '14 at 20:47
It is clear that we can use bulk_upload to fetch data from a CSV file and write to the datastore, but how do we write documents in the same manner? Do we need to do set up a cron job, or is there something as straightforward as using the bulk uploader? We are new to this so any pointers would be helpful. Thanks! — Cygorger, Jun 24 '14 at 02:45
@Cygorger a write the search api related code in the `_post_put_hook()` method. — Gianni Di Noia, Jun 24 '14 at 09:47
@GianniDiNoia We have tried this and are facing issues with the bulkuploader I've updated the question above with more detail; it would be great if you could take a look and let us know if you spot an error on our part — Cygorger, Jun 24 '14 at 11:18
there is no reason why the hook should not be called: make sure the hook is called with a simple logging than try to create your index and documents.. ignore the async ops for now.. — Gianni Di Noia, Jun 24 '14 at 11:48
@GianniDiNoia It isn't being called; we've checked with logging. It seems to be a result of the ndb db confusion (please see my edit above) — Cygorger, Jun 24 '14 at 14:09
@AndreiVolgin It does seem like a problem with a simple fix, but this is bedeviling us. Can you please suggest what we might be doing wrong in the code we've pasted above? Would appreciate any pointers tremendously, thanks! — Cygorger, Jun 24 '14 at 14:11
You are mixing `db` and `ndb`. You model code is based on `db` and it doesn't have `hooks`. If you are using `db`, then read http://blog.notdot.net/2010/04/Pre--and-post--put-hooks-for-Datastore-models . If you are using `ndb` amend you model code to reflect reality, then we might be able to help further. — Tim Hoffman, Jun 25 '14 at 01:11
@TimHoffman We tried using ndb all through, which resulted in the error we posted in our update above It is because bulkloader.py seems to expect a db.Model and checks using db.class_for_kind to fetch the object How do we get around this? Thanks — Cygorger, Jun 25 '14 at 04:08
Thats because the bulkloader isn't designed to use ndb. Builloader ism't seeing any action from google. I feel you should roll your own. Upload the CSV file to GCS, and read it via a task, — Tim Hoffman, Jun 25 '14 at 08:05
@TimHoffman Thanks Tim. That seems to be the case, and frankly we are quite puzzled by how confusingly muddled the documentation is, often mixing in ndb, db, HRD and master-slave into the same articles. I am also puzzled by the other posters' insistence that it is a simple task. What are we possibly missing? . Also, What are the advantages of uploading a CSV to GCS over using the remote_api_shell. Is it speed of datastore writes? Thanks — Cygorger, Jun 25 '14 at 12:11

Bulk uploading "search api" documents to appengine?

0 Answers0