0

I’m building a Django web application to store documents and their associated metadata.

The bulk of the metadata will be stored in the underlying MySQL database, with the OCR’d document text indexed in Elasticsearch to enable full-text search. I’ve incorporated django-elasticsearch-dsl to connect and synchronize my data models, as I’m also indexing (and thus, double-storing) a few other fields found in my models. I had considered using Haystack, but it lacks support for the latest Elasticsearch versions.

When a document is uploaded via the applications’s admin interface, a post_save signal automatically triggers a Celery asynchronous background task to perform the OCR and will ultimately index the extracted text into Elasticsearch.

Seeing as how I don’t have a full-text field defined in my model (and hope to avoid doing so as I don’t want to store or search against CLOB’s in the database), I’m seeking the best practice for updating my Elasticsearch documents from my tasks.py file. There doesn’t seem to be a way to do so using django-elasticseach-dsl (but maybe I’m wrong?) and so I’m wondering if I should either:

  1. Try to interface with Elasticsearch via REST using the sister django-elasticsearch-dsl-drf package.

  2. More loosely integrate my application with Elasticsearch by using the more vanilla elasticsearch-dsl-py package (based on elasticsearch-py). I‘d lose some “luxury” with this approach as I’d have to write a bit more integration code, at least if I want to wire up my models with signals.

Is there a best practice? Or another approach I haven’t considered?

Update 1: In trying to implement the answer from @Nielk, I'm able to persist the OCR'd text (result = "test" in tasks.py below) into ElasticSearch, but it's also persisting in the MySQL database. I'm still confused about how to essentially configure Submission.rawtext as a passthru to ElasticSearch.

models.py:

class Submission(models.Model):

  rawtext = models.TextField(null=True, blank=True)
  ...
  def type_to_string(self):
    return ""

documents.py:

@registry.register_document
class SubmissionDocument(Document)

  rawtext = fields.TextField(attr="type_to_string")

  def prepare_rawtext(self, instance):
    # self.rawtext = None
    # instance.rawtext = "test"

    return instance.rawtext

  ... 

tasks.py (called on Submission model post_save signal):

  @shared_task
  def process_ocr(my_uuid)

    result = "test" # will ultimately be OCR'd text

    instance = Submission.objects.get(my_uuid=my_uuid)
    instance.rawtext = result
    instance.save()

Update 2 (Working Solution):

models.py class Submission(models.Model):

   @property
   def rawtext(self):
      if getattr(self, '_rawtext_local_change', False):
         return self._rawtext
      if not self.pk:
         return None
      from .documents import SubmissionDocument
      try:
         return SubmissionDocument.get(id=self.pk)._rawtext
      except:
         return None

   @rawtext.setter
   def rawtext(self, value):
      self._rawtext_local_change = True
      self._rawtext = value

documents.py

   @registry.register_document
   class SubmissionDocument(Document):

      rawtext = fields.TextField()

      def prepare_rawtext(self, instance):
         return instance.rawtext

tasks.py

   @shared_task
   def process_ocr(my_uuid)

      result = "test" # will ultimately be OCR'd text

      # note that you must do a save on property fields, can't do an update
      instance = Submission.objects.get(my_uuid=my_uuid)
      instance.rawtext = result
      instance.save()
littleK
  • 19,521
  • 30
  • 128
  • 188

1 Answers1

1

You can add extra fields in the document definition linked to your model (see the field 'type_to_field' in the documentation https://django-elasticsearch-dsl.readthedocs.io/en/latest/fields.html#using-different-attributes-for-model-fields , and combine this with a 'prepare_xxx' method to initialize to an empty string if the instance is created, and to its current value in case of an update) Would that solve your problem ?

Edit 1 - Here's what I meant:

models.py

class Submission(models.Model):
    @property
    def rawtext(self):
        if getattr(self, '_rawtext_local_change ', False):
            return self._rawtext
        if not self.pk:
            return None
        from .documents import SubmissionDocument
        return SubmissionDocument.get(meta__id=self.pk).rawtext

    @property.setter
    def rawtext(self, value):
        self._rawtext_local_change = True
        self._rawtext = value

Edit 2 - fixed code typo

Nielk
  • 760
  • 1
  • 6
  • 22
  • When you say “linked to you model”, does that mean the field value will persist in Elasticsearch but not the MySQL database? Or will it ultimately be stored into the database also? – littleK Apr 18 '20 at 13:12
  • It will persist in elasticsearch only, yes. django elasticsearch dsl is designed to have documents structure based on your django models, not the other way around. – Nielk Apr 18 '20 at 14:47
  • Perfect. Last question: can I then import an instance of my model into my tasks.py file and do instance.raw_text = “my ocr text”? – littleK Apr 18 '20 at 14:57
  • That should work if 'raw_text' is also the name of the index field. I think a nice solution would be to add a 'raw_text' property in the model class, that fetches the data from elasticsearch. Depending on your data access pattern, you could optimize and initialize the field for multiple instances with a single multiget call. – Nielk Apr 18 '20 at 18:25
  • My attempt to implement your suggestion has revealed some holes in my thinking. See the "Update 1" post (with code) above. I'm able to index the OCR'd text from tasks.py, but it's still persisting in the database. How do I ensure that doesn't happen, and how would I set-up the model to fetch the latest value from Elasticsearch? – littleK Apr 20 '20 at 18:38
  • @littkeK I updated my reply with some code to illustrate what I mean. It looks a bit hacky, but it's convenient if you need to access `rawtext` seamlessly from other parts of your app. If you don't need that, then your can move all the logic of getting the current value of `rawtext` in the `prepare_rawtext` method of `SubmissionDocument`. – Nielk Apr 20 '20 at 19:21
  • Your example helps to explain things, thanks @Nielk. I was able to incorporate it successfully (had to tweak meta.id=self.pk to meta__id=self.pk to avoid a “keyword can’t be an expression” error), but I’m still seeing the value getting saved to the database, which I’m hoping to avoid... – littleK Apr 20 '20 at 23:24
  • did you run `makemigrations` / `migrate` after you updated the model ? – Nielk Apr 21 '20 at 06:05
  • I did, yes. Should I still need a type_to_string function on the model or a prepare_text function on the document that is explicitly setting the value to “”? – littleK Apr 21 '20 at 20:23
  • No, you don't need the `type_to_string` function in the model - this was from the django_elasticsearch_dsl (ded) example, to illustrate that ded can compute a field from a model's method, but you don't need that if you already have a `prepare_rawrext` method in the document. Just keep: ```@registry.register_document class SubmissionDocument(Document) rawtext = fields.TextField() def prepare_rawtext(self, instance): return instance.rawtext ... – Nielk Apr 23 '20 at 06:05
  • About the model field, how exactly do you know that it's still in the database ? If the model does not declare the field anymore, a migration should have been generated. You can check that at SQL level with a `django manage.py dbshell` session. But if may still appear in the admin view, since you have a property with same name. Maybe django got confused by that, you can try to remove the `rawtext` property from the model, run `makemigrations / migrate`, and add the property getter / setter again. Hope that helps... – Nielk Apr 23 '20 at 06:08
  • I've updated the OP with the working solution (see Update 2). I had a few things wrong (i.e. had forgotten to comment out the rawtext field on the model) but, in the end, your solution was correct. Thank you so much! Note that I had to add a try/except around "return SubmissionDocument.get(id=self.pk)._rawtext", and had also changed "meta__id" to "id". – littleK Apr 24 '20 at 15:59