I’ve been working on a hobby project that’s a django react site that give analytics and data viz for texts. Most likely will host on AWS. The user uploads a csv of texts. The current logic is that they get stored in the db and then when the user calls the api it runs the analytics on them and sends the analytics. I’m trying to decide whether to store the raw text data (what I have now) or run the analytics on the texts once when they're uploaded and then discard them, only storing the analytics.
My thoughts are:
Raw data:
pros:
- changes to analytics won’t require re uploading
- probably simpler db schema
cons:
- more sensitive data (not sure how safe it is in a django db on AWS, not sure what measures I could put in place to protect it more)
- more data to store (not sure what it would cost to store a lot of rows of texts)
Analytics:
pros:
- less sensitive, less space
cons:
- if something goes wrong with the analytics on the first run (that doesn’t throw an error), then they could be inaccurate and will remain that way