Which the Better Option for Storing Large Data in Django Application

Question

so i have an Django app that i currently working on, this app will do euclidean distance for 2000+ data.

Im using this data to make recommendation system using Content Based Filtering. Content Based works like if you clicked an item, this item will find other item that has the closest feature. I have also figured out the feature. What i need is if a person click an item, i calculate euclidean distance of its features and i get the result. So i will use the euclidean distance of all possible combination. Because im doing the recommendation every X hour i need to store all combination of distance.

That much data if run when the web is in high demand will collapse so i think about several solution but i don't know if this is different when it's deployed.

First idea is to compute all distances and put it in hardcoded variable in some_file.py. The file will look like this

data = [[1,2,..],[3,4,..],[5,6,..],[7,8,..],...]

and can be accessed like this data[0][2] = 2

this file is 60MB
Second idea is the basic one, i create a table with 3 columns. A,B, and euclidean_distances(A,B). But this solution will create 4.000.000+ records.

*NOTES

I'm using Postgresql for my database. Im just comparing 2 item so it will be 2D euclidean distance. I have several features, but i just posted 1 feature so that i can apply to other feature once it works

My Question is,

which one is the better solution to save all the distances when it is deployed ?
I'm planning to increase data in the future, my calculations is it will take (n^2 - n^2/2 - n) space in database. At what point that my database become so big that everytime i want to access that database, it become slow, like it takes 10-20 seconds longer ?

I'm open to other solution other than 2 above.

I would definitely try to go the route of storing it in the Database. That way you can leverage the ORM to access the data and won't have to load it all into memory every time you want to access a subset of the data — Jordan Hyatt, Sep 19 '22 at 17:14
if the database keep getting bigger, would that slow down the application or any other query beside that table ? — Michael Halim, Sep 20 '22 at 01:53
Nope, It should not have an effect on the preformance of unrelated tables — Jordan Hyatt, Sep 20 '22 at 17:24

score 2 · Answer 1 · answered Sep 25 '22 at 08:44

I tend to prefer using a database, specially when envisioning that the system will grow because that way one can scale the web side and the database independently. Overall that seems to be the method people go with as well

Some other considerations

Read Django's database access optimization docs to ensure you'll be following the standards - like Do database work in the database rather than in Python
Consider using Database replication since it may result in better performance.
Consider using cache to improve the performance and reduce the database workloads. Read more about Django's cache framework.
Django supports only specific databases.
Non-relational databases tend to be better to store large amounts of data and reduce the latency.
Django supports the usage of multiple databases.
Django has a method bulk_create() to insert data in the database (generally with one query only).

Roland · Answer 2 · 2022-09-24T16:11:59.963

1

You should really take a look at this article about optimising pairwise Euclidean distance calculations from Towards Data Science, which is pretty much about your situation.

edited Sep 24 '22 at 16:11

answered Sep 22 '22 at 08:37

Roland

247
2
8

the reason im considering a database solution is because i probably will access it often. If everytime i need let's say 2D euclidean distance of 1 vs 2-2000+ and 45 vs 1-2000+, wouldn't that be more convenient if i have stored that beforehand ? im only using 2D euclidean distance btw – Michael Halim Sep 22 '22 at 11:33
Depends on your definition of convenience and your exact needs. Not knowing anything else about those, I would not suggest a database because it is overkill for something like this. Performance at peek demand will definitely be slower with a database when compared to either a list lookup or a direct calculation for something as simple as the Euclidean distance though. The reason I gave you the timings above is because the operation, being very short already, offers next to no speed difference vs. list / db lookups, but runs at constant memory without maintenance and with little CPU use. – Roland Sep 22 '22 at 13:34
By the way, the example you have in your question is the single dimensional Euclidean distance calculation between two scalars, not the 2D calculation between two 2D points you just mentioned. Lookup / storage of any kind is needlessly complicated and more expensive than the calculation itself at that point, regardless of database or list lookup (CPU/RAM/storage usage & electricity) because even list lookup of a value in the 1D case results in more or less the same amount of work as the calculation itself, so any kind of database structure simply costs extra. – Roland Sep 22 '22 at 13:40
i have edited the post with extra information. What i mean about convenient is i can get the distance without much effort, i want it to be fast, so that i can do calculations with my recommendation system faster. – Michael Halim Sep 23 '22 at 03:11
I see now; thank you for clarifying. Since you wrote "i need to store all combination of distance" and mentiond 4 million records for 2000 distances, I had assumed you had wanted single dimensional distances. Not knowing what you are recommending, I have to question whether all items in your comparison pool always have the same number of dimensions (attributes) and the same attributes, because if that is not the case, the euclidean distance is not going to be applicable at all. – Roland Sep 23 '22 at 07:36
The cases I see: 1. Different number of or missing attributes: Regardless of whether some attributes are the same, the Euclidean distance cannot deal with this at all. A missing attribute is data which does not stem from user choice, so cannot be used. 2. Same number of, but different attributes: The Euclidean distance cannot deal with this correctly. 3. All items have all of the same attributes, with nothing missing: Use the Euclidean distance, don't precalculate all possible combinations since you have n>3 attributes, store only those distances which do correspond to existing pairings. – Roland Sep 23 '22 at 07:52
i can guarantee that all of those 2000+ data has no missing value. i also normalized the number so that it only range between 0.0-1.0 – Michael Halim Sep 23 '22 at 12:05
it still taking a lot of time to compute in 2000 x 2000 matrix. I use this algorithm and have 20 - 30 minutes runtime – Michael Halim Sep 28 '22 at 12:42

score 0 · Answer 3 · answered Sep 24 '22 at 23:07

0

If you're familiar with Pandas you can do all this with Vaex.io. It's exactly, what its designed for. Building out the index correctly you can easily slice and dice the data and retrieve the data you're seeking. https://vaex.io/

answered Sep 24 '22 at 23:07

user1470034

671
2
8
23

Which the Better Option for Storing Large Data in Django Application

3 Answers3