so i have an Django app that i currently working on, this app will do euclidean distance for 2000+ data.
Im using this data to make recommendation system using Content Based Filtering. Content Based works like if you clicked an item, this item will find other item that has the closest feature. I have also figured out the feature. What i need is if a person click an item, i calculate euclidean distance of its features and i get the result. So i will use the euclidean distance of all possible combination. Because im doing the recommendation every X hour i need to store all combination of distance.
That much data if run when the web is in high demand will collapse so i think about several solution but i don't know if this is different when it's deployed.
First idea is to compute all distances and put it in hardcoded variable in some_file.py. The file will look like this
data = [[1,2,..],[3,4,..],[5,6,..],[7,8,..],...]
and can be accessed like this
data[0][2] = 2
this file is 60MB
Second idea is the basic one, i create a table with 3 columns. A,B, and euclidean_distances(A,B). But this solution will create 4.000.000+ records.
*NOTES
I'm using Postgresql for my database. Im just comparing 2 item so it will be 2D euclidean distance. I have several features, but i just posted 1 feature so that i can apply to other feature once it works
My Question is,
- which one is the better solution to save all the distances when it is deployed ?
- I'm planning to increase data in the future, my calculations is it will take (n^2 - n^2/2 - n) space in database. At what point that my database become so big that everytime i want to access that database, it become slow, like it takes 10-20 seconds longer ?
I'm open to other solution other than 2 above.