Context
As part of my studies, I am creating a bot capable of detecting scam messages, in Python 3. One of the problems I am facing is the detection of fraudulent websites.
Currently, I have a list of domain names saved in a CSV file, containing both known domains considered safe (discord.com, google.com, etc.), and known fraudulent domains (free-nitro.ru etc.)
To share this list between my personal computer and my server, I regularly "deploy" it in ftp. But since my bot also uses GitHub and a MySQL database, I'm looking for a better system to synchronize this list of domain names without allowing anyone to access it.
I feel like I'm looking for a miracle solution that doesn't exist, but I don't want to overestimate my knowledge so I'm coming to you for advice, thanks in advance!
My considered solutions:
Put the domain names in a MySQL table
Advantages: no public access, live synchronization
Disadvantages: my scam detection script should be able to work offlineHash the domain names before putting them on git
Advantages: no public access, easy to do, supports equality comparison
Disadvantages: does not support similarity comparison, which is an important part of the programHash domain names with locality-sensitive hashing
Advantages: no easy public access, supports equality and similarity comparison
Disadvantages : similarities less precise than in clear, and impossible to hash a new string from the server without knowing at least the seed of the random, so public access problems
My opinion
It seems to me that the last solution, with the LSH, is the one that causes the least problems. But it is far from satisfying me, and I hope to find better. For the LSH algorithm, I have reproduced it here (from this notebook). I get similarity coefficients between 10% and 40% lower than those obtained with the current plain method.
EDIT: for clarification purpose, maybe my intentions weren’t clear enough (I’m sorry, English is not my native language and I’m bad at explaining things lol). The database or GitHub are just convenient ways to share info between my different bot instances. I could have one locally running on my pc, one on my VPS, one other god know where… and this is why I don’t want a FTP or any kind of synchronisation process involving an IP and/or a fixed destination folder. Ideally I’d like to just take my program at any time, download it wherever I want (by git clone) and just run it.
Please tell me if this isn’t clear enough, thanks :)