0

Context

As part of my studies, I am creating a bot capable of detecting scam messages, in Python 3. One of the problems I am facing is the detection of fraudulent websites.
Currently, I have a list of domain names saved in a CSV file, containing both known domains considered safe (discord.com, google.com, etc.), and known fraudulent domains (free-nitro.ru etc.)

To share this list between my personal computer and my server, I regularly "deploy" it in ftp. But since my bot also uses GitHub and a MySQL database, I'm looking for a better system to synchronize this list of domain names without allowing anyone to access it.
I feel like I'm looking for a miracle solution that doesn't exist, but I don't want to overestimate my knowledge so I'm coming to you for advice, thanks in advance!

My considered solutions:

  • Put the domain names in a MySQL table
    Advantages: no public access, live synchronization
    Disadvantages: my scam detection script should be able to work offline

  • Hash the domain names before putting them on git
    Advantages: no public access, easy to do, supports equality comparison
    Disadvantages: does not support similarity comparison, which is an important part of the program

  • Hash domain names with locality-sensitive hashing
    Advantages: no easy public access, supports equality and similarity comparison
    Disadvantages : similarities less precise than in clear, and impossible to hash a new string from the server without knowing at least the seed of the random, so public access problems

My opinion

It seems to me that the last solution, with the LSH, is the one that causes the least problems. But it is far from satisfying me, and I hope to find better. For the LSH algorithm, I have reproduced it here (from this notebook). I get similarity coefficients between 10% and 40% lower than those obtained with the current plain method.

EDIT: for clarification purpose, maybe my intentions weren’t clear enough (I’m sorry, English is not my native language and I’m bad at explaining things lol). The database or GitHub are just convenient ways to share info between my different bot instances. I could have one locally running on my pc, one on my VPS, one other god know where… and this is why I don’t want a FTP or any kind of synchronisation process involving an IP and/or a fixed destination folder. Ideally I’d like to just take my program at any time, download it wherever I want (by git clone) and just run it.
Please tell me if this isn’t clear enough, thanks :)

Z_runner
  • 67
  • 1
  • 10
  • Do the MySQL DB and GitHub components also need to access the list? Is the FTP owned by you, or can you set permissions on the file? Couldn't you just upload an encrypted version to the FTP? – root Oct 30 '22 at 02:12
  • Yeah, automating the FTP upload sounds like an idea I didn’t see. But *in my mind* I’d find cleaner to avoid this kind of process, and instead use something more general like the db or GitHub (so I wouldn’t have any issue using it on another server for example). In theory the DB or GitHub don’t need access to the list, they’re only a mean of storing/passing the data. – Z_runner Oct 30 '22 at 03:38
  • Then why FTP? Why not upload it directly to the filesystem where your python script runs? – root Oct 30 '22 at 04:12
  • I’m not sure what you mean by filesystem, how is that different to any ftp-like file transfer? Can you elaborate pls, maybe I’m missing something there. (Btw I edited my post with more info) – Z_runner Oct 30 '22 at 13:47
  • Your python script is running on some operating system. That operating system has files. One of those files is your script. Can't another file be that domain list? – root Oct 31 '22 at 00:36
  • This is already what I have, a CSV file containing clear domain names. Currently on my computer. But I need to "push" this file in some ways to my remote server and to any location where my bot may run, and this is where my question stands (git, database, etc.) – Z_runner Oct 31 '22 at 11:46

1 Answers1

1

At the end I think I'll use yet another solution. I'm thinking of using the MySQL database to store domain names, but only use it in my script to synchronize to it, keeping a local CSV version.

In short, the workflow I'm imagining:

  • I edit my SQL table when I want to add/remove items to it
  • When the bot is launched, the script connects to the DB and retrieves all the information from the table
  • Once the information is retrieved, it saves it in a CSV file and finishes running the rest of the script
  • If at launch no internet connection is available, the synchronization to the DB is not done and only the CSV file is used.

This way I have the advantages of no public access, an automatic synchronization, an access even offline after the first start, and I keep the support of comparison by similarity since no hash is done.

If you think you can improve my idea, I'm interested!

Z_runner
  • 67
  • 1
  • 10