3

So I'm running web scrapers from time to time. Sometimes a database is unavoidable, and I prefer sqlite because of its ease of use. However, I fear my latest project is putting strain on my SSD; take a look.

I have separate processes, each of them queries the database for a URL to scrape, does that, and then writes results and commits to disk. I want to, if possible, avoid unnecessary scraping, so I commit after each result so the other processes know not to scrape that one.

What I worry is that each commit is an actual write to disk. With something like a million queries, my SSD will age prematurely :/ So basically, two questions:

  1. Does this really mean 1 commit = 1 physical disk write? Or is there buffering within SQLite, or Linux does something like that, or maybe the SSD controller itself that kind of keeps the copy of the database in RAM (or controller memory) and only infrequently syncs with the drive? The database itself is less than 20MB.

  2. If it is how I fear, any tweaks I could do to sqlite3 or is a server-client database unavoidable? My understanding is that the database server would have exclusive access to the database, so it could just keep the working version in RAM and sync to disk regularly.

I'm sorry the question is more of a discussion :/

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
Liudvikas Akelis
  • 1,164
  • 8
  • 15
  • *"I want to, if possible, avoid unnecessary scraping, so I commit after each result so the other processes know not to scrape that one"* . save the urls with a flag indicates if they already scraped (`url, scrapped`) and lock them for writing, when you select them like this `1- lock the url rows, 2-select the url rows 3-scrape them 4-set the flag for them 5-release the lock` – Accountant م Jan 18 '19 at 15:23

3 Answers3

3

One commit indeed results in at least one disk write, or rather more than one, because the table structure and the database header and the file system's metadata are likely to be located in other pages.

Then look up your SSD's write endurance in the datasheet. Then realize that a piddling million writes you do "from time to time" are completely harmless.

To make your database writes more SSD-friendly, you can enable WAL mode. But unless you're continuously running the database 24/7, you won't notice any difference.

CL.
  • 173,858
  • 17
  • 217
  • 259
2

I've since figured out a workaround that works for me: set up the database on RAM (in linux, /dev/shm is one such place, and I believe Windows have equivalent utilities as well).

Obviously, we want it to write to disk at some point, so what I do is run a loop in bash to periodically copy the virtual database file to disk, like this:

while true; do sleep 20; cp /dev/shm/results.db ~/project/results.db; done

This is not a perfect long-term solution, but it definitely works with no modifications and can be a great speedboost.

Liudvikas Akelis
  • 1,164
  • 8
  • 15
  • 1
    Copying the database file without also copying the journal or Wal file can result in corruption. The backup API avoids this risk. https://sqlite.org/backup.html – Ben Page Sep 14 '20 at 23:14
0

This is a bit long for a comment. And, I'm not intimately familiar with the most recent advances in SSD technology. But, you have a valid concern.

I don't see how you can avoid the problem. You want to scrape web sites and then store the data "permanently". That process requires a write to some form of long-term storage. If the only storage is SSD, then you will need to write to the disk. That has nothing to do with the use of databases.

Databases may incur extra writes, depending on how they are designed -- additional writes for index pages, for log records, for page splits, and so on.

What can you do? My advice is to back up the database religiously, monitor your hardware, and update the hardware when necessary. By the way, that advice is rather independent of using databases or SSDs.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786