0

I am running a couple of crawlers that produce millions of datasets per day. The bottleneck is the latency between the spiders and the remote database. In case the location of the spider server is too large, the latency will slow the crawler down to a point where it can not longer complete the datasets needed for a day.

In search for a solution I came upon redis with the idea on installing redis the spider server where it will temporarily store the data collected with low latency and then redis will pull that data to mysql some how.

The setup is like this until now:

  • About 40 spiders running on multiple instances feed one central MySQL8 remote server on a dedicated machine over TCP/IP.
  • Each spider writes different datasets, one kind of spider gets positions and prices of search results, where there are 100 results with around 200-300 inserts on one page. Delay is about 2-10s between the next request/page.

The later one is the problem as the spider yields every position within that page and creates a remote insert within a transaction, maybe even a connect (not sure at the moment).

This currently only works as spiders and remote MySQL server are close (same data center) with ping times of 0.0x ms, it does not work with ping times of 50ms as the spiders can not write fast enough.

Is redis or maybe DataMQ a valid approach to solve the problem or are there other recommended ways of doing this?

merlin
  • 2,717
  • 3
  • 29
  • 59
  • It depends. Please more info on what the processing is. What is in one of those "millions of datasets"? Is there more than one `INSERT` per crawled server? How many crawlers are there? How many MySQL servers are there? Clustered (Galera / InnoDB Cluster)? – Rick James Mar 08 '22 at 01:54
  • And do you take a checksum of each page to avoid storing dups? If so, when (in the processing) do you notice the dup and avoid further processing? – Rick James Mar 08 '22 at 01:55
  • I added infos about the setup. I am not taking checksums, but we do insert ignore on duplicate update commands where needed. The problem is that we we do not write at the end of each page all data together, but each yield (dataset) to the remote server. I am not aware about other possibilities atm. – merlin Mar 08 '22 at 06:42

2 Answers2

0

Did you mean you have installed a Redis Server on each spider?

Actually it was not a good solution for you case. But if you have already done this and still want to use MySQL to persistent your data, cronjob on each server will be an option.

You can create a cronjob on each spider server(based on your dataset and your need, you can choose daily or hourly sync job). And write a data transfer script to scan your Redis and transfer to MySQL tables.

Gawain
  • 1,017
  • 8
  • 17
  • Currently there is no redis setup, just thinking about one. I updated the question for further clearance. – merlin Mar 08 '22 at 06:43
-1

I recommend using MongoDB instead of MySQL to store data