0

I'm scraping information regarding civil servants' calendars. This is all public, text-only information. I'd like to keep a copy of the raw HTML files I'm scraping for historical purposes, and also in case there's a bug and I need to re-run the scrapers.

This sounds like a great usage for a forward proxy like Squid or Apache Traffic Server. However, I couldn't find in their docs a way to both:

  • Keep a permanent history of the cached pages
  • Access old versions of the cached pages (think Wayback Machine)

Does anyone know if this is possible? I could potentially mirror the pages using wget or httrack, but a forward cache is a better solution as the caching process is driven by the scraper itself.

Thanks!

Vítor Baptista
  • 221
  • 2
  • 5

1 Answers1

0
  • If the site is available via HTTP, it can be done quite simply with Squid and some script, which would follow the Squid access log and store the appropriate content somewhere, using plain old wget for example
  • If the site is available only via HTTPS it would be much trickier
    • In simple case it's impossible to see what's being accessed, because the proxy is only aware of the domain it connects to, not the full URL.
    • There is a possibility to create so-called transparent proxy setup, which requires setup of DNS and a few TLS certificates, which would need to be trusted by browsers (or one common CA)
madman_xxx
  • 198
  • 6