I'm scraping information regarding civil servants' calendars. This is all public, text-only information. I'd like to keep a copy of the raw HTML files I'm scraping for historical purposes, and also in case there's a bug and I need to re-run the scrapers.
This sounds like a great usage for a forward proxy like Squid or Apache Traffic Server. However, I couldn't find in their docs a way to both:
- Keep a permanent history of the cached pages
- Access old versions of the cached pages (think Wayback Machine)
Does anyone know if this is possible? I could potentially mirror the pages using wget
or httrack
, but a forward cache is a better solution as the caching process is driven by the scraper itself.
Thanks!