Questions tagged [scraping]
28 questions
1
vote
2 answers
website being mirrored by another domain
So my website is being mirrored by another domain name, I tried many ways to block the access from that specific domain but no hope, I am using cloudflare CDN and the website mirroring my site using it too, I tried to get the remote address of the…

Allae Eddine
- 13
- 2
0
votes
1 answer
Block website scraper in Haproxy
I am using Haproxy. I want to block scrapers from my website. In the haproxy.cfg , I have created a rule.
acl blockedagent hdr_sub(user-agent) -i -f /etc/haproxy/badbots.lst
http-request deny if blockedagent
The file /etc/haproxy/badbots.lst…

Cyberzinga
- 35
- 2
- 6
0
votes
0 answers
How to capture tables with different structure from web
I have thousands of web pages(need login with username and passwords) like https://XXX.incometax.XXX/Preview/ViewDetail?TIN_INFO_NO=11935# where only last four digits(11935 for this example) changes for each url. Each url retrives tax information…

Learner
- 101
- 4
0
votes
1 answer
Can OpsView or Nagios be set up to report on a device based on status emails it sends out?
I'm looking at setting up a Nagios (or perhaps OpsView) server for monitoring our network.
I have a few periphery devices whose oid schema doesn't include nodes for some metric I want to monitor. Currently I monitor the metric based on status…

JoelAZ
- 130
- 7
0
votes
1 answer
Cannot find source of traffic spike
I noticed in my Munin graphs for Apache that there was a large spike in traffic yesterday. However, I have been unable to correlate this with anything on the site.
Google Analytics does not show any traffic increase. It essentially only counts…

DisgruntledGoat
- 2,629
- 4
- 28
- 36
0
votes
1 answer
Grab js+flv video without embed option
I'm running a website for a political organization and was asked to post this article to the blog along with the embedded video: http://weareaustin.com/fulltext/?nxd_id=135746
I couldn't figure out a way to get the video from the news page to the…

Jesse Aldridge
- 103
- 3
0
votes
0 answers
favicon.ico in referer field in access.log
There is this line in my nginx access.log:
54.201.239.190 - - [18/Dec/2022:22:34:56 +0100] "GET / HTTP/1.1" 200 64
"http://example.com/favicon.ico" "Mozilla/5.0 (X11; Linux x86_64) ..."
Simple question: Can anybody think of a way that a…

archygriswald
- 143
- 1
- 11
0
votes
1 answer
How to configure a forward proxy to keep a historical mirror of the websites accessed?
I'm scraping information regarding civil servants' calendars. This is all public, text-only information. I'd like to keep a copy of the raw HTML files I'm scraping for historical purposes, and also in case there's a bug and I need to re-run the…

Vítor Baptista
- 221
- 2
- 5
0
votes
1 answer
How can a url param '?i=1' detect a browser?
profreehost claims that a '?i=1' url GET param can protect their servers. I wondered how.
I did use google before asking question, but all the results was about they are for security and how to remove them (if you have ssh access).
I wanted to know…

Sam
- 25
- 1
- 10
0
votes
1 answer
How to identity who is scraping my website?
I have an e-commerce website, hosted on AWS.
I understand there are tools that prevent/block the scraping bots. But is it possible to detect who is scraping my website? I mean, would I be able to detect the requests are coming from a bot, then find…

Hooman Bahreini
- 518
- 6
- 17
0
votes
1 answer
How to reduce "RAM Cache + Buffer" size due to scrapy
I am running a couple of spiders in parallel by scrapyd 1.2. Each process will raise the Buffer during the crawl significantly as seen in the chart. What is this value and how can I reduce the footprint?

merlin
- 2,093
- 11
- 39
- 78
-1
votes
0 answers
Is offering the contents of a third party web site offline violating the law?
I have developed a nice little app that crawls a bunch of newspaper web sites and makes their latest content available on my phone offline. It's basically a Pocket app that saves contents automatically, once a day. I am wondering: if I ever wanted…

user221200
- 101
-2
votes
3 answers
Stop server crash
I run my python scripts and Scrapy framework for web scraping project on my Ubuntu 12.04 precise server. These scripts run whole day.
This project is under developing/testing stage. So i dont know what will be the system requirement of this…

Binit Singh
- 101
- 2