Highest Voted 'scraping' Questions - Server Fault Stack Exchange

1

vote

2 answers

website being mirrored by another domain

So my website is being mirrored by another domain name, I tried many ways to block the access from that specific domain but no hope, I am using cloudflare CDN and the website mirroring my site using it too, I tried to get the remote address of the…

asked Jul 06 '20 at 16:56

Allae Eddine

13
2

0

votes

1 answer

Block website scraper in Haproxy

I am using Haproxy. I want to block scrapers from my website. In the haproxy.cfg , I have created a rule. acl blockedagent hdr_sub(user-agent) -i -f /etc/haproxy/badbots.lst http-request deny if blockedagent The file /etc/haproxy/badbots.lst…

security haproxy python useragent scraping

asked Apr 17 '18 at 22:38

Cyberzinga

35
2
6

0

votes

0 answers

How to capture tables with different structure from web

I have thousands of web pages(need login with username and passwords) like https://XXX.incometax.XXX/Preview/ViewDetail?TIN_INFO_NO=11935# where only last four digits(11935 for this example) changes for each url. Each url retrives tax information…

automation csv scraping

asked Oct 16 '17 at 09:51

Learner

101
4

0

votes

1 answer

Can OpsView or Nagios be set up to report on a device based on status emails it sends out?

I'm looking at setting up a Nagios (or perhaps OpsView) server for monitoring our network. I have a few periphery devices whose oid schema doesn't include nodes for some metric I want to monitor. Currently I monitor the metric based on status…

nagios system-monitoring scraping opsview

asked Nov 28 '14 at 11:47

JoelAZ

130
7

0

votes

1 answer

Cannot find source of traffic spike

I noticed in my Munin graphs for Apache that there was a large spike in traffic yesterday. However, I have been unable to correlate this with anything on the site. Google Analytics does not show any traffic increase. It essentially only counts…

traffic scraping

asked Jul 18 '11 at 16:03

DisgruntledGoat

2,629
4
28
36

0

votes

1 answer

Grab js+flv video without embed option

I'm running a website for a political organization and was asked to post this article to the blog along with the embedded video: http://weareaustin.com/fulltext/?nxd_id=135746 I couldn't figure out a way to get the video from the news page to the…

video javascript flash scraping

asked Apr 08 '11 at 07:54

Jesse Aldridge

103
3

0

votes

0 answers

favicon.ico in referer field in access.log

There is this line in my nginx access.log: 54.201.239.190 - - [18/Dec/2022:22:34:56 +0100] "GET / HTTP/1.1" 200 64 "http://example.com/favicon.ico" "Mozilla/5.0 (X11; Linux x86_64) ..." Simple question: Can anybody think of a way that a…

log-files attacks scraping

asked Dec 19 '22 at 15:42

archygriswald

143
1
11

0

votes

1 answer

How to configure a forward proxy to keep a historical mirror of the websites accessed?

I'm scraping information regarding civil servants' calendars. This is all public, text-only information. I'd like to keep a copy of the raw HTML files I'm scraping for historical purposes, and also in case there's a bug and I need to re-run the…

squid mirror scraping apache-traffic-server

asked Nov 11 '22 at 16:37

Vítor Baptista

221
2
5

0

votes

1 answer

How can a url param '?i=1' detect a browser?

profreehost claims that a '?i=1' url GET param can protect their servers. I wondered how. I did use google before asking question, but all the results was about they are for security and how to remove them (if you have ssh access). I wanted to know…

security http scraping

asked Oct 04 '21 at 15:51

Sam

25
1
10

0

votes

1 answer

How to identity who is scraping my website?

I have an e-commerce website, hosted on AWS. I understand there are tools that prevent/block the scraping bots. But is it possible to detect who is scraping my website? I mean, would I be able to detect the requests are coming from a bot, then find…

amazon-web-services ip hostname whois scraping

asked Dec 09 '20 at 04:03

Hooman Bahreini

518
6
17

0

votes

1 answer

How to reduce "RAM Cache + Buffer" size due to scrapy

I am running a couple of spiders in parallel by scrapyd 1.2. Each process will raise the Buffer during the crawl significantly as seen in the chart. What is this value and how can I reduce the footprint?

linux memory cache scraping devops

asked May 26 '20 at 06:08

merlin

2,093
11
39
78

-1

votes

0 answers

Is offering the contents of a third party web site offline violating the law?

I have developed a nice little app that crawls a bunch of newspaper web sites and makes their latest content available on my phone offline. It's basically a Pocket app that saves contents automatically, once a day. I am wondering: if I ever wanted…

web-crawler offline-files scraping

asked Jun 21 '16 at 13:47

user221200

101

-2

votes

3 answers

Stop server crash

I run my python scripts and Scrapy framework for web scraping project on my Ubuntu 12.04 precise server. These scripts run whole day. This project is under developing/testing stage. So i dont know what will be the system requirement of this…

python ubuntu-12.04 server-crashes scraping

asked Aug 21 '13 at 05:37

Binit Singh

101
2

Questions tagged [scraping]