Questions tagged [scraping]

28 questions
1
vote
2 answers

website being mirrored by another domain

So my website is being mirrored by another domain name, I tried many ways to block the access from that specific domain but no hope, I am using cloudflare CDN and the website mirroring my site using it too, I tried to get the remote address of the…
0
votes
1 answer

Block website scraper in Haproxy

I am using Haproxy. I want to block scrapers from my website. In the haproxy.cfg , I have created a rule. acl blockedagent hdr_sub(user-agent) -i -f /etc/haproxy/badbots.lst http-request deny if blockedagent The file /etc/haproxy/badbots.lst…
Cyberzinga
  • 35
  • 2
  • 6
0
votes
0 answers

How to capture tables with different structure from web

I have thousands of web pages(need login with username and passwords) like https://XXX.incometax.XXX/Preview/ViewDetail?TIN_INFO_NO=11935# where only last four digits(11935 for this example) changes for each url. Each url retrives tax information…
Learner
  • 101
  • 4
0
votes
1 answer

Can OpsView or Nagios be set up to report on a device based on status emails it sends out?

I'm looking at setting up a Nagios (or perhaps OpsView) server for monitoring our network. I have a few periphery devices whose oid schema doesn't include nodes for some metric I want to monitor. Currently I monitor the metric based on status…
JoelAZ
  • 130
  • 7
0
votes
1 answer

Cannot find source of traffic spike

I noticed in my Munin graphs for Apache that there was a large spike in traffic yesterday. However, I have been unable to correlate this with anything on the site. Google Analytics does not show any traffic increase. It essentially only counts…
DisgruntledGoat
  • 2,629
  • 4
  • 28
  • 36
0
votes
1 answer

Grab js+flv video without embed option

I'm running a website for a political organization and was asked to post this article to the blog along with the embedded video: http://weareaustin.com/fulltext/?nxd_id=135746 I couldn't figure out a way to get the video from the news page to the…
0
votes
0 answers

favicon.ico in referer field in access.log

There is this line in my nginx access.log: 54.201.239.190 - - [18/Dec/2022:22:34:56 +0100] "GET / HTTP/1.1" 200 64 "http://example.com/favicon.ico" "Mozilla/5.0 (X11; Linux x86_64) ..." Simple question: Can anybody think of a way that a…
archygriswald
  • 143
  • 1
  • 11
0
votes
1 answer

How to configure a forward proxy to keep a historical mirror of the websites accessed?

I'm scraping information regarding civil servants' calendars. This is all public, text-only information. I'd like to keep a copy of the raw HTML files I'm scraping for historical purposes, and also in case there's a bug and I need to re-run the…
Vítor Baptista
  • 221
  • 2
  • 5
0
votes
1 answer

How can a url param '?i=1' detect a browser?

profreehost claims that a '?i=1' url GET param can protect their servers. I wondered how. I did use google before asking question, but all the results was about they are for security and how to remove them (if you have ssh access). I wanted to know…
Sam
  • 25
  • 1
  • 10
0
votes
1 answer

How to identity who is scraping my website?

I have an e-commerce website, hosted on AWS. I understand there are tools that prevent/block the scraping bots. But is it possible to detect who is scraping my website? I mean, would I be able to detect the requests are coming from a bot, then find…
Hooman Bahreini
  • 518
  • 6
  • 17
0
votes
1 answer

How to reduce "RAM Cache + Buffer" size due to scrapy

I am running a couple of spiders in parallel by scrapyd 1.2. Each process will raise the Buffer during the crawl significantly as seen in the chart. What is this value and how can I reduce the footprint?
merlin
  • 2,093
  • 11
  • 39
  • 78
-1
votes
0 answers

Is offering the contents of a third party web site offline violating the law?

I have developed a nice little app that crawls a bunch of newspaper web sites and makes their latest content available on my phone offline. It's basically a Pocket app that saves contents automatically, once a day. I am wondering: if I ever wanted…
-2
votes
3 answers

Stop server crash

I run my python scripts and Scrapy framework for web scraping project on my Ubuntu 12.04 precise server. These scripts run whole day. This project is under developing/testing stage. So i dont know what will be the system requirement of this…
Binit Singh
  • 101
  • 2
1
2