0

I am in need to scrape the data from the website here. This was protected by Incapsula. I have already done two approaches and also used the techniques which were given by Stack Overflow users.

APPROACH 1:

from incapsula import IncapSession

headers = {'Host': 'www.vignanam.org',
           'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/7.0.540.0 Safari/534.10',
           'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
           'Accept-Language': 'en-US,en;q=0.5',
           'Accept-Encoding': 'gzip, deflate',
           'Connection': 'keep-alive',
           'Cookie': 'visid_incap_1642409=B+YoelHCSKKN5z/Phs0zXCsF9VsAAAAAQUIPAAAAAACXaWvcNDXdMzcOky/SvffB; incap_ses'
                     '_715_1642409=kyFvSyJuuBVpNuh+aTHsCSsF9VsAAAAAKV6TIWTPSZmb+mOZWeuNHA==',
           'Upgrade-Insecure-Requests': '1'}

session = IncapSession()
response = session.get('http://www.vignanam.org/index.htm#&panel1-1', headers=headers, bypass_crack=True)

print response.text

APPROACH 2:

from mechanize import Browser
from bs4 import BeautifulSoup

browser = Browser()

browser.open('https://www.incapsula.com/blog/how-incapsula-protects-against-data-leaks.html')

print browser.response()

soup = BeautifulSoup(browser.response().read(), features='html5lib')

print soup

Both approaches are producing the same results.

RESULT/OUTPUT

<html> 
<head> 
<META NAME="robots" CONTENT="noindex,nofollow"> 
<script src="/_Incapsula_Resource SWJIYLWA=5074a744e2e3d891814e9a2dace20bd4,719d34d31c8e3a6e6fffd425f7e032f3"> </script> 
<body> 
</body>
</html>

How to break this and scrape the data from there? Is any other programming languages to overcome this?

halfer
  • 19,824
  • 17
  • 99
  • 186
  • Approach 2 is sending request to wrong url. – Corentin Limier Nov 22 '18 at 10:21
  • Please read [Under what circumstances may I add “urgent” or other similar phrases to my question, in order to obtain faster answers?](//meta.stackoverflow.com/q/326569) - the summary is that this is not an ideal way to address volunteers, and is probably counterproductive to obtaining answers. Please refrain from adding this to your questions. – halfer Nov 22 '18 at 22:26
  • Sure. I will improve me from questioning in the StackOverflow – Aravindh Thirumaran Nov 23 '18 at 07:24

1 Answers1

0

This :

import requests

requests.get('http://www.vignanam.org/index.htm#&panel1-1').text

worked fine for me.

I didn't see any kind of incapsula protection and it didn't block my request.

(curl http://www.vignanam.org/index.htm#&panel1-1 worked too in bash)

Returns :

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3c.org/TR/1999/REC-html401-19991224/loose.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml" itemscope itemtype="http://schema.org/Organization">\n<head>\n<title>Vaidika Vignanam - Vedic Chants, Siva, Vishnu, Devi\nStotrams, Annamayya, Tyagaraja, Ramadasa Keerthanas in Sanskrit, Hindi,\nTelugu, Tamil, Kannada, Malayalam, Gujarati, Bengali and Oriya</title>\n<link rel=stylesheet type="text/css" href="css/vignanam.css"/>\n<link rel=stylesheet href="css/anythingslider.css" type="text/css" media=screen />\n<link type="text/css" href="css/jquery-ui-1.8.12.custom.css" rel=Stylesheet />\n<link rel=stylesheet href=aqtree3clickable.css />\n<link rel=stylesheet href="css/glowtabs.css" type="text/css" media=screen />\n<meta content="text/html; charset=utf-8" http-equiv=Content-Type />\n<meta name=keywords content="vedas, vedic chants, shiva stotrams, vishnu stotrams, \n\t\tdevi ......
Corentin Limier
  • 4,946
  • 1
  • 13
  • 24
  • Did you get the output as the complete page's HTML file? Can you please show your output. Because still, I am getting the same output as I mentioned in my question. – Aravindh Thirumaran Nov 22 '18 at 10:28
  • Whether you are using linux? – Aravindh Thirumaran Nov 22 '18 at 10:29
  • Cannot show you the entire output as it doesn't fit in a stackoverflow message – Corentin Limier Nov 22 '18 at 10:29
  • @AravindhThirumaran Gave you the beginning of the output and it doesn't look like what you got. I don't think that Linux made any difference here. – Corentin Limier Nov 22 '18 at 10:30
  • Did you have any ideas, why I am only getting this output? In my office, everyone is getting the same output which I got. We used various IPs too. Still, I am unable to get the output. – Aravindh Thirumaran Nov 22 '18 at 11:11
  • If you can scrape means, will you send me all mantras in CSV file in the English language – Aravindh Thirumaran Nov 23 '18 at 07:29
  • Which country you are from? – Aravindh Thirumaran Nov 23 '18 at 07:38
  • @AravindhThirumaran I cannot scrap for you. I'm from France. I cannot help you more, I think that when a website protects itself from scraping, it means that you shouldn't scrap. I was just mentioning that maybe there was an error because I could launch the request fine. – Corentin Limier Nov 23 '18 at 09:10
  • @CorentinLimier it implements an dynamic blocking algo, it means that it not blocks any request ..but watching you and then start blocking. The current answer is wrong – Reishin Feb 03 '21 at 19:13
  • @Reishin current answer is wrong or current question is incomplete ? :) – Corentin Limier Feb 04 '21 at 16:32
  • OP question is pretty much clear - how to get content of the site "reliably" behind the incapsula. However SO is not a "help bypass or crack things" site, answer is not simple as curl http://example.com. It's better to close the question completely. – Reishin Feb 05 '21 at 18:07