Webscraping Scopus with lxml.html

Question

I'm trying to webscrape Scopus with lxml.html (ultimately to create a list of document titles), but it seems no data is being stored from the page.content; the resulting list(tr_elements) ends up empty.

import requests
import lxml.html as lh

url = 'https://www.scopus.com/results/citedbyresults.uri?sort=plf-f&cite=2-s2.0-84939544008&src=s&nlo=&nlr=&nls=&imp=t&sid=fdbfeac69ab848bdff16425dc6937ffc&sot=cite&sdt=a&sl=0&origin=resultslist&offset=1&txGid=b63ddae0b71deb5a4615640f49db9904'
page = requests.get(url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')

Since the inspect element shows that rows have varying classes(https://i.stack.imgur.com/6QUvw.png) I've also tried running it with tr_elements = doc.xpath("//tr[contains(@class, 'searchArea')]") specifying which rows to parse, but this also ends up in an empty list. Any ideas?

score 0 · Answer 1 · answered Oct 27 '20 at 16:52

0

I figured it out. Access denied | www.scopus.com used Cloudflare to restrict access

answered Oct 27 '20 at 16:52

Alex Yepes

1

Webscraping Scopus with lxml.html

1 Answers1