0

I am using BeautifulSoup (bs4) to extract data from an SSRN paper URL, here is the URL for reference https://papers.ssrn.com/sol3/papers.cfm?abstract_id=962461. The data I want is on the PlumX metrics widget on the right of the page. If you hover over it and look at 'Citations:95' I would like to extract 95. This is in the HTML as:

`<li class="plx-citation">
       <span class="ppp-label">Citation Indexes: </span>
       <span class="ppp-count">95</span>
</li>`

I have tried many approaches in Python but none of them seem to work:

1) Extracting the information by class

soup.find("li", {"class": "ppp-count"})

The output is None

2) Extracting the information by xpath by using lxml instead of Soup:

`tree = html.fromstring(paper_url.content)
 r = tree.xpath('//*[@id="maincontent"]/div[2]/div[2]/div/div[2]/div/div[2]/div/div/div/ul/li[1]/ul/li/span[2]')`

The output is []

3) I printed out the whole soup and lxml and the plumX data just disappears (these branches of HTML are not there, in fact citations also doesn't have any HTML there).

It is there in the main page (if you check it out using inspect element in a browser but never there in the code). I even tried to use a different parser like html5lib but it did not fix my problem. Could someone kindly tell me what to do?

0m3r
  • 12,286
  • 15
  • 35
  • 71
Afr0
  • 41
  • 1
  • 10

1 Answers1

2

Actually the main reason where you aren't able to extract the desired value, because the widget is loaded via JavaScript which is fetching the data from an API.

import requests
import json


params = {
    'type': 'ssrn_id',
    'id': '962461',
    'site': 'ssrn',
    'href': 'https://plu.mx/ssrn/a/?ssrn_id=962461',
    'ref': '',
    'pageToken': 'f0399e1a-c031-0c64-6619-423f-7ebf45fa0416',
    'isElsWidget': 'false'
}


def main(url):
    r = requests.get(url, params=params).json()
    print(r['statistics']['Citations'][0]['count'])
    # print(json.dumps(r, indent=4)) for nice view :)


main("https://api.plu.mx/widget/other/artifact")

Output:

95
  • Do you have a suggestion on retrieving this information in real time. I'm trying to do this over a lot of papers so I can get the `id` and the `href` but I'm not sure about the pageToken – Afr0 Apr 11 '20 at 08:36
  • 1
    @Afr0 chek [my previous answer](https://stackoverflow.com/a/61045691/7658985) which will explain for you on how to get the `XHR` request. – αԋɱҽԃ αмєяιcαη Apr 11 '20 at 08:38
  • Hi, I read through it, if I understand correctly I can use the `requests_html` library to do it? I don't want to use selenium because it will destroy my computation time (around 10k papers). I'm also not sure how you got it from the website. I clicked on `Inspect Element > Network > XHR` and I don't see the pageToken – Afr0 Apr 11 '20 at 08:53
  • 1
    @Afr0 if you have been able to locate the `XHR` requests. so you have to press on `parameters` [tab](https://imgur.com/wvwcRdu) so in this case you will even don't need to use `requests_html` – αԋɱҽԃ αмєяιcαη Apr 11 '20 at 08:56
  • Thanks! I see it there now! Is there a way to access the XHR data directly in Python? – Afr0 Apr 11 '20 at 09:02
  • 1
    @Afr0 I'm limited to provide answer only for the current question being asked. as that's will be against the community rules which is heading to limit your question for specific problem. therefor you don't need to use `pageToken` parameter at all. since it's an auto dynamically generated token which is used by a `Java` Function for other things which your question is not interested in. – αԋɱҽԃ αмєяιcαη Apr 11 '20 at 09:09
  • I understand! thank you very much, ignoring it does the trick! – Afr0 Apr 11 '20 at 09:18