1

I'm trying to get some tags with beautiful soup, to generate a bibtex entry with this data.

The ISBN brazilian site, when access from browser, shows the informations about that ISBN. But when i tried to use urlopen and requests, it gives me a HTTPError code 500. In browser this happened, and only resolved by closing the tab and opening the same link in another tab.

The website asks for captcha. I think the first search need to be answering the captcha, and the others, just changing the isbn in url will works.

After this, when you hit 'link+isbn' it shows the information about the book. I'm trying to use this 'link+isbn' to webscrape with beautifoul soup.

Link that works: http://www.isbn.bn.br/website/consulta/cadastro/isbn/9788521208037 -- (do a first search in 'www.isbn. ... /cadastro' fisrt, because the captcha)

I tried with some codes, and now i'm just trying to get the html of website without error 500.

import sys
import urllib
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

BRbase = 'http://www.isbn.bn.br/website/consulta/cadastro/isbn/'

Lista_ISBN = ['9788542209402',
              '9788542206937',
              '9788521208037']

for isbn in Lista_ISBN:
    page = BRbase + isbn
    url = Request(page, headers={'User-Agent': 'Mozilla/5.0'})
    html = urlopen(url).read()
    #code to beautiful soup
    try:
        #code to beautiful soup and generate bibtex
        print(page)
        print(html)
        
    except:
        print('ISBN {} não encontrado'.format(isbn))
sys.exit(1)
Jean Pimenta
  • 135
  • 8

1 Answers1

1
import requests
from bs4 import BeautifulSoup

headers = {"Cookie": 'JSESSIONID=60F8CDFBD408299B40C7E7C2459DC624'}

isbn = ['9788542209402', '9788542206937', '9788521208037']

for item in isbn:
    print(f"{'*'*20}Extracting ISBN# {item}{'*'*20}")
    r = requests.get(
        f"http://www.isbn.bn.br/website/consulta/cadastro/isbn/{item}", headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    for item in soup.findAll('strong')[2:10]:
        print(item.parent.get_text(strip=True))

Output:

********************Extracting ISBN# 9788542209402********************
ISBN978-85-422-0940-2
TítuloSPQR
Edição1
Ano Edição2017
Tipo de SuportePapel
Páginas448
Editor(a)Planeta
ParticipaçõesMary Beard ( Autor)Luiz Gil Reyes (Tradutor)
********************Extracting ISBN# 9788542206937********************
ISBN978-85-422-0693-7
TítuloEm nome de Roma
Edição1
Ano Edição2016
Tipo de SuportePapel
Páginas560
Editor(a)Planeta
ParticipaçõesAdrian Goldsworthy ( Autor)Claudio Blanc (Tradutor)
********************Extracting ISBN# 9788521208037********************
ISBN978-85-212-0803-7
TítuloCurso de física básica: ótica, relatividade e física quântica
Edição2
Ano Edição2014
Tipo de SuportePapel
Páginas0
Editor(a)Blucher
ParticipaçõesH. Moysés Nussenzveig ( Autor)
  • For me, the output is just the ***Extracting ISB#... ***. I think it's that jsessionid. – Jean Pimenta Dec 15 '19 at 20:07
  • @JeanPimenta you need to solve the captcha for one time. then passing the `JSESSIONID` while it's active to be able to scrape. – αԋɱҽԃ αмєяιcαη Dec 15 '19 at 20:17
  • That I understand, but how to get this jsessionid? I solve the captcha and how I get access to this info? – Jean Pimenta Dec 15 '19 at 20:40
  • @JeanPimenta have you run the repl.it ? – αԋɱҽԃ αмєяιcαη Dec 15 '19 at 20:40
  • Yes... works well. But how i can get this? I have a python code that looks ISBN and DOI codes in other api. But some ISBN, like those in example, from Brazil, not appear in those database. They appear in this site. My idea is incorporate the search on this website into python code... but i don't know how i get the value of jsessionid. – Jean Pimenta Dec 15 '19 at 20:45