-2

I am new with Python and I'm trying to learn web scraping.

I have the following code and would like to know how to get/print the href or the link:

<.h1><.a href="https://www.nytimes.com/tips"> Got a confidential news tip?

  • similar to http://stackoverflow.com/questions/42173719/how-to-use-regular-expression-to-retrieve-data-in-python/42173798#42173798 – GoingMyWay Feb 25 '17 at 09:23
  • another one similar https://stackoverflow.com/questions/3075550/how-can-i-get-href-links-from-html-using-python – Tudor Feb 25 '17 at 09:24

1 Answers1

1

You can use BeautifulSoup to get this work done:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

response = urlopen("http://someurl.com")
page_source = response.read()
soup = BeautifulSoup(page_source, 'html.parser')
x = soup.find_all('h1')
print (x)

then all you have to do is use the re module and extract data from the output.

cookiedough
  • 3,552
  • 2
  • 26
  • 51