2

I need some help with a code that I am working on for my coursera class, the objective is as follow: Find the link at position 18 (the first name is 1). Follow that link. Repeat this process 7 times. The answer is the last name that you retrieve. Hint: The first character of the name of the last page that you will load is: J from link:(http://py4e-data.dr-chuck.net/known_by_Shannon.html)

I have written a code for this task, but it seems like it only worked for the first item, and every site since that first one, the code's list malfunctions. My idea is to get the Html code and append the url into a list, then find the 18th item from the list, then redirect the entire loop with the new url and delete the old list. Repeating the process for 7 times. I am seriously confused with whether where exactly the code went wrong. Thanks in advance.

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import re
term_counter = (0)
file = list()
regex = list()
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
for I in range(7) :
    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')
    tags = soup('a')
    del file[:]
    file = list()
    for tag in tags :
        file.append(tag)
        print(tag.contents[0])
        url = tag.get('href')
        print (url)
    for items in range(17,18) :
        print(file[items])

I am new to the site, I am not sure if this is the correct place to ask python questions, if not please notify me and I will re-post this to the correct location.

Mr. T
  • 11,960
  • 10
  • 32
  • 54
py va
  • 31
  • 4
  • 1
    `range(17,18)` looks strange, it is the same as just `print(file[17])` as `range(17, 18)` yields a single number only – awesoon Sep 02 '18 at 06:10
  • `del file[:]; file = list()` is the same as `file[:]` and the same as `file = list()`. Python simply attributes the new object to the name `file`, no need to delete the values before. Also, since you don't actually use `I` in your loop, it is a convention to indicate this with an underscore `_I`. – Mr. T Sep 02 '18 at 09:02

3 Answers3

0

You can try this:

import bs4
import requests

url = 'http://py4e-data.dr-chuck.net/known_by_Shannon.html'

for I in range(7) :
    session = requests.get(url)
    soup = bs4.BeautifulSoup(session.text, "html.parser")
    target_list = soup.findAll("li")[17]
    href = target_list.find("a").get("href")
    link_name = target_list.find("a").text

    if(I == 6):
        print('Solution is ' + link_name)
    else:
        print('Name is: ' + link_name)
        print('Going to web is: ' + href)
        url = href

The answer will be the last link_name printed.

EDIT:

enter image description here

Juan Bermudez
  • 430
  • 3
  • 6
0

The problem is that although you print out the 18th tags entry, you don't set url to this value. Unfortunately, you use url also within your tags loop, so you don't notice this mistake. In your code, url is still set to the last entry of tags. If you print out I with the actual url used in your loop (uncomment the appropriate lines), you will see this:

0 http://py4e-data.dr-chuck.net/known_by_Shannon.html
['Riya', 'Terri', 'Coban', 'Oswald', 'Codie', 'Arshjoyat', 'Carli', 'Aieecia', 'Ronnie', 'Yelena', 'Abid', 'Prithvi', 'Ellenor', 'Shayla', 'Chala', 'Nelson', 'Chaitanya', 'Stacey', 'Karis', 'Mariyah', 'Jamie', 'Breeanna', 'Kendall', 'Adelaide', 'Aimiee', 'Manwen', 'Dennys', 'Benjamyn', 'Reynelle', 'Jesuseun', 'Malik', 'Brigitte', 'Farah', 'Youcef', 'Ruqayah', 'Mili', 'Caitaidh', 'Raul', 'Katelyn', 'Yakup', 'Cohan', 'Lylakay', 'Dougray', 'Silvana', 'Roxanne', 'Tanchoma', 'Andie', 'Aarman', 'Kyalah', 'Tayyab', 'Malikah', 'Bo', 'Oona', 'Daniil', 'Wardah', 'Jessamy', 'Karly', 'Tala', 'Ilyaas', 'Maram', 'Ruaidhri', 'Donna', 'Liza', 'Aileigh', 'Muzzammil', 'Chi', 'Serafina', 'Abbas', 'Rhythm', 'Jonny', 'Niraj', 'Ciara', 'Kylen', 'Demmi', 'Christianna', 'Tanzina', 'Brianna', 'Kevyn', 'Hariot', 'Maisie', 'Naideen', 'Nicolas', 'Suvi', 'Areeb', 'Kiranpreet', 'Rachna', 'Umme', 'Caela', 'Miao', 'Tansy', 'Miah', 'Luciano', 'Karolina', 'Rivan', 'Cavan', 'Benn', 'Haydn', 'Zaina', 'Rafi', 'Ahmad']
<a href="http://py4e-data.dr-chuck.net/known_by_Stacey.html">Stacey</a>
1 http://py4e-data.dr-chuck.net/known_by_Ahmad.html
['Tilhi', 'Rachel', 'Latif', 'Deryn', 'Pawel', 'Anna', 'Blake', 'Brehme', 'Jo', 'Laurajane', 'Khayla', 'Declyan', 'Graidi', 'Foosiya', 'Nabeeha', 'Otilija', 'Dougal', 'Adeena', 'Alfie', 'Angali', 'Lilah', 'Saadah', 'Kelam', 'Kensey', 'Tabitha', 'Peregrine', 'Abdisalam', 'Presley', 'Allegria', 'Harish', 'Arshjoyat', 'Hussan', 'Sammy', 'Ama', 'Leydon', 'Anndra', 'Anselm', 'Logyne', 'Fion', 'Jacqui', 'Reggie', 'Mounia', 'Pedro', 'Hussain', 'Raina', 'Inka', 'Shaylee', 'Riya', 'Phebe', 'Uzayr', 'Isobella', 'Abdulkadir', 'Johndean', 'Charlotte', 'Moray', 'Saraah', 'Liana', 'Keane', 'Maros', 'Robi', 'Rowanna', 'Wesley', 'Maddox', 'Annica', 'Oluwabukunmi', 'Jiao', 'Nyomi', 'Hamish', 'Bushra', 'Marcia', 'Rimal', 'Kaceylee', 'Limo', 'Dela', 'Cal', 'Rhudi', 'Komal', 'Stevey', 'Amara', 'Nate', 'Roma', 'Fatou', 'Marykate', 'Abiya', 'Bay', 'Kati', 'Carter', 'Niraj', 'Maisum', 'Jaz', 'Coban', 'Harikrishna', 'Armani', 'Muir', 'Ilsa', 'Benjamyn', 'Russel', 'Emerson', 'Rehaan', 'Veronica']
<a href="http://py4e-data.dr-chuck.net/known_by_Adeena.html">Adeena</a>

To avoid this problem, you have to set url for the next loop to the 18th entry:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
#import re
term_counter = (0)
file = list()
#regex = list()
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
for _I in range(7) :
    #print(_I, url) <- this prints out the _I value of the loop and the url used in this round
    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')
    tags = soup('a')
    #print([item.contents[0] for item in tags]) <- this prints out a list of all names on this page
    file = list()
    for tag in tags :
        file.append(tag)
        #this is the last url you used in your code for the next _I loop
        url = tag.get('href') 
    #so we have to redefine url as the 18th entry in your list for the next _I loop round
    url = file[17].get("href")
    print("The next url we will use is {}".format(url))
Mr. T
  • 11,960
  • 10
  • 32
  • 54
0

Seems to me, the class's question begs for a recursive approach:

import bs4
import requests

url = 'http://py4e-data.dr-chuck.net/known_by_Shannon.html'

def get_url(url):
     page_response = requests.get(url)
     soup = BeautifulSoup(page_response.content, "html.parser")
     return soup.find_all('a')[17].get('href')

def routine(url, counter):
    if counter > 0:
    url = get_url(url)
    print(url)
    counter -= 1
    routine(url, counter)

routine(url, 7)

output:

http://py4e-data.dr-chuck.net/known_by_Stacey.html
http://py4e-data.dr-chuck.net/known_by_Zoya.html
http://py4e-data.dr-chuck.net/known_by_Haiden.html
http://py4e-data.dr-chuck.net/known_by_Ayrton.html
http://py4e-data.dr-chuck.net/known_by_Leilan.html
http://py4e-data.dr-chuck.net/known_by_Thorben.html
http://py4e-data.dr-chuck.net/known_by_Jahy.html

The first character of the last url's name is confirmed to begin with a 'J'.

gregory
  • 10,969
  • 2
  • 30
  • 42