Extracting links and titles only

Question

I am trying to extract links and titles for these links in an anime website, However, I am only able to extract the whole tag, I just want the href and the title.

Here`s the code am using:

import requests
from bs4 import BeautifulSoup

r = requests.get('http://animeonline.vip/info/phi-brain-kami-puzzle-3')
soup = BeautifulSoup(r.content, "html.parser")
for link in soup.find_all('div', class_='list_episode'):
    href = link.get('href')
    print(href)

And here`s the website html:

<a href="http://animeonline.vip/phi-brain-kami-puzzle-3-episode-25" title="Phi Brain: Kami no Puzzle 3 episode 25">
                    Phi Brain: Kami no Puzzle 3 episode 25                  <span> 26-03-2014</span>
        </a>

And this is the output:

C:\Python34\python.exe C:/Users/M.Murad/PycharmProjects/untitled/Webcrawler.py
None

Process finished with exit code 0

All that I want is all links and titles in that class (episodes and their links)

Thanks.

score 1 · Answer 1 · answered Sep 09 '16 at 05:34

The entire page has only one element with class 'list_episode', so you can filter out the 'a' tags and then fetch the value for attribute 'href':

In [127]: import requests
     ...: from bs4 import BeautifulSoup
     ...: 
     ...: r = requests.get('http://animeonline.vip/info/phi-brain-kami-puzzle-3')
     ...: soup = BeautifulSoup(r.content, "html.parser")
     ...: 

In [128]: [x.get('href') for x in soup.find('div', class_='list_episode').find_all('a')]
Out[128]: 
[u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-25',
 u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-24',
 u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-23',
 u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-22',
 u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-21',
 u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-20',
 u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-19',
 u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-18',
 u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-17',
 u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-16',
 u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-15',
 u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-14',
 u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-13',
 u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-12',
 u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-11',
 u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-10',
 u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-9',
 u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-8',
 u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-7',
 u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-6',
 u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-5',
 u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-4',
 u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-3',
 u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-2',
 u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-1']

score -1 · Accepted Answer · edited Sep 13 '16 at 17:16

-1

So what is happening is, your link element has all the information in anchor <div> and class = "last_episode" but this has a lot of anchors in it which holds the link in "href" and title in "title".

Just modify the code a little and you will have what you want.

import requests
from bs4 import BeautifulSoup

r = requests.get('http://animeonline.vip/info/phi-brain-kami-puzzle-3')
soup = BeautifulSoup(r.content, "html.parser")
for link in soup.find_all('div', class_='list_episode'):
    href_and_title = [(a.get("href"), a.get("title")) for a in link.find_all("a")]   
    print href_and_title

output will be in form of [(href,title),(href,title),........(href,title)]

Edit(Explanation):

So what is happening is when you do

soup.find_all('div', class_='list_episode')

It gives you all details (in html page) with "div" and class "last_episode" but now this anchor holds a huge set of anchors with different "href" and title details, so to get that we use a for loop (there can be multiple anchors (<a>)) and ".get()".

 href_and_title = [(a.get("href"), a.get("title")) for a in link.find_all("a")]

I hope it's clearer this time .

edited Sep 13 '16 at 17:16

Mr Lister

45,515
15
108
150

answered Sep 09 '16 at 07:01

Neeraj Komuravalli

306
2
8

hey thanks for the answer the code is working, but can you please explain how the code works and why the for loop in the end, and is it hard to print then in lines ?? sorry I`d like to edit your code but i can`t do so if i don`t understand it :) – AbdulAziz Sep 09 '16 at 07:49
AbdulAziz I made the desirable changes and tried to explain my self hope you understood it – Neeraj Komuravalli Sep 09 '16 at 07:58
Thanks again, guess I have a better understanding now, but still how do I print the output in lines instead of just one line? – AbdulAziz Sep 11 '16 at 04:50
I solved it by changing your code a little, Here`s what I did:............ for link in soup.find_all('div', class_='list_episode'): for a in link.find_all('a'): hat = a.get("href"), a.get("title") print(hat) – AbdulAziz Sep 11 '16 at 07:37

Extracting links and titles only

2 Answers2