2

I've created a script to parse two fields from every movie container from a webpage. The script is doing fine.

I'm trying to use this getattr() function to scrape text and src from two fields, as in movie_name and image_link. In case of movie_name, it works. However, it fails when I try to parse image_link.

There is a function currently commented out which works when I uncomment. However, my goal here is to make use of getattr() to parse src.

import requests
from bs4 import BeautifulSoup

url = "https://yts.am/browse-movies"
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}

# def get_information(url):
#     res = requests.get(url,headers=headers)
#     soup = BeautifulSoup(res.text,'lxml')
#     for row in soup.select(".browse-movie-wrap"):
#         movie_name = row.select_one("a.browse-movie-title").text
#         image_link = row.select_one("img.img-responsive").get("src")
#         yield movie_name,image_link

def get_information(url):
    res = requests.get(url,headers=headers)
    soup = BeautifulSoup(res.text,'lxml')
    for row in soup.select(".browse-movie-wrap"):
        movie_name = getattr(row.select_one("a.browse-movie-title"),"text",None)
        image_link = getattr(row.select_one("img.img-responsive"),"src",None)
        yield movie_name,image_link

if __name__ == '__main__':
    for items in get_information(url):
        print(items)

How can I scrape src using getattr() function?

baduker
  • 19,152
  • 9
  • 33
  • 56
MITHU
  • 113
  • 3
  • 12
  • 41
  • 1
    "src" is not an attribute of the `select_one` output in the Python sense of the word. You're expected to use the `get` method to fetch the tag's attributes (in the HTML sense of the word). – Tim Roberts Apr 06 '21 at 18:58

1 Answers1

2

The reason this works:

movie_name = getattr(row.select_one("a.browse-movie-title"),"text",None)

But this doesn't:

image_link = getattr(row.select_one("img.img-responsive"),"src",None)

is because methods of a class are also attributes. So, effectively, what you're doing is getting a function text for the first example. In other words, there's no method or attribute called src.

If you look at attributes of:

row.select_one("a.browse-movie-title").attrs

You'll get:

{'href': 'https://yts.mx/movies/imperial-blue-2019', 'class': ['browse-movie-title']}

Likewise, for

row.select_one(".img-responsive").attrs

The output is:

{'class': ['img-responsive'], 'src': 'https://img.yts.mx/assets/images/movies/imperial_blue_2019/medium-cover.jpg', 'alt': 'Imperial Blue (2019) download', 'width': '170', 'height': '255'}

So, if we experiment and do this:

getattr(row.select_one(".img-responsive"), "attrs", None).src

We'll end up with:

AttributeError: 'dict' object has no attribute 'src'

Therefore, as mentioned in the comments, this is not how you'd use getattr() in pure Python sense on bs4 objects. You can either use the .get() method or the [key] syntax.

For example:

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}


def get_information(url):
    soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')
    for row in soup.select(".browse-movie-wrap"):
        movie_name = row.select_one("a.browse-movie-title").getText()
        image_link = row.select_one("img.img-responsive").get("src")
        yield movie_name, image_link


if __name__ == '__main__':
    for items in get_information("https://yts.am/browse-movies"):
        print(items)

This produces:

('Imperial Blue', 'https://img.yts.mx/assets/images/movies/imperial_blue_2019/medium-cover.jpg')
('Ablaze', 'https://img.yts.mx/assets/images/movies/ablaze_2001/medium-cover.jpg')
('[CN] Long feng zhi duo xing', 'https://img.yts.mx/assets/images/movies/long_feng_zhi_duo_xing_1984/medium-cover.jpg')
('Bobbie Jo and the Outlaw', 'https://img.yts.mx/assets/images/movies/bobbie_jo_and_the_outlaw_1976/medium-cover.jpg')
('Adam Resurrected', 'https://img.yts.mx/assets/images/movies/adam_resurrected_2008/medium-cover.jpg')
('[ZH] The Wasted Times', 'https://img.yts.mx/assets/images/movies/the_wasted_times_2016/medium-cover.jpg')
('Promise', 'https://img.yts.mx/assets/images/movies/promise_2021/medium-cover.jpg')

and so on ...

Finally, if you really want to parse this with getattr() you can try this:

movie_name = getattr(row.select_one("a.browse-movie-title"), "getText", None)()
image_link = getattr(row.select_one("img.img-responsive"), "attrs", None)["src"]

And you'll still get the same results, but, IMHO, this is way too complicated and not too readable either than a plain .getText() and .get("src") syntax.

baduker
  • 19,152
  • 9
  • 33
  • 56