4

The project: for a list of meta-data of wordpress-plugins: - approx 50 plugins are of interest! but the challenge is: i want to fetch meta-data of all the existing plugins. What i subsequently want to filter out after the fetch is - those plugins that have the newest timestamp - that are updated (most) recently. It is all aobut acutality... so the base-url to start is this:

url = "https://wordpress.org/plugins/browse/popular/

aim: i want to fetch all the metadata of the plugins that we find on the first 50 pages of the popular-plugins.... for example...:

https://wordpress.org/plugins/wp-job-manager
https://wordpress.org/plugins/ninja-forms
https://wordpress.org/plugins/participants-database ....and so on and so forth.

here we go:

import requests
from bs4 import BeautifulSoup
from concurrent.futures.thread import ThreadPoolExecutor

url = "https://wordpress.org/plugins/browse/popular/{}"


def main(url, num):
    with requests.Session() as req:
        print(f"Collecting Page# {num}")
        r = req.get(url.format(num))
        soup = BeautifulSoup(r.content, 'html.parser')
        link = [item.get("href")
                for item in soup.findAll("a", rel="bookmark")]
        return set(link)


with ThreadPoolExecutor(max_workers=20) as executor:
    futures = [executor.submit(main, url, num)
               for num in [""]+[f"page/{x}/" for x in range(2, 50)]]

allin = []
for future in futures:
    allin.extend(future.result())


def parser(url):
    with requests.Session() as req:
        print(f"Extracting {url}")
        r = req.get(url)
        soup = BeautifulSoup(r.content, 'html.parser')
        target = [item.get_text(strip=True, separator=" ") for item in soup.find(
            "h3", class_="screen-reader-text").find_next("ul").findAll("li")[:8]]
        head = [soup.find("h1", class_="plugin-title").text]
        new = [x for x in target if x.startswith(
            ("V", "Las", "Ac", "W", "T", "P"))]
        return head + new


with ThreadPoolExecutor(max_workers=50) as executor1:
    futures1 = [executor1.submit(parser, url) for url in allin]

for future in futures1:
    print(future.result())

that runs like so - but gives back some errors..(see below)

Extracting https://wordpress.org/plugins/use-google-libraries/
Extracting https://wordpress.org/plugins/blocksy-companion/
Extracting https://wordpress.org/plugins/cherry-sidebars/
Extracting https://wordpress.org/plugins/accesspress-social-share/Extracting https://wordpress.org/plugins/goodbye-captcha/
Extracting https://wordpress.org/plugins/wp-whatsapp/

here the traceback of the errors:

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Traceback (most recent call last):

  File "C:\Users\rob\.spyder-py3\dev\untitled0.py", line 51, in <module>
    print(future.result())

  File "C:\Users\rob\devel\IDE\lib\concurrent\futures\_base.py", line 432, in result
    return self.__get_result()

  File "C:\Users\rob\devel\IDE\lib\concurrent\futures\_base.py", line 388, in __get_result
    raise self._exception

  File "C:\Users\rob\devel\IDE\lib\concurrent\futures\thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)

  File "C:\Users\rob\.spyder-py3\dev\untitled0.py", line 39, in parser
    target = [item.get_text(strip=True, separator=" ") for item in soup.find(

AttributeError: 'NoneType' object has no attribute 'find_next'

Update: as mentioned above i am getting this AttributeError which says that NoneType has no attribute find_next. Below is the line that's giving the nasty problems.

target = [item.get_text(strip=True, separator=" ") for item in soup.find("h3", class_="screen-reader-text").find_next("ul").findAll("li")]

Specifically, the issue is in the soup.find() method, which can return either a Tag (when it finds something), which has a .find_next() method (i.e. attribute) or None (when it doesn't find anything), which doesn't. We can try extracting this whole call to its own variable, which we can then test.

tag = soup.find("h3", class_="screen-reader-text")
target = []
if tag:
    lis = tag.find_next("ul").findAll("li")
    target = [item.get_text(strip=True, separator=" ") for item in lis[:8]]

btw; we can use CSS selectors instead to get this running:

target = [item.get_text(strip=True, separator=" ") for item in soup.select("h3.screen-reader-text + ul li")[:8]]

This gets "all li anywhere under ul that's right next to h3 with the screen-reader-text class". If we want li directly under ul (which they would usually be anyway, but that's not always the case for other elements), we could use ul > li instead (the > means "direct child").

note: the best thing would be to dump all the results into a csv-file or - to print it out on screen.

look forward to hear from you

zero
  • 1,003
  • 3
  • 20
  • 42
  • 1
    Why say you have a UTF-8 error while you have `AttributeError: 'NoneType' object has no attribute 'find_next'` which basically means you're not getting the HTML you *think* you're getting. – baduker Jun 07 '21 at 14:44
  • right - youre right - after reworking i saw that i have some issues in the code: i am getting an AttributeError which says that NoneType has no attribute find_next. Below is the line that's giving the issues: ´target = [item.get_text(strip=True, separator=" ") for item in soup.find("h3", class_="screen-reader-text").find_next("ul").findAll("li")´ – zero Jun 07 '21 at 16:56

2 Answers2

3

The page is rather well organized so scraping it should be pretty straight forward. All you need to do is get the plugin card and then simply extract the necessary parts.

Here's my take on it.

import time

import pandas as pd
import requests
from bs4 import BeautifulSoup

main_url = "https://wordpress.org/plugins/browse/popular"
headers = [
    "Title", "Rating", "Rating Count", "Excerpt", "URL",
    "Author", "Active installs", "Tested with", "Last Updated",
]


def wait_a_bit(wait_for: float = 1.5):
    time.sleep(wait_for)


def parse_plugin_card(card) -> list:
    title = card.select_one("h3").getText()
    rating = card.select_one(
        ".plugin-rating .wporg-ratings"
    )["data-rating"]
    rating_count = card.select_one(
        ".plugin-rating .rating-count a"
    ).getText().replace(" total ratings", "")
    excerpt = card.select_one(
        ".plugin-card .entry-excerpt p"
    ).getText()
    plugin_author = card.select_one(
        ".plugin-card footer span.plugin-author"
    ).getText(strip=True)
    active_installs = card.select_one(
        ".plugin-card footer span.active-installs"
    ).getText(strip=True)
    tested_with = card.select_one(
        ".plugin-card footer span.tested-with"
    ).getText(strip=True)
    last_updated = card.select_one(
        ".plugin-card footer span.last-updated"
    ).getText(strip=True)
    plugin_url = card.select_one(
        ".plugin-card .entry-title a"
    )["href"]
    return [
        title, rating, rating_count, excerpt, plugin_url,
        plugin_author, active_installs, tested_with, last_updated,
    ]


with requests.Session() as connection:
    pages = (
        BeautifulSoup(
            connection.get(main_url).text,
            "lxml",
        ).select(".pagination .nav-links .page-numbers")
    )[-2].getText(strip=True)

    all_cards = []
    for page in range(1, int(pages) + 1):
        print(f"Scraping page {page} out of {pages}...")
        # deal with the first page
        page_link = f"{main_url}" if page == 1 else f"{main_url}/page/{page}"
        plugin_cards = BeautifulSoup(
            connection.get(page_link).text,
            "lxml",
        ).select(".plugin-card")
        for plugin_card in plugin_cards:
            all_cards.append(parse_plugin_card(plugin_card))
    wait_a_bit()

df = pd.DataFrame(all_cards, columns=headers)
df.to_csv("all_plugins.csv", index=False)

It scrapes all the pages (currently 49 of them) and dumps everything to a .csv file with 980 rows (as of now) that looks like this:

enter image description here

You don't even have to run the code, the entire dump is here.

baduker
  • 19,152
  • 9
  • 33
  • 56
1

Baduker's solution is great, but just wanted to add.

We could slightly modify the parsing of the plugin card as there is an api that retuns all that data. Would still require small amount of processing (Ie. Pull out the content for author, the rating is stored out of 100 I believe (so a rating of 82 is really 82/100*5 = 4.1 -> "4 Stars"), and things like that.

But thought I would share.

import time

import pandas as pd
import requests
from bs4 import BeautifulSoup

main_url = "https://wordpress.org/plugins/browse/popular"


def wait_a_bit(wait_for: float = 1.5):
    time.sleep(wait_for)


# MODIFICATION MADE HERE
def parse_plugin_card(card):
    plugin_slug = card.select_one('a')['href'].split('/')[-2]
    url = 'https://api.wordpress.org/plugins/info/1.0/%s.json' %plugin_slug
    jsonData = requests.get(url).json()
    sections = jsonData.pop('sections')
    for k, v in sections.items():
        sections[k] = BeautifulSoup(v).text
    jsonData.update(sections)
    return jsonData


with requests.Session() as connection:
    pages = (
        BeautifulSoup(
            connection.get(main_url).text,
            "lxml",
        ).select(".pagination .nav-links .page-numbers")
    )[-2].getText(strip=True)

    all_cards = []
    for page in range(1, int(pages) + 1):
        print(f"Scraping page {page} out of {pages}...")
        # deal with the first page
        page_link = f"{main_url}" if page == 1 else f"{main_url}/page/{page}"
        plugin_cards = BeautifulSoup(
            connection.get(page_link).text,
            "lxml",
        ).select(".plugin-card")
        for plugin_card in plugin_cards:
            all_cards.append(parse_plugin_card(plugin_card))
    wait_a_bit()

df = pd.DataFrame(all_cards)
df.to_csv("all_plugins.csv", index=False)

Here's just a sample showing you the columns:

enter image description here

chitown88
  • 27,527
  • 4
  • 30
  • 59
  • chitown88 - very cool. This is a great solution! Many thanks for providing this. It is so great to see this great and global viable community. I never stop learning here. You deserve great honour. Have a great day! greetings. Zero – zero Jun 10 '21 at 16:55
  • dear chithown88 many thanks for your solution -just saw that you provide extra infos like the (very) concrete date of release - this is great- BTW-. is there also a excerpt of the story of the plugin in your results - included too!?- Your solution is just awesome – zero Jun 12 '21 at 19:50
  • 1
    Sure. I’ll add that in there tomorrow morning. – chitown88 Jun 13 '21 at 06:38
  • 1
    @zero, what do you mean by "story" of the plugin? do you mean the details/description – chitown88 Jun 14 '21 at 08:58
  • hello dear chitown88 exactly - this i meant by "story" the description. This would be awesome to have.: BTW Baduker called this text "excerpt"... - it would be great if we have this part too – zero Jun 14 '21 at 11:54
  • 1
    @zero, ok give it shot now. You may need to do a little string manipulation as the description if it contains more text/info than needed. – chitown88 Jun 14 '21 at 12:04
  • awesome: it runs like a charme gives back 21 MB - while wanting to open it in calc i get back - that this cause some trouble since the maximum amout of characters per cell are exceeded. But this i may solve with some calc-friends. I am happy about this great great scraper - it is very very useful. Many thanks – zero Jun 14 '21 at 20:12
  • by the way: the scraper runs very very well but in the very first line it gives back: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. The code that caused this warning is on line 28 of the file C:\Users\Kasper\.spyder-py3\dev\wp_pluginsdata__chitown88.py. To get rid of this warning, pass the add. argument 'features="lxml"' to the BS4 constructor. – zero Jun 14 '21 at 20:13
  • 1
    Ah ya. Just add this in the. BeautifulSoup(v, 'lxml').text – chitown88 Jun 15 '21 at 06:10
  • hello dear @chitown88 again many many thanks. One Question regarding the API. This is a great thing that there a API exists. One little - simple - question. Is there a way - a method to publish the dataset on the net. can we use (the API) for this idea of the publication - eg if we want to publish the first (newest ) results of the plugins!? – zero Jul 21 '21 at 20:24