-3

Link to the wikipedia page: [1]: https://en.wikipedia.org/wiki/Lists_of_cultivars I want to extract the names of all the items there and later process them in order to get their scientific names which would be stored in a list in python.

Which API would be the best and most efficient one for this task?

  • "*Which API would be the best and most efficient one for this task?*" Questions seeking the opinions of the community are explicitly off-topic here per the scope of the site as defined in the [help/on-topic]. See also: [ask] – esqew Nov 18 '21 at 13:56
  • The *requests* module is probably the most popular starting point followed by *bs4* to isolate the data you're interested in –  Nov 18 '21 at 13:59
  • You already added plenty of tags - why not research in that direction yourself? There are tons of posts about web scraping with python on SO, there are bound to be examples how to use the wikipedia-api in its documentation etc. Research is step 0 before asking. – Patrick Artner Nov 18 '21 at 14:02
  • as I know Wikipedia has only one API. So question is: use API or not use API. Or: use some special module for Wikipedia API or use requests directly with API or use requests+BeautifulSoul without API. There should be some tutorials abotu modules for Wikipedia API, and on Stackoveflow should be many question about scraping Wikipedia - you see some links on this page in `Related` in right column. – furas Nov 18 '21 at 17:27

1 Answers1

2

As I know Wikipedia has only one API. https://www.mediawiki.org/wiki/API:Main_page

So question is: use API or not use API.

Or rather question is: use some special module for Wikipedia API (like WikipediaAPI) or use requests directly with API or use requests+BeautifulSoul without API.

Because data are in standard <table> then the simplest can bo to use pandas and read_html(url).

It search all <table> and convert every table to DataFrame and it returns list with all DataFrames.

But this gives only text without links.

import pandas as pd

url = 'https://en.wikipedia.org/wiki/Lists_of_cultivars'

all_tables = pd.read_html(url)

#for df in all_tables:
#    print(df)

df = all_tables[0]
print(df.to_string())

Result:

               Common name                                    Taxon Woody / Herbaceous                                                                        List of cultivars
0                    Apple                          Malus domestica              Woody                                                                          Apple cultivars
1       Banana / Plaintain                                     Musa         Herbaceous                                                                         Banana cultivars
2                 Cannabis                                 Cannabis         Herbaceous                                                                         Cannabis strains
3                Cherimoya                         Annona cherimola              Woody                                                                      Cherimoya cultivars
4                   Citrus                                   Citrus              Woody                                                             Citrus hybrids and cultivars
5                   Coffee                                   Coffea              Woody                                                                         Coffee varieties
6               Sweet corn  Zea mays convar. saccharata var. rugosa         Herbaceous                                                                      Sweetcorn varieties
7                    Basil                                   Ocimum         Herbaceous                                                                          Basil cultivars
8              Bottlebrush                              Callistemon              Woody                                                                    Callistemon cultivars
9               Canna lily                                    Canna         Herbaceous                                                                          Canna cultivars
10            Chili pepper                                 Capsicum         Herbaceous                                                                       Capsicum cultivars
11                Cucumber                          Cucumis sativus         Herbaceous                                                                       Cucumber varieties
12                     Elm                                    Ulmus              Woody                                              Elm cultivars, hybrids and hybrid cultivars
13                 Gazania                                  Gazania         Herbaceous                                                                        Gazania cultivars
14                   Grape                                    Vitis              Woody                                                                          Grape varieties
15               Grevillea                                Grevillea              Woody                                                                      Grevillea cultivars
16                    Hops                          Humulus lupulus         Herbaceous                                                                            Hop varieties
17                   Maize                     Zea mays subsp. mays         Herbaceous                                                      Italian traditional maize varieties
18                   Mango                                Mangifera              Woody                                                                          Mango cultivars
19                 Nemesia                                  Nemesia         Herbaceous                                                                        Nemesia cultivars
20                   Olive                            Olea europaea              Woody                                                                          Olive cultivars
21                   Onion                              Allium cepa         Herbaceous                                                                          Onion cultivars
22                    Pear                                    Pyrus              Woody                                                                           Pear cultivars
23  Tropical pitcher plant                                Nepenthes         Herbaceous                                                                      Nepenthes cultivars
24                 Pumpkin                           Cucurbita pepo         Herbaceous                                             Pumpkin varieties grown in the United States
25              Asian rice                             Oryza sativa         Herbaceous                                                                           Rice varieties
26                    Rose                                     Rosa              Woody  All-America Rose SelectionsAward of Garden Merit rosesRose cultivars named after people
27              Strawberry                        Fragaria ananassa         Herbaceous                                                                     Strawberry cultivars
28            Sweet potato                          Ipomoea batatas         Herbaceous                                                                   Sweet potato cultivars
29                  Tomato                     Solanum lycopersicum         Herbaceous                                                                         Tomato cultivars
30           Venus flytrap                        Dionaea muscipula         Herbaceous                                                                  Venus flytrap cultivars

To get text with links you can use requests (or urllib) with BeautifulSoup (or lxml)

import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/Lists_of_cultivars'

r = requests.get(url)

soup = BeautifulSoup(r.text, 'html.parser')

table = soup.find('table')

for row in table.find_all('tr'):
    print('=== row ===')
    for cell in row.find_all('td'):
        a = cell.find('a')
        print(a.text)
        print(a['href'])
        print('---')

Result:

=== row ===
=== row ===
Apple
/wiki/Apple
---
Malus domestica
/wiki/Malus_domestica
---
Woody
/wiki/Woody_plant
---
Apple cultivars
/wiki/List_of_apple_cultivars
---
=== row ===
Banana
/wiki/Banana
---
Musa
/wiki/Musa_(genus)
---
Herbaceous
/wiki/Herbaceous_plant
---
Banana cultivars
/wiki/List_of_banana_cultivars
---
=== row ===
Cannabis
/wiki/Cannabis
---
Cannabis
/wiki/Cannabis
---
Herbaceous
/wiki/Herbaceous_plant
---
Cannabis strains
/wiki/Cannabis_strains
---
=== row ===
Cherimoya
/wiki/Cherimoya
---
Annona cherimola
/wiki/Annona_cherimola
---
Woody
/wiki/Woody_plant
---
Cherimoya cultivars
/wiki/List_of_cherimoya_cultivars
---
=== row ===
Citrus
/wiki/Citrus
---
Citrus
/wiki/Citrus
---
Woody
/wiki/Woody_plant
---
Citrus hybrids and cultivars
/wiki/List_of_citrus_hybrids_and_cultivars
---
=== row ===
Coffee
/wiki/Coffee_plant
---
Coffea
/wiki/Coffea
---
Woody
/wiki/Woody_plant
---
Coffee varieties
/wiki/List_of_coffee_varieties
---
=== row ===
Sweet corn
/wiki/Sweet_corn
---
Zea mays convar. saccharata var. rugosa
/wiki/Sweet_corn
---
Herbaceous
/wiki/Herbaceous_plant
---
Sweetcorn varieties
/wiki/List_of_sweetcorn_varieties
---
=== row ===
Basil
/wiki/Basil
---
Ocimum
/wiki/Ocimum
---
Herbaceous
/wiki/Herbaceous_plant
---
Basil cultivars
/wiki/List_of_basil_cultivars
---
=== row ===
Bottlebrush
/wiki/Callistemon
---
Callistemon
/wiki/Callistemon
---
Woody
/wiki/Woody_plant
---
Callistemon cultivars
/wiki/List_of_Callistemon_cultivars
---
=== row ===
Canna lily
/wiki/Canna_lily
---
Canna
/wiki/Canna_(plant)
---
Herbaceous
/wiki/Herbaceous_plant
---
Canna cultivars
/wiki/List_of_Canna_cultivars
---
=== row ===
Chili pepper
/wiki/Chili_pepper
---
Capsicum
/wiki/Capsicum
---
Herbaceous
/wiki/Herbaceous_plant
---
Capsicum cultivars
/wiki/List_of_Capsicum_cultivars
---
=== row ===
Cucumber
/wiki/Cucumber
---
Cucumis sativus
/wiki/Cucumis_sativus
---
Herbaceous
/wiki/Herbaceous_plant
---
Cucumber varieties
/wiki/List_of_cucumber_varieties
---
=== row ===
Elm
/wiki/Elm
---
Ulmus
/wiki/Ulmus
---
Woody
/wiki/Woody_plant
---
Elm cultivars, hybrids and hybrid cultivars
/wiki/List_of_Elm_cultivars,_hybrids_and_hybrid_cultivars
---

You may have to put it in some list, dictionary, DataFrame.

furas
  • 134,197
  • 12
  • 106
  • 148
  • thank you @furas it was very helpful. I am thinking of using _beautifulsoup (bs4)_ _html5lib_ and _requests_ libraries. Thank you again, – SandyShiva Nov 21 '21 at 11:42