As I know Wikipedia has only one API. https://www.mediawiki.org/wiki/API:Main_page
So question is: use API or not use API.
Or rather question is: use some special module for Wikipedia API (like WikipediaAPI) or use requests
directly with API or use requests
+BeautifulSoul
without API.
Because data are in standard <table>
then the simplest can bo to use pandas
and read_html(url)
.
It search all <table>
and convert every table to DataFrame
and it returns list with all DataFrames
.
But this gives only text without links.
import pandas as pd
url = 'https://en.wikipedia.org/wiki/Lists_of_cultivars'
all_tables = pd.read_html(url)
#for df in all_tables:
# print(df)
df = all_tables[0]
print(df.to_string())
Result:
Common name Taxon Woody / Herbaceous List of cultivars
0 Apple Malus domestica Woody Apple cultivars
1 Banana / Plaintain Musa Herbaceous Banana cultivars
2 Cannabis Cannabis Herbaceous Cannabis strains
3 Cherimoya Annona cherimola Woody Cherimoya cultivars
4 Citrus Citrus Woody Citrus hybrids and cultivars
5 Coffee Coffea Woody Coffee varieties
6 Sweet corn Zea mays convar. saccharata var. rugosa Herbaceous Sweetcorn varieties
7 Basil Ocimum Herbaceous Basil cultivars
8 Bottlebrush Callistemon Woody Callistemon cultivars
9 Canna lily Canna Herbaceous Canna cultivars
10 Chili pepper Capsicum Herbaceous Capsicum cultivars
11 Cucumber Cucumis sativus Herbaceous Cucumber varieties
12 Elm Ulmus Woody Elm cultivars, hybrids and hybrid cultivars
13 Gazania Gazania Herbaceous Gazania cultivars
14 Grape Vitis Woody Grape varieties
15 Grevillea Grevillea Woody Grevillea cultivars
16 Hops Humulus lupulus Herbaceous Hop varieties
17 Maize Zea mays subsp. mays Herbaceous Italian traditional maize varieties
18 Mango Mangifera Woody Mango cultivars
19 Nemesia Nemesia Herbaceous Nemesia cultivars
20 Olive Olea europaea Woody Olive cultivars
21 Onion Allium cepa Herbaceous Onion cultivars
22 Pear Pyrus Woody Pear cultivars
23 Tropical pitcher plant Nepenthes Herbaceous Nepenthes cultivars
24 Pumpkin Cucurbita pepo Herbaceous Pumpkin varieties grown in the United States
25 Asian rice Oryza sativa Herbaceous Rice varieties
26 Rose Rosa Woody All-America Rose SelectionsAward of Garden Merit rosesRose cultivars named after people
27 Strawberry Fragaria ananassa Herbaceous Strawberry cultivars
28 Sweet potato Ipomoea batatas Herbaceous Sweet potato cultivars
29 Tomato Solanum lycopersicum Herbaceous Tomato cultivars
30 Venus flytrap Dionaea muscipula Herbaceous Venus flytrap cultivars
To get text with links you can use requests
(or urllib
) with BeautifulSoup
(or lxml
)
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Lists_of_cultivars'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
table = soup.find('table')
for row in table.find_all('tr'):
print('=== row ===')
for cell in row.find_all('td'):
a = cell.find('a')
print(a.text)
print(a['href'])
print('---')
Result:
=== row ===
=== row ===
Apple
/wiki/Apple
---
Malus domestica
/wiki/Malus_domestica
---
Woody
/wiki/Woody_plant
---
Apple cultivars
/wiki/List_of_apple_cultivars
---
=== row ===
Banana
/wiki/Banana
---
Musa
/wiki/Musa_(genus)
---
Herbaceous
/wiki/Herbaceous_plant
---
Banana cultivars
/wiki/List_of_banana_cultivars
---
=== row ===
Cannabis
/wiki/Cannabis
---
Cannabis
/wiki/Cannabis
---
Herbaceous
/wiki/Herbaceous_plant
---
Cannabis strains
/wiki/Cannabis_strains
---
=== row ===
Cherimoya
/wiki/Cherimoya
---
Annona cherimola
/wiki/Annona_cherimola
---
Woody
/wiki/Woody_plant
---
Cherimoya cultivars
/wiki/List_of_cherimoya_cultivars
---
=== row ===
Citrus
/wiki/Citrus
---
Citrus
/wiki/Citrus
---
Woody
/wiki/Woody_plant
---
Citrus hybrids and cultivars
/wiki/List_of_citrus_hybrids_and_cultivars
---
=== row ===
Coffee
/wiki/Coffee_plant
---
Coffea
/wiki/Coffea
---
Woody
/wiki/Woody_plant
---
Coffee varieties
/wiki/List_of_coffee_varieties
---
=== row ===
Sweet corn
/wiki/Sweet_corn
---
Zea mays convar. saccharata var. rugosa
/wiki/Sweet_corn
---
Herbaceous
/wiki/Herbaceous_plant
---
Sweetcorn varieties
/wiki/List_of_sweetcorn_varieties
---
=== row ===
Basil
/wiki/Basil
---
Ocimum
/wiki/Ocimum
---
Herbaceous
/wiki/Herbaceous_plant
---
Basil cultivars
/wiki/List_of_basil_cultivars
---
=== row ===
Bottlebrush
/wiki/Callistemon
---
Callistemon
/wiki/Callistemon
---
Woody
/wiki/Woody_plant
---
Callistemon cultivars
/wiki/List_of_Callistemon_cultivars
---
=== row ===
Canna lily
/wiki/Canna_lily
---
Canna
/wiki/Canna_(plant)
---
Herbaceous
/wiki/Herbaceous_plant
---
Canna cultivars
/wiki/List_of_Canna_cultivars
---
=== row ===
Chili pepper
/wiki/Chili_pepper
---
Capsicum
/wiki/Capsicum
---
Herbaceous
/wiki/Herbaceous_plant
---
Capsicum cultivars
/wiki/List_of_Capsicum_cultivars
---
=== row ===
Cucumber
/wiki/Cucumber
---
Cucumis sativus
/wiki/Cucumis_sativus
---
Herbaceous
/wiki/Herbaceous_plant
---
Cucumber varieties
/wiki/List_of_cucumber_varieties
---
=== row ===
Elm
/wiki/Elm
---
Ulmus
/wiki/Ulmus
---
Woody
/wiki/Woody_plant
---
Elm cultivars, hybrids and hybrid cultivars
/wiki/List_of_Elm_cultivars,_hybrids_and_hybrid_cultivars
---
You may have to put it in some list, dictionary, DataFrame.