1

When I import a table from a webpage to Python then the column (Population.1) shows as NaN while it is not NaN in the original webpage

enter image description here

import requests


pop_url = (
    "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
)

r = requests.get(pop_url)

wiki_tables = pd.read_html(r.text, header=0)

len(wiki_tables)

cont_pop = wiki_tables[1]

cont_pop.head()
Laurent
  • 12,287
  • 7
  • 21
  • 37

1 Answers1

1

Here is one way to do it with Beautiful Soup:

import pandas as pd
import requests
from bs4 import BeautifulSoup

pop_url = (
    "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
)
r = requests.get(pop_url)

# Import table and remove first row (duplicated header)
cont_pop = pd.read_html(r.text, header=0)[1].drop(index=0)

Then:

# Find and add missing values
raw = BeautifulSoup(r.content, "html.parser").find_all("table")
rows = raw[1].text.split("\n")[14:]
rows = [rows[i : i + 11][1:] for i in range(0, len(rows), 11)][:-1]
cont_pop["Population.1"] = [value for row in rows for value in row if "%" in value]

Finally:

print(cont_pop)
# Output

    Rank                 Country / Dependency  Population Population.1   
1      –                                World  8035105000         100%  \
2      1                                China  1411750000        17.6%   
3      2                                India  1392329000        17.3%   
4      3                        United States   334869000        4.17%   
5      4                            Indonesia   277749853        3.46%   
..   ...                                  ...         ...          ...   
238    –                Tokelau (New Zealand)        1647           0%   
239    –                                 Niue        1549           0%   
240  195                         Vatican City         825           0%   
241    –  Cocos (Keeling) Islands (Australia)         593           0%   
242    –    Pitcairn Islands (United Kingdom)          47           0%   

            Date Source (official or from the United Nations) Notes  
1    10 Jun 2023                             UN projection[3]   NaN  
2    31 Dec 2022                         Official estimate[4]   [b]  
3     1 Mar 2023                       Official projection[5]   [c]  
4    10 Jun 2023                 National population clock[7]   [d]  
5    31 Dec 2022                         Official estimate[8]   NaN  
..           ...                                          ...   ...  
238   1 Jan 2019                            2019 Census [211]   NaN  
239   1 Jul 2021               National annual projection[96]   NaN  
240   1 Feb 2019               Monthly national estimate[212]  [af]  
241  30 Jun 2020                             2021 Census[213]   NaN  
242   1 Jul 2021                       Official estimate[214]   NaN  

[242 rows x 7 columns]
Laurent
  • 12,287
  • 7
  • 21
  • 37