3

I want to convert a website table to pandas df, but BeautifulSoup doesn't recognize the table (snipped image below). Below is the code I tried with no luck.

enter image description here

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = 'https://www.ndbc.noaa.gov/ship_obs.php'
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.content, 'html.parser')
tables = soup.find_all('table', rules = 'all')
#tables =soup.find_all("table",{"style":"color:#333399;"}) #instead of above line to specify table with no luck!
df = pd.read_html(table, skiprows=2, flavor='bs4')
df.head()

I also tried the code below with no luck

df = pd.read_html('https://www.ndbc.noaa.gov/ship_obs.php')
print(df)
VirtualScooter
  • 1,792
  • 3
  • 18
  • 28
user2031063
  • 947
  • 1
  • 7
  • 11

2 Answers2

3

Your table is not in the <table> tag but in multiple <span> tags.

You can parse these to a dataframe like so:

import pandas as pd
import requests
import bs4

url = f"https://www.ndbc.noaa.gov/ship_obs.php"
soup = bs4.BeautifulSoup(requests.get(url).text, 'html.parser').find('pre').find_all("span")
print(pd.DataFrame([r.getText().split() for r in soup]))

Output:

      0     1     2      3     4     5   ...    40    41    42    43    44    45
0    SHIP  HOUR   LAT    LON  WDIR  WSPD  ...    °T    ft   sec    °T   Acc   Ice
1    SHIP    19  46.5  -72.3   260   5.1  ...  None  None  None  None  None  None
2    SHIP    19  46.8  -71.2   110   2.9  ...  None  None  None  None  None  None
3    SHIP    19  47.4  -61.8    40  18.1  ...  None  None  None  None  None  None
4    SHIP    19  47.7  -53.2    40   8.0  ...  None  None  None  None  None  None
..    ...   ...   ...    ...   ...   ...  ...   ...   ...   ...   ...   ...   ...
170  SHIP    19  17.6  -62.4   100  20.0  ...  None  None  None  None  None  None
171  SHIP    19  25.8  -78.0    40  24.1  ...  None  None  None  None  None  None
172  SHIP    19   1.5  104.8    20  22.0  ...  None  None  None  None  None  None
173  SHIP    19  57.9    1.2   180     -  ...  None  None  None  None  None  None
174  SHIP    19  35.1  -10.0   310  24.1  ...  None  None  None  None  None  None

[175 rows x 46 columns]
baduker
  • 19,152
  • 9
  • 33
  • 56
2

Slightly different approach, and look at column counts too. I skipped lines at the top, so you'll have to build the column headers and clean up that last row.

import io
url = 'https://www.ndbc.noaa.gov/ship_obs.php'
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
tablecontent = soup.find('pre')
data = BeautifulSoup(tablecontent.text, "html.parser")
s = io.StringIO(data.text)
df = pd.read_csv(s, sep='\s+', engine='python', skiprows=3, header=None)

Output (sorry, copying out of jupyter is not aligning well)

    0   1   2   3   4   5   6   7   8   9   ... 14  15  16  17  18  19  20  21  22  23
0   SHIP    19  47.4    -61.8   40  18.1    -   -   -   29.82   ... -   -   -   -   -   -   -   -   ----    -----
1   SHIP    19  47.7    -53.2   40  8.0 -   -   -   29.76   ... -   -   -   -   -   -   -   -   ----    -----
2   SHIP    19  47.8    -54.1   50  13.0    -   -   -   29.75   ... -   -   -   -   -   -   -   -   ----    -----
3   SHIP    19  48.2    -53.4   50  13.0    -   -   -   29.78   ... -   -   -   -   -   -   -   -   ----    -----
4   SHIP    19  46.8    -71.2   110 2.9 -   -   -   30.03   ... -   -   -   -   -   -   -   -   ----    -----
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
178 SHIP    19  25.8    -78.0   40  24.1    -   4.9 4.0 30.08   ... 11  5   -   -   -   -   -   -   ----    -----
179 SHIP    19  1.5 104.8   20  22.0    -   -   -   29.87   ... 11  5   -   -   -   -   -   -   ----    -----
180 SHIP    19  57.9    1.2 180 -   -   -   -   29.35   ... 5   -   -   -   -   -   -   -   ----    -----
181 SHIP    19  35.1    -10.0   310 24.1    -   6.6 6.0 29.68   ... 5   8   14.8    10.0    310 -   -   -   ----    -----
182 182 ship    observations    reported    for 1900    GMT None    None    None    ... None    None    None    None    None    None    None    None    None    None
Jonathan Leon
  • 5,440
  • 2
  • 6
  • 14