Need to clean web scraped data using python

Question

I am trying to write a code for scraping data from http://goldpricez.com/gold/history/lkr/years-3. The code that I have written follows below. The code works and gives me my intended results.

import pandas as pd

url = "http://goldpricez.com/gold/history/lkr/years-3"

df = pd.read_html(url)

print(df)

But result is with some unwanted data and I want only the data in the table. Please can some help me with this.

Here I have added the image of the output with unwanted data (red circled)

You can always slice the Data frame to get rid of the unwanted data. Alternatively, use Beautiful soup library to parse html before using pandas library. — Bernad Peter, Jun 14 '20 at 05:13
``read_html`` return list of dataframe for each table in HTML source, use list index to access the required dataframe https://stackoverflow.com/questions/39710903/pd-read-html-imports-a-list-rather-than-a-dataframe — sushanth, Jun 14 '20 at 05:46
You were correct to use `pd.read_html`. Just select the correct index where the data is [3]. See my answer below — Prayson W. Daniel, Jun 14 '20 at 07:45

Bernad Peter · Accepted Answer · 2020-06-20T05:28:57.597

2

    import pandas as pd



   url = "http://goldpricez.com/gold/history/lkr/years-3"

   df = pd.read_html(url)# this will give you a list of dataframes from html

  print(df[3])

edited Jun 20 '20 at 05:28

answered Jun 14 '20 at 05:27

Bernad Peter

504
5
12

1

Thanks mate. it works fine and small question. what df[3] does?? – Thejitha Anjana Jun 14 '20 at 06:17
Using `urllib.requests` did actually just performed the process twice as `.read_html` does that :) so there is no need for that step – Prayson W. Daniel Jun 14 '20 at 07:50
Explanation of why I downvoted: I rarely downvote answers. I usually dislike downvotes without explanation to where I could improve. So here is mine. You added extra and unused codes `from urllib.request import urlopen, Request url = "http://goldpricez.com/gold/history/lkr/years-3" req = Request(url=url) html = urlopen(req).read()` all these is not used. df[3] would work if all that is deleted. ;) that is why. Hope you understand :) – Prayson W. Daniel Jun 14 '20 at 08:27
@ThejithaAnjana df[3] prints the fourth dataframe of from the dataframes list. – Bernad Peter Jun 20 '20 at 05:30
As for now, it's the `df[1]` element, this is what worked for me. – Ibrahim.H Sep 25 '22 at 23:09

score 0 · Answer 2 · answered Jun 14 '20 at 06:29

0

Use BeautifulSoup for this the below code works perfectly

import requests
from bs4 import BeautifulSoup
url = "http://goldpricez.com/gold/history/lkr/years-3"
r = requests.get(url)
s = BeautifulSoup(r.text, "html.parser")
data = s.find_all("td")
data = data[11:]
for i in range(0, len(data), 2):
    print(data[i].text.strip(), "      ", data[i+1].text.strip())

This other advantage of using BeautifulSoup is that it is way faster that your code

answered Jun 14 '20 at 06:29

yashetty29

43
1
7

`.read_html` uses bs4 under the hood ;) `flavor : str or None, container of strings The parsing engine to use. ‘bs4’ and ‘html5lib’ are synonymous with each other, they are both there for backwards compatibility. The default of None tries to use lxml to parse and if that fails it falls back on bs4 + html5lib.` – Prayson W. Daniel Jun 14 '20 at 07:55

Prayson W. Daniel · Answer 3 · 2020-06-14T08:14:56.870

The way you used .read_html will return a list of all tables. Your table is at index 3

import pandas as pd

url = "http://goldpricez.com/gold/history/lkr/years-3"

df = pd.read_html(url)[3]

print(df)

.read_html makes a call to the URL, and uses BeautifulSoup to parse the response under the hood. You can change the parse, the name of the table, pass header as you would in .read_csv. Check .read_html for more details.

For speed, you can use lxml e.g. pd.read_html(url, flavor='lxml')[3]. By default, html5lib, which is the second slowest, is used. Another flavor is html.parser. It is the slowest of them all.

Need to clean web scraped data using python

3 Answers3