"https://www.tokopedia.com/sitemap/product/1.xml.gz" this is my url this url contains the number of product urls but it's zipped i don't know how to unzip the url and how to get the data from that, how to unzip it using scrapy or Beautiful soup some other scrapy libraries
Asked
Active
Viewed 415 times
1 Answers
5
Take a look at gzip
import requests
from io import BytesIO
import gzip
r = requests.get('https://www.tokopedia.com/sitemap/product/1.xml.gz',stream=True)
g=gzip.GzipFile(fileobj=BytesIO(r.content))
content=g.read()
print(content)
Output is too long to be pasted here. So giving output for g.read(1000)
Output:
b'<?xml version="1.0" encoding="UTF-8"?>\n\t<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"\n\txmlns:xhtml="http://www.w3.org/1999/xhtml">\n\t <url>\n\t <loc>https://www.tokopedia.com/tokoshishaonline/shisha-medium</loc>\n <xhtml:link rel="alternate" media="only screen and (max-width: 640px)" href="https://m.tokopedia.com/tokoshishaonline/shisha-medium" />\n\t </url>\n\t <url>\n\t <loc>https://www.tokopedia.com/lighting/lampu-sorot-philips-hnf-207-flood-light-lampu-tembak-lampu-stadion</loc>\n <xhtml:link rel="alternate" media="only screen and (max-width: 640px)" href="https://m.tokopedia.com/lighting/lampu-sorot-philips-hnf-207-flood-light-lampu-tembak-lampu-stadion" />\n\t </url>\n\t <url>\n\t <loc>https://www.tokopedia.com/agromedia/pop-supernasa</loc>\n <xhtml:link rel="alternate" media="only screen and (max-width: 640px)" href="https://m.tokopedia.com/agromedia/pop-supernasa" />\n\t </url>\n\t <url>\n\t <loc>https://www.tokopedia.com/agromedia/aero-810</loc>\n <xhtml:l'

Bitto
- 7,937
- 1
- 16
- 38
-
i am new to scraping so i need one help how to use parse and get the url from the content . can i use xpth here – selva kumar Jan 30 '19 at 19:35
-
@selvakumar you can use BeautifulSoup as well. see https://www.crummy.com/software/BeautifulSoup/bs4/doc/#parsing-xml – Bitto Jan 30 '19 at 19:38