0

"https://www.tokopedia.com/sitemap/product/1.xml.gz" this is my url this url contains the number of product urls but it's zipped i don't know how to unzip the url and how to get the data from that, how to unzip it using scrapy or Beautiful soup some other scrapy libraries

Bitto
  • 7,937
  • 1
  • 16
  • 38
selva kumar
  • 73
  • 1
  • 7

1 Answers1

5

Take a look at gzip

import requests
from io import BytesIO
import gzip
r = requests.get('https://www.tokopedia.com/sitemap/product/1.xml.gz',stream=True)
g=gzip.GzipFile(fileobj=BytesIO(r.content))
content=g.read()
print(content)

Output is too long to be pasted here. So giving output for g.read(1000)

Output:

b'<?xml version="1.0" encoding="UTF-8"?>\n\t<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"\n\txmlns:xhtml="http://www.w3.org/1999/xhtml">\n\t <url>\n\t   <loc>https://www.tokopedia.com/tokoshishaonline/shisha-medium</loc>\n       <xhtml:link rel="alternate" media="only screen and (max-width: 640px)" href="https://m.tokopedia.com/tokoshishaonline/shisha-medium" />\n\t </url>\n\t <url>\n\t   <loc>https://www.tokopedia.com/lighting/lampu-sorot-philips-hnf-207-flood-light-lampu-tembak-lampu-stadion</loc>\n       <xhtml:link rel="alternate" media="only screen and (max-width: 640px)" href="https://m.tokopedia.com/lighting/lampu-sorot-philips-hnf-207-flood-light-lampu-tembak-lampu-stadion" />\n\t </url>\n\t <url>\n\t   <loc>https://www.tokopedia.com/agromedia/pop-supernasa</loc>\n       <xhtml:link rel="alternate" media="only screen and (max-width: 640px)" href="https://m.tokopedia.com/agromedia/pop-supernasa" />\n\t </url>\n\t <url>\n\t   <loc>https://www.tokopedia.com/agromedia/aero-810</loc>\n       <xhtml:l'
Bitto
  • 7,937
  • 1
  • 16
  • 38
  • i am new to scraping so i need one help how to use parse and get the url from the content . can i use xpth here – selva kumar Jan 30 '19 at 19:35
  • @selvakumar you can use BeautifulSoup as well. see https://www.crummy.com/software/BeautifulSoup/bs4/doc/#parsing-xml – Bitto Jan 30 '19 at 19:38