1

I am scraping a web site using BeautifulSoup

CHN = "https://ncov.dxy.cn/ncovh5/view/pneumonia?scene=2&clicktime=1579582238&enterid=1579582238&from=singlemessage&isappinstalled=0"
response3 = get(CHN, headers=headers)
response3.encoding='utf-8'

Scrape all content from the website

html_soup3 = BeautifulSoup(response3.text, 'html.parser')

html_soup = BeautifulSoup(response.text, 'html.parser')

and then looking for script with ad ID

scripts = html_soup3.find_all('script', id='getAreaStat')
print(scripts)


Out[64]: [<script id="getAreaStat">try { window.getAreaStat = [{"provinceName":"湖北省","provinceShortName":"湖北","currentConfirmedCount":2895,"confirmedCount":67801,"suspectedCount":0,"curedCount":61732,"deadCount":3174,"comment":"","locationId":420000,"statisticsData":"https://file1.dxycdn.com/2020/0223/618/3398299751673487511-135.json","cities":[{"cityName":"武汉","currentConfirmedCount":2880,"confirmedCount":50006,"suspectedCount":0,"curedCount":44591,"deadCount":2535,"locationId":420100},{"cityName":"孝感","currentConfirmedCount":4,"confirmedCount":3518,"suspectedCount":0,"curedCount":3386,"deadCount":128,"locationId":420900},

I wonder how can I get a dictionary with the provinceName and their children.

Rob
  • 14,746
  • 28
  • 47
  • 65
bruvio
  • 853
  • 1
  • 9
  • 30

1 Answers1

2

You could take the response text and regex out the appropriate string and use ast library to convert to dict

import ast, re

#r = response text appropriately encoded
p = re.compile(r'window\.getAreaStat = \[(.*?)\]}catch')
data = p.findall(r)[0]
print(ast.literal_eval(data))

See the regex here

Explanation:

enter image description here

Fuller example (the encoding part taken from @宏杰李 here):

import requests, re, ast

res = requests.get('https://ncov.dxy.cn/ncovh5/view/pneumonia?scene=2&clicktime=1579582238&enterid=1579582238&from=singlemessage&isappinstalled=0')
res.encoding = "GBK"
r = res.text
p = re.compile(r'window\.getAreaStat = \[(.*?)\]}catch')
data = p.findall(r)[0]
print(ast.literal_eval(data))
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • dear bruvio - dear QHarr - for the sake of diving into learing it would be good to provide a combined & collected solution with the code in toto. This would support the insights into the solution and would be great for all the visitors of this thread. Many thanks dear Quarr - for your work here - and for your support of SO. It is great. Keep up the great work - it rocks – zero Mar 27 '20 at 13:29
  • @bruvio Please define does not work. The regex pattern correctly picks up all the provinces. – QHarr Mar 27 '20 at 13:37
  • @zero Full example added. Let me know if you want more info provided :-) – QHarr Mar 27 '20 at 13:47
  • @zero I think I provided with the code all the info needed to reproduce the issue. I have been trying to extract the information I wrote about (the provinceNames(s)). It would be beneficial for me also trying to understand how to parse the string I get. or If there is a different way to extract the information. For example using json (tried). Please keep up the great work, thanks – bruvio Mar 27 '20 at 13:48