how to delete everything but numbers (web crawling market cap of Korean Company)

Question

Please help, this poor, struggling, philosophy & economy majored person.

I'm trying to get the market cap of Samsung Electronics from Korean Website 'finance.naver.com'

(It doesn't need to be Samsung, I just need to crawl marketcap for my quant investment purpose)

the web site is https://finance.naver.com/item/main.nhn?code=005930

this is the image of the web page and the target number is in the red box

this is my code

from bs4 import BeautifulSoup
import requests
mkc_url = 'https://finance.naver.com/item/main.nhn?code=005930'
mkc_result = requests.get(mkc_url)
mkc_obj = BeautifulSoup(mkc_result.content, "html.parser")

I found the the target number is in the 'div' tag, 'first' class

mkc = mkc_obj.find("div",{"class": "first"})

in the 'div' tag, I found the number is in 'em' tag, '_market_sum' id

em_id = mkc.find("em", {"id":"_market_sum"})

finanlly i got the result like this

'조' is the measure of Korean currency so I wanted to delete everything but numbers, but I didn't know how

What I did was put that result in the DataFrame, and tried to delete everything but numbers using '.str.strip'

df_mkc = pd.DataFrame(em_id)
df_mkc[0] = df_mkc[0].str.strip('\n')
df_mkc[0] = df_mkc[0].str.strip('\t')
df_mkc[0] = df_mkc[0].str.strip()
df_mkc = df_mkc.replace({'\$': '', ',': ''}, regex=True)

and it get's ugglier and ugglier

I tapped out at this point

Please help!!!

Thanks for all your kindness, wisdon and generosity

Sileo · Accepted Answer · 2020-10-15T15:13:11.337

2

After you defined em_id, get rid of its tags by doing

em_txt = em_id.get_text()

then, you can get rid of the white spaces with (thanks to this answer)

clean_em = "".join(line.strip() for line in em_txt.split("\n"))

finally, if the currency will always be the same, you can create a list with the two number values by doing

mcap_list = clean_em.split('조')

you may want to get rid of the comma in 4,299 by doing

mcap_list[1] = mcap_list[1].replace(",","")

and convert the values to integers with

for i in range(len(mcap_list)):
    mcap_list[i] = int(mcap_list[i])

You now have mcap_list equal to [290,4299]

edited Oct 15 '20 at 15:13

answered Mar 25 '20 at 14:16

Sileo

307
2
18

1

It worked! But, the result was a list of [290, 4299]. So I put those two numbers together first when they were still str, and then made them into int! Perfectly solved my problem! Thank you very much! – Rock Lee Mar 26 '20 at 10:01

score 1 · Answer 2 · answered Mar 25 '20 at 14:28

1

Another solution is to use regex and re.findall, considering the below dummy DataFrame:

df = pd.DataFrame({'Extract' : ['Total revenue for this year is $10,000, for last year it was $8000',
                                'and profit in USD is $2000.00','it is 20.00%',
                                'This is in Korean currency 500조']})

df['Numbers'] = df['Extract'].str.findall(r'(\d+[.,]?\d*)')

print(df['Numbers')

0    [10,000, 8000]
1         [2000.00]
2           [20.00]
3             [500]

answered Mar 25 '20 at 14:28

ManojK

1,570
2
9
17

1

Thank you for your help! In my case, It still had ', ' between two numbers, so I had to find out the other way, but it helped me a lot! – Rock Lee Mar 26 '20 at 09:58
Great, I would suggest to read more about regular expressions, they are very handy in such situations. Cheers!!! – ManojK Mar 26 '20 at 10:09

how to delete everything but numbers (web crawling market cap of Korean Company)

2 Answers2