1

Please help, this poor, struggling, philosophy & economy majored person.

I'm trying to get the market cap of Samsung Electronics from Korean Website 'finance.naver.com'

(It doesn't need to be Samsung, I just need to crawl marketcap for my quant investment purpose)

the web site is https://finance.naver.com/item/main.nhn?code=005930

this is the image of the web page and the target number is in the red box enter image description here

this is my code

from bs4 import BeautifulSoup
import requests
mkc_url = 'https://finance.naver.com/item/main.nhn?code=005930'
mkc_result = requests.get(mkc_url)
mkc_obj = BeautifulSoup(mkc_result.content, "html.parser")

I found the the target number is in the 'div' tag, 'first' class

mkc = mkc_obj.find("div",{"class": "first"})

in the 'div' tag, I found the number is in 'em' tag, '_market_sum' id

em_id = mkc.find("em", {"id":"_market_sum"})

finanlly i got the result like this

enter image description here

'조' is the measure of Korean currency so I wanted to delete everything but numbers, but I didn't know how

What I did was put that result in the DataFrame, and tried to delete everything but numbers using '.str.strip'

df_mkc = pd.DataFrame(em_id)
df_mkc[0] = df_mkc[0].str.strip('\n')
df_mkc[0] = df_mkc[0].str.strip('\t')
df_mkc[0] = df_mkc[0].str.strip()
df_mkc = df_mkc.replace({'\$': '', ',': ''}, regex=True)

and it get's ugglier and ugglier

enter image description here

I tapped out at this point

Please help!!!

Thanks for all your kindness, wisdon and generosity

Sileo
  • 307
  • 2
  • 18
Rock Lee
  • 47
  • 4

2 Answers2

2

After you defined em_id, get rid of its tags by doing

em_txt = em_id.get_text()

then, you can get rid of the white spaces with (thanks to this answer)

clean_em = "".join(line.strip() for line in em_txt.split("\n"))

finally, if the currency will always be the same, you can create a list with the two number values by doing

mcap_list = clean_em.split('조')

you may want to get rid of the comma in 4,299 by doing

mcap_list[1] = mcap_list[1].replace(",","")

and convert the values to integers with

for i in range(len(mcap_list)):
    mcap_list[i] = int(mcap_list[i])

You now have mcap_list equal to [290,4299]

Sileo
  • 307
  • 2
  • 18
  • 1
    It worked! But, the result was a list of [290, 4299]. So I put those two numbers together first when they were still str, and then made them into int! Perfectly solved my problem! Thank you very much! – Rock Lee Mar 26 '20 at 10:01
1

Another solution is to use regex and re.findall, considering the below dummy DataFrame:

df = pd.DataFrame({'Extract' : ['Total revenue for this year is $10,000, for last year it was $8000',
                                'and profit in USD is $2000.00','it is 20.00%',
                                'This is in Korean currency 500조']})

df['Numbers'] = df['Extract'].str.findall(r'(\d+[.,]?\d*)')

print(df['Numbers')

0    [10,000, 8000]
1         [2000.00]
2           [20.00]
3             [500]
ManojK
  • 1,570
  • 2
  • 9
  • 17
  • 1
    Thank you for your help! In my case, It still had ', ' between two numbers, so I had to find out the other way, but it helped me a lot! – Rock Lee Mar 26 '20 at 09:58
  • Great, I would suggest to read more about regular expressions, they are very handy in such situations. Cheers!!! – ManojK Mar 26 '20 at 10:09