0

I am trying to do scraping from https://finansial.bisnis.com/read/20210506/90/1391096/laba-bank-mega-tumbuh-dua-digit-kuartal-i-2021-ini-penopangnya. I am trying to scrape the date of news, here's my code:

news['tanggal'] = newsScrape['date']
dates = []
for x in news['tanggal']:
    x = listToString(x)
    x = x.strip()
    x = x.replace('\r', '').replace('\n', '').replace(' \xa0|\xa0', ',').replace('|', ', ')
    dates.append(x)
dates = listToString(dates)
dates = dates[0:20]
if len(dates) == 0:
    continue
news['tanggal'] = dt.datetime.strptime(dates, '%d %B %Y, %H:%M')

but I got this error:

ValueError: time data '06 Mei 2021, 11:32  ' does not match format '%d %B %Y, %H:%M'

My assumption is because Mei is in Indonesian language, meanwhile the format need May which is in English. How to change Mei to be May? I have tried dates = dates.replace('Mei', 'May') but it doesnt work on me. When I tried it, I got error ValueError: unconverted data remains: The type of dates is string. Thanks

winnie
  • 135
  • 7

3 Answers3

0

Your assumption regarding the May -> Mei change is correct, the reason you're likely facing a problem after the replacement are the trailing spaces in your string, which are not accounted for in your format. You can use string.rstrip() to remove these spaces.

import datetime as dt

dates = "06 Mei 2021, 11:32  "
dates = dates.replace("Mei", "May") # The replacement will have to be handled for all months, this is only an example
dates = dates.rstrip()
date = dt.datetime.strptime(dates, "%d %B %Y, %H:%M")
print(date) # 2021-05-06 11:32:00

While this does fix the problem here, it's messy to have to shorten the string like this after dates = dates[0:20]. Consider using regex to gain the appropriate format at once.

ankurbohra04
  • 432
  • 5
  • 12
  • is there any other way if I want to change the other month? since it needs effort to replace 12 months – winnie May 06 '21 at 06:23
  • You could use some sort of translation but that would be heavily overdoing it. First, you can write a dictionary of the form {"Mei": "May", ...}. If you know the month will always be in this language, you can pick the month off as month = dates.split()[1] and use this to get the pair from the dictionary and perform the replacement i.e. dates.replace(month, translations[month]). If not, you can simply loop through the dictionary and apply replace(key, value) for each pair. Any way you choose a hard-coded translation is required. – ankurbohra04 May 06 '21 at 06:29
0

The problem seems to be just the trailing white space you have, which explains the error ValueError: unconverted data remains: . It is complaining that it is unable to convert the remaining data (whitespace).

s = '06 Mei 2021, 11:32  '.replace('Mei', 'May').strip()
datetime.strptime(s, '%d %B %Y, %H:%M')
# Returns datetime.datetime(2021, 5, 6, 11, 32)

Also, to convert all the Indonesian months to English, you can use a dictionary:

id_en_dict = {
    ...,
    'Mei': 'May',
    ...
}
Daren
  • 114
  • 5
  • is there any other way if I want to change the other month? since it needs effort to replace 12 months – winnie May 06 '21 at 06:23
0

You can try with the following

import datetime as dt
import requests
from bs4 import BeautifulSoup
import urllib.request

url="https://finansial.bisnis.com/read/20210506/90/1391096/laba-bank-mega-tumbuh-dua-digit-kuartal-i-2021-ini-penopangnya"
r = requests.get(url, verify=False)
soup = BeautifulSoup(r.content, 'html.parser')
info_soup= soup.find(class_="new-description")
x=info_soup.find('span').get_text(strip=True)
x = x.strip()
x = x.replace('\r', '').replace('\n', '').replace(' \xa0|\xa0', ',').replace('|', ', ')
x = x[0:20]
x = x.rstrip()
date= dt.datetime.strptime(x.replace('Mei', 'May'), '%d %B %Y, %H:%M')
print(date)

result:

2021-05-06 11:45:00
Renaud
  • 2,709
  • 2
  • 9
  • 24
  • is there any other way if I want to change the other month? since it needs effort to replace 12 months – winnie May 06 '21 at 06:23