-2

I am building a database that collects the news published on a newspaper website following instructions from this code https://github.com/jhnwr/webscrapenewsarticles/blob/master/newscraper.py.. John Watson Rooney github site But when I extract the information doing web scraping process, the output is inside brackets "[]" and I can't remove them to clean the data and make a news dataframe

'''

#find all the articles by using inspect element and create blank list
n=0
newslist = []
#loop through each article to find the title, subtitle, link, date and author. try and except as repeated articles from other sources have different h tags.
          
for item in articles:
    try:
        newsitem = item.find('h3', first=True)
        title = newsitem.text
        link = newsitem.absolute_links
        subtitle =  item.xpath('//a[@class="epigraph page-link"]//text()')
        author =  item.xpath('//span[@class="oculto"]/span//text()')
        date =  item.xpath('//meta[@itemprop="datePublished"]/@content')
        date_scrap = dt.datetime.utcnow().strftime("%d/%b/%Y")
        hour_scrap = dt.datetime.utcnow().strftime("%H:%M:%S")
        print(n, '\n', title, '\n', subtitel, '\n', link, '\n', author, '\n', date, '\n', date_scrap , '\n', hour_scrap)
        newsarticle = {
        'title': title,
        'subtitle': subtitle,
        'link': link,
        'autor': author,
        'fecha': date, 
        'date_scrap': dat_scrap,
        'hour_scrap': hour_scrap 
        }
        newslist.append(newsarticle)
    n+=1
    except:
        pass

news_db = pd.DataFrame(rows)
news_db.to_excel (r'db_article.xlsx', index = False, header=True)
news_db.head(10)

'''

I'm not allowed to embed image, but printing output is like:

En Vivo Procuraduría y Fiscalía investigan caso de joven que se suicidó tras detención
['Una joven de 17 años denunció que 4 policías la agredieron sexualmente durante las protestas']
{'https://www.eltiempo.com/justicia/investigacion/investigan-denuncia-de-agresion-sexual-de-policias-a-menor-en-popayan-588429'}
['Here_Author_name']
['2021-05-14']
15/May/2021
18:14:48

I would like to remove both type brackets "[]" y "{}", I have used the following commands but they convert the values in NAN:

     news_db['subtitle']= news_bd['subtitle'].str.strip(']')
     news_db['subtitle']= news_bd['subtitle']..str.replace(r"\[.*\]", "")
lcricaurte
  • 53
  • 1
  • 6
  • You could slice each parsed item to remove the first and last chars. E.g. item[1:-1] – Lucas Ng May 15 '21 at 18:40
  • I Tried before but same result happens ... items values are converted to NAN – lcricaurte May 15 '21 at 19:41
  • The code you posted cannot be run, because of incorrect indentation and missing details; you should also simplify the code to do as little as possible and still show that an error exists ([see this](https://stackoverflow.com/help/minimal-reproducible-example)) – Jakub Dąbek May 15 '21 at 19:56
  • brackets `[]` mean you get list - so you can use `[0]` to get first element on list. Or you may have to use `for`-loop to process every element separatelly. – furas May 15 '21 at 20:02
  • @furas thanks to help me to understand better the scraping process and the output. Now, all this make sense – lcricaurte May 15 '21 at 23:35

1 Answers1

0

item.xpath method returns a list of items found e.g. ['Author'] instead of 'Author', just like item.find, it's useful when searching for multiple elements (e.g. ['Author1', 'Author2']). To get just one value, use the first argument:

subtitle = item.xpath('//a[@class="epigraph page-link"]//text()', first=True)
author = item.xpath('//span[@class="oculto"]/span//text()', first=True)
date = item.xpath('//meta[@itemprop="datePublished"]/@content', first=True)

absoule_links is probably a set, you can get a random element by using

link = next(iter(newsitem.absolute_links))
# or
link = newsitem.absolute_links.pop()
Jakub Dąbek
  • 1,044
  • 1
  • 8
  • 17