2

I am making a web scraper.
I access google search, I get the link of the web page and then I get the contents of the <title> tag.
The problem is that, for example, the string "P\xe1gina N\xe3o Encontrada!" should be "Página Não Encontrada!". I tried do decode to latin-1 and then encode to utf-8 and it did not work.

    r2 = requests.get(item_str)
    texto_pagina = r2.text
    soup_item = BeautifulSoup(texto_pagina,"html.parser")
    empresa = soup_item.find_all("title")
    print(empresa_str.decode('latin1').encode('utf8'))

Can you help me, please? Thanks !

Mr Lister
  • 45,515
  • 15
  • 108
  • 150
Daniel Castro
  • 155
  • 1
  • 1
  • 10
  • Maybe some answer here [link](http://stackoverflow.com/questions/5498371/how-can-i-get-portuguese-characters-in-python) – Mike Mar 16 '16 at 00:17
  • Didn't work. I already tried that...thanks – Daniel Castro Mar 16 '16 at 00:33
  • can you show us result of print([empresa])? so we can exactly see what is current encoding. and is that python3? – YOU Mar 16 '16 at 01:18
  • print(empresa_str) : [Ops... P\xe1gina N\xe3o Encontrada!] [ANADI Consultoria ERP Totvs] [Experfite | Consultoria Microsiga Protheus homologada e certificada Totvs - Home] [Consultoria TOTVS\xae | ALFA Sistemas de Gest\xe3o] [.: TOTVS IV2 - Tecnologia e Sistemas :.] [Consultoria TOTVS Protheus] [CONSULTORIA TOTVS PROTHEUS | Systh] – Daniel Castro Mar 16 '16 at 01:24
  • Instead of `print(empresa_str)`, can you do what @YOU suggested above, which is: `print([empresa])`? – Saroekin Mar 16 '16 at 01:27
  • Sorry, this is the correct print(empresa) : [Interativa] [Ops... P\xe1gina N\xe3o Encontrada!] [ANADI Consultoria ERP Totvs] – Daniel Castro Mar 16 '16 at 01:30
  • that square brackets are needed or python will do something behind the scene, print([empresa]) or print([empresa_str]). and looks like those are not string but beautifulsoup objects. – YOU Mar 16 '16 at 01:32
  • They are not needed... Each [] represents one empresa, then I extract the content of the tag. – Daniel Castro Mar 16 '16 at 01:36
  • Did you try putting # -*- coding: latin-1 -*- at the top of the python file? – Paul Chris Jones Jul 05 '19 at 19:27

2 Answers2

4

You can change the retrieved text variable to something like:

string = u'P\xe1gina N\xe3o Encontrada!'.encode('utf-8')

After printing string it seemed to work just fine for me.


Edit

Instead of adding .encode('utf8'), have you tried just using empresa_str.decode('latin1')?

As in:

string = empresa_str.decode('latin_1')
Saroekin
  • 1,175
  • 1
  • 7
  • 20
-1

Not the most elegant solution, but worked for me :

def remove_all(substr, str):
 index = 0
 length = len(substr)
 while string.find(str, substr) != -1:
    index = string.find(str, substr)
    str = str[0:index] + str[index+length:]
 return str

 def latin1_to_ascii (unicrap):
    xlate={ 'xc3cb3':'o' , 'xc3xa7':'c','xc3xb5':'o',  'xc3xa3':'a',  'xc3xa9':'e',
    'xc0':'A', 'xc1':'A', 'xc2':'A', 'xc3':'A', 'xc4':'A', 'xc5':'A',
    'xc6':'Ae', 'xc7':'C',
    'xc8':'E', 'xc9':'E', 'xca':'E', 'xcb':'E',
    'xcc':'I', 'xcd':'I', 'xce':'I', 'xcf':'I',
    'xd0':'Th', 'xd1':'N',
    'xd2':'O', 'xd3':'O', 'xd4':'O', 'xd5':'O', 'xd6':'O', 'xd8':'O',
    'xd9':'U', 'xda':'U', 'xdb':'U', 'xdc':'U',
    'xdd':'Y', 'xde':'th', 'xdf':'ss',
    'xe0':'a', 'xe1':'a', 'xe2':'a', 'xe3':'a', 'xe4':'a', 'xe5':'a',
    'xe6':'ae', 'xe7':'c',
    'xe8':'e', 'xe9':'e', 'xea':'e', 'xeb':'e',
    'xec':'i', 'xed':'i', 'xee':'i', 'xef':'i',
    'xf0':'th', 'xf1':'n',
    'xf2':'o', 'xf3':'o', 'xf4':'o', 'xf5':'o', 'xf6':'o', 'xf8':'o',
    'xf9':'u', 'xfa':'u', 'xfb':'u', 'xfc':'u',
    'xfd':'y', 'xfe':'th', 'xff':'y',
    'xa1':'!', 'xa2':'{cent}', 'xa3':'{pound}', 'xa4':'{currency}',
    'xa5':'{yen}', 'xa6':'|', 'xa7':'{section}', 'xa8':'{umlaut}',
    'xa9':'{C}', 'xaa':'{^a}', 'xab':'<<', 'xac':'{not}',
    'xad':'-', 'xae':'{R}', 'xaf':'_', 'xb0':'{degrees}',
    'xb1':'{+/-}', 'xb2':'{^2}', 'xb3':'{^3}', 'xb4':'',
    'xb5':'{micro}', 'xb6':'{paragraph}', 'xb7':'*', 'xb8':'{cedilla}',
    'xb9':'{^1}', 'xba':'{^o}', 'xbb':'>>', 
    'xbc':'{1/4}', 'xbd':'{1/2}', 'xbe':'{3/4}', 'xbf':'?',
    'xd7':'*', 'xf7':'/'
    }
    unicrap = remove_all ('\\', unicrap)
    unicrap = remove_all('&amp;', unicrap)
    unicrap = remove_all('u2013', unicrap)

    r = unicrap
    for item,valor in xlate.items():
        #print item, unicrap.find(item)
        r = r.replace(item,valor)
    return r
Daniel Castro
  • 155
  • 1
  • 1
  • 10
  • That's stripping the accents and other diacritics completely, not displaying the original values. Also, don't name variables `str`, and you don't need the `string` module to perform `find`; Python's `str` has featured a `find` method for decades, so `string.find(str, substr)` is just a verbose/slow way to say `str.find(substr)`. – ShadowRanger Mar 16 '16 at 02:48