0

Not A Duplicate

This is not a duplicate of this question I think. There the answer says how to fix the problem in python 2 and that it should not occur in python 3. Also, the answer provided does not not work for me:

>>"ć́".decode()
AttributeError: 'str' object has no attribute 'decode'

>>len(u"ć́")
2

Original Question:

I am importing book data from a website, and process it then. One of the first steps is to so some stuff with the length of a certain string. Unfortunately the len() function sometimes returns a false value, when abnormal" characters are included:

>>len("Krste Asanović́ ... [et al.].")
29
>>ord("ć́")
TypeError: ord() expected a character, but string of length 2 found

Here the "ć́" is not a standard character, if I replace it with a normal "c" I get a different result.

>>len("Krste Asanovic ... [et al.].")
28

I can, of course, solve the problem using replace():

>>"Krste Asanović́ ... [et al.].".replace("ć́","c")
'Krste Asanovic ... [et al.].'

But is there a way to "forbid" weird letters in the first place?

EDIT

>>list("ć́")
['ć', '́']

I'm using python3.6

EDIT 2

this...

>>"ć́".replace("´","")
"ć́"

does nothing.

Community
  • 1
  • 1
NewNewton
  • 1,015
  • 1
  • 10
  • 22
  • Please show (1) which version of Python you are using, (2) what `list("ć")` returns. – lenz Feb 18 '18 at 19:35
  • 2
    Having a second, closer look at the example, I see that the 2-character string that is looked up here is actually the letter "ć", followed by a combining acute accent (so there are actually two acute accents, conceptually). So I think this is **not a duplicate** of the linked question, which is an encoding thing. Here we have a data problem. – lenz Feb 18 '18 at 19:46
  • 1
    To me, it's clear that this isn't a duplicate of the mentioned question. However, it's unclear to me what you want: The given example is most probably corrupt (an accented letter with the same accent again), but you talk about "weird" and "abnormal" characters, which could be anything. Do you want ASCII only? Or just no combining characters? For example, it would be perfectly acceptable to use "c" plus the combining acute (ie. two characters), which is equivalent (but not equal) to the single character "ć". – lenz Feb 18 '18 at 20:36
  • ah ok, i see... i want to convert the string. to get a len() result as if it was only one character, because it only "needs space" for one character. I thought its an abnormal character, interpreted by python as two characters. But it seems like its a very unique exception, so maybe the replace() solution is fine...(?) – NewNewton Feb 18 '18 at 20:48
  • If this is a rare case, the replace solution will do. Your last edit doesn't work because you are using the wrong character (the ASCII acute accent`\xB4` instead of the combining acute `\u0301`). – lenz Feb 18 '18 at 21:31

0 Answers0