9

I've a problem:

E.x. I have a sentence

s = "AAA? BBB. CCC!" 

So, I do:

import string
table = str.maketrans('', '', string.punctuation)
s = [w.translate(table) for w in s]

And it's all right. My new sentence will be:

s = "AAA BBB CCC"

But, if I have input sentence like:

s = "AAA? BBB. CCC! DDD.EEE"

after remove punctuation the same method as below I'll have

s = "AAA BBB CCC DDDEEE"

but need:

s = "AAA BBB CCC DDD EEE"

Is any ideas/methods how to solve this problem?

ctrlaltdel
  • 145
  • 1
  • 2
  • 7

8 Answers8

8

string.punctuation contains following characters:

'!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~'

You can use translate and maketrans functions to map punctuations to empty values (replace)

import string

'AAA? BBB. CCC! DDD.EEE'.translate(str.maketrans('', '', string.punctuation))

Output:

'AAA BBB CCC DDDEEE'
Vlad Bezden
  • 83,883
  • 25
  • 248
  • 179
5

Try this code:

import re

input_str = "AAA? BBB. CCC! DDD.EEE"
output_str = re.sub('[^A-Za-z0-9]+', ' ', input_str)
print output_str

'AAA BBB CCC DDD EEE'

Bharat Jogdand
  • 438
  • 3
  • 16
  • 1
    Same problem as the one @casualcoder pointed above. multiple spaces between words. – Luv Dec 07 '18 at 07:20
  • Use this to cover some edge cases `r'[^\w]+(\s+|$)'`, and add `.strip()` otherwise you'll have an extra space if the last word ends with a punctuation character. – Burhan Khalid Dec 07 '18 at 07:31
  • 1
    Obs.: in this way you remove special characters like ß and also the ones with graphic signals, like ü,ú,.. so depending on your language is not a good option. – Laura Corssac May 15 '20 at 13:55
4

You can also do it like this:

punctuation = "!@#$%^&*()_+<>?:.,;"  # add whatever you want

s = "AAA? BBB. CCC!" 
for c in s:
    if c in punctuation:
        s = s.replace(c, "")

print(s)

>>> "AAA BBB CCC"
alpharoz
  • 119
  • 9
2

Use:

import re

" ".join(re.split('\W+', s))

That splits the string on all non-word characters, then joins the individual substrings by single spaces.

9769953
  • 10,344
  • 3
  • 26
  • 37
  • Actually, in Python, \W includes word forming characters as well as non-word characters. – Andj Mar 18 '23 at 17:42
  • @Andj Interesting. But what do you mean by "word forming characters" here? – 9769953 Mar 18 '23 at 19:48
  • the Unicode definition of `\w` is `[\p{alpha}\p{gc=Mark}\p{digit}\p{gc=Connector_Punctuation}\p{Join_Control}]`. The re module defines it as most of \p{alpha}, all of \p{digit} and one of \p{gc=Connector_Punctuation}. So all Marks and Join Controls are stripped. – Andj Mar 18 '23 at 23:40
  • Take pattern `pattern = re.compile(r'[^\w]', re.U)` and then `re.sub(pattern, "", text)` using `text='မြန်မာစကား'` you would get the result `'မနမစက'`. All medial consonants and dependent vowels are striped out. Personally, I would consider medial consonants and dependent vowels as word forming characters. Also try `text = unicodedata.normalize("NFD", "français")` this results in `'francais'`. So, combining diacritics are treated by re as non-word forming characters. – Andj Mar 18 '23 at 23:47
  • Obviously, you could have had `pattern = re.compile(r'[\W]')` instead of `pattern = re.compile(r'[^\w]'`. The results are the same. Also, the regex flag `re.U` doesn't really add anything on my plaform. – Andj Mar 19 '23 at 00:03
1

This is one approach using str.strip and a simple iteration.

Ex:

from string import punctuation

s = "AAA? BBB. CCC! DDD.EEE"

def cleanString(strval):
    return "".join(" " if i in punctuation else i for i in strval.strip(punctuation))

s = " ".join(cleanString(i) for i in s.split())
print(s)

Output:

AAA BBB CCC DDD EEE
Rakesh
  • 81,458
  • 17
  • 76
  • 113
0

Check this out:

if __name__ == "__main__":
    test_string = "AAA? BBB. CCC! DDD.EEE"
    result = "".join((char if char.isalpha() else " ") for char in test_string)
    print(result)


Result: AAA  BBB  CCC  DDD EEE
Optimus
  • 697
  • 2
  • 8
  • 22
0

Try this:

import string
exclude = set(string.punctuation)
exclude.remove(".")
doc = "AAA? BBB. CCC! DDD.EEE"
for punctuation in exclude:
    doc = doc.replace(punctuation,"")
doc = doc.replace("."," ")
doc = doc.split()
print(" ".join(doc))
0

I know not everyone has this situation, but I am writing an internationalized app and it's a bit heavier lift. This is what I have come up with:

[Edited to add 'import regex'] - Thanks Andj

import regex

random_string = "~`!ќ®†њѓѕў‘“ъйжюёф №%:,)( ЛПМКё…∆≤≥“™ƒђ≈≠»"

clean_string = regex.sub( r'[^\w\s]', '', random_string )

print( clean_string )

Result is:

ќњѓѕўъйжюёф  ЛПМКёƒђ

This works with a wide range of alphabets and special characters in many languages. I've tested it on several languages with every special character and a few regular characters on that keyboard. Still need to strip out a few special markers this won't detect.

Straightforward but powerful. Hope that helps someone.

horace
  • 938
  • 9
  • 20
  • if you are using `r'[^\w\s]'` as a pattern use the regex module, rather than the re module. Your regex substitution will strip out important word forming characters for many languages and scripts. `\w` and `\W` with the re module are quite dangerous if your code needs to be able to support any language. – Andj Mar 18 '23 at 17:48
  • please elaborate. one of features of the app i'm writing is to reduce errors in file name generation. – horace Mar 19 '23 at 05:59
  • it is a complex topic, and it really depends on the languages and writing systems/scripts you need to support. For instance re's \w metacharacter supports a most (but not all) of \p{alpha}, all of \p{digits}, only one of \p{gc=Connector_Punctuation}, none of \p{gc=Mark}, and none of \p{Join_Control}. The core issue is around Marks. All marks are treated as non word forming characters. So combining diacritics are not word forming characters, dependent vowels in South Asian and south-east Asian script, Arabic marks, etc ... all not matched by \w. NFD data has big problems with \w. – Andj Mar 19 '23 at 06:53
  • 1
    Take a file name like kɔ̈ɔ̈r.png (i chose it because i do have a file of that name on my computer). It uses two combining diacritics. It is the same in NFC, NFKC, NFKC_CF, NFD, and NFKD. Your regex pattern above would make the following transform: kɔ̈ɔ̈r -> kɔɔr. More problematically, for a file မြန်မာစာ.md, မြန်မာစာ would transform to မနမစ, i.e completely unrecognisable give the original filename. If you are restricted to LCG life is somewhat easier, but NFD or NFKD will cause grief. – Andj Mar 19 '23 at 07:25
  • Thanks @Andj. It's not at a certain what languages this app will be used with. The regex module is a simple fix. There is probably still something lurking under the covers. Only time will tell. – horace Mar 20 '23 at 01:39
  • if you want to future proof yourself, try the regex module's implementation of \w and \W. Better for your purposes than try the re module.With Python the devil is in the detail. Unfortunately the detail doesn't end up in the documentation. – Andj Mar 20 '23 at 04:14
  • That is precisely what I have done. If I understand you correctly, I have simply replaced 're' with 'regex'. I have looked a little into regex and it seems a better implementation. – horace Mar 20 '23 at 16:36
  • sorry for the confusion, not all of my comment was submitted. The missing piece of the comment was that you also need to be aware that regex is likely to be using a different version of Unicode to re. The difference between re and regex implementation of \w and \W, and whether for your usage scenario would benefit from the regex version of \w, a tailoring somewhere between re and regex, or whether you need a superset containing the regex \w and some additional characters. – Andj Mar 22 '23 at 00:08
  • I'd be inclined to use regex with \w and add hyphen and period to it. – Andj Mar 22 '23 at 00:28
  • I work with multilingual data, potentially in any language. I tend to use virtual environments targeted at a specific version of Python. That means I am also targeting a specific version of Unicode, since each version of Python is using a different version of Unicode. To ensure the widest compatibility and handle edge cases, I install the most recent version of regex build against the version of Unicode I am using. If the version of icu4c is more recent, I use UnicodeSet notations to allow me to target the character repertoire of a specific version of Unicode. – Andj Mar 22 '23 at 00:35
  • Thank you for your expertise. I have just started working with multilingual text. The text will be added into a database, but is primarily used for naming files instead of users being forced to use ASCII (not even unicode) text. My filename "cleanser" has to remove any punctuation and leave plausible text (Unicode) to be used in the file name. Training people to use ASCII became increasingly unrealistic as we delve deeper into non-latin scripts so I was tasked with writing a full-blown video file management system for our workflow. – horace Mar 22 '23 at 15:29
  • I have also backburnered creating a venv for my app distro so I can depend on availability of third-party software. Eventually, that beast will rear it's ugly head and I will have to deal with it. – horace Mar 22 '23 at 15:30
  • hope your project goes well. I know for Catalan and Valencian support I had to add punt volant as a word forming character rather than as punctuation. Over time, you build up more and more exceptions as new languages. But often I am on the edge, some of the code relates to Unicode proposals, so working with characters we are still trying to get added to Unicode. – Andj Mar 23 '23 at 00:38
  • Thanks! So far so good. tested it on a low-hanging-fruit project for Russian. Went pretty well except they kept switching keyboards from Cyrillic to Latin. Just remember C != C – horace Mar 23 '23 at 21:17
  • test against \p{Cyrillic} if you expect everything in Cyrillic. You can either just throw an error, a warning or transliterated Latin to cyrillic. – Andj Mar 23 '23 at 23:09
  • If I understand your comment, \p{Cyrillic} is a literal expression for the regex module. I'll have a look. But this gets sticky as we have several languages now. – horace Mar 24 '23 at 16:57
  • Yes is an expression for regex, some engines also use the Posix notation [:Cyrillic:]. It can also be used in Unicode Set notation. If I understand your use case, some filenames will have characters from incorrect script de to keyboard change. It's difficult to address. The trick is detecting the ones in this situation. Then determine the dominant script (the most frequently occurring) and use that as your default. In the example you gave, you could have had three scripts in filename: Latin, Cyrillic and Common (numbers, punctuation, etc.) – Andj Mar 24 '23 at 21:38
  • I've started a q/a thread here: https://stackoverflow.com/questions/75847941/regex-for-unicode-text-to-filter-characters-out-for-file-names – horace Mar 27 '23 at 04:29