Unicode category for commas and quotation marks

Question

I have this helper function that gets rid of control characters in XML text:

def remove_control_characters(s): #Remove control characters in XML text
    t = ""
    for ch in s:
        if unicodedata.category(ch)[0] == "C":
            t += " "
        if ch == "," or ch == "\"":
            t += ""
        else:
            t += ch
    return "".join(ch for ch in t if unicodedata.category(ch)[0]!="C")

I would like to know whether there is a unicode category for excluding quotation marks and commas.

This question is incomplete. Generally speaking, you can have question marks and commas in JSON data. I regularly pass XML documents as part of JSON data structures. So here you should show the input to your function, and show how you use the output in such a way that you get invalid JSON. — Louis, Jun 29 '16 at 14:12
The function takes a string, and the output is what I expect it to be. But what I want to know is whether there is a unicode category for commas and quotation marks. — SANBI samples, Jul 04 '16 at 07:14

Xander · Answer 1 · 2016-07-04T19:23:14.603

1

In Unicode, control characters general category is 'Cc', even if they have no name.unicodedata.category() returns the general category, as you can test for yourself in the python console :

>>>unicodedata.category(unicode('\00')) 'Cc'

For commas and quotation marks, the categories are Pi and Pf. You only test the first character of the returned code in your example, so try instead :

 cat = unicodedata.category(ch)
 if cat == "Cc" or cat == "Pi" or cat == "Pf":

edited Jul 04 '16 at 19:23

answered Jul 04 '16 at 19:10

Xander

96
1
9

This only only works if I use `cat == "P"`. It seems like Python does not recognize the second subcategory letter. – SANBI samples Jul 06 '16 at 07:01
Comma is in Punctuation Other category: Po `002C;COMMA;Po;0;CS;;;;;N;;;;;` – Dmitry Jul 06 '16 at 08:00

Dmitry · Answer 2 · 2016-07-06T08:45:06.643

Based on a last Unicode data file here UnicodeData.txt

Comma and Quotation mark are in Punctuation Other category Po:

002C;COMMA;Po;0;CS;;;;;N;;;;;
0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;

So, based on your question, your code should be something like this:

o = [c if unicodedata.category(c) != 'Cc' else ' '\
    for c in xml if unicodedata.category(c) != 'Po']

return("".join(o))

If you want to find out a category for any other unicode symbol and do not want to deal with the UnicodeData.txt file, you can just print it out with a print(c, unicodedata.category(c))

Unicode category for commas and quotation marks

2 Answers2

Linked