0

Control characters I'm talking about can be found here: http://ascii.cl/control-characters.htm

I need the control characters as their single character length entity, not represented as an ASCII code, or the plain text of their symbol.

See below... a b

As shown above in both sublime and notepad text editors, I need the actual symbols, not their ascii code. So I need the characters as shown in the second invalid_chrs_list.

Is there a way to get these symbols, a file somewhere online, or a site that I can copy paste them from?

Edit:

#Invalid characters ascii codes here (http://ascii.cl/control-characters.htm)
#invalid_chrs_list = [0,1,2,3,4,5,6,7,8,16,17,18,19,20,21,22,23,24,25,26,27] # ascii
#invalid_chrs_list = ['', ''] # real for acsii code 3 and 17 - NEED THE REST - Can't post these characters into stackoverflow so just pretend their there like in my screenshot.
invalid_chrs_list = ['\x00','\x01','\x02','\x03','\x04','\x05','\x06','\x07','\x08','\x10','\x11','\x12','\x13','\x14','\x15','\x16','\x17','\x18','\x19','\x1a','\x1b'] # escaped

with open(file, 'rb') as f:
    # Iterate through the rows
    for row in f:
        # Catch invalid characters
        for char in row:
            if char in invalid_chrs_list: # <--- MAKE THIS FASTER
                print ('found')
                break

alternate for loop which would be faster if the check worked:

for char in invalid_chrs_list:
    if char in row:

I've tried using ord(char) and chr(char) in if char in invalid_chrs_list: on each of the lists, but am not sure how to compare them to each other to verify a match

Edit - Solution: The list in the code below is the correct list, it is not necessary to use the literals I showed in my images.

I was looking in the wrong place for the answer, thank you to @Peteris for pointing me in the right direction.

I needed to switch the file mode to text: 'r' or I need to encode the character I'm checking with char.encode() for it to check the literal properly. In my case I need to be opening the file in binary mode so I went with char.encode().

    invalid_chrs_list = ['\x00','\x01','\x02','\x03','\x04','\x05','\x06','\x07','\x08','\x10','\x11','\x12','\x13','\x14','\x15','\x16','\x17','\x18','\x19','\x1a','\x1b']

    with open('test.txt', 'rb') as f:
            # Iterate through the rows
            for row in f:
                    for char in invalid_chrs_list:
                            if char.encode() in row:
                                    print ('found')
                                    break
Wafer
  • 188
  • 3
  • 17
  • Can't you just copy the "Symbols" column shown at the linked website? Here they all are, if not: `NUL, SOH, STX, ETX, EOT, ENQ, ACK, BEL, BS, TAB, LF, VT, FF, CR, SO, SI, DLE DC1, DC2, DC3, DC4, NAK, SYN, ETB, CAN, EM, SUB, ESC, FS, GS, RS, US`. I know of no way you could make them into single character entities/symbols that could be pasted into a text document however. You might be able to find a font somewhere that has them in it, but again, it's unclear how that could be used the way you want. – martineau Feb 24 '17 at 19:08
  • Those are not symbols, those are 2-3 characters long. I need instances of these characters that are 1 character in length. – Wafer Feb 24 '17 at 20:06
  • Unicode has code points for these characters, which I believe implies they could be put into Python Unicode strings, but they would likely require more than one byte each in the string. If strings containing them were displayed on a device that used a font containing glyphs for them, they would appear as you describe.See answer to [**_Font for representing Unicode non‐printable characters_**](http://graphicdesign.stackexchange.com/a/57822/42230). – martineau Feb 24 '17 at 22:42

1 Answers1

1

Make a tiny program that simply outputs the bytes you want to a file, converting them to bytes from the ascii code?

But I'd bet that you don't really want to copy/paste them as literal characters in your code, it can't work that way for e.g. newline character and others; ascii codes or escaped representations is the proper way to go.

Peteris
  • 3,281
  • 2
  • 25
  • 40
  • I do, as you can see I have started the list of characters already. And I do not have newline in my list of ascii code versions because I don't need that specific one. I need to check if a file contains any of these characters, it takes a lot longer to convert every character in the file into it's ascii code and then compare the values of my list, then it would to compare the single character entities to every character in the file without any converting. – Wafer Feb 24 '17 at 20:08
  • @Wafer you don't need to convert every character in the file into its ascii code, but the reasonable way would be to have a list of ascii codes and then have your program first convert it to a list of the appropriate characters and then do everything as if you had a "manually written" list. And why wouldn't '\x03' work for you, as you already have that representation but commented it out? The result of '\x03' *is* a single character with ascii code 3. – Peteris Feb 24 '17 at 20:34
  • I added the code I'm trying to get working faster. I can't match \xo3 to the symbol for ETX, at least not in a method that I've found yet. – Wafer Feb 24 '17 at 21:17
  • checking `if '\x03' in the row` does not work because it isn't `\x03` it's a single character `ETX` - with a length of one. – Wafer Feb 24 '17 at 22:30
  • 1
    @Wafer '\x03' *is* a string literal that (after compilation) is a single character ETX. Try print(len('\x03')) or print('\x03' == chr(3)) or refer to python spec on string literals at https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals – Peteris Feb 24 '17 at 23:15