4

I am trying to grep for the hexadecimal value of a range of UTF-8 encoded characters and I only want just that specific range of characters to be returned. I currently have this:

grep -P -n "[\xB9-\xBF]" $str_st_location >> output_st.txt

But this returns every character that has any of those hex values in it hex representation i.e it returns 00B9 - FFB9 as long as the B9 is present.

Is there a way I can specify using grep that I only want the exact/specific hex value range I search for?

Sample Input:

STRING_OPEN
Open
æ–­å¼€
Ouvert
Abierto
Открыто
Abrir

Now using my grep statement, it should return the 3rd line and 6th line, but it also includes some text in my file that are Russian and Chinese because the range for languages include the hex values I'm searching for like these:

断开
Открыто

I can't give out more sample input unfortunately as it's work related.

EDIT: Actually the below code snippet worked!

grep -P  -n "[\x{00B9}-\x{00BF}]" $str_st_location > output_st.txt

It found all the corrupted characters and there were no false positives. The only issue now is that the lines with the corrupted characters automatically gets "uncorrupted" i.e when I open the file, grep's output is the corrected version of the corrupted characters. For example, it finds æ–­å¼€ and in the text file, it's show as 断开.

user2056389
  • 105
  • 2
  • 6
  • 4
    Add a sample of your input and expected output to your question, then it'll be easier for us to help you. – Tom Fenech Jun 30 '15 at 14:39
  • Maybe you can use `tr` to delete all characters outside your desired range. Use `tr` with `-c` to get complement of characters in your range and `-d` to delete them. – Mark Setchell Jul 01 '15 at 22:32

1 Answers1

3

Since you're using -P, you're probably using GNU grep, because that is a GNU grep extension. Your command works using GNU grep 2.21 with pcre 8.37 and a UTF-8 locale, however there have been bugs in the past with multi-byte characters and character ranges. You're probably using an older version, or it is possible that your locale is set to one that uses single-byte characters.

If you don't want to upgrade, it is possible to match this character range by matching individual bytes, which should work in older versions. You would need to convert the characters to bytes and search for the byte values. Assuming UTF-8, U+00B9 is C2 B9 and U+00BF is C2 BF. Setting LC_CTYPE to something that uses single-byte characters (like C) will ensure that it will match individual bytes even in versions that correctly support multi-byte characters.

LC_CTYPE=C grep -P -n "\xC2[\xB9-\xBF]" $str_st_location >> output_st.txt
mark4o
  • 58,919
  • 18
  • 87
  • 102
  • I tried this and set the LC_CTYPE=C at the top of my script and it didn't return anything. – user2056389 Jul 01 '15 at 13:42
  • If you are setting `LC_CTYPE=C` earlier in the script (not on the same line) then you'll also need `export LC_CTYPE`. – mark4o Jul 03 '15 at 05:17