3

I want to replace the ASCII/English characters in a file and keep the unicode characters in Linux environment

INSERT INTO text (old_id,old_text,old_flags) VALUES (2815829,'[[चित्र:Youth-soccer-indiana.jpg|thumb|300px|right|बचपन का खेल.एसोसिएशन फुटबॉल, ऊपर दिखाया गया है, एक टीम खेल है जो सामाजिक कार्यों को भी प्रदान करता है।]]\n\n\'\'\'खेल\'\'\', कई [[नियमों]] एवं [[रिवाजों]] द्वारा संचालित होने वाली एक [[प्रतियोगी]] गतिविधि है। \'\'खेल\'\' 

I have tried

~$ sed 's/[^\u0900-\u097F]/ /g' hi.text but the range

but i get

sed: -e expression #1, char 23: Invalid range end

I also tried this and it seems to work but not fully

sed 's/[a-zA-Z 0-9`~!@#$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' enwiki-latest-pages-articles-multistream_3.sql  >result.txt

Can anyone tell me how to get the sed working with the unicode range regex

Thomas Dickey
  • 51,086
  • 7
  • 70
  • 105
gaurus
  • 426
  • 1
  • 4
  • 16
  • what do you mean by *seems to work but not fully*? – umläute Nov 12 '15 at 11:29
  • 1
    please simplify the problem. Consider posting 20 chars mixed ascii and unicode and the required output from those chars. Do you want to delete the ascii, or as your title says "replace". One line of code shows a space char , the 2nd shows no replacment char. Good luck. – shellter Nov 12 '15 at 11:54
  • yes i want to delete(replace with null) all the ascii characters and retain only the unicode hindi words.The second regex which i tried, is retaining some special characters (which is not required) – gaurus Nov 12 '15 at 15:34
  • 1
    we already have your verbal description. We need to see samples! Help us visualize your problem by including sample inputs (well designed), required output and your current code as well as problems with your current output, and any error messages. See http://stackoverflow.com/questions/33023436/awk-array-to-output-the-line-count-as-well-as-average for a good example (not quite your area of interest, but a very well organized question) . Good luck. – shellter Nov 12 '15 at 16:00
  • Input :INSERT INTO text (old_id,old_text,old_flags) VALUES (2815829,'[[चित्र:Youth-soccer-indiana.jpg|thumb|300px|right|बचपन का खेल.एसोसिएशन फुटबॉल, ऊपर दिखाया गया है, एक टीम खेल है जो सामाजिक कार्यों को भी प्रदान करता है।]]\n\n\'\'\'खेल\'\'\', कत्पत्ति ==\n\"खेल\" (\"स्पोर्ट\") शब्द की [[पुराने फ्रेंच]] शब्द \'\'देस्पोर्ट (desport)\'\' से उत्पत्ति हुई है, जिसका अर्थ \"अवकाश\" है।\n\n== इतिहास ==\n\n[[चित्र:Greek statue discus thrower 2 century aC.jpg|thumb|150px|right|2 expected output चित्र बचपन का खेल.एसोसिएशन फुटबॉल, ऊपर दिखाया गया है, एक टीम खेल है जो सामाजिक कार्यों को भी प्रदान करता है – gaurus Nov 13 '15 at 17:05
  • @user1516947: i've updated my answer with a perl implementation that do what you need. In the expected output i think you miss to remove some symbols just like ``.`` and ``,`` and the hindi words extracted in the ending part of the query (``खेल कत्पत्ति खेल स्पोर्ट शब्द की पुराने फ्रेंच शब्द देस्पोर्ट से उत्पत्ति हुई है जिसका अर्थ अवकाश है इतिहास चित्र``) – Giuseppe Ricupero Nov 15 '15 at 18:30

3 Answers3

4

ASCII codes are in the range 0 to 127 inclusive. From that range, 0-31 and 127 are control characters. Unicode encoded as UTF-8 uses data bytes from the range 128 to 255 inclusive.

Because sed is line-oriented, newline (code 9 is control/J) is treated specially. Your file may include tab (code 8) and carriage return (code 13). But in practice you likely only care about tabs and printable ASCII.

Tilde (~) is code 126 (something handy to know).

So:

sed -e 's/[ -~\t]/ /g'

where \t is ASCII tab (and depending on implementation you may need a literal tab) will remove all of the printable ASCII, leaving untouched newline and UTF-8.

Thomas Dickey
  • 51,086
  • 7
  • 70
  • 105
2

PERL

If you don't mind using perl try a mnemonic:

# this version replace each group also newlines
perl -pe 's/[[:ascii:]]/ /g;' filename

UPDATE: Using @user1516947 example i've slightly modified the perl solution to collapse multiple ascii chars into one space (and remove unwanted leading and trailing spaces):

perl -pe 's/[[:ascii:]]+/ /g; s/^\s+|\s+$//g' filename

Command line usage example based on sample input:

echo "INSERT INTO text (old_id,old_text,old_flags) VALUES (2815829,'[[चित्र:Youth-soccer-indiana.jpg|thumb|300px|right|बचपन का खेल.एसोसिएशन फुटबॉल, ऊपर दिखाया गया है, एक टीम खेल है जो सामाजिक कार्यों को भी प्रदान करता है।]]\n\n\'\'\'खेल\'\'\', कत्पत्ति ==\n\"खेल\" (\"स्पोर्ट\") शब्द की [[पुराने फ्रेंच]] शब्द \'\'देस्पोर्ट (desport)\'\' से उत्पत्ति हुई है, जिसका अर्थ \"अवकाश\" है।\n\n== इतिहास ==\n\n[[चित्र:Greek statue discus thrower 2 century aC.jpg|thumb|150px|right|2" | perl -pe 's/[[:ascii:]]+/ /g; s/^\s+|\s+$//g'

Output:

 चित्र बचपन का खेल एसोसिएशन फुटबॉल ऊपर दिखाया गया है एक टीम खेल है जो सामाजिक कार्यों को भी प्रदान करता है। खेल कत्पत्ति खेल स्पोर्ट शब्द की पुराने फ्रेंच शब्द देस्पोर्ट से उत्पत्ति हुई है जिसका अर्थ अवकाश है। इतिहास चित्र

(GNU) SED

Or in sed (in linux environment you have to modify LANG env to make the sed range valid):

# this version does not replace newlines
LANG=C sed 's/[\d0-\d127]/ /g' filename

A less readable sed version that replace all newlines (but one) too:

LANG=C sed ':a;N;$!ba;s/[\d0-\d127]/ /g' filename
Giuseppe Ricupero
  • 6,134
  • 3
  • 23
  • 32
  • Making broad sweeping statements about `sed` is precarious because there are multiple incompatible versions, even just on Linux alone. I would stick to Perl for portability. – tripleee Nov 15 '15 at 17:15
  • @tripleee: you're right, i've edited the response to specify the sed implementation (gnu). According to your experience, it's enough? – Giuseppe Ricupero Nov 15 '15 at 17:27
  • Yeah, definitely an improvement, though my vote goes to [Thomas' answer](http://stackoverflow.com/a/33670413/874188). – tripleee Nov 15 '15 at 18:04
  • @tripleee Thomas shows a deep knowledge of ascii code but its solution does not work as is in Linux (the requested environment), it also does not remove the newlines. – Giuseppe Ricupero Nov 16 '15 at 10:56
  • Fair point, though I am not at all convinced that the OP *wants* the newlines squished. – tripleee Nov 16 '15 at 11:03
1

To get rid of the ascii characters you can run it over the range, sed eats newlines though so if you want those gone too you need to hit it with tr afterward.

echo -e "hi ☠ \nthere ☠" | LANG=C sed "s/[\x01-\x7F]//g" | tr -d '\n'
☠☠

Conversely if you wanted to rid the unicode characters you can specify instead the unicode range: echo -e "hi ☠ \nthere ☠" | LANG=C sed "s/[\x80-\xFF]//g"
hi
there