Bash/Linux Find non-ASCII character in a .txt file and replace it with an ASCII character

Question

I have a list of files offloaded from oceanographic instruments. For some reason, there is occasionally a non-ASCII character inserted where an ASCII character should be. I have found grave-E (È) where there should be a W to denote the western hemisphere in longitude records.

Here's what the data looks like:

CUMSECS Date UTC    Time UTC    Date Local  Time local  Z (m)   Target Z    Z Bot   Temp    PAR Salin   Ang VelX    Ang VelY    Ang VelZ    Pump +  Pump -  Gctr    Fix secs    Date UTC    Time UTC    Date Local  Time Local  Lat LatD    Latm        Lon LonD    Lonm        DOP Temp    PAR Salin   Batt V      CMD secs    Date Local  Time Local  No. Cmds
526068034   09/01/16    18:00:34    09/01/16    11:00:34     3.75    2.69    
3.75     0.29    0.000000    0.00   -12 -70 -50 0   5   10
526068039   09/01/16    18:00:39    09/01/16    11:00:39     3.75    2.69    
3.75     0.29    0.000000    0.00   -12 -70 -50 0   5   10
526068044   09/01/16    18:00:44    09/01/16    11:00:44     3.74    2.69    
3.75     0.29    0.000000    0.00   -12 -70 -50 0   5   10
526068049   09/01/16    18:00:49    09/01/16    11:00:49     3.73    2.69    
3.75     0.29    0.000000    0.00   -30732  13588   31909   60399   7538    -82
543622771   03/23/17    22:19:31    03/23/17    15:19:31    38.31877    38  
19.1262 N   123.07136   123  4.2812 È   23.6    115.06     0.0000   96.00   
121.718 
547764151   05/10/17    20:42:31    05/10/17    13:42:31     0.03   16.00   
127.00  13.68   1074.904320 33.56   -4908   -3976   261 1   0   0
547764152   05/10/17    20:42:32    05/10/17    13:42:32     0.00   16.00   
127.00  13.68   1074.904320 33.56   -4908   -3976   261 1   0   0

I can find the non-ASCII characters using the following Bash line pcregrep -n '[^\x00-\x7F]' 170510_ocean_Copepod.txt

I would like to loop through a series of files, find these characters, and replace them with a 'W' so that I can subsequently read them into R and process them en masse. Alternatively, a workaround to the error returned by R in trying to read these files ("multibyte string in location...") would be equally effective for my purposes. Any help much appreciated.

`pcregrep -n '[^\x00-\x7F]' 170510_ocean_Copepod.txt | sed 's/[^\x00-\x7F]/W/g' ` but that returns an error on the sed call for an illegal byte sequence — Connor Dibble, Jun 20 '17 at 22:48
Have you tried to change the `fileEncoding` argument of `read.table`? — Scarabee, Jun 20 '17 at 22:48
I have tried the fileEncoding and Encoding routes in R (explicitly calling it latin1 or utf8), but to no avail. My understanding of the encoding issues may be limited, but as far as I can tell it's not really an encoding problem. Perhaps I'm wrong- any ideas? — Connor Dibble, Jun 20 '17 at 22:54
So I never could get the tr method to work- it always returns a an "error: illegal byte sequence". But, I used iconv in the fashion suggested by Kind Stranger, which was successful. In the end, I did not replace the characters, but was able to get the encoding recognizable by R so that I can batch process files where those little multibyte characters are hidden. If anyone has any ideas on how to actually replace the characters (or why I am getting such an error in a MacOSX bash terminal session), that would help me make my code more robust. For now, my research remains in one hemisphere. — Connor Dibble, Jul 24 '17 at 22:51

Kind Stranger · Accepted Answer · 2017-06-22T08:45:40.953

2

I think the problem is that È in utf-8 is a multibyte character consisting of \xc3 and \x88 and sed can't seem to deal with that for whatever reason. As @Jack suggested, tr might be a better tool for the job (tested in bash for windows which doesn't have pcregrep):

user@PC:~$ grep -P '[^\x00-\x7f]' | tr 'È' 'W'
19.1262 N   123.07136   123  4.2812 WW   23.6    115.06     0.0000   96.00

Notice that it does convert both bytes separately to W.

Another method could be to convert the whole file using iconv. iso-8859-15 (latin-9) is one example of single-byte character encoding. The command to convert the file using iconv would be:

iconv -f utf-8 -t iso-8859-15 -o <converted-file> <input-file>

edited Jun 22 '17 at 08:45

answered Jun 21 '17 at 00:00

Kind Stranger

1,736
13
18

1

Another option could be to use `iconv` to convert the file encoding before reading it in `r` – Kind Stranger Jun 21 '17 at 00:00
Looks like the shell approach with tr will work, but I'm curious about the encoding as well. Do you know what encoding I could convert to that would not contain any multibyte characters and can be subsequently read into R? Thanks for your useful suggestions. – Connor Dibble Jun 21 '17 at 23:42
It looks like the tr approach is getting hung up as well. I get an an error: `tr: Illegal byte sequence` whether using `cat | tr 'È' 'W'` or `pcregrep -n '[^\x00-\x7F]' 170510_ocean_Copepod.txt | tr 'È' 'W'`. If I use the cat approach it prints out to the line where the È is before returning the error. – Connor Dibble Jun 21 '17 at 23:46
@SeaSpider, have added detail regarding `iconv` – Kind Stranger Jun 22 '17 at 08:46

score 0 · Answer 2 · answered Jun 22 '17 at 09:06

0

You can use sed to replace È with W:

sed 's/È/W/g' 170510_ocean_Copepod.txt

answered Jun 22 '17 at 09:06

zombic

221
1
9

Bash/Linux Find non-ASCII character in a .txt file and replace it with an ASCII character

2 Answers2