Remove lines containing non-ASCII characters from a file in Perl

Question

I have a file with aprox 12,000 lines generated every 6 hours. On some of these lines, there are non-ascii characters.

I would like to be able to run a Perl script to remove all lines that have non-ASCII characters in it.

@tchrist - Not sure about OP's context, but for example when the file needs to be loaded into a software which barfs at non-ascii and the business requirements don't mind losing lines (e.g. loading partial file is better than none) but do mind mangled lines that would result from deleting or encoding non-ascii characters (e.g. file format is position based). This is a VERY realistic scenario, I have had to do it in my job. — DVK, Dec 04 '10 at 21:07

score 6 · Answer 1 · edited Jan 29 '16 at 16:21

6

You can do:

perl -i.bak -ne 'print unless(/[^[:ascii:]]/)' file

Regex explanation for /[^[:ascii:]]/:

/ start of regular expression
[ start of character class
^ make this a negative character class (a class that matches anything besides what is listed)
[:ascii:] any ASCII character
] end of character class
/ end of regular expression

edited Jan 29 '16 at 16:21

Christopher Bottoms

11,218
8
50
99

answered Dec 04 '10 at 18:46

codaddict

445,704
82
492
529

Wouldn't it be `-i'*.bak'`? EDIT: Never mind, they're equivalent – Cameron Dec 04 '10 at 18:56
Just tried both, they create a backup file named .bak but the original file is now populated with ⨪⨪⨪扭牥› instead of readable data. EDIT: If I convert the original file from .txt to .html and run the code, it appears to work well. Why would this not work on a .txt file? – John Simpleton Dec 04 '10 at 18:59

score 1 · Answer 2 · answered Dec 04 '10 at 19:05

#!/usr/bin/perl -p
END {close STDOUT}
use 5.010;
use utf8;
use strict;
use autodie;
use warnings qw<FATAL all>;
use open qw<IN :bytes OUT :encoding(US-ASCII) :std>;
BEGIN {$SIG{__WARN__}=sub{confess}}
use sigtrap qw<stack-trace normal-signals error-signals>;
use Carp;
"disconcertingly";

Remove lines containing non-ASCII characters from a file in Perl

2 Answers2