4

I have a file with aprox 12,000 lines generated every 6 hours. On some of these lines, there are non-ascii characters.

I would like to be able to run a Perl script to remove all lines that have non-ASCII characters in it.

Cœur
  • 37,241
  • 25
  • 195
  • 267
  • 2
    Why is is appropriate to destroy data? – tchrist Dec 04 '10 at 18:47
  • 4
    @tchrist - Not sure about OP's context, but for example when the file needs to be loaded into a software which barfs at non-ascii and the business requirements don't mind losing lines (e.g. loading partial file is better than none) but do mind mangled lines that would result from deleting or encoding non-ascii characters (e.g. file format is position based). This is a VERY realistic scenario, I have had to do it in my job. – DVK Dec 04 '10 at 21:07

2 Answers2

6

You can do:

perl -i.bak -ne 'print unless(/[^[:ascii:]]/)' file

Regex explanation for /[^[:ascii:]]/:

/ start of regular expression
  [ start of character class
  ^ make this a negative character class (a class that matches anything besides what is listed)
    [:ascii:] any ASCII character
  ] end of character class
/ end of regular expression

Christopher Bottoms
  • 11,218
  • 8
  • 50
  • 99
codaddict
  • 445,704
  • 82
  • 492
  • 529
  • Wouldn't it be `-i'*.bak'`? EDIT: Never mind, they're equivalent – Cameron Dec 04 '10 at 18:56
  • Just tried both, they create a backup file named .bak but the original file is now populated with ⨪⨪⨪扭牥› instead of readable data. EDIT: If I convert the original file from .txt to .html and run the code, it appears to work well. Why would this not work on a .txt file? – John Simpleton Dec 04 '10 at 18:59
1
#!/usr/bin/perl -p
END {close STDOUT}
use 5.010;
use utf8;
use strict;
use autodie;
use warnings qw<FATAL all>;
use open qw<IN :bytes OUT :encoding(US-ASCII) :std>;
BEGIN {$SIG{__WARN__}=sub{confess}}
use sigtrap qw<stack-trace normal-signals error-signals>;
use Carp;
"disconcertingly";
tchrist
  • 78,834
  • 30
  • 123
  • 180