After considerable search on SO and Google, I resort to posting a new question. I am working with TextWrangler trying to compose a regular expression which will give me shortest matches of a multiple-line pattern.
Basically,
ہے\tVM
is the string I am looking for (an Arabic word separated by a tab character from its part of speech tag). What makes it difficult is that I would like to search for all single sentences containing that string. Here is what I have so far:
/(<Sentence id='\d+'>(?:[^<]|<(?!\/Sentence>))*ہے\tVM(?:[^<]|<(?!\/Sentence>))*<\/Sentence>)/
The files I am looking at are encoded in CML, so part of my question is whether any of you is aware of a CML parser for MAC?
Another obvious alternative is to write a Perl script -- here again, I am thankful for any advice pointing to a simple solution.
My current script is:
use open ':encoding(utf8)';
use Encode;
binmode(STDOUT, ":utf8");
binmode(STDIN, ":utf8");
my $word = Encode::decode_utf8("ہے");
my @files = glob("*.posn");
foreach my $file (@files) {
open FILE, "<$file" or die "Error opening file $file ($!)";
my $file = do {local $/; <FILE>};
close FILE or die $!;
if ($file =~ /(<Sentence id='\d+'>(?:[^<]|<(?!\/Sentence>))*$word\tVM(?:[^<]|<(?!\/Sentence>))*<\/Sentence>)/g) {
print STDOUT "$1\n\n\n\n";
push(@matches, "$1\n\n");
}
}
open(OUTPUT, ">matches.txt");
print OUTPUT "@matches";
close(OUTPUT);