1

I am writing a perl script that will be run inside of an Automator app to process documents that were previously processed by hand. I need to do this process weekly, always with the same junk data removed. These are rtf files, converted from html files on Mac OS X using another Automator script in order to maintain formatting. I have created a new droplet script to process the rtf files to remove unnecessary junk data.

My shell script is:

#!/bin/bash
# 
#    replace CR with CRLF
#     
/usr/bin/perl -CSDA -pi <<'EOF' - "$@"
s/dateformat//og;
s/text1//og;
s/text2//og;
s/text3//og;
s///og;

EOF

This takes care of 99% of what needs to be done. However, the final file comes out with excess line breaks. Is there any way to have that the substitution of text1, text2 etc includes removing the line break that follows? My only restriction is that this has to be able to be run in an Automator script shell window.

Input sample data is formatted as such:

Text1 Dateformat 
[Content1] 

Text2 Dateformat
[Content2]

Text3 Dateformat
[Content3]

The script above produces output:

[Content1]


[Content2]


[Content3]

Desired output should be formatted as:

[Content1]

[Content2]

[Content3]

In the original document, there is a single line break after a content block, then the Text1 and Dateformat.

The original document

After processing, Text1 and Dateformat are removed, but as you can see there are now two line breaks between content blocks.

The document after processed with the Automator droplet above

podel
  • 11
  • 4
  • Add a __DATA__ section with some sample rows it would help people to understand what are trying to do. – Dragos Trif May 17 '20 at 18:48
  • @DragosTrif thank you, i've put in a sample. – podel May 17 '20 at 19:02
  • Perl one liner `perl -0777 -pe "s/Text\d Dateformat\s*\n//g" input_file.txt` -- hope that the problem was understood properly. – Polar Bear May 17 '20 at 19:10
  • @PolarBear I tried this out. The addition of `\s*\n` after `s/`ing out `dateformat` made no difference to the end result. I'm still left with two line breaks between the end of `Content 1` and the beginning of `Content 2`. I'm wondering if this is a text edit issue instead of a perl one. – podel May 17 '20 at 20:16

3 Answers3

1

You can match and remove the whitespace as part of your pattern. The \R is the generic line ending, which matches any of the Unicode line endings, including a bare newline or a carriage return/newline pair. Also, take a look at a hexdump of the data to see what the real line-endings are. Old Mac Classic line endings seems to show up in odd places (but \R should handle that).

The \h is horizontal whitespace:

#!/bin/bash
#
#    replace CR with CRLF
#
/usr/bin/perl -CSDA -pi <<'EOF' - "$@"
s/dateformat\R//ig;
s/text1\h+//ig;
s/text2\h+//ig;
s/text3\h+//ig;
EOF

Note that I've added the /i flag for case insensitivity since your patterns are all lowercase but the data have mixed case.

I've also removed the /o switch, which no longer does anything.

If there's some reason you're removing DateFormat by itself, you can just removing all trailing whitespace after Textn. The \s gets vertical and horizontal whitespace:

#!/bin/bash
#
#    replace CR with CRLF
#
/usr/bin/perl -CSDA -pi <<'EOF' - "$@"
s/dateformat//ig;
s/text1\s+//ig;
s/text2\s+//ig;
s/text3\s+//ig;
EOF

If you just want to skip those lines, you don't even need to do a substitution. You can just skip them whether or not they have the DateFormat bit. This uses the -n instead of -p so I can control when it outputs. I've added the \A beginning-of-string anchor for good measure:

#!/bin/sh
/usr/bin/perl -CSDA -ni -e 'print unless /\AText[123]\s+/i' "$@"
brian d foy
  • 129,424
  • 31
  • 207
  • 592
  • This works to remove the text, however I'm still left with two lines between `[Content]`. I've added images to the original post to illustrate what is happening on both my original code and on your update. – podel May 19 '20 at 12:28
  • I don’t get that same output, and the images you supply do not help me understand your issue. As I noted In the first paragraph, a hex dump would be useful. – brian d foy May 19 '20 at 12:38
  • The real line-endings appear to be /n. https://pastebin.com/g4da4Riz – podel May 19 '20 at 12:59
  • and here is the same file after processing https://pastebin.com/xm8Mr01M – podel May 19 '20 at 13:16
  • `\n` is a "logical" character. Mac Classic used `\r` to mean the same thing, which is why I want to see the original octets. Neither of those pastebins help me (and are basically the same as just showing me the original file). – brian d foy May 19 '20 at 14:12
  • I'm sorry, I really am new to this. What can I supply that would be of help? I used hexdump -c -n1048 to get the pastebin data. – podel May 19 '20 at 14:33
  • The original file would be much better. – brian d foy May 19 '20 at 14:51
  • [original file](https://www.dropbox.com/s/jz3ak7yd7k3j9tv/dummy%20original.rtf?dl=0) and [processed file](https://www.dropbox.com/s/wibboqsa141oycd/dummy%20processed.rtf?dl=0) – podel May 19 '20 at 15:28
  • Neither of those files looks like the data you are trying to process. Honestly, I don't think there's much more we can do to help you. Your best bet now is to break up your Automator process into steps and closely examine each step. I'm confident the solution I've offered for the question you posed is good, but I think that there's something else in the pipeline that is affecting your output. Good luck! – brian d foy May 19 '20 at 16:33
  • the files I supplied are identical from start to finish to the ones I'm processing, from export in original format to processing, except for the fact that they are dummy files created to protect privacy so I'm not sure where the confusion is. Thank you for your help, and hopefully I can find a solution that works. Thank you! – podel May 19 '20 at 16:55
0

This script does same what does one liner

use strict;
use warnings;
use feature 'say';

my $data = do { local $/; <DATA> };

$data =~ s/Text\d+\s+Dateformat\s*//g;
say $data;

__DATA__
Text1 Dateformat 
[Content1] 

Text2 Dateformat
[Content2]

Text3 Dateformat
[Content3]

Output

[Content1]

[Content2]

[Content3]

NOTE: replace <DATA> to <> to read from pipe or a file given on command line

Polar Bear
  • 6,762
  • 1
  • 5
  • 12
  • This changes the program. The original program would remove the Text lines even if they didn't have the Dateformat parts. – brian d foy May 18 '20 at 14:38
  • Because I am running this script inside an Automator window, it won't allow me to use `say`. With the rest, the result is still two line breaks between `[Content]` blocks. – podel May 19 '20 at 12:13
  • @podel - use `print` instead of `say`, use `s/\n\n/\n/` if you do not see otherway. – Polar Bear May 19 '20 at 18:09
-1
use strict;
use warnings;

use Data::Dumper;

my $record = {};
my ( $key, $val );

while ( my $row = <DATA> ) {
    chomp( $row );
    next if !$row;
    if ( $row =~ /Dateformat/ ) {
        ( $key, undef ) = split /\s+/, $row;
        print "$key\n";
    } elsif ( $row =~ /\[/ ) {
        $record->{$key} = $row;
    }  
}   
print Dumper($record);





__DATA__
Text1 Dateformat 
[Content1] 

Text2 Dateformat
[Content2]

Text3 Dateformat
[Content3]
Dragos Trif
  • 276
  • 1
  • 12
  • Putting these in a hash means you don't get them in the order they were in the original (except by accident). You don't need to build up a data structure for line-oriented tasks. Also, you output the parts they want to remove. – brian d foy May 18 '20 at 14:45