0

This is only the second perl script I have written, so any constructive help/advice would be greatly appreciated. Also, note that I am working on a Windows machine, using Strawberry Perl. I am aware that a Tidy module exists for Perl, but (for reasons that aren't worth explaining in this note) would prefer to call tidy.exe from the script, as opposed to using the module.

What I want my perl script to do:

  1. Take an html file, copy it, and give it an .xml extension.

  2. Run tidy.exe on the newly formed .xml file to make it well-formed xml.

  3. Strip the xhtml namespace from the newly created, well-formed .xml file

When I run it from the command line using the following command G:\TestFolder>perl tidy_cleanup.pl it produces the desired result. However, when I fire the script from the icon, it skips step 2 listed above. Based on the code posted below, do you have any idea why it behaves this way?

Here is my code:

#!/usr/bin/perl

use strict;
use warnings;

use File::Basename;
use FileHandle;

my $basename;
my @files = glob("*.html");

foreach my $file (@files) {

  my $oldext   = ".html";
  my $newext   = ".xml";
  my $newerext = "v2.xml";

  my $newfile  = $file;
  $newfile     =~ s/$oldext/$newext/;

  my $newerfile = $newfile;
  $newerfile    =~ s/$newext/$newerext/;

  open IN, $file or die "Can't read source file $file: $\n";
  open OUT, ">$newfile" or die "Can't write on file $newfile: $!\n";

  print "Copying $file to $newfile\n";


{while(<IN>)

{  
print OUT $_;  

close(IN);
close(OUT);


}

my $xmltidy = "for \%i in ($newfile) do c:\\Tidy\\tidy.exe --output-xml yes --numeric-entities yes --doctype omit --quote-nbsp no -asxml -utf8 -numeric -m \"\%i\"";
system($xmltidy);


print "\nfinished running tidy \n\n";
}

  {
    open NEWIN,  "$newfile"    or die "Can't read source file $newfile: $!\n";
    open NEWOUT, ">$newerfile" or die "Can't write on file $newerfile: $!\n";

    print "Copying $newfile to $newerfile\n";
    {
      while (<NEWIN>) {
        if ( /(\<html)( xmlns="http:\/\/www.w3.org\/1999\/xhtml" xml:lang="en-GB")(.*)/ ) {
          print NEWOUT "<html$3";
        }
        else {
          print NEWOUT $_;
        }
      }

      close(NEWIN);
      close(NEWOUT);
    }
  }
}
1723842
  • 71
  • 1
  • 9
  • It is hard to believe that this program does anything useful however you run it. You close both the input and output files inside the first `while` loop, so only a single line will ever be copied to `$newfile`. You would have seen error messages like `readline() on closed filehandle`, so why didn't you tell us about them? I suggest that you explain exactly what the program is supposed to do so that we can help you fix it. There seems to be more to it than you have described as the first `if` statement must have a purpose, although all it seems to do is remove everything before an `` tag – Borodin Jul 21 '14 at 14:55
  • You're right, I see readline() on closed filehandle IN at line 42. – 1723842 Jul 21 '14 at 15:11
  • new code for the deleted if statement – 1723842 Jul 21 '14 at 15:11
  • As for what I want it to do, that is explained in steps 1, 2, and 3 above. – 1723842 Jul 21 '14 at 15:12
  • I'm a bit curious, it looks like this works for one file, but you're trying to call tidy with, I think, a batch language loop, is there a reason for that? – dsolimano Jul 21 '14 at 15:17
  • Ah, and what's the command, arguments, working directory for the icon? – dsolimano Jul 21 '14 at 15:18
  • are you talking about calling it through system($xmltidy); ? (eg, using system();) ? I did that based off of googling, "How can I call an .exe file from a perl script?" – 1723842 Jul 21 '14 at 15:21
  • I think the reason for the batch batch language loop is because I want run this program on multiple html files eventually, not just one html file. – 1723842 Jul 21 '14 at 15:23
  • so, before I wrote the perl script, to run tidy I would write for %i in (*.html) do G:\Folderdirectory\path\tidy.exe --output-xml yes --numeric-entities yes --doctype omit --quote-nbsp no -asxml -utf8 -numeric -m "%i" – 1723842 Jul 21 '14 at 15:24

2 Answers2

1

The reason your program isn't working via a shortcut may be that it is looking for HTML files in the wrong directory. When you run perl tidy_cleanup.pl from the command line it looks in your current working directory, however when you set up a shortcut you need to specify the current directory in the field marked Start in:.

However, as I said in my comment, you are processing only a single line of the file when you copy from HTML to XML because you close the file handles inside the while loop.

This is how I would write what I think you want.

use strict;
use warnings;
use autodie;

use File::Copy 'copy';

my $tidy = 'C:\Tidy\tidy.exe';
die "'tidy.exe' not found" unless -f $tidy;

for my $html_file (glob '*.html') {

  (my $xml_file = $html_file) =~ s/\.html\z/.xml/;
  copy $html_file, $xml_file;

  print qq{Tidying "$xml_file"\n};

  qx{"$tidy" --output-xml yes --numeric-entities yes --doctype omit --quote-nbsp no -asxml -utf8 -numeric -m "$xml_file"};

  print "Finished running tidy\n\n";

  (my $v2_file = $xml_file) =~ s/\.xml\z/_v2.xml/;
  open my $xml_fh,  '<', $xml_file;
  open my $v2_fh,   '>', $v2_file;

  print qq{Copying "$xml_file" to "$v2_file"\n};

  while (<$xml_fh>) {
    s/\s*xmlns="[^"]+"//;
    s/\s*xml:lang="[^"]+"//;
    print $v2_fh $_;
  }

  print "Copy complete\n\n";
}
Borodin
  • 126,100
  • 9
  • 70
  • 144
  • So when I run this I get: Can't open '*.xml' for reading: 'Invalid argument' at tidy_cleanup.pl line 19. – 1723842 Jul 21 '14 at 16:28
  • @xslt_user: Line 19 is the `print` statement. If you have added to the program so that line 19 is the `qx` then please say what you have done. I have made a few changes since my very first post, and you may have a bugged version if you were quick to pick it up. Please take another copy and try again. – Borodin Jul 22 '14 at 02:07
  • The new edits you made worked. I also got it to work using the code below. – 1723842 Jul 22 '14 at 13:44
  • @xslt_user: Well done getting your program working, but I encourage you to use something more like my solution. The layout of your code is very unusual and difficult to read, and some of the techniques are very out of date. I also see that you are still copying only the first line of the file and getting a warning message. It is always best to `use warnings` and `use strict`, but there is little point if you ignore the messages they produce. – Borodin Jul 22 '14 at 16:07
0
use strict;
use warnings;
use File::Basename;
use FileHandle;

my @files = glob("*.html");
foreach my $file (@files) {

my $oldext = ".html";
my $newext = ".xml";
my $newerext = "v2.xml";
my $newfile = $file;
$newfile =~ s/$oldext/$newext/;

my $newerfile = $newfile;
$newerfile =~ s/$newext/$newerext/;

open IN, $file or die "Can't read source file $file: $\n";
open OUT, ">$newfile" or die "Can't write on file $newfile: $!\n";
print "Copying $file to $newfile\n";
{while(<IN>)

{  
print OUT $_;    
close(OUT);
my $xmltidy = "c:\\Tidy\\tidy.exe --output-xml yes --numeric-entities yes --doctype omit --quote-nbsp no -asxml -utf8 -numeric -m \"$newfile\"";
system($xmltidy);
print "\nfinished running tidy \n\n";
{
open NEWIN, "$newfile" or die "Can't read source file $newfile: $!\n";
open NEWOUT, ">$newerfile" or die "Can't write on file $newerfile: $!\n";
print "Copying $newfile to $newerfile\n";

{while(<NEWIN>)
{
  if(/(\<html)( xmlns="http:\/\/www.w3.org\/1999\/xhtml" xml:lang="en-GB")(.*)/) {      
        print NEWOUT "<html$3";             
     }         
   else {           
           print NEWOUT $_;
           }     
}
close(NEWIN);
close(NEWOUT);
}
}    
}
close(IN);
}
}
1723842
  • 71
  • 1
  • 9