-1

I have XML files that look like this:

<?xml version="1.0" encoding="UTF-8"?>
<!-- some comment here -->
<rsccat version="1.0" locale="en_US" product="some_prouduct" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="../../../../product/resources/schema/msgcat.xsd">
  <message>

    <entry key="entry1" lol="false">
        <![CDATA[
            <actions>
                <action id="hmm" type="nothing">
                    <cmd>456</cmd>
                    <msg id="123"></msg>
                </action>
            </actions>
        ]]>
    </entry>

<entry key="entry2">message2 </entry>
<entry key="entry3">message3 </entry>

<entry key="entry4">
    <actions hello="yes">
    <action type="lol">
    <cmd>rolf</cmd>
    <txt>omg</txt>
    </action>
    </actions> </entry>



</message>
</rsccat>

I would like to write a function in Perl which takes in the path of an XML file, and a list of keys to be removed, and removes the entries associated with those keys entirely, without leaving any white spaces or blank lines. Moreover, I would like that the existing blank lines in the original XML files are preserved, for instance, the three blank lines after the entry with key entry4.

I have written a function which removes the entries without leaving any blank lines, but it also removes the existing blank lines in the XML file.

use File::Slurp;  
sub findReplaceFile
{
    my ($filename, @keys) = @_;  

    my $filetext = read_file($filename);

    foreach my $key (@keys) 
    {
        chomp($key);  # remove newline characters
        my $regex = qr/<entry\s+key\s*=\s*"${key}".*?>.+?<\/entry>/s;
        $filetext =~ s/$regex//gs;  # replacing with empty string
        $filetext =~ s/\n\s*\n/\n/g;  # removing extra line
    }
}

Please help me with my goal, I am fine with both the XML Parser module in Perl as well as plain old regex.

Kautuk Raj
  • 29
  • 4
  • 1
    (1) It is better to use XSLT for your task. (2) you can pass parameters for the keys to XSLT. (3) Spaces are irrelevant for XML files. That's why it is not clear why you need preserve the existing blank lines. – Yitzhak Khabinsky Jul 07 '23 at 15:04
  • If blank lines are important, shouldn't they be in a CDATA section? – Shawn Jul 07 '23 at 16:37
  • The blank lines are important as in something like a `git diff`, their removal will highlight more changes. @Shawn – Kautuk Raj Jul 08 '23 at 05:47

2 Answers2

0

Wrote an example without using modules. Most likely, when reading a file, they use the chomp function, which removes line breaks. This is not the ultimate truth, but only my assumption. It is this module (File::Slurp) that I have never used. File app.pl

#!/usr/bin/perl -w
use strict;

my $path = "data.xml";
findReplaceFile($path, "entry2", "entry4");


sub findReplaceFile {
    my ($filename, @keys) = @_;
    my $data = readData($filename);
    foreach my $key (@keys) {
        $data =~ s/<entry[^>]+key=(.?)$key\1[^>]*?>.*?<\/entry>\n?//mis;
    }
    writeData($filename, $data);
}

sub writeData {
    my $path = shift || "data.txt";
    my $data = shift || die "To write data to a file, you need to transfer this data";
    if (-e $path) {
        open my $fh, ">$path.dat" or die "Can't open file '$path.dat' for write: $!";
        print $fh $data;
        close $fh;
    }
}

sub readData {
    my $path = shift || "data.txt";
    my $data = "";
    if (-e $path and -T $path and -r $path) {
        open my $fh, "<$path" or die "Can't open file '$path' for read: $!";
        $data = join("", <$fh>);
        close $fh;
    } else {
        die "File '$path' dosn't exists or not a text file";
    }
    return $data;
}

This code will not modify your original XML. It will save the result in a separate file, adding the substring ".dat" to the file name, in the line:

open my $fh, ">$path.dat" or die;

It should also be noted that this code completely reads the file into memory, if your file grows to a huge size, you will need to rewrite the algorithm for reading line by line from the file, as well as checking and replacing on the fly.

The following line of code does exactly the same as the code above. Run this line in the terminal, key numbers must be specified in this part: (?:1|3) - first and third (?:1|3|2) - first, third and second etc.

perl -i.dat -ps0400e "s/<entry[^>]+key=(.?)entry(?:1|3)\1[^>]*?>.*?<\/entry>\n?//gmis" data.xml

Only now the original file will be saved with the .dat extension, and the result will be saved to the file with the original name.

e1st0rm
  • 95
  • 2
  • 1
    Why go to the effort of writing your own broken parser when it would be shorter and more reliable to us an existing one?! Your code doesn't even handle CDATA sections which are in use in the document! – ikegami Jul 08 '23 at 02:43
  • Just wanted to point out that using a read function from scratch is not required. The regex you suggested solves the purpose. – Kautuk Raj Jul 08 '23 at 05:52
  • I wrote this code in less than 5 minutes, I think you will need much more time to find the module and study its documentation. And it would probably be justified if such module was used constantly. But if you need to perform a one-time task and you know for sure that CDATA does not contain tags, then this code will do it just fine. Next line code do de same work: perl -i.dat -ps0400e "s/]+key=(.?)entry(?:1|3)\1[^>]*?>.*?<\/entry>\n?//gmis" data.xml – e1st0rm Jul 08 '23 at 12:52
  • Re: Does it really matter how fast you can write broken code? Why didn't you spend those five minutes doing it right instead? Where did you get the idea that the code is broken? I also parsed XML more than once, only here it was not required. Read the terms of the assignment. – e1st0rm Jul 11 '23 at 03:19
-1

Answering my own question, for completion.

Thanks to @e1st0rm for suggesting the regex.

use File::Slurp;  
sub findReplaceFile
{
    my ($filename, @keys) = @_;  

    my $filetext = read_file($filename);

    foreach my $key (@keys) 
    {
        $filetext =~ s/<entry[^>]+key=(.?)$key\1[^>]*?>.*?<\/entry>\n?//mis;
    }
    # Now, just write the data in variable filetext into the same or different file
}
Kautuk Raj
  • 29
  • 4