How do I remove duplicate notes from an XML document in Perl?

Question

I have a sitemap video file xml with duplicated nodes :

<?xml version="1.0"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1"> 
<url>
<loc>http://www.tubtun.com/video/Samsung_42Channel_Wireless_SoundStand</loc>
<video:video>
    <video:title>Samsung 42Channel Wireless SoundStand</video:title>
    <video:description>Samsung 4.2Channel Wireless SoundStand</video:description>
    <video:thumbnail_loc>http://www.tubtun.com/media/files_thumbnail/user91/pl_5364844b0dc.jpg</video:thumbnail_loc>
    <video:player_loc>http://www.tubtun.com/modules/vPlayer/vPlayer.swf?f=http://www.tubtun.com/modules/vPlayer/vPlayercfg.php?fid=844b0dc2c7258f4de11</video:player_loc>
    <video:publication_date>2015-01-27</video:publication_date>
</video:video>
</url>
<url>
<loc>http://www.tubtun.com/video/Samsung_42Channel_Wireless_SoundStand</loc>
<video:video>
    <video:title>Samsung 42Channel Wireless SoundStand</video:title>
    <video:description>Samsung 4.2Channel Wireless SoundStand</video:description>
    <video:thumbnail_loc>http://www.tubtun.com/media/files_thumbnail/user91/pl_5364844b0dc.jpg</video:thumbnail_loc>
    <video:player_loc>http://www.tubtun.com/modules/vPlayer/vPlayer.swf?f=http://www.tubtun.com/modules/vPlayer/vPlayercfg.php?fid=844b0dc2c7258f4de11</video:player_loc>
    <video:publication_date>2015-01-27</video:publication_date>
</video:video>
</url>
.....

I have written a perl script to remove this duplicated data:

use strict;
use warnings;
use XML::LibXML;

my $file = 'sitemap.xml';
my $doc = XML::LibXML->load_xml( location => $file );

my %seen;
foreach my $uni ( $doc->findnodes('//url') ) {  # 'university' nodes only

    my $name = $uni->find('video:title');

    print "'$name' duplicated\n",
      $uni->unbindNode() if $seen{$name}++;  # Remove if seen before
}

$doc->toFile('clarified.xml'); # Print to file

Unfortunately, the file "clarified.xml" is the same as sitemap.xml.

I don't know what is wrong with my script.

Have you checked what's inside of `$name`? Does your script say print the `foo duplicated` output? — simbabque, Oct 08 '15 at 11:46
Well then you found the problem. The `$uni->find('video:title')` does not work properly. You might want to check how to work with namespace-prefixes in XML::LibXML, and how to get the text node out of an element. — simbabque, Oct 08 '15 at 12:31

score 1 · Answer 1 · answered Oct 08 '15 at 12:36

I'm not quite sure why your XML::LibXML isn't working, although as mentioned in the comments - if it's not working with the find that'll be the root of it.

I'll offer an alternative that does work using XML::Twig.

#!/usr/bin/env perl 
use strict;
use warnings;
use XML::Twig; 

my $file = 'test3.xml';

my %seen;

sub delete_url_if_seen {
   my ( $twig, $url ) = @_; 
   my $name = $url -> get_xpath('./video:video/video:title',0) -> trimmed_text;
   if ( $seen{$name}++ ) { $url -> delete; };
}

my $twig = XML::Twig -> new ( 'pretty_print' => 'indented_a', 
                   'twig_handlers' => { 'url' => \&delete_url_if_seen } );
$twig -> parsefile_inplace ( $file );

nwellnhof · Answer 2 · 2020-05-25T13:32:02.733

You should use an XPathContext and register the video and the default namespace. You should also call findvalue to get the title as string.

my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs(sitemap => 'http://www.sitemaps.org/schemas/sitemap/0.9');
$xpc->registerNs(video   => 'http://www.google.com/schemas/sitemap-video/1.1');
for my $node ($xpc->findnodes('//sitemap:url', $doc)) {
    my $name = $xpc->findvalue('video:title', $node);
    ...
}

score 0 · Accepted Answer · edited May 23 '17 at 11:58

0

I have it working, here's the code & I tried the solution provided in https://stackoverflow.com/a/4817929/235961

use strict;
use warnings;
use XML::LibXML;

my $file = 'sitemap.xml';
my $doc = XML::LibXML->load_xml( location => $file );

my %seen;
foreach my $uni ( $doc->findnodes("//*[name() ='url']") ) {  # 'university' nodes only

    my $name = $uni->find('//video:title');
    print "'$name' duplicated\n",
      $uni->unbindNode() if $seen{$name}++;  # Remove if seen before
}

$doc->toFile('clarified.xml'); # Print to file

edited May 23 '17 at 11:58

Community

1
1

answered Oct 08 '15 at 12:51

Pradeep

3,093
17
21

I'm struggling to believe that your code works any differently from the OP's original. All you have done is change the XPath `//url` to `//*[name()='url']` which is identical when the node has no namespace, as here. You are also trying to use the `video` namespace which LibXML knows nothing about. Please show your sample input data and the resulting output – Borodin Oct 08 '15 at 14:10
The find is slightly different too. – Sobrique Oct 08 '15 at 14:14
@Borodin I am not quite sure but I think it has to do with XML namespaces, I read a similar problem here http://stackoverflow.com/a/4817929/235961 – Pradeep Oct 08 '15 at 14:55
@Borodin video has a ns mentioned & I am unable to run my code on any online IDEs for you to see, 'cause none have XML::LibXML installed – Pradeep Oct 08 '15 at 14:57

How do I remove duplicate notes from an XML document in Perl?

3 Answers3