2

I would like to write a command line script which combines multiple <author> tags from an atom feed into one. For example, an entry like:

<entry>
    <id>someid</id>
    <published>somedate</published>
    <title>Title</title>
    <summary>Summary</summary>
    <author>
      <name>Author One</name>
    </author>
    <author>
      <name>Author Two</name>
    </author>
    <author>
      <name>Author Three</name>
    </author>
  </entry>

should become:

<entry>
    <id>someid</id>
    <published>somedate</published>
    <title>Title</title>
    <summary>Summary</summary>
    <author>
      <name>Author One, Author Two, Author Three</name>
    </author>
  </entry>

I think I could do it myself using Perl and regexes but, as parsing XML with regexes is not a good idea, I would be thankful for a more elegant solution that uses a proper xml-parser.

Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
n_flanders
  • 123
  • 3

2 Answers2

4

Ted has the right idea, but a few things were done in a more complicated manner than needed, and they were unaware of the properties of the Atom format (e.g. its use of namespaces).

use XML::LibXML               qw( );
use XML::LibXML::XPathContext qw( );

my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs(a => 'http://www.w3.org/2005/Atom');

# See XML::LibXML::Parser for more ways to create the document object.
my $doc = XML::LibXML->load_xml( location => 'atom.xml' );

for my $entry_node ($xpc->findnodes('/a:feed/a:entry', $doc)) {
   my @author_names;
   for my $author_node ($xpc->findnodes('a:author', $entry_node)) {
      push @author_names, $xpc->findvalue('a:name', $author_node);
      $author_node->unbindNode();
   }

   my $author_node = XML::LibXML::Element->new('author');
   my $name = $author_node->appendTextChild('name', join(", ", @author_names));
   $entry_node->appendChild($author_node);
}

$doc->toFile('atom.new.xml');
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • Cool, yeah, I just read about it an hour ago :-) – Ted Lyngmo Nov 22 '20 at 19:59
  • @Ted Lyngmo Note the use of `unbindNode` to delete the node. And how it's simpler to get the author nodes rather than the name nodes – ikegami Nov 22 '20 at 20:01
  • I'll take a look for sure. I'm new to `XML::LibXML` and `Xpath` (I've only used those once before and that was for an answer here). – Ted Lyngmo Nov 22 '20 at 20:04
1

In Perl I suggest using XML::LibXML.

Here I've used an Xpath query to find the name nodes, then push all the names into an array whilst removing the author nodes as I go. Finally, I create a new author node that is appended.

#!/usr/bin/perl

use strict;
use warnings;

use XML::LibXML;

# example loading the xml from a file
my $dom = XML::LibXML->load_xml(location => 'atom.xml', no_blanks => 1);
my $root = $dom->documentElement();

# the Xpath query
my $query = q{
    /entry/author/name
};

my @authornames;

foreach my $namenode ($dom->findnodes($query)) {
    # save the name
    push @authornames, $namenode->to_literal();

    # remove the author node
    $namenode->getParentNode->getParentNode->removeChild($namenode->getParentNode);

    #or:
    # $root->removeChild($namenode->getParentNode);
}

# build a new author node
my $author = XML::LibXML::Element->new('author');
$author->appendTextChild('name', join(", ",@authornames));

# and add it
$root->appendChild($author);

# print the result
print $dom->serialize(1);

#or, if you don't want the <?xml...> header:
# print $root->serialize(1) . "\n";

Output:

<?xml version="1.0"?>
<entry>
  <id>someid</id>
  <published>somedate</published>
  <title>Title</title>
  <summary>Summary</summary>
  <author>
    <name>Author One, Author Two, Author Three</name>
  </author>
</entry>
Ted Lyngmo
  • 93,841
  • 5
  • 60
  • 108