0

I'm migrating from LaTeX to PrinceXML. One of the things I need to do is to convert the bibliography. I've converted my .bib file to HTML. However, since LaTeX took care of sorting the entries for me, I haven't taken care to put them into the correct order - but in the HTML the order of declaration does matter.

So my problem is: using Linux command line tools (e.g. Perl is acceptable, but Javascript is not), how can I sort a source file like this:

<div id="references">
    <h2>References</h2>

    <ul>
        <li id="reference-to-book-1">
            <span class="ref-author">Sample, Peter</span>
            <cite><a href="http://example.org/">Online Book 1</a></cite>
            <span class="ref-year">2011</span>
        </li>
        <li id="reference-to-book-2">
            <cite>Physical Book 2</cite>
            <span class="ref-year">2012</span>
            <span class="ref-author">Example, Sandy</span>
        </li>
    </ul>
</div><!-- references -->

to look like this:

<div id="references">
    <h2>References</h2>

    <ul>
        <li id="reference-to-book-2">
            <span class="ref-author">Example, Sandy</span>
            <cite>Physical Book 2</cite>
            <span class="ref-year">2012</span>
        </li>
        <li id="reference-to-book-1">
            <span class="ref-author">Sample, Peter</span>
            <cite><a href="http://example.org/">Online Book 1</a></cite>
            <span class="ref-year">2011</span>
        </li>
    </ul>
</div><!-- references -->

The criteria being:

  1. The <li> elements containing the entries are sorted alphabetically according to author (i.e. everything from one <li id=" to its corresponding </li> is to be moved as a single block).
  2. Within each entry, the elements are in the following order:
    1. line matches class="ref-author"
    2. line matches <cite>
    3. line matches class="ref-year"
    4. There are more elements (e.g. class="publisher") I omitted from the example for purposes of clarity; also, I run across this sorting problem very often. So it would be helpful if the expressions to match could be specified freely (e.g. as an array declaration in the script).
  3. The remainder of the file (outside /id="references"/,/-- references --/) is unchanged.
  4. The result file should have each line unchanged except for its position in the file (this point added because I the XML parsers I tried broke my indentation).

I got 1, 3 and 4 solved using sed and sort, but can't get 2 to work that way.

Borodin
  • 126,100
  • 9
  • 70
  • 144
user66554
  • 558
  • 2
  • 14
  • Your sample looks like XHTML. Is that always the case? It would be best to process this data using an XML parser if possible – Borodin May 21 '15 at 13:03
  • If it is XHTML, I would write a Perl script using [XML::LibXML](http://search.cpan.org/dist/XML-LibXML/) to read and write the document. Start out with something like [html2html](http://www.win.tue.nl/~rp/bin/html2html) and insert code that manipulates the DOM tree using the XML::LibXML API. Any self-respecting language has a mature XML library, so you don't need to use Perl, but it's what I'm most familiar with for this task. – reinierpost May 21 '15 at 13:14
  • @Borodin it's XHTML in this particular case, but I have had this sorting problem with different formats, too (and they aren't necessarily XML). – user66554 May 21 '15 at 15:33
  • If you really want help with this then please show the work you have done to solve the problem yourself, and describe the problems that you are having with doing the job yourself. You must also show representative samples of all the different types of data that you must handle, and ***explain clearly all restrictions*** that a useful solution must follow – Borodin May 21 '15 at 16:00
  • @user66554: Your description *appeared* clear, and we had no reason to guess that what you had written was wildly incomplete. For instance, *“Oh, and by the way, this data may not be HTML at all”* would have been useful. But *still* all you have said is that the data file may contain anything, and you expect help on that basis? – Borodin May 21 '15 at 16:16
  • If you want to delete the question then delete it, no need for permission from us – EdChum May 22 '15 at 08:25
  • @EdChum I can't delete the question because it says "there already are answers." – user66554 May 22 '15 at 09:07
  • Ah, yes I'd leave it as it is then – EdChum May 22 '15 at 09:12

2 Answers2

2

I'd use Mojo for this. You might need to tidy up the XML afterwards.

use Mojo::Base -strict;
use Mojo::DOM;
use Mojo::Util 'slurp';

my $xml = slurp $ARGV[0] or die "I need a file";

my $dom = Mojo::DOM->new($xml);

my $list = $dom->at('#references ul');

my $refs = $dom->find('li');

$refs->each('remove');

$refs = $refs->sort( sub { $a->at('.ref-author')->text cmp $b->at('.ref-author')->text } );

for my $ref ( @{ $refs } ){


    my $new = Mojo::DOM->new('<li></li>')->at('li');
    $new->append_content($ref->at('.ref-author'));
    $new->append_content($ref->at('cite'));

    #KEEP APPENDING IN THE ORDER YOU WANT THEM


    $list->append_content($new);

}

say $dom;
LLFourn
  • 173
  • 8
  • You missed the part where the OP said that he didn't want to use an HTML parser, and that the data isn't always HTML anyway. Yeah, I know right? – Borodin May 21 '15 at 16:10
0

I suggest you use the XML::LibXML module and parse your data as HTML. Then you can manipulate the DOM as you wish and print the modified structure back out

Here's an example of how it might work

use strict;
use warnings;

use XML::LibXML;

my $dom = XML::LibXML->load_html(IO  => \*DATA);

my ($refs) = $dom->findnodes('/html/body//div[@id="references"]/ul');

my @refs = $refs->findnodes('li');

$refs->removeChild($_) for @refs;

$refs->appendChild($_) for sort {
  my ($aa, $bb) = map { $_->findvalue('span[@class="ref-author"]') } $a, $b;
  $aa cmp $bb;
} @refs;

print $dom, "\n";


__DATA__
<html>
  <head>
  <title>Title</title>
  </head>
  <body>
    <div id="references">
        <h2>References</h2>

        <ul>
            <li id="reference-to-book-1">
                <span class="ref-author">Sample, Peter</span>
                <cite><a href="http://example.org/">Online Book 1</a></cite>
                <span class="ref-year">2011</span>
            </li>
            <li id="reference-to-book-2">
                <cite>Physical Book 2</cite>
                <span class="ref-year">2012</span>
                <span class="ref-author">Example, Sandy</span>
            </li>
        </ul>
    </div><!-- references -->
  </body>
</html>

output

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Title</title></head><body>
    <div id="references">
        <h2>References</h2>

        <ul>

        <li id="reference-to-book-2">
                <cite>Physical Book 2</cite>
                <span class="ref-year">2012</span>
                <span class="ref-author">Example, Sandy</span>
            </li><li id="reference-to-book-1">
                <span class="ref-author">Sample, Peter</span>
                <cite><a href="http://example.org/">Online Book 1</a></cite>
                <span class="ref-year">2011</span>
            </li></ul></div><!-- references -->
  </body></html>
Borodin
  • 126,100
  • 9
  • 70
  • 144
  • @user66554: If you had mentioned that in your original question I wouldn't have wasted time creating a solution that was of no use to you. I really should have insisted in the first place that you at least make an attempt at your own solution, but my guess was that you were well-intentioned. Your arrogance that is surfacing only now makes me wish I hadn't even bothered – Borodin May 21 '15 at 15:55
  • @user66554: No, it is you behaving as if you had contracted me to do the work that reveals your arrogance. *“No parsers, please”* would be rude even to an employee. Your problem description is shoddy, and if the problem is as compound as it now appears then you are never likely to get a working solution on the basis of what you wrote in the question. – Borodin May 21 '15 at 16:08
  • @user66554: I have said a lot more than that, yet you pick on the one comment that doesn't go into specifics. You are alarmingly entitled and I want to have no more to do with you – Borodin May 21 '15 at 16:37