1

I'm a weak perl user (and manipulator of arrays), and this problem is stumping me. Hope someone can help!

I have a source file with the following type of data (greatly simplified):

URL: 22489196
Keywords: Ball, Harga, Call, Dall, Eall, Jarga, Fall

URL: 22493265
Keywords: Hall, Iall, Yarga, Jall, Zarga, Kall

The words interrupting the alpha order (Harga, etc.) are "qualifiers". The end result I need is:

22489196

Ball--Harga
Call
Dall
Eall--Jarga
Fall

22493265

Hall
Iall--Yarga
Jall--Zarga
Kall

I've tried various "for" loops, pushing the terms into a second array and shifting the original array on conditional concatenation of its terms, but I still end up with missing or extra terms. Can anyone suggest how this might be done? MANY THANKS in advance!

ADDED: here's one iteration of part of my messy code:

while (<FILE>) {

    if (/URL\:/) {

        print "$_\n";
    }

    if (/Keywords\: /) {

        s/Keywords\: //;
        chomp();

        my @terms    = split ', ', $_;
        my @bakterms = reverse @terms;
        my $noTerms  = @terms;
        my $IzItOdd  = $noTerms%2;
        #my $ctr = $noTerms++;

        for ($i = 0; $i <= $#bakterms; $i++){

            my $j = $i+1;

            if ($j <= $#bakterms) {

                my $one = $bakterms[$i];
                my $two = $bakterms[$j];

                if ($two gt $one) { # i.e., if $two is alphabetically AFTER $one

                    push @ary3, $bakterms[$i];
                    $disarry = 1;
                    my $interloper = $bakterms[$j+1].= "--" . $two;
                    push @ary3, $interloper;
                    shift @bakterms;
                    #$ctr--;
                    shift(@bakterms);
                    #$ctr--;
                }
                else {

                    push @ary3, $bakterms[$i];
                    #shift(@bakterms);
                    shift @bakterms;
                    $disarry = 0;
                }
            }
        }
        @ary3 = sort @ary3;

        foreach my $term (@ary3) {

            print "** $term\n";
        }

        @ary3 = ();
        print"\n";
    }
}
exit 0;
Pavel Vlasov
  • 3,455
  • 21
  • 21
  • Please show some code that you have tried. Why are there dashes between some of the words in the output? – simbabque Sep 21 '12 at 15:31
  • Can't you define "qualifiers" in a better way (list, pattern, ..) than 'out of order'? How would you deal with "..., Hall, Harga, Iall, ..."? – Ekkehard.Horner Sep 21 '12 at 15:36
  • @Ekkehard.Horner: there are 2.286 possible qualifiers, all of which are simply "extensions" to the "base term" (e.g. Jall can occur alone or "extended" by Zarga). The "Hall, Harga, Iall" pattern would statistically be rare, so I could manually inspect the results for "false positives". Does this clarify a bit? – user1689248 Sep 21 '12 at 15:49
  • @simbabque: the dashes in the output separate the "base term" from the "qualifier"--does that help? – user1689248 Sep 21 '12 at 16:16
  • Can qualifiers test positive for `$qualifier=~/[A-Z]arga/` ? – Jean Sep 21 '12 at 16:16
  • @Jean: I'm afraid I don't understand your question. Here's a sample of actual qualifiers: Leucine, Leukotriene B4, Life, Light, Listeria, Long-Term Care, Lung – user1689248 Sep 21 '12 at 16:20
  • Out of curiosity: How did you come up with your example data? – simbabque Sep 21 '12 at 17:03
  • @simbabque, it was a VAST simplification of real data. Here's part of some real data: "Adaptor Proteins, Signal Transducing, Adolescent, Adult, Aged, Aged, 80 and over, ...". Here, the output should have "Adaptor Proteins--Signal Transducing" and (if at all possible, later in the output) "Aged" followed by "Aged--80 and over". -- THANKS! – user1689248 Sep 21 '12 at 17:16

1 Answers1

2

Well, "Harga" doesn't interrupt alphabetical order, "Call" does. So the qualifier is actually the word before the one that interrupts alphabetical order.

my $keywords = ...;  # 'Ball, Harga, Call, Dall, Eall, Jarga, Fall'
my @keywords = split /\s*,\s*/, $keywords;
my $prev_keyword = '';
while (@keywords) {
    my $keyword = shift(@keywords);

    my $qualifier;
    if (@keywords >= 1 && $keyword eq $prev_keyword) {
       $qualifier = shift(@keywords);
    }
    elsif (@keywords >= 2 && $keywords[0] gt $keywords[1]) {
       $qualifier = shift(@keywords);
    }

    if (defined($qualifier)) {
       print("$keyword--$qualifier\n");
    } else {
       print("$keyword\n");
    }

    $prev_keyword = $keyword;
}
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • @ikegami, that gets MUCH closer to what I need--THANK YOU! Is there a way to deal with instances like "..., Mice, Mice, Inbred, ..." where "Mice" is one base term and "Mice, Inbred" is another qualified term that follows immediately in the list? – user1689248 Sep 21 '12 at 16:57
  • @ikegami: sorry, should have clarified: output should be "Mice", then "Mice--Inbred". THANK YOU! – user1689248 Sep 21 '12 at 17:07
  • Updated. Could have used a look-ahead for this too, but I thought a look-behind would be more appropriate. – ikegami Sep 21 '12 at 17:20
  • @ikegami: Thank you VERY MUCH for this. I'll still have to examine the data more closely, but this code is a HUGE help! – user1689248 Sep 21 '12 at 17:41
  • @ikegami: It's much simpler than the mess I'd come up, and gives me output much closer to what I need, so I'll happily give it a check. Unfortunately, I've got to examine the frequency of the qualifiers in my data before proceeding. You've been a HUGE help! – user1689248 Sep 21 '12 at 17:57