7

I am using the uniq function exported by the module, List::MoreUtils to find the uniq elements in an array. However, I want it to find the uniq elements in a case insensitive way. How can I do that?

I have dumped the output of the Array using Data::Dumper:

#! /usr/bin/perl

use strict;
use warnings;
use Data::Dumper qw(Dumper);
use List::MoreUtils qw(uniq);
use feature "say";

my @elements=<array is formed here>;

my @words=uniq @elements;

say Dumper \@words;

Output:

$VAR1 = [
          'John',
          'john',
          'JohN',
          'JOHN',
          'JoHn',
          'john john'
        ];

Expected output should be: john, john john

Only 2 elements, rest all should be filtered since they are the same word, only the difference is in case.

How can I remove the duplicate elements ignoring the case?

Neon Flash
  • 3,113
  • 12
  • 58
  • 96

2 Answers2

11

Use lowercase, lc with a map statement:

my @uniq_no_case = uniq map lc, @elements;

The reason List::MoreUtils' uniq is case sensitive is that it relies on the deduping characteristics of hashes, which also is case sensitive. The code for uniq looks like so:

sub uniq {
    my %seen = ();
    grep { not $seen{$_}++ } @_;
}

If you want to use this sub directly in your own code, you could incorporate lc in there:

sub uniq_no_case {
    my %seen = ();
    grep { not $seen{$_}++ } map lc, @_;
}

Explanation of how this works:

@_ contains the args to the subroutine, and they are fed to a grep statement. Any elements that return true when passed through the code block are returned by the grep statement. The code block consist of a few finer points:

  • $seen{$_}++ returns 0 the first time an element is seen. The value is still incremented to 1, but after it is returned (as opposed to ++$seen{$_} who would inc first, then return).
  • By negating the result of the incrementation, we get true for the first key, and false for every following such key. Hence, the list is deduped.
  • grep as the last statement in the sub will return a list, which in turn is returned by the sub.
  • map lc, @_ simply applies the lc function to all elements in @_.
TLP
  • 66,756
  • 10
  • 92
  • 149
  • And this is the same uniq function exported by List::MoreUtils module? – Neon Flash Oct 25 '12 at 17:13
  • Indeed it is. Although since the sub is so simple and short, you can just copy paste it, and save yourself loading the module. – TLP Oct 25 '12 at 17:15
  • Thanks. I will understand the subroutine and then use it directly :) Can you explain the grep syntax a little? The hash, %seen is using the elements of the array as a key and checking for their occurrence. But, I am not sure, how this entire syntax works. – Neon Flash Oct 25 '12 at 17:22
  • @NeonFlash Added an explanation in my answer. It is a fairly cleverly written sub, I think. – TLP Oct 25 '12 at 17:30
  • @NeonFlash If this answer solves your problem to your satisfaction, don't forget to accept it by clicking the checkmark. – TLP Oct 25 '12 at 17:57
  • This version of the syntax is slightly more malleable: my @uniq_no_case = uniq map {lc $_} @elements; – HoldOffHunger Jun 08 '17 at 20:45
  • Having this line instead will preserve the case of the array: `grep { not $seen{lc $_}++ } @_;` – CJ7 Aug 11 '22 at 02:10
6

Use a hash to keep track of the words you have already seen, but also normalize them for upper/lower case:

my %seen;
my @unique;
for my $w (@words) {
  next if $seen{lc($w)}++;
  push(@unique, $w);
}
# @unique has the unique words

Note that this will preserve the case of the original words.

UPDATE: As noted in the comments, it's not clear exactly what the OP needs, but I wrote the solution this way to illustrate a general technique for selecting unique representatives from a list under some "equivalence relation." In this case the equivalence relationship is word $a is equivalent to word $b if and only if lc($a) eq lc($b).

Most equivalence relationships can be expressed in this way, that is, the relationship is defined by a classifier function f() such that $a is equivalent to $b if and only if f($a) eq f($b). For instance, if we want to say that two words are equivalent if they have the same length, then f() would be length().

So now you might see why I wrote the algorithm this way - the classifier function may not produce values that are part of the original list. In the case of f = length, we want to select words, but f of a word is a number.

ErikR
  • 51,541
  • 9
  • 73
  • 124
  • Using `lc` inside the hash access is much nicer than the other solution given, as it preserves the (first matching) case from the input. – LeoNerd Oct 26 '12 at 11:54
  • @LeoNerd What on earth are you talking about? There is no difference between using lc before and inside the hash. – TLP Oct 26 '12 at 13:08
  • I meant, as opposed to the map lc ... solution given in the other answer. This one is nicer as it returns values in their original case, not in forced-lower case. – LeoNerd Oct 26 '12 at 13:36
  • Aha, I see now. However, that's not what the OP requested. Besides, who's to say that the original case is desireable? Usually, names are ucfirst(lc). – TLP Oct 26 '12 at 14:14
  • I'm sure that the uniq() library has more support and efficiency than this version. – HoldOffHunger Jun 08 '17 at 20:46