Case Insensitive Unique Array Elements in Perl

Question

I am using the uniq function exported by the module, List::MoreUtils to find the uniq elements in an array. However, I want it to find the uniq elements in a case insensitive way. How can I do that?

I have dumped the output of the Array using Data::Dumper:

#! /usr/bin/perl

use strict;
use warnings;
use Data::Dumper qw(Dumper);
use List::MoreUtils qw(uniq);
use feature "say";

my @elements=<array is formed here>;

my @words=uniq @elements;

say Dumper \@words;

Output:

$VAR1 = [
          'John',
          'john',
          'JohN',
          'JOHN',
          'JoHn',
          'john john'
        ];

Expected output should be: john, john john

Only 2 elements, rest all should be filtered since they are the same word, only the difference is in case.

How can I remove the duplicate elements ignoring the case?

TLP · Accepted Answer · 2012-10-25T17:29:47.577

11

Use lowercase, lc with a map statement:

my @uniq_no_case = uniq map lc, @elements;

The reason List::MoreUtils' uniq is case sensitive is that it relies on the deduping characteristics of hashes, which also is case sensitive. The code for uniq looks like so:

sub uniq {
    my %seen = ();
    grep { not $seen{$_}++ } @_;
}

If you want to use this sub directly in your own code, you could incorporate lc in there:

sub uniq_no_case {
    my %seen = ();
    grep { not $seen{$_}++ } map lc, @_;
}

Explanation of how this works:

@_ contains the args to the subroutine, and they are fed to a grep statement. Any elements that return true when passed through the code block are returned by the grep statement. The code block consist of a few finer points:

$seen{$_}++ returns 0 the first time an element is seen. The value is still incremented to 1, but after it is returned (as opposed to ++$seen{$_} who would inc first, then return).
By negating the result of the incrementation, we get true for the first key, and false for every following such key. Hence, the list is deduped.
grep as the last statement in the sub will return a list, which in turn is returned by the sub.
map lc, @_ simply applies the lc function to all elements in @_.

edited Oct 25 '12 at 17:29

answered Oct 25 '12 at 17:09

TLP

66,756
10
92
149

And this is the same uniq function exported by List::MoreUtils module? – Neon Flash Oct 25 '12 at 17:13
Indeed it is. Although since the sub is so simple and short, you can just copy paste it, and save yourself loading the module. – TLP Oct 25 '12 at 17:15
Thanks. I will understand the subroutine and then use it directly :) Can you explain the grep syntax a little? The hash, %seen is using the elements of the array as a key and checking for their occurrence. But, I am not sure, how this entire syntax works. – Neon Flash Oct 25 '12 at 17:22
@NeonFlash Added an explanation in my answer. It is a fairly cleverly written sub, I think. – TLP Oct 25 '12 at 17:30
@NeonFlash If this answer solves your problem to your satisfaction, don't forget to accept it by clicking the checkmark. – TLP Oct 25 '12 at 17:57
This version of the syntax is slightly more malleable: my @uniq_no_case = uniq map {lc $_} @elements; – HoldOffHunger Jun 08 '17 at 20:45
Having this line instead will preserve the case of the array: `grep { not $seen{lc $_}++ } @_;` – CJ7 Aug 11 '22 at 02:10

ErikR · Answer 2 · 2012-10-26T20:25:22.193

6

Use a hash to keep track of the words you have already seen, but also normalize them for upper/lower case:

my %seen;
my @unique;
for my $w (@words) {
  next if $seen{lc($w)}++;
  push(@unique, $w);
}
# @unique has the unique words

Note that this will preserve the case of the original words.

UPDATE: As noted in the comments, it's not clear exactly what the OP needs, but I wrote the solution this way to illustrate a general technique for selecting unique representatives from a list under some "equivalence relation." In this case the equivalence relationship is word $a is equivalent to word $b if and only if lc($a) eq lc($b).

Most equivalence relationships can be expressed in this way, that is, the relationship is defined by a classifier function f() such that $a is equivalent to $b if and only if f($a) eq f($b). For instance, if we want to say that two words are equivalent if they have the same length, then f() would be length().

So now you might see why I wrote the algorithm this way - the classifier function may not produce values that are part of the original list. In the case of f = length, we want to select words, but f of a word is a number.

edited Oct 26 '12 at 20:25

answered Oct 25 '12 at 17:11

ErikR

51,541
9
73
124

Using `lc` inside the hash access is much nicer than the other solution given, as it preserves the (first matching) case from the input. – LeoNerd Oct 26 '12 at 11:54
@LeoNerd What on earth are you talking about? There is no difference between using lc before and inside the hash. – TLP Oct 26 '12 at 13:08
I meant, as opposed to the map lc ... solution given in the other answer. This one is nicer as it returns values in their original case, not in forced-lower case. – LeoNerd Oct 26 '12 at 13:36
Aha, I see now. However, that's not what the OP requested. Besides, who's to say that the original case is desireable? Usually, names are ucfirst(lc). – TLP Oct 26 '12 at 14:14
I'm sure that the uniq() library has more support and efficiency than this version. – HoldOffHunger Jun 08 '17 at 20:46

Case Insensitive Unique Array Elements in Perl

2 Answers2