What do you call the property of a list that describes the degree to which it contains duplicates?

Question

I have a function that selects Cartesian products of lists such that the number of duplicate elements is highest:

import Data.List (nub)

f :: Eq a => [[a]] -> [[a]]
f xss = filter ((==) minLength . length . nub) cartProd
  where
    minLength = minimum (map (length . nub) cartProd)
    cartProd = sequence xss

For example:

*Main> f [[1,2,3],[4],[1,5]]
[[1,4,1]]

But:

*Main> sequence [[1,2,3],[4],[1,5]]
[[1,4,1],[1,4,5],[2,4,1],[2,4,5],[3,4,1],[3,4,5]]

Is there a name for the property that the result of my function f has?

I'm not sure, but in this case you could also probably say something about "cardinality", e.g. "the sequence whose alphabet has lowest cardinality". But I guess that's worse than "contains the most duplicates"... — jberryman, Feb 13 '17 at 19:07
I wrote an answer proposing "multiplicity", but then realized I don't really understand the question. (e.g. does `[1,4,1,4]` have more "duplicates" than `[1,1,1,4]`? Why (not)?) Even so you might enjoy reading about that word on e.g. Wikipedia to see if it sounds relevant to you. — Daniel Wagner, Feb 13 '17 at 19:10
@jberryman In a way that's actually better, since it positively emphasizes the property I am after. When you said "alphabet" I also realized that one could speak in terms of operations on the list. I might want to use the term "projection". — marc_r, Feb 13 '17 at 20:51
@DanielWagner "Multiplicity" sounds catchy. If I interpret my lists as multisets, it's a very close fit I think. `[1,4,1,4]` is as good as `[1,1,1,4]` (their `nub`s have the same length), so I would be after something like "total multiplicity". More specifically about the why: I am using this in the generation of certain graphs in which I want to merge as many identical nodes as possible. — marc_r, Feb 13 '17 at 21:01
"homogeneity" ? Or entropy, like defined by Shannon, if you consider a list as the possible values of a variable. — Jean-Baptiste Potonnier, Feb 14 '17 at 00:30
@Jean-BaptistePotonnier `[1,4,1,4]` and `[1,1,1,4]` would have different entropies, although I consider them equal. — marc_r, Feb 14 '17 at 14:14

Daniel Wagner · Accepted Answer · 2017-02-13T23:05:04.453

I believe your function is computing a minimum set cover:

Given a set of elements { 1 , 2 , . . . , n } (called the universe) and a collection S of sets whose union equals the universe, the set cover problem is to identify the smallest sub-collection of S whose union equals the universe.

In your case, n is length xss. There is one set in S for each distinct element x of concat xss, namely the set { i | x `elem` (xss !! i) } of all indices that x occurs in. The minimum set cover then tells you which x to choose from each list in xss (sometimes giving you multiple choices; any choice will produce the same final nubbed length).

Here is a worked example for your [[1,2,3],[4],[1,5]]:

The universe is {1,2,3}.

There are five sets in the collection S; I'll name them S_1 through S_5:

S_1 = {1,3} because the first and third lists contain 1.
S_2 = {1} because the first list contains 2.
S_3 = {1} because the first list contains 3.
S_4 = {2} because the second list contains 4.
S_5 = {3} because the third list contains 5.

A minimum set cover for this is {S_1, S_4}. Because this is a set cover, this means every list contains either 1 or 4. Because it is minimal, no other choice of sets produces a smaller final collection of values. So, we can choose either 1 or 4 from each list to produce a final answer. As it happens, no list contains both 1 and 4 so there is only one choice, namely [1,4,1].

Brilliant. From the Wikipedia link I understand that my function is actually solving the **hitting set problem** which is achieved by interchanging sets and universe from the minimum-set-cover problem. Which is exactly what you're doing here. — marc_r, Feb 13 '17 at 23:51

What do you call the property of a list that describes the degree to which it contains duplicates?

1 Answers1