11

I have a situation where by I need to create tens of thousands of unique numbers. However these numbers must be 9 digits and cannot contain any 0's. My current approach is to generate 9 digits (1-9) and concatenate them together, and if the number is not already in the list adding it into it. E.g.

public void generateIdentifiers(int quantity)
{
    uniqueIdentifiers = new List<string>(quantity);
    while (this.uniqueIdentifiers.Count < quantity)
    {
        string id = string.Empty;
        id += random.Next(1,10);
        id += random.Next(1,10);
        id += random.Next(1,10);
        id += " ";
        id += random.Next(1,10);
        id += random.Next(1,10);
        id += random.Next(1,10);
        id += " ";
        id += random.Next(1,10);
        id += random.Next(1,10);
        id += random.Next(1,10);
        if (!this.uniqueIdentifiers.Contains(id))
        {
            this.uniqueIdentifiers.Add(id);
        }
    }
}

However at about 400,000 the process really slows down as more and more of the generated numbers are duplicates. I am looking for a more efficient way to perform this process, any help would be really appreciated.

Edit: - I'm generating these - http://www.nhs.uk/NHSEngland/thenhs/records/Pages/thenhsnumber.aspx

Daniel Hilgarth
  • 171,043
  • 40
  • 335
  • 443
Eddie
  • 690
  • 10
  • 27
  • 3
    You can use HashTable instead of a List – Haris Hasan Sep 15 '11 at 07:55
  • 4
    "Random" and "unique" in the same sentence triggers the "you are probably doing something wrong" alert. Would you mind explaining what you 're after? – Jon Sep 15 '11 at 07:56
  • Why can't they contain zeroes? – Scott Sep 15 '11 at 07:56
  • 2
    [Describe the goal](http://catb.org/~esr/faqs/smart-questions.html#goal), not just the step. Don't succumb to [the XY problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). – outis Sep 15 '11 at 08:11
  • 2
    The way you are doing it is rather inefficient. You are better off storing them as actual numbers, then just format them with the spaces when you need to display them. – slugster Sep 15 '11 at 11:38
  • 6
    An interesting thing about your question is that you have misidentified the cause of the problem. It does not slow down *as there are more collisions*. It slows down *as it gets more and more expensive to detect collisions*. The fact that there are more collisions is irrelevant; that's not at all what is causing the slowdown. What is causing the slowdown is you've chosen a completely unsuitable data structure to detect collisions. – Eric Lippert Sep 15 '11 at 15:07
  • 1
    To look at a different aspect of the problem: just how random do these numbers have to be? If an attacker can get their hands on a few of the numbers generated by this algorithm, is it a problem if they can deduce all the rest of the generated numbers? "Random" is only pseudo-random; you can easily deduce the full set of numbers generated by knowing a few of them. Is it a problem if one NHS number holder can deduce the other hundred-thousand-or-so numbers you generated that day? – Eric Lippert Sep 15 '11 at 15:11
  • 1
    You need to generate tens of thousand, yet generate 400.000 at which time the process slows down. With only 9 digits and with no 0, that means that you have only 900k unique numbers (unless my match is off and it's even less) - at that point you can as well generate all possible numbers into a list and just pick one. – Michael Stum Sep 15 '11 at 16:01
  • 1
    @Michael: There are 387 million nine-digit numbers containing only digits 1 through 9. How do you figure that there are only 900 thousand? – Eric Lippert Sep 15 '11 at 16:11
  • @Eric because I had a brain fart and had "1 Million" as a 10 digit number. – Michael Stum Sep 15 '11 at 16:25
  • @Eric: How can you programmatically deduce the full set by knowing a few of them? Can't there be more than 1 implementation that could generate the same shown numbers but different for numbers that haven't been shown yet? – Joan Venge Sep 15 '11 at 20:06
  • 5
    @Joan: You have to know the implementation details of Random, but that is easily reverse-engineered from the IL. Every number you get in sequence from the output gives you about 3 bits of information about what the internal state of Random was at the time that sequence was generated. If Random has, say, 64 bits of internal state, then you can determine the internal state with high probability with only around two dozen output numbers. Once you know the internal state, you know the rest of the numbers that were generated. – Eric Lippert Sep 15 '11 at 20:18
  • @Eric: Thanks Eric. I thought you meant, given any random number from any implementation without knowing the implementation, one could find out the rest of the numbers being generated after seeing a handful of them. I see what you mean now. – Joan Venge Sep 15 '11 at 21:23

10 Answers10

16

As others have mentioned, use a HashSet<T> instead of a List<T>.
Furthermore, using StringBuilder instead of simple string operations will gain you another 25%. If you can use numbers instead of strings, you win, because it only takes a third or fourth of the time.

var quantity = 400000;
var uniqueIdentifiers = new HashSet<int>();
while (uniqueIdentifiers.Count < quantity)
{
    int i=0;
    i = i*10 + random.Next(1,10);
    i = i*10 + random.Next(1,10);
    i = i*10 + random.Next(1,10);
    i = i*10 + random.Next(1,10);
    i = i*10 + random.Next(1,10);
    i = i*10 + random.Next(1,10);
    i = i*10 + random.Next(1,10);
    i = i*10 + random.Next(1,10);
    i = i*10 + random.Next(1,10);
    uniqueIdentifiers.Add(i);
}

It takes about 270 ms on my machine for 400,000 numbers and about 700 for 1,000,000. And this even without any parallelism. Because of the use of a HashSet<T> instead of a List<T>, this algorithm runs in O(n), i.e. the duration will grow linear. 10,000,000 values therefore take about 7 seconds.

Daniel Hilgarth
  • 171,043
  • 40
  • 335
  • 443
  • This has cut ~60 minutes off the execution time of the program. Thank you. – Eddie Sep 15 '11 at 08:58
  • 1
    You are welcome. If you still need to have the numbers as strings in the end, put a `var result = uniqueIdentifiers.Select (i => i.ToString("### ### ###")).ToList();` after the while loop. This will double the execution time, but it prevents that you create the string even if you don't need it, because it is a duplicate. – Daniel Hilgarth Sep 15 '11 at 09:03
  • 2
    BTW: If you need more than 9 digits, you need to use `Int64` instead of `int`. – Daniel Hilgarth Sep 15 '11 at 09:09
  • 1
    I have a question: why HashTable is faster thant List? – vietean Sep 15 '11 at 16:11
  • 1
    @vietean: Because in a HashSet/HashTable, the items are indexed, it is an O(1) operation to look up an item. In a list, you need to search through the whole list, i.e. it is a O(n) operation. In other words: It takes one logical operation to get an item out of a HashSet/HashTable, no matter how many items there are. – Daniel Hilgarth Sep 15 '11 at 16:58
  • You might be able to shave off an extra couple of ms by skipping the `Contains` check and just calling `Add`. – LukeH Sep 16 '11 at 11:48
  • Another (very) slight gain can be made by using base 9. Generate a number in the range [0, 9 ** 9) (`i = random.Next(0, 387420489)`), then use the base conversion algorithm, but stick to base 10 for the representation (decimal-coded nonary) and replace '0' digits with '9'. Basically, define ints d,f,j and replace each `i=...` line with `d=i%9; j+=f*(d==0?9:d); i/=9; f*=10;`. The result is held in `j`. Depending on the implementation of `Random`, this can reduce execution time by a few percent. – outis Sep 16 '11 at 20:42
  • There are implementations that use base 9, see below. They seemed to be slower. – Daniel Hilgarth Sep 17 '11 at 06:14
  • @Daniel: their slowness is likely due to their use of strings. Combining the base 9 conversion with your approach of using ints results in a slight speed increase, likely due to a slight reduction in the number of arithmetic operations (2 divisions, 2 multiplications, 1 addition and 1 comparison per digit (d, f and j will likely be register variables with optimization, cutting down on memory operations) for the base 9 conversion, compared to 1 multiplication, 1 addition, 1 method call and whatever `Random.Next` needs per digit). – outis Sep 20 '11 at 09:12
  • 1
    Just out of curisoity, is there any reason why you didn't do something like `Random.Next(100000000, 999999999)` to generate a 9-digit number? This fits perfectly within an integer, and I'd imagine would be faster than doing the nine calls. – Mike Bailey Oct 04 '11 at 15:48
  • 1
    Because this would also potentially generate numbers with zeros in it. – Daniel Hilgarth Oct 04 '11 at 20:03
4

This suggestion may or may not be popular.... it depends on people's perspective. Because you haven't been too specific about what you need them for, how often, or the exact number, I will suggest a brute force approach.

I would generate a hundred thousand numbers - shouldn't take very long at all, maybe a few seconds? Then use Parallel LINQ to do a Distinct() on them to eliminate duplicates. Then use another PLINQ query to run a regex against the remainder to eliminate any with zeroes in them. Then take the top x thousand. (PLINQ is brilliant for ripping through large tasks like this). If needed, rinse and repeat until you have enough for your needs.

On a decent machine it will just about take you longer to write this simple function than it will take to run it. I would also query why you have 400K entries to test when you state you actually need "tens of thousands"?

slugster
  • 49,403
  • 14
  • 95
  • 145
  • Although I wouldn't agree with your approach completely, your suggestion of parallelism is the key point, that everybody forgot about, as this is a perfect candidate for paralelisation. – Hassan Sep 15 '11 at 08:10
4

The trick here is that you only need ten thousand unique numbers. Theoretically you could have almost 9,0E+08 possibilities, but why care if you need so many less?

Once you realize that you can cut down on the combinations that much then creating enough unique numbers is easy:

long[] numbers = { 1, 3, 5, 7 }; //note that we just take a few numbers, enough to create the number of combinations we might need
var list = (from i0 in numbers
            from i1 in numbers
            from i2 in numbers
            from i3 in numbers
            from i4 in numbers
            from i5 in numbers
            from i6 in numbers
            from i7 in numbers
            from i8 in numbers
            from i9 in numbers
            select i0 + i1 * 10 + i2 * 100 + i3 * 1000 + i4 * 10000 + i5 * 100000 + i6 * 1000000 + i7 * 10000000 + i8 * 100000000 + i9 * 1000000000).ToList();

This snippet creates a list of more than a 1,000,000 valid unique numbers pretty much instantly.

InBetween
  • 32,319
  • 3
  • 50
  • 90
  • There are 387 million possibilities. Where is this figure of 900 million coming from? – Eric Lippert Sep 15 '11 at 16:14
  • @Eric Lippert: A fast and obviously wrong hack (999999999-111111111). I underestimated the amount of numbers with zeros :p. It's really 9^9 but I didn't have a calculator nearby. My point is still valid though. – InBetween Sep 15 '11 at 16:22
  • @InBetween: Can you please explain "you only need ten thousand unique numbers"? – Daniel Hilgarth Sep 15 '11 at 18:16
  • @Daniel Hilgarth: I quote: "I have a situation where by I need to create **tens of thousands** of unique numbers..." From that I am inclined to believe that 1,000,000+ unique Ids suits his needs. – InBetween Sep 15 '11 at 18:22
  • BTW: You ARE realizing that you are creating numbers with 10 digits? – Daniel Hilgarth Sep 15 '11 at 18:23
  • @InBetween: Thanks for the clarification. *tens of thousands* and *ten thousand* are quite different, that's why I was asking. :-) – Daniel Hilgarth Sep 15 '11 at 18:24
  • @Daniel Hilgarth: lol you are right :) duh! Anyway I think the intention of what I meant is clear, I didn't really test or recheck the code at all. Take out one `from` clause and you have 200,000+ unique ids. If that isn't enough slap 1 more number in the array and you get 2,000,000+ in reasonable time. – InBetween Sep 15 '11 at 18:28
  • @InBetween: I know :) I was just re-reading it, because I really like the simplicity and then I saw it... – Daniel Hilgarth Sep 15 '11 at 18:30
3

Try avoiding checks making sure that you always pick up a unique number:

static char[] base9 = "123456789".ToCharArray();

static string ConvertToBase9(int value) {
    int num = 9;
    char[] result = new char[9];
    for (int i = 8; i >= 0; --i) { 
        result[i] = base9[value % num];
        value = value / num;
    }
    return new string(result);
}

public static void generateIdentifiers(int quantity) {
    var uniqueIdentifiers = new List<string>(quantity);
    // we have 387420489 (9^9) possible numbers of 9 digits in base 9.
    // if we choose a number that is prime to that we can easily get always
    // unique numbers
    Random random = new Random();
    int inc = 386000000;
    int seed = random.Next(0, 387420489);
    while (uniqueIdentifiers.Count < quantity) {
        uniqueIdentifiers.Add(ConvertToBase9(seed));
        seed += inc;
        seed %= 387420489;
    }
}

I'll try to explain the idea behind with small numbers...

Suppose you have at most 7 possible combinations. We choose a number that is prime to 7, e.g. 3, and a random starting number, e.g. 4.

At each round, we add 3 to our current number, and then we take the result modulo 7, so we get this sequence:

4 -> 4 + 3 % 7 = 0
0 -> 0 + 3 % 7 = 3
3 -> 3 + 3 % 7 = 6
6 -> 6 + 6 % 7 = 5

In this way, we generate all the values from 0 to 6 in a non-consecutive way. In my example, we are doing the same, but we have 9^9 possible combinations, and as a number prime to that I choose 386000000 (you just have to avoid multiples of 3).

Then, I pick up the number in the sequence and I convert it to base 9.

I hope this is clear :)

I tested it on my machine, and generating 400k unique values took ~ 1 second.

Paolo Tedesco
  • 55,237
  • 33
  • 144
  • 193
  • Very clever approach. Although you have no collisions and only get unique values, it takes longer than a simple optimized brute force algorithm (see my answer), because of the calculations needed for that clever approach. It takes 2 seconds for 1,000,000 values as compared to only 700 ms with the optimized brute force. – Daniel Hilgarth Sep 15 '11 at 08:55
  • @Daniel Hilgarth: thanks for the "clever approach" :) I have to say that your version has also the added advantage of being truly random, while mine is not. By the way, I modified my version slightly, and now it runs in ~1 sec for 1M values. – Paolo Tedesco Sep 15 '11 at 10:53
  • You are welcome. :-) I would have never thought of something like this and I didn't even try to understand it, hehe... – Daniel Hilgarth Sep 15 '11 at 10:58
2

Looking at the solutions already posted, mine seems fairly basic. But, it works, and generates 1million values in approximate 1s (10 million in 11s).

public static void generateIdentifiers(int quantity)
{
    HashSet<int> uniqueIdentifiers = new HashSet<int>();

    while (uniqueIdentifiers.Count < quantity)
    {
        int value = random.Next(111111111, 999999999);
        if (!value.ToString().Contains('0') && !uniqueIdentifiers.Contains(value))
            uniqueIdentifiers.Add(value);
    }
}
Daniel Hilgarth
  • 171,043
  • 40
  • 335
  • 443
Daniel Becroft
  • 716
  • 3
  • 19
2

Meybe this will bee faster:

        //we can generate first number wich in 9 base system will be between 88888888 - 888888888 
        //we can't start from zero becouse it will couse the great amount of 1 digit at begining

        int randNumber = random.Next((int)Math.Pow(9, 8) - 1, (int)Math.Pow(9, 9));


        //no we change our number to 9 base, but we add 1 to each digit in our number
        StringBuilder builder = new StringBuilder();

        for (int i=(int)Math.Pow(9,8); i>0;i= i/9)
        {
            builder.Append(randNumber / i +1);
            randNumber = randNumber % i;
        }

        id = builder.ToString();
matmot
  • 163
  • 5
1

use string array or stringbuilder, wjile working with string additions.

more over, your code is not efficient because after generating many id's your list may hold new generated id, so that the while loop will run more than you need.

use for loops and generate your id's from this loop without randomizing. if random id's are required, use again for loops and generate more than you need and give an generation interval, and selected from this list randomly how much you need.

use the code below to have a static list and fill it at starting your program. i will add later a second code to generate random id list. [i'm a little busy]

    public static Random RANDOM = new Random();
    public static List<int> randomNumbers = new List<int>();
    public static List<string> randomStrings = new List<string>();

    private void fillRandomNumbers()
    {
        int i = 100;
        while (i < 1000)
        {
            if (i.ToString().Contains('0') == false)
            {
                randomNumbers.Add(i);
            }
        }
    }
icaptan
  • 1,495
  • 1
  • 16
  • 36
0

I think first thing would be to use StringBuilder, instead of concatenation - you'll be pleasantly surprised. Antoher thing - use a more efficient data structure, for example HashSet<> or HashTable.

If you could drop the quite odd requirement not to have zero's - then you could of course use just one random operation, and then format your resulting number the way you want.

Hassan
  • 2,603
  • 2
  • 19
  • 18
0

I think @slugster is broadly right - although you could run two parallel processes, one to generate numbers, the other to verify them and add them to the list of accepted numbers when verified. Once you have enough, signal the original process to stop.

Combine this with other suggestions - using more efficient and appropriate data structures - and you should have something that works acceptably.

However the question of why you need such numbers is also significant - this requirement seems like one that should be analysed.

Schroedingers Cat
  • 3,099
  • 1
  • 15
  • 33
0

Something like this?

public List<string> generateIdentifiers2(int quantity)
        {
            var uniqueIdentifiers = new List<string>(quantity);
            while (uniqueIdentifiers.Count < quantity)
            {
                var sb = new StringBuilder();
                sb.Append(random.Next(11, 100));
                sb.Append(" ");
                sb.Append(random.Next(11, 100));
                sb.Append(" ");
                sb.Append(random.Next(11, 100));

                var id = sb.ToString();
                id = new string(id.ToList().ConvertAll(x => x == '0' ? char.Parse(random.Next(1, 10).ToString()) : x).ToArray());

                if (!uniqueIdentifiers.Contains(id))
                {
                    uniqueIdentifiers.Add(id);
                }
            }
            return uniqueIdentifiers;
        }
iDevForFun
  • 978
  • 6
  • 10