0

Is there a built in function, or a simple UDF that can identify the pattern in the information below and remove the duplicates?

Assume the following is all within a single excel cell:

80154, 80299, 80299, 82055, 82145, 82205, 82520, 82570, 83840, 83925, 83925, 83986, 83992, 84315, 80154, 80299, 80299, 82055, 82145, 82205, 82520, 82570, 83840, 83925, 83925, 83986, 83992, 84315

There are two sets of data (starts with 80154 ends with 84315). I want to end up with only one set, but I want to do it to 50,000 lines. The final output should be just the BOLD text. Also, sometimes the data repeats itself 3 times, again, I just want the unique set of data.

NOTE: I can't just remove duplicates, because sometimes there will be duplicates in the set that I need to capture in the final output. For example, (A,A,B,C,A,A,B,C) needs to be reduced to (A,A,B,C).

Devil07
  • 141
  • 1
  • 8
  • Are all the cells duplicated inside themselves? And is it only twice? – Scott Craner Aug 11 '17 at 19:52
  • @ScottCraner I'm not sure what you mean by duplicated inside themselves. There are 50,000 lines that each has one cell that is similar to the example. The data is either a duplicate, or a triplicate. I want to strip it down to the original unique series. Let me know if that answers your question. – Devil07 Aug 11 '17 at 19:57
  • If it where always the same it would be easy, but if some are duplicate and others triplicate, you are probably not going to get this done with a formula. For example if always duplicate: `=LEFT(A1,LEN(A1)/2-1)` would do it. But finding patterns is not what Excel does easily. – Scott Craner Aug 11 '17 at 19:59
  • I was afraid of that. Can that function be used to fix the duplicates, and provide and error for triplicates, where I can go back and do the triplicates manually? – Devil07 Aug 11 '17 at 20:01
  • Nope, that is the problem. – Scott Craner Aug 11 '17 at 20:02
  • here try this: `=LEFT(A1,FIND(LEFT(A1,LEN(A1)/5),A1,2)-3)` if it works I will post as an answer. If finds where the first 20% repeats and sets that as the cutoff point. – Scott Craner Aug 11 '17 at 20:05
  • I just tried that function, its did the whole spreadsheet in like 2 seconds. However, like you said, the triplicates were an issue. Also, I didn't realize that some of the cells were correct, and it produced a funny output when the data was correct. – Devil07 Aug 11 '17 at 20:06
  • It should not matter about the triplicates. You may want to try changing the `5` to `10`. Play with that number as it will increase or decrease the amount of text it is trying to find. But triplicates would not matter, as soon as the formula finds the repeat it cuts the string at that point. Also yes it would error when no duplicate is found: `IFERROR(LEFT(A1,FIND(LEFT(A1,LEN(A1)/5),A1,2)-3),A1)` – Scott Craner Aug 11 '17 at 20:09
  • The second one works better, and it returns an error of "#VALUE!" for the cells that are not duplicates. – Devil07 Aug 11 '17 at 20:09

1 Answers1

1

This finds where the first 20% is repeated and cuts the string at that point.

IF it does not find a duplicate it will return the whole string.

=IFERROR(LEFT(A1,FIND(LEFT(A1,LEN(A1)/5),A1,2)-3),A1)

Play with the 5 till you find the proper length of string that will get you the correct answer on all your strings. The higher the number the smaller the string it compares.

Also if it is cutting off too much or not enough, like leaving the , at the end adjust the -3 up and down.

Scott Craner
  • 148,073
  • 10
  • 49
  • 81