1

After searching the forum, I did not find a good solution for this question. If I missed it, please tell me.

I need to count the unique values in one column in EXCEL 2010.

The worksheet has 1 million rows and 10 columns. All cell values are string or numbers.

I used the solution at Count unique values in a column in Excel

 =SUMPRODUCT((A2:A1000000<>"")/COUNTIF(A2:A100000,A2:A1000000&""))

But, it runs so long time that the EXCEL is almost frozen. And, it generates 25 processes in Win 7.

Are there more efficient ways to do it?

Also, in the column, all values have for format of

  AX_Y

  here, A is a character, X is an integer, Y is an integer from 1 to 10. 

  For example, A5389579_10

I need to cut off the part after (including) undersocre. for the example,

  A5389579

This is what I need to count as unique values in all cells in one column.

  For example, A5389579_10
               A1543848_6
               A5389579_8

Here, the unique value has 2 after removing the part after underscore.

How to do it in EXCEL VBA and R (if no efficient solution for EXCEL)?

Community
  • 1
  • 1
user3440244
  • 371
  • 1
  • 3
  • 15
  • Read the values to a variant array, then loop through the array and split each value on the underscore: take the first part and use a Scripting Dictionary to keep count of how many times it occurs. Should be pretty fast (at least relative to your current approach). Maybe run it one column at a time, depending on how many unique values you expect to find over the whole range. – Tim Williams Apr 12 '14 at 20:15
  • 1
    this is quick work for `data.table` or `dplyr` – hrbrmstr Apr 12 '14 at 20:16
  • And/or `length(unique(...))`, depending on the class of the column. Need to see some data to go any further. – Rich Scriven Apr 12 '14 at 20:21
  • it is so slow because it is running the COUNTIF function within the formula a million times. This is 999,999 times more passes over the data necessary. It is an O(N^2) solution. Instead, you can use a keyed collection to make it a single pass of the data (O(N)). There are several answers with this already on Stackoverflow - look in the "Related" section of this question. – Cor_Blimey Apr 12 '14 at 20:22
  • @hrbrmstr, would you please tell me how to do this in R ? Thanks ! – user3440244 Apr 12 '14 at 20:52

4 Answers4

2

If you want to do this by VBA, you can take advantage of the Collection object. Since collections can only contain unique values, trying to add all of your input data to a collection will result in an array of unique values. The code below takes all the variables in a selected range and then outputs an array with distinct values to an other sheet (in this case a sheet named Output).

Sub ReturnDistinct()
    Dim Cell As Range
    Dim i As Integer
    Dim DistCol As New Collection
    Dim DistArr()
    Dim OutSht As Worksheet
    Dim LookupVal As String

    Set OutSht = ActiveWorkbook.Sheets("Output") '<~~ Define sheet to putput array

    If TypeName(Selection) <> "Range" Then Exit Sub

    'Add all distinct values to collection
    For Each Cell In Selection
        If InStr(Cell.Value, "_") > 0 Then
            LookupVal = Mid(Cell.Value, 1, InStr(Cell.Value, "_") - 1)
        Else
            LookupVal = Cell.Value
        End If
        On Error Resume Next
        DistCol.Add LookupVal, CStr(LookupVal)
        On Error GoTo 0
    Next Cell

    'Write collection to array
    ReDim DistArr(1 To DistCol.Count, 1 To 1)
    For i = 1 To DistCol.Count Step 1
        DistArr(i, 1) = DistCol.Item(i)
    Next i

    'Outputs distinct values
    OutSht.Range("A1:A" & UBound(DistArr)).Value = DistArr
End Sub

Note that since this code writes all the distinct values to a single column in the OutSht-sheet, this will return an error if there are more than 1,048,576 distinct values in your dataset. In that case you would have to split the data to be filled into multiple output columns.

Netloh
  • 4,338
  • 4
  • 25
  • 38
  • running your VBA code, I got run time error "9", subscript out of range. Thanks ! – user3440244 Apr 12 '14 at 20:51
  • I initialy forgot to include the fact that you only wanted to consider values in front of the "underscore". I have updated the code now. Have you included an empty sheet named "Output" in your workbook. I think that is what is causing the run time error. – Netloh Apr 12 '14 at 20:54
  • The same error at : Set OutSht = ActiveWorkbook.Sheets("Output"). thanks ! – user3440244 Apr 12 '14 at 21:11
  • Then again, it must be because you haven't created an empty sheet named `Output` in the workbook. The code needs that sheet to output the unique values. – Netloh Apr 12 '14 at 21:13
  • I think there's a 65k limit on using Transpose() - you can instead use a 2d array and drop that on the sheet without using Transpose. – Tim Williams Apr 12 '14 at 21:42
  • @TimWilliams That could potentially cause a problem. I have edited the code to use a 2D array (with only a single column). – Netloh Apr 12 '14 at 22:59
  • @Søren Holten Hansen, It works well, thanks ! Would you please recommand some books or websites or online docs about EXCEL 2010 VBA programming ? – user3440244 Apr 15 '14 at 19:03
  • I have found "Power Programming with VBA" by John Walkenbach to be a quite good book on VBA programming. – Netloh Apr 16 '14 at 07:42
1

For your specific request to count, use the below in a formula like =COUNTA(GetUniques(LEFT("A1:A100000",FIND("_","A1:A100000")-1)) entered as an array formula with Ctrl+Shift+Enter.

It also accepts multiple ranges / values (e.g. GetUniques("A1:A10","B2:E4"))

Function GetUniques(ParamArray args())
    Dim arg, ele, arr, i As Long
    Dim c As Collection
    Set c = New Collection
    For Each arg In args
        If TypeOf arg Is Range Then
            If arg.Count = 1 Then
                arr = array(arg.value)
            Else
                arr = arg.Value
            End If
        ElseIf VarType(arg) > vbArray Then
            arr = arg
        Else
            arr = Array(arg)
        End If
        For Each ele In arr
            On Error Resume Next
            c.Add ele, VarType(ele) & "|" & CStr(ele)
            On Error GoTo 0
        Next ele
    Next arg
    If c.Count > 0 Then
        ReDim arr(0 To c.Count - 1)
        For i = 0 To UBound(arr)
            arr(i) = c(i + 1)
        Next i
        Set c = Nothing
        GetUniques = arr
    End If
End Function

edit: added a performance optimisation for ranges (loads them at once into an array - much faster than enumerating through a range)

Cor_Blimey
  • 3,260
  • 1
  • 14
  • 20
1

In R:

# sample data
df <- data.frame(x=1:1000000,
                 y=sample(1e6:(1e7-1),1e6,replace=T))
df$y <- paste0("A",df$y,"_",sample(1:10,1e6,replace=T))

# this does the work...
length(unique(sub("_[0-9]+","",df$y)))
# [1] 946442
# and it's fast...
system.time(length(unique(sub("_[0-9]+","",df$y))))
#    user  system elapsed 
#    2.01    0.00    2.02 
jlhoward
  • 58,004
  • 7
  • 97
  • 140
0

In excel 2010... in the next column add (if original data was in A:A add in B1) = 1/COUNTIF(A:A,A1) and copy down col B to the bottom of your data. Depending on your PC it may chug away calculating for a long time, but it will work. Then copy col B & paste values over itself.

Then SUM col B

Johnny C
  • 100
  • 1
  • 8