Finding the exact match in the values in the categorical variables

Question

I wanted to find an exact match in the values between all three columns (rg1,rg2,rg3).Below is my dataframe.

For instance - first row has a combination of (70,71,72) , if this same combination appears in the remaining rows for the rest of the user ids , then, keep only those users and delete rest.

To describe it further - first row has (70,71,72) and say , if row 10 had the same values in B,C,D column, then I just want to display row 1 and row 10.(using R)

I tried doing clustering on this - kmodes. But I'm not getting the expected results.The current code is grouping all the rgs but it's kind of validating only a single Rg that has appeared most frequently in the data frame(above is my dataframe) and ranking them accordingly.

Can someone please guide me on this?Is there any better way to do this?

kmodes <- klaR::kmodes(mapped_df, modes= 5, iter.max = 10, weighted = FALSE)
 #Add these clusters to the main dataframe
final <- mapped_df %>%
  mutate(cluster = kmodes$cluster)

1) It is not clear (at least for me) what are you looking for (rows with consecutive values?). 2) It would be better if you include a reproducible example (including your data frame). Edit your question copying-pasting the output of `dput(mapped_df)`. — Wencheng Lau-Medrano, Jul 12 '22 at 08:07

score 0 · Accepted Answer · answered Jul 12 '22 at 14:02

0

You can sort across the columns, then look for duplicates.

set.seed(1234)

df <- tibble(Userids = 1:20,
             rg_1 = sample(1:20, 20, TRUE),
             rg_2 = sample(1:20, 20, TRUE),
             rg_3 = sample(1:20, 20, TRUE)) 

df[4, -1] <- rev(df[15, -1])

# sort across the columns
df_sorted <- t(apply(df[-1], 1, sort))

# return the duplicated rows
df[duplicated(df_sorted) | duplicated(df_sorted, fromLast = TRUE), ]

This will give you a data frame with all the duplicated values. Once you have the sorted data frame, it should be easy enough to find what you need.

  Userids  rg_1  rg_2  rg_3
    <int> <int> <int> <int>
1       4    16    17     6
2      15     6    17    16

answered Jul 12 '22 at 14:02

Thank you so much. This was helpful. Just a quick question - Could you explain these two line of codes? reversing dataframe ? df[4, -1] <- rev(df[15, -1]) # sort across the columns df_sorted <- t(apply(df[-1], 1, sort)) (Are we sorting only -1 ie. last column? Could you please explain apply(df[-1],1) ? – Anonymous Jul 12 '22 at 14:57
Ah, that line `df[4, -1] <- rev(df[15, -1])` is just me creating some fake data. It takes row 4 and makes it equal to row 15, but just in reverse. That way we can test it. And then `df[-1]` removes the first column from the data frame, to not include the Userids. So `apply(df[-1], 1, sort)` says to iterate through the rows of the df, excluding the first column, and then sort. The "1" says to iterate over rows, as opposed to a "2" which would iterate over columns. – Jul 12 '22 at 15:01
To see what apply does, check out `apply(matrix(1:4, nrow = 2), 1, sum)`. You'll see it sums across the rows. Play around with that. Change the 1 to a 2, and sum to mean. It just operates on a margin of the matrix/data frame, and *applies* that formula. – Jul 12 '22 at 15:04

score 0 · Answer 2 · answered Jul 12 '22 at 14:04

I still do not understand what are you precisely looking for. Besides, it is always recomended to include the data frame you are refering.

I could suggest a solution, which implies the use of a threshold value. So, for each row, if some of the differences (between rg1-rg2, rg1-rg3 and rg2-rg3) is higher than the threshold, it will not be consider.

threshold <- 5

index <- mapped_df %>% 
  
  tibble(g1_g2 = abs(rg1 - rg2),
         g1_g3 = abs(rg1 - rg3),
         g2_g3 = abs(rg2 - rg3)) %>% 
  
  apply(1, function(x, threshold) all(x <= threshold), 
        threshold = threshold)

mapped_df[index]

harre · Answer 3 · 2022-07-12T14:21:56.373

Maybe you're (just) after some filtering?

library(tidyverse)

data <- tibble(Userids = 1:10,
               rg1 = c(70,1:8,70),
               rg2 = c(71,11:18,71),
               rg3 = c(72,21:28,72))

data |>
    filter(rg1 == 70,
           rg2 == 71,
           rg3 == 72)

data |> 
    filter(rg1 == rg1[row_number()==1],
           rg2 == rg2[row_number()==1],
           rg3 == rg3[row_number()==1])

Output:

# A tibble: 2 × 4
  Userids   rg1   rg2   rg3
    <int> <dbl> <dbl> <dbl>
1       1    70    71    72
2      10    70    71    72

Or combine them for ease:

data |>
    unite(rg, starts_with("rg")) |>
    filter(rg == rg[row_number()==1])

Output:

# A tibble: 2 × 2
  Userids rg      
    <int> <chr>   
1       1 70_71_72
2      10 70_71_72

Finding the exact match in the values in the categorical variables

3 Answers3