Remove rows based on unique values across two columns

Question

I got some data from two experiments where participants listened to pairs of audio, and now I'm trying to get a smaller list of pairs where the segments appear only once. Here is a sample of my data, where each row represents a pair:

data <- structure(c("38", "39", "48", "50", "55", "68", "143", "'00123_16_02 Firestarter_timbre.txt'", 
"'00123_16_02 Firestarter_timbre.txt'", "'00123_16_02 Firestarter_timbre.txt'", 
"'00123_16_02 Firestarter_timbre.txt'", "'00133_10_02 Loner_timbre.txt'", 
"'00133_10_02 Loner_timbre.txt'", "'00371_17_05 - Original_timbre.txt'", 
"'00133_10_02 Loner_timbre.txt'", "'00030_11_01 Get Your Snack On_timbre.txt'", 
"'00845_03_11 - Flying Lotus - Parisian Goldfish_timbre.txt'", 
"'01249_17_UMEK - Efortil_timbre.txt'", "'00030_11_01 Get Your Snack On_timbre.txt'", 
"'01300_08_02 - Clipper_timbre.txt'", "'01300_08_02 - Clipper_timbre.txt'", 
"MRHT", "MRHT", "MRHT", "MRHT", "MRHT", "MRHT", "MRHT", "12", 
"9", "14", "11", "14", "15", "12", "11", "12", "14", "15", "14", 
"14", "11", "2.75", "2.22222222222222", "2.21428571428571", "2.54545454545455", 
"2.28571428571429", "2.53333333333333", "2.25", "2.81818181818182", 
"3.25", "3.14285714285714", "2.93333333333333", "3.14285714285714", 
"3.07142857142857", "2.90909090909091", "0.621581560508061", 
"0.97182531580755", "1.25137287246211", "1.21355975243384", "0.994490316197694", 
"0.743223352957207", "1.05528970602217", "0.873862897505303", 
"0.753778361444409", "0.662993544131796", "1.03279555898864", 
"0.662993544131796", "0.997248963150875", "1.04446593573419"), .Dim = c(7L, 
10L), .Dimnames = list(NULL, c("pair.number", "Segment1", "Segment2", 
"category", "Rhythm.n", "Timbre.n", "Rhythm.mean", "Timbre.mean", 
"Rhythm.sd", "Timbre.sd")))

Is there a way to get a set of pairs where the segments don't repeat themselves across both "Segment1" and "Segment2"? Here's what it might look like:

structure(c("48", "55", "143", "'00123_16_02 Firestarter_timbre.txt'", 
"'00133_10_02 Loner_timbre.txt'", "'00371_17_05 - Original_timbre.txt'", 
"'00845_03_11 - Flying Lotus - Parisian Goldfish_timbre.txt'", 
"'00030_11_01 Get Your Snack On_timbre.txt'", "'01300_08_02 - Clipper_timbre.txt'", 
"MRHT", "MRHT", "MRHT", "14", "14", "12", "14", "14", "11", "2.21428571428571", 
"2.28571428571429", "2.25", "3.14285714285714", "3.14285714285714", 
"2.90909090909091", "1.25137287246211", "0.994490316197694", 
"1.05528970602217", "0.662993544131796", "0.662993544131796", 
"1.04446593573419"), .Dim = c(3L, 10L), .Dimnames = list(NULL, 
    c("pair.number", "Segment1", "Segment2", "category", "Rhythm.n", 
    "Timbre.n", "Rhythm.mean", "Timbre.mean", "Rhythm.sd", "Timbre.sd"
    )))

Thanks!

For these, I manually selected the ones that contain unique segments to make the pairs. The repetition has to be avoided regardless of the column the segment is listed in. — DavidLopezM, Apr 24 '14 at 09:48

James Trimble · Answer 1 · 2014-04-23T15:03:27.640

2

Edit: The second line of code now ensures that nothing in the Segment1 column appears in the Segment2 column. Note that this solution is likely to return fewer than the maximum possible number of rows.

This ensures that the values of Segement1 are unique:

data <- data[!duplicated(data[, "Segment1"]),]

You can then run this to remove duplicates in the Segment2 column; this will also remove any rows in which Segment2 appears anywhere in the Segment1 column:

data <- data[!duplicated(data[, "Segment2"]) & !(data[, "Segment2"] %in% data[, "Segment1"]),]

edited Apr 23 '14 at 15:03

answered Apr 23 '14 at 11:49

James Trimble

1,868
13
20

I played with `duplicated` for a while, but it still gives me repeated values across columns Segment1 and Segment2. Do you know of a way to prevent this? – DavidLopezM Apr 23 '14 at 12:09

Gavin Kelly · Answer 2 · 2014-04-23T13:13:21.720

2

It sounds like you want what's called a 'matched graph' - your vertices are tracks, and an edge goes between them if you they were listened to in a pair. You then need to find a set of edges that contain no common vertices (a matching) - and probably ideally the largest set of such (a maximal matching).

There's a function in R's igraph package that should help with this called maximum.bipartite.matching - you'll need to get segment1 and segment2 into a graph representation to call that. Something along the lines of:

seg1 <-df$Segment1
seg2 <- df$Segment2
levs <- unique(c(seg1, seg2))
seg1 <- as.integer(factor(seg1, levels=levs))
seg2 <- as.integer(factor(seg2, levels=levs))
library(igraph)
reord <- order(c(1:length(seg1), 1:length(seg2)))
gr <- graph(c(seg1, seg2)[reord])
maximum.bipartite.matching(gr)

Most of this is to get the vertices in the correct format: We cast them as factors with common levels, then turn them into integers. We interweave them to form (seg1_1, seg2_1, seg1_2, seg2_2, seg1_3, seg2_3, ...) to give pairs of vertices, and then create a graph object out of them. The output of the final line will find the largest number of pairs of audio-tracks such that none of them overlap. You'll need to extract these, and map them back to the original data set.

edited Apr 23 '14 at 13:13

answered Apr 23 '14 at 12:33

Gavin Kelly

2,374
1
10
13

Think that's probably because I didn't set it up as a _bipartite_ graph, sorry. If you set `V(gr)$type <- FALSE` the function should work, though I would say there should be three edges found, whereas the function only seems to find three - maybe a result of forcing this graph rather artificially to be bi-partite. Maybe someone else knows better graph-matching algorithms in R – Gavin Kelly Apr 23 '14 at 13:15
Then I get the following error: `Error in .Call("R_igraph_create", as.numeric(edges) - 1, as.numeric(n), : At type_indexededgelist.c:117 : cannot create empty graph with negative number of vertices, Invalid value` – DavidLopezM Apr 23 '14 at 13:19
Check that `seg1` and `seg2` are getting correct values - by the fifth line, they should be integers without any missing values. Depending on the exact class of Segment1, you may need initially to cast them `as.character(df$Segment1)` etc – Gavin Kelly Apr 23 '14 at 13:22
And I think the matching it provides is maximal in the sense that one can't add any more edges that wouldn't introduce overlaps - but it doesn't necessarily find the 'largest' maximal set (called rather confusinly maxim_um_)- I can't find an R implementation of such an algorithm – Gavin Kelly Apr 23 '14 at 13:25
Thank you for taking the time to answer all my replies! After running this code and indexing back to the main database, I still get repeated values for some strange reason. I guess I'll look for a solution tomorrow, and wait to see if someone else has some input on how to do this. – DavidLopezM Apr 23 '14 at 14:38

Remove rows based on unique values across two columns

2 Answers2