0

I have a data frame. Small sample of the data frame can be thought of like this:

    V1        V2        V3        V4..........
0  a2er56    b34er12   aer234     NA
1  2fr34rt   a2er56    aer234        NA
2  b34er12   2fr34t     NA        NA
.
.
.

There are total 1193 rows and 173 columns.

I want to find the common elements i.e. the columns for a pair of rows and that should be run for all the pairs. For rows starting from 0 to all the rows (1193) except 0 it needs to find what are the common entries that they have under the columns. This should be done for all the pairs possible.

By pairs I mean (0,1), (0,2), (0,3)....(1,2), (1,3),.... i.e all possible pairs and for common elements I meant :

If we see the pair (0,2) in the small sample data set we can see b34er12 is present in both, thus I want a data frame having three columns which will have: 0, 2, b34er12.

Again for the pair (1,2) the common element is 2fr34t.

I want a data frame like this:

name1    name2   common_element1  common_element2......
 0        1       a2er56               aer234
 0        2       b34er12
 1        2       2fr34t

dput of my data frame of two rows (2, 19) which have one common element (20a8ed50-e592-70e4-aecd-ec17f1723495)

 structure(list(V1 = list(`2` = "20a8ed50-e592-70e4-aecd-ec17f1723495", 
`19` = "5e2406fe-ddb4-a797-508e-2aff5baae0fc"), V2 = list(
`2` = NA_character_, `19` = "f8bd691e-ab7f-3718-c1d2-1361dab00bcd"), 
V3 = list(`2` = NA_character_, `19` = "b76c2ff6-87f0-2cfb-ac8c-e4adcb580c9e"), 
V4 = list(`2` = NA_character_, `19` = "ce09b782-92f6-245c-70dd-886bdd1d5590"), 
V5 = list(`2` = NA_character_, `19` = "8f305f42-6ad3-2a5d-8910-e2c9415a8a7f"), 
V6 = list(`2` = NA_character_, `19` = "409d5592-8bac-5814-6e41-6b239fac65b8"), 
V7 = list(`2` = NA_character_, `19` = "3a12ffe6-69a8-9bf6-0ed6-bd9587f4ae2f"), 
V8 = list(`2` = NA_character_, `19` = "a1b39b52-e345-0fbe-3beb-156a7ffb9894"), 
V9 = list(`2` = NA_character_, `19` = "a0d0c78b-2a73-aff5-582e-7e8dcaa96058"), 
V10 = list(`2` = NA_character_, `19` = "74b70d92-29cc-7b6e-58e9-515998b434f4"), 
V11 = list(`2` = NA_character_, `19` = "72cd80a5-ae3d-496b-f1fd-e3722c2b504e"), 
V12 = list(`2` = NA_character_, `19` = "eac9b054-e49d-109c-7c70-37473b2c998b"), 
V13 = list(`2` = NA_character_, `19` = "331c1a13-1723-d521-42b9-dc4b4f3ece45"), 
V14 = list(`2` = NA_character_, `19` = "45f73feb-6f8c-2ecb-2fab-cd38362e4b7a"), 
V15 = list(`2` = NA_character_, `19` = "f10dc385-9f8e-a85e-1962-5b63976f40fa"), 
V16 = list(`2` = NA_character_, `19` = "54dfb458-669b-49e9-1f09-4543bd2f49b7"), 
V17 = list(`2` = NA_character_, `19` = "a04a9177-e309-b0b0-cc60-109c61a4caf1"), 
V18 = list(`2` = NA_character_, `19` = "a5551ec1-1c06-8727-2046-f3c51854decf"), 
V19 = list(`2` = NA_character_, `19` = "acffb1a5-ddda-460d-a6fb-99d3cdb67aca"), 
V20 = list(`2` = NA_character_, `19` = "16063f02-0f0c-e2f7-a433-1fb062229801"), 
V21 = list(`2` = NA_character_, `19` = "1ec467d7-320a-09dd-6fe2-6cb42023d4e9"), 
V22 = list(`2` = NA_character_, `19` = "2882a616-e376-e72b-e10f-6adc937bfb17"), 
V23 = list(`2` = NA_character_, `19` = "53380cb1-297a-eaa3-f319-9b5184d5cad2"), 
V24 = list(`2` = NA_character_, `19` = "20a8ed50-e592-70e4-aecd-ec17f1723495"), 
V25 = list(`2` = NA_character_, `19` = "23437f56-d286-bba3-a7ed-e5173acb785c"), 
V26 = list(`2` = NA_character_, `19` = "24221c40-7a63-e932-3a82-6daa305ff7a4"), 
V27 = list(`2` = NA_character_, `19` = "7bc9216e-a9a5-3c43-bd3d-3bf9680a0799"), 
V28 = list(`2` = NA_character_, `19` = "40880c67-e4b4-95eb-bf7b-58f07dcd93ca"), 
V29 = list(`2` = NA_character_, `19` = "03c5eef8-5442-2f5d-7c33-44c2d59aeb4e"), 
V30 = list(`2` = NA_character_, `19` = "24d95b8a-42ed-fa15-13b1-53892d0339fe"), 
V31 = list(`2` = NA_character_, `19` = "a08a6b2b-577f-9cdd-3fec-992b54f4a0bb"), 
V32 = list(`2` = NA_character_, `19` = "6d65b9cc-fe48-fc94-404b-6cbc300f044e"), 
V33 = list(`2` = NA_character_, `19` = "fd5a9a7a-e07b-6b0a-a01c-c71b35940330"), 
V34 = list(`2` = NA_character_, `19` = "d2e09e88-5ff4-753c-7554-28fad0fcd63f"), 
V35 = list(`2` = NA_character_, `19` = "c0645d9e-eb24-eeff-e7ff-28f839b90c37"), 
V36 = list(`2` = NA_character_, `19` = "a093a047-9d04-0a32-5c42-bb2b5b58b1c5"), 
V37 = list(`2` = NA_character_, `19` = "09199dbf-1f96-5cc0-6cdb-96f2802e7487"), 
V38 = list(`2` = NA_character_, `19` = "338ed72b-5b6a-ecbe-7bf4-8449d452cefc"), 
V39 = list(`2` = NA_character_, `19` = "deb26b9c-b8a2-9a02-d8d3-1805e522e4d5"), 
V40 = list(`2` = NA_character_, `19` = "9d99226f-e484-dddd-7d1c-a9f1803fc21d"), 
V41 = list(`2` = NA_character_, `19` = "a340124f-a029-ee62-683a-152bef28db6b"), 
V42 = list(`2` = NA_character_, `19` = "2c147f88-a297-4932-4b1e-9caffb1982d4"), 
V43 = list(`2` = NA_character_, `19` = "03f230ba-ed9d-db27-71ba-b0ebe57a7827"), 
V44 = list(`2` = NA_character_, `19` = "a8c7b55d-431f-9876-e839-623912250da6"), 
V45 = list(`2` = NA_character_, `19` = "7c7e6f90-ad68-8fca-cbad-91cd2ba6110d"), 
V46 = list(`2` = NA_character_, `19` = "577b02ae-7b6f-676b-5e03-aabb11e8bf04"), 
V47 = list(`2` = NA_character_, `19` = "bf2f74db-223a-380b-531e-2810df300b15"), 
V48 = list(`2` = NA_character_, `19` = "a92967ec-2219-3a0e-df0a-419ce07e5bd8"), 
V49 = list(`2` = NA_character_, `19` = "9f1eb31e-efc0-6b92-e2f1-8df231329752"), 
V50 = list(`2` = NA_character_, `19` = "7a909135-ca3f-c392-8aa3-693382647029"), 
V51 = list(`2` = NA_character_, `19` = "869fd4d7-1670-62b5-dbf3-2f8ea99a52dc"), 
V52 = list(`2` = NA_character_, `19` = "718a995a-5281-9d08-a916-fa37b541cbd1"), 
V53 = list(`2` = NA_character_, `19` = "be9dec6a-d2fb-60dc-a7d3-9013b4fe92b7"), 
V54 = list(`2` = NA_character_, `19` = "1304f539-4a82-0e17-0623-aa84a5fea370"), 
V55 = list(`2` = NA_character_, `19` = "f8bfb612-9b95-df4e-7f89-022528e43f5f"), 
V56 = list(`2` = NA_character_, `19` = "11a3e1e9-de5e-ac6e-d198-a7cceb5cac3c"), 
V57 = list(`2` = NA_character_, `19` = "dccd545b-5e83-bd67-158e-1939bae072f5"), 
V58 = list(`2` = NA_character_, `19` = "fd6e2a94-5a3b-2969-744d-efa09d5257a8"), 
V59 = list(`2` = NA_character_, `19` = "ff437f85-db94-12d0-7621-e0a1fe217eb6"), 
V60 = list(`2` = NA_character_, `19` = "2107b5c3-194b-c3ec-8ae2-93c7214e90da"), 
V61 = list(`2` = NA_character_, `19` = "acf5a19e-7292-dc1e-8b68-d4d629843935"), 
V62 = list(`2` = NA_character_, `19` = "a14c6399-a7bb-00e8-043a-a529104dd866"), 
V63 = list(`2` = NA_character_, `19` = "fbce294b-4dbc-fbd3-5caf-7e66a25d2a2c"), 
V64 = list(`2` = NA_character_, `19` = "29e32b77-49be-a503-96d3-ce32b04f5d25"), 
V65 = list(`2` = NA_character_, `19` = "1218d219-7778-e30d-1e05-95406ac7520f"), 
V66 = list(`2` = NA_character_, `19` = "8f4819b8-1ac0-4d47-62d2-a10f9eeac5af"), 
V67 = list(`2` = NA_character_, `19` = "0c37cbe5-6922-bb53-1611-2926ceccb776"), 
V68 = list(`2` = NA_character_, `19` = "d26fa4f0-1918-097c-6176-e4fe4ee2e7cb"), 
V69 = list(`2` = NA_character_, `19` = "4f2d0023-2a04-9fa5-49ec-27da08a4567a"), 
V70 = list(`2` = NA_character_, `19` = "f7cb205c-1885-3bb6-5963-0d4eb90cf763"), 
V71 = list(`2` = NA_character_, `19` = "825843e2-8536-41c3-5df2-0363c8c87dfa"), 
V72 = list(`2` = NA_character_, `19` = "c596f761-a75b-10e1-1bfc-fda0b247fc05"), 
V73 = list(`2` = NA_character_, `19` = "73f09340-4b13-5716-8496-d412796ec7fd"), 
V74 = list(`2` = NA_character_, `19` = "7028ac54-d368-28d1-26bd-7abe3ac0c656"), 
V75 = list(`2` = NA_character_, `19` = "5cddb8c8-c5b5-af9d-f503-e0a8f0cf9c3f"), 
V76 = list(`2` = NA_character_, `19` = "71df6470-5429-6f55-ba12-cfff63849cce"), 
V77 = list(`2` = NA_character_, `19` = "870aa18e-f4c5-9a90-afe1-99b3c2fe8534"), 
V78 = list(`2` = NA_character_, `19` = "52c163c5-ac34-4c93-4aba-368fd00808b0"), 
V79 = list(`2` = NA_character_, `19` = "313f157b-60d6-b82c-c6fa-028d4a226e94"), 
V80 = list(`2` = NA_character_, `19` = "ce5c86af-8e37-c2cb-9d44-db1e3687b6b4"), 
V81 = list(`2` = NA_character_, `19` = "516494df-f44b-a4bd-ce35-cae9f2102ca6"), 
V82 = list(`2` = NA_character_, `19` = "fb423fb0-bc21-3556-386f-9ff05722166c"), 
V83 = list(`2` = NA_character_, `19` = "0e36a888-38a0-54be-6a27-366a5e17fc9a"), 
V84 = list(`2` = NA_character_, `19` = "8d2f177e-a485-d865-a220-09d42aa4afc4"), 
V85 = list(`2` = NA_character_, `19` = "605924f4-aa86-54af-ae01-782c1fd9eb45"), 
V86 = list(`2` = NA_character_, `19` = "ddd7cf47-66cd-4706-9df1-b0aa05ec594d"), 
V87 = list(`2` = NA_character_, `19` = "cfd1d1dc-2fa4-c7a3-f6fa-e93083d33386"), 
V88 = list(`2` = NA_character_, `19` = "4a1626e6-5137-87f4-8c23-4ec92c29252d"), 
V89 = list(`2` = NA_character_, `19` = "c46e93c7-f9fc-220b-705b-e9adcc95c3ac"), 
V90 = list(`2` = NA_character_, `19` = "b4cdd55e-b56a-5b38-8539-3f09857d242b"), 
V91 = list(`2` = NA_character_, `19` = "c9953591-82e2-461d-275b-6bf5e84c902c"), 
V92 = list(`2` = NA_character_, `19` = "03392f1f-93dd-7684-67a4-821d875b209e"), 
V93 = list(`2` = NA_character_, `19` = "b11f4d5e-1af0-b0b7-c2de-e71afe1ba4f7"), 
V94 = list(`2` = NA_character_, `19` = "7c06ac0b-8662-595d-b05c-e8a6d14ae347"), 
V95 = list(`2` = NA_character_, `19` = "0bb57f8b-6d9b-9c8a-c8ad-bf1ee2e64843"), 
V96 = list(`2` = NA_character_, `19` = "907b69bc-fd0b-4ec5-e912-b24219bb45a2"), 
V97 = list(`2` = NA_character_, `19` = "2373dc5d-c24d-be1a-dba4-04de497bef48"), 
V98 = list(`2` = NA_character_, `19` = "e3829233-bf93-691e-910e-5251876b63ff"), 
V99 = list(`2` = NA_character_, `19` = "81bf5b0f-d69f-3e3f-189a-4b6b5fa508c8"), 
V100 = list(`2` = NA_character_, `19` = "1843bd54-78d5-6768-7bc4-c83dc51a3c33"), 
V101 = list(`2` = NA_character_, `19` = "b2e019ff-6364-9de8-a9d5-2fb683393f66"), 
V102 = list(`2` = NA_character_, `19` = "f84bc0b5-dba4-3208-2eaf-2f2c0e3c5207"), 
V103 = list(`2` = NA_character_, `19` = "bd8245a7-28d3-b2f1-ad24-433c224147a6"), 
V104 = list(`2` = NA_character_, `19` = "d06e992f-4212-b12e-58a7-d5cc8cbf1433"), 
V105 = list(`2` = NA_character_, `19` = NA_character_), V106 = list(
    `2` = NA_character_, `19` = NA_character_), V107 = list(
    `2` = NA_character_, `19` = NA_character_), V108 = list(
    `2` = NA_character_, `19` = NA_character_), V109 = list(
    `2` = NA_character_, `19` = NA_character_), V110 = list(
    `2` = NA_character_, `19` = NA_character_), V111 = list(
    `2` = NA_character_, `19` = NA_character_), V112 = list(
    `2` = NA_character_, `19` = NA_character_), V113 = list(
    `2` = NA_character_, `19` = NA_character_), V114 = list(
    `2` = NA_character_, `19` = NA_character_), V115 = list(
    `2` = NA_character_, `19` = NA_character_), V116 = list(
    `2` = NA_character_, `19` = NA_character_), V117 = list(
    `2` = NA_character_, `19` = NA_character_), V118 = list(
    `2` = NA_character_, `19` = NA_character_), V119 = list(
    `2` = NA_character_, `19` = NA_character_), V120 = list(
    `2` = NA_character_, `19` = NA_character_), V121 = list(
    `2` = NA_character_, `19` = NA_character_), V122 = list(
    `2` = NA_character_, `19` = NA_character_), V123 = list(
    `2` = NA_character_, `19` = NA_character_), V124 = list(
    `2` = NA_character_, `19` = NA_character_), V125 = list(
    `2` = NA_character_, `19` = NA_character_), V126 = list(
    `2` = NA_character_, `19` = NA_character_), V127 = list(
    `2` = NA_character_, `19` = NA_character_), V128 = list(
    `2` = NA_character_, `19` = NA_character_), V129 = list(
    `2` = NA_character_, `19` = NA_character_), V130 = list(
    `2` = NA_character_, `19` = NA_character_), V131 = list(
    `2` = NA_character_, `19` = NA_character_), V132 = list(
    `2` = NA_character_, `19` = NA_character_), V133 = list(
    `2` = NA_character_, `19` = NA_character_), V134 = list(
    `2` = NA_character_, `19` = NA_character_), V135 = list(
    `2` = NA_character_, `19` = NA_character_), V136 = list(
    `2` = NA_character_, `19` = NA_character_), V137 = list(
    `2` = NA_character_, `19` = NA_character_), V138 = list(
    `2` = NA_character_, `19` = NA_character_), V139 = list(
    `2` = NA_character_, `19` = NA_character_), V140 = list(
    `2` = NA_character_, `19` = NA_character_), V141 = list(
    `2` = NA_character_, `19` = NA_character_), V142 = list(
    `2` = NA_character_, `19` = NA_character_), V143 = list(
    `2` = NA_character_, `19` = NA_character_), V144 = list(
    `2` = NA_character_, `19` = NA_character_), V145 = list(
    `2` = NA_character_, `19` = NA_character_), V146 = list(
    `2` = NA_character_, `19` = NA_character_), V147 = list(
    `2` = NA_character_, `19` = NA_character_), V148 = list(
    `2` = NA_character_, `19` = NA_character_), V149 = list(
    `2` = NA_character_, `19` = NA_character_), V150 = list(
    `2` = NA_character_, `19` = NA_character_), V151 = list(
    `2` = NA_character_, `19` = NA_character_), V152 = list(
    `2` = NA_character_, `19` = NA_character_), V153 = list(
    `2` = NA_character_, `19` = NA_character_), V154 = list(
    `2` = NA_character_, `19` = NA_character_), V155 = list(
    `2` = NA_character_, `19` = NA_character_), V156 = list(
    `2` = NA_character_, `19` = NA_character_), V157 = list(
    `2` = NA_character_, `19` = NA_character_), V158 = list(
    `2` = NA_character_, `19` = NA_character_), V159 = list(
    `2` = NA_character_, `19` = NA_character_), V160 = list(
    `2` = NA_character_, `19` = NA_character_), V161 = list(
    `2` = NA_character_, `19` = NA_character_), V162 = list(
    `2` = NA_character_, `19` = NA_character_), V163 = list(
    `2` = NA_character_, `19` = NA_character_), V164 = list(
    `2` = NA_character_, `19` = NA_character_), V165 = list(
    `2` = NA_character_, `19` = NA_character_), V166 = list(
    `2` = NA_character_, `19` = NA_character_), V167 = list(
    `2` = NA_character_, `19` = NA_character_), V168 = list(
    `2` = NA_character_, `19` = NA_character_), V169 = list(
    `2` = NA_character_, `19` = NA_character_), V170 = list(
    `2` = NA_character_, `19` = NA_character_), V171 = list(
    `2` = NA_character_, `19` = NA_character_), V172 = list(
    `2` = NA_character_, `19` = NA_character_), V173 = list(
    `2` = NA_character_, `19` = NA_character_)), row.names = c("2", 
    "19"), class = "data.frame")
Ashi
  • 61
  • 5
  • Your description is a little unclear to me. What would the intended result look like if "b34er12" was also present in row 1, column V4? – jdobres Aug 21 '21 at 18:04
  • @jdobres sorry for that. So, "b34er12" cannot be present in row 1 and column V4 as it is already present in row 1 and column V2. For one row the values present under the columns do not repeat. – Ashi Aug 21 '21 at 18:07
  • Okay, then what would the result look like if "b34er12" was also present in row 1, V2 (note, I mean your labeled row 1, not row 0)? – jdobres Aug 21 '21 at 18:09
  • @jdobres okay I got it. I have updated my question adding a few more examples. Hope that will help. To answer your question, if "b34er12" is present in row 1, then 0 and 1 will have that as a common element and 1 and 2 will have 2 elements as their common elements. – Ashi Aug 21 '21 at 18:13

1 Answers1

1

The trick is to convert to a long table that contains all name and element pairs:

Convert to long table containing columns name, variable, and element:

library(dplyr)
library(tidyr)

# Note that you also use the native pipe |> on R >= 4.1
data = data %>% 
  # Add index to data
  mutate(name = row_number()) %>% 
  # Convert from wide to long
  pivot_longer(!name, names_to = 'variable', values_to = 'element') %>% 
  # Remove any NAs
  drop_na()

This give an output that looks like:

  name variable element
1    1     col1      a1
2    2     col1      c2
3    3     col1      a3
4    4     col1      d4
5    1     col2      c2
6    2     col2      v5
7    3     col2      d4
8    4     col2      x6

We can then find all duplicates by grouping by element and filtering for duplicates:

dups = longdata %>% 
  select(-variable) %>% 
  group_by(element) %>% 
  mutate(numdups = n()) %>% 
  filter(numdups > 1) %>% 
  select(-numdups)

All pairs can then be found with an inner join by element. This matches all entries, so we have to filter out joins to the same name (e.g. (name1, name2) = (1,1):

dups %>% inner_join(dups, by = 'element') %>% 
  filter(name.x < name.y) %>% 
  select (name1 = name.x, name2 = name.y, element)
  • Thank you for your help. I have added a name column to the data frame. But when I am trying to convert to long table, it is showing me an error like this: Error in melt_dataframe(data, as.integer(id.ind - 1), as.integer(measure.ind - : Can't melt data.frames with non-atomic 'measure' columns – Ashi Aug 21 '21 at 20:49
  • If I use longdata <- reshape2::melt(data, measure.vars = "name", value.name = "element") will that be okay? As this is giving me no error. But when I am doing dups = longdata %>% select(-variable) %>% group_by(element) %>% mutate(numdups = n()) %>% filter(numdups >1) %>% select(-numdups) , the dups is coming as 0 observations with the 174 columns. Can you kindly help? – Ashi Aug 22 '21 at 02:33
  • No, that doesn't do the trick. What you are looking for is an output that contains three columns, and which described your data in long form. I've edited the answer to provide an example. Can you provide a data sample using `dput`? – All Downhill From Here Aug 22 '21 at 08:24
  • Since the data is so big dput is giving me a very result. Is there any way I can show you a sample of the data? – Ashi Aug 22 '21 at 08:41
  • Just provide the output of `dput(data[ 1:100, ])` (or any other range other than `1:100` that contains at least one duplicate. Alternatively, you can just select rows, e.g. `c(1,4,7)` so that you have at least one duplicate in there. – All Downhill From Here Aug 22 '21 at 08:55
  • Thank you for the help. I am pasting the rows 2 and 19 which have one common element. I am editing my question and putting the dput there. – Ashi Aug 22 '21 at 09:16
  • Thank you for your time, but I am still getting the error Error in `[<-.data.frame`(`*tmp*``, col, value = c(``0` `= "e958e6a0-4546-6861-9a40-c3f267675b7e", : replacement has 1149 rows, data has 1193` when I am trying to do the for loop. I read some posts on similar errors. It is because of the unequal column lengths. Do you have any idea how to get around this issue? – Ashi Aug 22 '21 at 18:11
  • I have demonstrated a working solution for your minimum reproducible example. You seem to have a separate issue converting your dataframe to long form with the data structure you’ve selected. I suggest to open a separate question on this topic, where you provide more detail on how the data was constructed. – All Downhill From Here Aug 22 '21 at 18:21
  • Sure. Thanks for the help. I will add a new question asking about converting my data to a long form. – Ashi Aug 24 '21 at 22:57
  • 1
    Turn out `reshape2` isn't maintained anymore, and `tidyr` is now the recommended package. I've updated the code above. Give it a spin and see if it works for you. – All Downhill From Here Aug 25 '21 at 07:50