0

I am trying to make a combination analysis that shows the results in a plot. I have a data frame with 9 columns and each column consists of different percentages or NA's if a value was not present in the sample.

The example code I have used for this can be found here: https://epirhandbook.com/en/combinations-analysis.html

The issue is that in a line the 1's are changed to 0's and vice versa. The line is:

data <- data %>%
  mutate(across(all_of(columns), ~ as.integer(. %in% c("yes", NA))))

The full code that I have used is:

library(tidyverse)
library(UpSetR)
library(ggupset)

data <- META_new[c("lengthpergram","countpergram","acrylrel",
                   "cottonrel","polyestrel","polyamiderel",
                   "elastaanrel","lyocellrel","viscoserel",
                   "nylonrel","wolrel")]

columns <- c("acrylrel", "cottonrel", "polyestrel", "polyamiderel",
             "elastaanrel", "lyocellrel", "viscoserel", "nylonrel", "wolrel")

for (col in columns) {
  data[[col]][data[[col]] > 0] <- "yes"
  data[[col]][data[[col]] == 0] <- NA
}

data <- data %>%
  mutate(acrylrel = ifelse(acrylrel == "yes", 1, 0),
         cottonrel = ifelse(cottonrel == "yes", 1, 0),
         polyestrel = ifelse(polyestrel == "yes", 1, 0),
         polyamiderel = ifelse(polyamiderel == "yes", 1, 0),
         elastaanrel = ifelse(elastaanrel == "yes", 1, 0),
         lyocellrel = ifelse(lyocellrel == "yes", 1, 0),
         viscoserel = ifelse(viscoserel == "yes", 1, 0),
         nylonrel = ifelse(nylonrel == "yes", 1, 0),
         wolrel = ifelse(wolrel== "yes", 1, 0),)

data <- data %>%
  mutate(across(all_of(columns), ~ as.integer(. %in% c("yes", NA))))

data %>%
  UpSetR::upset(
    sets = columns,
    order.by = "freq",
    sets.bar.color = c("red", "orange", "yellow", "green", "cyan", "blue", "purple", "pink", "salmon"),
    empty.intersections = "on",
    number.angles = 0,
    point.size = 2,
    line.size = 1, 
    mainbar.y.label = "Fabric combinations by frequency",
    sets.x.label = "Types of fabric present in samples")

The code gives a good plot. But it allocates the wrong column name to the value. For example, polyestrel is supposed to be the most frequent combination, but lyocellrel is allocated, even though lyocellrel is least frequent.

Unfortunately I cannot add the df, as it is too big, but I hope someone has suggestions on how to fix this (if this line is even the problem).

I changed some of the original code of the website, original:

 mutate(across(c(fever, chills, cough, aches, vomit), .fns = ~+(.x == "yes")))

Because when I tried it I got this error:

Error in start_col:end_col : argument of length 0

First 5 rows

data <- data <- data.frame(
  acrylrel = c(0.00000, 0.00000, 0.00000, 36.61972, 0.00000),
  cottonrel = c(9.089974, 65.000000, 0.000000, 19.014085, 8.500000),
  polyestrel = c(83.72237, 35.00000, 42.81081, 44.36620, 15.00000),
  polyamiderel = c(5.583548, 0.000000, 53.594595, 0.000000, 40.000000),
  elastaanrel = c(1.604113, 0.000000, 3.594595, 0.000000, 1.500000),
  lyocellrel = c(0, 0, 0, 0, 0),
  viscoserel = c(0, 0, 0, 0, 0),
  nylonrel = c(0, 0, 0, 0, 0),
  wolrel = c(0, 0, 0, 0, 0)
)
Mark
  • 7,785
  • 2
  • 14
  • 34
  • Could you make this reproducible by editing your question to include the output of `dput(data)` – jpsmith Jul 29 '23 at 19:08
  • I have included the data, but left out the final column of wolrel as the data would otherwise be too large – Melanie A Kool Jul 29 '23 at 19:55
  • Hi Melanie! Welcome to StackOverflow! What is META_new? – Mark Jul 30 '23 at 06:28
  • it might be good to upload the data somewhere else (e.g. Github) or just include the head of the data instead, whichever is easiest/reproduces the problems – Mark Jul 30 '23 at 06:32
  • Hi! I have added the first five rows instead. I think this would work to see why it messes the variables up in the plot, hopefully it will – Melanie A Kool Jul 30 '23 at 11:18
  • the issue is the line `data <- data %>% mutate(across(all_of(columns), ~ as.integer(. %in% c("yes", NA))))` look at `data` before it, and then run it, and then look at data after it. NAs become 1s and 1s become 0s – Mark Jul 30 '23 at 12:27
  • What are you actually trying to do in that part? It seems like you are trying to assign every value larger than 0 to be 1, and the 0s stay 0 – Mark Jul 30 '23 at 12:28
  • Hey @mark - fyi it’s generally not recommended to suggest providing links to data, but instead encourage folks to provide smaller samples of their data (ie, `dput(df[10,])` for the first 10 rows). This is because it helps the long-term posterity of the site since links may go bad over time, etc. – jpsmith Jul 30 '23 at 12:38
  • @jpsmith you're right, point taken. it was more a fallback if for whatever reason they aren't able to do that. I'm more than happy to download a csv and then create a dput of the head for someone if they can't do that. I did offer both – Mark Jul 30 '23 at 12:45
  • @MelanieAKool I commented the code in my answer below. Take a look and let me know if you understand where things went wrong! – Mark Jul 30 '23 at 12:45
  • 1
    Hi @Mark! I applied it to the data and it showed the graph correctly, thank you so much! It was indeed in the NA's – Melanie A Kool Jul 30 '23 at 19:58
  • glad I could help :-) – Mark Jul 31 '23 at 02:13

1 Answers1

1

This appears to be what you want:

data %>%
  mutate(across(everything(), ~ as.integer(. > 0))) %>%
  UpSetR::upset(
    sets = columns,
    order.by = "freq",
    sets.bar.color = c("red", "orange", "yellow", "green", "cyan", "blue", "purple", "pink", "salmon"),
    empty.intersections = "on",
    number.angles = 0,
    point.size = 2,
    line.size = 1, 
    mainbar.y.label = "Fabric combinations by frequency",
    sets.x.label = "Types of fabric present in samples")

Output: plot

Going through your code part by part:

# this turns every value into "yes" if positive, or NA if 0
for (col in columns) {
  data[[col]][data[[col]] > 0] <- "yes"
  data[[col]][data[[col]] == 0] <- NA
}

# this is the same as above, but all of the "yes" values have been turned into 1s. Note that (frustratingly!) NA == "yes" is NA, not FALSE, as you would think. The way to check for NA values is with the function is.na()
data %>%
  mutate(acrylrel = ifelse(acrylrel == "yes", 1, 0),
         cottonrel = ifelse(cottonrel == "yes", 1, 0),
         polyestrel = ifelse(polyestrel == "yes", 1, 0),
         polyamiderel = ifelse(polyamiderel == "yes", 1, 0),
         elastaanrel = ifelse(elastaanrel == "yes", 1, 0),
         lyocellrel = ifelse(lyocellrel == "yes", 1, 0),
         viscoserel = ifelse(viscoserel == "yes", 1, 0),
         nylonrel = ifelse(nylonrel == "yes", 1, 0),
         wolrel = ifelse(wolrel== "yes", 1, 0),)

# with this line, because you've already turned the "yes" values into 1s, `. %in% c("yes", NA)` evaluates to FALSE for the 1s and TRUE for the NA values (oddly this works)
data <- data %>%
  mutate(across(all_of(columns), ~ as.integer(. %in% c("yes", NA))))
Mark
  • 7,785
  • 2
  • 14
  • 34