How to unite multiple columns (character data) without concatenating?

Question

Within my data I have a subset of data that look like this:

Incident | Year | Person1  |Person2|
  :----  |:---: |:------:  | -----:|
        1| 2014 | A        | B     |
        2| 2014 | A        |       |
        3| 2016 | B        | C     |
        4| 2018 | A        | C     |
        5| 2020 | C        |       |

My goal is to count frequencies for each person and graph the data in a barchart (x axis=year, y=frequency, persons grouped). I could do count(data$Person1) and count(data$Person2) and sum them "by hand" to count frequencies, but I will need the data to be combined for the bar chart. I don't think I would use concatenate because I don't want AB, AC, etc.

I tried

group_data<-group_by(.data=data,Incident,Person1,Person2)
sum_counts<-dplyr::summarise(group_data,total.count=n())
View(sum_counts)

But the output is essentially the table above.

Is there an elegant way to group columns Person1 and Person2 without muddling the number of incidents (each row = a distinct incident)? Or in order to make my desired barchart, do I need to totally restructure the data?

Thanks in advance for wisdom.

Data:

dput(srkw.dat) structure(list(incident = 1:78, year = c(1962L, 1976L, 1981L, 1981L, 1982L, 1987L, 1989L, 1990L, 1992L, 1992L, 1992L, 1994L, 1998L, 2003L, 2003L, 2003L, 2003L, 2004L, 2004L, 2004L, 2005L, 2005L, 2005L, 2005L, 2005L, 2005L, 2005L, 2005L, 2005L, 2005L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2007L, 2007L, 2007L, 2008L, 2008L, 2008L, 2009L, 2009L, 2009L, 2010L, 2010L, 2010L, 2011L, 2011L, 2013L, 2013L, 2013L, 2014L, 2014L, 2014L, 2014L, 2014L, 2015L, 2015L, 2015L, 2015L, 2015L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2017L, 2018L, 2018L, 2018L, 2019L, 2020L, 2020L, 2020L, 2020L), pod1 = c("L", "L", "L", "L", "L", "L", "L", "K", "L", "L", "L", "L", "L", "J", "L", "L", "J", "L", "L", "L", "J", "J", "J", "J", "J", "L", "L", "J", "L", "L", "J", "J", "K", "L", "L", "J", "L", "L", "K", "L", "L", "L", "J", "J", "J", "L", "L", "K", "J", "L", "K", "", "K", "J", "J", "L", "J", "L", "K", "K", "L", "J", "J", "J", "L", "J", "J", "L", "L", "J", "J", "J", "J", "L", "J", "", "J", "L"), pod2 = c("", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "K", "", "K", "", "", "", "", "K", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "L", "", "", "", "", "", "", "", "", "", "", ""), class = "data.frame", row.names = c(NA, -78L))

What are you trying to get exactly? Are you trying to fill the missing values in Person2, reshape this, something else? — camille, Feb 25 '22 at 04:03
Not fill in the values - trying to get a barplot that will show both person1 and person2 frequencies. Wasn't sure if I had to aggregate person1&2 beforehand. — burphound, Feb 25 '22 at 19:29
Still unclear—you've got a bunch of blanks in one of your person columns, but haven't said what you want to do with those, or how exactly you're trying to aggregate and/or chart this data — camille, Feb 25 '22 at 19:38

zephryl · Answer 1 · 2022-02-25T00:23:02.670

1

You do have to restructure your data, but it's easy -- just a quick pivot_longer().

library(tidyverse)

data %>%
  pivot_longer(Person1:Person2, values_to = "Person", values_drop_na = TRUE) %>%
  ggplot() +
  geom_bar(aes(Year, fill = Person))

edited Feb 25 '22 at 00:23

answered Feb 24 '22 at 23:58

zephryl

14,633
3
11
30

Thanks Zephryl, that is great. My real data have about 80 rows however - is there a way to do this without manually inputting the data for the tribble? – burphound Feb 25 '22 at 00:07
@SarahTeman Oh yes, the tribble is just me importing the sample data from your answer. You should be able to just plug your data in, starting from the `incident_persons <- incidents %>%` line. You would just want to replace `incidents` with whatever your dataframe is called. – zephryl Feb 25 '22 at 00:18
I think my example was unnecessarily detailed, so I’ve given it a bit of a haircut. – zephryl Feb 25 '22 at 00:24
@zephryl what do you mean by example being more detailed? Doesnt this answer your question? – Onyambu Feb 25 '22 at 00:26
It worked - thank you! However the NAs still showed even though i included values_drop_na=TRUE. Do you know why that might be? – burphound Feb 25 '22 at 00:27
@SarahTeman are the NAs showing up in `Person`, or another column? `values_drop_na` only removes rows where the `values_to` column is NA, but leaves NAs elsewhere in the dataset intact. Unrelated, an important caution when using `values_drop_na`: if both `Person1` *and* `Person2` are NA, then that observation will be entirely removed from the pivoted data, which may or may not be what you want. – zephryl Feb 25 '22 at 00:56
The NAs are showing up in Person. What it looks in my bar plot is that there is an undefined value that is equal to the length of the existing bar (eg, if for year YY person 1 had a frequency of 5 and was red, there's an unknown value above that that also has a frequency of 5 and is blue) - and it's like that for each bar. – burphound Feb 25 '22 at 02:25
@SarahTerman can you edit your question to include the data that produces the issue? nb, it’s easiest if you run `dput(data)` and just paste the result into your question. – zephryl Feb 25 '22 at 02:37
@zephryl just edited to include the data! fwiw i removed a lot of extra columns that weren't relevant to the question. – burphound Feb 25 '22 at 19:09
@SarahTeman Looks like the issue is that your missing values aren't actually [`NA`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/NA.html)s -- they're empty strings (`""`). Try converting them to true `NA`s with something like: `data <- data %>% mutate(pod1 = na_if(pod1, ""), pod2 = na_if(pod2, ""))`. You may want to check if this is affecting any other character columns, too. – zephryl Feb 25 '22 at 20:15

langtang · Answer 2 · 2022-02-25T00:22:09.930

0

I think you are trying to do something like this (edited based on OP's clarification that column names start with pod, rather than Person:

dat %>% 
  pivot_longer(cols=starts_with("pod"),values_to = "pod") %>% 
  filter(!is.na(pod)) %>% 
  count(pod,Year)

Output:

# A tibble: 7 x 3
  pod    Year     n
  <chr> <dbl> <int>
1 A      2014     2
2 A      2018     1
3 B      2014     1
4 B      2016     1
5 C      2016     1
6 C      2018     1
7 C      2020     1

Input:

dat = tibble(
    Incident=c(1,2,3,4,5),
    Year =c(2014,2014,2016,2018,2020),
    pod1 = c("A","A","B","A","C"),
    pod2 = c("B",NA,"C","C",NA)
)

edited Feb 25 '22 at 00:22

answered Feb 24 '22 at 23:59

langtang

22,248
1
12
27

Thanks for replying langtang! I did that and got: "Error in count(.,Person,Year) : object 'Person' not found" Do you have any suggestions for fixing that? – burphound Feb 25 '22 at 00:04
I've added my input, since you didn't provide it. Perhaps your table structure if different than mine? – langtang Feb 25 '22 at 00:07
Sorry - I substituted "person" for "pod" in the column names for my question - not sure if that might have affected my syntax? I did: dat %>% pivot_longer(cols=starts_with("pod"),values_to = "pod") %>% filter(!is.na(pod)) %>% count(pod,year) And got the error above. – burphound Feb 25 '22 at 00:12
should work fine if the only difference is pod1 instead of Person1 and pod2 instead of Person2 , see edit – langtang Feb 25 '22 at 00:20
do you have it working now? – langtang Feb 25 '22 at 01:13
I still don't have it working - gah! Not sure what's up since I inputted exactly your code - I'll keep trying to figure it out. – burphound Feb 25 '22 at 02:29

How to unite multiple columns (character data) without concatenating?

2 Answers2