Split a data-frame based in ordered multi factorial column

Question

I would like to split a data-frame in a list of data-frames. The reasoning to split it is that we will have always father followed by mother which in turn is followed by offspring. However, these family members might have more than one row (which are always subsequent. e.g father number 1 is in the row 1 and row 2). In my below example I have two families, then I am trying to get a list with two data-frames.

My input:

df <- 'Chr  Start   End Family
1   187546286   187552094   father
3   108028534   108032021   father
1   4864403 4878685 mother
1   18898657    18904908    mother
2   460238  461771  offspring
3   108028534   108032021   offspring
1   71481449    71532983    father
2   74507242    74511395    father
2   181864092   181864690   mother
1   71481449    71532983    offspring
2   181864092   181864690   offspring
3   160057791   160113642   offspring'

df <- read.table(text=df, header=T)

Thus, my expected output dfout[[1]] would look like:

dfout <- 'Chr   Start   End Family
1   187546286   187552094   father
3   108028534   108032021   father
1   4864403 4878685 mother
1   18898657    18904908    mother
2   460238  461771  offspring
3   108028534   108032021   offspring'

dfout - read.table(text=dfout, header=TRUE)

I'm not understanding what logic you have that dictates when you go from one family to the next down the rows of your data frame. — Phil, Oct 31 '16 at 16:26

Pierre L · Accepted Answer · 2016-10-31T16:32:27.040

To split each family into a separate data frame, you will need an index indicating where one family ends and another begins. For the index, I used "father" as the change-point. But we cannot simply use indx <- df$Family == "father" since there can be multiple 'father' entries in a row. Instead we test where the switch from 'offspring' to 'father' by searching for where it equals 1.

indx <- cumsum(c(1L, diff(df$Family == "father")) == 1L)
split(df, indx)
# $`1`
#   Chr     Start       End    Family
# 1   1 187546286 187552094    father
# 2   3 108028534 108032021    father
# 3   1   4864403   4878685    mother
# 4   1  18898657  18904908    mother
# 5   2    460238    461771 offspring
# 6   3 108028534 108032021 offspring
# 
# $`2`
#    Chr     Start       End    Family
# 7    1  71481449  71532983    father
# 8    2  74507242  74511395    father
# 9    2 181864092 181864690    mother
# 10   1  71481449  71532983 offspring
# 11   2 181864092 181864690 offspring
# 12   3 160057791 160113642 offspring

score 0 · Answer 2 · answered Nov 09 '16 at 14:03

It would be more helpful if you posted the code you use to produce your actual data frame. I don't have time to redo everything, but I'll show you how it works, in a generic sense.

gender <- c("M","M","F","F","F","F","M","M","M","M","F","F")
values <- c(20,22,24,19,9,17,18,22,12,14,7,8)
fruit <- c("apple","pear","mango","mango","mango","apple","banana","banana","banana","mango","apple","apple")
df <- data.frame(gender, values, fruit)


> df
   gender values  fruit
1       M     20  apple
2       M     22   pear
3       F     24  mango
4       F     19  mango
5       F      9  mango
6       F     17  apple
7       M     18 banana
8       M     22 banana
9       M     12 banana
10      M     14  mango
11      F      7  apple
12      F      8  apple

split(df, df$gender)

$F
   gender values fruit
3       F     24 mango
4       F     19 mango
5       F      9 mango
6       F     17 apple
11      F      7 apple
12      F      8 apple

$M
   gender values  fruit
1       M     20  apple
2       M     22   pear
7       M     18 banana
8       M     22 banana
9       M     12 banana
10      M     14  mango

Split a data-frame based in ordered multi factorial column

2 Answers2