How to arrange nested data (i.e., data with parenting) in R?

Question

I have a dataset with multiple levels:

Categories (e.g., "Countries")
Countries (e.g., "USA")
Cities (e.g., "New York")
Counties (e.g., "Manhattan")
Places (e.g., "Times Square")

Each row (except for LVL 1 entries) is linked to a parent a level above.

For example: Times Square->Manhatten->New York->USA->Countries

My question: how to sort this dataset:

df2 <- structure(list(ID = c(3,6,9,11,12,19,411,50,77,83,105),
                      Parent = c(12,12,77,105,19,NA,3,41,19,77,19),
                      Level = c(3,3,3,3,2,1,4,5,2,3,2),
                      Name = c("New York","Boston","Oxford","Vancouver","USA","Countries",
                               "Manhattan","Times Square","UK","London","Canada")),
                 class = "data.frame",
                 row.names = c(NA, -11L))

into this:

df2 <- structure(list(ID = c(19,12,3,41,50,6,77,83,9,105,11),
                      Parent = c(NA,19,12,3,41,12,19,77,77,19,105),
                      Level = c(1,2,3,4,5,3,2,3,3,2,3),
                      Name = c("Countries","USA","New York","Manhattan","Times Square",
                               "Boston","UK","London","Oxford","Canada","Vancouver")),
                 class = "data.frame",
                 row.names = c(NA, -11L))

In df2, the list is arranged according to the level first, but each linked sub-level is directly underneath.

I have tried several dyplr::arrange() variants (e.g., arrange(Level, Parent)) but all fail to account for the nested data. I think the solution might be a combination of group_by() and using arrange( ,.by_group = TRUE) as done here (R, dplyr - combination of group_by() and arrange() does not produce expected result?). Unfortunately, I couldn't solve it by myself.

Can anyone help? A tidyverse/dplyr solution would be preferred :-)

Have you considered using a nested list for your data? I find it easier to manage structured data/data with varying levels. Just a thought. Good luck! — jpsmith, Mar 29 '22 at 11:39

Stefano Barbi · Accepted Answer · 2022-03-29T12:58:45.147

5

Here is a solution using igraph::dfs

library(igraph)

g <- with(na.omit(df2), graph.data.frame(cbind(Parent, ID), directed = TRUE))
 

data.frame(ID = as.integer(names(dfs(g, root = "19")$order))) |>
  left_join(df2)
           
##> + Joining, by = "ID"
##>     ID Parent Level         Name
##> 1   19     NA     1    Countries
##> 2   12     19     2          USA
##> 3    3     12     3     New York
##> 4   41      3     4    Manhattan
##> 5   50     41     5 Times Square
##> 6    6     12     3       Boston
##> 7   77     19     2           UK
##> 8    9     77     3       Oxford
##> 9   83     77     3       London
##> 10 105     19     2       Canada
##> 11  11    105     3    Vancouver

edited Mar 29 '22 at 12:58

answered Mar 29 '22 at 12:24

Stefano Barbi

2,978
1
12
11

Thank you so much Stefano! When copying your code, R is complaining about an unexpected token (">"). Is there a typo in your code or am I missing something? – diggi2395 Mar 29 '22 at 12:50
1

@diggi2395 No, it's not a typo. `|>` is the new pipe operator in R-4.1. You can safely substitute with `%>%`. – Stefano Barbi Mar 29 '22 at 12:56
Ahh, I see! Unfortunately, I got another error as the object "out" in left_join() couldn't be found? – diggi2395 Mar 29 '22 at 12:58
@diggi2395 eheh that was a typo. It should be fixed now! – Stefano Barbi Mar 29 '22 at 12:59
Perfect, now it worked! Thanks again :-) – diggi2395 Mar 29 '22 at 13:02

How to arrange nested data (i.e., data with parenting) in R?

1 Answers1