0

I have a data frame which contains households' information on their trips in a day.

df <- data.frame(
hid=c("10001","10001","10001","10001"),
mid=c(1,2,3,4),
thc=c("010","01010","0","02030"),
mdc=c("000","01010","0","02020"),
thc1=c(0,0,0,0),
thc2=c(1,1,NA,2),
thc3=c(0,0,NA,0),
thc4=c(NA,1,0,3),
thc5=c(NA,0,NA,0),
mdc1=c(0,0,0,0),
mdc2=c(0,1,NA,2),
mdc3=c(0,0,NA,0),
mdc4=c(NA,1,NA,2),
mdc5=c(NA,0,NA,0)
)

hid: household id (actual data frame has further households)
mid: household member id
thc: strings to indicate sequence of member's daily movement;
0=inside house, 1=unique ID of the place s/he visited

Thus, if it's coded as 01020, it means that s/he visited place 1 from home (0) then backed to home (0), visited another place 2 from home (0) then backed to home (0) in a day.

IDs in hid are splitted into each columns, htc1, htc2, htc3, htc4 and htc5. Maximum number of thc is set based on the maximum length of the movement in a household.
If maximum code is 5 in a member and those of others are 3, htc4 and 'htc5' of other members are padded by NA.

mdc: variable which indicates attribute of the activity taken at the place. For instance, 1=work and 2= school. It is also splitted in the latter columns.

Now, what I am trying to obtain is a list which contains adjacency matrix and node list for network analysis used in, i.e., igraph, which contains information in the df.

This is the desired outcome:

# Desired list
[1] # It represents first element grouped by `hid`.
    # In the actual data frame, there are around 40,000
    # households which contains different `hid`.

$hid # `hid` of each record
[1]10001
[2]10001
[3]10001
[4]10001

$mid # `mid` of each record
[1]1
[2]2
[3]3
[4]4

$trip # `adjacency matrix` of each `mid`
      # head of line indicates destination area id
      # leftmost column indicates origin area id
      # for example of [1], 'mid'=1 took 1 trip from 0 to 1 and 1 trip from 1 to 0
[1] # It represents `mid`=1
  0 1
0 0 1
1 1 0
[2] # It represents `mid`=2 
  0 1
0 0 2
1 2 0
[3]
  0
0 0
[4]
  0 1 2 3
0 0 0 1 1
1 0 0 0 0
2 1 0 0 0
3 1 0 0 0

$node # Attribute of each area defined in `mdc'
      # for instance, mdc of `mid`=4, that is `02020`, s/he had activity `2` twice
      # in area id '2' and `3` as indicated in `thc` and `thc1-4`.
      # The number does not indicate "how many times s/he took activity in the area"
     # but indicates "what s/he did in the area"
area mdc1 mdc2 mdc3 mdc4
   0   0    0    0     0
   1   0    1   NA    NA
   2  NA   NA   NA     2
   3  NA   NA   NA     2

[2] # Next element continues same information of other hid
    # Thus, from `hid` to `mdc` are one set of attributes of one element

It is quite complecated to convert from df to desired list in my current knowledge of list and data conversion. For instance, to create adjacency matrix, I need to refer information in thc or thc1-5 anteroposteriorly. For node, it also require to obtain the maximum number of area id and store information in 'mdc or mdc1-5'.
I highly appreciate if you could provide any suggestions to start with this work.

I prefer to use tidyverse, purrr and their families but I have not used purrr for list operation. I used to use the formater for data manipulation but not familior with list operation.

After this operation, I will visualize movement and activity pattern of each household (not member) in igraph or other packages such as ggnetwork or networkD3 to find ascendant pattern from distribution of each pattern.

HSJ
  • 687
  • 6
  • 16

1 Answers1

1

Here are two helper functions that can build the adjacency matrix and the activity matrix:## Build the adjacency matrices (details in comments)

build_adj_mat <- function(thc_) {
  # Convert the factor to numeric for processing
  if (is.factor(thc_)) {
    thc_ <- as.numeric(unlist(strsplit(as.character(thc_), "")))
  }

  # Create a matrix with the correc dimensions, and give names
  mat <- matrix(0, nrow = max(thc_) + 1, ncol = max(thc_) + 1)
  rownames(mat) <- colnames(mat) <- seq(min(thc_), max(thc_))

  # Add to the matrix when appropriate
  for (i in 1:(length(thc_) - 1)) {
    from = thc_[i] + 1
    to = thc_[i + 1] + 1
    mat[from, to] <- mat[from, to] + 1
  }
  return(mat)
}


## Build the activity matrix / node

build_node_df <- function(df_) {
  # get the maximum area length
  max_len <-
    max(as.numeric(unlist(strsplit(
      as.character(df_$thc), ""
    ))))
  # Build the actual matrix function
  build_act_mat <- function(loc_, act_, max = max_len) {
    if (is.factor(loc_)) {
      loc_ <- as.numeric(unlist(strsplit(as.character(loc_), "")))
    }
    if (is.factor(act_)) {
      act_ <- as.numeric(unlist(strsplit(as.character(act_), "")))
    }
    area = rep(NA, max + 1)
    for (i in 1:length(loc_)) {
      area[loc_[i] + 1] <- act_[i]
    }
    return(area)
  }
  # Call the function
  out <- mapply(build_act_mat, df_$thc, df_$mdc)
  # cbind the output with the areas
  out <- data.frame(cbind(0:max_len, out))
  # Assign proper column names
  colnames(out) <- c("area", paste("mid_", df_$mid, sep = ""))
  return(out)
}

Then a function that applies those functions to the df, with some additions for your hid and mid output:

build_list <- function(dfo) {
  hid_ <- as.numeric(as.character(dfo$hid))
  mid_ <- as.numeric(as.character(dfo$mid))
  trip_ <- lapply(dfo$thc, build_adj_mat)
  node_ <- build_node_df(dfo)

  return(list(
    hid = hid_,
    mid = mid_,
    trip = trip_,
    node = node_)
    )
}

Output:

> build_list(df)
$hid
[1] 10001 10001 10001 10001

$mid
[1] 1 2 3 4

$trip
$trip[[1]]
  0 1
0 0 1
1 1 0

$trip[[2]]
  0 1
0 0 2
1 2 0

$trip[[3]]
  0
0 0

$trip[[4]]
  0 1 2 3
0 0 0 1 1
1 0 0 0 0
2 1 0 0 0
3 1 0 0 0


$node
  area mid_1 mid_2 mid_3 mid_4
1    0     0     0     0     0
2    1     0     1    NA    NA
3    2    NA    NA    NA     2
4    3    NA    NA    NA     2

I'm sure there's a way to get this to work with dplyr, but it's probably easier to just use split from base R. With this slightly modified dataframe:

df2 <- data.frame(
  hid = c("10001", "10002", "10002", "10003"),
  mid = c(1, 2, 3, 4),
  thc = c("010", "01010", "0", "02030"),
  mdc = c("000", "01010", "0", "02020")
)

Now split the new dataframe into a list and use lapply to apply the build_list function to each piece:

split_df2 <- split(df2, df2$hid)
names(split_df2) <- paste("hid_", names(split_df2), sep = "")
lapply(split_df2, build_list)

Output:

$hid_10001
$hid_10001$hid
[1] 10001

$hid_10001$mid
[1] 1

$hid_10001$trip
$hid_10001$trip[[1]]
  0 1
0 0 1
1 1 0


$hid_10001$node
  area mid_1
1    0     0
2    1     0


$hid_10002
$hid_10002$hid
[1] 10002 10002

$hid_10002$mid
[1] 2 3

$hid_10002$trip
$hid_10002$trip[[1]]
  0 1
0 0 2
1 2 0
...
...

Hope that gets you pointed in the right direction!

Luke C
  • 10,081
  • 1
  • 14
  • 21
  • Thank you for your perfect idea! I will try to understand and apply it for my actual data frame. I will confirm how it works for multiple households data frame. – HSJ Jul 16 '18 at 11:58