I have a data frame which contains households' information on their trips in a day.
df <- data.frame(
hid=c("10001","10001","10001","10001"),
mid=c(1,2,3,4),
thc=c("010","01010","0","02030"),
mdc=c("000","01010","0","02020"),
thc1=c(0,0,0,0),
thc2=c(1,1,NA,2),
thc3=c(0,0,NA,0),
thc4=c(NA,1,0,3),
thc5=c(NA,0,NA,0),
mdc1=c(0,0,0,0),
mdc2=c(0,1,NA,2),
mdc3=c(0,0,NA,0),
mdc4=c(NA,1,NA,2),
mdc5=c(NA,0,NA,0)
)
hid
: household id (actual data frame has further households)
mid
: household member id
thc
: strings to indicate sequence of member's daily movement;
0=inside house, 1=unique ID of the place s/he visited
Thus, if it's coded as 01020
, it means that s/he visited place 1
from home (0) then backed to home (0), visited another place 2
from home (0) then backed to home (0) in a day.
IDs in hid
are splitted into each columns, htc1
, htc2
, htc3
, htc4
and htc5
. Maximum number of thc
is set based on the maximum length of the movement in a household.
If maximum code is 5 in a member and those of others are 3, htc4
and 'htc5' of other members are padded by NA
.
mdc
: variable which indicates attribute of the activity taken at the place. For instance, 1=work and 2= school. It is also splitted in the latter columns.
Now, what I am trying to obtain is a list which contains adjacency matrix
and node list
for network analysis
used in, i.e., igraph
, which contains information in the df
.
This is the desired outcome:
# Desired list
[1] # It represents first element grouped by `hid`.
# In the actual data frame, there are around 40,000
# households which contains different `hid`.
$hid # `hid` of each record
[1]10001
[2]10001
[3]10001
[4]10001
$mid # `mid` of each record
[1]1
[2]2
[3]3
[4]4
$trip # `adjacency matrix` of each `mid`
# head of line indicates destination area id
# leftmost column indicates origin area id
# for example of [1], 'mid'=1 took 1 trip from 0 to 1 and 1 trip from 1 to 0
[1] # It represents `mid`=1
0 1
0 0 1
1 1 0
[2] # It represents `mid`=2
0 1
0 0 2
1 2 0
[3]
0
0 0
[4]
0 1 2 3
0 0 0 1 1
1 0 0 0 0
2 1 0 0 0
3 1 0 0 0
$node # Attribute of each area defined in `mdc'
# for instance, mdc of `mid`=4, that is `02020`, s/he had activity `2` twice
# in area id '2' and `3` as indicated in `thc` and `thc1-4`.
# The number does not indicate "how many times s/he took activity in the area"
# but indicates "what s/he did in the area"
area mdc1 mdc2 mdc3 mdc4
0 0 0 0 0
1 0 1 NA NA
2 NA NA NA 2
3 NA NA NA 2
[2] # Next element continues same information of other hid
# Thus, from `hid` to `mdc` are one set of attributes of one element
It is quite complecated to convert from df
to desired list in my current knowledge of list and data conversion. For instance, to create adjacency matrix
, I need to refer information in thc or thc1-5
anteroposteriorly. For node
, it also require to obtain the maximum number of area id and store information in 'mdc or mdc1-5'.
I highly appreciate if you could provide any suggestions to start with this work.
I prefer to use tidyverse
, purrr
and their families but I have not used purrr
for list operation. I used to use the formater for data manipulation but not familior with list operation.
After this operation, I will visualize movement and activity pattern of each household (not member) in igraph
or other packages such as ggnetwork
or networkD3
to find ascendant pattern from distribution of each pattern.