I have a wide dataset that has psychometric measures taken from participants across various timepoints.
Time varying labels within the psychometric measures are in the form:
QuestionnaireTime_Item#
.
An example is dass1_1
where dass
= Questionnaire
, 1_
= Time_
questionnaire was administered; and 1
= Item#
of the relevant questionnaire.
This is mostly consistent across the questionnaires, however there is one psychometric
variable that does not follow this nomenclature: siss1
. Although this
nomenclature is consistent with other variables denoting the date and session
number of data collection i.e., date1
and session1
.
As as can be seen, the labels for these variables are at the ends of the variables.
However, there are a number of variables that contain a numeral in the name that should not be changed, specifically cff1
, cff2
, etc., which denote item number on this measure and not time (however, they are only asked once during the datefinal
collection period [see below]).
Time in the variable names is denoted by numerals in most cases (1--14) with the exception of the word
'final' (e.g., datefinal
, sessionfinal
, dassfinal_1
, sissfinal
) for the last session.
Additionally, there is a data collection period that took place at 6- and 12-months after the final session datefinal
data collection period.
These are denoted are denoted with 6fup
or 12fup
e.g., date_6fup
, and dass6fup_2
.
I would like change the string denoting the time variable to make it consistent and have it at the start of each variable name. Additionally, I would like to have an underscore between the name of the questionnaire and the relevant item number. For example:
date1
->T1.date
session1
->T1.session
siss2
->T2.siss
dass1_1
->T1.dass_1
datefinal
->T15.date
dass_6fup_2
->T16.dass_2
date_12fup
->T17.date
What is the best way to do this given that the numerical value denoting the time changes and is inconsistent?
Currently, I have the below which was provided here:
names(old_sp_wide) <- sub("([a-z]+)(\\d+)(_\\d+)?", "T\\2.\\1\\3",
sub("final", "15", names(old_sp_wide)),
ignore.case = TRUE
)
However, this also changes the name for the variables with the cff
prefix, and does not work as expected on the variables with the time label 6fup
and 12fup
.
What is the best way to do this given that the numerical value denoting the time changes and is inconsistent? Is there a way to this with stringr
or stringi
?
Please see below for a reproducible example.
structure(list(uci = 12345L, dob = structure(1L, .Label = "1988_01_26", class = "factor"),
sex = 2L, sp_episode = 1L, staff = structure(1L, .Label = "aj", class = "factor"),
YP_consent = 1L, date1 = structure(1L, .Label = "2016_10_03", class = "factor"),
session1 = 1L, dass1_1 = 3L, dass1_2 = 0L, dass1_3 = 2L,
siss1 = 1L, diag1 = NA, diag2 = NA, diag3 = NA, pastpsyc = NA,
pastmed = NA, date2 = structure(1L, .Label = "2016_10_15", class = "factor"),
session2 = 3L, dass2_1 = 3L, dass2_2 = 0L, dass2_3 = 2L,
siss2 = NA, datefinal = structure(1L, .Label = "2016_11_12", class = "factor"),
sessionfinal = 8L, dassfinal_1 = 2L, dassfinal_2 = 1L, dassfinal_3 = 2L,
dassfinal_4 = 3L, sissfinal = NA, cff1 = NA, cff2 = NA, cff3 = NA,
date_6fup = structure(1L, .Label = "2014_06_30", class = "factor"),
dass6fup_2 = 3L, dass6fup_3 = 1L, dass6fup_4 = 1L, siss6fup = 2L,
date_12fup = NA), class = "data.frame", row.names = c(NA,
-1L))