I am trying to extract some variable names and numbers from the following vector and store them into two new variables:
unique_strings <- c("PM_1_PMS5003_S_Avg", "PM_2_5_PMS5003_S_Avg", "PM_10_PMS5003_S_Avg",
"PM_1_PMS5003_A_Avg", "PM_2_5_PMS5003_A_Avg", "PM_10_PMS5003_A_Avg",
"PNC_0_3_PMS5003_Avg", "PNC_0_5_PMS5003_Avg", "PNC_1_0_PMS5003_Avg",
"PNC_2_5_PMS5003_Avg", "PNC_5_0_PMS5003_Avg", "PNC_10_0_PMS5003_Avg",
"PM_1_PMS7003_S_Avg", "PM_2_5_PMS7003_S_Avg", "PM_10_PMS7003_S_Avg",
"PM_1_PMS7003_A_Avg", "PM_2_5_PMS7003_A_Avg", "PM_10_PMS7003_A_Avg",
"PNC_0_3_PMS7003_Avg", "PNC_0_5_PMS7003_Avg", "PNC_1_0_PMS7003_Avg",
"PNC_2_5_PMS7003_Avg", "PNC_5_0_PMS7003_Avg", "PNC_10_0_PMS7003_Avg"
)
I would like to extract each character before the PMS
for the first variable. This includes the strings that being with PM
or PNC
, as well as the underscores and digits. I would like to store these results into a variable called pollutant
.
Desired output:
unique(pollutant)
[1] "PM_1" "PM_2_5" "PM_10" "PNC_0_3" "PNC_0_5" "PNC_1_0" "PNC_2_5" "PNC_5_0" "PNC_10"
I would like to extract everything after the PMS
for the second variable.
For this, I first tried extracting just the model numbers (four-digit numbers ending in 003
) from each string, however, it would be useful to include the A_Avg
or S_Avg
in the extraction as well.
Here's my first attempt:
model_id <- str_extract(unique_strings, "[0-9]{4,}")
unique(model_id)
[1] "5003" "7003"
I have not used regex before and am having a difficult time navigating existing docs / stack posts. Your input is appreciated!