I need to scale "Age" attribute from the data set which is in the following format. How to do scaling of text based variable in R?
age_upon_outcome
2 weeks
1 month
3 months
1 year
3 weeks
2 months
8 months
I need to scale "Age" attribute from the data set which is in the following format. How to do scaling of text based variable in R?
age_upon_outcome
2 weeks
1 month
3 months
1 year
3 weeks
2 months
8 months
The general norm for handling text data is to convert them into a numerical format, that is in complete numbers.
In your case since the variables are of the order of weeks, months or year, one way to go would be to either go in weeks or days.
If you go by days, you would typically have (considering an a week having 7 days, and a month having 30 days):
14, 30, 90, ....
If you go by weeks, you would typically have (considering a month having 4 weeks, a year having 52 weeks):
2, 4, 12, ...
Now that you have them in numbers, it should be easy to scale them, for example, the popular MinMaxScaling:
MinMaxScaleFeature <- function(x)
{
return((x - min(x)) /(max(x) - min(x)))
}
This is how a typical function would look like.
require(dplyr)
require(tidyr)
age_upon_outcome <-
'2 weeks
1 month
3 months
1 year
3 weeks
2 months
8 months'
age_upon_outcome <- strsplit(age_upon_outcome, '\n') %>% unlist
my_df <- as.data.frame(age_upon_outcome, stringsAsFactors = FALSE) %>% as_tibble()
my_df %>% separate(age_upon_outcome, into = c('age', 'unit'), sep = ' ') %>%
mutate(unit_in_days = case_when(unit == 'weeks' ~ 7,
unit == 'month' ~ 30,
unit == 'months' ~ 30,
unit == 'year' ~ 365)) %>%
mutate(age = as.numeric(age)*unit_in_days) %>%
mutate(scaled_age = (age - mean(age)) /sd(age))
the output
age unit unit_in_days scaled_age
<dbl> <chr> <dbl> <dbl>
1 14 weeks 7 -0.769
2 30 month 30 -0.650
3 90 months 30 -0.202
4 365 year 365 1.85
5 21 weeks 7 -0.717
6 60 months 30 -0.426
7 240 months 30 0.916