0

I need to scale "Age" attribute from the data set which is in the following format. How to do scaling of text based variable in R?

age_upon_outcome
2 weeks
1 month
3 months
1 year
3 weeks
2 months
8 months
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Anamika Chavan
  • 149
  • 1
  • 3
  • 14
  • Can you be a bit more clear as to what the format of the age is? And before any scaling you are going to have to convert the text field to a numeric one. – A. K. Jul 31 '19 at 17:32
  • @A.K. Age is in the format of 2 week, 1 month, 3 month, 1 year – Anamika Chavan Jul 31 '19 at 17:35

2 Answers2

1

The general norm for handling text data is to convert them into a numerical format, that is in complete numbers.

In your case since the variables are of the order of weeks, months or year, one way to go would be to either go in weeks or days.

If you go by days, you would typically have (considering an a week having 7 days, and a month having 30 days):

14, 30, 90, .... 

If you go by weeks, you would typically have (considering a month having 4 weeks, a year having 52 weeks):

2, 4, 12, ... 

Now that you have them in numbers, it should be easy to scale them, for example, the popular MinMaxScaling:

MinMaxScaleFeature <- function(x)
{
    return((x - min(x)) /(max(x) - min(x)))
}

This is how a typical function would look like.


You can also use other scaling mechanisms like Standard or Robust, you can check them out here: https://medium.com/@ian.dzindo01/feature-scaling-in-python-a59cc72147c1
Ankur Sinha
  • 6,473
  • 7
  • 42
  • 73
1
require(dplyr)
require(tidyr)

age_upon_outcome <- 
'2 weeks
1 month
3 months
1 year
3 weeks
2 months
8 months'

age_upon_outcome <- strsplit(age_upon_outcome, '\n') %>% unlist 

my_df <- as.data.frame(age_upon_outcome, stringsAsFactors = FALSE) %>%  as_tibble()


my_df %>%  separate(age_upon_outcome, into = c('age', 'unit'), sep = ' ') %>% 
  mutate(unit_in_days = case_when(unit == 'weeks' ~ 7, 
                                  unit == 'month' ~ 30,
                                  unit == 'months' ~ 30,
                                  unit == 'year' ~ 365)) %>% 
 mutate(age = as.numeric(age)*unit_in_days) %>% 
 mutate(scaled_age = (age - mean(age)) /sd(age))

the output

    age unit   unit_in_days scaled_age
  <dbl> <chr>         <dbl>      <dbl>
1    14 weeks             7     -0.769
2    30 month            30     -0.650
3    90 months           30     -0.202
4   365 year            365      1.85 
5    21 weeks             7     -0.717
6    60 months           30     -0.426
7   240 months           30      0.916
Mouad_Seridi
  • 2,666
  • 15
  • 27