0

I have a set of data including many height measurements as character variables. Some are written as "5ft 7", some are "170cm", some are "1.7m" and some are simply "170".

I would like to change them so that they all are displayed as a numeric variable with no unit of measurement (just 170, for example).

Vinícius Félix
  • 8,448
  • 6
  • 16
  • 32
  • it's easy to just remove non-numeric characters with `as.numeric(gsub("\\D","",x))`. But how would you present "5ft 7" as numeric only? Using gsub, that will give you 57... – iod Oct 06 '21 at 12:23
  • As you're going to want these in one measure, perhaps look at [measurments](https://stackoverflow.com/questions/59052801/how-to-convert-feet-to-cm-in-r). – Chris Oct 06 '21 at 13:18

1 Answers1

0

Data wrangling is marvelous fun involving a fair bit of stumbling around and tweeking for edge cases:

heights <- c("5ft 7", "170cm", "1.7m", "6' 7", "150", "5' 2\"", "5ft8")
heights
[1] "5ft 7"  "170cm"  "1.7m"   "6' 7"   "150"    "5' 2\"" "5ft8"

But gives opportunity to explore many tools. Going to a uniform measure, let's say centimeter, index what we've got as notation:

b4meas <-gsub('[0-9\\. ]', '', heights)
b4meas  
[1] "ft"  "cm"  "m"   "'"   ""    "'\"" "ft"

The pattern in gsub `[0-9\. ]' is saying give me everything that isn't digits, dot, or space. We'll probably want to index these different cases for conversion:

which(b4meas== 'ft')
[1] 1 7
which(b4meas== '')
[1] 5

And the exploring the numbers:

char_num <- gsub('[a-z\']','', heights, perl=TRUE)
char_num
[1] "5 7"   "170"   "1.7"   "6 7"   "150"   "5 2\"" "58"
> which(nchar(char_num) == 2 & b4meas=='ft')
[1] 7
> which(nchar(char_num) == 3 & b4meas=='ft')
[1] 1
> which(nchar(char_num) == 3 & b4meas=="'")
[1] 4
> which(b4meas=="'\"")
[1] 6

So our heterogeneous foot notations, which can be indices as well. And our cm based measure that don't need conversion:

which(nchar(char_num) == 3 & b4meas=="'" | b4meas == 'cm')
[1] 2 4

So, let's see what we got going here:

split_char <- strsplit(char_num, ' ')
> split_char
[[1]]
[1] "5" "7"

[[2]]
[1] "170"

[[3]]
[1] "1.7"

[[4]]
[1] "6" "7"

[[5]]
[1] "150"

[[6]]
[1] "5"   "2\""

[[7]]
[1] "58"

So, [[2]] & [[5]] can be left alone or written directly to another column without conversion. [[3]] * 100, [[1]] & [[4]] can be calculated, [[6]] needs further cleaning, [[7]] needs additional splitting.

sum(as.numeric(split_char[[1]][1])*12 * 2.54, as.numeric(split_char[[1]][2]) * 2.54)
[1] 170.18
# for [[6]]
sum(as.numeric(split_char[[6]][1]) * 12 * 2.54, eval(as.numeric(gsub('\\"', '', split_char[[6]][2])) * 2.54))
[1] 157.48
# either `eval` or `force` can be used to avoid
# Error in gsub( non-numeric argument to binary operator
# for [[7]]
sum(as.numeric(strsplit(split_char[[7]], '')[[1]][1])*12 *2.54, as.numeric(strsplit(split_char[[7]],'')[[1]][2]) * 2.54)
[1] 172.72

Ok, we can convert, but wait, we've got a data.frame! So, will use our indices and conversions to do it...one hopes...

> physio_df <- data.frame(heights)
> physio_df[['heights_cm']] <- NA_real_ # add column to convert to
> physio_df
  heights heights_cm
1   5ft 7         NA
2   170cm         NA
3    1.7m         NA
4    6' 7         NA
5     150         NA
6   5' 2"         NA
7    5ft8         NA

It's a miracle, some of our cases are simplified just by taking to data.frame. But also means it will be useful to recalculate b4meas to reflect this (as you're already in a data.frame, you don't need to do this).

# [[5]] just take to numeric
physio_df$heights_cm[which(nchar(physio_df$heights) ==3)] <- physio_df$heights[as.numeric(which(nchar(physio_df$heights) ==3))] 
#[[7]] 
physio_df$heights_cm[b4meas== 'm'] <- as.numeric(char_num[b4meas == 'm'])* 100
b4meas2 <- gsub('[0-9\\. ]', '', physio_df$heights)
> b4meas2
[1] "ft"  "cm"  "m"   "'"   ""    "'\"" "ft"
physio_df$heights[[6]]
[1] "5' 2\""

Oh, so it wasn't actually a miracle and b4meas is still a valid index. The great thing about indices if you have multiple case that fit the criterion all such cases can be addressed.

#let's make an index for [[1]] & [[4]] but not [[6]]
one_four_type <- setdiff(which(sapply(split_char, function(x) length(x) == 2)), which(b4meas == "'\""))
# and use in a `for` loop, should `sapply`, data has killed brain
 for(i in 1:length(one_four_type)){
+ physio_df$heights_cm[one_four_type[i]] <-
+ sum(as.numeric(split_char[[one_four_type[i]]][1])*12 * 2.54,
+ as.numeric(split_char[[one_four_type[i]]][2]) * 2.54)
+ }
physio_df
  heights heights_cm
1   5ft 7     170.18
2   170cm       <NA>
3    1.7m        170
4    6' 7     200.66
5     150        150
6   5' 2"       <NA>
7    5ft8       <NA>
# physio_df$heights_cm[2]
physio_df$heights_cm[which(b4meas=='cm')] <- as.numeric(char_num[b4meas=='cm'])
# physio_df$heights_cm[6]
> physio_df$heights_cm[which(b4meas == "'\"")] <-
+ sum(as.numeric(split_char[[6]][1]) * 12 * 2.54, eval(as.numeric(gsub('\\"', '', split_char[[6]][2])) * 2.54))
# physio_df$heights_cm[7]
physio_df$heights_cm[7] <- sum(as.numeric(strsplit(split_char[[7]], '')[[1]][1])*12 *2.54, as.numeric(strsplit(split_char[[7]],'')[[1]][2]) * 2.54)
> physio_df
  heights heights_cm
1   5ft 7     170.18
2   170cm        170
3    1.7m        170
4    6' 7     200.66
5     150        150
6   5' 2"     157.48
7    5ft8     172.72
Dharman
  • 30,962
  • 25
  • 85
  • 135
Chris
  • 1,647
  • 1
  • 18
  • 25