-2

I have one column in a CSV of salary data with the following types of data:

  • Nothing in cell
  • "£2.5-£3.5 an hour"
  • "£4.8 an hour"
  • "£40,000-£60,000 a year"
  • £60,000 a year

My objective is to:

  1. Split this df$salary column into as many columns as possible (via delimiter " " )
  2. For the rows where the column 5 is blank (the salary is fixed), take the first value of the split data frame and put that as a new column as a currency, multiplying by a set number based on if the 3rd column is "hour" (multiply by 8*262 to get annual salary) etc.
  3. For the rows where the column 5 is not blank (the salary is a range), take the mid range of numbers of the 2nd and 4th columns of the split data frame, add that to the same new column for salary, and annualise based on what column 5 of the split data frame is (year, month etc.)
  4. Append this new salary column to the old data frame with the full data.

So far I have data_split<- strsplit(as.character(data$salary), split=" ") to split the column I was looking for, but I get this output:

> tail(data_split)

[[1]]
[1] "£26,000" "a"       "year"   

[[2]]
character(0)

[[3]]
[1] "£100" "a"    "day" 

[[4]]
[1] "£16,107" "a"       "year"   

[[5]]
[1] "£15,747" "a"       "year"   

[[6]]
[1] "£9.00"  "-"      "£15.50" "an"     "hour" 

Any idea what to do next? The difficulty vs. other answers I found on stack overflow is that I am manipulating based on an if condition on a column 5 of a data frame. I would have loved if I could access my R materials from this online course but they are blocked!!

Jaap
  • 81,064
  • 34
  • 182
  • 193
Dhruv Ghulati
  • 2,976
  • 3
  • 35
  • 51

1 Answers1

0

I'm guessing you want the values, and the units of time, and an indicator as to whether the values specify a range. This is one method for extracting those data:

valueList <- lapply(data_split,
                 function(x){
                     # isolate the strings starting with £
                     x = grep("^£",x,value=T)
                     # convert to numeric
                     x = as.numeric(substr(x,2,nchar(x)))})

# extract vectors with the minimum and maximum values
minValue <- lapply(valueList,min)
maxValue <- lapply(valueList,max)

# identify values specified by a range
isRange <- minValue == maxValue)

# identify the unit of time
timeUnit <- character(length(data_split))
timeUnit[grepl('year', data)] <- 'year'
timeUnit[grepl('day', data)] <- 'day'
timeUnit[grepl('month', data)] <- 'month'
timeUnit[grepl('hour,% data)] <- 'hour'
Jthorpe
  • 9,756
  • 2
  • 49
  • 64
  • Thank you! How can I ignore the blanks? I am creating a new list with `lapply` right? At some point, I need to get a column with the same rows as `data_split`, not deleting blank rows, and add it back as the new **annualised salary** column. Can I ignore n/a? I got this error when running valueList `Warning messages: 1: In FUN(X[[1000L]], ...) : NAs introduced by coercion` and for the minValue part got `1: In FUN(X[[1000L]], ...) : no non-missing arguments to max; returning -Inf` – Dhruv Ghulati Feb 15 '15 at 22:48
  • regarding the first warning (`NAs introduced by coercion`) this is likely because there were some invalid strings identified by `grep("^£",...)`. You can inspect those by looking at `data_split[is.na(valueList)]`. Regarding the second error (`no non-missing arguments to max`), this is caused by some elements in 'data_split' don't have a matching string you can inspect that one similarly. Together these cause missing values and/or infinite values, but the length should be fine. You probably want to set the infinite values to missing via `minValue[!is.finite(minValue)]<- NA` – Jthorpe Feb 16 '15 at 03:50
  • When I used your debug, all the values within the new valueList found (double [[ ) are actually values that clearly contain `£`. For example: `data_split[is.na(valueList)] [[1]] [1] "£19,805" "a" "year" [[2]] [1] "£23,000" "a" "year" [[3]] [1] "£18,000" "a" "year" ` .\ Strangely , I get this error with your second piece of code: `minValue <- lapply(valueList,min) There were 50 or more warnings (use warnings() to see the first 50) > minValue[!is.finite(minValue)]<- NA Error in is.finite(minValue) : default method not implemented for type 'list'` – Dhruv Ghulati Feb 16 '15 at 23:08