-1

I am trying to split the first column range into two separate columns as so

enter image description here

I have tried the following code:

testDF$IncomeLowerRange <- strsplit(gsub("[^-]+-", "", testDF$IncomeRange), ";")[1]
testDF$IncomeUpperRange <- strsplit(gsub("[^-]+-", "", testDF$IncomeRange), ";")[2]
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
CRich
  • 118
  • 1
  • 1
  • 7
  • 1
    What is the logic by which `$100,000+` ends up having an upper range of `250,000` ? – Tim Biegeleisen Jun 12 '17 at 02:52
  • No specific logic, there is just no range assigned to anyone with a salary of $100,000+. – CRich Jun 12 '17 at 03:03
  • Paste the input as text please. `dput(head(testDF))` – zx8754 Jun 12 '17 at 06:43
  • Maybe it would be easier to relevel the first factor. `testDF$IncomeLowerRange<-factor(testDF$IncomeRange,labels =c(25000,50000,....))` – Niek Jun 12 '17 at 06:57
  • @OmegaSupreme: Glad it worked for you. Please also consider upvoting if my answer proved helpful to you (see [How to upvote on Stack Overflow?](http://meta.stackexchange.com/questions/173399/how-to-upvote-on-stack-overflow)). – Wiktor Stribiżew Jun 20 '17 at 06:45

2 Answers2

1

I do not think there is a one-liner for this as you need to extract some of the data, replace fully or partially in other cases.

I suggest using a simple logic for this:

> test <- c("$25,000-49,000","Not displayed", "$100,000+")
> df <- data.frame(test)
> df$col1 <- sub("^\\$?([0-9,]+).*", "\\1", df$test)
> df$col1[df$col1=="Not displayed"] <- "NA"
> df$col2 <- sub("^[^-]+-(.*)", "\\1", df$test)
> df$col2[df$col2=="Not displayed"] <- "NA"
> df$col2[df$col2=="$100,000+"] <- "250,000"
> 
> df
            test    col1    col2
1 $25,000-49,000  25,000  49,000
2  Not displayed      NA      NA
3      $100,000+ 100,000 250,000

where col1 will be your IncomeLowerRange and col2 will be your IncomeUpperRange.

The ^\\$?([0-9,]+).* regex is used to extract the first number from a range, and also 100,000 into the first column (^ matches the start of string, \$? matches 1 or 0 $ symbols, ([0-9,]+) captures digits and commas into Group 1 that is kept, and the rest (also, all after the number - .*) is removed.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

Assuming that the number of levels is limited (as in your example) a very easy solution might be the following

IncomeRange <- c("$25,000-49,000","$50,000-74,999",
          "Not displayed","$25,000-49,000", "$100,000+",
          "$100,000+","$75,000-99,999")
df <- data.frame(IncomeRange)

# IncomeRange only has 5 levels
levels(df$IncomeRange)
[1] "$100,000+"      "$25,000-49,000" "$50,000-74,999"
[4] "$75,000-99,999" "Not displayed" 

df$IncomeLowerRange<-factor(df$IncomeRange,
                                labels =c(100000,25000,50000,75000,NA))
df$IncomeUpperRange<-factor(df$IncomeRange,
                            labels =c(250000,49000,74999,99999,NA))

Output

     IncomeRange IncomeLowerRange IncomeUpperRange
1 $25,000-49,000            25000            49000
2 $50,000-74,999            50000            74999
3  Not displayed             <NA>             <NA>
4 $25,000-49,000            25000            49000
5      $100,000+            1e+05           250000
6      $100,000+            1e+05           250000
7 $75,000-99,999            75000            99999
Niek
  • 1,594
  • 10
  • 20