-2

I want to add a new column to a data.frame based on the last entry in a string so that I can do regression analysis. In particular, I am trying to extract information from the Fertiliser variable which contains information on the amount of three different types of fertiliser, separated by hyphens. For instance: 0-0-0 or 30-10-2700 are viable. The last portion of the string is what I need to capture e.g. x-y-z I need z.

I tried but was not able to adapt the answer here How to create new column in dataframe based on partial string matching other column in R to this case.

More complete example of the data:

    Fertiliser millet_biomass millet_yield
 1:      0-0-0       2659.608     710.6942
 2:    0-0-100       2701.044     718.1154
 3:   0-0-2700       3415.879     804.0360
 4:    0-0-300       2781.639     730.5943
 5:    0-0-900       2997.173     760.0136
 6:     12-4-0       3703.255     772.1719
 7:   12-4-100       3720.247     773.1759
 8:  12-4-2700       3950.189     788.6133
 9:   12-4-300       3751.400     775.1368
10:   12-4-900       3826.693     780.2623
11:    30-10-0       4180.323     798.2134
12:  30-10-100       4184.229     798.4918
13: 30-10-2700       4217.044     800.9312
14:  30-10-300       4187.014     798.6570
15:  30-10-900       4194.873     799.2085
16:      6-2-0       3296.274     765.8496
17:    6-2-100       3326.844     767.6693
18:   6-2-2700       3772.058     785.4535
19:    6-2-300       3381.152     760.7330
20:    6-2-900       3517.515     768.3018
21:    90-30-0       4542.924     831.2832
22:  90-30-100       4543.036     831.3983
23: 90-30-2700       4545.037     831.3227
24:  90-30-300       4543.240     831.3921
25:  90-30-900       4543.733     831.3727

Thus, there are five patterns -0$, -100$, -300$, -900$, 2700$, which need to be replaced by 0, 100, 300, 900, 2700

oguz ismail
  • 1
  • 16
  • 47
  • 69
M. Jimenez
  • 11
  • 1
  • 4

3 Answers3

3

Is this what you want to do? Let's take a snippet of data:

Fertiliser <- c("0-0-0", "0-0-100", "0-0-2700", "0-0-300")
millet_yield <- c(710, 718, 804, 730)
df <- data.frame(Fertiliser, millet_yield)

df looks like this:

   Fertiliser millet_yield
#1      0-0-0          710
#2    0-0-100          718
#3   0-0-2700          804
#4    0-0-300          730

Using separate() from the tidyr package:

library(tidyr)
df %>% separate(Fertiliser, into=(c("F1", "F2", "Manure")), sep="-", convert=T)

   F1 F2 Manure millet_yield
#1  0  0      0          710
#2  0  0    100          718
#3  0  0   2700          804
#4  0  0    300          730

convert=T makes sure the character strings become numeric. Now you can run a regression on your data.

Joe
  • 8,073
  • 1
  • 52
  • 58
  • Yes, this is exactly what I need. But the columns are not added to the data.frame. How can I do it? I am not used to the operator %>% – M. Jimenez Apr 18 '16 at 17:56
  • Oh, you mean you wanted to keep the original column? Then it's `df %>% separate(Fertiliser, into=(c("F1", "F2", "Manure")), sep="-", convert=T, remove=F)`. All other columns will be kept. The operator `%>%` is just for chaining functions together in `tidyr` and `dplyr`. – Joe Apr 18 '16 at 18:21
  • No no, the original column is not necessary. The result shows perfect in the console, but not in the data.frame (Environment->Data). It still displays the same variables as before running the line :S – M. Jimenez Apr 18 '16 at 21:34
  • `df2 <- df %>% separate(Fertiliser, into=(c("F1", "F2", "Manure")), sep="-", convert=T)` – Joe Apr 18 '16 at 21:51
0

We can extract the last numbers with stri_extract_last from stringi. As the dataset is a already a data.table, can use the data.table methods to assign (:=) a new column.

library(data.table)
library(stringi)
setDT(df1)[, Manure := as.numeric(stri_extract_last_regex(Fertiliser, "\\d+"))]
head(df1)
#   Fertiliser millet_biomass millet_yield Manure
#1:      0-0-0       2659.608     710.6942      0
#2:    0-0-100       2701.044     718.1154    100
#3:   0-0-2700       3415.879     804.0360   2700
#4:    0-0-300       2781.639     730.5943    300
#5:    0-0-900       2997.173     760.0136    900
#6:     12-4-0       3703.255     772.1719      0

Or a base R option is

df1$Manure <- as.numeric(sub(".*-", "", df1$Fertiliser))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • It works. I really like this style. Can we also use stringi to create a new column with only certain words within a character? i.e, from a column with values "Millet high SOC"; "Millet medium SOC" and "Millet low SOC" I want to create a new column which either contain "high", medium" or "low" – M. Jimenez Apr 18 '16 at 18:41
  • @M.Jimenez You could use `setDT(df1)[, NewCol := stri_extract(Col, regex="(?<=\\s)\\w+")]` – akrun Apr 19 '16 at 02:25
  • 1
    Concise and effective. Thank you – M. Jimenez Apr 19 '16 at 11:23
  • Oh thank you. I didn't know I could choose only one. Indeed all have being good and I've tried to vote for all :-). I've used your solution in combination with another one. – M. Jimenez Apr 21 '16 at 10:58
0

You can easily do this with sub, removing everything up to and including the last hyphen character:

transform(x, Fertiliser = sub('.*-', '', Fertiliser))
##     Fertiliser millet_biomass millet_yield
## 1:           0       2659.608     710.6942
## 2:         100       2701.044     718.1154
## 3:        2700       3415.879     804.0360
## 4:         300       2781.639     730.5943

...

Here, the .* is greedy, so it matches as much as possible before matching the final - character.

You can also rename the resulting column, rather than replacing Fertiliser:

 transform(x, Quantity = sub('.*-', '', Fertiliser))
Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112