Separate a string by a number

Question

I am trying to separate my column VEHICLE_TYPE by Model and Engine. The code can be a normal SQL or R code.

My data looks like this:

       MODEL           VEHICLE_TYPE
77        Bora               Bora 1.6
79      Ducato Ducato 15 120 Multijet
80      Ducato Ducato 15 120 Multijet
87       Astra         Astra 1.7 CDTI
88         406            406 2.0 HDi
89         406            406 2.0 HDi
90 Focus C-MAX   Focus C-MAX 1.6 TDCi
91 Focus C-MAX   Focus C-MAX 1.6 TDCi
92 Focus C-MAX   Focus C-MAX 1.6 TDCi
93 Focus C-MAX   Focus C-MAX 1.6 TDCi
94 Focus C-MAX   Focus C-MAX 1.6 TDCi
97    S-Klasse              S 320 CDI
98    S-Klasse              S 320 CDI
99    S-Klasse              S 320 CDI

I would like to receive something like this:

MODEL         VEHICLE TYPE
Bora          1.6
Ducato 15     120 Multijet
...           ...
Focus C-Max   1.6 TDCi

The problem is, that VEHICLE_TYPE can have different lengths and different number of spaces, which I could separate by.

I tried it with gsub and regex, didn't work, but strsplit worked. Far away from what I actually wanted, I ran out of ideas and need some help now.

> strsplit(as.character(test$VEHICLE_TYPE)," ")

[[1]]
[1] "Bora" "1.6"

[[2]]
[1] "Ducato"   "15"       "120"      "Multijet"

[[3]]
[1] "Ducato"   "15"       "120"      "Multijet"

[[4]]
[1] "Astra" "1.7"   "CDTI" 

[[5]]
[1] "406" "2.0" "HDi"

[[6]]
[1] "406" "2.0" "HDi"

[[7]]
[1] "Focus" "C-MAX" "1.6"   "TDCi" 

[[8]]
[1] "Focus" "C-MAX" "1.6"   "TDCi" 

[[9]]
[1] "Focus" "C-MAX" "1.6"   "TDCi" 

[[10]]
[1] "Focus" "C-MAX" "1.6"   "TDCi" 

[[11]]
[1] "Focus" "C-MAX" "1.6"   "TDCi" 

[[12]]
[1] "S"   "320" "CDI"

It isn't clear how you want to modify the names. Can you update your question to clearly show how the strings are being transformed? — Tim Biegeleisen, Jan 26 '17 at 04:49
So what's the `model` for Ducato - is it just Ducato or is it Ducato 15? Your "required output" is inconsistent with the input. Or is that part of the assignment - to override the `model` column when there is a mismatch like that? (In which case, why do you/we even need the `model` column in the first place?) — , Jan 26 '17 at 05:31

rosscova · Answer 1 · 2017-01-26T09:49:21.973

I'm guessing someone will know a way to do this with regex in a simpler way than this, but since I'm a regex dunce, here's my attempt. Split by space, then collapse everything from and after the first "numeric" value.

library( magrittr )
df[['VEHICLE_TYPE']] %<>%
    strsplit( " " ) %>%
    sapply( function(x) paste(
        x[ grep( "[[:digit:]]", x )[1] : length(x) ],
        collapse = " " )
    )

Result

> df
# # A tibble: 14 × 2
# MODEL    VEHICLE_TYPE
# <chr>           <chr>
# 1         Bora             1.6
# 2       Ducato 15 120 Multijet
# 3       Ducato 15 120 Multijet
# 4        Astra        1.7 CDTI
# 5          406     406 2.0 HDi
# 6          406     406 2.0 HDi
# 7  Focus C-MAX        1.6 TDCi
# 8  Focus C-MAX        1.6 TDCi
# 9  Focus C-MAX        1.6 TDCi
# 10 Focus C-MAX        1.6 TDCi
# 11 Focus C-MAX        1.6 TDCi
# 12    S-Klasse         320 CDI
# 13    S-Klasse         320 CDI
# 14    S-Klasse         320 CDI

Or, if you prefer to split by the last numeric value, instead of the first:

df[['VEHICLE_TYPE']] %<>%
    strsplit( " " ) %>%
    sapply( function(x) paste(
        x[ tail( grep( "[[:digit:]]", x ), 1 ) : length(x) ],
        collapse = " " )
    )

> df
# # A tibble: 14 × 2
# MODEL VEHICLE_TYPE
# <chr>        <chr>
# 1         Bora          1.6
# 2       Ducato 120 Multijet
# 3       Ducato 120 Multijet
# 4        Astra     1.7 CDTI
# 5          406      2.0 HDi
# 6          406      2.0 HDi
# 7  Focus C-MAX     1.6 TDCi
# 8  Focus C-MAX     1.6 TDCi
# 9  Focus C-MAX     1.6 TDCi
# 10 Focus C-MAX     1.6 TDCi
# 11 Focus C-MAX     1.6 TDCi
# 12    S-Klasse      320 CDI
# 13    S-Klasse      320 CDI
# 14    S-Klasse      320 CDI

EDIT: If you have some rows without any numeric values, you might need a little extra tinkering:

df[['VEHICLE_TYPE']] %<>%
    strsplit( " " ) %>%
    sapply( function(x) paste(
        if( length( grep( "[[:digit:]]", x ) ) > 1L ) {
            x[ tail( grep( "[[:digit:]]", x ), 1 ) : length(x) ]
        } else { x },
        collapse = " " )
    )

I tried your code, but then this error code appears: Error in tail(grep("[[:digit:]]", x), 1):length(x) : argument of length 0 — rayray, Jan 26 '17 at 08:41
That will happen if there are rows without any numeric values. In the subset you've provided, there are no examples of that, are there some rows like that in your complete data set? — rosscova, Jan 26 '17 at 09:21
The data set I have contains about 4 million rows, so I am not able to find those rows without any values :/ — rayray, Jan 26 '17 at 09:35
See my edit. I just tested it with a small subset, including a row without numeric values, and it works. — rosscova, Jan 26 '17 at 09:38

score 0 · Answer 2 · answered Jan 26 '17 at 06:26

Example with regex

with s(id,model,type) as (
select 77,'Bora','Bora 1.6' from dual union all
select 79,'Ducato','Ducato 15 120 Multijet' from dual union all
select 80 ,'Ducato','Ducato 15 120 Multijet' from dual union all
select 87 ,'Astra','Astra 1.7 CDTI' from dual union all
select 88 ,'406','406 2.0 HDi' from dual union all
select 89 ,'406','406 2.0 HDi' from dual union all
select 90 ,'Focus C-MAX','Focus C-MAX 1.6 TDCi' from dual union all
select 91 ,'Focus C-MAX','Focus C-MAX 1.6 TDCi' from dual union all
select 92 ,'Focus C-MAX','Focus C-MAX 1.6 TDCi' from dual union all
select 93 ,'Focus C-MAX','Focus C-MAX 1.6 TDCi' from dual union all
select 94 ,'Focus C-MAX','Focus C-MAX 1.6 TDCi' from dual union all
select 97  ,'S-Klasse','S 320 CDI' from dual union all
select 98  ,'S-Klasse','S 320 CDI' from dual union all
select 99  ,'S-Klasse','S 320 CDI' from dual 
)
select regexp_substr(type,'\d+(\.\d+)?\s*\w*$') /*cut part with model*/
from s

nurandi · Accepted Answer · 2017-01-26T09:29:37.057

Here is an alternative solution using gsub

df$VEHICLE_TYPE <- gsub(".+ ([0-9.]+(?: [^ ]+)?)$", "\\1", df$VEHICLE_TYPE)

> df

#           MODEL VEHICLE_TYPE
#  1         Bora          1.6
#  2       Ducato 120 Multijet
#  3       Ducato 120 Multijet
#  4        Astra     1.7 CDTI
#  5          406      2.0 HDi
#  6          406      2.0 HDi
#  7  Focus C-MAX     1.6 TDCi
#  8  Focus C-MAX     1.6 TDCi
#  9  Focus C-MAX     1.6 TDCi
#  10 Focus C-MAX     1.6 TDCi
#  11 Focus C-MAX     1.6 TDCi
#  12    S-Klasse      320 CDI
#  13    S-Klasse      320 CDI
#  14    S-Klasse      320 CDI

I assume that vehicle type always at the end and follows this pattern: (1) group of numeric characters (0 to 9 and dot), e.g. 1.6 OR (2) combination of group numeric character and group any other characters, separated by space (e.g. 120 Multijet, 2.0 HDi)

UPDATE : deal with 308 1.6i Flex and Cherokee 2.8 CRD 4x4

df$VEHICLE_TYPE <- gsub(".+ ([0-9.]+[a-z]?(?: [^ ]+)?(?: [^ ]+)?)$", "\\1", df$VEHICLE_TYPE)

# OR, simply grep "number" and everything after
# df$VEHICLE_TYPE <- gsub(".+ ([0-9.]+[a-z]? .+)$", "\\1", df$VEHICLE_TYPE)


> df

#          MODEL VEHICLE_TYPE
# 1         Bora          1.6
# 2       Ducato 120 Multijet
# 3       Ducato 120 Multijet
# 4        Astra     1.7 CDTI
# 5          406      2.0 HDi
# 6          406      2.0 HDi
# 7  Focus C-MAX     1.6 TDCi
# 8  Focus C-MAX     1.6 TDCi
# 9  Focus C-MAX     1.6 TDCi
# 10 Focus C-MAX     1.6 TDCi
# 11 Focus C-MAX     1.6 TDCi
# 12    S-Klasse      320 CDI
# 13    S-Klasse      320 CDI
# 14    S-Klasse      320 CDI
# 15         308    1.6i Flex
# 16    Cherokee  2.8 CRD 4x4

Hey, thanks for your answer! Your code works good, but somehow it just leaves some rows out like these two: 308 308 1.6i Flex, Cherokee Cherokee 2.8 CRD 4x4 — rayray, Jan 26 '17 at 08:46
So, what do you expect if your input is `308 1.6i Flex` or `Cherokee 2.8 CRD 4x4` ? — nurandi, Jan 26 '17 at 09:04
I would expect something like this: `308` and `1.6i Flex` and for the other `Cherokee` and `2.8 CDR 4x4` — rayray, Jan 26 '17 at 09:30

MT0 · Answer 4 · 2017-01-26T10:15:11.643

0

In Oracle, you can use the first and second matching groups from the regular expression ^(.*?)\s+(\d.*)$:

SELECT REGEXP_SUBSTR( vehicle_type, '^(.*?)\s+(\d.*)$', 1, 1, NULL, 1 )
         AS model,
       REGEXP_SUBSTR( vehicle_type, '^(.*?)\s+(\d.*)$', 1, 1, NULL, 2 )
         AS vehicle_type
FROM   your_table;

edited Jan 26 '17 at 10:15

answered Jan 26 '17 at 10:08

MT0

143,790
11
59
117

Separate a string by a number

4 Answers4