11

I have a data set:

crimes<-data.frame(x=c("Smith", "Jones"), charges=c("murder, first degree-G, manslaughter-NG", "assault-NG, larceny, second degree-G"))

I'm using tidyr:separate to split the charges column on a match with "G,"

crimes<-separate(crimes, charges, into=c("v1","v2"), sep="G,")

This splits my columns, but removes the separator "G,". I want to retain the "G," in the resulting column split.

My desired output is:

 x         v1                       v2
 Smith     murder, first degree-G   manslaughter-NG
 Jones     assault-NG               larceny, second degree-G

Any suggestions welcome.

TDog
  • 165
  • 1
  • 2
  • 9

2 Answers2

11

Replace <yourRegexPattern> with your Regex

If you want the 'sep' in the left column (look behind)

dataframe %>% separate(column_to_sep, into = c("newCol1", "newCol2"), sep="(?<=<yourRegexPattern>)")

If you want the 'sep' in the right column (look ahead)

dataframe %>% separate(column_to_sep, into = c("newCol1", "newCol2"), sep="(?=<yourRegexPattern>)")

Also note that when you are trying to separate a word from a group of digits (I.E. Auguest1990 to August and 1990) you will need to ensure the whole pattern gets read.

Example:

dataframe %>% separate(column_to_sep, into = c("newCol1", "newCol2"), sep="(?=[[:digit:]])", extra="merge")
Cameron
  • 2,805
  • 3
  • 31
  • 45
7

UPDATE

This is what you ask for. Keep in mind that your data is not tidy (both V1 and V2 have more than one variable inside each column)

A<-separate(crimes,charges,into=c("V1","V2"),sep = "(?<=G,)")
A
      x                      V1                        V2
1 Smith murder, first degree-G,           manslaughter-NG
2 Jones             assault-NG,  larceny, second degree-G

An easier way to get keep the "G" or "NG" is to use sep=", " as said by alistaire.

A<-separate(crimes, charges, into=c("v1","v2"), sep = ', ')

This gives

      x         v1              v2
1 Smith   murder-G manslaughter-NG
2 Jones assault-NG       larceny-G

If you wanted to keep separating your data.frame (using the -)

separate(A, v1, into = c("v3","v4"), sep = "-")

that gives

      x      v3 v4              v2
1 Smith  murder  G manslaughter-NG
2 Jones assault NG       larceny-G

You'll need to do that again for the v2 column. I don't know if you want to keep separating, please post your expected output to make my answer more specific.

Matias Andina
  • 4,029
  • 4
  • 26
  • 58
  • Sorry, my example didn't include the real world case of my data, which has commas mixed in with the charges. So the "G, " is necessary as the extractor string to differentiate from ", " which exist. – TDog Apr 13 '16 at 03:54
  • And my desired out put is: x v1 v2 1 Smith murder-G manslaughter-NG – TDog Apr 13 '16 at 03:55
  • Huge props @Matias Andina. That worked great. Now on to further cleaning. As you noted, my data is not tidy. Not yet anyway. – TDog Apr 13 '16 at 17:08