1

I've scraped data from a source online to create a data frame (df1) with n rows of information pertaining to individuals. It comes in as a single string, and I split the words apart into appropriate columns.

90% of the information is correctly formatted to the proper number of columns in a data frame (6) - however, once in a while there is a row of data with an extra word that is located in the spot of the 4th word from the start of the string. Those lines now have 7 columns and are off-set from everything else in the data frame.

Here is an example:

Num Last-Name First-Name Cat. DOB Location

11 Jackson, Adam L 1982-06-15 USA
2 Pearl, Sam R 1986-11-04 UK
5 Livingston, Steph LL 1983-12-12 USA
7 Thornton, Mark LR 1982-03-26 USA
10 Silver, John RED LL 1983-09-14 USA


df1 = c(" 11 Jackson, Adam L 1982-06-15 USA",
    "2 Pearl, Sam R 1986-11-04 UK",
    "5 Livingston, Steph LL 1983-12-12 USA",
    "7 Thornton, Mark LR 1982-03-26 USA",
    "10 Silver, John RED LL 1983-09-14 USA")

You can see item #10 has an extra input added, the color "RED" is inserted into the middle of the string.

I started to run code that used to evaluate how many characters were present in the 4th word, and if it was 3 or greater (every value that will be in the Cat. column is is 1-2 characters), I created a new column at the end of the data frame, assigned the value to it, and if there was no value (i.e. it evaluates to FALSE), input NA. I'm sure I could likely create a massive nested ifelse statement in a mutate (my personal comfort zone), but I figure there must be a more efficient way to achieve my desired result:

Num Last-Name First-Name Cat. DOB Location Color

11 Jackson, Adam L 1982-06-15 USA NA
2 Pearl, Sam R 1986-11-04 UK NA
5 Livingston, Steph LL 1983-12-12 USA NA
7 Thornton, Mark LR 1982-03-26 USA NA
10 Silver, John LL 1983-09-14 USA RED

I want to find the instances where the 4th word from the start of the string is 3 characters or longer, assign that word or value to a new column at the end of the data frame, and shift the corresponding values in the row to the left to properly align with the others rows of data.

alistaire
  • 42,459
  • 4
  • 77
  • 117
wetcoaster
  • 367
  • 3
  • 15
  • Do you have any criteria that it is all upper case? – akrun May 17 '16 at 04:15
  • @akrun no personal criteria that it must be uppercase, but when it comes in from the source, any value that is in the 4th spot from the start of the string (whether it's the classifying category, or the color in certain instances), they will both be uppercase values. – wetcoaster May 17 '16 at 04:23

3 Answers3

2

here's a simpler way:

input <- gsub("(.*, \\w+) ((?:\\w){3,})(.*)", "\\1 \\3 \\2", input, TRUE)
input <- gsub("([0-9]\\s\\w+)\\n", "\\1 NA\n", input, TRUE)

the first gsub transposes colors to the end of the string. the second gsub makes use of the fact that unchanged lines will now end with a date and country-code (not a country-code and a color), and simply adds an "NA" to them.

IDEone demo

Scott Weaver
  • 7,192
  • 2
  • 31
  • 43
1

We could use gsub to remove the extra substrings

v1 <- gsub("([^,]+),(\\s+[[:alpha:]]+)\\s*\\S*(\\s+[[:alpha:]]+\\s+\\d{4}-\\d{2}-\\d{2}.*)",
            "\\1\\2\\3", trimws(df1))
d1 <- read.table(text=v1, sep="", header=FALSE, stringsAsFactors=FALSE, 
 col.names = c("Num", "LastName", "FirstName", "Cat", "DOB", "Location"))
d1$Color <-  trimws(gsub("^[^,]+,\\s+[[:alpha:]]+|[[:alpha:]]+\\s+\\d{4}-\\d{2}-\\d{2}\\s+\\S+$",
                       "", trimws(df1)))
d1
#  Num   LastName FirstName Cat        DOB Location Color
#1  11    Jackson      Adam   L 1982-06-15      USA      
#2   2      Pearl       Sam   R 1986-11-04       UK      
#3   5 Livingston     Steph  LL 1983-12-12      USA      
#4   7   Thornton      Mark  LR 1982-03-26      USA      
#5  10     Silver      John  LL 1983-09-14      USA   RED
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Really well done, I got to it before your edit, but I was able to make use of the first lines of code you provided. Even better now - thank you! – wetcoaster May 17 '16 at 05:05
0

Using strsplit instead of regex:

# split strings in df1 on commas and spaces not preceded by the start of the line
s <- strsplit(df1, '(?<!^)[, ]+', perl = T)

# iterate over s, transpose the result and make it a data.frame
df2 <- data.frame(t(sapply(s, function(x){
    # if number of items in row is 6, insert NA, else rearrange
    if (length(x) == 6) {c(x, NA)} else {x[c(1:3, 5:7, 4)]}
})))

# add names
names(df2) <- c("Num", "Last-Name", "First-Name", "Cat.", "DOB", "Location", "Color")

df2
#   Num  Last-Name First-Name Cat.        DOB Location Color
# 1  11    Jackson       Adam    L 1982-06-15      USA  <NA>
# 2   2      Pearl        Sam    R 1986-11-04       UK  <NA>
# 3   5 Livingston      Steph   LL 1983-12-12      USA  <NA>
# 4   7   Thornton       Mark   LR 1982-03-26      USA  <NA>
# 5  10     Silver       John   LL 1983-09-14      USA   RED
alistaire
  • 42,459
  • 4
  • 77
  • 117