1

I am trying to process some character strings for an input file. First I convert the strings from a vector to a list, then I reduce to only unique values.

Next I would like to convert the words in each list element into a string with a separator of ':1 '.

I can get the function to work on a single list element but when I try to use ldply from plyr to do it for the whole list, I only get the last word in each list element.

Here's the code:

library(plyr)

df1 <- data.frame(id = seq(1,5,1), string1 = NA)
head(df1)
df1$string1[1] <- "This string is a string."
df1$string1[2] <- "This string is a slightly longer string."
df1$string1[3] <- "This string is an even longer string."
df1$string1[4] <- "This string is a slightly shorter string."
df1$string1[5] <- "This string is the longest string of all the other strings."

df1$string1 <- tolower(as.character(df1$string1))
df1$string1 <- gsub('[[:punct:]]',' ',df1$string1)
df1$string1 <- gsub('[[:digit:]]',' ',df1$string1)
df1$string1 <- gsub("\\s+"," ",df1$string1)

fdList1 <- strsplit(df1$string1, " ", df1$string1)
fdList2 <- lapply(fdList1, unique)

toString1 <- function(x){
string2 <- c()
#print(length(x[1][1]))
#print(x)
#print(class(x))
for(i in length(x)){
string2 <- paste0(string2, x[[i]], ":1 ", collapse="")
}
string2
}

df2 <- ldply(fdList2, toString1)
df2 

v1 <- toString1(fdList2[2])
v1

df2 is wrong, I would like a vector similar to v1 for each list element.

Any suggestions?

screechOwl
  • 27,310
  • 61
  • 158
  • 267
  • 1
    Try this: `ldply(seq(length(fdList2)), function(x) toString1(fdList2[x]))`. Your function seems to be passing `[[.]]` instead of `[.]`. – Arun Mar 04 '13 at 18:52

2 Answers2

3

To explain why this is happening:

Your function toString1 is the issue:

toString1 <- function(x) {
    string2 <- c()
    for(i in length(x)) { 
        string2 <- paste0(string2, x[[i]], ":1 ", collapse="")
    }
    string2
}

In the case of toString1(fdList2[1]), you're passing a list. So, there is no use for the for-loop. It would work if your function is:

toString1 <- function(x) {
    string2 <- paste0(x[[1]], ":1 ", collapse="")
}
o <- toString1(fdList2[2])

# [1] "this:1 string:1 is:1 a:1 slightly:1 longer:1 "

But when you do ldply, what you're passing is not the list (fdList2[2]), but a vector (fdList2[[2]]). So, in this case, your function should be:

toString1 <- function(x) {
    string2 <- c()
    for(i in 1:length(x)) { 
        string2 <- paste0(string2, x[i], ":1 ", collapse="")
    }
    string2
}
ldply(fdList2, toString1)

#                                                                   V1
# 1                                          this:1 string:1 is:1 a:1 
# 2                      this:1 string:1 is:1 a:1 slightly:1 longer:1 
# 3                         this:1 string:1 is:1 an:1 even:1 longer:1 
# 4                     this:1 string:1 is:1 a:1 slightly:1 shorter:1 
# 5 this:1 string:1 is:1 the:1 longest:1 of:1 all:1 other:1 strings:1 

Note the change of length(x) in the for-loop to 1:length(x) as it has to cycle through ALL elements and x[[i]] to x[i] because its a vector.

Hope this helps.

Arun
  • 116,683
  • 26
  • 284
  • 387
2

Why not just use sapply on "fdList2"?

> sapply(fdList2, paste0, ":1 ", collapse = "")
[1] "this:1 string:1 is:1 a:1 "                                         
[2] "this:1 string:1 is:1 a:1 slightly:1 longer:1 "                     
[3] "this:1 string:1 is:1 an:1 even:1 longer:1 "                        
[4] "this:1 string:1 is:1 a:1 slightly:1 shorter:1 "                    
[5] "this:1 string:1 is:1 the:1 longest:1 of:1 all:1 other:1 strings:1 "
> ## If you need a single column data.frame
> data.frame(V1 = sapply(fdList2, paste0, ":1 ", collapse = ""))
                                                                  V1
1                                          this:1 string:1 is:1 a:1 
2                      this:1 string:1 is:1 a:1 slightly:1 longer:1 
3                         this:1 string:1 is:1 an:1 even:1 longer:1 
4                     this:1 string:1 is:1 a:1 slightly:1 shorter:1 
5 this:1 string:1 is:1 the:1 longest:1 of:1 all:1 other:1 strings:1 

For that matter, if this is really your target, you can simplify your intermediate steps even further. Skip the creation of "fdList1" and "fdList2" and just use:

sapply(strsplit(df1$string1, " "), 
       function(x) paste0(unique(x), ":1 ", collapse = ""))
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485