2

Having a strange issue here with apply and R 3.0.1.

I have a huge dataframe with text, numbers and logical values. The logical values are converted to chr when I use apply, but because R allows something like TRUE == "TRUE" that isn't a problem.

But to some logical values, apply seems to prepend a space, and TRUE == " TRUE" returns NA. Of course, I could do

sapply(cuelist[,4],FUN=function(logicalvalue) as.logical(sub("^ +", "", logicalvalue)))

but that isn't nice and I still don't know why R does that.

df <- data.frame(test=c("a","b","<",">"),logi=c(TRUE,FALSE,FALSE,TRUE))
apply(df, MARGIN=1, function(listelement) print(listelement) )

Interestlingly, the spaces only appear in this example on [2,1] and [2,4]

version _
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 0.1
year 2013
month 05
day 16
svn rev 62743
language R
version.string R version 3.0.1 (2013-05-16) nickname Good Sport

Edit: same behaviour on R version 2.15.0 (2012-03-30)

Edit2: My dataframe lools like this

> df
  test  logi
1    a FALSE
2    b FALSE
3    <  TRUE
4    >  TRUE

> str(df)
'data.frame':   4 obs. of  2 variables:
 $ test: Factor w/ 4 levels "<",">","a","b": 3 4 1 2
 $ logi: logi  FALSE FALSE TRUE TRUE
niton
  • 8,771
  • 21
  • 32
  • 52
Marc
  • 238
  • 1
  • 9

2 Answers2

3

In a way, the problem is with apply, but more appropriately, the problem is with as.matrix, and how it is handling logical values.

Here are a few examples to help elaborate on the query I had for Karl.

First, let's create four data.frames to do some tests on.

  1. Your original data.frame to demonstrate the behavior:
  2. A data.frame with varying number of characters in the "test" column to look into Karl's explanation of what's going on.
  3. A data.frame with some numbers to help us start to understand what actually seems to be going on.
  4. A data.frame where your "logi" column is explicitly created as.character.
df1 <- data.frame(test = c("a","b","<",">"),
                  logi = c(TRUE,FALSE,FALSE,TRUE))
df2 <- data.frame(test = c("aa","b","<",">>"), 
                  logi = c(TRUE,FALSE,FALSE,TRUE))
df3 <- data.frame(test = c("aa","b","<",">>"), 
                  logi = c(TRUE,FALSE,FALSE,TRUE),
                  num = c(1, 12, 123, 2))
df4 <- data.frame(test = c("aa","b","<",">>"), 
                  logi = as.character(c(TRUE,FALSE,FALSE,TRUE)))

Now, let's use as.matrix on each of them.

This has a space before TRUE.

as.matrix(df1)
#      test logi   
# [1,] "a"  " TRUE"
# [2,] "b"  "FALSE"
# [3,] "<"  "FALSE"
# [4,] ">"  " TRUE"

This has a space before TRUE, but the "test" column remains unaffected. Hmm.

as.matrix(df2)
#      test logi   
# [1,] "aa" " TRUE"
# [2,] "b"  "FALSE"
# [3,] "<"  "FALSE"
# [4,] ">>" " TRUE"

Ahh... This has a space before TRUE and spaces before shorter numbers. So it seems that perhaps R is considering the numeric underlying value of TRUE and FALSE, but calculating the width of the number of characters in TRUE and FALSE. Again, the first "test" column remains unaffected.

as.matrix(df3)
#      test logi    num  
# [1,] "aa" " TRUE" "  1"
# [2,] "b"  "FALSE" " 12"
# [3,] "<"  "FALSE" "123"
# [4,] ">>" " TRUE" "  2"

Things seem fine here, if you tell R that the logi column is a character column.

as.matrix(df4)
#      test logi   
# [1,] "aa" "TRUE" 
# [2,] "b"  "FALSE"
# [3,] "<"  "FALSE"
# [4,] ">>" "TRUE" 

For what it's worth, sapply doesn't seem to have that problem.

sapply(df1, as.matrix)
#      test logi   
# [1,] "a"  "TRUE" 
# [2,] "b"  "FALSE"
# [3,] "<"  "FALSE"
# [4,] ">"  "TRUE" 

Update

In the R Public chat room, Joshua Ulrich points to format being the culprit. as.matrix uses as.vector for factors, which converts them to character (try str(as.vector(df1$test)) to see what I mean; for everything else, it uses format, but unfortunately, doesn't have an option to include any of the arguments from format, one of which is trim (which is by default set to FALSE).

Compare the following:

A <- c(TRUE, FALSE)

format(A)
# [1] " TRUE" "FALSE"
format(A, trim = TRUE)
# [1] "TRUE"  "FALSE"
format(as.character(A))
# [1] "TRUE " "FALSE"
format(as.factor(A))
# [1] "TRUE " "FALSE"

So, how to sort of easily convert logical columns to character? Maybe something like this (though I would suggest creating a backup of your data first):

df1[sapply(df1, is.logical)] <- lapply(df1[sapply(df1, is.logical)], as.character)
df1
#   test  logi
# 1    a  TRUE
# 2    b FALSE
# 3    < FALSE
# 4    >  TRUE
as.matrix(df1)
#      test logi   
# [1,] "a"  "TRUE" 
# [2,] "b"  "FALSE"
# [3,] "<"  "FALSE"
# [4,] ">"  "TRUE" 
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • This is very nice! With that I can pre-edit the dataframe before it is handled by apply/parApply. Still, I don't think this is appropiate behaviour for a dataframe in combination with apply functions. Maybe this really is bug-territory? Should I or someone post that to the R-bugtracker? Maybe I'm not experienced enough with R to realize the "positive" or "useful" underlying concept of this. – Marc Sep 05 '13 at 07:05
  • Yeah, thank you for your effort, very analytical :) I changed my whole codebase now, so that all logical lists given are strings in the first place. Doesn't feel comfortable or right, but it works. I don't have to slow down my calculating processes by trimming each and every logical column I have. – Marc Sep 05 '13 at 07:47
1

It is definitely due to apply, that converts the data frame to a matrix, so all elements have the same type, here character, and logicals are converted to it. TRUE gets converted to " TRUE" to match the number of characters of "FALSE":

"FALSE"
" TRUE"

To get convinced:

as.matrix(df)

Instead you could use the a*ply from plyr package, e.g.

a_ply(df, 1, print)
Karl Forner
  • 4,175
  • 25
  • 32
  • Ah I understand now. @AnandaMahto was right, print(df, quote = TRUE) really adds the spaces, too, but I was still able to do df[1,1]==TRUE. Is there another way than plyr? I need the parapply for the parallel package, which has the same behaviour. – Marc Sep 04 '13 at 13:12
  • Did you actually test this with some other examples? I don't think this answer is quite right. – A5C1D2H2I1M1N2O1R2T1 Sep 04 '13 at 13:12
  • Try your explanation on `df1 <- data.frame(test=c("a","bb","<<",">"),logi=c(TRUE,FALSE,TRUE,TRUE))`. Why doesn't the first column get extra spaces? – A5C1D2H2I1M1N2O1R2T1 Sep 04 '13 at 13:15
  • Yes, that is right, but your still able to do something like df1[3,2]==TRUE , while this doesn't work in the apply function. – Marc Sep 04 '13 at 13:21
  • plyr also offers parallelization with the .parallel option – Karl Forner Sep 04 '13 at 14:36
  • If you really want, you still can use lapply: res <- lapply(seq_along(df), function(ci) { row <- df[ci, ] lapply(row, print) }) – Karl Forner Sep 04 '13 at 14:42
  • @KarlForner, I've [tried to elaborate](http://stackoverflow.com/a/18619447/1270695) on what I was observing. This feels somewhat like bug territory to me though. I understand what is happening with numeric when character conversion occurs, but not why it also applies the way it does to logical. – A5C1D2H2I1M1N2O1R2T1 Sep 04 '13 at 16:43