4

I'm fairly new to the R language. So I have this vector containing the following:

> head(sampleVector)

[1] "| txt01 |   100 |         200 |       123.456 |           0.12345 |"
[2] "| txt02 |   300 |         400 |       789.012 |           0.06789 |"

I want to extract the lines and break each into separate pieces, with a data value per piece. I want to get a list resultListthat eventually would print out the following:

> head(resultList)`

[[1]]`  
[1] ""   "txt01"    "100"       "200"     "123.456"        "0.12345" 

[[2]]`  
[1] ""   "txt02"    "300"       "400"     "789.012"        "0.06789"

I am struggling with the strsplit() notation and I have tried and got the following code so far:

resultList  <- strsplit(sampleVector,"\\s+[|] | [|]\\s+ | [\\s+]")`          
#would give me the following output`

# [[1]]`    
# [1] "| txt01"    "100"       "200"     "123.456"        "0.12345 |" 

Anyway I can get the output the one strsplit call? I am guessing my notation to distinguish the delimiter + whitespace is wrong. Any help on this would be good.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
12341234
  • 404
  • 5
  • 16

3 Answers3

4

Here's one way. This first removes the | from the vector with gsub. Then it uses strsplit on the spaces (or any number of spaces). Probably a bit easier that way.

strsplit(gsub("|", "", sampleVector, fixed=TRUE), "\\s+")
# [[1]]
# [1] ""        "txt01"   "100"     "200"     "123.456" "0.12345"
#
# [[2]]
# [1] ""        "txt02"   "300"     "400"     "789.012" "0.06789"

Here's an interesting alternative using scan that might be useful, and will probably be quite fast.

lapply(sampleVector, function(y) {
    s <- scan(text = y, what = character(), sep = "|", quiet = TRUE)
    (g <- gsub("\\s+", "", s))[-length(g)]
})
# [[1]]
# [1] ""        "txt01"   "100"     "200"     "123.456" "0.12345"
#
# [[2]]
# [1] ""        "txt02"   "300"     "400"     "789.012" "0.06789"
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • Wow thanks mate. I was looking into `gsub` but I didn't think of putting it within the `strsplit` function. Thanks. Will accept as answer (in around 6 minutes when SO allows me too) ahahahah thanks again for the help P.S. I wish i can upvote this ahahhaha – 12341234 Oct 21 '14 at 01:26
  • 1
    No worries. I really like these types of problems :). Also, not sure if you want to keep the first `""` element. It can be removed easily with `lapply(resultList, "[", -1)` – Rich Scriven Oct 21 '14 at 01:27
  • ahahahahha i'm getting the hang of it. I am unfamiliar with the notations like similar to `\\s`. Thanks again. Wish of luck for the SF Giants :) – 12341234 Oct 21 '14 at 01:29
  • Ah I would need it when I do my matrix manipulations later on but I'll keep that in mind. – 12341234 Oct 21 '14 at 01:29
4

Another strsplit option which I nearly missed:

strsplit(test,"[| ]+")
#[[1]]
#[1] ""        "txt01"   "100"     "200"     "123.456" "0.12345"
# 
#[[2]]
#[1] ""        "txt02"   "300"     "400"     "789.012" "0.06789"

...and my original answer because regmatches is my favourite function of late:

regmatches(test,gregexpr("[^| ]+",test))
#[[1]]
#[1] "txt01"   "100"     "200"     "123.456" "0.12345"
#
#[[2]]
#[1] "txt02"   "300"     "400"     "789.012" "0.06789"

To break it down as requested:

[| ]+ is a regex searching for single or repeated instances + of a space  or a pipe |
[^| ]+ is a regex searching for single or repeated instances + of any character not ^ a space  or a pipe |
gregexpr finds all the instances of this pattern and returns the start locations and length of the matching patterns.
regmatches extracts all the patterns from test that are matched by gregexpr

thelatemail
  • 91,185
  • 12
  • 128
  • 188
0

May try strsplit first and the gsub:

sapply(strsplit(xx, '\\|'), function (x) gsub("^\\s+|\\s+$", "", x))
     [,1]     
[1,] ""       
[2,] "txt01"  
[3,] "100"    
[4,] "200"    
[5,] "123.456"
[6,] "0.12345"
rnso
  • 23,686
  • 25
  • 112
  • 234
  • i wanted to return a list with each list component containing the split characters. but thanks anyway! – 12341234 Oct 23 '14 at 21:49