-1

I'm trying to split tons of strings as below:

x = "�\001�\001�\001�\001�\001\002CN�\001\bShandong�\001\004Zibo�\002$ABCDEFGHIJK�\002\aIMG_HAS�\002�\002�\002�\002�\002�\002�\002�\002\02413165537405763268743�\002\001�\002�\002�\002�\003�\003�\003����\005�\003�\003�\003�\003"

into four pieces

'CN', 'Shandong', 'Zibo', 'ABCDEFGHIJK'

I've tried

stringr::str_split(x, '\\00.')

which output the origin x. Also,

trimws(gsub("�\\00?", "", x, perl = T))

which only removes the unknown character .

Could someone help me with this? Thanks for doing so.

jay.sf
  • 60,139
  • 8
  • 53
  • 110
Zhenyu Wu
  • 3
  • 2

2 Answers2

2

You can try with str_extract_all :

stringr::str_extract_all(x, '[A-Za-z_]+')[[1]]
[1] "CN"          "Shandong"    "Zibo"        "ABCDEFGHIJK" "IMG_HAS"

With base R :

regmatches(x, gregexpr('[A-Za-z_]+', x))[[1]]

Here we extract all the words with upper, lower case or an underscore. Everything else is ignored so characters like �\\00? are not there in final output.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
0

We can use strsplit from base R

setdiff(strsplit(x, "[^A-Za-z]+")[[1]], "")
#[1] "CN"          "Shandong"    "Zibo"        "ABCDEFGHIJK" "IMG"         "HAS"  
akrun
  • 874,273
  • 37
  • 540
  • 662