0

I've been grappling with regex in following string:

"Just beautiful, let’s see how the next few days go. \n\nLong term buying opportunities could be around the corner \xed\xa0\xbd\xed\xb2\xb0\xed\xa0\xbd\xed\xb3\x89\xed\xa0\xbd\xed\xb2\xb8... https://t dot co/hUradDaNVX"
  1. I am unable to remove the entire \x...\x pattern from the above string.
  2. I'm unable to remove https URL from above string.

My regex expression are:

gsub('http.* *', '', twts_array)
gsub("\\x.*\\x..","",twts_array)

My output is:

"Just beautiful let’s see how the next few days go \n\nLong term buying opportunities could be around the corner \xed\xa0\xbd\xed\xb2\xb0\xed\xa0\xbd\xed\xb3\x89\xed\xa0\xbd\xed\xb2\xb8... httpstcohUradDaNVX"

My expected output is:

Just beautiful, let’s see how the next few days go. Long term buying opportunities could be around the corner 

P.S: As you can see neither of problems got solved. I also added dot for . in https://t dot co/hUradDaNVX as StackOverflow does not allow me to post shortened urls. Can some one help me in tackling this problem.

m0nhawk
  • 22,980
  • 9
  • 45
  • 73
user2582651
  • 33
  • 1
  • 8
  • 1
    Try `iconv(x,"latin1","ASCII",sub="")` (this removes any non ASCII character). – nicola Feb 11 '18 at 09:01
  • Tried this solution, I'm still unable to get rid of those special characters. I get something like '������' on printing them – user2582651 Feb 11 '18 at 09:10
  • 1
    Please clarify: are you on Windows or Linux or Mac? On Windows, if I assign the string literal you provided at the beginning of the question to `x` and use `iconv(x,"latin1","ASCII",sub="")`, I get `"Just beautiful, lets see how the next few days go. \n\nLong term buying opportunities could be around the corner ... https://t dot co/hUradDaNVX"`, and [look, it works in Linux, too](https://ideone.com/eDelSR). – Wiktor Stribiżew Feb 11 '18 at 11:34
  • I'm on mac and i still get this output ""Just beautiful, let’s see how the next few days go. \n\nLong term buying opportunities could be around the corner \xed\xa0\xbd\xed\xb2\xb0\xed\xa0\xbd\xed\xb3\x89\xed\xa0\xbd\xed\xb2\xb8… https://t dot co/hUradDaNVX"./ I'll check on windows if that's the case. Thanks – user2582651 Feb 11 '18 at 12:32

1 Answers1

0

On Linux you can do the following:

twts_array <- "Just beautiful, let’s see how the next few days go. \n\nLong term buying opportunities could be around the corner \xed\xa0\xbd\xed\xb2\xb0\xed\xa0\xbd\xed\xb3\x89\xed\xa0\xbd\xed\xb2\xb8... https://t dot co/hUradDaNVX"

twts_array_str <- enc2utf8(twts_array)
twts_array_str <- gsub('<..>', '', twts_array_str)
twts_array_str <- gsub('http.*', '', twts_array_str)

twts_array_str
# "Just beautiful, let’s see how the next few days go. \n\nLong term buying opportunities could be around the corner ... "

enc2utf8 will convert any unknown Unicode sequences to <..> format. Then it will be replaced by gsub with URL as well.

m0nhawk
  • 22,980
  • 9
  • 45
  • 73