1

I have a text file, which is several hundred rows long. I am trying to remove all of the [edit:add] punctuation characters from it except the "/" characters. I am currently using the strip function in the qdap package.

Here is a sample data set:

htxt <- c("{rtf1ansiansicpg1252cocoartf1038cocoasubrtf360/", 
        "{fonttblf0fswissfcharset0 helvetica",
        "margl1440margr1440vieww9000viewh8400viewkind0")

Here is the code:

strip(htxt, char.keep = "/", digit.remove = F, apostrophe.remove = TRUE, lower.case = TRUE)

The only problem with this beautiful function is that it removes the "/" characters. If I try to remove all characters except the "{" character it works:

strip(htxt, char.keep = "{", digit.remove = F, apostrophe.remove = TRUE, lower.case = TRUE)

Has anyone experienced the same problem?

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
user1738753
  • 626
  • 4
  • 12
  • 19
  • it is customary to attribute bug finds in the NEWS file of a package. I'd appreciate your name so as to attribute credit. You can view the package maintainer email for qdap and send me an email. T hanks for finding and reporting this. In the future feel free to use the github repo directly to report issues or feature requests: https://github.com/trinker/qdap/issues – Tyler Rinker Jun 19 '13 at 00:29

2 Answers2

1

For whatever reason it seems the qdap:::strip always strips "/" out of character vectors. This is in the source code towards the end of the function:

x <- clean(gsub("/", " ", gsub("-", " ", x)))

This is run before the actual function which does the stripping which is defined in the body of the function strip....

So just replace the function with your own version:

strip.new <- function (x, char.keep = "~~", digit.remove = TRUE, apostrophe.remove = TRUE, 
    lower.case = TRUE) 
{
    strp <- function(x, digit.remove, apostrophe.remove, char.keep, 
        lower.case) {
        if (!is.null(char.keep)) {
            x2 <- Trim(gsub(paste0(".*?($|'|", paste(paste0("\\", 
                char.keep), collapse = "|"), "|[^[:punct:]]).*?"), 
                "\\1", as.character(x)))
        }
        else {
            x2 <- Trim(gsub(".*?($|'|[^[:punct:]]).*?", "\\1", 
                as.character(x)))
        }
        if (lower.case) {
            x2 <- tolower(x2)
        }
        if (apostrophe.remove) {
            x2 <- gsub("'", "", x2)
        }
        ifelse(digit.remove == TRUE, gsub("[[:digit:]]", "", 
            x2), x2)
    }
    unlist(lapply(x, function(x) Trim(strp(x = x, digit.remove = digit.remove, 
        apostrophe.remove = apostrophe.remove, char.keep = char.keep, 
        lower.case = lower.case))))
}

strip.new(htxt, char.keep = "/", digit.remove = F, apostrophe.remove = TRUE, lower.case = TRUE)

#[1] "rtf1ansiansicpg1252cocoartf1038cocoasubrtf360/"
#[2] "fonttblf0fswissfcharset0 helvetica"            
#[3] "margl1440margr1440vieww9000viewh8400viewkind0" 

The package author is pretty active on this site so he can probably clear up why strip does this by default.

Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
1

Why not:

> gsub("[^/]", "", htxt)
[1] "/" ""  "" 

Given the clarification by @SimonO101, the regex approach might be:

gsub("[]!\"#$%&'()*+,.:;<=>?@[^_`{|}~-]", "", htxt)

Note that the first item in that sequence is "]" and the last item is "-" and that the double-quote needed to be escaped. This is what is targeted with [:punct:] with the "\" removed. to do it programatically you might use:

rem.some.punct <- function(txt, notpunct=NULL){ 
       punctstr <- "[]!\"#$%&'()*/+,.:;<=>?@[^_`{|}~-]"
       rempunct <- gsub(paste0("",notpunct), "", punctstr)
       gsub(rempunct, "", txt)}
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Despite the slightly ambiguous wording in the post, this isn't what the OP wants. They want the text returned, without unwanted characters (as best I can tell strip is basically to remove punctuation and non-alpha-numeric characters apart from ones you want to keep). – Simon O'Hanlon Jun 12 '13 at 08:25
  • Ok, ok, the plain incorrect phrasing of the question! I base my assumptions on the lat line of code which apparently contains the correct result of applying the function. – Simon O'Hanlon Jun 12 '13 at 15:26