0

I am trying to import interview transcriptions with textreadr, but it works by separating the text into two columns through locating a separator character (usually a colon). In transcriptions I have a colon occasionally appears in the body of the response text which causes an error. I was hoping to replace these colons with something else (e.g. a dash or underscore), but not sure how to go about down that.

I can find the location of all the colons through gregexpr(), but then how can I replace them? Would I be able to use grep or sub somehow through an if statement?

EDIT

Ok found a inelegent solution through the stringr package:

First I replace all the colons through

dat = str_replace_all(text,":","_")

Then I reinsert only the first colon that I wanted to keep through

dat = str_replace(dat,"_",":")

Not great, but it worked....

Gerard
  • 159
  • 1
  • 2
  • 11

1 Answers1

0

You can use strsplit and then combine all elements after the first. Something like:

txn <- c("Int1: This is some text.",
         "Int2: As I speak I take a long pause: for effect",
         "Int1: This inteview is over.")               

transcripts <- strsplit(txn, ":")
interviewer <- sapply(transcripts, "[", 1)
scripts <- sapply(transcripts, function(x) paste(x[-1], collapse = ":"))
dat <- data.frame(interviewer, scripts)
emilliman5
  • 5,816
  • 3
  • 27
  • 37