2

What is a fast way of doing multiple string.replace? I'm trying to add spaces to shorten english words like

he'll -> he 'll
he's -> he 's
we're -> we 're
we've -> we 've

also i'm adding spaces in between before and punctuation as such:

"his majesty" ->  " his majesty " 
his; majesty -> his ; majesty

Is there a faster and cleaner way to do it? It's a little too slow for the purpose but I've been doing it this way:

def removeDoubleSpace(sentence):
  sentence.replace("  ", " ")
  if "  " in sentence:
    removeDoubleSpace(sentence)

def prepro(sentence):
  sentence = sentence.replace(",", " ,")
  sentence = sentence.replace(";", " ; ")
  sentence = sentence.replace(":", " : ")
  sentence = sentence.replace("(", " ( ")
  sentence = sentence.replace("(", " ) ")
  sentence = sentence.replace("‘"," ‘ ")
  sentence = sentence.replace('"',' " ')
  sentence = sentence.replace("'re", " 're")
  sentence = sentence.replace("'s", " 's")
  sentence = sentence.replace("'ll", " 'll")
  sentence = removeDoubleSpace(sentence)
  return sentence
alvas
  • 115,346
  • 109
  • 446
  • 738

1 Answers1

5

You could use a few regular expressions to accomplish the same task:

import re

# Replace multiple consecutive spaces with a single space
# Example: "One Two  Three    Four!" -> "One Two Three Four!"
sentence = re.sub(' +', ' ', sentence)    

# Surround each instance ; : ( ) ‘ and " with spaces
# Example: '"Hello;(w)o:r‘ld"' -> " Hello ;  ( w ) o : r ‘ ld "
sentence = re.sub('([;:()‘"])', ' \\1 ', sentence)

# Insert a space before each instance of , 's 're and 'll
# Example: "you'll they're, we're" -> "you 'll they 're , we 're"
sentence = re.sub("(,|'s|'re|'ll)", ' \\1', sentence)

return sentence
verdesmarald
  • 11,646
  • 2
  • 44
  • 60
  • what does the `' +', ' '` , `\\1 ' mean? would this perform faster than the replace? – alvas Oct 03 '12 at 03:43
  • 1
    @2er0 I added some comments, `' +'` matches multiple consecutive spaces. `\\1` in the replacement string inserts the value that was matched between the parentheses (`()`) in the pattern. You would have to test it to see if it's faster, as I don't have access to your test data, but my instinct is yes. – verdesmarald Oct 03 '12 at 03:45
  • @2ero If this still isn't good enough, you could also loop through the characters in the string and build up the output sequentially in a list, then convert it to a string. However, that approach is painful to code and I would only recommend it if all else fails. I'm also not sure if performance gain would be significant. – verdesmarald Oct 03 '12 at 04:07