0

I am writing a function that finds the index of the first trail whitespace of a string, but I am unsure of how to do so, can someone please teach me?

for example "i am here. " there are three spaces after the sentence. The function would give me '10'.

and the input is meant to the a text python file that is split into sentences (a list of strings)

this is what i have tried

alplist = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"] 
space = [' ', ',', '.', '(', ')', ':', ':']

def TRAIL_WHITESPACE(python_filename): 
    whitespace = [] 
    LINE_NUMBER = 1 
    index = 0 
    for item in lines: 
        for index in range(len(item)): 
            if len(item) > 0: 
                if item[index] + item[index + 1] in alplist + space: 
                    index = index 
                    if item[index:] in " ": 
                        whitespace.append({'ERROR_TYPE':'TRAIL_WHITESPACE', 'LINE_NUMBER': str(LINE_NUMBER),'COLUMN': str(index),'INFO': '','SOURCE_LINE': str(lines[ len(item) - 1])}) 
                        LINE_NUMBER += 1 
                    else: 
                        LINE_NUMBER += 1 
                else: 
                    LINE_NUMBER += 1 
            else: 
                LINE_NUMBER += 1 
    return whitespace

Thank you

kylieCatt
  • 10,672
  • 5
  • 43
  • 51
Joyce Lam
  • 33
  • 1
  • 4

3 Answers3

2

This can easily be done using the str.rstrip() method:

#! /usr/bin/env python

#Find index of any trailing whitespace of string s
def trail(s):
    return len(s.rstrip())

for s in ("i am here. ", "nospace", "   no  trail", "All sorts of spaces \t \n", ""):
    i = trail(s)
    print `s`, i, `s[:i]`

output

'i am here. ' 10 'i am here.'
'nospace' 7 'nospace'
'   no  trail' 12 '   no  trail'
'All sorts of spaces \t \n' 19 'All sorts of spaces'
'' 0 ''
PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
  • @Tony That one-liner can be condensed even further: `[len(s.rstrip()) for s in open("file.txt")]`. I'm not sure what you mean about not allowing for the possibility of no whitespace on a line: 'nospace' has no space... But I suppose I should also included an example of what happens when the function is passed an empty string. – PM 2Ring Oct 26 '14 at 07:18
  • This approach has the advantage that it can be collapsed down to a one-liner: `[len(s.rstrip()) for s in open("file.txt").readlines()]`. But a disadvantage is that it doesn't signal an error if there is no whitespace on a line, such as when there is no newline at the end of a file. If the format of the input is known, this may not matter. – Tony Oct 26 '14 at 07:18
  • That's a handy tip. I was unaware that readlines() could be dispensed with. – Tony Oct 26 '14 at 07:20
  • 1
    Iterating directly over the file object is often superior to using `readlines()` since it means you don't need to read the whole file into memory and you can start processing it straight away. – PM 2Ring Oct 26 '14 at 07:23
1

you can try to use regular expressions. something like this:

import re

my_re = re.compile(r'\S\s')

res = my_re.search("some long string")

if res:
    print("start: {}, end: {}".format(res.start(0), res.end(0)))
Alexey
  • 75
  • 6
  • For me this returns start: 3, end: 5. Isnt that the index of the first whitespace character and not the first trailing whitespace character? – Paul Rooney Oct 26 '14 at 07:14
  • @PaulRooney, the approach is sensible but I agree the regex is wrong. `'^.*?\S*?(\s+?)\Z'` works for me. – Tony Oct 26 '14 at 07:19
  • I'm talked only about to use regexp. Regexp itself I've gave only as some simple example. And this particular example can be modified like so to work from the end of the line: r'\S\s+$' – Alexey Oct 26 '14 at 07:30
0

As @Alexey said, a regular expression seems to be a way to proceed.

The following should do what you want. Note that 'whitespace' includes new line characters.

Call it like this: list_of_indexes = find_ws("/path/to/file.txt")

import re

def find_ws(filename):
    """Return a list of indexes, each indicating the location of the 
    first trailing whitespace character on a line.  Return an index of 
    -1 if there is no trailing whitespace character (at the end of a file)"""

    text = open(filename).readlines()

    # Any characters, then whitespace, then end of line
    # Use a non-"greedy" match
    # Make the whitespace before the end of the line a group
    match_space = re.compile(r'^.*?\S*?(\s+?)\Z') 

    indexes = []

    for s in text:
        m = match_space.match(s)
        if m == None:
            indexes.append(-1)
        else:
            # find the start of the matching group
            indexes.append(m.start(1)) 

    return indexes

Documentation on regular expressions in Python is available.

Tony
  • 1,645
  • 1
  • 10
  • 13