1

There is a similar question already on Stack Overflow see link, but I am having problems referencing the item before i. I'm working with a list of strings, and only need to combine neighboring strings in the list when a certain string starts with specific characters because that string is erroneously dividing the neighboring strings. For example:

list = ["a","b","<s16","c","d"]

In this case I would want to combine any two elements neighboring the string that starts with "<s16" (starts with because each occurance includes a different number). So the correct list would look like this: list = ["a","bc","d"]

I have tried several methods and the recurring problems are:

  1. i.startswith does not work with integer objects (when I try to use range(len(list)) for example for the string index)
  2. trying to reference the object before i (such as with list.pop(i-1)) results in type error for an unsupported operand type, I guess because it thinks I'm trying to subtract 1 from a string rather than reference the element before the given element that starts with <s16>

I have tried using re.match and re.findall to resolve the first issue, but it does not seem to accruately find the right list items. if any(re.match('<s16') for i in list):

Thank you in advance for any help and I also apologize in advance for my ignorance, I'm new.

New Dev
  • 48,427
  • 12
  • 87
  • 129
Erik
  • 19
  • 5
  • Can the `"s16"` occur multiple times or is there just one? What happens if there are two next to each other? – tzaman Nov 20 '20 at 11:05
  • @tzaman,yes the "s16" occurs multiple times but with different endings each time and there are never two next to each other. To give more background, I tagged a pdf based on different font types, sizes, colors,etc., so that I could parse out only the paragraphs. The problem is that the footnote number superscripts (not the actually footnotes themselves) are recognized and tagged differently, so they artifically divide the paragraphs in two parts. The same for some page breaks, I will need to combine paragraphs that start with lower case with the previous paragraph. – Erik Nov 20 '20 at 11:16

2 Answers2

1

The best is to use the re module

import re

mylist = ["<s1", "a","b","<s16", "<s18", "c", "d", "e", "f", "<s16", "g", "h", "i", "j", "<s135"]

# So you will catch strings which starts with "<s" followed by some digits
# and after zero or more entries of any caracter.
r = "^<s\d+.*"
i = 0
while i < len(mylist):
    item = mylist[i]
    
    # If you are at the start of the list just pop the first item
    if (i == 0) and re.search(r, item):
        mylist.pop(i)
    
    # If you are at the end of the list just pop the last item
    elif (i == len(mylist) - 1) and re.search(r, item):
        mylist.pop(i)
    
    # If you have found a wrong item inside the list
    # continue until you delete all consecutive entries
    elif re.search(r, item):
        mylist.pop(i)
        item = mylist[i]
        while re.search(r, item):
            mylist.pop(i)
            item = mylist[i]
        
        mylist[i-1] += mylist[i]
        mylist.pop(i)
    
    else:
        i += 1

print(mylist)

# ['a', 'bc', 'd', 'e', 'fg', 'h', 'i', 'j']

PS: You can add more options using more regex expressions for catching different cases

Roman Zh.
  • 985
  • 2
  • 6
  • 20
0

Easiest to use a while loop here:

def join(l, sep="<s16"):
  i = 1
  while i < len(l) - 1:
    if l[i].startswith(sep):
      l.pop(i)  # remove the separator (at current index)
      l[i-1] += l.pop(i)  # join next element to previous
    else:
      i += 1

l = ["a","b","<s16abc","c","d", "<s16def", "e", "f"]
join(l)
print(l) 
# ['a', 'bc', 'de', 'f']

Also don't name your lists list since it shadows the built-in by that name which is not a good idea.

tzaman
  • 46,925
  • 11
  • 90
  • 115
  • Thank you @tzaman, it works. Don't suppose I can ask you if it possible to add a "unless" condition i.e. join unless string ends in "|", which indicates a footnote at the end of a block – Erik Nov 20 '20 at 13:58
  • @Erik you can change the condition in the `if` statement to be whatever you want. – tzaman Nov 23 '20 at 09:11