1

I wrote a little program to turn pages from book scans to a .txt file. On some lines, words are moved to another line. I wonder if this is any way to remove the dashes and merge them with the syllables in the line below?

E.g.:

effects on the skin is fully under-
stood one fights

to:

 effects on the skin is fully understood
 one fights

or:

effects on the skin is fully 
understood one fights

Or something like that. As long as it was connected. Python is my third language and so far I can't think of anything, so maybe someone will give mea hint.

Edit: The point is that the last symbol, if it is a dash, is removed and merged with the rest of the word below

VLAZ
  • 26,331
  • 9
  • 49
  • 67
Student111
  • 49
  • 4
  • I don't understand how the line with the dash and the one after it get merged from your example. Do you have example code to show us what you've done so far? – rayryeng Feb 07 '22 at 21:09
  • Thanks - your desired result contradicts the expected output you've given. The second line with the dash and the third line, when you "merge with the rest of the word below" doesn't make sense to me. Could you revise your example to use actual words in the English language? – rayryeng Feb 07 '22 at 21:12
  • I don't care how it will merge, eg if the last word in n line is "remov-" and in the n+1 line is "ing" I' d like to merge it – Student111 Feb 07 '22 at 21:13
  • The reason why I'm asking all of these questions is that it'll be easier for the community to actually give you an answer. Right now it's unclear. – rayryeng Feb 07 '22 at 21:13
  • while reading the lines in you could check that there is a hypen at the end of the line and combine those two strings. – Andrew Ryan Feb 07 '22 at 21:14
  • 1
    we're not here to write your code for you, what have you tried? – notacorn Feb 07 '22 at 21:17
  • No I don't have yet. I'd like some hint. I wonder on strings but my txt files are huge, over 2k lines so I don't think that solution will be fast – Student111 Feb 07 '22 at 21:18
  • Don't worry about speed at first. Check if the trivial solution works before delving into performance requirements. – blackbrandt Feb 07 '22 at 21:23
  • @Student111, 2000 lines is far from huge. 2 billion lines maybe, but for 2000 lines there's really no point in trying to optimize it. You could easily load it into memory and deal with string operations, unless your computer is 20 years old or you have to do it as part of a web server dealing with tens of request per second. Just give it some try. – wovano Feb 07 '22 at 21:39
  • Do you have a limit of line length? When the word should be located at the end of the current line, and when moved to the next line? – Yuri Ginsburg Feb 07 '22 at 21:45

4 Answers4

3

This is a generator which takes the input line-by-line. If it ends with a - it extracts the last word and holds it over for the next line. It then yields any held-over word from the previous line combined with the current line.

To combine the results back into a single block of text, you can join it against the line separator of your choice:

source = """effects on the skin is fully under-
stood one fights
check-out Daft Punk's new sin-
le "Get Lucky" if you hav-
e the chance. Sound of the sum-
mer."""

def reflow(text):
    holdover = ""
    for line in text.splitlines():
        if line.endswith("-"):
            lin, _, e = line.rpartition(" ")
        else:
            lin, e = line, ""
        yield f"{holdover}{lin}"
        holdover = e[:-1]

print("\n".join(reflow(source)))
""" which is:
effects on the skin is fully
understood one fights
check-out Daft Punk's new
single "Get Lucky" if you
have the chance. Sound of the
summer.
"""

To read one file line-by-line and write directly to a new file:

def reflow(infile, outfile):
    with open(infile) as source, open(outfile, "w") as dest:
        holdover = ""
        for line in source.readlines():
            line = line.rstrip("\n")
            if line.endswith("-"):
                lin, _, e = line.rpartition(" ")
            else:
                lin, e = line, ""
            dest.write(f"{holdover}{lin}\n")
            holdover = e[:-1]

if __name__ == "__main__":
    reflow("source.txt", "dest.txt")
Jack Deeth
  • 3,062
  • 3
  • 24
  • 39
  • 1
    slightly proud of `lin, _, e = line.rpartition`. feels like I've made a weak pun. – Jack Deeth Feb 07 '22 at 22:02
  • To make a very nice feature and it works perfectly. However, I have a question about uploading the file. When will I do: with open("test.txt") as f: contents = f.readlines() print("\n".join(reflow(contents))) To replace me with a file in a line list, is there any way to load a file to use your functions? – Student111 Feb 07 '22 at 22:26
  • @Student111 I've now shown how to read from one file and write to another :) – Jack Deeth Feb 07 '22 at 22:42
  • That's perfect! – Student111 Feb 07 '22 at 22:44
2

Here is one way to do it

with open('test.txt') as file:
    combined_strings = []
    merge_line = False
    for item in file:
        item = item.replace('\n', '') # remove new line character at end of line
        if '-' in item[-1]:  # check that it is the last character
            merge_line = True
            combined_strings.append(item[:-1])
        elif merge_line:
            merge_line = False
            combined_strings[-1] = combined_strings[-1] + item
        else:
            combined_strings.append(item)
Andrew Ryan
  • 1,489
  • 3
  • 15
  • 21
1

If you just parse the line as a string then you can utilize the .split() function to move around these kinds of items

words = "effects on the skin is fully under-\nstood one fights"
#splitting among the newlines
wordsSplit = words.split("\n")
#splitting among the word spaces
for i in range(len(wordsSplit)):
    wordsSplit[i] = wordsSplit[i].split(" ")
#checking for the end of line hyphens
for i in range(len(wordsSplit)):
    for g in range(len(wordsSplit[i])):
        if "-" in wordsSplit[i][g]:
            #setting the new word in the list and removing the hyphen
            wordsSplit[i][g] = wordsSplit[i][g][0:-1]+wordsSplit[i+1][0]
            wordsSplit[i+1][0] = ""
#recreating the string
msg = ""
for i in range(len(wordsSplit)):
    for g in range(len(wordsSplit[i])):
        if wordsSplit[i][g] != "":
            msg += wordsSplit[i][g]+" "

What this does is split by the newlines which are where the hyphens usually occur. Then it splits those into a smaller array by word. Then checks for the hyphens and if it finds one it replaces it with the next phrase in the words list and sets that word to nothing. Finally, it reconstructs the string into a variable called msg where it doesn't add a space if the value in the split array is a nothing string.

Jerry Spice
  • 63
  • 1
  • 8
1

What about

import re

a = '''effects on the skin is fully under-
stood one fights'''

re.sub(r'-~([a-zA-Z0-9]*) ', r'\1\n', a.replace('\n', '~')).replace('~','\n')

Explanation

a.replace('\n', '~') concatenate input string into one line with (~ instead of \n - You need to choose some other if you want to use ~ char in the text.)

-~([a-zA-Z0-9]*) regex then selects all strings we want to alter with the () backreference which saves it to re.sub memory. Using '\1\n' it is later re-invoked.

.replace('~','\n') finally replaces all remaining ~ chars to newlines.

Přemysl Šťastný
  • 1,676
  • 2
  • 18
  • 39