Hi I have a script which is able to remove subheadings and paragraphs but I am not able to remove paragraphs with non english subheadings and words.
For example, (Original Text):
=== Personal finance ===
Protection against unforeseen personal events, as well as events in the wider economies
Transference of family wealth across generations (bequests and inheritance)
=== Corporate finance ===
Corporate finance deals with the sources of funding and the capital structure of corporations and the actions that managers take to increase the value of the firm to the shareholders.
== External links ==
Business acronyms and abbreviations
Business acronyms
== Kūrybinės Industrijos ==
Kūrybinės industrijos apima sritį ekonominių veiksnių, susitelkusių ties žinių ir informacijos generavimu arba tyrimu.
The (Result) I get from my code is :
Protection against unforeseen personal events, as well as events in the wider economies
Transference of family wealth across generations (bequests and inheritance)
Corporate finance deals with the sources of funding and the capital structure of corporations and the actions that managers take to increase the value of the firm to the shareholders.
Kūrybinės industrijos apima sritį ekonominių veiksnių, susitelkusių ties žinių ir informacijos generavimu arba tyrimu.
This is what I hope to achieved (Desired Result):
Protection against unforeseen personal events, as well as events in the wider economies
Transference of family wealth across generations (bequests and inheritance)
Corporate finance deals with the sources of funding and the capital structure of corporations and the actions that managers take to increase the value of the firm to the shareholders.
The script is as follows:
import re
from subprocess import call
f1 = open('asd.text', 'r') # read file that contains the orginal text
f2 = open('NoRef.text', 'w') # write to new file
section_title_re = re.compile("^=+\s+.*\s+=+$")
content = []
skip = False
for l in f1.read().splitlines():
line = l.strip()
if "== external links ==" in line.lower():
skip = True
continue
if section_title_re.match(line):
skip = False
continue
if skip:
continue
content.append(line)
content = '\n'.join(content) + '\n'
f2.write(content+"\n")
f2.close()
Problem: So far my code is able to remove paragraphs with subheading of known names like "External Links".
But do I remove those subheadings and paragraphs that are non english?
Thank you.