Removing Non English Sub headings and Paragraphs

Question

Hi I have a script which is able to remove subheadings and paragraphs but I am not able to remove paragraphs with non english subheadings and words.

For example, (Original Text):

=== Personal finance ===
Protection against unforeseen personal events, as well as events in the wider economies
Transference of family wealth across generations (bequests and inheritance)

=== Corporate finance ===
Corporate finance deals with the sources of funding and the capital structure of corporations and the actions that managers take to increase the value of the firm to the shareholders.

== External links ==
Business acronyms and abbreviations
Business acronyms

== Kūrybinės Industrijos ==
Kūrybinės industrijos apima sritį ekonominių veiksnių, susitelkusių ties žinių ir informacijos generavimu arba tyrimu.

The (Result) I get from my code is :

Protection against unforeseen personal events, as well as events in the wider economies
Transference of family wealth across generations (bequests and inheritance)

Corporate finance deals with the sources of funding and the capital structure of corporations and the actions that managers take to increase the value of the firm to the shareholders.

Kūrybinės industrijos apima sritį ekonominių veiksnių, susitelkusių ties žinių ir informacijos generavimu arba tyrimu.

This is what I hope to achieved (Desired Result):

Protection against unforeseen personal events, as well as events in the wider economies
Transference of family wealth across generations (bequests and inheritance)

Corporate finance deals with the sources of funding and the capital structure of corporations and the actions that managers take to increase the value of the firm to the shareholders.

The script is as follows:

import re
from subprocess import call

f1 = open('asd.text', 'r') # read file that contains the orginal text
f2 = open('NoRef.text', 'w') # write to new file

section_title_re = re.compile("^=+\s+.*\s+=+$")

content = []
skip = False
for l in f1.read().splitlines():
    line = l.strip()

    if "== external links ==" in line.lower():
        skip = True  
        continue

    if section_title_re.match(line):
        skip = False
        continue
    if skip:
        continue
    content.append(line)

content = '\n'.join(content) + '\n'
f2.write(content+"\n")
f2.close()

Problem: So far my code is able to remove paragraphs with subheading of known names like "External Links".

But do I remove those subheadings and paragraphs that are non english?

Thank you.

Did you try googling for libraries that detect languages? A cursory search brought up this: https://pypi.python.org/pypi/langdetect? — juanpa.arrivillaga, Jun 09 '16 at 06:27
If you know in advance all possible (English) headings you may encounter, just check if the heading is in your list (better use a `set` actually), and skip the whole paragraph if it's not. — Julien, Jun 09 '16 at 06:30
Hi Julien I have no idea of all the possible English headings thus there is where my problem lies. — windboy, Jun 09 '16 at 06:32
Then I think you need to find a good library as suggested by juanpa.arrivillaga.... — Julien, Jun 09 '16 at 06:33
The [tag:wiki] tag is generic for any wiki platform. Your question looks like it is about scraping [tag:wikipedia] and doesn't seem to care what their content management and development model is. If this is a correct observation, please [edit] your question to use the correct tag. — tripleee, Jun 09 '16 at 06:44

score 1 · Accepted Answer · answered Jun 09 '16 at 07:30

1

If you only want to detect if a string contains non english characters, thats easy: just try to decode it as ascii: if it fails, it contains character with code above 127:

try:
     utxt = txt.decode('ascii')
except:
     # txt contains non "english" characters
     ...

If you want to detect if it contains non english words, that a much more complex question, and you should wonder whether you want to accept english words badly written, such as englich woerds badli writed. Good luck if you want to go that way...

answered Jun 09 '16 at 07:30

Serge Ballesta

143,923
11
122
252

1

English orthography permits diacritics in loan words like *zöology* and *résumé* so this is an approximate approach at best. In a resource like Wikipedia, words like these are somewhat likely to be corrected into the proper form by meticulous editors, even if beginners are likely to originally type them without diacritics. – tripleee Jun 09 '16 at 08:22
I will give that a try. Thank you very much. – windboy Jun 09 '16 at 08:58
The package [`langdetect`](https://pypi.python.org/pypi/langdetect) proposed by @juanpa.arrivillaga behaves quite well with those words: `detect_langs("englich woerds badli writed")` returns `[en:0.999994655212]`. So as an heuristic tool, seems to do a good job :) (in addition to behave correctly with loan words). – MariusSiuram Jun 09 '16 at 09:19
@tripleee: I assumed that what was required was *only plain ascii characters*. But of course for true language detection, existent and well tested heuristic tools is the way to go. – Serge Ballesta Jun 09 '16 at 10:14

Removing Non English Sub headings and Paragraphs

1 Answers1