-1

I am looking for ways to extract specific paragraphs out of strings. I have a lot of documents which I want to use for topic modeling, but these contain tables, figures, headers, etc. I only want to use the summary which usually is in a document. But the summaries aren't clearly declared.

I converted the PDFs to text and tried something like this but it did not work out well, because the summaries are always declared in a different way:

def get_summary(text):

subject = ""
copy = False
textlines = text.splitlines()

for line in textlines:
    #print line
    if line.strip() == 'SUMMARY_BEGIN':
        copy = True
    elif line.strip() == 'SUMMARY_END':
        copy = False
    elif copy:
        #print(line)
        subject += line

return subject

I dont want search for a summary between 100 different possible substrings.

Edit: look alike example:

Date
21 Jun 2017

name name [abc]
name name [abc]
name name [cbd]
name name
name name
name name
name name
name name

nonsense-word1

nonsense-word1
nonsense-word1

12354
37264324

Summary:
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. 
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document.
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. 
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. 

Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. 

Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. 
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. 


32 463264 
324324
324432
32424

nonsense-word2

nonsense-word2
nonsense-word2
nonsense-word2

nonsense-word2

nonsense-word2

324
24442

name name
name name
name name
name name

3244324324

Date
21 Jun 2017

Date
21 Jun 2017

Date
21 Jun 2017

electronically validated

electronically validated

electronically validated

electronically validated
electronically validated


763254 3276 4276457234
nlp_noob
  • 11
  • 4
  • This obviously depends on what your input data looks like. It's not really possible to suggest an answer without specific details about these documents you want to process. – Håken Lid Jun 27 '18 at 12:24
  • @HåkenLid These are like industrial documents. The part I am looking for is usually the only part of the document which is "only text" (alphanumeric). The other parts of the documents are usually just numbers, single words or small sentences. – nlp_noob Jun 27 '18 at 12:29
  • You could simply use a regular expression. Somewhere between 90% and 100% accuracy might be possible, but with unstructured input, you probably have to expect some false negatives / positives. – Håken Lid Jun 27 '18 at 12:36
  • I am not sure what you mean. I can't just look for every alphanumeric word in the text, because there are frequent words that make no sense due to my pdf to text conversion. These words dominate my topic models. That is basically the main problem. – nlp_noob Jun 27 '18 at 12:44
  • As I said, you have to provide specific input data samples to us if you want specific suggestions on how to solve it. – Håken Lid Jun 27 '18 at 12:48
  • I added an example which should show how my documents look like. – nlp_noob Jun 27 '18 at 13:00

3 Answers3

0

You can write a regular expression that only catches sentences. This will match the first sequence of at least 2 sentences (starting with an upper case character) in a row.

(?:[A-Z][^\n.]+.\s*){2,}

https://regex101.com/r/blK6sf/1

Håken Lid
  • 22,318
  • 9
  • 52
  • 67
  • thanks. this obviously works for the example but is it possible to write an regular expression that only catches sentences with an upper case character and ends with a dot and a new line afterwards. Because if I change the names to all upper case, this re does not work anymore. I am sorry, I should have included all upper case nonsense words in my example. – nlp_noob Jun 28 '18 at 07:45
  • You obviously have to use an expression that matches your actual data. I can't suggest a better expression, since I don't know what your input looks like. You should include relevant input data when you ask a question like this. – Håken Lid Jun 28 '18 at 08:36
0

Why not looking for the sentences in the document with more than N words. These are probably true sentences and not useless lines.

Another way is to know which words appear only in true sentences. Some simple words may only appear in the true paragraphs. For instance articles or prepositions that you can retrieve with a simple grep

Gabriel M
  • 1,486
  • 4
  • 17
  • 25
0

Just get re to do the heavy lifting ;)

import re

def get_summary(text):
    return re.search(
        r'\nSummary:\n(?P<content>.*?)[\d\s]{6,}',
        text,
        flags=re.MULTILINE | re.DOTALL,
    ).group('content')
FraggaMuffin
  • 3,915
  • 3
  • 22
  • 26