I am looking for ways to extract specific paragraphs out of strings. I have a lot of documents which I want to use for topic modeling, but these contain tables, figures, headers, etc. I only want to use the summary which usually is in a document. But the summaries aren't clearly declared.
I converted the PDFs to text and tried something like this but it did not work out well, because the summaries are always declared in a different way:
def get_summary(text):
subject = ""
copy = False
textlines = text.splitlines()
for line in textlines:
#print line
if line.strip() == 'SUMMARY_BEGIN':
copy = True
elif line.strip() == 'SUMMARY_END':
copy = False
elif copy:
#print(line)
subject += line
return subject
I dont want search for a summary between 100 different possible substrings.
Edit: look alike example:
Date
21 Jun 2017
name name [abc]
name name [abc]
name name [cbd]
name name
name name
name name
name name
name name
nonsense-word1
nonsense-word1
nonsense-word1
12354
37264324
Summary:
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document.
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document.
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document.
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document.
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document.
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document.
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document.
32 463264
324324
324432
32424
nonsense-word2
nonsense-word2
nonsense-word2
nonsense-word2
nonsense-word2
nonsense-word2
324
24442
name name
name name
name name
name name
3244324324
Date
21 Jun 2017
Date
21 Jun 2017
Date
21 Jun 2017
electronically validated
electronically validated
electronically validated
electronically validated
electronically validated
763254 3276 4276457234