1

I have this string which is a mix between a title and a regular sentence (there is no separator separating the two).

text = "Read more: Indonesia to Get Moderna Vaccines Before the pandemic began, a lot of people were...."

The title actually ends at the word Vaccines, the Before the pandemic is another sentence completely separate from the title.

How do I remove the substring until the word vaccines? My idea was to remove all words from the words "Read more:" to all the words after that that start with capital until before one word (before). But I don't know what to do if it meets with conjunction or preposition that doesn't need to be capitalized in a title, like the word the.

I know there is a function title() to convert a string into a title format in Python, but is there any function that can detect if a substring is a title?

I have tried the following using regular expression.

import re
text = "Read more: Indonesia to Get Moderna Vaccines Before the pandemic began, a lot of people were...."
res = re.sub(r"\s*[A-Z]\s*", " ", text)
res

But it just removed all words started with capital letters instead.

catris25
  • 1,173
  • 3
  • 20
  • 40
  • Do you need to split it only once for a given string? What would you do if there would be another string with no separator? – ktv6 Mar 16 '21 at 10:10
  • @ktv6 yes, it would be only one for each, just one title and one regular sentence. There can't be two titles or two regular sentences. I have tokenized them all to be this way. – catris25 Mar 16 '21 at 10:11
  • `text = text[text.index("Vaccines")+8:]`? See [demo](https://ideone.com/3d8iEF). Well, you may also use `re.sub(r'(?i).*?vaccines\s*', '', text)`... – Wiktor Stribiżew Mar 16 '21 at 10:11
  • what if the word isn't vaccine? There are a lot of texts here, and it can be any word like you usually see in a newspiece. @WiktorStribiżew – catris25 Mar 16 '21 at 10:12
  • @catris25 this is exactly what I've asked you before. You can not split any random string without separator or any other a priori data (i.e. if the title always has constant number of words) – ktv6 Mar 16 '21 at 10:16
  • @ktv6 okay, but is there a way to get words starting with capital letters in python consecutively? – catris25 Mar 16 '21 at 10:22
  • Try `^(?:Read\s+more\s*:)?\s*(?:(?:[A-Z]\S*|to)\s+)*(?=[A-Z]\S*)`, see [this regex demo](https://regex101.com/r/XncKtl/1). There may be more exceptions added in the group after `to` using the alternation operator `|`. – Wiktor Stribiżew Mar 16 '21 at 10:25
  • @catris25 Look at this post https://stackoverflow.com/questions/9525993/get-consecutive-capitalized-words-using-regex However are you sure that it will work for you? There are 3 consecutive words that start with the capital letter in your example – ktv6 Mar 16 '21 at 10:25
  • Perhaps start the match with an uppercase char until Vaccines? `\b[A-Z].*?Vaccines\b\s*` https://regex101.com/r/xQWLzM/1 – The fourth bird Mar 16 '21 at 10:26
  • @ktv6 The actual problem is not matching consecutive capitalized words, but rather a *title* that may contain non-capitalized words. – Wiktor Stribiżew Mar 16 '21 at 10:26
  • 1
    @Thefourthbird `Vaccines` is an unknown word. – Wiktor Stribiżew Mar 16 '21 at 10:29
  • @WiktorStribiżew yeah I get it now. One way I could think of is to check the first letter of every word in a string until you meet like N consecutive uncapitalized first letters. Where N is the number of uncapitalized words you are sure won't meet consecutively in tittle. When you meet those N words you know the last word you've iterated over that started with the capital letter is the first word of the actual text. However the actual text itself can start with consecutive capitalized words (i.e. the second word being name). – ktv6 Mar 16 '21 at 10:33
  • @catris25 I believe we can match the title by making sure we match a sequence of capitalized words or [words that can be non-capitalized in titles](https://stackoverflow.com/a/34785551/3832970). The pattern will [look like this here](https://regex101.com/r/XncKtl/3), `^(?:Read\s+more\s*:)?\s*(?:(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of)\s+)*(?=[A-Z])`. – Wiktor Stribiżew Mar 16 '21 at 10:36
  • @WiktorStribiżew yeah, your example on regex101.com works, but if I tried it on data with the word `the` in it, it just stops replacing the words there. – catris25 Mar 16 '21 at 10:36
  • @WiktorStribiżew that's great. Now I just need to collect all words that don't need to be capitalized in sentences. (My data are not in English). Thanks – catris25 Mar 16 '21 at 10:39
  • @catris25 Please check my answer with some more considerations. – Wiktor Stribiżew Mar 16 '21 at 10:46

2 Answers2

2

You can match the title by matching a sequence of capitalized words and words that can be non-capitalized in titles.

^(?:Read\s+more\s*:)?\s*(?:(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of)\s+)*(?=[A-Z])

See the regex demo.

Details:

  • ^ - start of string
  • (?:Read\s+more\s*:)? - an optional non-capturing group matching Read, one or more whitespaces, more, zero or more whitespaces and a :
  • \s* - zero or more whitespaces
  • (?:(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of)\s+)* - zero or more sequences of
    • (?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of) - an capitalized word that may contain any non-whitespace chars or one of the words that can stay non-capitalized in an English title
    • \s+ - one or more whitespaces
  • (?=[A-Z]) - followed with an uppercase letter.

NOTE: You mentioned your language is not English, so

  1. You need to find the list of your language words that may go non-capitalized in a title and use them instead of ^(?:Read\s+more\s*:)?\s*(?:(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of
  2. You might want to replace [A-Z] with \p{Lu} to match any Unicode uppercase letters and \S* with \p{L}* to match any zero or more Unicode letters, BUT make sure you use the PyPi regex library then as Python built-in re does not support the Unicode category classes.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

Why don't you just use slicing?

title = text[:44]
print(title)

Read more: Indonesia to Get Moderna Vaccines

Patrick
  • 26
  • 1
  • 6