I searched the web now for hours and tried several alternatives, but couldn't find a satisfying solution. I have a string called tmp_txt containing several articles, which all start with
"Newspaper.com \tTopic \tXX.XX.2015\r\n\t\r\n\r\nher_goes_the_title\r\n\r\ntext_containing_\r\n\r\nsometimes"
whereby XX.XX.2015
is a changing date (but always in 2015).
I want to find all the dates (XX.XX.2015
) and all the titles (here_goes_the_title
) for writing them into a dataframe (corresponding dates and titles into one line, but different columns).
Until now, my best solution finds all dates, but also a bit of their environment, e.g.:
dates <- str_match_all(tmp_text, "\t(.*?).2015")
leads to
"\tTopic \t15.09.2015"
etc.
Finding the titles is much tougher, because they can only be found after the first \r\n\t\r\n\r\n
-sequence in every article and before the \r\n\r\n
-sequence, which occurs multiple times in an article.
Do you have any solutions?
Thanks in advance, Hanno
1st edit
Okay, like suggested by r2evans, here are some examples:
Süddeutsche.de \tPolitik \t15.09.2013\r\n\t\r\n\r\nSyrien-Konflikt\r\n\r\nHollande dämpft Erwartungen an Chemiewaffen-Plan\r\n\r\n
date should be
15.09.2013
title should be
Syrien-Konflikt
would be nice, if there would be also a solution for grabbing the second title:
Hollande dämpft Erwartungen an Chemiewaffen-Plan
However, there are few cases, where the title is preceded by irrelevant information:
\r\nSüddeutsche.de \tComputer \t07.09.2013\r\n\t\r\n\r\nhttp://www.sueddeutsche.de/digital/syrische-elektronische-armee-wie-syrische-hacker-im-netz-fuer-assad-kaempfen-1.1764980\r\n\r\nSyrische Elektronische Armee\r\n\r\nWie syrische Hacker im Netz für Assad kämpfen\r\n\r\n
date should be:
07.09.2013
title should be:
Syrische Elektronische Armee
second title should be
Wie syrische Hacker im Netz für Assad kämpfen
However, sometimes the irregular information consists of two lines like here:
Süddeutsche.de \tPolitik \t03.09.2013\r\n\t\r\nKurz\r\n\r\nhttp://www.sueddeutsche.de/politik/syrisch-tuerkische-grenze-mindestens-sechs-menschen-sterben-bei-explosion-1.1761804\r\n\r\nSyrisch-türkische Grenze\r\n\r\nMindestens sechs Menschen sterben bei Explosion\r\n\r\nBei einer Explosion von Munition sind an der syrisch-türkischen Grenze...
date:
03.09.2013
title:
Syrisch-türkische Grenze
second title:
Mindestens sechs Menschen sterben bei Explosion
The first solution suggested by r2evans works good. However, I know that there are X articles and by now the function returns X dates (which is correct), but only X-2 titles!
I don't know, which titles aren't found properly. So I would like to use a function, which shows me the first 50 characters after the date, which would help me to find the problematic cases by manual search, e.g.
Süddeutsche.de \tPolitik \t03.09.2013\r\n\t\r\nKurz\r\n\r\nhttp://www.sueddeutsche.de/politik/syrisch-tuerkische-grenze-mindestens-sechs-menschen-sterben-bei-explosion-1.1761804\r\n\r\nSyrisch-türkische Grenze\r\n\r\nMindestens sechs Menschen sterben bei Explosion\r\n\r\nBei einer Explosion von Munition sind an der syrisch-türkischen Grenze...
return should be:
03.09.2013\r\n\t\r\nKurz\r\n\r\nhttp://www.sueddeutsche.de/p
If you had a better solution, I would be glad to get to know.
If there are questions remaining, feel free to ask. Let me also know, if you need the txt.file to be uploaded.
Cheers, Hanno
"
or numbers in the beginning – hyhno01 Feb 28 '19 at 19:37