1

I searched the web now for hours and tried several alternatives, but couldn't find a satisfying solution. I have a string called tmp_txt containing several articles, which all start with

"Newspaper.com \tTopic \tXX.XX.2015\r\n\t\r\n\r\nher_goes_the_title\r\n\r\ntext_containing_\r\n\r\nsometimes"

whereby XX.XX.2015 is a changing date (but always in 2015).

I want to find all the dates (XX.XX.2015) and all the titles (here_goes_the_title) for writing them into a dataframe (corresponding dates and titles into one line, but different columns).

Until now, my best solution finds all dates, but also a bit of their environment, e.g.:

dates <- str_match_all(tmp_text, "\t(.*?).2015")

leads to

"\tTopic \t15.09.2015"

etc.

Finding the titles is much tougher, because they can only be found after the first \r\n\t\r\n\r\n-sequence in every article and before the \r\n\r\n-sequence, which occurs multiple times in an article.

Do you have any solutions?

Thanks in advance, Hanno

1st edit

Okay, like suggested by r2evans, here are some examples:

Süddeutsche.de \tPolitik \t15.09.2013\r\n\t\r\n\r\nSyrien-Konflikt\r\n\r\nHollande dämpft Erwartungen an Chemiewaffen-Plan\r\n\r\n

date should be

15.09.2013

title should be

Syrien-Konflikt

would be nice, if there would be also a solution for grabbing the second title:

Hollande dämpft Erwartungen an Chemiewaffen-Plan

However, there are few cases, where the title is preceded by irrelevant information:

\r\nSüddeutsche.de \tComputer \t07.09.2013\r\n\t\r\n\r\nhttp://www.sueddeutsche.de/digital/syrische-elektronische-armee-wie-syrische-hacker-im-netz-fuer-assad-kaempfen-1.1764980\r\n\r\nSyrische Elektronische Armee\r\n\r\nWie syrische Hacker im Netz für Assad kämpfen\r\n\r\n

date should be:

07.09.2013

title should be:

Syrische Elektronische Armee

second title should be

Wie syrische Hacker im Netz für Assad kämpfen

However, sometimes the irregular information consists of two lines like here:

Süddeutsche.de \tPolitik \t03.09.2013\r\n\t\r\nKurz\r\n\r\nhttp://www.sueddeutsche.de/politik/syrisch-tuerkische-grenze-mindestens-sechs-menschen-sterben-bei-explosion-1.1761804\r\n\r\nSyrisch-türkische Grenze\r\n\r\nMindestens sechs Menschen sterben bei Explosion\r\n\r\nBei einer Explosion von Munition sind an der syrisch-türkischen Grenze...

date:

03.09.2013

title:

Syrisch-türkische Grenze

second title:

Mindestens sechs Menschen sterben bei Explosion

The first solution suggested by r2evans works good. However, I know that there are X articles and by now the function returns X dates (which is correct), but only X-2 titles!

I don't know, which titles aren't found properly. So I would like to use a function, which shows me the first 50 characters after the date, which would help me to find the problematic cases by manual search, e.g.

Süddeutsche.de \tPolitik \t03.09.2013\r\n\t\r\nKurz\r\n\r\nhttp://www.sueddeutsche.de/politik/syrisch-tuerkische-grenze-mindestens-sechs-menschen-sterben-bei-explosion-1.1761804\r\n\r\nSyrisch-türkische Grenze\r\n\r\nMindestens sechs Menschen sterben bei Explosion\r\n\r\nBei einer Explosion von Munition sind an der syrisch-türkischen Grenze...

return should be:

03.09.2013\r\n\t\r\nKurz\r\n\r\nhttp://www.sueddeutsche.de/p

If you had a better solution, I would be glad to get to know.

If there are questions remaining, feel free to ask. Let me also know, if you need the txt.file to be uploaded.

Cheers, Hanno

hyhno01
  • 177
  • 8
  • For titles, are those "sequences" literal? That is, if you look for the literal `\r\n\t\r\n\r\n` and `\r\n\r\n` and take all text in between them, you'd have the title? – r2evans Feb 28 '19 at 16:36
  • The titles are mostly literal but sometimes contain symbols like " or numbers in the beginning – hyhno01 Feb 28 '19 at 19:37

2 Answers2

2

A base R solution. Using Jonny's txt,

txt <- "Newspaper.com \tTopic \t12.02.2015\r\n\t\r\n\r\nher_goes_the_title\r\n\r\ntext_containing_\r\n\r\nsometimes"

regmatches(txt, gregexpr("\\b[0-9]{2}\\.[0-9]{2}\\.[0-9]{4}\\b", txt))
# [[1]]
# [1] "12.02.2015"
regmatches(txt, gregexpr("(?<=\r\n\t\r\n\r\n)[^\r\n]+(?=\r\n\r\n)", txt, perl = TRUE))
# [[1]]
# [1] "her_goes_the_title"

The use of gregexpr is good for multiple matchings. It might find more than one date in a string, though, so use caution if you start seeing that pattern. (There are easy ways to fix it if you think you'll have it, such as lapply(x, `[[`, 1) where x is the return from above.) You can cheat and use just regexpr if you're only working on one string at a time, but vectorizing it is likely a good thing in the long run.

Explanation:

"\\b[0-9]{2}\\.[0-9]{2}\\.[0-9]{4}\\b"
 ^^^                              ^^^  word boundaries before/after
    ^^^^^      ^^^^^      ^^^^^        character range, just digits here
         ^^^        ^^^        ^^^     number of characters in preceding match
            ^^^        ^^^             the literal dot "."

and

"(?<=\r\n\t\r\n\r\n)[^\r\n]+(?=\r\n\r\n)"
 ^^^^^^^^^^^^^^^^^^^                       must have this pattern before,
                                              but does not consume it
                            ^^^^^^^^^^^^   must have the pattern after, no consume
                    ^^^^^^^                any character not one of \r \n
                           ^               one or more of preceding match

The use of (?<= and (?= require perl=TRUE.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thanks for this very fast answer and the good working solution! Unfortunately, I must have discoverd some irregularity now, since my script finds two titles less than expected. So I want to find say 40 symbols after the date for error location. I added \\s+((?:\\w+(?:\\s+|$)){40}) after the termin for finding the dates, but it does not work properly. Any suggestions? – hyhno01 Feb 28 '19 at 19:25
  • I cannot even being to help without seeing the strings that produced the problems! This is a good example of when unit-test philosophy might be appropriate: provide a few examples to include things like (1) should match exactly one; (2) looks similar but should not match; (3) might match more than one; etc. Please edit your question and include more examples. – r2evans Feb 28 '19 at 19:42
1

It depends on how rigid the structure is before the date & title. You mention its different for the title, so it would be great if you could supply us some more strings in a vector, with the desired output titles required.

If it is consistent, you can use non-matching groups to remove the parts you are not interested in e.g.

txt <- "Newspaper.com \tTopic \t12.02.2015\r\n\t\r\n\r\nher_goes_the_title\r\n\r\ntext_containing_\r\n\r\nsometimes"

library(stringi)

before_date <- "Newspaper.com \tTopic \t"
# non-matching bit before. Getting number in format nn-nn-nnnn
date <- stringi::stri_extract_first_regex(txt, 
                                          sprintf("(?<=%s)\\d{2}.\\d{2}.\\d{4}",
                                                  before_date))
date

before_title <- sprintf("%s%s\r\n\t\r\n\r\n", before_date, date)
# find all characters not \r or \n and return, after the initial sequence
title <- stringi::stri_extract_first_regex(txt,
                                           sprintf("(?<=%s)[^\\r\\n]*",
                                                   before_title))
title

Here (?<=News)paper would return just paper when extracting this regex pattern e.g. Regex with non-capturing group using stringr in R

Jonny Phelps
  • 2,687
  • 1
  • 11
  • 20