How to read a PDF up to a certain end line?

Question

I am doing for loop for many research papers. Here I want extract from read document a content.

How can I make that R reads only until last line, where many dots are, and indicate as an end-line? like on the picture below:

[Numbers] [Letter][Dots][Number]

enter image description here

If there is no many dots than stop and indicate as an end-line.

For example I have the following code but it doesn't work for other documents, cause sometimes have different endings.

if(((nrow(pdf[pdf$text == "References ." & pdf$element_id == '2',]) == 1) & !(exists("endline"))) == 1){

endline <- pdf$line_id[pdf$text == "References ." & pdf$element_id == '2']
   }

R reads whole document and identifies only until the last where many dots are.

Moreover, I also tried this grep() function if(nrow(pdf[grep("\\......... ([0-9]{3})$",pdf$text, perl = TRUE),])){ endline <-grep("\\......... ([0-9]{3})$",pdf$text, perl = TRUE) } — Bakai Baiazbekov, Apr 11 '19 at 10:21
Hi @Bakai. Can you please edit your question to add these elements instead of adding a comment? That way, all relevant elements will be directly found in your question. — PJProudhon, Apr 11 '19 at 10:28

Ildar Akhmetov · Answer 1 · 2019-04-11T12:45:04.530

0

This regex should help:

(\.+\s*\d+\n)(?!\d)

Explanation:

(\.+\s*\d+\n) - dots and a page number (with optional spaces), followed by an end of line character

(?!\d) - a negative lookahead, which means that there are NO digits at the beginning of the next line.

The negative lookahead does the magic of finding the last occurrence of the pattern.

Working example: https://regex101.com/r/gIrhxf/2

edited Apr 11 '19 at 12:45

answered Apr 11 '19 at 12:22

Ildar Akhmetov

1,331
13
22

How to read a PDF up to a certain end line?

1 Answers1