Extracting Body of Text from Research Articles; Several Attempted Methods

Question

I need to extract the body of texts from my corpus for text mining as my code now includes references, which bias my results. All coding is performed in R using RStudio. I have tried many techniques.

I have text mining code (of which only the first bit is included below), but recently found out that simply text mining a corpus of research articles is insufficient as the reference section will bias results; reference sections alone may provide another analysis, which would be a bonus.

EDIT: perhaps there is an R package that I am not aware of

My initial response was to clean the text formats after converting from pdf to text using Regex commands within quanteda. As a reference I was intending to follow: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005962&rev=1 . Their method confuses me not just in coding a parallel regex code, but in how to implement recognizing the last reference section to avoid cutting off portions of the text when "reference" appears prior to that section; I have been in contact with their team, but am waiting to learn more about their code since it appears they use a streamlined program now.

PubChunks and LAPDF-text were my next two options the latter of which is referenced in the paper above. In order to utilize the PubChunks package I need to convert all of my pdf (now converted to text) files into XML. This should be straightforward only the packages I found (fileToPDF, pdf2xml, trickypdf) did not appear to work; this seems to be a within-R concern. (Coding relating to trickypdf is included below).

For LAPDF-text, ...[see edit]... the code did not seem to run properly. There are also very limited resources out there for this package in terms of guides etc and they have shifted their focus to a larger package using different language that does happen to include LAPDF-text.

EDIT: I installed java 1.6 (SE 6) and Maven 2.0 then ran the LAPDF-text installer, which seemed to work. That being said, I am still having issues with this process and mvn commands recognizing folders though am continuing to work through it.

I am guessing there is someone else out there, as there are related research papers with similarly vague processes, who has done this before and has also got their hands dirty. Any recommendations is greatly appreciated.

Cheers

    library(quanteda)
    library(pdftools)
    library(tm)
    library(methods)
    library(stringi) # regex pattern
    library(stringr) # simpler than stringi ; uses stringi on backend

    setwd('C:\\Users\\Hunter S. Baggen\\Desktop\\ZS_TestSet_04_05')
    files <- list.files(pattern = 'pdf$')
    summary(files)
    files

    # Length 63 
    corpus_tm <- Corpus(URISource(files),
                 readerControl = list(reader = readPDF()))
    corpus_tm
    # documents 63
    inspect(corpus_tm)
    meta(corpus_tm[[1]])
    # convert tm::Corpus to quanteda::corpus
    corpus_q <- corpus(corpus_tm)
    summary(corpus_q, n = 2)
    # Add Doc-level Variables here *by folder and meta-variable year
    corpus_q
    head(docvars(corpus_q))
    metacorpus(corpus_q)
    #_________

    # extract segments ~ later to remove segments
    # corpus_segment(x, pattern, valuetype, extract_pattern = TRUE)
    corpus_q_refA <- corpus_reshape(corpus_q, to = "paragraphs", showmeta         = TRUE)
    corpus_q_refA
    # Based upon Westergaard et al (15 Million texts; removing references)
    corpus_q_refB <- corpus_trim(corpus_q, what = c('sentences'),         exclude_pattern = '^\[\d+\]\s[A-Za-z]')
    corpus_q_refB # ERROR with regex above

    corpus_tm[1]

    sum(str_detect(corpus_q, '^Referen'))

    corpus_qB <- corpus_q
    RemoveRef_B <- corpus_segment(corpus_q, pattern = 'Reference', valuetype = 'regex')
    cbind(texts(RemoveRef_B), docvars(corpus_qB))

    # -------------------------
    # Idea taken from guide (must reference guide)
    setGeneric('removeCitations', function(object, ...)         standardGeneric('removeCitations'))
    'removCitations'
    setMethod('removeCitations', signature(object = 'PlainTextDocument'),
      function(object, ...) {
        c <- Content(object)
        # remove citations tarting with '>'
        # EG for > : citations <- grep('^[[:blank:]]*>.*', c)  if         (length(citations) > 0)  c <- c[-citations]
        # EG for -- : signatureStart <- grep('^-- $', c)  if         (length(signatureStart) > 0)  c <- c[-(signatureStart:length(c))]
        # using 15 mil removal guideline
        citations <- grep('^\[\d+\]\s[A-Za-z]')
      }


    # TRICKY PDF download from github
    library(pubchunks)
    library(polmineR)
    library(githubinstall)
    library(devtools)
    library(tm)
    githubinstall('trickypdf') # input Y then 1 if want all related packages
    # library(trickypdf)
    # This time suggested I install via 'PolMine/trickypdf'
    # Second attempt issue with RPoppler
    install_github('PolMine/trickypdf')
    library(trickypdf) # Not working
    # Failed to install package 'Rpoppler' is not available for R 3.6.0

Short of the RPoppler issue above the initial description should be sufficient.

UPDATE: Having reached out to several research groups the TALN-UPF researchers got back to me and provided me with a pdfx java program that has allowed me to convert my pdfs easily into xml. Of course, now I learn that PubChunks is created with its sister program that extracts xmls from search engines and therefore is of little use to me. That being said, the TALN-UPF group will hopefully advise whether I can extract the body of each text via their other programs (Dr Inventor and Grobid). If this is possible then everything will be accomplished. Of course if not I will be back at RegEx.

Restating your question: You need help removing (or isolating) the reference sections of research articles. Very broad question and difficult to answer as general rules might not cover all cases. 1. Could you at least point (or add) to 2 research papers you are trying to use? Or some articles that could be used as an example. 2. How about the abstract and or author summary parts. They might have an influence or bias on the result. — phiver, Oct 23 '19 at 09:42
Hello, yes that is correct. Here are two articles that I used in earlier codes: https://journals.sagepub.com/doi/pdf/10.1177/1010539510391644 https://www.tandfonline.com/doi/pdf/10.1080/23328940.2016.1216256?needAccess=true You are certainly correct that the authors and abstract would have an impact on the results. So too would be be the headers and footers, which could be removed via Regex perhaps (assuming /f is used for line breaks). I am essentially looking for a way to text mine the body of each article as part of a corpus, but in order to do so I need to isolate the body of the text. — Hunter, Oct 23 '19 at 10:06
`corpus_segment(corp, pattern = "\\s*references\\n", valuetype = "regex", pattern_position = "after")` this works to remove references after the text and keeps the text. Option `pattern_position = "before"` would just keep everything after the reference starts. But this requires to pdf to have the text in a normal way. 2 column texts, as in the "Occupational heat stress" would remove more than is warrented as pdftools reads in the text line by line. — phiver, Oct 26 '19 at 18:19
I believe that the authors of the text mining article above avoided this issue by removing sentences that started with typical reference format such as [1] etc. and avoided issues with columns by considering all sentences not starting with a lowercase and preceded by specific grammar as continuations of the preceding paragraph or line. Other sources remove specific lengths of white spaces and consider that a new sentence to avoid issues with columns. One way to avoid cutting off relevant sentences after "References" would be to cut all sentences starting with: ^\[\d+\]\s[A-Za-z]. Thoughts? — Hunter, Oct 29 '19 at 01:13
Yes that is also a way to go about it. But another problem with the texts that are in 2 columns is that they tend to split words to continue on the next line e.g. ex- at the end of one and ample on the next. That should be 1 word. Rpoppler doesn't work on windows machines. pdftools does work, but finding the correct start and end points of the columns is an issue. The data is (sort of) available in the `pdf_data` function. But even the developer of pdftools says it is difficult to figure out correctly. Might be worth a SO bounty question. I might create one tomorrow. — phiver, Oct 29 '19 at 09:21
That is a good point. One example by Charles Bordet states that str_split to split rows when two (or could use more) spaces appear would help remove issue of column, but perhaps not words that extend past the line. Above I made an update stating that I have been able to get the documents into xml form, but am interested in using the RegEx method considering that that particular step may not necessarily allow me to extract the body of the text. THANK YOU SO MUCH - for your help thus far, a successful bounty would be very much appreciated — Hunter, Oct 30 '19 at 03:03
Hello, to clear all this up I ultimately decided to manually remove all parts of the texts that were not included in the body of the text. This was doable because I had ~2200 papers. I will note A: I used CERMINE to convert all pdf files to .txt and B: then used OCR software (easy to find eg Adobe Pro) to convert pdfs in image form. Cheers all — Hunter, Mar 31 '20 at 07:10

Extracting Body of Text from Research Articles; Several Attempted Methods

0 Answers0