Remove a section from Corpus

Question

I have a quanteda corpus of hundreds of documents. How do I remove specific sections - like the abstract and footnotes etc. Otherwise, I am faced with doing it manually. Thanks

As requested, here is a text example. It is from a regular journal article. It shows the Meta data, then the abstract, then keywords, then introduction, then author contact details, then body of article, then Note, then Disclosure statement, then Notes on contributors, then references. I would like to remove everything apart from the introduction and body of the article. I would also like to remove the author name and Journal title - which are repeated throughout

" Behavioral Sciences of Terrorism and Political Aggression

    ISSN: 1943-4472 (Print) 1943-4480 (Online) Journal homepage: http://www.tandfonline.com/loi/rirt20

Sometimes they come back: responding to

American foreign fighter returnees and other

Elusive threats

Christopher J. Wright

To cite this article: Christopher J. Wright (2018): Sometimes they come back: responding to

American foreign fighter returnees and other Elusive threats, Behavioral Sciences of Terrorism and

Political Aggression, DOI: 10.1080/19434472.2018.1464493

To link to this article: https://doi.org/10.1080/19434472.2018.1464493

     Published online: 23 Apr 2018.

     Submit your article to this journal

     Article views: 57

     View related articles

     View Crossmark data

                     Full Terms & Conditions of access and use can be found at

             http://www.tandfonline.com/action/journalInformation?journalCode=rirt20

" "BEHAVIORAL SCIENCES OF TERRORISM AND POLITICAL AGGRESSION, 2018

https://doi.org/10.1080/19434472.2018.1464493

Sometimes they come back: responding to American foreign

fighter returnees and other Elusive threats

Christopher J. Wright

Department of Criminal Justice, Austin Peay State University, Clarksville, TN, USA

ABSTRACT                                                                          ARTICLE HISTORY

Much has been made of the threat of battle hardened jihadis from                  Received 8 January 2018

Islamist insurgencies, especially Syria. But do Americans who                     Accepted 10 April 2018

return home after gaining experience fighting abroad pose a

                                                                                  KEYWORDS

greater risk than homegrown jihadi militants with no such                         Terrorism; foreign fighters;

experience? Using updated data covering 1990–2017, this study                     domestic terrorism;

shows that the presence of a returnee decreases the likelihood                    homegrown terrorism;

that an executed plot will cause mass casualties. Plots carried out               lone-wolf; homeland security

Introduction: being afraid. Being a little afraid

How great of a threat do would-be jihadis pose to their home country? And do those who

return home after gaining experience fighting abroad in Islamist insurgencies or attending

terror training camps pose a greater risk than other jihadi militants? The fear, as first outlined

by Hegghammer (2013), is two-fold. First, individuals that have gone abroad to fight might

CONTACT Christopher J. Wright wrightc@apsu.edu Department of Criminal Justice, Austin Peay State University,

Clarksville, TN 37043, USA

" "2 C. J. WRIGHT

Many of the earliest studies on Western foreign fighters suggested that those who

returned were in fact more deadly than those with no experience fighting in Islamist insur-

gencies. Hegghammer’s (2013) analysis suggested that these foreign fighter returnees

were a greater danger than when they left. Likewise, Byman (2015), Nilson (2015),

Kenney (2015), and Vidno (2011) came to similar conclusions while offering key insights

into the various mechanisms linking foreign fighting with successful plot execution and

greater mass casualties.

Other studies came to either mixed conclusions or directly contradicted the earlier find-

ings. Adding several years of data to Hegghammer’s (2013) earlier study, Hegghammer

" " BEHAVIORAL SCIENCES OF TERRORISM AND POLITICAL AGGRESSION 3

for them to form the types of large, local networks that would be necessary to carry out a

large-scale attack without attracting the attention of security services’ (p. 92).

"

Note

1. Charges were brought against Noor Zahi Salman, the widow of the Omar Mateen who carried

   out the June, 2016 attack against the Pulse Nightclub in Orlando, Florida (US Department of

   Justice., 2017a, January 17). However, in March of 2018 a jury acquitted her of the charges that

   she had foreknowledge of the attack.

Disclosure statement

No potential conflict of interest was reported by the authors.

Notes on contributors

Christopher J. Wright, Ph.D., is an Assistant Professor at Austin Peay State University where he

teaches in the Homeland Security Concentration.

ORCID

Christopher J. Wright http://orcid.org/0000-0003-0043-6616

References

Byman, D. (2015). The homecomings: What happens when Arab foreign fighters in Iraq and Syria

return? Studies in Conflict & Terrorism, 38(8), 581–602.

Byman, D. (2016). The Jihadist returnee threat: Just how dangerous? Political Science Quarterly, 131(1),

69–99.

Byman, D., & Shapiro, J. (2014). Be afraid. Be a little afraid: The threat of terrorism from Western foreign

fighters in Syria and Iraq. Foreign Policy at Brookings. Washington, DC: Brookings. Retrieved from

https://www.brookings.edu/wp-content/uploads/2016/06/Be-Afraid-web.pdf

Without an example of your text, this question cannot be answered precisely. But the solution lies in `corpus_segment()`. Take a look at that function. If you want a a full answer, please post a more detailed question complete with an example of your text and your expected output. — Ken Benoit, Jun 22 '18 at 09:48
Hi. I am still struggling with this. I have looked at various help guides, but to no avail. I have a corpus with thousands of documents. Each document has various sections that I would like to remove, including abstract, bibliographic notes, author contact details etc. How can I avoid having to go through each document manually? Thanks — Nicholas Bradley, Jul 06 '18 at 10:30
OK - just to be clear, the example text above is what you want to _remove_? So what you want is just the body of the text? What if you include a toy version of an actual document, where instead of removing the part you want kept, you include it as a single short paragraph. You can shorten the other sections too. Once I have that example that includes what you want removed AND kept, I can answer this for you. — Ken Benoit, Jul 06 '18 at 10:34
I added such a text example in an edit to the original question. If you need further information, I will happily supply it. Thanks — Nicholas Bradley, Jul 06 '18 at 10:50
The key to this really lies in the details, so how about you put the text of the document as a code block, as it appears in your file. Even better would be to include a link to a text file. State what you want to extract, for example: "All text from the 'ABSTRACT' through 'Introduction', then all text after 'Introduction' until 'CONTACT' ". — Ken Benoit, Jul 06 '18 at 10:58
Sorry, I copied it straight from a text file, so not sure why it turned out like that. Could I email to your LSE email please? — Nicholas Bradley, Jul 06 '18 at 11:40
But basically, I want to remove: All text from metadata through to Introduction. Keeping the introduction, I then want to remove contact details. I want to keep the body of the text. All remaining text at the end can be removed. So: Note through to References. I also want to remove the author name and journal title which get repeated every page. Thanks — Nicholas Bradley, Jul 06 '18 at 12:09

score 0 · Answer 1 · answered Jul 06 '18 at 18:51

The approach

The key here is to determine the regular markers that precede each section, and then to use them as tags in a call to corpus_segment(). It's the tags that will need tweaking, based on their degree of regularity across documents.

Based on what you supplied above, I pasted that into a plain text file that I named example.txt. This code extracted the Introduction and what I think is the body of the article, but for that I had to decide a tag that marked its ending. Below, I used "Disclosure Statement". So:

library("quanteda")

crp <- readtext::readtext("~/tmp/example.txt") %>% 
    corpus()
pat <- c("\nIntroduction?", "\nCONTACT", "©", "\nDisclosure statement")

crpextracted <- corpus_segment(crp, pattern = pat)

summary(crpextracted)
## Corpus consisting of 4 documents:
##     
##          Text Types Tokens Sentences              pattern
## example.txt.1    62     74         5        Introduction:
## example.txt.2    18     21         2              CONTACT
## example.txt.3   156    253        11                    ©
## example.txt.4   101    180        19 Disclosure statement
## 
## Source: /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/quanteda/* on x86_64 by kbenoit
## Created: Fri Jul  6 19:51:01 2018
## Notes: corpus_segment.corpus(crp, pattern = pat)

When you examine the text in the "Introduction:" tagged segment, you can see that everything from that string until the next tag was extracted to that as a new document:

corpus_subset(crpextracted, pattern == "\nIntroduction:") %>%
    texts() %>% cat()
## being afraid. Being a little afraid
## 
## How great of a threat do would-be jihadis pose to their home country? And do those who
## 
## return home after gaining experience fighting abroad in Islamist insurgencies or attending
## 
## terror training camps pose a greater risk than other jihadi militants? The fear, as first outlined
## 
## by Hegghammer (2013), is two-fold. First, individuals that have gone abroad to fight might

How to remove pdf junk

All pdf conversions produce unwanted junk such as running headers, footers, etc. Here's how to remove them. (Note: You will want to do this before the step above.) How to construct the toreplace pattern? You will need to understand something about regular expressions, and use some experimentation.

library("stringr")
toreplace <- '\\n*\" \" BEHAVIORAL SCIENCES OF TERRORISM AND POLITICAL AGGRESSION,{0,1} \\d+\\n*'
texts(crp) <- str_replace_all(texts(crp), regex(toreplace), "")
cat(texts(crp))

To demonstrate this on a section from your example:

# demonstration
x <- '
" " BEHAVIORAL SCIENCES OF TERRORISM AND POLITICAL AGGRESSION 3

'
str_replace_all(x, regex(toreplace), "")
## [1] ""

Remove a section from Corpus

1 Answers1

The approach

How to remove pdf junk