0

My goal is to pull out a specific section in a set of word documents according to key words. I'm having trouble parsing out specific sections of text from a larger data set of text files. The data set originally looked like this, with "title 1" and "title 2" to indicate the start and end of the text I am interested in and unimportant words to indicate the part of the text file I am not interested in:

**Text**           **Text File** 
title one           Text file 1
sentence one        Text file 1
sentence two        Text file 1
title two           Text file 1
unimportant words   Text file 1
title one           Text file 2
sentence one        Text file 2

Then I used as.character to turn the data into characters and used unnest_tokens to tidy up the data

df <- data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
tidy_df <- df %>% unnest_tokens(word, Text, token = "words")

I would now like to only look at the sentences in my dataset and exclude the unimportant words. Title one and title two are the same in every text file, but the sentences between them are different. I've tried this code below, but it does not seem to work.

filtered_resume <- lapply(tidy_resume, (tidy_resume %>% select(Name) %>% filter(title:two)))

2 Answers2

1

If you'd like a tidyverse option that involves very few lines of code, give this a look. You can use case_when() and str_detect() to find the lines in your dataframe that contain the signals for important/not important.

library(tidyverse)

df1 <- df %>%
  mutate(important = case_when(str_detect(Text, "title one") ~ TRUE,
                               str_detect(Text, "title two") ~ FALSE))
df1 
#> # A tibble: 11 x 3
#>    Text              File        important
#>    <chr>             <chr>       <lgl>    
#>  1 title one         Text file 1 TRUE     
#>  2 sentence one      Text file 1 NA       
#>  3 sentence two      Text file 1 NA       
#>  4 title two         Text file 1 FALSE    
#>  5 unimportant words Text file 1 NA       
#>  6 title one         Text file 2 TRUE     
#>  7 sentence one      Text file 2 NA       
#>  8 sentence two      Text file 2 NA       
#>  9 sentence three    Text file 2 NA       
#> 10 title two         Text file 2 FALSE    
#> 11 unimportant words Text file 2 NA

Now you can use fill() from tidyr to fill those values down.

df1 %>%
  fill(important, .direction = "down")
#> # A tibble: 11 x 3
#>    Text              File        important
#>    <chr>             <chr>       <lgl>    
#>  1 title one         Text file 1 TRUE     
#>  2 sentence one      Text file 1 TRUE     
#>  3 sentence two      Text file 1 TRUE     
#>  4 title two         Text file 1 FALSE    
#>  5 unimportant words Text file 1 FALSE    
#>  6 title one         Text file 2 TRUE     
#>  7 sentence one      Text file 2 TRUE     
#>  8 sentence two      Text file 2 TRUE     
#>  9 sentence three    Text file 2 TRUE     
#> 10 title two         Text file 2 FALSE    
#> 11 unimportant words Text file 2 FALSE

Created on 2018-08-14 by the reprex package (v0.2.0).

At this point, you can filter(important) to keep only the text that you want, and then you can use functions from tidytext to do text mining on the important text you have left.

Julia Silge
  • 10,848
  • 2
  • 40
  • 48
0

Not familiar with the tidytext package, so here's an alternative base R solution. Using this expanded example data (creation code included at bottom):

> df
                Text        File
1          title one Text file 1
2       sentence one Text file 1
3       sentence two Text file 1
4          title two Text file 1
5  unimportant words Text file 1
6          title one Text file 2
7       sentence one Text file 2
8       sentence two Text file 2
9     sentence three Text file 2
10         title two Text file 2
11 unimportant words Text file 2

Make a function that makes a separate column that indicates whether a given row should be kept or dropped, based on the value in the Text column. Details in comments:

get_important_sentences <- function(df_) {
  # Create some variables for filtering
  val = 1
  keep = c()

  # For every text row
  for (x in df_$Text) {
    # Multiply the current val by 2
    val = val * 2

    # If the current text includes "title",
    # set val to 1 for 'title one', and to 2
    # for 'title two'
    if (grepl("title", x)) {
      val = ifelse(grepl("one", x), 1, 0)
    }

    # append val to keep each time
    keep = c(keep, val)
  }

  # keep is now a numeric vector- add it to
  # the data frame
  df_$keep = keep

  # exclude any rows where 'keep' is 1 (for
  # 'title one') or 0 (for 'title 2' or any
  # unimportant words). Also, drop the
  return(df_[df_$keep > 1, c("Text", "File")])
}

Then you can call that either on the whole data frame:

> get_important_sentences(df)
            Text        File
2   sentence one Text file 1
3   sentence two Text file 1
7   sentence one Text file 2
8   sentence two Text file 2
9 sentence three Text file 2

Or on a per-file-source basis with lapply:

> lapply(split(df, df$File), get_important_sentences)
$`Text file 1`
          Text        File
2 sentence one Text file 1
3 sentence two Text file 1

$`Text file 2`
            Text        File
7   sentence one Text file 2
8   sentence two Text file 2
9 sentence three Text file 2

Data:

df <-
  data.frame(
    Text = c(
      "title one",
      "sentence one",
      "sentence two",
      "title two",
      "unimportant words",
      "title one",
      "sentence one",
      "sentence two",
      "sentence three",
      "title two",
      "unimportant words"
    ),
    File = c(rep("Text file 1", 5), rep("Text file 2", 6)),
    stringsAsFactors = FALSE
  )
Luke C
  • 10,081
  • 1
  • 14
  • 21