Simple section labeling with tidytext for plain text input

Question

I'm using tidytext (and the tidyverse) to analyze some text data (as in Tidy Text Mining with R).

My input text file, myfile.txt, looks like this:

# Section 1 Name
Lorem ipsum dolor
sit amet ... (et cetera)
# Section 2 Name
<multiple lines here again>

with 60 or so sections.

I would like to generate a column section_name with the strings "Category 1 Name" or "Category 2 Name" as values for the corresponding lines. For instance, I have

library(tidyverse)
library(tidytext)
library(stringr)

fname <- "myfile.txt"
all_text <- readLines(fname)
all_lines <- tibble(text = all_text)
tidiedtext <- all_lines %>%
  mutate(linenumber = row_number(),
         section_id = cumsum(str_detect(text, regex("^#", ignore_case = TRUE)))) %>%
  filter(!str_detect(text, regex("^#"))) %>%
  ungroup()

which adds a column in tidiedtext for the corresponding section number for each line.

Is it possible to add a single line to the call to mutate() to add such a column? Or is there another approach I ought to be using?

score 1 · Answer 1 · answered Feb 23 '17 at 22:17

I don't wish to have you rewrite your entire script, but I just found the question interesting and thought to add a base R tentative:

parse_data <- function(file_name) {
  all_rows <- readLines(file_name)
  indices <- which(grepl('#', all_rows))
  splitter <- rep(indices, diff(c(indices, length(all_rows)+1)))
  lst <- split(all_rows, splitter)
  lst <- lapply(lst, function(x) {
    data.frame(section=x[1], value=x[-1], stringsAsFactors = F)
  })
  line_nums = seq_along(all_rows)[-indices]
  df <- do.call(rbind.data.frame, lst)
  cbind.data.frame(df, linenumber = line_nums)
}

Testing with a file named ipsum_data.txt:

parse_data('ipsum_data.txt')

yields:

 text                        section          linenumber
 Lorem ipsum dolor           # Section 1 Name 2         
 sit amet ... (et cetera)    # Section 1 Name 3         
 <multiple lines here again> # Section 2 Name 5

The file ipsum_data.txt contains:

# Section 1 Name
Lorem ipsum dolor
sit amet ... (et cetera)
# Section 2 Name
<multiple lines here again>

I hope this proves useful.

Thanks for the response. This is very helpful. It's not a big deal for me to rewrite the script, but I think the other solution is more what I was looking for in terms of conciseness. — weinerjm, Feb 24 '17 at 06:37

score 0 · Accepted Answer · answered Feb 23 '17 at 21:34

Here's an approach using grepl for simplicity with if_else and tidyr::fill, but there's nothing wrong with the original approach; it's pretty similar to one used in the tidytext book. Also note that filtering after adding line numbers will make some nonexistent. If it matters, add line numbers after filter.

library(tidyverse)

text <- '# Section 1 Name
Lorem ipsum dolor
sit amet ... (et cetera)
# Section 2 Name
<multiple lines here again>'

all_lines <- data_frame(text = read_lines(text))

tidied <- all_lines %>% 
    mutate(line = row_number(),
           section = if_else(grepl('^#', text), text, NA_character_)) %>% 
  fill(section) %>% 
  filter(!grepl('^#', text))

tidied
#> # A tibble: 3 × 3
#>                          text  line          section
#>                         <chr> <int>            <chr>
#> 1           Lorem ipsum dolor     2 # Section 1 Name
#> 2    sit amet ... (et cetera)     3 # Section 1 Name
#> 3 <multiple lines here again>     5 # Section 2 Name

Or if you just want to format the numbers you've already got, just add section_name = paste('Category', section_id, 'Name') to your mutate call.

Thanks! This is pretty much what I was looking for. – weinerjm Feb 24 '17 at 06:41 — weinerjm, Feb 24 '17 at 06:41

Simple section labeling with tidytext for plain text input

2 Answers2