Combining .txt files with character data into a data frame for tidytext analysis

Question

I have bunch of .txt files of Job Descriptions and I want to import them to do text mining analyses.

Please find attached some sample text files: https://sample-videos.com/download-sample-text-file.php. Please use the 10kb and 20kb versions because the job descriptions are different lengths.

After combining them, I would like to do tidy text analyses and create document term matrices.

What I have done thus far:

file_list <- list.files(pattern="*.txt")
list_of_files <- lapply(file_list, read.delim)
mm<- merge_all(list_of_files) # this line doesn't work because the column headers of the lists are different
## Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column

I would appreciate an answer that either helps me merge these lists into a data frame OR tells me a better way to import these text files OR sheds light on how to do tidy text analysis on lists rather than data frames.

Thanks!

You might want to `rbind` instead of `merge`. Those just look like unlabeled paragraphs. `merge`-ing would involve matching by identical values. Those were NOT job descriptions, just junk text. — IRTFM, Dec 06 '18 at 00:37

score 1 · Accepted Answer · answered Dec 06 '18 at 01:37

1

One approach could be using dplyr package and a for loop to import each file and combine together as a dataframe with filename and paragraph number used to index, then using tidytext to tidy up:

#install.packages(c("dplyr", "tidytext"))
library(dplyr)
library(tidytext)

file_list <- list.files(pattern="*.txt")

texts <- data.frame(file=character(),
                    paragraph=as.numeric(),
                    text=character(),
                    stringsAsFactors = FALSE) # creates empty dataframe

for (i in 1:length(file_list)) {
  p <- read.delim(file_list[i],
                  header=FALSE,
                  col.names = "text",
                  stringsAsFactors = FALSE) # read.delim here is automatically splitting by paragraph
  p <- p %>% mutate(file=sub(".txt", "", x=file_list[i]), # add filename as label
                    paragraph=row_number()) # add paragraph number
  texts <- bind_rows(texts, p) # adds to existing dataframe
}

words <- texts %>% unnest_tokens(word, text) # creates dataframe with one word per row, indexed

Your final output would then be:

head(words)
                   file paragraph        word
1   SampleTextFile_10kb         1       lorem
1.1 SampleTextFile_10kb         1       ipsum
1.2 SampleTextFile_10kb         1       dolor
1.3 SampleTextFile_10kb         1         sit
1.4 SampleTextFile_10kb         1        amet
1.5 SampleTextFile_10kb         1 consectetur
...

Is this what you're looking for for your next stages of analysis?

answered Dec 06 '18 at 01:37

Andy Baxter

5,833
1
8
22

Thank you for your quick reply and this is very helpful; however, I was wondering if there was a way to put each whole job description (so each .txt file) into its own row OR, have the .txt files be split into columns by key paragraphs in the job description (for instance most job descriptions have a section talking about "position responsibilities" or "duties" and then a section about "Qualifications"). OR if you know of an any easy way to convert character strings into numerics that would be applicable across many job descriptions (so different language and formatting etc.).Thank you! – Reuben Sarwal Dec 06 '18 at 09:10
ah that's pretty complex! I think this would depend on the format of the txt files you're using - R would need to know how to identify a header and associate it with the following text. How many text files are you using, how are they laid out and it is feasible to edit each one? Perhaps there's a simpler way though. – Andy Baxter Dec 06 '18 at 12:23
I have about 500 job description text files where each one has ROUGHLY, a position summary section, a responsibilities section, and a qualifications section. The problem, of course, is that the language is not exactly similar across the job descriptions (duties vs. responsibilities) and the formatting is different (some put paragraphs between sections some do not). Thus, going through all of them to make them uniform would be quite time consuming I feel--but perhaps the only way? Thanks for all your help! – Reuben Sarwal Dec 06 '18 at 21:34
My best suggestion would be to edit files to separate headings from the blocks of text using a Tab, then only having paragraph markers at the end of each section, maybe editing the text files using find/replace. Then `read.delim` could import as a two-columned table using `sep="\t"` - the first column being the headings and the second column the full text of each section. You could then use `unite` to bring bunches of similar columns together (combining 'duties' and 'responsibilities' for example). Sorry that does seem tedious, maybe there's another way of doing it in R but I cant think of how – Andy Baxter Dec 07 '18 at 23:27

Combining .txt files with character data into a data frame for tidytext analysis

1 Answers1