1

I have 584 .txt files that I would like to merge into one 584 x 4 tibble.

Important Background Info:

The files can be divided into three categories according to the labels embedded in the file names. Thus:

A_1_COD.txt, A_23_COD.txt, A_235_COD,..., A_457_COD -> Belong in Category A;

B_3_COD.txt, B_19_COD.txt, B_189_COD,..., B_355_COD -> Belong in Category B;

C_5_COD.txt, C_11_COD.txt, C_196_COD,..., C_513_COD -> Belong in Category C;

The file names shown in this section have been modified for ease of comprehension. Examples of the real file names are: ENTITY_117_MOR.txt; INCREMENTAL_208_MOR.txt; MODERATE_173_MOR.txt. The real categories are:ENTITY, INCREMENTAL, & MODERATE.

What the resulting tibble structure should be like:

A tibble: 584 x 4

row filename
<?>
category
<fct>
text
<chr>
1 A_1_COD A "Lorem ipsu-
2 B_2_COD B "Lorem ipsu-
3 C_3_COD C "Lorem ipsu-
. . . .
. . . .
. . . .
584 A_584_COD A "Lorem ipsu-

What I have managed to do so far: Thanks to @awaji98, I managed to get three of the four columns I intend to have by using the following code:

library(tidyverse)
library(readtext)

folder <- "path_to_folder_of_texts"

  dat <- 
  folder %>% 
 # get full path names for each text
  dir(pattern = "*.txt", 
      full.names = T) %>% 
 # map readtext function to each path name into a dataframe
  map_df(., readtext) %>% 
 # add and change columns as desired
  mutate(filename= str_remove(doc_id, ".txt$"),
         category = as.factor(str_extract(doc_id, "^."))) %>% 
  select(filename,category,text) %>% 
  rowid_to_column(var = "row") 

# if you prefer a tibble output
dat %>% tibble()

The result can be seen in the image below:

The picture shows the resulting table with all the data except for category

Remaining problem to be solved: I need to get R to extract the categories embedded in the file names (i.e., ENTITY, INCREMENTAL, MODERATE) to fill the category column with the respective values.

@awaji98 suggested two possible paths. Here's the first one:

> dat <- folder %>% 
+     # get full path names for each text
+     dir(pattern = "*.txt", 
+         full.names = T) %>% 
+     # map readtext function to each path name into a dataframe
+     map_df(., readtext) %>% 
+     # add and change columns as desired
+     mutate(filename= str_remove(doc_id, ".txt$")) %>% 
+     tidyr::extract(filename, into = "category", regex = "^([A-Z]+)_", remove = FALSE) %>% 
+     mutate(category = factor(category)) %>% 
+     select(filename,category,text) %>% 
+     rowid_to_column(var = "row") %>% 
+     tibble()

, which resulted in a column filled with red "NAs."

The second one,

> dat <- ## Use tidy::extract to create two new columns from doc_id
+     folder %>% 
+     # get full path names for each text
+     dir(pattern = "*.txt", 
+         full.names = T) %>% 
+     # map readtext function to each path name into a dataframe
+     map_df(., readtext) %>% 
+     # add and change columns as desired
+     mutate(filename= str_remove(doc_id, ".txt$")) %>% 
+     tidyr::extract(doc_id, into = c("category","filename"), regex = "^([A-Z]+)_(.*).txt$") %>% 
+     mutate(category = factor(category)) %>% 
+     select(filename,category,text) %>% 
+     rowid_to_column(var = "row") %>% 
+     tibble()

as shown in the photo below, produced two columns filled with red "NAs."
image shows tibble with two columns containing red "NAs," which was not the expected output.

Final Solution

@awaji98 realized that the problem was with the regex. As it turned out, the file names had a trailing whitespace. The solution was to add a space to the front of each regex in the answer. Thus, the code that delivered the expected result was:

library(tidyverse)
library(readtext)

folder <- "path_to_folder_of_texts"  
  
dat <-folder %>% 
  # get full path names for each text
  dir(pattern = "*.txt", 
      full.names = T) %>% 
  # map readtext function to each path name into a dataframe
  map_df(., readtext) %>% 
  # add and change columns as desired
  mutate(filename= str_remove(doc_id, ".txt$")) %>% 
  extract(filename, into = "category", regex = "^ ([A-Z]+)_", remove = FALSE) %>% 
 mutate(category = factor(category)) %>% 
  select(filename,category,text) %>% 
  rowid_to_column(var = "row") %>% 
  tibble()

The final result is shown in the following photo:

This picture shows the successful final result Kind regards,
Á_C

Á_C
  • 13
  • 4

1 Answers1

1

You can use a combination of some common tidyverse functions and the useful readtext() from the package with the same name:

library(tidyverse)
library(readtext)

folder <- "path_to_folder_of_texts"

  dat <- 
  folder %>% 
 # get full path names for each text
  dir(pattern = "*.txt", 
      full.names = T) %>% 
 # map readtext function to each path name into a dataframe
  map_df(., readtext) %>% 
 # add and change columns as desired
  mutate(filename= str_remove(doc_id, ".txt$"),
         category = as.factor(str_extract(doc_id, "^."))) %>% 
  select(filename,category,text) %>% 
  rowid_to_column(var = "row") 

# if you prefer a tibble output
dat %>% tibble()

UPDATED:

Perhaps one of the following will get what you need. The first example keeps the filename column with the category at the front of each value:

folder %>% 
  # get full path names for each text
  dir(pattern = "*.txt", 
      full.names = T) %>% 
 # map readtext function to each path name into a dataframe
  map_df(., readtext) %>% 
# add and change columns as desired
  mutate(filename= str_remove(doc_id, ".txt$")) %>% 
  extract(filename, into = "category", regex = "^ ([A-Z]+)_", remove = FALSE) %>% 
 mutate(category = factor(category)) %>% 
  select(filename,category,text) %>% 
  rowid_to_column(var = "row") %>% 
  tibble()
  
  

The second one uses tidyr::extract to create two columns from the doc_id, so filename drops the category part:

  ## Use tidy::extract to create two new columns from doc_id
  folder %>% 
    # get full path names for each text
    dir(pattern = "*.txt", 
        full.names = T) %>% 
    # map readtext function to each path name into a dataframe
    map_df(., readtext) %>% 
    # add and change columns as desired
    mutate(filename= str_remove(doc_id, ".txt$")) %>% 
    extract(doc_id, into = c("category","filename"), regex = "^ ([A-Z]+)_(.*).txt$") %>% 
    mutate(category = factor(category)) %>% 
    select(filename,category,text) %>% 
    rowid_to_column(var = "row") %>% 
    tibble()
awaji98
  • 685
  • 2
  • 6
  • Hello awaj98, First and foremost, thanks so much for your help. I've tried what you suggested and I keep getting the following error message: Error in as_mapper(.f, ...) : object 'readtext' not found – Á_C Aug 05 '21 at 17:05
  • My guess would be that the package readtext didn't load properly. Could you load it okay? What was the message after the library(readtext) line? – awaji98 Aug 06 '21 at 05:09
  • Good call!!! I had to install readtext because it was not there. The only issue now is that the category column is empty. This is probably my fault. Maybe I should have given you examples of the real file names: ENTITY_117_MOR.txt; INCREMENTAL_208_MOR.txt; MODERATE_173_MOR.txt. The real categories are:ENTITY, INCREMENTAL, & MODERATE. How can I get R to fill the empty column? Thank you so much, Á_C – Á_C Aug 06 '21 at 13:06
  • Unfortunately, neither does. In the first case, the entire category column is filled with red "NAs." In the second case, the same thing happens and all the filenames disappear from the second column and are replaced by red "NAs." Might there be a way to tell R something along the lines of "if filename contains "ENTITY," category= ENTITY; if filename contains "INCREMENTAL," category= INCREMENTAL; if filename contains "MODERATE," category= MODERATE." Apologies if this what you're already trying to do and my scant knowledge of R won't let me see it. – Á_C Aug 06 '21 at 15:39
  • That doesn't sound good! My guess now is that an extract function from another package is causing a conflict. Try replacing extract with tidyr::extract (which explicitly tells R to use extract from the the tidyr package). If that doesn't work, there's always a plan b, – awaji98 Aug 06 '21 at 16:00
  • It might be easier to see that is going on if you edit your question above, and add the results of the following, which will list the full text file paths: dir(folder) %>% head() – awaji98 Aug 06 '21 at 16:17
  • So, I tried using tidyr::extract instead and nothing changed. I also modified my question above. Is it more or less what you suggested I do? Thanks. – Á_C Aug 06 '21 at 17:57
  • As I expected, the problem was with the regex. Your filenames have a trailing whitespace. I've added a space to the front of each regex in the answer, and it should work now. – awaji98 Aug 07 '21 at 02:43
  • **SUCCESS!!**! The first option is the answer. It delivers exactly what is expected. The second one works, but it removes the category name from the file name. **Thanks ever so much!!!** – Á_C Aug 07 '21 at 11:06