0

I am new to coding currently working on a project, which requires me to parse NDJSON strings that are located in .txt files. I have hundreds of .txt files, each containing up to 1 million NDJSON strings. I have the below code, which I know parses one individual file successfully (if I explicitly state the name of the .txt input file and the name of the .csv output file):

library('ndjson')
library('tidyverse')

parsed_df <- tbl_df(ndjson::stream_in("test.txt"))
selected_df <- parsed_df[,c(3,26,30,51,54,57,76,93,99,125,143,169,173,246,
                            250,251,253,254,267,269,370,431,432,450)]

write.csv(selected_df, 'test_reduced.csv')

In this above example, I simply set the directory to a folder and make sure the files are located in the folder.

I now want to repeat this process but I want to loop through the all of the files in the folder, rather than manually type in the name of each file and adjust the output file. Each file contains tweet information relating to a specific disaster, so I'd like to be able to create logical names for each file, such as Nepal01.txt, Nepal02.txt, HurricaneSandy01.txt, etc. I say this because the names of each file are long, so if I rename them, I'd like to enable this process to work but keep the name logical. For this reason, I need to find a dynamic way of selecting all files that end in .txt and dynamically writing output files with relevant names in a .csv format, e.g. Nepal_reduced01.csv, Nepal_reduced02.csv, HurricaneSandy_reduced01.csv, etc.

Below is my failed attempt so far:

library('ndjson')
library('tidyverse')

filenames= list.files(".", ".txt")
for( i in 1:length(filenames) )

  parsed_df <- tbl_df(ndjson::stream_in(filenames[1])) 
  selected_df <- parsed_df[,c(3,26,30,51,54,57,76,93,99,125,143,169,173,246,
                              250,251,253,254,267,269,370,431,432,450)]

  write.csv(selected_df, cbind(i,'.csv'))
})

Below is an image of the error message:

enter image description here

  • 2
    I suspect this is something Apache Drill (via [sergeant](https://github.com/hrbrmstr/sergeant), if you like) would be really good at. In this case, you're just missing a `{` at the start of your loop, though. There's an extra `)` at the end, too. To make your loop do anything, change that `1` to `i`. – alistaire Feb 09 '18 at 23:00
  • Sorry alistaire - can you be more specific please? I am new to coding and not familiar with Apache Drill (or sergeant). I'm also not sure where I should change from 1 to i. I also suspect that your suggestion will not resolve the naming convention issue for output files. – Christopher Loynes Feb 09 '18 at 23:12
  • Fix what you've got first. Fix the typos, then change the `1` in `parsed_df <- tbl_df(ndjson::stream_in(filenames[1]))` to `i`, or you'll just read the same file over and over. For output filenames, you don't want `cbind`, which returns a matrix; more likely you need `paste0`, e.g. maybe `paste0(filenames[i], '.csv')`. – alistaire Feb 09 '18 at 23:16
  • Did you check `filenames`? I think you should remove `"."` in `list.files(".", ".txt")`and use `paste0(tools::file_path_sans_ext(filenames[i]), '.csv')` for the output as suggested by alistaire. – DJack Feb 10 '18 at 00:14
  • As the author of both `ndjson` and `sergeant` I heartily agree with the Drill recommendation. "new to coding" is not really an excuse for, well, anything at this level. Learn coding, then progress to (what this is) fairly advanced coding. – hrbrmstr Feb 10 '18 at 02:54

0 Answers0