0

I have a small problem with which I need experts` advice. I need to split texts into pieces with different sizes. For example, one of the texts consists of 19578 words. what I want to do is to put the first 1000 words in the first piece, the first 2000 words in the second piece, the first 3000 words in the third ... and put the first 19000 words in the nineteenth piece. So each chunk increases by 1000 words and has to contain words of the previous chunks. (in a numeric sense, the desired list would look like : [1,1000], [1,2000], [1,3000], [1,4000],... [1,19000])

Using stringr package, I`ve put the text into a list:

words <-str_split(as.character(text), pattern = boundary(type = "word"))

and tried to split the list with the split function:

split.words <-split(unlist(words), cut(seq_along(unlist(words)), 19, labels = F))

However, the result is vastly different from what I desire. It produces equally-sized chunks. (if it was a numeric list, chunks would look like: [1,1000],[1001,2000], ..., [18001,19000])

I also tried to combine elements of the split.word list by c() function:

combined <- c(split.words[[1]][["1"]], split.words[[1]][["2"]], split.words[[1]][["3"]], split.words[[1]][["4"]], ...)

yet again the outcome is a character element which is in chunks of 1000 words (one element, but still there are sections with 1000 words in them). Basically, I`ve just changed the type from a list into a character element by c() function.

NOW my question is how I can split my texts into unequally-sized chunks that increase by 1000 words. Note that all the chunks must start from the first word.

  • If you want the words as character vectors, you could do `lapply(seq(1000, length(words[[1]]), 1000), function(x) words[[1]][1:x])`, or if you want it as text with spaces, you could do `lapply(seq(1000, length(words[[1]]), 1000), function(x) paste(words[[1]][1:x], collapse = " "))` – user12728748 Aug 03 '20 at 13:49
  • @user12728748 thanks a million! it seems to work properly. – Mohammad Farsadnia Aug 03 '20 at 15:00

2 Answers2

0

Maybe you can try Reduce with option accumulate = TRUE

Reduce(c,split.words,accumulate = TRUE)
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
0
mkdir ./d.output

cat giant-humungous-file.txt | cut -d' ' -f1-1000 > ./d.output/file1000.txt

cat giant-humungous-file.txt | cut -d' ' -f1-2000 > ./d.output/file2000.txt

Etc.

Then you can do this:-

find ./d.output/*.txt -type f >> stack
cat stack | tr '\n' ' ' | sed s'@^@cat @'g | sed s'@$@ > newfile.txt@' > stack2
mv stack2 stack
chmod +x ./stack
./stack
petrus4
  • 616
  • 4
  • 7