2

I need to count the lines of 221 poems and tried counting the line breaks \n.

However, some lines have double line breaks \n\n to make a new verse. These I only want counted as one. The amount and position of double line breaks is random in each poem.

Minimal working example:

library("quanteda")

poem1 <- "This is a line\nThis is a line\n\nAnother line\n\nAnd another one\nThis is the last one"
poem2 <- "Some poetry\n\nMore poetic stuff\nAnother very poetic line\n\nThis is the last line of the poem"

poems <- quanteda::corpus(poem1, poem2)

The resulting line count should be 5 lines for poem1 and 4 lines for poem2.

I tried stringi::stri_count_fixed(texts(poems), pattern = "\n"), but the regex pattern is not elaborate enough to account for the random double line break problem.

John
  • 109
  • 1
  • 8

1 Answers1

3

You can use stringr::str_count with the \R+ pattern to find the number of consecutive line break sequences in the string:

> poem1 <- "This is a line\nThis is a line\n\nAnother line\n\nAnd another one\nThis is the last one"
> poem2 <- "Some poetry\n\nMore poetic stuff\nAnother very poetic line\n\nThis is the last line of the poem"
> library(stringr)
> str_count(poem1, "\\R+")
[1] 4
> str_count(poem2, "\\R+")
[1] 3

So the line count is str_count(x, "\\R+") + 1.

The \R pattern matches any line break sequence, CRLF, LF or CR. \R+ matches a sequence of one or more such line break sequence.

See the R code DEMO online:

poem1 <- "This is a line\nThis is a line\n\nAnother line\n\nAnd another one\nThis is the last one"
poem2 <- "Some poetry\n\nMore poetic stuff\nAnother very poetic line\n\nThis is the last line of the poem"
library(stringr)
str_count(poem1, "\\R+")
# => [1] 4
str_count(poem2, "\\R+")
# => [1] 3
## Line counts:
str_count(poem1, "\\R+") + 1
# => [1] 5
str_count(poem2, "\\R+") + 1
# => [1] 4
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Thank you for explaining the rationale behind your answer, I was happy to learn about the line break matching with \R. – John Dec 15 '20 at 11:30