0

I have this output from the pdftools pdf_data() for a page of the financial statements of a town. Unfortunately, in rare cases, the capture of a line y is slightly off, as shown below. I would like to be able to group on y including cases where y is +-1.

  library(data.table)
  data <- 
    read.csv(
      text =
     "x, y, text\n43, 391, Total\n66, 391, Expenditures\n260, 390, 6476803\n542, 390, 6773717"
     )
  data <- setDT(data)
  
  # View data
  print(data)
#>      x   y          text
#> 1:  43 391         Total
#> 2:  66 391  Expenditures
#> 3: 260 390       6476803
#> 4: 542 390       6773717
  
  # The problem
  data[, paste(text, collapse = ""), y]
#>      y                  V1
#> 1: 391  Total Expenditures
#> 2: 390     6476803 6773717

Desired output something like this if y <= y + 1 and y => y - 1:

#>       y                  V1
#> 1: c(391, 390)  Total Expenditures 6476803 6773717

Most attempts about grouping on a range suggest to create new columns for the hi and low, create a new variable cut() to group on, but I was unsure where to begin to implement this. I also have thousands of pages where the y's are constantly varying.

I generally use in data.table so a solution for that is much preferred.

Created on 2021-05-20 by the reprex package (v2.0.0)

David Lucey
  • 252
  • 3
  • 9
  • It's not really clear what you're aiming for here. Could you provide the output you expect? – Dean MacGregor May 20 '21 at 19:27
  • I edited as best I could to better explain. Basically, how do you group by a specified range of values around the grouped integer. If an operation like this has a name, just knowing that would help. – David Lucey May 20 '21 at 21:19
  • I'm not sure why you'd want output like that so it's hard to extrapolate what you'd want to happen if there were more values of y. Is each row just supposed to have 2 values of `y`? What if your source data has 390, 391, 392, 393 then do you return a row for each of c(390, 391), c(391,392), and c(392,393). Further, your output in V1 has the text of y=390 after the output of y=391. Is that intended or should that always have the characters first? – Dean MacGregor May 20 '21 at 22:19
  • These are scanned rows of a pdf in which it seems that a small number have slightly varying vertical alignment. Each grouped row should consist of the actual value of y itself, y -1 and y + 1, so there would be 3 possible values of y in each. The diff in levels of y is generally about 12 as would be natural on a page, so there is little risk of an unintended row of y winding up in two groups. The reversed output in V1 is an error I will correct. – David Lucey May 21 '21 at 12:08
  • is your `V1` in order according to`x` or just by the order it's in and `x` is nothing? – Dean MacGregor May 21 '21 at 12:16
  • V1 is in order according to x, but not because I have specified it. The data.table is arranged according to y (line position), then x (row position), which in this case, happens to be in the same order as text as it appears on the page. – David Lucey May 21 '21 at 12:22
  • It looks like it is not possible, though seems like it should be, but here is the workaround for others whom I'm sure will have this problem with the outstanding pdftools pdf_data function. I decided to modify y prior to grouping, when it was 1 or 2 integers away from the others on that row ("y_new"), and then group on "y_new". data[data[, .SD[order(y), .(y, y - shift(y, type = "lag"))]][, .(y_new = ifelse(max(V2, na.rm=TRUE) %in% c(-2, -1, 1, 2), y - max(V2, na.rm = TRUE), y)), y], on = "y"][order(x), paste(text, collapse = " "), y_new] so this gave the desired result. – David Lucey May 22 '21 at 19:14

0 Answers0