I need to read a df from a pdf file and here is an example table
So far I was able to read the data as raw lines with the following chunk
library(pdftools)
library(tidyverse)
pdf_file <- pdf_text("exm.pdf")
raw_df <- pdf_file %>%
read_lines() %>%
data.frame() %>%
rename(rawline = 1)
raw_df <- raw_df %>%
mutate(
rawline = str_replace(string = rawline,
pattern = "^ \\s*",
replacement = "")
)
here is the structure of raw df
> raw_df
rawline
1 Id Name Address Mobile
2 1 Kiran Bengaluru, 99999 99999
3 Mysore Road
4 6th Lane
5 2 John Mandya 77777 77777
6 Taluka Junction
7 3 Ravi Mysore 88888 88888
How can i convert this into a proper df? I tried filtering out the lines that start with a digit by using regex but after that I got stuck. I need to gather address lines (that have no number at the beginning) and attach them to the previous address text and then split lines into columns. I tried splitting based the space between id, name, address and Mobile but it is not constant across all lines. How can I resolve this issue? Thanks in advance.
Edit
As suggested, I tried pdf_data and i got a table (head(15)) like this with x and y positions of the text
# A tibble: 15 x 6
width height x y space text
<int> <int> <int> <int> <lgl> <chr>
1 8 11 77 74 FALSE Id
2 5 11 77 88 FALSE 1
3 26 11 181 74 FALSE Name
4 23 11 181 88 FALSE Kiran
5 5 11 77 129 FALSE 2
6 20 11 181 129 FALSE John
7 5 11 77 156 FALSE 3
8 18 11 181 156 FALSE Ravi
9 35 11 294 74 FALSE Address
10 48 11 294 88 FALSE Bengaluru,
11 33 11 294 102 TRUE Mysore
12 22 11 330 102 FALSE Road
13 5 11 294 115 FALSE 6
14 5 6 299 114 TRUE th
15 21 11 308 115 FALSE Lane
based on this table I can filter out the x values and get the columns as vectors. but if there are spaces in the values( like address) this filtering will not work. Is there a way to gather address column based on x and y values?
Basically I need to gather rows based on a value (ex: x == 294) until the same value appears then I can use str_c to merge those cells to a single string.