3

This question is a follow-up of this.

The following metadata.txt has been generated by: pdftk sample.pdf dump_data > metadata.txt

metadata.txt:

InfoBegin
InfoKey: ModDate
InfoValue: D:20170817080316Z00'00'
InfoBegin
InfoKey: CreationDate
InfoValue: D:20170817080316Z00'00'
InfoBegin
InfoKey: Creator
InfoValue: Adobe Acrobat 7.0
InfoBegin
InfoKey: Producer
InfoValue: Mac OS X 10.9.5 Quartz PDFContext
PdfID0: 76cf9fd41f0778314abfec8b34d8388d
PdfID1: 76cf9fd41f0778314abfec8b34d8388d
NumberOfPages: 612
BookmarkBegin
BookmarkTitle: Contents
BookmarkLevel: 1
BookmarkPageNumber: 11
BookmarkBegin
BookmarkTitle: Preface 
BookmarkLevel: 1
BookmarkPageNumber: 5
BookmarkBegin
BookmarkTitle: Explanatory Note and Abbreviations Used 
BookmarkLevel: 1
BookmarkPageNumber: 7
PageMediaBegin
PageMediaNumber: 1
PageMediaRotation: 0
PageMediaRect: 0 0 405 616
PageMediaDimensions: 405 616

I would like R to read the Table-of-Contents (TOC) information from metadata.txt into a data.frame, starting from the first BookmarkBegin to the BookmarkPageNumber immediately before PageMediaBegin.

The area of interest can be filtered out with the following code:

require(stringi)

connect=file('metadata.txt')
metadata=readLines(connect)

existing_toc=c(min(grep('BookmarkBegin', metadata)),max(grep('BookmarkPageNumber', metadata)))
metadata_toc=metadata[existing_toc[1]:existing_toc[2]]

Removing BookmarkBegin and splitting the strings on each line by every first occurrence of : via:

toc_data=metadata_toc[-grep('BookmarkBegin', metadata_toc)]
toc_data_split=stri_split_fixed(toc_data, ": ", n=2)

lands me with the following list:

[[1]]
[1] "BookmarkTitle" "Contents"     

[[2]]
[1] "BookmarkLevel" "1"            

[[3]]
[1] "BookmarkPageNumber" "11"                

[[4]]
[1] "BookmarkTitle" "Preface "     

[[5]]
[1] "BookmarkLevel" "1"            

[[6]]
[1] "BookmarkPageNumber" "5"                 

[[7]]
[1] "BookmarkTitle"                           
[2] "Explanatory Note and Abbreviations Used "

[[8]]
[1] "BookmarkLevel" "1"            

[[9]]
[1] "BookmarkPageNumber" "7"

How should I continue from here to get a data.frame like so:

structure(list(BookmarkTitle = structure(c(1L, 3L, 2L), .Label = c("Contents", 
"Explanatory Note and Abbreviations Used", "Preface"), class = "factor"), 
    BookmarkLevel = c(1, 1, 1), BookMarkPageNumber = c(11, 5, 
    7)), .Names = c("BookmarkTitle", "BookmarkLevel", "BookMarkPageNumber"
), row.names = c(NA, -3L), class = "data.frame")

                            BookmarkTitle BookmarkLevel
1                                Contents             1
2                                 Preface             1
3 Explanatory Note and Abbreviations Used             1
  BookMarkPageNumber
1                 11
2                  5
3                  7
Sati
  • 716
  • 6
  • 27
  • I considered using the `read_yaml()` function from package `yaml` before but it runs into problems when strings that follow `BookmarkTitle` also contain `:`s. – Sati May 11 '18 at 04:05

2 Answers2

2

This code should convert metadata_toc into a desired data frame format.

(Edit - Updated code to incorporate a scenario wherein BookmarkTitle also has : as it's value)

library(tidyverse)
library(stringi)

df <- data.frame(txt = metadata_toc) %>%
  filter(txt != 'BookmarkBegin') %>%   #filter unwanted text - 'BookmarkBegin'

  #based on first occurrence of ':' split 'txt' column into two new columns 
  rowwise() %>%
  mutate(txt_1 = stri_split_fixed(txt, ': ', n=2)[[1]][1],
         txt_2 = stri_split_fixed(txt, ': ', n=2)[[1]][2]) %>%
  select(-txt) %>%
  ungroup() %>%

  #new column 'row_num' helps 'spread' (i.e. next line) know that every 3 subsequent rows are to be spread into 3 columns in a single row.
  mutate(row_num = rep(1:(n()/3), each = 3)) %>%    
  #rep(...) means that 9 (=n() i.e. number of total rows) rows in this sample data is divided into 3 groups as we want to finally convert it into 3 rows.
  #rep(1:3, each=3)
  #[1] 1 1 1 2 2 2 3 3 3

  spread(txt_1, txt_2) %>%             #convert data to wide format 
  select(c("BookmarkTitle", "BookmarkLevel", "BookmarkPageNumber"))
df

Output is:

  BookmarkTitle                           BookmarkLevel BookmarkPageNumber
1 Contents                                1             11                
2 "Preface "                              1             5                 
3 "Explanatory Note: Abbreviations Used " 1             7 

Sample data:

metadata_toc <- c("BookmarkBegin", "BookmarkTitle: Contents", "BookmarkLevel: 1", 
"BookmarkPageNumber: 11", "BookmarkBegin", "BookmarkTitle: Preface ", 
"BookmarkLevel: 1", "BookmarkPageNumber: 5", "BookmarkBegin", 
"BookmarkTitle: Explanatory Note: Abbreviations Used ", "BookmarkLevel: 1", 
"BookmarkPageNumber: 7")
Prem
  • 11,775
  • 1
  • 19
  • 33
  • 1
    `?dplyr::mutate` adds a new column to dataframe, `?tidyr::spread` converts data from long to wide format (in order to see it in action you can run above code before & after `spread` line) & `?dplyr::select` selects only the desired columns from dataframe. – Prem May 11 '18 at 09:24
  • What does `rep(1:(n()/3), each = 3)` do? I can see that it fills up the data in the newly-added `row_num` column. But I cannot figure out what `1:(n()/3)` means... – Sati May 11 '18 at 09:29
2

This base solution will convert metadata_toc to a data frame. First replace each line not having a colon with an empty line. It is now in Debian Control File (DCF) format so read it using read.dcf. Convert the resulting matrix m to a data frame DF and convert the column types to character and numeric.

metadata_toc[grep(":", metadata_toc, invert = TRUE)] <- ""
m <- read.dcf(textConnection(metadata_toc))
DF <- as.data.frame(m, stringsAsFactors = FALSE)
DF[] <- lapply(DF, type.convert, as.is = TRUE)

giving:

> DF
                            BookmarkTitle BookmarkLevel BookmarkPageNumber
1                                Contents             1                 11
2                                 Preface             1                  5
3 Explanatory Note and Abbreviations Used             1                  7

Note

metadata_toc <- c("BookmarkBegin", "BookmarkTitle: Contents", "BookmarkLevel: 1", 
"BookmarkPageNumber: 11", "BookmarkBegin", "BookmarkTitle: Preface ", 
"BookmarkLevel: 1", "BookmarkPageNumber: 5", "BookmarkBegin", 
"BookmarkTitle: Explanatory Note and Abbreviations Used ", "BookmarkLevel: 1", 
"BookmarkPageNumber: 7")
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • When running the final code `lapply(DF, type.convert, as.is = TRUE)`, I ran into this error: `Error in FUN(X[[i]], ...) : the first argument must be of mode character.` – Sati May 11 '18 at 15:07
  • I am running 3.4.4 on OSX. Just realized it could be easily fixed with `as.data.frame(m, stringsAsFactors = FALSE)`. Only problem is, the integers would end up as characters. – Sati May 11 '18 at 16:46
  • 1
    The modified code shown in the answer should work. `type.convert` will ensure that the final result does not give characters for numbers. – G. Grothendieck May 11 '18 at 16:56
  • How does a DCF work and how is it different from / related to other data structures? – Sati May 11 '18 at 22:27
  • Just googled `R DCF` and [here](https://stat.ethz.ch/R-manual/R-devel/library/base/html/dcf.html)'s what I've found. An immediate question that comes to mind would be: What are the advantages/disadvantages of defining a colon-separated plain text file as DCF as opposed to YAML? – Sati May 11 '18 at 22:45
  • Why does `read.dcf ` return a matrix rather than just another data.frame? – Sati May 11 '18 at 23:00
  • Presumably because it only returns character data. – G. Grothendieck May 11 '18 at 23:17
  • Simple & great answer (+1)! – Prem May 12 '18 at 06:54