I have several large PDF docs (70-200, pages each). The PDFs themselves are generated from HTML pages (I can't get the source code of the HTML pages which is why I am working with the PDFs). Anyway, what I want to do is parse the PDF into separate pages based on the converted H1 tag attribute. When I print out the PDF I get this:
Seller Tag (AST)
{
NSBaselineOffset = 0;
NSColor = "Device RGB colorspace 0.94118 0.32549 0.29804 1";
NSFont = "\"Helvetica 8.00 pt. P [] (0x7ff0f262e590) fobj=0x7ff0f4339680, spc=2.22\"";
}Table of Contents
{
NSBaselineOffset = 0;
NSColor = "Device RGB colorspace 0.94118 0.32549 0.29804 1";
NSFont = "\"Helvetica 34.00 pt. P [] (0x7ff0f262e590) fobj=0x7ff0f432f940, spc=9.45\"";
}...
which looks like a bunch of attributes contained in a Dictionary. But when I run this code:
let strContent = myAppManager.pdfToText(fromPDF:pdfDirPath.absoluteString + "/" + thisFile)
let strPDF:NSAttributedString = strContent
let strNSPDF = strPDF.string as NSString
let rangeOfString = NSMakeRange(0, strNSPDF.length)
let arrAttributes = strPDF.attributes(at: 0, longestEffectiveRange: nil, in: rangeOfString)
print(arrAttributes)
I get this output
[__C.NSAttributedStringKey(_rawValue: NSColor): Device RGB colorspace 0.94118 0.32549 0.29804 1, __C.NSAttributedStringKey(_rawValue: NSBaselineOffset): 0, __C.NSAttributedStringKey(_rawValue: NSFont): "Helvetica 8.00 pt. P [] (0x7ff0f441d490) fobj=0x7ff0f4339680, spc=2.22"]
I was kind of expecting a high number, like 1000 or more entries, not 1.
So snooping around, I know the H1 HTML tag gets converted to this:
Table of Contents
{
NSBaselineOffset = 0;
NSColor = "Device RGB colorspace 0.94118 0.32549 0.29804 1";
NSFont = "\"Helvetica 34.00 pt. P [] (0x7ff0f262e590) fobj=0x7ff0f432f940, spc=9.45\"";
}
So what I am looking to do is delimit the converted H1s so I can get the content between as a page and do stuff with it. Any ideas or suggestions would be appreciated.