0

I have several large PDF docs (70-200, pages each). The PDFs themselves are generated from HTML pages (I can't get the source code of the HTML pages which is why I am working with the PDFs). Anyway, what I want to do is parse the PDF into separate pages based on the converted H1 tag attribute. When I print out the PDF I get this:

Seller Tag (AST)
{
NSBaselineOffset = 0;
NSColor = "Device RGB colorspace 0.94118 0.32549 0.29804 1";
NSFont = "\"Helvetica 8.00 pt. P [] (0x7ff0f262e590) fobj=0x7ff0f4339680, spc=2.22\"";
}Table of Contents
{
NSBaselineOffset = 0;
NSColor = "Device RGB colorspace 0.94118 0.32549 0.29804 1";
NSFont = "\"Helvetica 34.00 pt. P [] (0x7ff0f262e590) fobj=0x7ff0f432f940, spc=9.45\"";
}...

which looks like a bunch of attributes contained in a Dictionary. But when I run this code:

 let strContent = myAppManager.pdfToText(fromPDF:pdfDirPath.absoluteString + "/" + thisFile)
 let strPDF:NSAttributedString = strContent
 let strNSPDF = strPDF.string as NSString
 let rangeOfString = NSMakeRange(0, strNSPDF.length)
 let arrAttributes = strPDF.attributes(at: 0, longestEffectiveRange: nil, in: rangeOfString)
 print(arrAttributes)

I get this output

[__C.NSAttributedStringKey(_rawValue: NSColor): Device RGB colorspace 0.94118 0.32549 0.29804 1, __C.NSAttributedStringKey(_rawValue: NSBaselineOffset): 0, __C.NSAttributedStringKey(_rawValue: NSFont): "Helvetica 8.00 pt. P [] (0x7ff0f441d490) fobj=0x7ff0f4339680, spc=2.22"]

I was kind of expecting a high number, like 1000 or more entries, not 1.

So snooping around, I know the H1 HTML tag gets converted to this:

Table of Contents
{
NSBaselineOffset = 0;
NSColor = "Device RGB colorspace 0.94118 0.32549 0.29804 1";
NSFont = "\"Helvetica 34.00 pt. P [] (0x7ff0f262e590) fobj=0x7ff0f432f940, spc=9.45\"";
}

So what I am looking to do is delimit the converted H1s so I can get the content between as a page and do stuff with it. Any ideas or suggestions would be appreciated.

PruitIgoe
  • 6,166
  • 16
  • 70
  • 137
  • 1
    That's because you printed the attributed at for the first character. `0`, that's the index of the character you want. You can enumerate the attributes until finding the one you want. You can keep the range, and then create subattributedstring from these array of ranges that would be one page each. You just need to be clear on the “separator": ie the attributes that delimitate a page. – Larme Oct 13 '21 at 16:44
  • Ooooohhhhh...that makes sense. – PruitIgoe Oct 13 '21 at 20:18

2 Answers2

1

Quickly done, assuming you have:

someText[HEADER1]someText1[HEADER2]someText2[HEADER3]someText3...

Where [HEADERN] have the same attributes (and you know them) but not the same as someTextN.

We want in the end, and array of:

struct Page: CustomStringConvertible {
    let title: NSAttributedString? //Tha's be the h1 tag content
    let content: NSAttributedString?

    var description: String {
        return "Title: \(title?.string ?? "") - content: \(content?.string ?? "")"
    }
}

Initial sample:

let htmlString = "<b>Title 1</b> Text for part one.\n <b>Title 2</b> Text for part two<b>Title 3</b>Text for part three"
let attributedString = try! NSAttributedString(data: Data(htmlString.utf8),
                                               options: [.documentType : NSAttributedString.DocumentType.html],
                                               documentAttributes: nil)

With:

let headerAttributes: [NSAttributedString.Key: Any] = [.font: UIFont.boldSystemFont(ofSize: 12)]
print("headerAttributes: \(headerAttributes)")

func headerOneAttributes(_ headerAttributes: [NSAttributedString.Key: Any], matches attributes: [NSAttributedString.Key: Any]?) -> Bool {
    guard let attributes = attributes else { return false }

    guard let attributesFont = attributes[.font] as? NSFont, let headerFont = headerAttributes[.font] as? NSFont else {
        return false
    }
    return attributesFont.fontDescriptor.symbolicTraits == NSFontDescriptor.SymbolicTraits(rawValue: 268435458) //Here fonts arent' equal equal, some work here plus checking on other attributes too and font size?
    // Do you own check
    // return false
}

We can iterates the attributes to get all the headers ranges:

var headerRanges: [NSRange] = []
attributedString.enumerateAttributes(in: NSRange(location: 0, length: attributedString.length), options: []) { attributes, range, stop in
    if headerOneAttributes(headerAttributes, matches: attributes) {
        headerRanges.append(range)
    }
}

With an iteration on the ranges:

var pages: [Page] = []
guard !headerRanges.isEmpty else { return }

//In case the first title doesn't "start at the beginning", we have a "content" with no title at start
if let first = headerRanges.first, first.location > 0 {
    pages.append(Page(title: nil, content: attributedString.attributedSubstring(from: first)))
}

// Then we iterate
for (anIndex, aRange) in headerRanges.enumerated() {
    print(pages)
    let title = attributedString.attributedSubstring(from: aRange)
    let subtext: NSAttributedString?
    // If there is a "nextRange", then we get the end of subtext from it
    if anIndex + 1 <= headerRanges.count - 1 {
        let next = headerRanges[anIndex + 1]
        let location = aRange.location + aRange.length
        let length = next.location - location
        subtext = attributedString.attributedSubstring(from: NSRange(location: location, length: length))
    } else {
        //There is no next => Until the end
        let location = aRange.location + aRange.length
        let length = attributedString.length - location
        subtext = attributedString.attributedSubstring(from: NSRange(location: location, length: length))
    }
    pages.append(Page(title:title, content: subtext))
}
print(pages)

PS: UIFont/NSFont: ~the same, I tested on a macOS app, not iOS, that's why.

Larme
  • 24,190
  • 6
  • 51
  • 81
  • thanks for this. Am going to do some test running with it today but it makes sense (at least in my head - which is kind of a low bar). : ) – PruitIgoe Oct 18 '21 at 16:42
0

Okay, so @Larme put me on the right track for what I was looking for. Posting the code in hopes it helps someone else. I've tested this on a 77 page document and it worked. I should have noted in the question that I am working on MacOS.

func parsePDF(_ strPDFContent:NSMutableAttributedString) -> Array<Dictionary<String, Any>> {
    
    //some initial setup
    let strNSPDF = strPDFContent.string as NSString
    var arrDocSet:Array<Dictionary<String, Any>> = []
    
    //get all the page headers
    var arrRanges = [NSRange]()
    strPDFContent.enumerateAttribute(NSAttributedString.Key.font, in: NSRange(0..<strPDFContent.length), options: .longestEffectiveRangeNotRequired) {
        value, range, stop in
        if let thisFont = value as? NSFont {
            if thisFont.pointSize == 34 {
                arrRanges.append(range)
            }
        }
    }
    
    //get the content and store data
    for (idx, range) in arrRanges.enumerated() {
        
        //get title
        let strTitle = String(strNSPDF.substring(with: range))
        var textRange = NSRange(location:0, length:0)
        
        //skip opening junk
        if !strTitle.contains("Table of Contents\n") {
            
            if idx < arrRanges.count-1 {
                textRange = NSRange(location: range.upperBound, length: arrRanges[idx+1].lowerBound - range.upperBound)
            } else if idx == arrRanges.count-1 {
                textRange = NSRange(location: range.upperBound, length: strNSPDF.length - range.upperBound)
            }
        
            let strContent = String(strNSPDF.substring(with: textRange))

            arrDocSet.append(["title":strTitle, "content":strContent, "contentRange":textRange, "titleRange":range])
            
        }
            
    }
    
    print(arrDocSet)
    
    return arrDocSet
    
}

This will output:

["titleRange": {10001, 27}, "title": "Set up Placements with AST\n", "content": "This page contains a sample web page showing how Xandr\'s seller tag (AST) functions can be implemented in the header and body of a sample client page.\nSee AST API Reference for more details on using ...
...
ready.\nExample\n$sf.ext.status();\n", "title": " SafeFrame API Reference\n", "contentRange": {16930, 9841}

Let me know if there's places I could be more efficient.

PruitIgoe
  • 6,166
  • 16
  • 70
  • 137