2

I'm looping through all pages in a PDFDocument (200+ pages) but app crashes with

Message from debugger: Terminated due to memory issue

The pdf is approx 4mb in size yet each iteration of the loop jumps the memory up approx 30mb. Which doesn't seem right to me. I have managed to locate where in my code the memory is being used just not sure how to claim it back. Tried setting variables to nil but no effect. Tried code in the for loop in an autoreleaspool{} but no effect.

@objc func scrapePDF(){

    let documentURL = self.documentDisplayWebView!.url!
    let document = PDFDocument(url: documentURL)
    let numberOfPages = document!.pageCount

    DispatchQueue.global().async {

        for pageNumber in 1...numberOfPages {

           print(document?.page(at: pageNumber)!.string!)

        }
    }
}

UPDATE: solved ..... kind of

Playing around a bit I found that rather than passing a reference to the PDFDocument inside the loop, if instead I create a new instance for each loop this strangely solves the memory issue. I don't quite understand why though. PDFDocument is a Class not a Struct so is passed by reference. Meaning it is only created once and then referenced to inside my loop. So why would it cause a memory issue?

@objc func scrapePDF(){

    let documentURL = self.documentDisplayWebView!.url!
    let document = PDFDocument(url: documentURL)
    let numberOfPages = document!.pageCount

    DispatchQueue.global().async {

        for pageNumber in 1...numberOfPages {
           let doc = PDFDocument(url: documentURL)
           print(doc?.page(at: pageNumber)!.string!)

        }
    }
}

Though the above code clears the memory issue the problem with it is that its too slow. Each loop takes 0.5 seconds and with 300+ pages I can't accept that. Any tips on speeding it up? Or why it doesn't give the memory back if referencing the PDFDocument from outside the loop

Further UPDATE. It seems that it’s calling the .string method of the PDFPage that is increases the memory to the point of crashing.

RyanTCB
  • 7,400
  • 5
  • 42
  • 62
  • Hi - this question refers to creating PDF's and not reading them. however, the solution may be relevant https://stackoverflow.com/questions/14699194/memory-warning-and-crash-when-creating-pdf <- It refers to running one page at a time. – benjiiiii Dec 08 '17 at 12:46
  • Have you tried fetching the first 20 pages to see if the memory is released when the loop completes? – Laffen Dec 08 '17 at 14:19
  • 1
    As per Apple docs https://developer.apple.com/documentation/pdfkit/pdfdocument/1436036-string String >`This is a convenience method, equivalent to creating a selection object for the entire document and then invoking the PDFSelection class’s string method.` Looks like it will create String representation of entire document and using PDFSelection convince init it will get that one page so memory will is affecting here – Prashant Tukadiya Dec 08 '17 at 14:25
  • @Laffen yeah if I do that I get significant portions of the memory back. I could write so logic to read n number of pages at a time but I’d rather not if avoidable – RyanTCB Dec 08 '17 at 14:30
  • @PrashantTukadiya but that’s if I call string in the PDFDocument. When calling on the page it should return just the text of that page. https://developer.apple.com/documentation/pdfkit/pdfpage/1503949-string – RyanTCB Dec 08 '17 at 14:37
  • @RyanTCB oh I missed that , Did you tried with different PDF's ? – Prashant Tukadiya Dec 08 '17 at 14:41
  • @PrashantTukadiya I’ve tried numerous PDFs and as long as the document is less that 200 pages it can complete the loop and return the memory. However I can’t be sure of the size of the PDF so need a solution to reclaim memory. My option so far is to follow Laffen suggestion and fetch parts at a time. I’d rather understand why swift keeps it’s all I’m memory – RyanTCB Dec 08 '17 at 14:45
  • what happens if you remove the line `DispatchQueue.global().async {` – meggar Dec 08 '17 at 15:12
  • @meggar it freezes the UI. Didn’t think it was good practice to do that – RyanTCB Dec 08 '17 at 15:13
  • right but does it use the same memory? – meggar Dec 08 '17 at 15:15
  • @meggar yes. Sorry – RyanTCB Dec 08 '17 at 15:31
  • It almost looks like that the `PDFDocument` caches the fetched pages, resulting in a memory warning when the cache gets to big. This explains why it works instantiating a new `PDFDocument` in every loop, because then only one page is cached at any given time. I'm curious on why you're scraping the PDF in the first place? – Laffen Dec 11 '17 at 08:15
  • Im scraping so I can enter details into a calendar rather than having user enter manually. I also conclude that its caching the fetched pages but why? Why is it not just using the instance passed in. If `PDFDocument` was a Struct id get that it sends copy but its a Class so by reference. – RyanTCB Dec 11 '17 at 08:19

0 Answers0