exist-db how to access a pdf

Question

I am sure it is very simple ... I just cannot get my head around this... the exist-db Documentation is a bit fuzzy on content extraction... http://exist-db.org/exist/apps/doc/contentextraction.

I have a pdf-file, containing of about 162 high-res images (the pdf is quite big ...) and I do not know how to access any of the that are presumably created ...

please do not destroy me! I am just starting to build a database (for an Edition at Uni)I'd love to have a facsimile edition (so one Tab with the image-file and one tab with the transcribed texts)

I aim at doing something similar to what Heidelberg Universitdy did with the "Welsche Gast Digital" http://digi.ub.uni-heidelberg.de/diglit/cpg389/0190/image (the choosen image is just an example! ) This pic When clicking on faksimile the Scan opens and when clicking on Transkription the transcribed texts open!

I am quite new to Xquery, Xpath and most X-related stuff. I have a "working design" put together in exist-db and am looking at TEI for marking up the transcritpion etc, I fear I'll have to spend quite some time on this issue ... (it is not about doing my job for me, it's just about pointing me in the right direction)

duncdrum · Accepted Answer · 2018-07-24T23:38:54.887

1

I m afraid the short answer is simply don't.

Storing a pdf in your db, and then trying to extract images from it, is kind of a recipe for disaster. Instead you should use the source images (not necessarily extracted from the pdf), and store these individually in a collection (e.g. resources/img). Those image files are then the binary resources that the documentation is actually talking about.

You might want to take a look at tei-publisher for creating digital edition in exist, especially this demo app for how to present high-res facsimiles with transcribed portions of text. I m afraid its all a bit more involved then just opening a pdf in a browser, but so is the Welsche Gast Digital

edited Jul 24 '18 at 23:38

answered Jul 24 '18 at 23:30

duncdrum

723
5
13

Thanks for your time and answer! Thought so, but I thought there was an easy way I did not stumble upon :-) As I want to make the facsimile and the transcription correspond in their respective view, I think it is best to use separate xml-files that are joined by a facsimile which I then use to open the facsimile in return .... (which is not my initial question but me processing what you wrote) (I have to transcribe one book and haven't decided yet, on whether I want to use one giant xml-file (which I find messy but) or several smaller ones (which I think is more elegant). – Jul 25 '18 at 06:51
How to organize your tei files and how to link from tei to img files are separate questions. Once you have some basic working code feel free to open another question here. The pages I linked to show you examples for how to achieve what you want, they also show you how to create pdfs from your tei + img files. If my response helped you to lay the upload pdf and extract images approach to rest, please mark it as accepted answer in the stack overflow UI (only you can do that). – duncdrum Jul 25 '18 at 10:24
As Duncan suggested, always use the original texts and images if you have access to them. If they are not available, you will have to use the content extraction module as a last resort. – adamretter Jul 29 '18 at 06:13

exist-db how to access a pdf

1 Answers1