1

I am using Camelot to extract tables from PDF files. While this works very well, it extracts the text only, it does not extract the hyperlinks that are embedded in the tables.

Is there a way of using Camelot or a similar package to extract table text and hyperlinks embedded within tables?

Thanks!

Amy D
  • 55
  • 1
  • 6

1 Answers1

0

most applications such as tablular text extractors simply scrape the visible surface as plain text and actually hyperlinks are often stored elsewhere in the pdf which is NOT a WTSIWYG word processor file.

So, if you're lucky you can extract the co-ordinates (without their page allocation like this)

C:\Users\lz02\Downloads>type "7 - 20 November 2022 (003).pdf" |findstr /i "(http"
<</Subtype/Link/Rect[ 69.75 299.75 280.63 313.18] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(http://www.bbc.co.uk/complaints/complaint/) >>/StructParent 5>>
<</Subtype/Link/Rect[ 219.37 120.85 402.47 133.06] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(http://www.bbc.co.uk/complaints/handle-complaint/) >>/StructParent 1>>
<</Subtype/Link/Rect[ 146.23 108.64 329.33 120.85] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(http://www.bbc.co.uk/complaints/handle-complaint/) >>/StructParent 2>>
<</Subtype/Link/Rect[ 412.48 108.64 525.55 120.85] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(https://www.ofcom.org.uk/tv-radio-and-on-demand/broadcast-codes/broadcast-code) >>/StructParent 3>>
<</Subtype/Link/Rect[ 69.75 96.434 95.085 108.64] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(https://www.ofcom.org.uk/tv-radio-and-on-demand/broadcast-codes/broadcast-code) >>/StructParent 4>>
<</Subtype/Link/Rect[ 69.75 683.75 317.08 697.18] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(http://www.bbc.co.uk/complaints/comp-reports/ecu/) >>/StructParent 7>>
<</Subtype/Link/Rect[ 463.35 604.46 500.24 617.89] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(https://www.bbc.co.uk/contact/ecu/reporting-scotland-bbc-one-scotland-20-december-2021) >>/StructParent 8>>
<</Subtype/Link/Rect[ 463.35 577.11 500.24 590.54] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(https://www.bbc.co.uk/contact/ecu/book-of-the-week-preventable-radio-4-19-april-2022) >>/StructParent 9>>
<</Subtype/Link/Rect[ 463.35 522.4 521.41 535.83] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(https://www.bbc.co.uk/contact/ecu/the-one-show-bbc-one-6-october-2022) >>/StructParent 10>>
<</Subtype/Link/Rect[ 463.35 495.04 518.04 508.47] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(https://www.bbc.co.uk/contact/ecu/news-6pm-bbc-one-22-september-2022) >>/StructParent 11>>
<</Subtype/Link/Rect[ 463.35 469.04 518.04 482.47] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(https://www.bbc.co.uk/contact/ecu/news-1030am-bbc-news-channel-20-september-2022) >>/StructParent 12>>

NOTE, the random order, to find which page they belong to you need to traceback to their /StructParent ##

K J
  • 8,045
  • 3
  • 14
  • 36