a good friend of mine is currently writing a book for his PhD, and asked me if I could help him automate the process of checking all his given sources (hyperlinks). I've searched all over the internet and could not find any helpful tip on how to help him though. I've came across iText7 and iTextSharp, but could not manage to make them work. For the beginning it would already be a huge help, if I could just look out for the links (by parsing the whole pdf into a string, and search for the //.../ tags?), and show them in a listbox. I was not able to find any link using the iTextSharp ANNOT function, so I guess the pdf he gave me is not correctly formatted.. I should still be able to parse the text and search for links (RegEx), right? Does anyone have a hint for me, on how I could make this work? Thanks in advance!
Asked
Active
Viewed 66 times
0
-
It would help us help you if you shared what you tried with us. – May 02 '19 at 14:04
-
Sorry, I have just tried stuff out yesterday evening and thrown the code away since it didnt work. (Using iTextSharp) it kept failing because there are no annotations in the pdf, therefore the example I've found on the internet, using the ANNOT-function, did not work – beadrex May 02 '19 at 14:11
-
1How is your friend creating this PDF? I imagine it might be easier to extract the URIs from the original source rather than the finished product. – May 02 '19 at 14:13
-
He used Word to export it as a PDF, I do have the file as a word file as well. Do you think the annotations would be set in the Word file? – beadrex May 02 '19 at 18:16
-
No idea, but it would be simple enough to scan it for every instance of `http` and go from there. I recently used a `DocX` library to extract all of the text from a document and convert it to LaTeX. Its much easier to extract from Word than a PDF. I would definitely go that route if you can. https://github.com/xceedsoftware/DocX – May 02 '19 at 18:17
-
I'm still unable to find any hyperlinks.. `var hyperlink = doc.Hyperlinks.FirstOrDefault(); if (hyperlink != null) { // Code in here does not get executed, probably because there is no hyperlink in the document? hyperlink.Text = "xceed"; hyperlink.Uri = new Uri("http://www.xceed.com/"); }` – beadrex May 02 '19 at 18:54
-
Then they aren't formatted as hyperlinks and you'll need to find them by searching for `http`, like I suggested. – May 02 '19 at 19:16
-
Is there some implemented function in DocX to do so already? How do I scan for http through all of the text? Sorry for annoying you and thank you so much for your help so far - I'm just lost :/ – beadrex May 02 '19 at 19:19