0

In my organization, we are dealing with a huge number of PDF files (100,000+) that must be remediated to be compliant with WCAG 2.0 requirements. In a short time period, there is no way that we can remediate all of those files due to lack of resources and budget. Hence, we are looking for some tools, techniques, or best practices to be able to get the job done. As a developer who understands agile software development, my approach is to start fixing some issues programmatically whenever it's possible. For example, we probably can develop and run a tool to add an appropriate Author to all PDF files. I have no experience in Accessibility remediation, so I'm not sure if my approach is correct or if there is any sophisticated tools available already to partially remediate PDF files in bulk.

Any suggestion or guidance would be much appreciated.

Fred
  • 378
  • 1
  • 10
  • 26
  • 1
    Can you please provide some info on what type of PDF documents we're talking about here? Are these documents that are created by hand from platforms like MS Office, InDesign, etc.? Or are they dynamically generated from a database using some other software? – Josh Sep 17 '19 at 14:58
  • It's a combination of all of the above. We have a repository of PDFs files that are generated in many different ways. Some of them are scanned documents, some are exported from MS office tools such as PowerPoint and Word, and some are generated by other customized applications. – Fred Sep 17 '19 at 16:38
  • 1
    My best advice is to Google "PDF remediation services" rather than try to automate it yourself. They have the tools to automate the parts that **can** be automated but they also have the people and processes in place to add the semantic information that (currently) only a human can glean from the layout. Even well-tagged PDF created with the best tools available often doesn't include enough metadata to pass the WCAG 2.0 requirements due to poor authoring of the source document. See https://www.w3.org/TR/WCAG20-TECHS/pdf to get an understanding of the enormity of the problem. – joelgeraci Sep 17 '19 at 19:45
  • 1
    There are tools that can get you X% of the way there, with 100 - X% services to spot change, fix up, augment. The challenge is finding the tools that makes the "X" as large as possible but that highly depends on your input documents and the repeat-ability of them. – Kevin Brown Sep 18 '19 at 20:16
  • 1
    The scanned documents will be your biggest challenge (most likely bitmaps). Microsoft seems to have ambitions to put out word documents as accessible PDF or some such, but I am not sure how far they have got. Probably requires careful use of the 'styles' feature in the original content. – brennanyoung Sep 19 '19 at 13:06
  • 2
    Just to throw an idea in the pot - have you thought about converting the PDFs to basic web pages. We used a PDF library to extract the raw content (can't remember which but this was in PHP) and images and converted them to simple web pages. Out of 850 PDFs there were only 2 we couldn't convert automatically due to curved text and strange formatting. Without seeing what you are working with I obviously can't suggest if this is the right solution but just another route you can look at. – GrahamTheDev Sep 21 '19 at 07:58
  • 1
    If you have the sources, it's better to fix them and regenerate the PDF files. IN the case of MS Word, by setting the correct options, you can make those PDF at least readable. Sadly setting the correct styling for example so that headers appear as such can't be entirely automated; at best a tool can make suggestions based on text/font/color/alignment information, that's all. – QuentinC Sep 21 '19 at 14:45
  • Hi everyone, Thank you for your helpful comments. Now I understand that I should not look for a full-fledged application to automate the entire remediation process. But, as you suggested in your comments. there should be tools, engines, and SDKs that we can use to do some parts of the process that can be time-consuming. I was wondering if you could help me to identify some tools that I can use in my remediation process such as converting image to text, converting PDFs to HTML, etc. – Fred Sep 24 '19 at 15:12
  • CommonLook has a product called CommonLook Dynamic that's intended to improve the accessibility of dynamically generated PDFs. I haven't used it myself, but their other tools are excellent. This might help to get you part of the way to where you need to be. – Josh Sep 24 '19 at 20:00

0 Answers0