docsplit gem pdf to text

Question

Well basically I have the same problems as discussed here: http://blog.joshsoftware.com/2014/08/13/pdf-to-plain-text-processing-using-docsplit/ But the solution that they propose in docsplit doesn't work.

 Docsplit.extract_text(filepath, {:pdf_opts => ‘-layout’, output: ‘tmp_text_file’})

the :pdf_opts => ‘-layout’ option doesn't do anything and I can't find any documentation about options like that, thus I get a single word per line in the output text file.

Does anyone know how to get an accurate text file ?

Thank you

Can you post a sample pdf and output so we can try to reproduce the problem? — Joe Martinez, Apr 28 '15 at 16:19

Shweta · Accepted Answer · 2015-04-28T16:56:07.323

1

If you read blog post carefully internally processing

 :pdf_opts => ‘-layout’

is not supported yet by master branch of docsplit gem. For this you need to use https://github.com/documentcloud/docsplit/pull/114. So use

gem 'docsplit', git: 'git://github.com/narutosanjiv/docsplit.git'

Hope this helps. Let me know if you still face any issues.

edited Apr 28 '15 at 16:56

answered Apr 28 '15 at 16:50

Shweta

1,171
7
11

It worked, looking good so far, still have to do a little bit more testing. – Richardlonesteen Apr 28 '15 at 19:08

docsplit gem pdf to text

1 Answers1