2

With the docsplit gem I can extract the text from a PDF or any other file type. For example, with the line:

 Docsplit.extract_pages('doc.pdf')

I can have the text content of a PDF file.

I'm currently using Rails, and the PDF is sent through a request and lives in memory. Looking in the API and in the source code I couldn't find a way to extract the text from memory, only from a file.

Is there a way to get the text of this PDF avoiding the creation of a temporary file?

I'm using attachment_fu if it matters.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
fotanus
  • 19,618
  • 13
  • 77
  • 111

2 Answers2

3

Use a temporary directory:

require 'docsplit'

def pdf_to_text(pdf_filename)
  Docsplit.extract_text([pdf_filename], ocr: false, output: Dir.tmpdir)

  txt_file = File.basename(pdf_filename, File.extname(pdf_filename)) + '.txt'
  txt_filename = Dir.tmpdir + '/' + txt_file

  extracted_text = File.read(txt_filename)
  File.delete(txt_filename)

  extracted_text
end

pdf_to_text('doc.pdf')
barbolo
  • 3,807
  • 1
  • 31
  • 31
0

If you have the content in a string, use StringIO to create a File-like object that IO can read. In StringIO, it doesn't matter if the content is true text, or binary, it's all the same.

Look at either of:

new(string=""[, mode])
Creates new StringIO instance from with string and mode.

open(string=""[, mode]) {|strio| ...}
Equivalent to ::new except that when it is called with a block, it yields with the new instance and closes it, and returns the result which returned from the block.
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Actually this is not what I was looking for. Docsplit needs an file path as input, and I can't have it from a stringio. Same thing for output. – fotanus Apr 30 '13 at 18:54
  • If you need a filepath you're going to have to write it out to disk. Tempfile would work, or a normal `File.write` followed by a `File.delete`. – the Tin Man Apr 30 '13 at 19:05