Extract text from document in memory using docsplit

Question

With the docsplit gem I can extract the text from a PDF or any other file type. For example, with the line:

 Docsplit.extract_pages('doc.pdf')

I can have the text content of a PDF file.

I'm currently using Rails, and the PDF is sent through a request and lives in memory. Looking in the API and in the source code I couldn't find a way to extract the text from memory, only from a file.

Is there a way to get the text of this PDF avoiding the creation of a temporary file?

I'm using attachment_fu if it matters.

score 3 · Answer 1 · answered Jan 06 '15 at 12:08

Use a temporary directory:

require 'docsplit'

def pdf_to_text(pdf_filename)
  Docsplit.extract_text([pdf_filename], ocr: false, output: Dir.tmpdir)

  txt_file = File.basename(pdf_filename, File.extname(pdf_filename)) + '.txt'
  txt_filename = Dir.tmpdir + '/' + txt_file

  extracted_text = File.read(txt_filename)
  File.delete(txt_filename)

  extracted_text
end

pdf_to_text('doc.pdf')

score 0 · Answer 2 · answered Apr 29 '13 at 22:54

0

If you have the content in a string, use StringIO to create a File-like object that IO can read. In StringIO, it doesn't matter if the content is true text, or binary, it's all the same.

Look at either of:

new(string=""[, mode])
Creates new StringIO instance from with string and mode.

open(string=""[, mode]) {|strio| ...}
Equivalent to ::new except that when it is called with a block, it yields with the new instance and closes it, and returns the result which returned from the block.

answered Apr 29 '13 at 22:54

the Tin Man

158,662
42
215
303

Actually this is not what I was looking for. Docsplit needs an file path as input, and I can't have it from a stringio. Same thing for output. – fotanus Apr 30 '13 at 18:54
If you need a filepath you're going to have to write it out to disk. Tempfile would work, or a normal `File.write` followed by a `File.delete`. – the Tin Man Apr 30 '13 at 19:05

Extract text from document in memory using docsplit

2 Answers2