3

I have been trying a simple Ruby program to parse a simple pdf file and extract the texts I am interested in. I found that pdf-reader is quite good gem for pdf file parsing. I have read through the examples given in that gem and some tutorials around that.

I have tried the callback method and was able to get all the text from my pdf file. But I did not understand the concept behind the arguments for some of the callbacks.

For example, If my pdf has a simple table with 3 columns and 2 rows. (Header row values are Name, Address, Age) and first row values are (Arun, Hoskote, 22) and when U run the a ruby following ruby script

receiver = PDF::Reader::RegisterReceiver.new
reader = PDF::Reader.new("Arun.pdf")
reader.pages.each do |page|
    page.walk(receiver)
    receiver.callbacks.each do |cb|
      puts cb.inspect
    end
end

It prints series of callbacks among which some of the interesting callbacks show_text_with_positioning were like following

{:name=>:show_text_with_positioning, :args=>[["N", 5, "am", -4, "e"]]}
{:name=>:show_text_with_positioning, :args=>[[" "]]}
{:name=>:show_text_with_positioning, :args=>[["Ad", 6, "d", 3, "ress"]]}
{:name=>:show_text_with_positioning, :args=>[[" "]]}
{:name=>:show_text_with_positioning, :args=>[["Age"]]}
{:name=>:show_text_with_positioning, :args=>[[" "]]}
{:name=>:show_text_with_positioning, :args=>[["Ar", 4, "u", 3, "n"]]}
{:name=>:show_text_with_positioning, :args=>[[" "]]}
{:name=>:show_text_with_positioning, :args=>[["H", 3, "o", -5, "sk", 9, "o", -5,     "te"]]}
{:name=>:show_text_with_positioning, :args=>[[" "]]}
{:name=>:show_text_with_positioning, :args=>[["22"]]}
{:name=>:show_text_with_positioning, :args=>[[" "]]}

From the above callbacks, what does args represent with respect to pdf file ? If I want to extract only name value that is 'Arun' (Anything can come here) here or age value i,e '25' (any value can come here) here in this example, how can I do that in ruby program ? Is there any pdf-parser API or Ruby API to get only a single "interested" value(s) from a pdf file ?

How can I write a Ruby program to access a particular callback which I am interested in which gives me the text I wanted ?

Raghavendra Nilekani
  • 396
  • 2
  • 10
  • 22

1 Answers1

0

If you particularly only want the text, you can do something like this (but probably using a different stream as the destination for the text):

receiver = PDF::Reader::TextReceiver.new($stdout)
PDF::Reader.file("Arun.pdf", receiver)

Once you have the text, you could use regular expressions or whatever to get the specific value you want out of it.

Hakanai
  • 12,010
  • 10
  • 62
  • 132