2

I am trying to convert pdf to plain text using Pdf-reader ruby gem (https://github.com/yob/pdf-reader/)

It is working fine with pages whose orientation is portrait but its not working for me with pages whose orientation is landscape

When i am trying to convert landscape pages in plain text its reading contents in random order and i am also losing some data in plain text

Attributes of landscape page are as follow

{:Parent=>#<PDF::Reader::Reference:0x000000062d4e60 @id=11481, @gen=0>, :Type=>:Page, :Resources=>{:Font=>{:Fcpdf0=>#<PDF::Reader::Reference:0x000000062cfc80 @id=8585, @gen=0>, :Fcpdf2=>#<PDF::Reader::Reference:0x000000062cef10 @id=8588, @gen=0>, :Fcpdf3=>#<PDF::Reader::Reference:0x000000062cec18 @id=8590, @gen=0>}, :ProcSet=>#<PDF::Reader::Reference:0x000000062cdca0 @id=4, @gen=0>}, :MediaBox=>[0, 0, 595.276, 841.89], :CropBox=>nil, :Rotate=>90, :Contents=>[#<PDF::Reader::Reference:0x000000062c6c70 @id=15, @gen=0>, #<PDF::Reader::Reference:0x000000062c6a18 @id=16, @gen=0>]} 

and attributes of portrait page are as follow

{:Parent=>#<PDF::Reader::Reference:0x000000062fadb8 @id=11481, @gen=0>, :Type=>:Page, :Resources=>{:Font=>{:Fcpdf0=>#<PDF::Reader::Reference:0x000000062f9be8 @id=8585, @gen=0>, :Fcpdf2=>#<PDF::Reader::Reference:0x000000062f8c48 @id=8588, @gen=0>, :Fcpdf1=>#<PDF::Reader::Reference:0x000000062f8748 @id=8587, @gen=0>, :Fcpdf4=>#<PDF::Reader::Reference:0x000000062f3b30 @id=8592, @gen=0>}, :ProcSet=>#<PDF::Reader::Reference:0x000000062f3630 @id=4, @gen=0>}, :MediaBox=>[0, 0, 594, 792], :CropBox=>[0, 0, 594, 792], :Rotate=>0, :Contents=>[#<PDF::Reader::Reference:0x000000062f05e8 @id=9, @gen=0>, #<PDF::Reader::Reference:0x000000062f02c8 @id=10, @gen=0>]} 

I am reading pdf as:

reader = PDF::Reader.new("sample.pdf")

page = reader.pages[page_no]

puts page.text

So can anyone help me in converting landscape pages to plain text.

Uri Agassi
  • 36,848
  • 14
  • 76
  • 93
Shweta
  • 1,171
  • 7
  • 11
  • I haven't used Pdf-reader but have spent a bit of time using a Python tool called [PDFMiner](http://www.unixuser.org/~euske/python/pdfminer/). So speaking generally, I've had issues where data has come out in layout order, not the visual order you see on screen. This can be seemingly random until you look at the X,Y coordinates and bounding boxes associated with the objects. Your issue might be to do with how the landscape PDF is being authored. Are you able to post a PDF sample? – Matt Jul 11 '14 at 07:52

1 Answers1

0

try to set orientation

reader = PDF::Reader.new("sample.pdf",{:orientation => :landscape}) 
Gagan Gami
  • 10,121
  • 1
  • 29
  • 55
  • @user3210186 : I haven't use it before so not that much idea.. Hope someone solve your issue soon.. Sorry for that – Gagan Gami Jul 11 '14 at 06:17