0

I have this PDF file which is arranged in 5 columns.

I have looked and looked through Stack Overflow (and Googled crazily) and tried all the solutions (including the last resort of trying Adobe Acrobat itself).

However, for some reason I cannot get those 5 columns in csv/xls format - as I need them arranged. Usually when I export them, the format is horrible and all the entries are arranged line by line with some data loss.

http://www.2shared.com/document/PagE4A1T/ex1.html

Here is a link to an excerpt of the file above, but I am really getting frustrated and am running out of options.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
econclicks
  • 327
  • 1
  • 5
  • 11
  • welcome to stackoverflow. what language are you trying to do this in? – Daniel A. White Mar 21 '11 at 12:25
  • Have you found a solution yet ? Would it be possible to provide a link the whole file as I a written a tool that should be able to process it and am interested in using it as a test for my software. I am happy to send you the resulting CSV file. – Andrew Cash Mar 23 '11 at 05:14

1 Answers1

1

iText (or iTextSharp) could do this, if you can give it the boundaries of those 5 columns, and are willing to deal with some overhead (namely reparsing the page's text for each column)

Rectangle2D columnBoxArray[] = buildColumnBoxes();
ArrayList<String> columnTexts = new ArrayList<String>(columnBoxArray.length);
For (Rectangle2D columnBBox : columnBoxArray) {

  FilteredTextRenderListener textInRectStrategy = 
    new FilteredTextRenderListener(new LocationTextExtractionStrategy(), 
      new RegionTextRenderFilter( columnBBox ) );

  columnTexts.add(PdfTextExtractor.extractText( reader, pageNum, textInRectStrategy));
}

Each line of text should be separated by \n, so it becomes a simple matter of string parsing.

If you wanted to not reparse the whole page for each column, you could probably come up with a custom implementation of FilteredTextRenderListener that would take multiple listener/filter pairs. You could then parse the whole thing once rather than once for each column.

Mark Storer
  • 15,672
  • 3
  • 42
  • 80