Reading text from binary file like PDF

Question

I have a problem with reading binary file in C++. Currently my code is like this:

FILE *s=fopen(source, "rb");
fseek(s,0,SEEK_END);
size_file size=ftell(s);
rewind(s);

char *sbuffer=(char *) malloc(sizeof(char) * size);
if(sbuffer==NULL){
    fputs("Memory error", stderr);
    exit(2);
}
size_t result=fread(sbuffer,1,size,s);
if(result != size){
    fputs("Reading error",stderr);
    exit(3);
}
fclose(s);
cout<<sbuffer<<endl;

However, the characters printed out on the terminal are all random characters instead of what I write in the PDF file. They are like:

% P D F - 1 . 3 
 % ? ? ? ? ? ? ? ? ? ? ? 
 4   0   o b j 
 < <   / L e n g t h   5   0   R   / F i l t e r   / F l a t e D e c o d e   > > 
 s t r e a m 
 x  ? ? ? j ? 0  E ? ? ? k ?  y Q E # ? ? ? m ? & ? ? @  % + ? .     ? ?  ? ? A i  ?     4 z \ 1 G W ? ?  - , ? ? ? (  ? ? ?  9 ? ? ? ? ?  \ ? } ? ? ? e ? ? ? ? 0 ? ? ? ~ ? , ? ? & 8 ? ? x e 4 ? r 
 | ? ? ? 
          ? ? ? ? E  > a ? ? z & ? Z ? < ?  }  '  ? ? ? j p ? ? Q 7 0 ? ? ? S %  - p ? ? ? 7 D  ?  ? ? ' Q z Q ?  ? ? ? ? ? ? ? ? ? \ 2 ? ? 7 ? ? ? < ? ? D ~  ? ? ? 

 e n d s t r e a m 
 e n d o b j 
 5   0   o b j 
 2 2 8 
 e n d o b j 
 2   0   o b j

And many others characters like the above. I tried to search for a long time but cannot find out how to get the actual characters out for later processing. By the way, I'm trying to write a compressor which takes binary file as input and output. Any help here is highly appreciated!

If you're printing arbitrary data with formatted output, you're gonna have a bad time. — Kerrek SB, Feb 23 '13 at 16:46
If you are processing a pdf file, this is not a text file! to get the text from a pdf, you need additional library. — donald, Feb 23 '13 at 16:46
You're joking right? You expect you can just read in a binary PDF file and C++ will somehow magically decode it for you? Pop your PDF open in a hex editor. I'm sure you'll see that your program is printing out the right thing. — Chris Eberle, Feb 23 '13 at 16:46
@donald ok...I just try to process the context from a binary file for the compressing purpose... — Iam619, Feb 23 '13 at 16:47
@Chris ok...I'm just trying to learn things...then I guess my compressor will have to process these characters.. — Iam619, Feb 23 '13 at 16:49
There is no problem. Your data is all there. You just can't *print* it naively the way you do. — Kerrek SB, Feb 23 '13 at 16:53
Relax friends.. everyone had that problem when tried to use binary files for the first time. There's a time in programmer's life, marked with huge red line on the timeline: to the left you have no idea what binary files are (they are just text, but in binary, right?) and to the right side, well, now you know. — quetzalcoatl, Feb 23 '13 at 16:58
@Kerrek I'm just trying to debug why my compressor doesn't work since the compressor assumes the characters in the sbuffer are the same as in the PDF file, by printing out the sbuffer to have a look.. — Iam619, Feb 23 '13 at 16:59
Your code looks very much like plain C... If you code in C++, you should prefer C++ streams for file access, and avoid malloc, and also use smart pointers for memory management instead of doing it manually. Or you could use plain C of course, if you prefer that, just use printf instead of cout and your code becomes C... — hyde, Feb 23 '13 at 17:03
@hyde Yeah..I should use printf instead of cout.. It's a C code instead of C++ code I admit... — Iam619, Feb 23 '13 at 17:05
You probably want to print arbitrary bytes as something recognizable, e.g. pairs of hex digits... but you can use existing tools for that. Just make sure you write the raw data to the standard out (e.g. `std::cout.write(sbuffer, size)`). — Kerrek SB, Feb 23 '13 at 17:14

quetzalcoatl · Accepted Answer · 2013-02-23T17:04:00.683

Only a few file formats like plain raw .TXT text files can be "read" and "understood" directly. Most of the file formats, including almost any binary format, is a .. format. This implies certain structure held inside the file. Completely contrary to the .TXT text file that is completely structure-less, or rather, it is one huge block of pure data.

Open a WordPad or Word or any other a least somewhat intelligent text editor and write some text there and then save it as RTF, DOC, ODT or any other non-TXT file. Then save it as TXT file too.

Download a HEX VIEWER/HEX EDITOR. Whatever one. Take one of those free, you don't need many features, just the one that displays raw binary values in one column and ASCII text in the other column. Almost any of free hex viewers/editors can do that.

Open and compare those two files. You will immediatelly see difference.

Back to the PDF:

The PDF even can contain graphics interleaved with the text. How'd you expect to keep it, if the text were "just sitting in the file" like in TXT? How would the image position/description/data be embedded? The PDF can even contain scripts, if I remember well, similar to JavaScripts. Executable. In PDF-type document you can have buttons that do something. That's much more complicated than just text-in a-file.

Binary files usually does not contain any plain-readable text for your eyes. They have that text structured in blocks, wrapped in metadata about colors, text layout, paging and such, or even special structures about document versioning, authoring, classification, (...). This everything has to be stored somewhere.

Usually, binary files have sections. First section usually is called the HEADER. Inside, there will be information about: format type, format version, file/block/data length, image resolution, and similar. All those most probably will be kept in binary form: no "800x600" texts, just "|00|00|03|20|00|00|02|58|" assuming 32bit BE. After your have read, decoded and understood the description, then you will know where the actual data starts, how the data blocks are laid out, and how to decode them and understand what they contain.

edit:

After you understand what is the difference between text files and binary files, check out the absolute basics on http://en.wikipedia.org/wiki/Entropy_(information_theory). Then try playing with RLE (http://www.daniweb.com/software-development/cpp/code/216388/basic-rle-file-compression-routine) or Huffman (http://www.cprogramming.com/tutorial/computersciencetheory/huffman.html) just to start on something relatively simple. Then start reading more about Huffman codes, and then, well, you will be reasonably prepared to the task, like ZIP or LZH..

If you are really into compression subject, I've included links to some starting points. Have fun! — quetzalcoatl, Feb 23 '13 at 17:05
I'm trying to implement a LZ77 algorithm so I tried to look at each individual characters....Thanks a lot! — Iam619, Feb 23 '13 at 17:06
To make it simple, instead of looking at characters, just look at the raw bytes of the data. There is very little difference, whever you analyze stream of text that consists of characters 0..9a..zA..Z!@#$%^&* or whever you analyze stream of bytes 00/01/02/03/.../FE/FF. It's just not semirandom ~80 symbols but semirandom 256 symbols at the input :) — quetzalcoatl, Feb 23 '13 at 17:12
Yeah..currently I use a forloop to get sbuffer[i] out and try to find if there is any repeat of that in previous context, as the LZ77 algorithm described. Is this looking at the raw bytes of data? If not, then to implement the compressor, do I still need to parse the PDF into text characters? Thanks so much... — Iam619, Feb 23 '13 at 17:20
Yes, when using `fopen` with `rb` flags, then `fread` reads raw bytes from the files, without any translations. Your buffer will contain exactly the same as the file on the disk. To compress the file, you do not need to parse it. That would be usually too complex and/or risky if some new version of file is released. >90% of compressors just compress the file as one big block of bytes, ignoring file format. There are some cases that actually do analyze the file more thoroughly, but are rare, and usually it is then called "a compressed format" not "compressing a file", i.e. Flash FWS versus CWS. — quetzalcoatl, Feb 23 '13 at 23:48
Yeah, I've done the compressor :) I just process the block of bytes. Thanks! — Iam619, Feb 26 '13 at 20:30

score 3 · Answer 2 · edited Jun 03 '20 at 04:03

3

To parse PDF as text, use some PDF library, such as gnupdf or poppler.

edited Jun 03 '20 at 04:03

Marcel Gosselin

4,610
2
31
54

answered Feb 23 '13 at 16:58

hyde

60,639
21
115
176

Reading text from binary file like PDF

2 Answers2

Linked