4

Assuming I have a file with .doc extension in Windows platform, how can I open the the file for outputting its contents on the screen using the ofstream object in C++? I am aware that the object can be used to open files in text and binary modes. But I would like to know if a .doc (or even .pdf) file can be opened and its contents read.

user1832196
  • 41
  • 1
  • 2
  • 2
    Sure, they can be opened and read. But perhaps you're interested in parsing a `doc` file? You can read the bits, but it's up to you, the programmer, to understand the bits (or use a library that will understand the bits for you). – Cornstalks Nov 17 '12 at 16:55
  • 1
    For starters, you would need to use `ifstream` not `ofstream`... – Yakov Galka Nov 17 '12 at 16:56
  • when you want to output binary file to stdout you need to convert it to base64 because it could hold NULL values, which will terminate the outputted string. – Saddam Abu Ghaida Nov 17 '12 at 16:57

2 Answers2

2

I've never actually done this before, but after reading up on it, I think I might have a suggestion. The .docx format is actually just XML that is zipped up. After unzipping, the file is located at word/document.xml. Doing this in a program is where it gets fun.

Two options: If you're using C++ CLR (.NET) then Microsoft has an SDK for you. It should make it pretty easy to open Office documents.

Otherwise if you're just using regular C++, you might have to do some extra work.

  1. Open the file and unzip it using a library like zlib
  2. Find the document.xml file inside
  3. Parse the XML document. You'll probably want to use some kind of XML parsing library for this. You'll have to look up the specs for the XML to figure out how to get the text you want.
austin
  • 5,816
  • 2
  • 32
  • 40
1

C++ std library has ifstream class that can be used to read simple text files, and for read binary files too.

It is up to you to interpret these bytes in the file. To proper interpret the binary file you need to know the format of the file.

If you think of MS Word files then I would start from here: http://en.wikipedia.org/wiki/Office_Open_XML to understand MS Word 2007 format.

You might find the Boost Iostreams library ( http://www.boost.org/doc/libs/1_52_0/libs/iostreams/doc/home.html ) somehow useful if you want to make some filter by yourself.

PiotrNycz
  • 23,099
  • 7
  • 66
  • 112