I just put together a small script for a team of users that collects all PDF and DOC* files in a directory and parses them for hyperlinks. The PDF section works as intended, however a difference between the Word doc I was given for design (plain text) differs from the actual Word documents that they are using (text is in a TextBox element).
I noticed that when I tried to gather sentences/words from these new files, all I received was the text for the background image of the file (normally a special character).
I have browsed through the API and tried quite a few methods listed in ole_methods, but have not yet found a way to access the TextBox to pull the required text out of it.
I know that I can convert the Word files to PDF and shortcut it that way (tested and proven), but that entails quite a bit of file management that I'd like to avoid in lieu of the simpler solution: access the text.
You can replicate the element in a document using the Draw Text Box function (Word 2007+).
Does anyone know how to access this element, or better yet find ALL text in the document regardless of what element it is located in?
require 'win32ole'
word = WIN32OLE.new('Word.Application')
doc = word.Documents.Open(file)
doc.Sentences.each { |x| puts x.text }
- Adam