2

I just put together a small script for a team of users that collects all PDF and DOC* files in a directory and parses them for hyperlinks. The PDF section works as intended, however a difference between the Word doc I was given for design (plain text) differs from the actual Word documents that they are using (text is in a TextBox element).

I noticed that when I tried to gather sentences/words from these new files, all I received was the text for the background image of the file (normally a special character).

I have browsed through the API and tried quite a few methods listed in ole_methods, but have not yet found a way to access the TextBox to pull the required text out of it.

I know that I can convert the Word files to PDF and shortcut it that way (tested and proven), but that entails quite a bit of file management that I'd like to avoid in lieu of the simpler solution: access the text.

You can replicate the element in a document using the Draw Text Box function (Word 2007+).

Does anyone know how to access this element, or better yet find ALL text in the document regardless of what element it is located in?

require 'win32ole'
word = WIN32OLE.new('Word.Application')
doc = word.Documents.Open(file)
doc.Sentences.each { |x| puts x.text }
  • Adam
Deduplicator
  • 44,692
  • 7
  • 66
  • 118
adam reed
  • 2,024
  • 18
  • 24

1 Answers1

3

Assuming that something equivalent to doc.Sentences.each { |x| puts x.text } but for textboxes will suffice, then this should work for you:

doc.Shapes.each do |x|
  puts x.TextFrame.TextRange.text
end

It looks quite a bit messier than how you went through the sentences, but the x.TextFrame.TextRange.text will return the actual text contained in the text boxes.

Paul Hoffer
  • 12,606
  • 6
  • 28
  • 37