The response you get when detecting text using Vision API (TextAnnotation) is structured like TextAnnotation -> Page -> Block (text block, table block, etc.) -> Paragraph -> Word -> Symbol. Additional properties for these are the detected language, detected break (space, hyphen, line break) only. Thus Vision API is not capable to predict as specific as the "Title" of the document. See TextAnnotation reference.
If you want to predict as specific as "Title" in a document/image. I suggest to use AutoML Vision where you can train a model that will predict the "Title", given a set of documents/images that are properly labeled. Once training is done, you can send a prediction request to predict the "Title".
You can refer to this document for an example on how to prepare a dataset, train a model and predict.