1

I am fairly new to AWS Comprehend. I know that AWS Comprehend can custom classify documents (Text Files). Does, AWS Comprehend also classify Image files? Also, while training the model, is it necessary to give the entire document text in the CSV or will just keywords do?

The reason being, I want to built a custom classifier that can classify invoice, Pay Stubs and few other such document types which are in image formats. Can Comprehend do this? If so how?

Googled quite a lot but couldn't find anything much relevant around. Really appreciate your help with this.

Thank you!

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
mounika
  • 75
  • 9
  • Hi! I think that you will need to use AWS SageMaker and then build and train your own model or use an existing model from the market place, to achieve something like this: https://aws.amazon.com/marketplace/pp/prodview-pdnyjzczg6nvm – Jaime S Apr 06 '20 at 16:02
  • Thanks, @JaimeS Apart from SageMaker, does Comprehend has the capability to classify images? Or is the capability limited to just text? – mounika Apr 06 '20 at 16:55
  • https://aws.amazon.com/blogs/machine-learning/building-a-custom-classifier-using-amazon-comprehend/ – Channa Feb 06 '21 at 16:34

3 Answers3

3

Comprehend doesn't do this natively, so you would have to build a solution. Something you could try is to combine Amazon Textract (for extracting the details from the documents) and then Comprehend to classify them.

From the FAQ, Textract calls out this as a common use case. I couldn't find an exact example of someone doing this, but it is directly called out in the documentation.

jgray
  • 46
  • 3
  • I used a combination of Apache Tika and Amazon Textract to extract data from Image files and created a CSV file from the extracted content of all the files. I then used the CSV file as an input to the classifier. It worked as expected. Thank you! – mounika Apr 20 '20 at 14:29
0

Amazon Comprehend only works on text.

Amazon Rekognition works on images.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
0

AWS has all the building blocks to accomplish this, but you will have to configure/build this yourself. You can use AWS Textract to extract all the text from a document, and then pass the text into the AWS Comprehend service to do the classification for document type.

Before you can do this you need to train the machine learning part of Comprehend to do the correct identification of the document types. You need to configure and train a custom classifier in AWS Comprehend where you supply a CSV file with a list of classifications for example 'document type' and then text that would be in the type of document. If it is just forms then you can use Textract Form feature to only get key value pairs, then use the keys (labels in the form) as text for the custom classifier.