0

I am new to using AWS Comprehend to search for PII. I get the job to run against an S3 bucket but can't read the output. The output is in another bucket that I specified. All of the output files have .out as the extension. I was expecting output in report form or at least the ability to open the output files and verify PII. One example of the output is a png file that has as extension .png.out
I do not want to redact the PII at this point. I just want to identify it. Any help would be appreciated.

Lele
  • 33
  • 1
  • 6
  • Have you tried opening the `.out` file in a Text Editor? I think you'll find that it contains contains the results in JSON format. – John Rotenstein Aug 12 '22 at 23:54
  • Yes. I did open one file in notepad. I will go back and review. Thank you! – Lele Aug 13 '22 at 00:45
  • Thanks again John. I looked at the output files again. I thought they didn’t make sense at first because there are several rows with no score. Do you know how to filter out rows/files with no score? – Lele Aug 14 '22 at 21:54
  • Please show the output (or a relevant sample of it) in your Question and let us know what you are seeking (eg what rows/information you want out of it). – John Rotenstein Aug 14 '22 at 22:27
  • Hi John. I now realize that each output file contains an analysis of the input file line by line, indicating the score for each. This is not what I want. I have an S3 bucket containing several folders/files of varying types. I would like to search for PII in all files in my bucket and have one report generated with file path/name, Score, and pii type. Is that possible with Comprehend? – Lele Aug 15 '22 at 23:56
  • 1
    I believe that [Amazon Macie](https://docs.aws.amazon.com/macie/latest/user/managed-data-identifiers.html) can do that, but it can be a particularly expensive service since it works across entire buckets (not sub-paths). – John Rotenstein Aug 16 '22 at 02:56
  • I read about Macie. I do not think I have access to it. Thank you for your help! – Lele Aug 16 '22 at 11:57
  • I found some information that indicates I might be able to get one report from multiple input files. It involves using an API to do the processing in batch. I think I can use one called StartEntitiesDetection but do not know what service to select from the main console to modify it and run it. Would it be the API Gateway? – Lele Aug 18 '22 at 13:28
  • It is a function of Amazon Comprehend: [StartEntitiesDetectionJob - Amazon Comprehend](https://docs.aws.amazon.com/comprehend/latest/dg/API_StartEntitiesDetectionJob.html) "When the topic detection job is finished, the service creates an output file in a directory specific to the job. The S3Uri field contains the location of the output file, called `output.tar.gz` . It is a compressed archive that contains the output of the operation." – John Rotenstein Aug 18 '22 at 22:52
  • Thanks again John. Right now, I am looking at creating a Lambda function to call the comprehend detection job from in batch. – Lele Aug 19 '22 at 11:27

0 Answers0