I am using the GCP's DLP API in python to redact images in the following way and it works fine:
def redact_image_all_text(
project,
filename,
output_filename,
):
"""Uses the Data Loss Prevention API to redact all text in an image.
Args:
project: The Google Cloud project id to use as a parent resource.
filename: The path to the file to inspect.
output_filename: The path to which the redacted image will be written.
Returns:
None; the response from the API is printed to the terminal.
"""
# Import the client library
import google.cloud.dlp
# Instantiate a client.
dlp = google.cloud.dlp_v2.DlpServiceClient()
# Construct the image_redaction_configs, indicating to DLP that all text in
# the input image should be redacted.
image_redaction_configs = [{"redact_all_text": True}]
# Construct the byte_item, containing the file's byte data.
with open(filename, mode="rb") as f:
byte_item = {"type_": google.cloud.dlp_v2.FileType.IMAGE, "data": f.read()}
# Convert the project id into a full resource id.
parent = f"projects/{project}"
# Call the API.
response = dlp.redact_image(
request={
"parent": parent,
"image_redaction_configs": image_redaction_configs,
"byte_item": byte_item,
}
)
# Write out the results.
with open(output_filename, mode="wb") as f:
f.write(response.redacted_image)
print(
"Wrote {byte_count} to {filename}".format(
byte_count=len(response.redacted_image), filename=output_filename
)
)
Now I want to apply this to word docs files. I have seen a few examples using dlp.deidentify_content but it seems to be only for text input.
# Call the API
response = dlp.deidentify_content(
request={
"parent": parent,
"deidentify_config": deidentify_config,
"item": contentItem,
}
)
So, I want to know if cloud DLP natively supports redaction/ de-identification on word DOCs. If so, how do I do it? If not, is there an elegant way to implement DLP redaction on word docs