how to de-identify/ redact word files using GCPs DLP API in python

Question

I am using the GCP's DLP API in python to redact images in the following way and it works fine:

def redact_image_all_text(
    project,
    filename,
    output_filename,
):
    """Uses the Data Loss Prevention API to redact all text in an image.
    Args:
        project: The Google Cloud project id to use as a parent resource.
        filename: The path to the file to inspect.
        output_filename: The path to which the redacted image will be written.
    Returns:
        None; the response from the API is printed to the terminal.
    """
    # Import the client library
    import google.cloud.dlp

    # Instantiate a client.
    dlp = google.cloud.dlp_v2.DlpServiceClient()

    # Construct the image_redaction_configs, indicating to DLP that all text in
    # the input image should be redacted.
    image_redaction_configs = [{"redact_all_text": True}]

    # Construct the byte_item, containing the file's byte data.
    with open(filename, mode="rb") as f:
        byte_item = {"type_": google.cloud.dlp_v2.FileType.IMAGE, "data": f.read()}

    # Convert the project id into a full resource id.
    parent = f"projects/{project}"

    # Call the API.
    response = dlp.redact_image(
        request={
            "parent": parent,
            "image_redaction_configs": image_redaction_configs,
            "byte_item": byte_item,
        }
    )

    # Write out the results.
    with open(output_filename, mode="wb") as f:
        f.write(response.redacted_image)

    print(
        "Wrote {byte_count} to {filename}".format(
            byte_count=len(response.redacted_image), filename=output_filename
        )
    )

Now I want to apply this to word docs files. I have seen a few examples using dlp.deidentify_content but it seems to be only for text input.

 # Call the API
    response = dlp.deidentify_content(
        request={
            "parent": parent,
            "deidentify_config": deidentify_config,
            "item": contentItem,
        }
    )

So, I want to know if cloud DLP natively supports redaction/ de-identification on word DOCs. If so, how do I do it? If not, is there an elegant way to implement DLP redaction on word docs

Microsoft Docs aren't natively support by DLP. The solution here is to read the content in Python and to submit it to DLP. I'm not a DOCx expert, but I think if you read the file with the format tags, you can submit it to DLP and DLP won't change the format, only the text to redact. So, you will be able to save the result as-is and keep your doc formatted. Else, use DLP for each document bloc (that solution could be less expensive because less bytes (the tags) won't be submitted for analysis) — guillaume blaquiere, Aug 19 '22 at 07:48
Calling the API for each block will increase the apis calls drastically right. For example calling api for each paragraph in a long word doc or each cell in a table. How do you read a whole doc with formatting tag? I could not find any library to do that. Please let me know if you know any. Thanks. — Akhil Kv, Aug 19 '22 at 14:29
Wouldn't I lose the formatting that way when I save it back as Docx — Akhil Kv, Aug 19 '22 at 17:22

score 0 · Answer 1 · answered Aug 19 '22 at 16:14

0

Others are right -> although inspect_content does support inspecting docx files (not doc), de-identify does not.

If you want to split up each paragraph, using the Record object and passing in each paragraph as a row will allow you to reduce you traffic.

answered Aug 19 '22 at 16:14

Jordanna Chord

950
5
12

Thanks for your answer. My documents involve many forms. So I think this method would not be very effective. I decided to convert the word docs -> PDF -> images. This way I can still use the redact_image method which is faster and I do not have to worry about the formatting. – Akhil Kv Aug 24 '22 at 05:43

how to de-identify/ redact word files using GCPs DLP API in python

1 Answers1