0

I've ported my Java app from the V2Beta version of the API to V2, and my results coming back seem to be less "accurate" than with the V2Beta version.

Names, addresses, zip codes, age, etc don't get de-identified at all. The results I'm seeing with the V2 API are very different from what I was getting with the V2Beta API. Maybe I'm doing something wrong? Given the input "Hello Mr. John S. Smith! This is Mr. Jones writing back with my SSN: 911-87-9111", the only thing that gets de-identified is the SSN digits. I would have expected the names to be de-identified as well.

I'm using Spring to inject stuff like the credentials, etc and there are some Lombok annotations to simplify my life, but the bulk of the code should be pretty straightforward:

import com.google.api.gax.core.CredentialsProvider;
import com.google.cloud.ProjectName;
import com.google.cloud.dlp.v2.DlpServiceClient;
import com.google.cloud.dlp.v2.DlpServiceSettings;
import com.google.privacy.dlp.v2.CharacterMaskConfig;
import com.google.privacy.dlp.v2.ContentItem;
import com.google.privacy.dlp.v2.DeidentifyConfig;
import com.google.privacy.dlp.v2.DeidentifyContentRequest;
import com.google.privacy.dlp.v2.DeidentifyContentResponse;
import com.google.privacy.dlp.v2.FieldId;
import com.google.privacy.dlp.v2.InfoTypeTransformations;
import com.google.privacy.dlp.v2.InfoTypeTransformations.InfoTypeTransformation;
import com.google.privacy.dlp.v2.PrimitiveTransformation;
import com.google.privacy.dlp.v2.Table;
import com.google.privacy.dlp.v2.Table.Row;
import com.google.privacy.dlp.v2.Value;
import lombok.AccessLevel;
import lombok.Setter;
import lombok.SneakyThrows;
import lombok.experimental.FieldDefaults;
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Service;

import java.util.Collection;
import java.util.LinkedList;
import java.util.Optional;
import java.util.concurrent.CompletableFuture;
import java.util.stream.Collectors;
import java.util.stream.Stream;

import static org.apache.commons.lang3.StringUtils.isNotBlank;
import static org.springframework.util.CollectionUtils.isEmpty;

@Service("DeIdentifyTest")
@FieldDefaults(level = AccessLevel.PRIVATE)
@Setter
@Slf4j
public class DeIdentifyTest {
    final DlpServiceSettings settings;
    final String projectId;

    @SneakyThrows
    public DeIdentifyTest(CredentialsProvider credentialsProvider, String projectId) {
        this.settings = DlpServiceSettings.newBuilder().setCredentialsProvider(credentialsProvider).build();
        this.projectId = projectId;
    }

    public CompletableFuture<Collection<String>> redact(final Collection<String> input,
                                                            final String mask) {
        return CompletableFuture.supplyAsync(() -> redactContent(input, mask));
    }

    @SneakyThrows
    private Collection<String> redactContent(Collection<String> input, String mask) {
        log.debug("Input: {}", input);

        if (isEmpty(input)) {
            return input;
        }

        CharacterMaskConfig characterMaskConfig =
                CharacterMaskConfig.newBuilder().setMaskingCharacter(mask).build();

        PrimitiveTransformation primitiveTransformation =
                PrimitiveTransformation.newBuilder().setCharacterMaskConfig(characterMaskConfig).build();

        InfoTypeTransformation infoTypeTransformationObject =
                InfoTypeTransformation.newBuilder().setPrimitiveTransformation(primitiveTransformation).build();

        InfoTypeTransformations infoTypeTransformationArray =
                InfoTypeTransformations.newBuilder().addTransformations(infoTypeTransformationObject).build();

        DeidentifyConfig deidentifyConfig =
                DeidentifyConfig.newBuilder().setInfoTypeTransformations(infoTypeTransformationArray).build();

        try (DlpServiceClient dlpClient = DlpServiceClient.create(settings)) {
            // Create the deidentification request object
            DeidentifyContentRequest request =
                    DeidentifyContentRequest.newBuilder()
                            .setParent(ProjectName.of(projectId).toString())
                            .setDeidentifyConfig(deidentifyConfig)
                            .setItem(createContentItemWithTable(input))
                            .build();

            // Execute the deidentification request
            DeidentifyContentResponse response = dlpClient.deidentifyContent(request);
            Table table = response.getItem().getTable();

            return Stream.of(table.getRowsList())
                            .flatMap(rows -> rows.stream())
                            .flatMap(row -> row.getValuesList().stream())
                            .map(val -> val.getStringValue())
                            .collect(Collectors.toCollection(LinkedList::new));
        }
    }

    private ContentItem createContentItemWithTable(Collection<String> input) {
        Table.Builder tableBuilder = Table.newBuilder().addHeaders(FieldId.newBuilder().setName("unused").build());
        Value.Builder valueBuilder = Value.newBuilder();

        Optional<Table.Builder> tableOpt = input.stream()
                .filter(item -> isNotBlank(item))
                .map(item -> valueBuilder.setStringValue(item).build())
                .map(value -> Row.newBuilder().addValues(value).build())
                .map(row -> tableBuilder.addRows(row))
                .reduce((t1, t2) -> t1);

        return ContentItem.newBuilder().setTable(tableOpt.get().build()).build();
    }
}
user2337270
  • 1,183
  • 2
  • 10
  • 27

1 Answers1

0

Your example fails to show us what InfoTypes you are choosing to detect. The main thing that changed in V2 is that there is no longer a default list of detectors. You must specify specifically what you are looking for.

See https://cloud.google.com/dlp/docs/infotypes-reference for the entire list.

If I send this

 {
 "item": {
  "value": "Hello Mr. John S. Smith! This is Mr. Jones writing back with my SSN: 509-03-2530"
 },
 "inspectConfig": {
  "includeQuote": true,
  "infoTypes": [
   {
    "name": "PERSON_NAME"
   },
   {
    "name": "US_SOCIAL_SECURITY_NUMBER"
   }
  ]
 }
}

I get

{
 "result": {
  "findings": [
   {
    "quote": "Mr. John S. Smith",
    "infoType": {
     "name": "PERSON_NAME"
    },
    "likelihood": "LIKELY",
    "location": {
     "byteRange": {
      "start": "6",
      "end": "23"
     },
     "codepointRange": {
      "start": "6",
      "end": "23"
     }
    },
    "createTime": "2018-05-21T16:11:54.449Z"
   },
   {
    "quote": "Jones",
    "infoType": {
     "name": "PERSON_NAME"
    },
    "likelihood": "POSSIBLE",
    "location": {
     "byteRange": {
      "start": "37",
      "end": "42"
     },
     "codepointRange": {
      "start": "37",
      "end": "42"
     }
    },
    "createTime": "2018-05-21T16:11:54.449Z"
   },
   {
    "quote": "509-03-2530",
    "infoType": {
     "name": "US_SOCIAL_SECURITY_NUMBER"
    },
    "likelihood": "LIKELY",
    "location": {
     "byteRange": {
      "start": "69",
      "end": "80"
     },
     "codepointRange": {
      "start": "69",
      "end": "80"
     }
    },
    "createTime": "2018-05-21T16:11:54.425Z"
   }
  ]
 }
}
Jordanna Chord
  • 950
  • 5
  • 12
  • If I understand the documentation correctly, omitting the InfoType list in the Java API call should default to using all available InfoTypes. I also tried specifically downloading all InfoTypes from [here](https://cloud.google.com/dlp/docs/reference/rest/v2/infoTypes/list) and adding them, but then I got exceptions left and right from the API about InfoTypes not being supported. I had to remove stuff like FIRST_NAME, LAST_NAME, etc before the call succeeded. The result was essentially the same as when I omitted the InfoTypes. – user2337270 May 21 '18 at 18:12
  • Omitting the types does not include all doc ... but I just noticed there is a bug in javadoc (https://github.com/GoogleCloudPlatform/google-cloud-java/issues/3293) so check out the doc on cloud.google.com/dlp as the source of truth. – Jordanna Chord May 21 '18 at 18:16
  • Basically I want to redact as much as possible. Names, street names, addresses, zip codes, phone numbers, age, etc. A small Java example for how to achieve this would go a long way. – user2337270 May 21 '18 at 18:23
  • Am I misunderstanding this? "A primitive transformation (the PrimitiveTransformation object). Note that specifying an infoType is optional, and if not specified, the API will match all available infoTypes." – user2337270 May 21 '18 at 18:25
  • There are two parts of the API call ... what to inspect (InspectConfig) and what to do with those findings (DeidentifyConfig). If you dont' specify an InspectConfig, there will be no findings for DeidentifyConfig to work on. There are cases where you might know where data is so you don't need an InspectConfig (say if you know column D are phone numbers, why pay for inspection?) ... but in your case you do need to supply InspectConfig .... if you don't specify an InfoType in deidentifyConfig it will apply to all types it found. – Jordanna Chord May 21 '18 at 20:48
  • I'm more confused than ever. I'm following the Java example for de-identifying data [here](https://cloud.google.com/dlp/docs/deidentify-sensitive-data), and I don't see an InspectConfig in the example for de-identification. My case is that I have completely random, unstructured data, and I need to de-identify as much of it as possible. – user2337270 May 21 '18 at 21:19
  • The example there needs some work ... i'll talk to the team about changing it. You need to add an InspectConfig to your request and all will work. In my answer I gave an example of what you need. – Jordanna Chord May 21 '18 at 21:38
  • I understand that you're building a service with client APIs in multiple languages, and maybe Java isn't one of the most widely used. But it would be a tremendous time saver for DLP customers that need to integrate with the API if 1) The Javadoc pages were up to date, and 2) there were more detailed examples. I've spent many hours trying to follow examples, and figuring out how to properly use the API, and I'm getting frustrated at my lack of progress. Should I file a bug on my finding that the API explorer returns InfoTypes that the Java API rejects? – user2337270 May 21 '18 at 22:45
  • The missing javadoc is just a bug in doc generation - it certainly needs fixing, i don't disagree. I'll also look at adding an example that uses InspectConfig alongside a basic DeidentificationConfig to make it more obvious you need to do it. – Jordanna Chord May 21 '18 at 23:14