I'm seeing a behavior in the Google DLP library that puzzles me, and I'm hoping for some clarification. I'm using the Java wrapper library, google-cloud-dlp version 0.34.0-beta. Given the input:
Collection<String> input = Lists.newArrayList("Jenny Tutone 2665 Agua Vista Dr Los Gatos CA 95030 (408) 867-5309 or 408.867.5309x100"
I'm seeing the output:
███ █ ████ or █
If I pass in the same string as a collection of substrings:
Collection<String> input = Lists.newArrayList("Jenny Tutone", "2665 Agua Vista Dr", "Los Gatos", "CA 95030", "(408) 867-5309", "or", "408.867.5309x100");
I see very different results:
███, 2665 █, █ Gatos, █ 95030, █, or, █
I'm using all the InfoType
types that I could find, which amounts to 67 of them. Am I doing something wrong here?
This is the meat of the code that invokes the Google DLP library:
private Collection<String> redactContent(Collection<String> input,
String replacement,
Likelihood minLikelihood,
List<InfoType> infoTypes) {
// Replace select info types with chosen replacement string
final Collection<RedactContentRequest.ReplaceConfig> replaceConfigs = infoTypes.stream()
.map(it -> RedactContentRequest.ReplaceConfig.newBuilder().setInfoType(it).setReplaceWith(replacement).build())
.collect(Collectors.toCollection(LinkedList::new));
final InspectConfig inspectConfig =
InspectConfig.newBuilder()
.addAllInfoTypes(infoTypes)
.setMinLikelihood(minLikelihood)
.build();
long itemCount = 0;
try (DlpServiceClient dlpClient = DlpServiceClient.create(settings)) {
// Google's DLP library is limited to 100 items per request, so the requests need to be chunked if the
// number of input items is greater.
Stream.Builder<Stream<ContentItem>> streamBuilder = Stream.builder();
for (long processed = 0; processed < input.size(); processed += maxItemsPerRequest) {
Collection<ContentItem> items =
input.stream()
.skip(processed)
.limit(maxItemsPerRequest)
.filter(item -> item != null && !item.isEmpty())
.map(item ->
ContentItem.newBuilder()
.setType(MediaType.PLAIN_TEXT_UTF_8.toString())
.setData(ByteString.copyFrom(item.getBytes(Charset.forName("UTF-8"))))
.build()
)
.collect(Collectors.toCollection(LinkedList::new));
RedactContentRequest request = RedactContentRequest.newBuilder()
.setInspectConfig(inspectConfig)
.addAllItems(Collections.unmodifiableCollection(items))
.addAllReplaceConfigs(replaceConfigs)
.build();
RedactContentResponse contentResponse = dlpClient.redactContent(request);
itemCount += contentResponse.getItemsCount();
streamBuilder.add(contentResponse.getItemsList().stream());
}
return streamBuilder.build()
.flatMap(stream -> stream.map(item -> item.getData().toStringUtf8()))
.collect(Collectors.toCollection(LinkedList::new));
}
}