How to determine text patterns in list of text values to validate new inputs?

Asked Aug 17 '23 at 03:32

Active Aug 17 '23 at 03:32

Viewed 13 times

-2

How do I detect text patterns in a list of text values so that I can test against that pattern to validate a new value?

For example,

Given a list of text values like this:

SKU-1242
SKU-5450
SKU-6532
SKU-2395
SKU-2393
SKU-9310
234321

I would like to be given this regex: [A-Z]{3}\-[0-9]{4,5}. Ideally I would like to know what pecentage of existing values match this pattern.

This example is very similar to the one that the AWS documentation uses to demonstrate how AWS SageMaker Data Wranger provides this as a part of the Data Quality and Insights Report (seen here: https://aws.amazon.com/blogs/machine-learning/detect-patterns-in-text-data-with-amazon-sagemaker-data-wrangler/).

Is there a library or tool that can detect these sorts of patterns in lists of text values? Any language will work.

asked Aug 17 '23 at 03:32

Hey! I'm working on this library: https://github.com/SuperiorityComplex/data_checks that let's you do that. Specifically you can just write plain Python to detect these pattern and do alerting + scheduling if desired – josh Aug 30 '23 at 17:16

0 Answers0