How can I extrapolate a pattern from a collection of filenames?

Question

I want to know if there is a well-known algorithm for extrapolating a filename pattern, given a collection of sample filenames as input. Take the following example filenames:

  ABC_348093423.csv
i.ABC_348097340.csv
  ABC_348099322.csv
i.GHI_348099324.csv
p.ABC_348101632.csv
  DEF_348101736.csv
p.ABC_348101633.csv
  ABC_348102548.csv

Ideally, the patterns that I would want to end up with in the result set would be something like:

*.ABC_*.csv
*.DEF_*.csv
*.GHI_*.csv

Even result values like the following would still be a good starting point:

i.ABC_348*.csv
p.ABC_348*.csv
...

Why do I need this?

I have an existing application where users can input a "file mask" to define a bucket for incoming input files to be grouped into. Incoming files are evaluated against each file mask (in order), and if the file matches a mask, the file goes into the bucket for that file mask... the end.

What I'd like to implement is, given the last X filenames that were processed, present the user with suggestions for new file masks. It does not have to be perfect. This will just be a user-assist feature.

What language am I using?

My application is written in Java, so any third-party Java library that can perform this kind of function would be an ideal solution. Otherwise, if there is a well-known algorithm for this problem, then I could implement it myself.

The analysis of strings to discern patterns is almost a mathematical field of its own. There would be no automatic library to compute this. — ErstwhileIII, Jul 28 '14 at 16:33
off topic : **too broad** and **recommendations for offsite resource** — , Jul 28 '14 at 16:33
Not looking for a REGEX pattern, exactly the opposite. Given a set of inputs, the output would BE a pattern that matches the inputs. — Jim Tough, Jul 28 '14 at 16:36
So you basically want to group the files into clusters and then find a regex pattern that would differentiate the clusters. — vandale, Jul 28 '14 at 16:49
This problem is fundamentally unsolvable unless you have access to the entire set of filenames. Given a sample of possible filenames you can write a pattern that matches them. However, there's always a probability that the next filename does not match the pattern. This sounds like an XY problem. Assuming you were to find such an algorithm, what would you use it for? — Jim Garrison, Jul 28 '14 at 16:54
To fully realize what you are hoping for, consider two trivial results of such an algorithm: `.+` and `ABC_348093423.csv|...|ABC_348102548.csv". One is too wide (accepts everything), the other one (presumably) too narrow. You can refine this, looking at varieties of substrings, and, finally, for individual character positions. (Is it `[A-Z]` or `\p{Lu}`?) — laune, Jul 28 '14 at 17:08
Is the scope of your problem to find several common substrings from a set of filenames, so that you can construct a simple wildcard pattern from them? — Justin Kaeser, Jul 28 '14 at 18:28
are the filenames always of this format: <3-letter-abbreviation>_.csv ?? if so you just need to get a list of the most recent n 3-letter-abbreviations and 'distinct' them. If not... I'm doubtful you can truly do this. — Randyaa, Jul 28 '14 at 18:43
You might think of other options, such as presenting the user with a large number of recent files, and allowing them to create a mask in-line which will immediately show them what their results would be (displaying match # and total # would be fairly helpful i think). — Randyaa, Jul 28 '14 at 18:45
@Randyaa - Interesting idea. I'll run that by the UI developer as an alternative to what I was considering. — Jim Tough, Jul 29 '14 at 13:11

Justin Kaeser · Answer 1 · 2014-07-28T19:38:29.570

Assuming you just want to suggest wildcard patterns based on common substrings, you could use a longest common substring algorithm to calculate all the common substrings, then choose a few based on their length and number of occurrences. This can be done recursively to find even more common substrings.

This example does 2 iterations of a longest common substring and outputs the results:

import java.util.*;

public class Main {

    private static String longestCommonSubstring(String S1, String S2)
    {
        int Start = 0;
        int Max = 0;
        for (int i = 0; i < S1.length(); i++)
        {
            for (int j = 0; j < S2.length(); j++)
            {
                int x = 0;
                while (S1.charAt(i + x) == S2.charAt(j + x))
                {
                    x++;
                    if (((i + x) >= S1.length()) || ((j + x) >= S2.length())) break;
                }
                if (x > Max)
                {
                    Max = x;
                    Start = i;
                }
            }
        }
        return S1.substring(Start, (Start + Max));
    }



    public static SortedMap<String,Integer> commonSubstrings(List<String> strings) {
        SortedMap<String,Integer> subs = new TreeMap<>();
        for (String str1: strings) {
            for (String str2: strings) {
                if (str1 != str2) {
                    String sub = longestCommonSubstring(str1,str2);
                    if (subs.containsKey(sub))
                        subs.put(sub,subs.get(sub)+1);
                    else
                        subs.put(sub,1);
                }
            }
        }

        return subs;
    }


    public static void main(String[] args) {
        List<String> filenames = Arrays.asList(
                "ABC_348093423.csv",
                "i.ABC_348097340.csv",
                "ABC_348099322.csv",
                "i.GHI_348099324.csv",
                "p.ABC_348101632.csv",
                "DEF_348101736.csv",
                "p.ABC_348101633.csv",
                "ABC_348102548.csv");

        Map<String,Integer> substrings = commonSubstrings(filenames);

        Map<String,Integer> subsubstrings = commonSubstrings(new ArrayList<>(substrings.keySet()));

        List<Map.Entry<String,Integer>> results = new ArrayList<>(subsubstrings.entrySet());
        Collections.sort(results, (a,b) -> a.getValue().compareTo(b.getValue()));

        for ( Map.Entry<String,Integer> s: results ) {
            System.out.println(s.getKey() + "\t" + s.getValue());
        }
    }
}

Of course, this misses shorter substrings common to all the file names such as .csv

How can I extrapolate a pattern from a collection of filenames?

1 Answers1