Find arbitary patterns common to a group of strings

Question

Background:

I am developing a program in that iterates over all the movies & tv series episodes stored on my computer, rates them (using rotten tomatoes) and sorts them in order of rating.

I extract the movie name by removing all the unneccessary text such as '.avi', '720p' etc. from the file name.

I am using Java.

Problem:

Some folders contain movie files such as:

Episode 301 Rainforest Schmainforest.avi

Episode 302 Spontaneous Combustion.avi

The word 'Episode' and numbers are valid and are common words in movies, so I can't simply remove them. However, It is clear from the repetitive nature of the names that 'Episode' and '3XX' should be removed.

Aother folder might be:

720p.S5.E1.cripple fight.avi

720p.S5.E2.towelie.avi

Many arbitary patterns like these exist in different groups of files, and I need something to recongise these arbitary patterns so I can extract the keywords. It would be unfeasible to write regex for each case.

Summary:

Is there a tool or API that I can use to find complex repetitive patterns (must be able to match sequences of numbers)? [something like a longest common sequence library]

Yep, I am currently using this. However, there are many many different types of patterns that currently exist (and that may exist in the future), and using regex requires me to know and code for them all. — Kevin, Apr 14 '12 at 07:17

score 2 · Accepted Answer · answered Apr 14 '12 at 07:25

2

Well, you could simply take all the filtered names in your dir, and do a simple word-count. You could give extra weight to words that occur in (roughly) the same spot every time.

In the end you'd end up with a count and a weight, and you need to decide what lines to draw. It's probably not every file in the dir (because of maybe images or samples), but if most have a certain word, it's not "the" or something like that, and mabye they all appear "at the start" or "on the second spot", you can filter them.

But this wouldn't work for, random example, Friends episodes. THey're all called "The one where.....". That would be filtered in every sane version of your sought-after algorithm

The bottom line is: I don't think you can because of the friends-episode-problem. There just not enough distinction between wanted repetition and unwanted repetition.

Only thing you can do is make a blacklist of stuff you want to filter, like you allready seem to do with the avi / 720 thing.

answered Apr 14 '12 at 07:25

Nanne

64,065
16
119
163

Yep, this was my first idea. But the it wouldn't recognise and remove sequences of numbers (eg, 01, 02, 03 in the same index should be consider 3 counts of 2 digit numbers, and thus considered common). – Kevin Apr 14 '12 at 07:29
I'm not after a perfect solution either. So the friends episodes problem might be a compromise. – Kevin Apr 14 '12 at 07:29
+1 for "There just not enough distinction between wanted repetition and unwanted repetition"... – thkala Apr 14 '12 at 07:31
1

Well, obviously you could filter numbers that add up, but would you want to remove all the sequences? What in the case of rocky movies? They're essential to the name, but a sequence! Bottom line is still that I can only imagine a method for blacklisting some stuff, because the data is not clear enough: some repetitions should stay, some don't. Any algorithm that would remove all sequences and repetitions would -I think- be too much of a compromise. – Nanne Apr 14 '12 at 07:31
@Mowgli: in the same vein: have you been *that* attentive in naming your files? In my own TV series folders, I often have mixed S03E02 and 3x02 episode notations, not to mention that most them lack the actual episode title! There is no distinction between wanted and unwanted variation either... – thkala Apr 14 '12 at 07:34
Hmm...the rocky comment has ramifications... I can't think of how I would sort out sequences without making all die hards the same keyword...good point – Kevin Apr 14 '12 at 07:35
@Mowgli: not to mention the case of multi-part episodes. Part 1, 2 and 3 are not all that uncommon... – thkala Apr 14 '12 at 07:51
Damit. This seemed so easy when I started planned. Oh, well. Thanks for sharing your thoughts. The conclusion being that what I asked is not possibele. – Kevin Apr 14 '12 at 09:25

score 1 · Answer 2 · answered Apr 14 '12 at 07:25

I believe that what you are asking for is not trivial. Pattern extraction, as opposed to mere recognition, is well within the fields of artificial intelligence and knowledge discovery. I have encountered several related libraries for Java, but most need a lot of additional code to define even the simplest task.

Since this is a rather hot research area, you might want to perform a cursory search in Google Scholar, using appropriate keywords.

Disclaimer: before you use any library or algorithm found via the Internet, you should investigate its legal status. Unfortunately quite a few of the algorithms that are developed in active research areas are often encumbered by patents and such...

score 0 · Answer 3 · answered Sep 15 '13 at 09:38

I have a kind-of answer posted here
http://pastebin.com/Eb0cQyKd

I wanted to remove non-unique parts of file names such as'720dpi', 'Episode', 'xvid' 'ac3' without specifying in advance what they would be. But I wanted to keep information like S01E01. I had created a huge black list but it wasn't convenient because the list kept on changing.

The code linked above uses Python (not Java) to remove all non-unique words in a file name. Basically it creates a list of all the words used in the file names, and any word which comes up for most of the files it puts into a dictionary. Then it iterates through the files and deletes all these dictionary words from them.

The script also does some cleaning: some movies use underscores ('_') or periods ('.') to separate words in the filenames. I convert all these to spaces.

I have used it a lot recently and it works well.

Find arbitary patterns common to a group of strings

Background:

Problem:

Summary:

3 Answers3