1

I have a batch job in xml that gets scheduled by a job scheduling engine. This engine provides the possibility of observing directories for changes of their content. My task is to monitor directories on a file exchange server running Windows, where customers and clients upload files we need to process.

We need to know about the arrival of new files as soon as possible.

I have to put a regular expression into that xml-job in order to not match subdirectories and temporary files.

In most cases, customers and clients upload files formatted as text/csv/pdf, which don't cause any problems. Some upload MS Office files, which, on the other hand, become a problem if someone opens them in the directory. Then an invisible temporary file is created beginning with ~$.

According to the documentation of the scheduling engine, the regex follows the POSIX 1003.2 standard. However, I am not able to prevent notifications being sent when someone opens an MS Office file in a monitored directory.

My regular expressions, that I have tried so far are:

First try before even noticing temporary office files:

^[a-zA-Z0-9_\-]+\.+[a-zA-Z0-9_\-][^~][^.part]*$

Second try, intention was excluding a leading ~:

^[^~][a-zA-Z0-9_\-]+\.+[a-zA-Z0-9_\-][^~][^.part]*$

Third try, intention was excluding a leading ~ by its character code:

^[^\x7e][a-zA-Z0-9_\-]+\.+[a-zA-Z0-9_\-][^~][^.part]*$

Fourth try, intention was excluding a leading ~ by its character code with a capital E:

^[^\x7E][a-zA-Z0-9_\-]+\.+[a-zA-Z0-9_\-][^~][^.part]*$

All of those don't stop sending notifications on file openings…

Does anyone have any idea what to do? All suggestions and alternatives are welcome.

I even checked them at regex101, regexplanet.com, regexr.com and regextester.com where the second try was matching exactly as desired. I did not even forget to configure POSIX compilation if it was possible on those sites (not all).

How can I exclude the ~ character from matching the regular expression (at the beginning of a file name)?

Short version:

How can I create a regular expression that matches any file with any extension apart from .part and does neither match the file thumbs.db, nor any file whose name begins with a ~?

Requirements: What should not be matched:

Subfolders (my approach was files without a .),

Thumbs.db (Windows thumbnails db),

*.part (filezilla partial uploads),

~$. (temporary files starting with ~ or ~$, MS Office tmp files)

The following list provides some files and folders that must be matched or not matched by the regex:

  • Ablage (subfolder, should not be matched)

  • Abrechnungen (subfolder, should not be matched)

  • eine_testdatei.csv

  • TEST-WORKBOOK.xlsx

  • TEST-WORKBOOK_äöüß.xlsx

  • Test-2018-08-08.txt

  • ~$TEST-WORKBOOK.xlsx (temporary file, should not be matched)

  • TEST-WORKBOOK.xlsx.part (partial upload, should not be matched)

  • TEST-WORKBOOK.part (partial upload, should not be matched)

New Problems occurred while trying to find the regex

A few problems came up after the creation of this question when I tried to apply the actually correct regex stated in the answer given by @Bohemian. I wasn't aware of those problems, so I just add them here for completeness.

The first one occurred when certain characters in the regex were not allowed in xml. The xml file is parsed by a java class that throws an exception trying to parse < and >, they are forbidden in xml documents if not related to xml nodes directly (valid: <xml-node>...</xml-node>, invalid: attribute="<ome_on, why isn't this VALI|>").

This can be avoided by using the html names &lt; instead of < and &gt; instead of >.

The second (and currently unresolved) issue is an operand criticized for the actually correct regular expression ^(?=.*\.)(?!thumbs.db$)[^~].*(?&lt;!\.part)$. The engine says:

Error: 2018-08-17T06:05:46Z REGEX-13

[repetition-operator operand invalid, ^(?=.*\.)(?!thumbs.db$)[^~].*(?&lt;!\.part)$]

enter image description here

The corresponding line in the xml file looks like this:

<start_when_directory_changed directory="F:\someDirectory" regex="^(?=.*\.)(?!thumbs.db$)[^~].*(?&lt;!\.part)$" />

Now I am stuck again, because my knowledge of regular expressions is pretty low. It is so low, that I don't even have any idea what character could be that criticized operand in the regex.

Research has brought me to this question whose accepted answer states "POSIX regexes don't support using the question mark ? as a non-greedy (lazy) modifier to the star and plus quantifiers (…)", which gives me an idea about what is wrong with the great regex. Still, I am not able to provide a working regex, more research will have to follow…

Community
  • 1
  • 1
deHaar
  • 17,687
  • 10
  • 38
  • 51
  • It's already wrong at the first try — `[^.part]*` doesn't do what you want it to. This, plus a lack of clarity in the question, makes it hard to figure out how to move forward. – hobbs Aug 15 '18 at 13:03
  • `[^.part]` is to exclude files ending with `.part`, doesn't it do that? My question is about the first characters of arbitrary files, which should not be regarded if they begin with a `~`. What exactly is unclear? – deHaar Aug 15 '18 at 13:06
  • No, it doesn't do that. – hobbs Aug 15 '18 at 13:07
  • Thanks, good point! What does it do then? I am a real regex noob… Better asking, how can I achieve exclusion of `.part` extended files? – deHaar Aug 15 '18 at 13:07
  • Your original try matches a file which has an extension that is: one letter, number, underscore, or dash, followed by one of anything except tilde, followed by zero or more of any character other than ".", "p", "a", "r", or "t". So for instance it accepts `.doc~` ("d" is a letter, "o" is not tilde, and "c~" doesn't contain any of the characters from `[.part]`) but rejects `.bat` (the last character is "t", which is forbidden). – hobbs Aug 15 '18 at 13:12
  • As for what you want to do... it's very nearly impossible with POSIX regex. Does your XML format allow "not" rules? Because it's very easily done if you can specify a pattern for things that *don't* match. – hobbs Aug 15 '18 at 13:15
  • I don't really know if it accepts *not rules* and I don't know how they are done. Can you provide a small example that enables me to try out if those rules are valid in my xml file? – deHaar Aug 15 '18 at 13:17
  • A pattern of `~$` would exclude ending tilde, `\.part$` would exclude .part files, and `\.~[^.]*$` would exclude "starting" tilde. They could be combined into a single pattern as simply as `~$|\.~[^.]$|\.part$`. – hobbs Aug 15 '18 at 13:20
  • Unfortunately, the pattern does not work, notifications get sent on every opening and closing of an `.xlsx` file due to the temporary file getting created. Thank you anyway! Perhaps, it is just not possible to exclude it... – deHaar Aug 15 '18 at 13:37
  • @hobbs I have to correct myself: With your pattern, no notifications are sent at all, no matter what file I put into the test folder. Sorry for the comment above, that was due to a mistake by me concerning xml formatting which caused the regex to not be regarded at all (every file caused a notification). – deHaar Aug 15 '18 at 13:49
  • to clarify, that pattern is only useful if it can be put into a rule that's logically negated. i.e. notify for any file that *doesn't* match the pattern. – hobbs Aug 15 '18 at 13:50
  • Ok, thanks for clarification… I am now on my way to find out how a pattern can be logically negated. Thank you! – deHaar Aug 15 '18 at 13:53
  • POSIX 1003.2 defines several regex dialects, which one is this, BRE or ERE? If you can't figure that out yourself, it would help if you specified which precise tool you are using. – tripleee Aug 16 '18 at 06:45
  • You should probably read [the Stack Overflow `regex` tag info page](/tags/regex/info) which has both posting guidance and troubleshooting tips. – tripleee Aug 16 '18 at 06:46
  • Your first try should already do what you ask; `^[a-zA-Z0-9_\-]` cannot match a tilde at beginning of line. – tripleee Aug 16 '18 at 06:47
  • @tripleee I cannot find any information about regex dialects, the documentation (online/pdf) just says POSIX 1003.20 standard. The tool is the [Job Scheduler by SOS Berlin](http://www.sos-berlin.com/jobscheduler), it has an option *Directory Monitoring* that I am trying to get to meet our requirements. – deHaar Aug 16 '18 at 06:52
  • Does `nonesvch\|.` trigger a notification? If not, does `nonesvch\|.*` trigger a notification? If not, does `nonesvch|.` trigger a notification? If not, does `nonesvch|.+` trigger one? – tripleee Aug 16 '18 at 06:56
  • @tripleee The point about my first try is the one that made me mad! It should not match a `~` at the beginning, but opening an `XLSX` file sends an email about new files arrived (but does not recognize the hidden temporary file starting with `~$`. I cannot explain that to the users... – deHaar Aug 16 '18 at 06:57
  • @tripleee `nonesvch\|.` does not cause any notification at all, not even the desired ones. `nonesvch\|.*` neither does. `nonesvch|.`, on the other hand, triggers notification about everything (even subfolders and temporary files, interesting!). – deHaar Aug 16 '18 at 07:02
  • So then it's POSIX ERE. Thanks for investigating. – tripleee Aug 16 '18 at 07:04

3 Answers3

1

POSIX ERE doesn't allow for a simple way to exclude a particular string from matching. You can disallow a particular character -- like in [^.part] you are matching a single character which is not (newline or) dot or p or a or r or t -- and you can specify alternations, but those are very cumbersome to combine into an expression which excludes some particular patterns.

Here's how to do it, but as you can see, it's not very readable.

^([^~t.]|t($|[^h])|th($|[^u])|thu($|[^m])|thum($|[^b])|thumb($|[^s])|thumbs($|[^.])|thumbs\.($|[^d])|thumbs\.d($|[^b])|\.($|[^p])|\.p($|[^a])|\.pa($|[^r])|\.par($|[^t]))+$

... and it still probably doesn't do exactly what you want.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Thanks a lot, I will check that out! – deHaar Aug 16 '18 at 07:12
  • Unfortunately, it doesn't do what I want. It keeps triggering notification on opening and closing an Excel workbook and it notifies about subfolders. Maybe I have to handle the filtering in the logic that gets triggered (that is in java, there is a list of trigger files) like filtering out everything that is unwanted and send the notification only if files are left in that list. – deHaar Aug 16 '18 at 07:17
  • Thanks a lot anyway, I think I have to come to terms with technical limitations and find another way to handle the issue. – deHaar Aug 16 '18 at 07:19
  • Best answer, accepted although the desired result cannot be achieved. The reason is technical limitation of the software I use. – deHaar Feb 25 '19 at 10:30
0

Try this:

^(?=.*\.)(?!thumbs.db$)[^~].*(?<!\.part)$

See live demo.

There is nothing special about the tilda character in regex.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • Thank you, but this still triggers notifications on opening a workbook. – deHaar Aug 16 '18 at 07:29
  • @deHaar can you please provide the file name(s) it should not be matching? – Bohemian Aug 16 '18 at 07:31
  • Ah, there is an interesting error message when using your suggestion: It says "The value of attribute "regex" associated with an element type "start_when_directory_changed" must not contain the '<' character." – deHaar Aug 16 '18 at 07:31
  • I have edited my question adding the main requirements concerning not to be matched items. – deHaar Aug 16 '18 at 07:37
  • @deHaar the only hit getting through was the (unmentioned) subdirs. I have added a look ahead to require at least one dot. If the current version doesn't work, rather than describing what should match, provide a list of actual filenames that should match and a list that shouldn't match. That way we have something to test our regexes with. – Bohemian Aug 16 '18 at 14:17
  • Thanks a lot, but I think inside the xml file (where I have to put the regex) I will get a parsing exception if I put a regex containing `<` or `>`. I will try anyway... – deHaar Aug 16 '18 at 14:35
  • 1
    @deHaar that's a separate issue, not stated in your question. However, try coding `>` instead of `>` and `<` instead of `<`. – Bohemian Aug 16 '18 at 14:39
  • I came across this different issue when I tried to apply your last edit, sorry. I will update my question soon, but I have little time at the moment. Tomorrow (in about 16 hours) I will provide more (and hopefully suitable) information including a list of filenames. – deHaar Aug 16 '18 at 14:40
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/178149/discussion-between-dehaar-and-bohemian). – deHaar Aug 16 '18 at 14:48
  • @deHaar the current regex passes all of your recently updated test cases. I've updated the demo link with these cases. – Bohemian Aug 16 '18 at 14:53
  • Thanks for the update! I could resolve the parsing exception by using the substitute code. But now, it says another error message, obviously concerning regex grammar: *repetition-operator operand invalid* followed by the regex withouth highlighting the critizised operand. Which one could it be? – deHaar Aug 17 '18 at 06:10
  • Because this is Perl regex, not POSIX regex. This syntax is simply not supported by the tool (or your question contains incorrect information; but the error message certainly seems to corroborate this conclusion). – tripleee Aug 17 '18 at 06:51
  • @tripleee is it possible to somehow *translate* or *transform* this PERL regex into POSIX? – deHaar Aug 17 '18 at 08:58
  • My answer contains a humble attempt but no, in the general case you cannot. – tripleee Aug 17 '18 at 08:59
  • @tripleee Thanks for that important piece of information! I will then have to handle the decision about sending a notification or not in java. – deHaar Aug 17 '18 at 13:16
-1

I am very late on this but above comments were helpful for me. It may not work for you but my solution is:

file_list <- file_list[!grepl("~", file_list)]
AsthaUndefined
  • 1,111
  • 1
  • 11
  • 24