6

I'm new with regular expressions. I need to extract the path from the following lines:

XXXX       c:\mypath1\test
YYYYYYY             c:\this is other path\longer
ZZ        c:\mypath3\file.txt

I need to implement a method that return the path of a given line. The first column is a word with 1 or more characters, never is empty, the second column is the path. The separator could be 1 or more spaces, or one or more tabs, or both.

halfer
  • 19,824
  • 17
  • 99
  • 186
Daniel Peñalba
  • 30,507
  • 32
  • 137
  • 219

3 Answers3

7

It sounds to me like you just want

string[] bits = line.Split(new char[] { '\t', ' ' }, 2,
                           StringSplitOptions.RemoveEmptyEntries);
// TODO: Check that bits really has two entries
string path = bits[1];

(This is assuming that the first column never contains spaces or tabs.)

EDIT: As a regular expression you can probably just do:

Regex regex = new Regex(@"^[^ \t]+[ \t]+(.*)$");

Sample code:

using System;
using System.Text.RegularExpressions;

class Program
{
    static void Main(string[] args)
    {
        string[] lines = 
        {
            @"XXXX       c:\mypath1\test",
            @"YYYYYYY             c:\this is other path\longer",
            @"ZZ        c:\mypath3\file.txt"
        };

        foreach (string line in lines)
        {
            Console.WriteLine(ExtractPathFromLine(line));
        }
    }

    static readonly Regex PathRegex = new Regex(@"^[^ \t]+[ \t]+(.*)$");

    static string ExtractPathFromLine(string line)
    {
        Match match = PathRegex.Match(line);
        if (!match.Success)
        {
            throw new ArgumentException("Invalid line");
        }
        return match.Groups[1].Value;
    }    
}
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Paths can have spaces, so the second one is quite bad. – xanatos Oct 18 '11 at 09:16
  • @Jon: Sorry, I need a regular expresion since I'm using .NET 1.1 and I have no access to StringSplitOptions.RemoveEmptyEntries overload. Thanks anyway! – Daniel Peñalba Oct 18 '11 at 09:16
  • @DanielPeñalba: It would have been useful to say so to start with - requiring .NET 1.1 is very rare these days. Will edit. – Jon Skeet Oct 18 '11 at 09:18
5
StringCollection resultList = new StringCollection();
try {
    Regex regexObj = new Regex(@"(([a-z]:|\\\\[a-z0-9_.$]+\\[a-z0-9_.$]+)?(\\?(?:[^\\/:*?""<>|\r\n]+\\)+)[^\\/:*?""<>|\r\n]+)");
    Match matchResult = regexObj.Match(subjectString);
    while (matchResult.Success) {
        resultList.Add(matchResult.Groups[1].Value);
        matchResult = matchResult.NextMatch();
    } 
} catch (ArgumentException ex) {
    // Syntax error in the regular expression
}

Breakdown :

@"
(                             # Match the regular expression below and capture its match into backreference number 1
   (                             # Match the regular expression below and capture its match into backreference number 2
      |                             # Match either the regular expression below (attempting the next alternative only if this one fails)
         [a-z]                         # Match a single character in the range between “a” and “z”
         :                             # Match the character “:” literally
      |                             # Or match regular expression number 2 below (the entire group fails if this one fails to match)
         \\                            # Match the character “\” literally
         \\                            # Match the character “\” literally
         [a-z0-9_.$]                   # Match a single character present in the list below
                                          # A character in the range between “a” and “z”
                                          # A character in the range between “0” and “9”
                                          # One of the characters “_.$”
            +                             # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
         \\                            # Match the character “\” literally
         [a-z0-9_.$]                   # Match a single character present in the list below
                                          # A character in the range between “a” and “z”
                                          # A character in the range between “0” and “9”
                                          # One of the characters “_.$”
            +                             # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
   )?                            # Between zero and one times, as many times as possible, giving back as needed (greedy)
   (                             # Match the regular expression below and capture its match into backreference number 3
      \\                            # Match the character “\” literally
         ?                             # Between zero and one times, as many times as possible, giving back as needed (greedy)
      (?:                           # Match the regular expression below
         [^\\/:*?""<>|\r\n]             # Match a single character NOT present in the list below
                                          # A \ character
                                          # One of the characters “/:*?""<>|”
                                          # A carriage return character
                                          # A line feed character
            +                             # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
         \\                            # Match the character “\” literally
      )+                            # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
   )
   [^\\/:*?""<>|\r\n]             # Match a single character NOT present in the list below
                                    # A \ character
                                    # One of the characters “/:*?""<>|”
                                    # A carriage return character
                                    # A line feed character
      +                             # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
"
FailedDev
  • 26,680
  • 9
  • 53
  • 73
  • 1
    This looks very complicated to basically get everything after the first set of spaces/tabs. – Jon Skeet Oct 18 '11 at 09:14
  • @JonSkeet I agree. That's a more general regex for windows path. – FailedDev Oct 18 '11 at 09:21
  • @FailedDev it doesn't work for example for "k:\test\test". If I try to pass path like **\\test\t><*st** it will be valid. I found this regex `^(?:[c-zC-Z]\:|\\)(\\[a-zA-Z_\-\s0-9\.]+)+`. It validates path correctly on my opinion. Found it [here](https://www.codeproject.com/Tips/216238/Regular-Expression-to-Validate-File-Path-and-Exten) – Potato Dec 08 '17 at 06:21
0

Regex Tester is a good Website to test the Regex fast.

Regex.Matches(input, "([a-zA-Z]*:[\\[a-zA-Z0-9 .]*]*)");
San
  • 868
  • 1
  • 9
  • 18