-1

I need a help in writing in U-SQL to output records to two different files based on a regular expression output. Let me explain my scenario in detail.

Let us assume my input file has two columns, "Name" and person identification number ("PIN"):

Name , PIN
John ,12345
Harry ,01234
Tom, 24659

My condition for PIN is it should start with either 1 or 2. In the above case records 1 & 3 are valid and record 2 is invalid.

I need to output record 1 & 3 to my output processed file and 2 to my error file

How can I do this and also can I use Regex.Match to validate the regular expression?

//posting my code

@person =
EXTRACT UserId          int,
         PNR     string,
         UID String,
FROM "/Samples/Data/person.csv"
USING Extractors.csv();

@rs1=select UserId,PNR,UID,Regex.match(PNR,'^(19|20)[0-9]{2}((0[1-9])$') as pnrval,Regex.match(UID,'^(19|20)[0-9]{2}$') as uidval
from @person

@rs2 = select UserId,PNR,UID from @rs1 where pnrval=true or uidval=true
@rs3 = select UserId,PNR,UID from @rs1 where uidval=false or uidval= false

OUTPUT @rs2   
    TO "/output/sl.csv"
      USING Outputters.Csv();

OUTPUT @rs3   
    TO "/output/error.csv"
      USING Outputters.Csv();

But I'm receiving this error:

Severity Code Description Project File Line Suppression State Error E_CSC_USER_INVALIDCOLUMNTYPE: 'System.Text.RegularExpressions.Match' cannot be used as column type.

Francesco B.
  • 2,729
  • 4
  • 25
  • 37
Sreejith
  • 19
  • 3
  • Can you use C# to process your files? – Francesco B. Mar 25 '18 at 11:38
  • Yes I can use C#. I thought of writing a function to validate this record and create some thing like a 3 column result set where the 3rd column will store the validation result and filter records based on that column and write to output files But this logic seems to a little complex when i have to validate multiple columns. – Sreejith Mar 25 '18 at 13:59
  • Ok, then I'll write C# code in a couple of hours to use regex.match – Francesco B. Mar 25 '18 at 16:02
  • Can you post code if you already have it? – Francesco B. Mar 25 '18 at 16:14
  • I will wait for your code. I have predefined regex patterns defined for each field in each file. What i mentioned in my question is a sample . The problem for me is writing to error and processed file when a rule for a field fails – Sreejith Mar 25 '18 at 16:43
  • It's not a problem, just the rules for a question here actually ask to show some code (even if it doesn't work) to better focus on what's important rather then writing things the asker already knows / has done. See https://stackoverflow.com/help/how-to-ask – Francesco B. Mar 25 '18 at 16:53
  • 1
    I have added the code to my question – Sreejith Mar 25 '18 at 17:29
  • Great, I updated my answer. Let me know if it's clear and works for you. – Francesco B. Mar 25 '18 at 18:13
  • Thank you. In USQl i will be writing the output to Azure data laje .In C# do i need to write code to connect to ADL – Sreejith Mar 26 '18 at 02:14
  • I don't think so, at least... not necessarily. If you process in C# your CSV (either locally or via web service), then later you can import it in U-SQL just like you were doing before. Let me know if everything is clear and if my answer satisfies you. – Francesco B. Mar 26 '18 at 07:08

3 Answers3

1
@person =
EXTRACT UserId          int,
         PNR     string,
         UID String,
FROM "/Samples/Data/person.csv"
USING Extractors.csv();

@rs1=select UserId,PNR,UID,Regex.Ismatch(PNR,'^(19|20)[0-9]{2}((0[1-9])$') as pnrval,Regex.Ismatch(UID,'^(19|20)[0-9]{2}$') as uidval
from @person

@rs2 = select UserId,PNR,UID from @rs1 where pnrval=true or uidval=true


@rs3 = select UserId,PNR,UID from @rs1 where pnrval=false or uidval= false

OUTPUT @rs2   
    TO "/output/sl.csv"
      USING Outputters.Csv();

OUTPUT @rs3   
    TO "/output/error.csv"
      USING Outputters.Csv();

This worked for my requirement. Thanks for the support and suggestions

Francesco B.
  • 2,729
  • 4
  • 25
  • 37
Sreejith
  • 19
  • 3
1
@someData =
SELECT * FROM
    ( VALUES
    ("John", "12345"),
    ("Harry", "01234"),
    ("Tom", "24659")
    ) AS T(Name, pin);

@result1 =
    SELECT Name,
        pin
    FROM @someData
    WHERE pin.StartsWith("1") OR pin.StartsWith("2");

@result2 =
    SELECT Name,
        pin
    FROM @someData
    WHERE !pin.StartsWith("1") AND !pin.StartsWith("2");
0

Considering your input, I would use

.*\s*,\s*[12]\d+
  • .* matches any amount of characters and is needed to match everything before the comma
  • \s*,\s* matches a comma optionally preceded and or followed by any amount of blanks (\s matches a blank)
  • [12] matches a single digit, equal to 1 or 2; this satisfies your requirement about PINs
  • \d+ matches one or more digits

Live demo here.

As far as using Regex.Match, I'll quote this answer on StackOverflow:

System.Text.RegularExpressions.Match is not part of the built-in U-SQL types.

So what I would do here is pre-parsing your CSV in C#; something like:

        Regex CurrentRegex = new Regex(@".*\s*,\s*[12]\d+", RegexOptions.IgnoreCase);

        foreach (var LineOfText in File.ReadAllLines(InputFilePath))
        {
            Match CurrentMatch = CurrentRegex.Match(LineOfText);
            if (CurrentMatch.Success)
            {
                // Append line to success file
            }
            else
            {
                // Append line to error file
            }
            CurrentMatch = CurrentMatch.NextMatch();
        }
Francesco B.
  • 2,729
  • 4
  • 25
  • 37
  • 1
    `[\s]` is better written just `\s`. Answering questions where the OP demonstrates no attempt of their own is slightly dubious. – tripleee Mar 25 '18 at 16:09
  • Dubious meaning...? however yes, you are right about those brackets – Francesco B. Mar 25 '18 at 16:13
  • 1
    See e.g. https://meta.stackexchange.com/questions/69798/is-stack-overflow-a-code-writing-service – tripleee Mar 25 '18 at 16:22
  • @tripleee, thanks, I updated my answer and the OP posted his/her code. I'm actively using StackOverflow since last October, can I ask why you are looking for "feedback on the tail end of my answers on Stack Overflow"? Is there a reason to delete someone's own answers? Thank you, best regards – Francesco B. Mar 25 '18 at 18:20
  • I have deleted some old answers which had a downvote where ultimately I agreed with the downvoter's (apparent) feedback that they did not provide a lot of value. Optimistically, I can always hope to receive a balancing upvote on the remaining ones. Removing less stellar content of your own is a good way to help keep the site's quality up and your own record neat. – tripleee Mar 26 '18 at 03:32