1

I am new to U-SQL and have started querying files. Based on instructions that I have seen on documentation and here on stack overflow, I have a written a query for extracting metadata from a set of json files as below.

REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];

USING Microsoft.Analytics.Samples.Formats.Json;

DECLARE @InputFile string = "/test/{*}.json";
DECLARE @OutputFile string = "/metadata.csv";

@json =
EXTRACT
        id string,
        date DateTime,
        type string,
        uri = FILE.URI()
      , modified_date = FILE.MODIFIED()
      , created_date = FILE.CREATED()
      , file_sz = FILE.LENGTH()
FROM
    @InputFile
USING new JsonExtractor();

@json2 =
    SELECT 
    uri
    modified_date,
    created_date,
    file_sz
FROM @json;

@json3 =
    SELECT DISTINCT uri,
    modified_date,
    created_date,
    file_sz

    FROM @json2;
OUTPUT @json3
TO @OutputFile
USING Outputters.Csv(outputHeader:true,quoting:true);
DROP ASSEMBLY [Newtonsoft.Json];
DROP ASSEMBLY [Microsoft.Analytics.Samples.Formats];

This generates the required metadata( I run this on Azure portal even though here the locations are relative)

My questions are:

1) How can we use values from a column(a column of filenames) from an internal table / file as a list of file paths for input?

2) How can we append new data to an existing file and update the metadata file with list of new files.

My metadata looks like this:

uri           created_date       modified_date   file_sz
/…/abc.json      09-22-2018        09-23-2018       250
/…/del.json      09-24-2018        09-24-2018       126

EXPECTED OUTPUT (if possible)

@filespresent =
SELECT uri
FROM @metadata

DECLARE @Inputfile string = @filespresent

The main purpose of doing this is I get a new batch of files everyday and I want to read only the files that were uploaded today.

The filename do not contain date and the only way I can extract the date information is from inside the file. I extract metadata in a separate query and then run the main job on the files selected from the metadata file.

If there are other workaround to this, they are also most welcome.

Any help is appreciated.

2 Answers2

1

That is a limitation for the moment. You can vote here for that feature. Also stated here the same issue.

The only way to get around that is to run the u-sql script with a powershell that injects an external variable like this:

DECLARE EXTERNAL @InputFile string = "/Input/sample{n}.json";

If you try to run this you get an error of "Rowset variable is not a scalar variable"...

REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];

USING Microsoft.Analytics.Samples.Formats.Json;

DECLARE @InputFile string = "/Input/sample4.json";
DECLARE @OutputFile1 string = "/Output/metadata1.csv";
DECLARE @OutputFile2 string = "/Output/metadata2.csv";
DECLARE @OutputFile3 string = "/Output/metadata3.csv";

@json1 =
EXTRACT
        // Json fields
        id string,   
        file string,
        // Virtual columns
        n string,
        uri = FILE.URI()
      , modified_date = FILE.MODIFIED()
      , created_date = FILE.CREATED()
      , file_sz = FILE.LENGTH()
FROM
    @InputFile // This file has a file set within
USING new JsonExtractor();

OUTPUT @json1
TO @OutputFile1
USING Outputters.Csv(outputHeader:true,quoting:true);

@json2 =
    SELECT 
    file
FROM @json1;

OUTPUT @json2
TO @OutputFile2
USING Outputters.Csv(outputHeader:true,quoting:true);

@json3 =
EXTRACT
        // Json fields
        id string,   
        file string,
        // Virtual columns
        n string,
        uri = FILE.URI()
      , modified_date = FILE.MODIFIED()
      , created_date = FILE.CREATED()
      , file_sz = FILE.LENGTH()
FROM
    @json2 // This is a rowset variable (with our fileset)
USING new JsonExtractor();

OUTPUT @json3
TO @OutputFile3
USING Outputters.Csv(outputHeader:true,quoting:true);
Miguel Domingues
  • 440
  • 3
  • 11
1

The features you would like (e.g., reading the paths from a file, and appending to an existing file) is not available.

Appending to a file you could do by reading the file, union the new data and then writing the result into the same file.

However, by looking at your scenario of **The main purpose of doing this is I get a new batch of files everyday and I want to read only the files that were uploaded today. **

You can do your EXTRACT as above and then put a filter onto the created_date or modified_date column to only select the files that are created or modified for the specific date. E.g. (typed into stackoverflow),

DECLARE EXTERNAL @last_processed_modified_date = DateTime.Now();

@json = EXTRACT
    id string,
    date DateTime,
    type string,
    uri = FILE.URI()
  , modified_date = FILE.MODIFIED()
  , created_date = FILE.CREATED()
  , file_sz = FILE.LENGTH()
FROM @InputFile
USING new JsonExtractor();

@json = SELECT * FROM @json WHERE modified_date > @last_processed_modified_date;

…
Michael Rys
  • 6,684
  • 15
  • 23
  • Thanks for the answer Michael. This is what I have implemented but does this read (as in completely go through) all the files and then choose the selected files? The folder may have 1000+ files, so my concern for data units used is, does it completely read all files or only looks at created date attribute? – Interested_Programmer Sep 25 '18 at 02:14
  • DateTime.Now() should be DateTime.Now. It was throwing a syntax error. – Interested_Programmer Sep 25 '18 at 11:40