Azure Data Lake Analytics IOutputter get output file name

Question

I'm using a custom IOutputter to write the results of my U-SQL script to a a local database:

OUTPUT @dataset
TO "/path/somefilename_{*}.file"
USING new CustomOutputter()

public class CustomOutputter: IOutputter
{          
        public CustomOutputter()
        {
            myCustomDatabase.Open("databasefile.database");
        }    

        public override void Output(IRow input, IUnstructuredWriter output)
        {

        }
}

Is there any possibility to replace "databasefile.database" with the specified output file path "/path/somefilename_{*}.file" ?

Since I'm not able to pass output.BaseStream to the database I can't find a way to properly write to the correct file name.

UPDATE How I copy the local DB file to the ADLA provided outputstream:

        public override void Close()
        {
            using (var fs = File.Open("databasefile.database", FileMode.Open))
            {
                byte[] buffer = new byte[65536];
                int read;
                while ((read = fs.Read(buffer, 0, buffer.Length)) > 0)
                {
                    this.output.BaseStream.Write(buffer, 0, read);
                    this.output.BaseStream.Flush();
                }
            }
        }

Michael Rys · Accepted Answer · 2017-01-10T20:19:31.407

1

I am not sure what you try to achieve.

Outputters (and UDOs in general) cannot leave their containers (VMs) when executed in ADLA (local execution has no such limit at this point). So connecting to a database outside the container is going to be blocked and I am not sure what it helps to write data into a database in a transient VM/container.
The UDO model has a well-defined model to write to files that live in either ADLS or WASB by writing the data in the input row(set) into the output's stream. You can write into local files, but again, these files will cease to exist after the vertex finishes execution.

Given this information, could you please rephrase?

Update based on clarifying comment

you have two options to generate a database from a rowset:

you use ADF to do the data movement. This is the most commonly used approach and probably the easiest.
If you use a custom outputter you could try the following:
1. write the output rowset into the database which is local to your vertex (you have to deploy the database as a resource, so you probably need a small footprint version to fit into the resource size limit) using the database interface,
2. then read the database file from the vertex local directory into the output stream so you copy the file into ADLS.
3. Note that you need atomic file processing on the outputter to avoid writing many database files that then get stitched together.

edited Jan 10 '17 at 20:19

answered Jan 10 '17 at 01:06

Michael Rys

6,684
15
23

Thanks for the explanation. My intent was to let ADLA create a database file, which then can be consumed by other services without further processing. So because of these limitations of ADLA it seems that I have to use e.g. DF and a custom activity to convert the ADLA output into my database file format, right? – coalmee Jan 10 '17 at 13:46
1

Actually you have two options: 1. you use ADF to do the data movement. 2. If you use a custom outputter you could try the following: write the output rowset into the database which is local to your vertex (you may have to deploy the database as a resource, so you probably need a small footprint version to fit into the resource size limit) using the database interface, then read the database file from the vertex local directory into the output stream so you copy the file into ADLS. Note that you need atomic file processing and deploy the database as resource to the vertex. – Michael Rys Jan 10 '17 at 20:11
I would prefer the second approach. I already tried to copy the DB file to the output stream. But it's failing with 4MB row size limit while writing the file to the output stream. See also: http://stackoverflow.com/questions/41533328/azure-data-lake-analytics-ioutputter-e-runtime-user-rowtoobig – coalmee Jan 11 '17 at 07:57
1

The original processing paradigms for extractors and outputters focused on row-oriented data formats (eg. csv) and thus outputters are checking to not write rows that cannot be read. However in your case you are not writing rows but a bytestream. The FileCopy Outputter shows how you can chunk your writing to write blobs without running into the size limit (chunk your writes into blocks that are less than 4MB). – Michael Rys Jan 11 '17 at 22:23
The DB is accessing the local file in a random read/write manner. Therefore I have to do the copying activity in the public override void Close() of the IOutputter. I updated my question with the code I use to copy the DB file. Even I use a blocksize smaller than 4MB I get the ROWTOOBIG error. That is what I don't get? – coalmee Jan 12 '17 at 07:59
1

I checked with my dev team and at this point they are still checking at a very low level for the 4MB. In my example, I have the row that I write already chunked. I filed an internal request to review our design for cases where data is not written in row formats to see if we can limit the check to row writes. – Michael Rys Jan 12 '17 at 23:45
Thank you very much! That would be awesome. It would save us the extra complexity added for the transformation step through ADF and Azure Batch. – coalmee Jan 13 '17 at 08:23

Azure Data Lake Analytics IOutputter get output file name

1 Answers1