Stream Analytics job reference data join creating duplicates

Question

I am using Stream Analytics to join streaming data (via IoT Hub) and reference data (via blob storage). The reference data blob file is generated every minute with latest data and is in a format "filename-{date} {time}.csv". The reference blob file data is used in the Azure Machine Learning function as parameters in SA job. The output of stream analytics job (into Azure SQL or Power BI) seems to be generating multiple rows instead of one for Azure Machine Learning function's output, one each for parameter values from previous blob files. My understanding is that it should only use the latest blob file content but looks like it is using all the blob files and generating multiple rows from AML output. Here is the query I am using:

SELECT AMLFunction(Ref.Input1, Ref.Input2), * FROM IoTInput Stream LEFT JOIN RefBlobInput Ref ON Stream.DeviceId = Ref.[DeviceID]

Please can you advice if the query or the file path needs changing to avoid duplicating records? Thanks

score 0 · Answer 1 · answered Jan 25 '18 at 09:32

To take effect of only latest file, you need to store your file in particular folder structure.

If you have note down, whenever you select reference data file as stream input; stream input dialog asks you for folder structure along with date and time format.

Stream always search for reference file from latest {date}/{time} folder. i.e. you need to store your file like,

2018-01-25/07:30/filename.json (YYYY-MM-DD/HH-mm/filename.json)

NOTE: Here your time folder needs to be unique for each minute. Same as, date folder needs to be unique for each date. Whenever you create new file, create it with under new time stamp folder and under current date folder.

You can use any datetime format that stream input supports.

Stream Analytics job reference data join creating duplicates

1 Answers1