0

In Azure data factory, how to check if an array of string(filenames) contains value?

I'm getting file names from get metadata activity and I need to check if all 4 filenames that I have are available in storage account before proceeding.

I'm expecting 4 files into the storage account and I need to check if all the 4 files are available. I need to explicitly check against the file names and not the number of files - this being a requirement

When I try to validate it using the child item from get meta data I am getting error "array elements can only be selected using an integer index." The issue here is that the file could be present at any index in the next load

Is there a better way to validate the filename?

Appreciate your help, thanks in advance

ADFDev
  • 3
  • 2

3 Answers3

1

My get meta data output looks like this

 "childItems": [
    {
        "name": "1.py",
        "type": "File"
    },
    {
        "name": "SalesData.numbers",
        "type": "File"
    },
    {
        "name": "file1.txt",
        "type": "File"
    }

]

and i used the below expression in set variable activity to check for file names

@if(
contains(activity('Get Metadata1').output.childitems,
json(concat('{"name":"file1.txt"',',','"type":"File"}'))), 

if(
contains(activity('Get Metadata1').output.childitems,
json(concat('{"name":"file2.txt"',',','"type":"File"}'))),

if(
contains(activity('Get Metadata1').output.childitems,
json(concat('{"name":"2.py"',',','"type":"File"}'))),'yes','no')
,'no')
,'no')

this checks if my blob has file1.txt, file2.txt and 2.py

if yes, i am assigning yes to variable else no

You can use a if condition as well

All About BI
  • 493
  • 3
  • 6
1

It is possible to check if multiple files exist using arrays but it is a bit fiddly. I often pass this off to another activity in the pipeline eg Stored Procedure or Notebook activity, depending on what compute you have available in the pipeline (eg a SQL database or Spark cluster). However if you do need to do this in the pipeline this may work for you.

To start off, I have an array parameter with the following value:

Parameter Name Parameter Type Parameter Value
pFilesToCheck Array ["json1.json","json2.json","json3.json","json4.json"]

These are the files that must exist. Next I have a Get Metadata activity pointing at a data lake folder with the Child Items argument set in the Field List:

Get Metadata activity

This will return some output in this format, listing all the files in the given directory, with some additional information on the execution:

{
    "childItems": [
        {
            "name": "json1.json",
            "type": "File"
        },
        {
            "name": "json2.json",
            "type": "File"
        },
        {
            "name": "json3.json",
            "type": "File"
        },
        {
            "name": "json4.json",
            "type": "File"
        }
    ],
    "effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (Some Region)",
    "executionDuration": 0,
    "durationInQueue": {
        "integrationRuntimeQueue": 1
    },
    "billingReference": {
        "activityType": "PipelineActivity",
        "billableDuration": [
            {
                "meterType": "AzureIR",
                "duration": 0.016666666666666666,
                "unit": "Hours"
            }
        ]
    }
}

In order to compare the input array pFilesToCheck (the files which must exist) with the results from the Get Metadata activity (the files which do exist), we must put them in a comparable format. I use an Array variable to do this:

Variable Name Variable Type
arrFilenames Array

Next is a For Each activity running in Sequential mode and using the range function to loop from 0 to 3, ie the array index for each item in the childItems array. The expression determines the number of items in the Get Metadata output which is 0-based. The Items property is set to the following expression:

@range(0,length(activity('Get Metadata File List').output.childItems))

Inside the For Each activity is an Append activity which appends the current item from the for each loop to the array variable arrFilenames. It uses this expression in the Value property:

@activity('Get Metadata File List').output.childItems[item()].name

'@item()' in this case will be a number between 0 and 3 being generated by the range function mentioned above. Once the loop is complete the array arrFilenames will now look like this (ie in the same format as the input array):

["json1.json","json2.json","json3.json","json4.json"]

The input array and actual file list can now be compared using the intersection function. I use a Set Variable activity with a boolean variable to record the result:

@equals(
length(variables('arrFilenames')),
length(intersection(variables('arrFilenames'),pipeline().parameters.pFilesMustExist)))

This expression compares the length of the array which contains the files which actually exist with the length of the same array joined via the intersection function to the input array of files which should exist. If the numbers match then all the files exist. If the numbers do not match then not all the files exist.

wBob
  • 13,710
  • 3
  • 20
  • 37
  • 1
    This is a great answer. For my use case -- determining if a container contained a specified file list -- I had to adjust the final `Set Variable` to: `@equals( length(pipeline().parameters.pFilesMustExist), length(intersection(variables('arrFilenames'),pipeline().parameters.pFilesMustExist)))` – Nick May 08 '22 at 16:51
-1

Can you try this (Python)?

import fnmatch
import os
 
rootPath = '/'
pattern = '*.mp3'
 
for root, dirs, files in os.walk(rootPath):
    for filename in fnmatch.filter(files, pattern):
        print( os.path.join(root, filename))
ASH
  • 20,759
  • 19
  • 87
  • 200