Iterate over a directory and extract only file names without reading the payload

Question

I am using the Mule 4.4 community edition on premise. Thanks to help, I have been able to read a large file without consuming memory and processing it, which is all good (here).

Now building on this further - my use case is to read all .csv files from within a directory. And then process them one by one:

\opt\out\
         students.csv
         teachers.csv
         collesges.csv
         ....

So my plan was to list the files in the directory:

<sftp:list doc:name="List" config-ref="SFTP_Config" directoryPath="/opt/out">
    <non-repeatable-iterable />
    <sftp:matcher filenamePattern="#['*.csv' ]"
                  directories="EXCLUDE" symLinks="EXCLUDE" />
</sftp:list>

And then I wanted to only read file names from directory and not read payload.

As per this early access article we are advised to use <non-repeatable-iterable />. However, after the list file operation as per article when I try to extract attributes:

<set-payload doc:name="Set Payload"  value="#[output application/json --- payload map $.attributes]"/>

No attributes are available... (my plan is to extract the file names and then run a for loop for each file name and then a choice condition to determine if file name has student, use student transformer, if teacher use teacher transformer, etc.)

However, as attributes are not available, I am not able to pass file names to the for loop (yet to be written).

So I changed from <non-repeatable-iterable /> to <repeatable-in-memory-iterable />

Code below:

<sftp:list doc:name="List" config-ref="SFTP_Config" directoryPath="/opt/out">
    <repeatable-in-memory-iterable />
    <sftp:matcher filenamePattern="#['*.csv' ]"
                  directories="EXCLUDE" symLinks="EXCLUDE" />
</sftp:list>

Using the above, I can extract the attributes of file names.

I am confused about the following:

The files to be processed in the above directory will be large (each file 700 MB), so while iterating the directory by using repeatable-in-memory-iterable, will it cause any memory issues? (I do not want to read file content, simply get file names at this stage)

Here is the complete payload till now (note - it does not contain any for loop to iterate over files, which I will plug in...)

<flow name="employee-process-flow">
    <http:listener doc:name="Listener"  config-ref="HTTP_Listener_config" path="/processFiles"/>
    <set-variable value='#[now() as String { format: "ddMMuu" }]' doc:name="Set todays date as ddmmyy" doc:id="c6a91a41-65b1-46df-a720-9c13fe360b6b" variableName="today"/>

    <sftp:list doc:name="List" config-ref="SFTP_Config" directoryPath="/opt/out">
    <repeatable-in-memory-iterable />
    <sftp:matcher filenamePattern="#['*.csv' ]"
        directories="EXCLUDE" symLinks="EXCLUDE" />
    </sftp:list>

    <set-payload doc:name="Set Payload" value="#[output application/json --- payload map $.attributes]"/>
    <foreach doc:name="For Each" >
        <logger level="INFO" doc:name="Logger"  message="we are here"/>
    </foreach>

</flow>

I just tried your code with `` and it worked without any problem for me. Can you tell the SFTP connector version in your project? — Harshank Bansal, May 25 '22 at 11:31
It is actually *"collesges.csv"*? Not *"colleagues.csv"* or *"colleges.csv"*? — Peter Mortensen, May 27 '22 at 07:39
Due to [8.3 filename](https://en.wikipedia.org/wiki/8.3_filename) constraints (but it violates that, with 9 characters)? In any case, what is the intended word? — Peter Mortensen, May 27 '22 at 08:43

score 1 · Answer 1 · edited May 27 '22 at 07:16

1

The List operation returns a list of messages, and each has a payload and attributes. The content of the files is returned as the payload, in a lazy way, meaning that the file's content is read only if you try to access that element's payload.

It makes sense that if you a non-repeatable-iterator and don't access the payload of each item in the <foreach> then you should not have any memory issues, because the contents are not read.

By using in memory repeatable streaming it is possible that the entire payload is being read into memory. Try reading a file a few gigabytes in size and see what happens there.

I'm not sure what the problem is with the attributes. It should work the same in any streaming mode.

Note that if you plan on doing something with the attributes—other than printing them—then you should output to application/java instead of JSON, to avoid unneeded conversions to and from JSON. For example, in your flow the output is used as input for the <foreach>, so it would be better for it to be Java.

Example: output application/java --- payload map $.attributes

edited May 27 '22 at 07:16

Peter Mortensen

30,738
21
105
131

answered May 24 '22 at 17:42

aled

21,330
3
27
34

thanks @aled for your feedback , but yes definitely I cannot read attributes of the files when I use ```non-repeatable-iterable``` while I can read attributes when using ```repeatable-in-memory-iterable``` . So as non repetable iterable will not read payload unless I access ```payload``` , I tried introducing a for loop after the list operation and once control enters inside the for loop , I can see mule is loading payload ( in debug mode ) and I can also access its attributes , also it takes mule some time to enter inside the for loop for each file ( tried with a 9 MB file will ramp it up ) – GettingStarted With123 May 25 '22 at 00:27
I also get the following warning in logs : ```org.mule.extension.file.common.api.AbstractFileInputStreamSupplier: With the purpouse of performing a size check on the file /opt/students.csv, this thread will sleep. The connector has no control of which type of thread the sleep will take place on, this can lead to running out of thread if the time for 'timeBetweenSizeCheck' is big or a lot of files are being read concurrently. This warning will only be shown once.``` – GettingStarted With123 May 25 '22 at 00:39
also thanks @aled regarding your comment about using ```application/java``` rather than ```application\json``` will ask a separate question , thanks – GettingStarted With123 May 26 '22 at 02:44
i tried listing different files with different sizes in a directory ( 1 KB , 1.5 MB , 19 MB , 670 MB ) using ``` repeatable-in-memory-iterable``` and surpisingly no issues reported while listing files – GettingStarted With123 May 26 '22 at 06:27

Iterate over a directory and extract only file names without reading the payload

1 Answers1