How to parse a string from the input file name in streamsets

Question

I need to extract a string from the input file and add it as a field in the record.

For example, if my file has a date in the filename, only the date needs to be extracted and added as an additional column in the record. If the file name is like xyzYYYMMDD.txt, only the YYYYMMDD should be extracted.

What is stopping you from doing this? Have you written any code yet? where does it fail? Please learn about https://stackoverflow.com/help/minimal-reproducible-example — whiplash, Dec 08 '21 at 01:10

score 0 · Answer 1 · answered Dec 14 '21 at 03:13

I was able to accomplish this. Assuming you are talking about Streamsets Datacollector. The rest will be pragmatic to parse your string to grab the specific parts of your file string in the Jython Evaluator.

Set up a Pipeline: (Directory Origin) -> (Expression Evaluator) -> (Jython Evaluator) -> (Trash)

==== Configuration:

Directory Origin:

File Name Pattern: ddsample_*
First File to Process: ddsample_20211203

Expression Evaluator:

Field Expressions
Output Field: /filename_from_header
Field Expression: ${record:attribute('filename')}

Jython Evaluator : Script

for record in sdc.records:
  try:
    txt=record.value['filename_from_header']
    record.value['filename_from_header'] = txt[9:]
    sdc.output.write(record)
  except Exception as e:
    sdc.error.write(record, str(e))

Then Click Preview and click on the Jython evaluator:

How to parse a string from the input file name in streamsets

1 Answers1