Can I get the file names for synced text files in my pipeline in Foundry?

Question

I have a bunch of text files that are synced from my raw system, I want an easy way to use the file names (in addition to the contents of the files) downstream in Foundry transforms.

I know this is possible using raw file access, but that seems complicated, I just want the file names next to the data.

ollie299792458 · Accepted Answer · 2022-01-13T12:20:16.417

If you're immediately going into code repos or code workbooks, then you can use the input_file_name() function (see proggeo's answer below). This is likely easier and simpler than the below, but won't work if you're going to do something else with the data.

Schema Method

If you open your dataset, then go to Details -> Schema, you can edit the schema to add a file path column, for each row this will have the value of the path of the file that the row comes from.

The key part is the _filePath member of fieldSchemaList and "addFilePath": true under customMetadata. The first is a special column that TextDataFrameReader populates with the file path, the second tells the reader to populate that column. The other column in the example below (content) just contains everything in each file.

For more details see the Metadata section in the Foundry core backend in platform documentation. This is also possible for csv's and more structured data with different Reader classes.

Full schema example

{
"fieldSchemaList": [
    {
        "type": "STRING",
        "name": "content",
        "nullable": null,
        "userDefinedTypeClass": null,
        "customMetadata": {},
        "arraySubtype": null,
        "precision": null,
        "scale": null,
        "mapKeyType": null,
        "mapValueType": null,
        "subSchemas": null
    },
    {
      "type": "STRING",
      "name": "_filePath",
      "nullable": null,
      "userDefinedTypeClass": null,
      "customMetadata": {},
      "arraySubtype": null,
      "precision": null,
      "scale": null,
      "mapKeyType": null,
      "mapValueType": null,
      "subSchemas": null
    }
],
"dataFrameReaderClass": "com.palantir.foundry.spark.input.TextDataFrameReader",
"customMetadata": {
    "textParserParams": {
      "parser": "SINGLE_COLUMN_PARSER",
      "nullValues": null,
      "nullValuesPerColumn": null,
      "charsetName": "UTF-8",
      "addFilePath": true,
      "addByteOffset": false,
      "addImportedAt": false
    }
}
}

score 0 · Answer 2 · answered Jan 13 '22 at 11:48

The response from ollie299792458 will work only if dataFrameReaderClass is com.palantir.foundry.spark.input.TextDataFrameReader.

Alternatively you can get a file name when reading the dataset in Code Repositories or Workbooks using Spark input_file_name function:

Creates a string column for the file name of the current Spark task.

Can I get the file names for synced text files in my pipeline in Foundry?

2 Answers2

Linked