How can I have nice file names & efficient storage usage in my Foundry Magritte dataset export?

Question

I'm working on exporting data from Foundry datasets in parquet format using various Magritte export tasks to an ABFS system (but the same issue occurs with SFTP, S3, HDFS, and other file based exports).

The datasets I'm exporting are relatively small, under 512 MB in size, which means they don't really need to be split across multiple parquet files, and putting all the data in one file is enough. I've done this by ending the previous transform with a .coalesce(1) to get all of the data in a single file.

The issues are:

By default the file name is part-0000-<rid>.snappy.parquet, with a different rid on every build. This means that, whenever a new file is uploaded, it appears in the same folder as an additional file, the only way to tell which is the newest version is by last modified date.
Every version of the data is stored in my external system, this takes up unnecessary storage unless I frequently go in and delete old files.

All of this is unnecessary complexity being added to my downstream system, I just want to be able to pull the latest version of data in a single step.

score 2 · Answer 1 · edited Jan 13 '22 at 15:27

This is possible by renaming the single parquet file in the dataset so that it always has the same file name, that way the export task will overwrite the previous file in the external system.

This can be done using raw file system access. The write_single_named_parquet_file function below validates its inputs, creates a file with a given name in the output dataset, then copies the file in the input dataset to it. The result is a schemaless output dataset that contains a single named parquet file.

Notes

The build will fail if the input contains more than one parquet file, as pointed out in the question, calling .coalesce(1) (or .repartition(1)) is necessary in the upstream transform
If you require transaction history in your external store, or your dataset is much larger than 512 MB this method is not appropriate, as only the latest version is kept, and you likely want multiple parquet files for use in your downstream system. The createTransactionFolders (put each new export in a different folder) and flagFile (create a flag file once all files have been written) options can be useful in this case.
The transform does not require any spark executors, so it is possible to use @configure() to give it a driver only profile. Giving the driver additional memory should fix out of memory errors when working with larger datasets.
shutil.copyfileobj is used because the 'files' that are opened are actually just file objects.

Full code snippet

example_transform.py

from transforms.api import transform, Input, Output
import .utils


@transform(
    output=Output("/path/to/output"),
    source_df=Input("/path/to/input"),
)
def compute(output, source_df):
    return utils.write_single_named_parquet_file(output, source_df, "readable_file_name")

utils.py

from transforms.api import Input, Output
import shutil
import logging

log = logging.getLogger(__name__)


def write_single_named_parquet_file(output: Output, input: Input, file_name: str):
    """Write a single ".snappy.parquet" file with a given file name to a transforms output, containing the data of the
    single ".snappy.parquet" file in the transforms input.  This is useful when you need to export the data using
    magritte, wanting a human readable name in the output, when not using separate transaction folders this should cause
    the previous output to be automatically overwritten.

    The input to this function must contain a single ".snappy.parquet" file, this can be achieved by calling
    `.coalesce(1)` or `.repartition(1)` on your dataframe at the end of the upstream transform that produces the input.

    This function should not be used for large dataframes (e.g. those greater than 512 mb in size), instead
    transaction folders should be enabled in the export.  This function can work for larger sizes, but you may find you
    need additional driver memory to perform both the coalesce/repartition in the upstream transform, and here.

    This produces a dataset without a schema, so features like expectations can't be used.

    Parameters:
        output (Output): The transforms output to write the single custom named ".snappy.parquet" file to, this is
            the dataset you want to export
        input (Input): The transforms input containing the data to be written to output, this must contain only one
            ".snappy.parquet" file (it can contain other files, for example logs)
        file_name: The name of the file to be written, if the ".snappy.parquet" will be automatically appended if not
            already there, and ".snappy" and ".parquet" will be corrected to ".snappy.parquet"

    Raises:
        RuntimeError: Input dataset must be coalesced or repartitioned into a single file.
        RuntimeError: Input dataset file system cannot be empty.

    Returns:
        void: writes the response to output, no return value
    """
    output.set_mode("replace")  # Make sure it is snapshotting

    input_files_df = input.filesystem().files()  # Get all files
    input_files = [row[0] for row in input_files_df.collect()]  # noqa - first column in files_df is path
    input_files = [f for f in input_files if f.endswith(".snappy.parquet")]  # filter non parquet files
    if len(input_files) > 1:
        raise RuntimeError("Input dataset must be coalesced or repartitioned into a single file.")
    if len(input_files) == 0:
        raise RuntimeError("Input dataset file system cannot be empty.")
    input_file_path = input_files[0]

    log.info("Inital output file name: " + file_name)
    # check for snappy.parquet and append if needed
    if file_name.endswith(".snappy.parquet"):
        pass  # if it is already correct, do nothing
    elif file_name.endswith(".parquet"):
        # if it ends with ".parquet" (and not ".snappy.parquet"), remove parquet and append ".snappy.parquet"
        file_name = file_name.removesuffix(".parquet") + ".snappy.parquet"
    elif file_name.endswith(".snappy"):
        # if it ends with just ".snappy" then append ".parquet"
        file_name = file_name + ".parquet"
    else:
        # if doesn't end with any of the above, add ".snappy.parquet"
        file_name = file_name + ".snappy.parquet"
    log.info("Final output file name: " + file_name)

    with input.filesystem().open(input_file_path, "rb") as in_f:  # open the input file
        with output.filesystem().open(file_name, "wb") as out_f:  # open the output file
            shutil.copyfileobj(in_f, out_f)  # write the file into a new file

Thanks and appreciate it. Is it possible for us to write the 'parquet' format into .JSON?. We have very similar situation to rename the file with suffix as 'current_date' but writing to JSON. — mari, Jul 06 '22 at 09:53
@mari json files should be in json format, mismatching file extensions and file formats is unlikely to ever be intentional. You can use the above code to transform an input parquet dataset into an output json dataset. — ollie299792458, Jul 07 '22 at 10:20

score 1 · Accepted Answer · answered Jan 13 '22 at 17:32

1

You can also use the rewritePaths functionality of the export plugin, to rename the file under spark/*.snappy.parquet file to "export.parquet" while exporting. This of course only works if there is only a single file, so .coalesce(1) in the transform is a must:

excludePaths:
  - ^_.*
  - ^spark/_.*
rewritePaths:
  '^spark/(.*[\/])(.*)': $1/export.parquet
uploadConfirmation: exportedFiles
incrementalType: snapshot
retriesPerFile: 0
bucketPolicy: BucketOwnerFullControl
directoryPath: features
setBucketPolicy: true

answered Jan 13 '22 at 17:32

nicornk

654
3
11

This is quite nice for one single file, what would be the approach if the dataset that we are trying to save needs to be partitioned to be saved ? – dry Feb 08 '22 at 06:04
1

@dry this feature works with multiple files! If you search for `rewritePaths` in the Foundry documentation it'll show you more details. But, if you're partitioning your data, I'd recommend using `createTransactionFolders: true` in your abfs config, in case the number of partitions changes. – ollie299792458 Feb 09 '22 at 15:35
1

@ollie299792458 thanks for the tip. Looking at the doc available I was not obvious how it behaves. – dry Feb 10 '22 at 05:27
@dry the explanation above the example in the docs seems pretty clear, it is a yaml map from keys to values, where the keys are regex matches, and the values are strings with option regex (`$1`, `$2` etc) or timestamp (`${dt:yyyy-MM-dd}` - using Java DateTimeFormatter) groups. Is it the description of what the keys and values actually do that is unclear? – ollie299792458 Feb 11 '22 at 13:36

score 1 · Answer 3 · answered Feb 10 '22 at 05:12

I ran into the same requirement the only difference was that the dataset required to be split into multiple parts due to the size. Posting here the code and how I have updated it to handle this use case.

def rename_multiple_parquet_outputs(output: Output, input: list, file_name_prefix: str):
    """
        Slight improvement to allow multiple output files to be renamed
    """
    output.set_mode("replace")  # Make sure it is snapshotting

    input_files_df = input.filesystem().files()  # Get all files
    input_files = [row[0] for row in input_files_df.collect()]  # noqa - first column in files_df is path
    input_files = [f for f in input_files if f.endswith(".snappy.parquet")]  # filter non parquet files
    if len(input_files) == 0:
        raise RuntimeError("Input dataset file system cannot be empty.")
    input_file_path = input_files[0]
    print(f'input files {input_files}')
    print("prefix for target name: " + file_name_prefix)

    for i,f in enumerate(input_files):
        with input.filesystem().open(f, "rb") as in_f:  # open the input file
            with output.filesystem().open(f'{file_name_prefix}_part_{i}.snappy.parquet', "wb") as out_f:  # open the output file
                shutil.copyfileobj(in_f, out_f)  # write the file into a new file

Also to use this into a code workbook the input needs to be persisted and the output parameter can be retrieved as shown below.

def rename_outputs(persisted_input):
    output = Transforms.get_output()
    rename_parquet_outputs(output, persisted_input, "prefix_for_renamed_files")

Do you know if there is a way to parallelize the file copy activities? — nicornk, Feb 10 '22 at 15:15
If you're splitting into multiple parts you'll want some check to ensure there are always the same number of parts, otherwise you won't get nice overwrite semantics in the external system (if I have 7 files one time then 6 files next time, then the output will still have 7 files in, the 7th file will be stale data). For abfs syncs using `createTransactionFolders: true` is a better solution with multiple files. — ollie299792458, Feb 11 '22 at 13:31
Sure, agree that the checks is required. But it would be nice to afterwards to run the copy activities in parallel (think of Hadoop style distcp). — nicornk, Feb 12 '22 at 18:16

How can I have nice file names & efficient storage usage in my Foundry Magritte dataset export?

3 Answers3

Linked

Related