How do I add a directory of .wav files to the Kedro data catalogue?

Question

This is my first time trying to use the Kedro package.

I have a list of .wav files in an s3 bucket, and I'm keen to know how I can have them available within the Kedro data catalog.

Any thoughts?

Zain Patel · Accepted Answer · 2021-01-26T11:50:25.370

I don't believe there's currently a dataset format that handles .wav files. You'll need to build a custom dataset that uses something like Wave - not as much work as it sounds!

This will enable you to do something like this in your catalog:

dataset:
  type: my_custom_path.WaveDataSet
  filepath: path/to/individual/wav_file.wav # this can be a s3://url

and you can then access your WAV data natively within your Kedro pipeline. You can do this for each .wav file you have.

If you wanted to be able to access a whole folders worth of wav files, you might want to explore the notion of a "wrapper" dataset like the PartitionedDataSet whose usage guide can be found in the documentation.

score 0 · Answer 2 · edited Feb 15 '22 at 01:59

This worked:

import pandas as pd

from pathlib import Path, PurePosixPath
from kedro.io import AbstractDataSet


class WavFile(AbstractDataSet):
    '''Used to load a .wav file'''
    
    def __init__(self, filepath):
        self._filepath = PurePosixPath(filepath)

    def _load(self) -> pd.DataFrame:
        df = pd.DataFrame({'file': [self._filepath],
                           'data': [load_wav(self._filepath)]})     
        return df
    

    def _save(self, df: pd.DataFrame) -> None:
        df.to_csv(str(self._filepath))

    def _exists(self) -> bool:
        return Path(self._filepath.as_posix()).exists()

    def _describe(self):
        return dict(filepath=self._filepath)
    
    
class WavFiles(PartitionedDataSet):
    '''Replaces the PartitionedDataSet.load() method to return a DataFrame.'''

    def load(self)->pd.DataFrame:
        '''Returns dataframe'''
        dict_of_data = super().load()
        
        df = pd.concat(
            [delayed() for delayed in dict_of_data.values()]
        )
        
        return df
    
    
my_partitioned_dataset = WavFiles(
    path="path/to/folder/of/wav/files/",
    dataset=WavFile,
)
     
my_partitioned_dataset.load()

This broadly works, but why not just drop `PartitionedDataSet` altogether and let `WavFile.load` accept a directory (either exclusively, or you can condition on directory v/s file)? — Zain Patel, Jan 28 '21 at 02:44
Yes, I did that in the end + some logic to deal with the s3 connection. — Myccha, Jan 29 '21 at 02:09
Did you consider using `fsspec` like the rest of the Kedro datasets with takes care of the S3 connection automatically for you in the background (and any other remote filesystems!)? — Zain Patel, Jan 29 '21 at 13:36

How do I add a directory of .wav files to the Kedro data catalogue?

2 Answers2