This is my first time trying to use the Kedro package.
I have a list of .wav files in an s3 bucket, and I'm keen to know how I can have them available within the Kedro data catalog.
Any thoughts?
I don't believe there's currently a dataset format that handles .wav
files. You'll need to build a custom dataset that uses something like Wave - not as much work as it sounds!
This will enable you to do something like this in your catalog:
dataset:
type: my_custom_path.WaveDataSet
filepath: path/to/individual/wav_file.wav # this can be a s3://url
and you can then access your WAV data natively within your Kedro pipeline. You can do this for each .wav
file you have.
If you wanted to be able to access a whole folders worth of wav files, you might want to explore the notion of a "wrapper" dataset like the PartitionedDataSet whose usage guide can be found in the documentation.
This worked:
import pandas as pd
from pathlib import Path, PurePosixPath
from kedro.io import AbstractDataSet
class WavFile(AbstractDataSet):
'''Used to load a .wav file'''
def __init__(self, filepath):
self._filepath = PurePosixPath(filepath)
def _load(self) -> pd.DataFrame:
df = pd.DataFrame({'file': [self._filepath],
'data': [load_wav(self._filepath)]})
return df
def _save(self, df: pd.DataFrame) -> None:
df.to_csv(str(self._filepath))
def _exists(self) -> bool:
return Path(self._filepath.as_posix()).exists()
def _describe(self):
return dict(filepath=self._filepath)
class WavFiles(PartitionedDataSet):
'''Replaces the PartitionedDataSet.load() method to return a DataFrame.'''
def load(self)->pd.DataFrame:
'''Returns dataframe'''
dict_of_data = super().load()
df = pd.concat(
[delayed() for delayed in dict_of_data.values()]
)
return df
my_partitioned_dataset = WavFiles(
path="path/to/folder/of/wav/files/",
dataset=WavFile,
)
my_partitioned_dataset.load()