0

I'm training an nlp model using spacy. I have the preprocessing steps all written as a pipeline, and now I need to do the training. According to spacy's documentation I need to run the following command:

python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy

The files config.cfg, train.spacy and dev.spacy are all registered in my data catalog. I want to run this command with something similar to the following code:

import subprocess


def train_spacy_nlp_model(
    config_filepath: str, 
    train_filepath: str, 
    dev_filepath: str, 
    output_dir: str
    ):
    cmd = [
        "python -m", "spacy",
        "train", config_filepath,
        "--output", output_dir,
        "--paths.train", train_filepath,
        "--paths.dev", dev_filepath
    ]

    result = subprocess.run(" ".join(cmd), shell=True)
    if result.returncode != 0:
        raise RuntimeError("Spacy training failed")

But I have no idea how to retrieve the file path information from the items in my data catalog, is there a way of passing this information to my nodes when creating the pipeline?

João Areias
  • 1,192
  • 11
  • 41

3 Answers3

0

The variables you are using as input are strings. While data catalog is different. The data catalog variables are Kedro Dataset.

Both are different. Store the path as part of config and you shall get your project started.

  • I'm only given the code as an example of the behavior I'm expecting, the code is not set in stone. I really want to avoid storing the path both in the config and the data catalog as it would be an anti-pattern to have them I'm both places, and I need the files on the catalog for the previous preprocessing steps – João Areias Nov 11 '22 at 09:31
  • Is there a way of making them share this information from a single source? – João Areias Nov 11 '22 at 09:32
  • I know it's stored on the _filepath attribute for the Dataset, but If I can get access to this attribute on the pipeline would be great – João Areias Nov 11 '22 at 09:59
0

This is probably not the most elegant solution to this, but it works for me so I'll use it until I get a better solution. The solution was to return the path with the object on my DataSet implementation, I doubt that this would generalize for other datasets like SQL queries for example, but since I know that I have to be dealing with a file here, works fine. Here is my implementation:

from kedro.io import AbstractDataSet
from spacy.tokens import DocBin
from dataclasses import dataclass
from typing import Union
from pathlib import Path


@dataclass
class DocBinModel:
    filepath: Path
    docbin: DocBin


class SpacyDocBinDataSet(AbstractDataSet):
    def __init__(self, filepath, save_args=None, load_args=None):
        self._filepath = filepath
        self._save_args = save_args or {}
        self._load_args = load_args or {}

    def _describe(self):
        return dict(
            filepath=self._filepath,
            save_args=self._save_args,
            load_args=self._load_args,
        )

    def _load(self):
        with open(self._filepath, "rb") as f:
            docbin = DocBin().from_bytes(f.read())
        
        return DocBinModel(self._filepath, docbin)

    def _save(self, data: Union[DocBin, DocBinModel]):
        if isinstance(data, DocBinModel):
            data = data.docbin
        data.to_disk(self._filepath)

    def _exists(self):
        return Path(self._filepath).exists()
João Areias
  • 1,192
  • 11
  • 41
0

You can use _get_load_path()

catalog.datasets.mydataset._get_load_path()
gaut
  • 5,771
  • 1
  • 14
  • 45