I'm training an nlp model using spacy. I have the preprocessing steps all written as a pipeline, and now I need to do the training. According to spacy's documentation I need to run the following command:
python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy
The files config.cfg
, train.spacy
and dev.spacy
are all registered in my data catalog. I want to run this command with something similar to the following code:
import subprocess
def train_spacy_nlp_model(
config_filepath: str,
train_filepath: str,
dev_filepath: str,
output_dir: str
):
cmd = [
"python -m", "spacy",
"train", config_filepath,
"--output", output_dir,
"--paths.train", train_filepath,
"--paths.dev", dev_filepath
]
result = subprocess.run(" ".join(cmd), shell=True)
if result.returncode != 0:
raise RuntimeError("Spacy training failed")
But I have no idea how to retrieve the file path information from the items in my data catalog, is there a way of passing this information to my nodes when creating the pipeline?