I have several thousands of files to be processed of the different types. I am using dynamic catalog creation with hooks. I used first after_catalog_created hook but it is too early in and I need those entries only for specific nodes. My try is with before_node_run for specific node tags returning the dictionary with just dynamically created entries. Node function is **kwargs only. It works as I see that node get updated inputs, but the problem is that I need to provide for the node specification the already existing catalog entry. So I have such, fake one. Then I am using it to build a dictionary with the same length as the dictionary that is being returned by the hook.
Pipeline code
for doc in docs["Type1_documents"]:
item = doc["name"]
item_name, _ = os.path.splitext(item)
type1_datasets_dict[item_name] = "brace_dictionary"
return Pipeline(
[
node(
func=func1,
inputs=type1_datasets_dict,
outputs=[
f"output1",
f"output2",
],
name=f"type1_eta",
tags=["dynamic-catalog", "type1", "data-engineering"],
)
]
)
Hook code
@hook_impl
def before_node_run(
self, node: Node, catalog: DataCatalog
) -> Optional[Dict[str, Any]]:
self.node = node
self.catalog = catalog
if "dynamic-catalog" in node.tags:
input_catalog_name = node.name
catalog_string = f"params:{input_catalog_name}.full_name"
if self.catalog.exists(catalog_string):
true_datasets_dict = {}
catalog_properties = self.catalog.load(f"params:{input_catalog_name}")
catalog_name = catalog_properties["full_name"]
type = catalog_properties["type"]
subtype = catalog_properties["subtype"]
datasets_dict = self.catalog.load(f"params:{catalog_name}")
for dataset in datasets_dict:
doc_name, _ = os.path.splitext(dataset["name"])
self.add_text_dataset(
name=doc_name,
folder=f"parsed/{type}/{subtype}",
)
true_datasets_dict[doc_name] = doc_name
return true_datasets_dict
return true_datasets_dict
But I am getting value error for this:
line 487, in _run_with_dict
raise ValueError(
ValueError: Node type1_eta: func1([brace_dictionary,brace_dictionary,brace_dictionary,..,brace_dictionary]) -> [output1, output2] expected 1 input(s) ['brace_dictionary'], but got the following 1497 input(s) instead: ['file1', 'file2', ...].
Is there another way how to do it conditionally?