0

I have several thousands of files to be processed of the different types. I am using dynamic catalog creation with hooks. I used first after_catalog_created hook but it is too early in and I need those entries only for specific nodes. My try is with before_node_run for specific node tags returning the dictionary with just dynamically created entries. Node function is **kwargs only. It works as I see that node get updated inputs, but the problem is that I need to provide for the node specification the already existing catalog entry. So I have such, fake one. Then I am using it to build a dictionary with the same length as the dictionary that is being returned by the hook.

Pipeline code

for doc in docs["Type1_documents"]:
        item = doc["name"]
        item_name, _ = os.path.splitext(item)
        
        type1_datasets_dict[item_name] = "brace_dictionary"
    
 

return Pipeline(
        [
            node(
                func=func1,
                inputs=type1_datasets_dict,
                outputs=[
                    f"output1",
                    f"output2",
                ],
                name=f"type1_eta",
                tags=["dynamic-catalog", "type1", "data-engineering"],
            )
            
        ]
    )

Hook code

    @hook_impl
    def before_node_run(
        self, node: Node, catalog: DataCatalog
    ) -> Optional[Dict[str, Any]]:
        self.node = node
        self.catalog = catalog
        if "dynamic-catalog" in node.tags:
            input_catalog_name = node.name
            catalog_string = f"params:{input_catalog_name}.full_name"
            if self.catalog.exists(catalog_string):
                true_datasets_dict = {}
                catalog_properties = self.catalog.load(f"params:{input_catalog_name}")
                catalog_name = catalog_properties["full_name"]
                type = catalog_properties["type"]
                subtype = catalog_properties["subtype"]

                datasets_dict = self.catalog.load(f"params:{catalog_name}")
                
                for dataset in datasets_dict:
                    doc_name, _ = os.path.splitext(dataset["name"])
                    self.add_text_dataset(
                        name=doc_name,
                        folder=f"parsed/{type}/{subtype}",
                    )
                    true_datasets_dict[doc_name] = doc_name

                return true_datasets_dict
        return true_datasets_dict

But I am getting value error for this:

 line 487, in _run_with_dict
    raise ValueError(
ValueError: Node type1_eta: func1([brace_dictionary,brace_dictionary,brace_dictionary,..,brace_dictionary]) -> [output1, output2] expected 1 input(s) ['brace_dictionary'], but got the following 1497 input(s) instead: ['file1', 'file2', ...].

Is there another way how to do it conditionally?

AHR
  • 99
  • 8

0 Answers0