0

I'm executing the Dagster tutorial, and I got stuck at the Multiple and Conditional Outputs step.

In the solid definitions, it asks to declare (among other things):

output_defs=[
    OutputDefinition(
        name="hot_cereals", dagster_type=DataFrame, is_required=False
    ),
    OutputDefinition(
        name="cold_cereals", dagster_type=DataFrame, is_required=False
    ),
],

But there's no information where the DataFrame cames from. Firstly I have tried with pandas.DataFrame but I faced the error: {dagster_type} is not a valid dagster type. It happens when I try to submit it via $ dagit -f multiple_outputs.py. Then I installed the dagster_pyspark and gave a try with the dagster_pyspark.DataFrame. This time I managed to summit the DAG to the UI. However, when I run it from the UI, I got the following error:

dagster.core.errors.DagsterTypeCheckDidNotPass: Type check failed for step output hot_cereals of type PySparkDataFrame.
  File "/Users/bambrozio/.local/share/virtualenvs/dagster-tutorial/lib/python3.7/site-packages/dagster/core/execution/plan/execute_plan.py", line 210, in _dagster_event_sequence_for_step
    for step_event in check.generator(step_events):
  File "/Users/bambrozio/.local/share/virtualenvs/dagster-tutorial/lib/python3.7/site-packages/dagster/core/execution/plan/execute_step.py", line 273, in core_dagster_event_sequence_for_step
    for evt in _create_step_events_for_output(step_context, user_event):
  File "/Users/bambrozio/.local/share/virtualenvs/dagster-tutorial/lib/python3.7/site-packages/dagster/core/execution/plan/execute_step.py", line 298, in _create_step_events_for_output
    for output_event in _type_checked_step_output_event_sequence(step_context, output):
  File "/Users/bambrozio/.local/share/virtualenvs/dagster-tutorial/lib/python3.7/site-packages/dagster/core/execution/plan/execute_step.py", line 221, in _type_checked_step_output_event_sequence
    dagster_type=step_output.dagster_type,

Does anyone know how to fix it? Thanks for the help!

Bruno Ambrozio
  • 402
  • 3
  • 18

3 Answers3

0

As Arthur pointed out, the full tutorial code is available on Dagster's github.

However, you do not need dagster_pandas, rather, the key lines missing from your code are:

if typing.TYPE_CHECKING:
    DataFrame = list
else:
    DataFrame = PythonObjectDagsterType(list, name="DataFrame")  # type: Any

The reason for the above structure is to achieve MyPy compliance, see the Types & Expectations section of the tutorial.

See also the documentation on Dagster types.

turnerm
  • 1,381
  • 11
  • 14
0

I was stuck here, too, but luckily I found the updated source code. They have updated the docs so that the OutputDefinition is defined beforehand.

Update your code before sorting and pipeline like below:

import csv
import os

from dagster import (
    Bool,
    Field,
    Output,
    OutputDefinition,
    execute_pipeline,
    pipeline,
    solid,
)


@solid
def read_csv(context, csv_path):
    lines = []
    csv_path = os.path.join(os.path.dirname(__file__), csv_path)
    with open(csv_path, "r") as fd:
        for row in csv.DictReader(fd):
            row["calories"] = int(row["calories"])
            lines.append(row)

    context.log.info("Read {n_lines} lines".format(n_lines=len(lines)))
    return lines


@solid(
    config_schema={
        "process_hot": Field(Bool, is_required=False, default_value=True),
        "process_cold": Field(Bool, is_required=False, default_value=True),
    },
    output_defs=[
        OutputDefinition(name="hot_cereals", is_required=False),
        OutputDefinition(name="cold_cereals", is_required=False),
    ],
)
def split_cereals(context, cereals):
    if context.solid_config["process_hot"]:
        hot_cereals = [cereal for cereal in cereals if cereal["type"] == "H"]
        yield Output(hot_cereals, "hot_cereals")
    if context.solid_config["process_cold"]:
        cold_cereals = [cereal for cereal in cereals if cereal["type"] == "C"]
        yield Output(cold_cereals, "cold_cereals")

You can also find the whole lines of codes from here.

-1

Try first to install the dagster pandas integration:

pip install dagster_pandas

Then do:

from dagster_pandas import DataFrame

You can find the code from the tutorial here.

0x26res
  • 11,925
  • 11
  • 54
  • 108