NetworkX find_cliques error using PySpark

Question

I'm trying to calculate find_cliques functionality to locate the maximal cliques for each subgroup.

I'm using this implementation using pandas_udf grouped by each connected component.

def pd_create_subgroups(pdf):

    index = pdf.component.unique()[0]

    try:
        # building the graph
        gnx = nx.from_pandas_edgelist(pdf, "src", "dst")

        bic = list(find_cliques(gnx))

        if len(bic) <= 1:
            return pd.DataFrame(data={"cliques": [[f"issue_{index}"]]})

        bic_sorted = sorted(map(sorted, bic))
        bic_sorted = [b for b in bic_sorted if len(b) >= 3]

        if len(bic_sorted) == 0:
            return pd.DataFrame(data={"cliques": [[f"issue_{index}"]]})

        return pd.DataFrame([bic_sorted]).transpose().rename(columns={0: "cliques"})
    except:
        return pd.DataFrame(data={"cliques": [[f"issue_{index}"]]})

pdf is a pandas dataframe containing the fields src, dst, component it has around 200M-300M undirected edges

and returns the following error -

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 (TID 331) (executor 9): java.lang.IndexOutOfBoundsException: index: 2147483628, length: 36 (expected: range(0, 2147483648))

When running on smaller graphs it works properly.

NetworkX find_cliques error using PySpark

0 Answers0