I'm trying to calculate find_cliques functionality to locate the maximal cliques for each subgroup.
I'm using this implementation using pandas_udf grouped by each connected component.
def pd_create_subgroups(pdf):
index = pdf.component.unique()[0]
try:
# building the graph
gnx = nx.from_pandas_edgelist(pdf, "src", "dst")
bic = list(find_cliques(gnx))
if len(bic) <= 1:
return pd.DataFrame(data={"cliques": [[f"issue_{index}"]]})
bic_sorted = sorted(map(sorted, bic))
bic_sorted = [b for b in bic_sorted if len(b) >= 3]
if len(bic_sorted) == 0:
return pd.DataFrame(data={"cliques": [[f"issue_{index}"]]})
return pd.DataFrame([bic_sorted]).transpose().rename(columns={0: "cliques"})
except:
return pd.DataFrame(data={"cliques": [[f"issue_{index}"]]})
pdf is a pandas dataframe containing the fields src
, dst
, component
it has around 200M-300M undirected edges
and returns the following error -
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 (TID 331) (executor 9): java.lang.IndexOutOfBoundsException: index: 2147483628, length: 36 (expected: range(0, 2147483648))
When running on smaller graphs it works properly.