Although GraphFrames alone may not fully provide the requisite functionality for your task out of the box, combining it with NetworkX and PandasUDF proves to be an effective solution. Initially, let's explore the capabilities of NetworkX, particularly relevant to your example.
plot of example graph :

Johnson's algorithm, which underpins the simple_cycles function in NetworkX, has better time complexity tailored for tasks of this nature than other DFS modification based algorithms (source)
Code to find the cycles in NetworkX:
import pandas as pd
import networkx as nx
df_edges = pd.DataFrame({
'src': [1, 2, 3, 3, 4],
'dst': [2, 3, 4, 1, 1]
})
# Create a directed graph from the dataframe
G = nx.from_pandas_edgelist(df_edges, source='src', target='dst', create_using=nx.DiGraph())
# Find cycles
cycles = list(nx.simple_cycles(G))
print(cycles) #output: [[1, 2, 3], [1, 2, 3, 4]]
The NetworkX function simple_cycles evidently provides the desired functionality. However, considering the potential scalability concerns and the need to function within the Spark ecosystem, it is beneficial to seek a solution that operates in a parallelized manner. This is where the utility of PandasUDF (vectorized UDF) shines. To formulate a scalable and generalizable solution, our first step is to execute a connected components operation. GraphFrames conveniently provides this capability, as demonstrated below:
from graphframes import *
g = GraphFrame(df_edges)
result = g.connectedComponents()
After obtaining the output from the connected components function, which would typically be in the format [node, component id], you can extend your original edge DataFrame with this component id. This results in a Spark DataFrame structured as [src, dst, component].
For the sake of brevity, I'll generate such a Spark DataFrame manually in the subsequent steps of the example. To illustrate the parallelization capabilities of the cycle finding function across distinct connected components, I'll also incorporate the edges of an additional subgraph into the edgelist.
Assuming this is the extended edgelist
df_edges = pd.DataFrame({
'src': [1, 2, 3, 3, 4,5,6,7],
'dst': [2, 3, 4, 1, 1,6,7,5],
'component' : [1,1,1,1,1,2,2,2]
})
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Convert pandas example DataFrame to Spark DataFrame
# this is in place of the processed output
# derived from both the original DataFrame
# and the connected components output.
spark_df_edges = spark.createDataFrame(df_edges)
Here's a visualization of the expanded graph, composed of two distinct connected components:

This is how the expanded edgelist, now integrated with the component id, appears when using .show()
+---+---+---------+
|src|dst|component|
+---+---+---------+
| 1| 2| 1|
| 2| 3| 1|
| 3| 4| 1|
| 3| 1| 1|
| 4| 1| 1|
| 5| 6| 2|
| 6| 7| 2|
| 7| 5| 2|
+---+---+---------+
Next, we define a Pandas UDF that can be applied on each group of connected components.In addition to finding cycles, this function is designed to return useful information such as the count of cycles found and a list of edges constituting each cycle on a per-component basis:
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StructType, StructField, IntegerType,StringType
import json
schema = StructType([
StructField('component', IntegerType()),
StructField('no_of_cycles', IntegerType()),
StructField('cyclelist', StringType())
])
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def find_cycles(pdf):
G = nx.from_pandas_edgelist(pdf, source='src', target='dst', create_using=nx.DiGraph())
cycles = list(nx.simple_cycles(G))
cyclelist = json.dumps(cycles)
num_cycles = len(cycles)
return pd.DataFrame({'component': [pdf['component'].iloc[0]],
'no_of_cycles': [num_cycles],
'cyclelist': [cyclelist]})
With the Pandas UDF now defined, we proceed to apply this function to each individual connected component like this:
cycles=spark_df_edges.groupby('component').apply(find_cycles).show(truncate=False)
The result cycles dataframe look like this:
+---------+------------+-------------------------+
|component|no_of_cycles|cyclelist |
+---------+------------+-------------------------+
|1 |2 |[[1, 2, 3], [1, 2, 3, 4]]|
|2 |1 |[[5, 6, 7]] |
+---------+------------+-------------------------+
Finally, we can join the two DataFrames:
from pyspark.sql.functions import broadcast
joined_df = spark_df_edges.join(broadcast(cycles), on='component', how='inner')
joined_df.show(truncate=False)
result is
+---------+---+---+------------+-------------------------+
|component|src|dst|no_of_cycles|cyclelist |
+---------+---+---+------------+-------------------------+
|1 |1 |2 |2 |[[1, 2, 3], [1, 2, 3, 4]]|
|1 |2 |3 |2 |[[1, 2, 3], [1, 2, 3, 4]]|
|1 |3 |4 |2 |[[1, 2, 3], [1, 2, 3, 4]]|
|1 |3 |1 |2 |[[1, 2, 3], [1, 2, 3, 4]]|
|1 |4 |1 |2 |[[1, 2, 3], [1, 2, 3, 4]]|
|2 |5 |6 |1 |[[5, 6, 7]] |
|2 |6 |7 |1 |[[5, 6, 7]] |
|2 |7 |5 |1 |[[5, 6, 7]] |
+---------+---+---+------------+-------------------------+
Note that we can use broadcast
here as the number of rows in the cycles dataframe is the number of connected components which usually is much smaller than the number of rows in an edgelist. The broadcast function tells Spark to broadcast the smaller DataFrame to all the worker nodes, which can speed up the join operation if one DataFrame is much smaller than the other.