0

Here is a Spark Graphframes df repsenting a directed graph, there may be some cycles in this graph. How can I detect the cycles in a Graphframe?

For example, here is a graph

| src | dst |
| --- | --- |
| 1   | 2   |
| 2   | 3   |
| 3   | 4   |
| 3   | 1   |
| 4   | 1   |

the cycle in this graph should be {1,2,3} and {1,2,3,4}.

mamonu
  • 392
  • 3
  • 19

3 Answers3

0

You could use BFS algorithm to find cycles in your graph

Hossein Torabi
  • 694
  • 1
  • 7
  • 18
  • thank you for your answer. Can you give more details about how to use BFS to find cycles? I have tried to use the BFS method, but what is the "from_expr" and "to_expr"? I try to use the same id to mean that i want to find the pathes from the node to the same node, which means the cycle. but the dataframe just has the from and to without the pathes. how to solve it? thank you for your help. – guangjun Jul 06 '23 at 07:09
0

I know nothing about 'Spark Graphframe'

In the hope you will find it useful, here is how to modify BFS to find cycles:

In the standard BFS algorithm, the code keeps track of which vertices have been previously visited. When searching the reachable neighbors of the current vertex, a previously visited vertex is skipped.

In BFS modified to find cycles, encountering a previously visited vertex may happen because a cycle exists.

To check for this, Dijsktra is applied to find the shortest path from Current vertex, through the rest of the graph, and back to the previously visited vertex. If such a path exists, then it is a cycle.

Here is an illustration of the situation

enter image description here

This is the basic essence of the algorithm, but you need to handle some significant details:

  • The same cycle may be detected multiple times. Code is needed to check if a newly found cycle is novel or not

  • Multigraphs ( multiple edges between node pairs )

  • Graphs with more than one component

  • Undirected graphs. ( Your question specifies directed graphs, so you can get away with testing for undirected graphs. However another modification to the algorithm will also handle these )

Maybe a link to C++ code implementing this will be helpful?

ravenspoint
  • 19,093
  • 6
  • 57
  • 103
0

Although GraphFrames alone may not fully provide the requisite functionality for your task out of the box, combining it with NetworkX and PandasUDF proves to be an effective solution. Initially, let's explore the capabilities of NetworkX, particularly relevant to your example.

plot of example graph :

graph visualisation of example

Johnson's algorithm, which underpins the simple_cycles function in NetworkX, has better time complexity tailored for tasks of this nature than other DFS modification based algorithms (source)

Code to find the cycles in NetworkX:

import pandas as pd
import networkx as nx

df_edges = pd.DataFrame({
    'src': [1, 2, 3, 3, 4],
    'dst': [2, 3, 4, 1, 1]
})
# Create a directed graph from the dataframe
G = nx.from_pandas_edgelist(df_edges, source='src', target='dst', create_using=nx.DiGraph())
# Find cycles
cycles = list(nx.simple_cycles(G))
print(cycles) #output: [[1, 2, 3], [1, 2, 3, 4]]

The NetworkX function simple_cycles evidently provides the desired functionality. However, considering the potential scalability concerns and the need to function within the Spark ecosystem, it is beneficial to seek a solution that operates in a parallelized manner. This is where the utility of PandasUDF (vectorized UDF) shines. To formulate a scalable and generalizable solution, our first step is to execute a connected components operation. GraphFrames conveniently provides this capability, as demonstrated below:

from graphframes import *
g = GraphFrame(df_edges)  
result = g.connectedComponents()

After obtaining the output from the connected components function, which would typically be in the format [node, component id], you can extend your original edge DataFrame with this component id. This results in a Spark DataFrame structured as [src, dst, component].

For the sake of brevity, I'll generate such a Spark DataFrame manually in the subsequent steps of the example. To illustrate the parallelization capabilities of the cycle finding function across distinct connected components, I'll also incorporate the edges of an additional subgraph into the edgelist.

Assuming this is the extended edgelist

df_edges = pd.DataFrame({
    'src': [1, 2, 3, 3, 4,5,6,7],
    'dst': [2, 3, 4, 1, 1,6,7,5],
    'component' : [1,1,1,1,1,2,2,2]
})

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Convert pandas example DataFrame to Spark DataFrame
# this is in place of the processed output 
# derived from both the original DataFrame 
# and the connected components output.
spark_df_edges = spark.createDataFrame(df_edges)

Here's a visualization of the expanded graph, composed of two distinct connected components:

extended graph

This is how the expanded edgelist, now integrated with the component id, appears when using .show()

+---+---+---------+
|src|dst|component|
+---+---+---------+
|  1|  2|        1|
|  2|  3|        1|
|  3|  4|        1|
|  3|  1|        1|
|  4|  1|        1|
|  5|  6|        2|
|  6|  7|        2|
|  7|  5|        2|
+---+---+---------+

Next, we define a Pandas UDF that can be applied on each group of connected components.In addition to finding cycles, this function is designed to return useful information such as the count of cycles found and a list of edges constituting each cycle on a per-component basis:

from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StructType, StructField, IntegerType,StringType
import json

schema = StructType([
    StructField('component', IntegerType()),
    StructField('no_of_cycles', IntegerType()),
    StructField('cyclelist', StringType())
])

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def find_cycles(pdf):
    G = nx.from_pandas_edgelist(pdf, source='src', target='dst', create_using=nx.DiGraph())
    cycles = list(nx.simple_cycles(G))
    cyclelist = json.dumps(cycles)
    num_cycles = len(cycles)
    return pd.DataFrame({'component': [pdf['component'].iloc[0]], 
                         'no_of_cycles': [num_cycles], 
                         'cyclelist': [cyclelist]})

With the Pandas UDF now defined, we proceed to apply this function to each individual connected component like this:

cycles=spark_df_edges.groupby('component').apply(find_cycles).show(truncate=False)

The result cycles dataframe look like this:

+---------+------------+-------------------------+
|component|no_of_cycles|cyclelist                |
+---------+------------+-------------------------+
|1        |2           |[[1, 2, 3], [1, 2, 3, 4]]|
|2        |1           |[[5, 6, 7]]              |
+---------+------------+-------------------------+

Finally, we can join the two DataFrames:

from pyspark.sql.functions import broadcast
joined_df = spark_df_edges.join(broadcast(cycles), on='component', how='inner')
joined_df.show(truncate=False)

result is

+---------+---+---+------------+-------------------------+
|component|src|dst|no_of_cycles|cyclelist                |
+---------+---+---+------------+-------------------------+
|1        |1  |2  |2           |[[1, 2, 3], [1, 2, 3, 4]]|
|1        |2  |3  |2           |[[1, 2, 3], [1, 2, 3, 4]]|
|1        |3  |4  |2           |[[1, 2, 3], [1, 2, 3, 4]]|
|1        |3  |1  |2           |[[1, 2, 3], [1, 2, 3, 4]]|
|1        |4  |1  |2           |[[1, 2, 3], [1, 2, 3, 4]]|
|2        |5  |6  |1           |[[5, 6, 7]]              |
|2        |6  |7  |1           |[[5, 6, 7]]              |
|2        |7  |5  |1           |[[5, 6, 7]]              |
+---------+---+---+------------+-------------------------+

Note that we can use broadcast here as the number of rows in the cycles dataframe is the number of connected components which usually is much smaller than the number of rows in an edgelist. The broadcast function tells Spark to broadcast the smaller DataFrame to all the worker nodes, which can speed up the join operation if one DataFrame is much smaller than the other.

mamonu
  • 392
  • 3
  • 19