As a background: I am a python coder using Graphframes and pyspark through Databricks. I've been using Graphframes to deduplicate records in the context of record-linkage. Below is some pseudo-code depicting the coding scenario I've come across:
...
# Define our graphframes object
outputGraphframe = GraphFrame(vertices, edges)
# Get pyspark dataframe with connected components using graphx algorithm
dfGraphX= outputGraphframe.connectedComponents(algorithm='graphx')
# Get pyspark dataframe with connected components using graphframes algorithm
dfGraphframes= outputGraphframe.connectedComponents(algorithm='graphframes')
The connected components as defined in dfGraphX
and dfGraphframes
variables can look wildly different.
For one instance of ~20,000 vertices and ~400,000 edges, the "graphframes" algorithm returned an "empty graph", where each component consisted of one record. For that same example, the "graphx" algorithm was far from an "empty graph", containing as many as 11 records under the same component. In this scenario, under manual inspection, the "graphx" algorithm performed properly while the "graphframes" algorithm did not.
When I tried to research how these two algorithms differ, I hit a dead-end quickly. Does someone know:
- The differences between the "graphx" and "graphframes" algorithms for the Graphframes.connectedComponents() function.
- The best use-cases for "graphx" and "graphframes" algorithms.
- Why the "graphframes" algorithm would return an "empty graph" when the vertex/edge arguments directly imply some records belong under the same component.