0

I'm using the Spark Graphframes library to create an identity resolution system. I have been able to use spark to find matches. My plan was to use a graph to find transient links between people and assign a single id to them for further analysis etc.

I used the following data (from the public febrl database):

vertex data sample:

+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+
|given_name| surname|street_number|          address_1|           address_2|          suburb|postcode|state|date_of_birth|soc_sec_id| id|block|
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+
|  michaela| neumann|            8|     stanley street|               miami|   winston hills|    4223|  nsw|     19151111|   5304218|  0| mneu|
|  courtney| painter|           12|  pinkerton circuit|          bega flats|       richlands|    4560|  vic|     19161214|   4066625|  1| cpai|
|   charles|   green|           38|salkauskas crescent|                kela|           dapto|    4566|  nsw|     19480930|   4365168|  2| cgre|
|   vanessa|    parr|          905|     macquoid place|   broadbridge manor|   south grafton|    2135|   sa|     19951119|   9239102|  3| vpar|
|   mikayla|malloney|           37|      randwick road|             avalind|hoppers crossing|    4552|  vic|     19860208|   7207688|  4| mmal|
|     blake|   howie|            1|     cutlack street|belmont park belt...|        budgewoi|    6017|  vic|     19250301|   5180548|  5| bhow|
| blakeston| broadby|           53|     traeger street|   valley of springs|      north ward|    3083|  qld|     19120907|   4308555|  7| bbro|
|    edward| denholm|           10|        corin place|           gold tyne|       clayfield|    4221|  vic|     19660306|   7119771|  9| eden|
|   charlie|alderson|          266|hawkesbury crescent|deergarden caravn...|           cooma|    4128|  vic|     19440908|   1256748| 10| cald|
|     molly|   roche|           59|willoughby crescent|        donna valley|         carrara|    4825|  nsw|     19200712|   1847058| 11| mroc|
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+

Edge data sample:

+---+-----+-----+
|src|  dst|match|
+---+-----+-----+
|  0|10000|    1|
|  1|17750|    1|
|  1|10001|    1|
|  1| 7750|    1|
|  2|19656|    1|
|  2|10002|    1|
|  2| 9656|    1|
|  3|19119|    1|
|  3|10003|    1|
|  3| 9119|    1|
+---+-----+-----+

created graph:

g = GraphFrame(vertix_data, edge_data)

used connected components:

connected = g.connectedComponents(algorithm='graphframes')

which results in:

+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+---------+
|given_name| surname|street_number|          address_1|           address_2|          suburb|postcode|state|date_of_birth|soc_sec_id| id|block|component|
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+---------+
|  michaela| neumann|            8|     stanley street|               miami|   winston hills|    4223|  nsw|     19151111|   5304218|  0| mneu|        0|
|  courtney| painter|           12|  pinkerton circuit|          bega flats|       richlands|    4560|  vic|     19161214|   4066625|  1| cpai|        1|
|   charles|   green|           38|salkauskas crescent|                kela|           dapto|    4566|  nsw|     19480930|   4365168|  2| cgre|        2|
|   vanessa|    parr|          905|     macquoid place|   broadbridge manor|   south grafton|    2135|   sa|     19951119|   9239102|  3| vpar|        3|
|   mikayla|malloney|           37|      randwick road|             avalind|hoppers crossing|    4552|  vic|     19860208|   7207688|  4| mmal|        4|
|     blake|   howie|            1|     cutlack street|belmont park belt...|        budgewoi|    6017|  vic|     19250301|   5180548|  5| bhow|        5|
| blakeston| broadby|           53|     traeger street|   valley of springs|      north ward|    3083|  qld|     19120907|   4308555|  7| bbro|        7|
|    edward| denholm|           10|        corin place|           gold tyne|       clayfield|    4221|  vic|     19660306|   7119771|  9| eden|        9|
|   charlie|alderson|          266|hawkesbury crescent|deergarden caravn...|           cooma|    4128|  vic|     19440908|   1256748| 10| cald|       10|
|     molly|   roche|           59|willoughby crescent|        donna valley|         carrara|    4825|  nsw|     19200712|   1847058| 11| mroc|       11|
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+---------+

The component column doesn't always increase in increments of 1 but seems to randomly skip numbers, I would like to make sure that the increase in increments of one as using this number to assign each person an id. Does anybody know why Graphframes does this?

When I look further into this, for the approx 20,000 rows in my development dataframe approx 17% of entries have a skip in them. In extreme cases the gap can be up to around 20-30, i.e. one rows id is 5846 and the next one is 5868. My worry is, when I go scale in millions and hundreds of millions the gaps will get very large between id's which could create problems down the line.

TL;DR: Why does Sparks connected components seem to randomly skip values and not always increment by 1?

Auren Ferguson
  • 479
  • 6
  • 17

2 Answers2

1

Graphframes documentation never promises consecutive ids - instead the only guarantee it provides is:

The resulting DataFrame contains all the vertex information and one additional column:

component (LongType): unique ID for this component

In practice GraphX implementation uses the smallest ID in the component ("return a graph with the vertex value containing the lowest vertex id in the connected component containing that vertex"), and Graphframes seems to do the same thing.

Community
  • 1
  • 1
0

Like @user10802135 said, the component values are not guaranteed to be sequential. If you want to make them sequential, you'll need to do some post-processing on the component field. A pyspark solution to this would look something like this:

import pyspark.sql.functions as F
from pyspark.sql import Window

# Define our window for partitioning data on - necessary for dense_rank() function
windowSpec = Window.partitionBy(F.lit(1)).orderBy('component')

# Redefine the component field, now in sequential order
df = df.withColumn('component', F.dense_rank().over(windowSpec))

By partitioning by the literal value of 1, all rows are considered in the dense_rank(), and ranking order is determined by the .orderBy() argument. In this case the .orderBy() argument is set to 'component', which will order in ascending order by default. The .dense_rank() functionality ensures that records under the same component will be given the same returned value, something that using rank() does NOT ensure.

There are some great examples and explanations of .dense_rank() and other window functions here.

PJ Gibson
  • 31
  • 4