I am using graphframes to represent a graph in pyspark from a similar dataframe:
data = [
("1990", "1995"),
("1980", "1996"),
("1993", "1994"),
("1990", "2002"),
("1996", "2002"),
("1999", "2008"),
("2003", "2014"),
]
Nodes are represented as a union of date strings, ie:
nodes = [
("1990_1995"),
("1980_1996"),
("1993_1994"),
("1990_2002"),
("1996_2002"),
("1999_2008"),
("2003_2014"),
]
My problem is that I know how to define the edges. In theory I know how to do it, but in the implementation I have some difficulties.
The edge definition is based on this rule: if a date range (e.g. "1990-12-10" - "1995-12-10") overlaps with another date range (e.g. "1980-12-10" - "1996-12-10"), then the two date ranges represent an edge. How can I implement the creation of edge according to this rule in pyspark graphframes?
The edges should be:
node_and_edges = [
("1990_1995", ["1980_1996", "1993_1994", "1990_2002"]),
("1980_1996", ["1990_1995", "1993_1994", "1990_2002"]),
("1993_1994", ["1990_1995", "1980_1996", "1990_2002"]),
("1990_2002", ["1990_1995", "1980_1996", "1993_1994", "1996_2002", "1999_2008"]),
("1996_2002", ["1993_1994", "1990_2002", "1999_2008"]),
("1999_2008", ["1990_2002", "1996_2002", "2003_2014"]),
("2003_2014", ["1999_2008"]),
]
Code below:
from pyspark.sql import functions as F
from pyspark.sql import SparkSession as ss
spark = ss.builder.appName("test").getOrCreate()
data = [
("1990", "1995"),
("1980", "1996"),
("1993", "1994"),
("1990", "2002"),
("1996", "2002"),
("1999", "2008"),
("2003", "2014"),
]
rdd = spark.sparkContext.parallelize(data)
columns = ["start_date", "end_date"]
df = rdd.toDF(columns)
# Creating nodes
df = df.withColumn("node", F.concat_ws("_", F.col("start_date"), F.col("end_date")))
# Creating edges
# how?
df.printSchema()
df.show()
Execution output:
root
|-- start_date: string (nullable = true)
|-- end_date: string (nullable = true)
|-- node: string (nullable = false)
+----------+--------+---------+
|start_date|end_date| node|
+----------+--------+---------+
| 1990| 1995|1990_1995|
| 1980| 1996|1980_1996|
| 1993| 1994|1993_1994|
| 1990| 2002|1990_2002|
| 1996| 2002|1996_2002|
| 1999| 2008|1999_2008|
| 2003| 2014|2003_2014|
+----------+--------+---------+
Overlap edges rule: