Questions tagged [graphframes]

DataFrame based graph library for Apache Spark

GraphFrames is DataFrame base alternative to core GraphX with cross language support:

External resources:

Related tags:

, , .

186 questions
1
vote
1 answer

Cannot set checkpoint dir when running Connected Component example

This is the Connected Components example by graphframe: from graphframes.examples import Graphs g = Graphs(sqlContext).friends() # Get example graph result = g.connectedComponents() result.select("id", "component").orderBy("component").show() In…
huy
  • 1,648
  • 3
  • 14
  • 40
1
vote
1 answer

Pyspark + Graphframes: "recursive" message aggregation

I've created the following graph: spark = SparkSession.builder.appName('aggregate').getOrCreate() vertices = spark.createDataFrame([('1', 'foo', 99), ('2', 'bar', 10), ('3', 'baz',…
Julio
  • 2,261
  • 4
  • 30
  • 56
1
vote
1 answer

Pyspark and Graphframes: Aggregate messages power mean

Given the following graph: Where A has a value of 20, B has a value of 5 and C has a value of 10, I would like to use pyspark/graphframes to compute the power mean. That is, In this case n is the number of items (3 in our case, for three vertices…
Julio
  • 2,261
  • 4
  • 30
  • 56
1
vote
0 answers

Iterative GraphFrames AggregateMessages hitting memory limits

I'm using GraphFrame's aggregateMessages capability to build a custom clustering algorithm. I tested this algorithm on a small sample dataset (~100 items) and verified that it works. But when I run this on my real dataset of 50k items, I am getting…
webber
  • 1,834
  • 5
  • 24
  • 56
1
vote
1 answer

How to Get Connected Component with Graphframes in Pyspark and Raw Data in Spark Dataframe?

I have a spark data frame which looks like below: +--+-----+---------+ |id|phone| address| +--+-----+---------+ | 0| 123| james st| | 1| 177|avenue st| | 2| 123|spring st| | 3| 999|avenue st| | 4| 678| 5th ave| +--+-----+---------+ I am…
MAMS
  • 419
  • 1
  • 6
  • 17
1
vote
1 answer

RDD Warning: Not enough space to cache rdd in memory

I am trying to run PageRank algorithm on a graphframe using pyspark. However when I execute it the program keeps running endlessly and I get following warnings: The code is as follows: vertices = sc.createDataFrame(lst_sent,['id',…
1
vote
1 answer

Convert GraphFrame output to a pandas DataFrame

I checked multiple sources but couldn't pinpoint this particular problem although it probably has a very easy fix. Let's say I have some graph, g. I am able to print the vertices using g.vertices.show() But I'm having a lot of trouble figuring out…
Jonathan
  • 1,876
  • 2
  • 20
  • 56
1
vote
1 answer

Spark GraphFrames High Shuffle read/write

Hi I have created Graph using vertex and edge files. Size of graph is 600GB. I am querying this graph using motif feature of Spark GraphFrames. I have setup an AWS EMR cluster for querying graph. cluster details:- 1 master and 8 slaves Master Node: …
AbhiK
  • 247
  • 3
  • 19
1
vote
1 answer

Spark graphx issue

I am trying to follow the example in https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-python.html However when changing some criteria the result is not as per expectation. Please see the steps below - from functools…
Pratik Rudra
  • 37
  • 1
  • 7
1
vote
0 answers

Why is there no GraphFrames release for Spark 2.4.x and scala 2.12?

I'm looking at the graphframes releases available here: https://spark-packages.org/package/graphframes/graphframes. The only GraphFrames release available for scala 2.12 as of April 22 2020 is with Spark 3.0, but Spark 3.0 isn't production yet. Is…
1
vote
1 answer

GraphFrames Shortest Paths gives distance and not the actual path

I'm new to Graphframes and trying to implement edge-betweenness. I tried using shortest Paths function that is built-in. It returns the distance from the source to the destination vertex but not the actual path between them. The output is: | id | …
Shubham Yadav
  • 561
  • 7
  • 16
1
vote
1 answer

Getting Size Exceeded Exception while storing Dataframe into MongoDB

I am trying to store Apache Spark Dataframe into MongoDB using Scala but getting Caused by: org.bson.BsonMaximumSizeExceededException: Payload document size is larger than maximum of 16777216. exception while storing dataframe into MongoDB Code…
ameen
  • 41
  • 2
  • 4
1
vote
0 answers

Depth First Search Algorithm in Dataframe(GraphFrame) in spark

I have a two dataframe having one containing vertices val v = sqlContext.createDataFrame(scala.List( ("a", "Alice", 34), ("b", "Bob", 36), ("c", "Charlie", 30), ("d", "David", 29), ("e", "Esther", 1), ("f",…
1
vote
0 answers

What is the most efficient 'sparky' way to build a graph from raw data?

I have a dataset containing mentions of various topics across reddit which looks like: +------------+-------+-----------+---------------+ | Year_month | Topic | Subreddit | Mention_count | …
1
vote
1 answer

How to add graphframes to Apache Zeppelin

I am trying to use the graphframes library on Apache Zeppelin with the Spark (pyspark) interpreter, however, I keep on getting the error: ModuleNotFoundError: No module named 'graphframes' whenever I try to import the graphframes module using from…
Marxley
  • 120
  • 1
  • 6