1

I have a large dataset (several millions of rows) that i want to use for graph analysis. After data preparation and cleaning, the data is now in a python format (pandas dataframe).

For the sake of graph analysis, i am using Stanford Network Analysis Project (SNAP). The reason that i am using SNAP, even though other frameworks are also available such as networkx or GraphLab is that SNAP can handle very large graphs.

But SNAP uses different types of data structure that we are used to when using pandas. It uses Vectors, Hashtables, and Pairs.

https://snap.stanford.edu/snappy/doc/tutorial/tutorial.html

I find a difficulty converting from dataframe format to any of these. what i am doing currently is that i convert the dataframe to a text format first, saving it on the hard disk and read it again from SNAP using snap.LoadEdgeListStr

https://snap.stanford.edu/snappy/doc/reference/LoadEdgeListStr1.html?highlight=loadedgeliststr

is there a way for direct conversion between the two formats, so i don't need to do the same process every time?

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
Taie
  • 1,021
  • 16
  • 29

1 Answers1

1

If you wish to convert a pandas dataframe to a SNAP graph in-memory, you could create a new graph and fill it with nodes and edges as follows:

import pandas as pd
import snap

# Create a sample pandas dataframe:
data = {
    's': [0, 0, 1],
    't': [1, 2, 0]
}
df = pd.DataFrame(data)

# Create SNAP directed graph:
G1 = snap.TNGraph.New()
# Add nodes:
nodes = set(df['s'].tolist() + df['t'].tolist())
for node in nodes:
    G1.AddNode(int(node))
# Add edges:
for index, row in df.iterrows():
    G1.AddEdge(int(row['s']), int(row['t']))
# Print result:
G1.Dump()

If you still wish to save / load your graphs after creating them for the first time, consider saving them in binary format instead of using text files (using the save() and load() functions). That should be much more efficient.

SNAP also provides Tables:

Tables in SNAP are designed to provide fast performance at scale, and to effortlessly handle datasets containing hundreds of millions of rows. They can be saved and loaded to disk in a binary format using the provided methods.

These allow a convenient API for transforming tables into graphs, however I don't think I would use them instead of pandas dataframe.

zohar.kom
  • 1,765
  • 3
  • 12
  • 28
  • TypeError Traceback (most recent call last) in () 8 # Add edges: 9 for index, row in df.iterrows(): ---> 10 G1.AddEdge(row['s'], row['t']) 11 # Print result: 12 G1.Dump() TypeError: in method 'PNGraph_AddEdge', argument 2 of type 'int' – Taie Aug 16 '18 at 08:05
  • On what dataframe are you running? Can you post a sample of the data? – zohar.kom Aug 16 '18 at 08:18
  • i haven't used my data yet. i am still using the sample data that you gave. – Taie Aug 16 '18 at 08:29
  • Both 'AddNode' and 'AddEdge' expect node ids, which should be of type 'int'. So if the columns 's' and 't' in your dataframe contain int values (like in the example) it should work – zohar.kom Aug 16 '18 at 08:35
  • Oh that's strange. What SNAP version are you using? I run 4.1.0-dev-macosx10.12.6-x64-py2.7, and pandas version is 0.19.1, and the program outputs: ------------------------------------------------- Directed Node Graph: nodes: 3, edges: 3 0] in [1] 1 out[2] 1 2 1] in [1] 0 out[1] 0 2] in [1] 0 out[0] – zohar.kom Aug 16 '18 at 08:38
  • i am using anaconda 2.5 on windows 10. 64 bit. so, python and pandas are included in this edition – Taie Aug 16 '18 at 08:49
  • I edited the original answer to cast values to int when adding nodes and edges, please try the current version – zohar.kom Aug 16 '18 at 08:55
  • Now, it works. Thanks. please, give me some time to try the code on my data. i will get back to you soon. By the way, where should i see the result of doing G1.Dump()? – Taie Aug 16 '18 at 09:09
  • Great! Dump prints to sys.stdout by default, you can write to any output stream. Take a look at the documentation here - https://snap.stanford.edu/snappy/doc/reference/graphs.html – zohar.kom Aug 16 '18 at 09:17