Graph Algorithms for Data Scientists

Photo by Alina Grubnyak on Unsplash

Graphs provide us with a very useful data structure. They can help us to find structure within our data. With the advent of Machine learning and big data, we need to get as much information as possible about our data. Learning a little bit of graph theory can certainly help us with that.

Here is a Graph Analytics for Big Data course on Coursera by UCSanDiego which I highly recommend to learn the basics of graph theory.

One of the algorithms I am going to focus on the current post is called Connected Components. Why it is important. We all know clustering.

You can think of Connected Components in very layman’s terms as a sort of a hard clustering algorithm which finds clusters/islands in related/connected data. As a concrete example: Say you have data about roads joining any two cities in the world. And you need to find out all the continents in the world and which city they contain.

How will you achieve that? Come on give some thought.

To put a Retail Perspective: Let us say, we have a lot of customers using a lot of accounts. One way in which we can use the Connected components algorithm is to find out distinct families in our dataset. We can assume edges(roads) between CustomerIDs based on same credit card usage, or same address or same mobile number etc. Once we have those connections, we can then run the connected component algorithm on the same to create individual clusters to which we can then assign a family ID. We can use these family IDs to provide personalized recommendations based on family needs. We can also use this family ID to fuel our classification algorithms by creating grouped features based on family.

In Finance Perspective: Another use case would be to capture fraud using these family IDs. If an account has done fraud in past, it is highly probable that the connected accounts are also susceptible to fraud.

So enough use cases. Let us start with a simple graph class written in Python to start up our exploits with code.

This post will revolve more around code from here onwards.

You can certainly play with our new graph class. Here we try to build some graphs.

Vertices of graph:
['a', 'c', 'b']
Edges of graph:
[['a', 'd', 2], ['c', 'b', 5], ['c', 'e', 5], ['c', 'd', 3], ['b', 'c', 2]]
Add vertex:
Vertices of graph:
['a', 'c', 'b', 'z']
Add an edge:
Vertices of graph:
['a', 'c', 'b', 'z']
Edges of graph:
[['a', 'z', 1], ['a', 'd', 2], ['c', 'b', 5], ['c', 'e', 5], ['c', 'd', 3], ['b', 'c', 2], ['z', 'a', 1]]
Adding an edge {"x","y"} with new vertices:
Vertices of graph:
['a', 'c', 'b', 'y', 'x', 'z']
Edges of graph:
[['a', 'z', 1], ['a', 'd', 2], ['c', 'b', 5], ['c', 'e', 5], ['c', 'd', 3], ['b', 'c', 2], ['y', 'x', 1], ['x', 'y', 1], ['z', 'a', 1]]

Let’s do something interesting now.

We will use the above graph class for our understanding purpose. There are many modules in python which we can use to do whatever I am going to do next but to understand the methods we will write everything from scratch.
Let us start with an example graph which we can use for our purpose.

Vertices of graph:
['Mannheim', 'Erfurt', 'Munchen', 'Numberg', 'Stuttgart', 'Augsburg', 'Kassel', 'Frankfurt', 'Wurzburg', 'Karlsruhe']
Edges of graph:
[['Mannheim', 'Frankfurt', 85], ['Mannheim', 'Karlsruhe', 80], ['Erfurt', 'Wurzburg', 186], ['Munchen', 'Numberg', 167], ['Munchen', 'Augsburg', 84], ['Munchen', 'Kassel', 502], ['Numberg', 'Stuttgart', 183], ['Numberg', 'Wurzburg', 103], ['Numberg', 'Munchen', 167], ['Stuttgart', 'Numberg', 183], ['Augsburg', 'Munchen', 84], ['Augsburg', 'Karlsruhe', 250], ['Kassel', 'Munchen', 502], ['Kassel', 'Frankfurt', 173], ['Frankfurt', 'Mannheim', 85], ['Frankfurt', 'Wurzburg', 217], ['Frankfurt', 'Kassel', 173], ['Wurzburg', 'Numberg', 103], ['Wurzburg', 'Erfurt', 186], ['Wurzburg', 'Frankfurt', 217], ['Karlsruhe', 'Mannheim', 80], ['Karlsruhe', 'Augsburg', 250]]

Let’s say we are given a graph with the cities of Germany and the respective distance between them. You want to find out how to go from Frankfurt (The starting node) to Munchen. There might be many ways in which you can traverse the graph but you need to find how many cities you will need to visit on a minimum to go from Frankfurt to Munchen) This problem is analogous to finding out the distance between nodes in an unweighted graph.

The algorithm that we use here is called Breadth First Search.

What we do in the above piece of code is to create a queue and traverse it based on levels. We start with Frankfurt as the starting node. We loop through its neighboring cities(Mannheim, Wurzburg, and Kassel) and push them into the queue. We keep track of what level they are at and also the path through which we reached them. Since we are popping the first element of a queue we are sure we will visit cities in the order of their level.

Check out this good post about BFS to understand more about queues and BFS.

({'Augsburg': 3,
'Erfurt': 2,
'Frankfurt': 0,
'Karlsruhe': 2,
'Kassel': 1,
'Mannheim': 1,
'Munchen': 2,
'Numberg': 2,
'Stuttgart': 3,
'Wurzburg': 1},
{'Augsburg': ':->Frankfurt->Mannheim->Karlsruhe',
'Erfurt': ':->Frankfurt->Wurzburg',
'Frankfurt': ':',
'Karlsruhe': ':->Frankfurt->Mannheim',
'Kassel': ':->Frankfurt',
'Mannheim': ':->Frankfurt',
'Munchen': ':->Frankfurt->Kassel',
'Numberg': ':->Frankfurt->Wurzburg',
'Stuttgart': ':->Frankfurt->Wurzburg->Numberg',
'Wurzburg': ':->Frankfurt'})

I did this example to show how BFS algorithm works.
We can extend this algorithm to find out connected components in an unconnected graph. 
Let us say we need to find groups of unconnected vertices in the graph.

For example, the below graph has 3 unconnected sub-graphs. Can we find what nodes belong to a particular subgraph?

['Bangalore', 'Kolkata', 'Delhi', 'Mumbai'],
['NY', 'ALB', 'TX']]

The above code is similar to the previous BFS code. We keep all the vertices of the graph in the nodes list. We take a node from the nodes list and start BFS on it. as we visit a node we remove that node from the nodes list. Whenever the BFS completes we start again with another node in the nodes list until the nodes list is empty.

As you can see we are able to find distinct components in our data. Just by using Edges and Vertices. This algorithm could be run on different data to satisfy any use case I presented above.

But Normally using Connected Components for a retail case will involve a lot of data and you will need to scale this algorithm.

Connected Components in PySpark

Below is an implementation from this paper on Connected Components in MapReduce and Beyond from Google Research. Read the PPT to understand the implementation better. Some ready to use code for you.

Or Use GraphFrames in PySpark:

To Install GraphFrames:

I ran on command line: pyspark — packages graphframes:graphframes:0.5.0-spark2.1-s_2.11 which opened up my notebook and installed graphframes after i try to import in my notebook.

The string to be formatted as : graphframes:(latest version)-spark(your spark version)-s_(your scala version).

Checkout this guide on how to use GraphFrames for more information.

The GraphFrames library implements the CC algorithm as well as a variety of other graph algorithms.

The above post was a lot of code but hope it was helpful. It took me a lot of time to implement the algorithm so wanted to make it easy for the folks.

If you want to read up more on Graph Algorithms here is a Graph Analytics for Big Data course on Coursera by UCSanDiego which I highly recommend to learn the basics of graph theory.

First Published on my blog here.


  1. Graphs in Python
  2. A Gentle Intoduction to Graph Theory Blog
  3. Graph Analytics for Big Data course on Coursera by UCSanDiego

Products You May Like

Leave a Reply

Your email address will not be published. Required fields are marked *