Investigating a social network using Graph Theory

Overview

This project will investigate a social network using graph theory. The first part of this project will find a dominating set of users in a network (i.e. users with at least one connection to all the other users in the network). The second part of the project will find sub-communities or tightly connected communities within a network

Data

The data used will be the provided Facebook datasets facebook_1000.txt, facebook_2000.txt and facebook_ucsd.txt

facebook_1000.txt has 783 vertices (persons) and 3784 edges (friendships).

facebook_2000.txt has 1780 vertices (persons) and 15,650 edges (friendships)

facebook_ucsd.txt has 14948 vertices (persons) and 886442 edges (friendships)

Note: The above datasets is slightly different from what was given in the initial proposal.

For answering the easier question (dominating set) the facebook_1000 and facebook_ucsd datasets will be used. For answering the harder question (finding sub-communities the facebook_1000 and facebook_2000 datasets will be used.

Questions

Easier: To find the dominating set of nodes in a Facebook network. A dominating set is defined as the subset of users who are connected to all the users in the network. This is useful for marketing messages or targeting “super” users or influencers in a network.

Harder: To find sub-communities in a Facebook network. Sub-communities are closely connected users who interact with each other more and share certain commonalities like same high school or same subject interest etc.

The graphs are visualized using GraphStream package (http://graphstream-project.org)

Algorithms, Data structures and Answers to the questions

Easier Question:

Main data structure:

The network was laid as a classic graph data structure using an adjacency list. Each person is denoted as a vertex and a friendship is denoted as an edge in the graph.

A HashMap of integer key to denote a vertex and HashSet of integers for its value to denote edges was used.

Although having this data-structure had its limitations (for example to find an edge which had the vertex as an end node), it was sufficient to answer this question as the algorithm was less complicated. See below for the data structures used to answer the harder question.

Algorithm:

The algorithm used was the one laid out in the project idea: Broadcasting to a twitter network lecture. The outline of the algorithm is as follows:

Mark all vertices as “uncovered”
Initialize an empty dominating set
While there are uncovered vertices:

Find the vertex V with the highest number of edges
Add V to the dominating set
Mark that vertex and all it’s uncovered neighbors as covered

Return the dominating set of vertices

Answer to the question:

The dominating set for the final run using the facebook_1000 dataset is shown below. For a network size of 783 vertices and 3784 edges, a dominating set of 266 vertices were found. Dominating set of vertices are in red and all the other vertices are in black.

For the facebook_ucsd dataset (size of 14948 vertices and 886442 edges) a dominating set of 2546 vertices were found. The graph is not displayed here as the image was too cluttered and visually not intuitive.

Harder Question:

Main data structure:

The network was laid as a classic graph data structure using an adjacency list. Each person is denoted as a vertex and a friendship is denoted as an edge in the graph.

A HashMap with the user id as an Integer key and Node object as the value was created.

A HashSet of edges (denoting friendship connections) was also maintained in the program.

Each Node object had a HashSet of edges within it in addition to the actual user id (stored as an integer). Each edge object pointed to the nodes that it connected.

This data-structure made more sense than the simple HashMap<Integer,HashSet<Integer>> data-structure that was used to answer the easier question above. The addVertex and addEdge methods were modified to incorporate both types of data-structures to make the code compatible for answering both questions.

Algorithm:

The algorithm used was the one described in the project ideas lecture: Detecting communities This algorithm is formally known as the Girvan-Newman algorithm

The steps in the algorithm are as follows:

Measure the betweenness of each edge in the graph

To measure the betweenness of an edge, we will first find the number of smallest paths (using Breadth First Search) between a node and every other node in the graph
Next, we will measure the flow of 1 unit based on the number of shortest paths
Next, we will sum up the amount of flow for each edge in the graph. This will give us the betweenness score for each edge.

Remove the edge(s) with the highest betweenness. If 2 or more edges have the same betweenness score we will remove all of them
Repeat above steps till a certain highest betweenness score of the edges is obtained (the betweenness of the edges will reduce as the graph is divided into more closely connected sub-communities)

To measure the betweenness of each edge a modified version of the Brandes algorithm for finding betweenness centrality of each vertex was used.

The Brandes algorithm is as follows:

Iterate over all vertices s in V
Calculate the dependency of vertex v on the vertex s for all v within V in two phases:

Breadth-first search, calculating distances and shortest path counts from s, push all vertices onto stack as they’re visited, also keeping track of the edges traversed.
Visit all vertices in reverse order (pop off stack), aggregating dependencies according to the equation given by Brandes for each edge traversed

Answer to the question:

For the facebook_1000 dataset (size of 783 vertices and 3784 edges), by achieving a max edge betweenness score less than 500, we get the output as shown below

Number of sub-communities detected = 35

For the facebook_2000 dataset (size of 1780 Vertices and 15650 edges), to achieve a max edge betweenness score of 500 or less we get the following output:

Number of sub-communities detected = 245

Algorithm Analysis

Harder question:

Usually finding the edge betweenness would require O(V2E) time, since we have to find the shortest paths from each to every other node in that graph. Each Breadth-first search (BFS) takes O(E) time. The beauty of the Bandes algorithm is that it can do this thru a single BFS search for each vertex.

Thus the STEP1 in the code below to find the number of shortest paths takes O(E) time for an unweighted graph (refer to corollary 4 in the Brandes paper).

The STEP2 in the Brandes algorithm to accumulate the dependencies of the starting node on all the other nodes also takes O(E) time for an unweighted graph (refer to corollary 7 in the Brandes paper).

Tracing thru the actual code for the Brandes’ algorithm to find the shortest path edge betweenness. Asymptotic notation for each group of steps is shown to the right.

Adding up the runtimes we get: O(E) + O(V*(V+3E)) = O(E)+O(V2+3VE) = O(V2+3VE+E)

Since VE is the dominating term in the above equation(assuming a small number of vertices compared to the number of edges), therefore the runtime of the function is O(VE).

Since we have to run the betweenness function for every edge of the graph (removing a pair of edges with the highest betweenness in each iteration), we get a total run time of O(VE2).

The above runtime is also explained in this scholarly article on edge betweenness.

Correctness verification (i.e. testing)

Easier question:

Testing was done using the dataset described in the lecture: Project Idea Broadcasting to a twitter network. The dominating set vertices are shown in red and all other vertices are in black.

Dominating set: test run

Input: small_test2_graph.txt

size 8+19 Graph nodes and edges

dominating set size = 3

dominating set node 1

dominating set node 4

dominating set node 8

The above test actually shows the limitation of the algorithm (the correct set of vertices should have been 2 and 7). This is a known limitation and was discussed in the lecture (This is an approximation algorithm)

Harder Question:

Several tests were run to verify the correctness of the code, both for step 1and both step 1 and 2. Examples were pulled from multiple lectures found on the internet. Here are some of them:

https://www.ismll.uni-hildesheim.de/lehre/cmie-11w/script/lecture5.pdf

https://www.cl.cam.ac.uk/teaching/1617/MLRD/slides/slides13.pdf

http://www.jsums.edu/nmeghanathan/files/2017/01/CSC641-Mod-4.pdf?x61976

Below is an illustration of the test done with Example - 1 in the first lecture above

Test1: small_test8_graph.txt

Whole graph (shows the betweenness scores of each edge).

Highest betweenness score is for edge 7 - 8 ( and 8 -7 since this is an undirected graph)

Iteration 1:

Recalculating the betweenness after iteration 1 (after removing the edges 7-8 and 8-7)

Iteration 2

Removing the edges with the highest betweenness score (12) 3 -7, 7 - 6, 8 - 9 and 8 - 12

We end up with 4 sub-communities of vertices 1-2-3, 4-5-6, 9-10-11 and 12-13-14

Reflection

For the harder question, the runtime increased quite a bit as the number of edges increased. For facebook_1000 dataset ( 783 vertices and 3784edges), the O(|V|*|E|^2) ~ 11.2 billion

To detect 35 sub-communities (betweenness score of less than 500) the process had to go thru 585 iterations. The actual run time was about 10 mins. The number of iterations was high because of multiple edges connecting the sub-communities.

For the facebook_2000 dataset (1780 vertices and 15650), O(|V|*|E|^2) ~ 436 billion (about 40 times more than the previous dataset). The actual run time was 6 hours to detect 245 sub-communities (betweenness score of less than 500)

Therefore it was not practical to run this process for the facebook_ucsd dataset which had 14,948 vertices and 886,442 edges. The big O value was 11.75 Trillion!!