Tuesday, January 31, 2012

The Friendship Paradox


It is the phenomenon first  observed by the sociologist Scott L. Feld in 1991 and he explains the same in his paper "Why your friends have more friends than you do" that,

 Most people have fewer friends than their friends have, on average. 

It’s not specifically about friendship, but a mathematical fact about any relation which is symmetrical and which varies across a population.
It can be explained as a form of sampling bias in which people with greater numbers of friends have an increased likelihood of being observed among one's own friends.

Assuming that a social network is represented by an undirected graph , where the set V of vertices corresponds to the people in the social network, and the set E of edges corresponds to the friendship relation between pairs of people. That is, friendship is a symmetric relation: if X is a friend of Y, then Y is a friend of X. the average number of friends of a person in the social network can be modeled as the average of the degrees of the vertices in the graph.Then the average number μ of friends of a random person in the graph is 

   μ =  2|E| / V
  
The average number of friends that a typical friend has can be modeled by choosing, uniformly at random, an edge of the graph (representing a pair of friends) and an endpoint of that edge (one of the friends), and again calculating the degree of the selected endpoint. That is, mathematically, it is

sum of d(v) for every uv in E / 2|E|  =    μ + σ2/ μ

where σ2 is the variance of the degrees in the graph. For a graph that has vertices of varying degrees (as is typical for social networks), both μ and σ2 are positive, which implies that the average degree of a friend is strictly greater than the average degree of a random node.
After this analysis, Feld goes on to make some more qualitative assumptions about the statistical correlation between the number of friends that two friends have, based on theories of social networks such as assortative mixing, and he analyzes what these assumptions imply about the number of people whose friends have more friends than they do. Based on this analysis, he concludes that in real social networks, most people are likely to have fewer friends than the average of their friends' numbers of friends.

However, this conclusion is not a mathematical certainty; there exist undirected graphs (such as the graph formed by removing a single edge from a large complete graph) that are unlikely to arise as social networks but in which most vertices have higher degree than the average of their neighbors' degrees.

'Friendship Paradox' May Help Predict Spread of Infectious Disease

Nicholas Christakis, professor of medicine, at Harvard University, and James Fowler, professor of medical genetics and political science at the University of California, San Diego, used the paradox to study the 2009 flu epidemic among 744 students. The findings, the researchers say, point to a novel method for early detection of contagious outbreaks.


References
[1] Zuckerman, Ezra W.; John T. Jost (2001). “What Makes You Think You’re So Popular? Self Evaluation Maintenance and the Subjective Side of the “Friendship Paradox”“. Social Psychology Quarterly 64(3): 207–223. doi:10.2307/3090112.

[2] Nicholas A. Christakis, James H. Fowler. Social Network Sensors for Early Detection of Contagious Outbreaks.PLoS ONE, 5(9): e12948 DOI:





Monday, January 30, 2012

An Analysis of Indian Railways

Graph theory is an important tool while studying several practical problems.
Recently, I read a paper which tries to analyze the problems tormenting Indian
Railways, by mapping the railway network to network flow problem of graphs. A
simple simulation and study of the network highlighted inherent problems with
the prevailing system of Indian Railways.

The authors considered only express trains for the study. The graph had the main
stations as vertices, and an edge exists between two stations if there exists
a trunk route (high capacity route used by express trains) between those two
stations. It was assumed that all trunk routes have the same capacity (that is,
same number of parallel tracks), the trains travel with uniform speed, and there
is no delay (that is, everything runs perfectly according to the time-table).

The results showed that perfection is impossible, given the current time-table
followed and distribution of railway traffic over the country. Even when running
perfectly, trains come within critical distance of each other. Hence, some has
to be delayed. This also increases the risk of accidents in the event of signalling
equipment failure. The increase of railway traffic has also been very skewed, with a
relatively high increase in the number of trains in the Indo-Gangetic Plain ( which
is not a surprise since the Railway Ministers were from UP, Bihar or West Bengal since
1996). However infrastructure improvement has been minimal. As a result, around 20%
of the trains of the region get delayed by more than an hour, and the number of accidents
have also been maximum in the Indo-Gangetic Plain.

The study has stressed the need for improvement of infrastructure (increasing tracks,
restructuring of time-table, etc.). Merely increasing number of trains in every
Railway Budget is only going to worsen the situation.

Link for the paper: http://www.sciencedirect.com/science/article/pii/S0378437112000027?v=s5

Network Metrics


In this Blog , i would like to mention network metrics such as Eigenvector centrality, Katz centrality, Page Rank  and hubs and authorities.
EIGENVECTOR CENTRALITY


In case of eigenvector centrality , the basic idea is to compute the centrality of vertices in terms of centrality
of its neighbours and then  applying this idea to recursively to the neighbours.

Suppose the vertex i has an initial centrality of xi(0) .
Improvement is xi(1)=Sum j(Aij*xj(0)). i.e the summation of centrality of its neighbours.
So, x(t)=Ax(t-1).
Now, expanding it we get , x(t) = A^t*x(0).
Now, we express x(0) as linear combination of eigenvectors vi of adjacency matrix A.
so,x(0)=Sum i(ci*vi)=A^t Sum i (ci*vi)=Sum i(L^t ci*vi ). where Li is the leading eigenvalue.
Dividing the expression by the leading eigenvalue (i.e one whose eigenvalue is the largest) and taking the
limit t-> infinity , we get X(limiting value)=c1v1 where v1 is the leading eigenvector. 
     Possible Problems
In a directed acyclic network , where centralities are zero.

                                                              KATZ CENTRALITY
Here , we give every node small amount of centrality for free( a,b >0)
so xi=a*Sum j (Aij*xij) + b.
This avoids problem of zero centrality.
Writing the exoression in matrix terms we get, x=aAx+bwhere =( 1 1 1 1 ... )T.
so, x=b(I- aA)^-1 1
For katz centrality , we put b=1 and thus we get, x=(I - a A)^-1.


   PAGE RANK
Page rank is the variant of Katz centrality named after Larry Page.It is a link analysis algorithm.
A PageRank results from a mathematical algorithm based on the graph, the webgraph , created by all World Wide web pages as nodes and links as edges, taking into consideration authority hubs such as cnn.com or usa.gov. The rank value indicates an importance of a particular page.A hyperlink to a page counts as a vote of support. The PageRank of a page is defined recursively  and depends on the number and PageRank metric of all pages that link to it. A Page that is linked to by many pages with high PageRank receives a high rank itself. If there are no links to a web page there is no support for that page.


   in mathematical terms, the Page Rank calculation is as follows:
     xi=a Sum j(Aij*xij / kj(out) ) +b.
   where if the out degree of a vertex is zero, the kj(out)=1.


HUBS AND AUTHORITIES
Each node has two types of centralities :
-> Hub centrality
-> authority Centrality
   Authorities -> nodes with useful information
   Hubs -> nodes that tell where good authorites are
 => Authority centrality of a node ( denoted by xi) is proportional to the sum of hub centralities
 pointing to it.
 -> xi=a*Sum j (Aji*yj).
  =>Hub centrality of a node is proportional to the authorities centralities of nodes pointing to it.
 -> yi-b*Sum j (Aij*xj).
 In matrix terms, x=aA^t y , y= bAx. 
substituting the value of y in x and vice versa we get,
=> x= abA^tA x( converges to the principal eigenvector of A^tA)
=>y =abAA^t y ( converges to the principal eigenvector of AA^t).

Saturday, January 28, 2012

Clique Relaxations in Social Network Analysis

          Given large amount of data provided by the Web 2.0, there is a pressing need to obtain new measures to better understand the networks’ structure, how their components are organized and the way they evolve over time. The representation of a social network was quite influenced by graph theory. In the social networks the set of vertices (or nodes) correspond to the “actors” (i.e.people, companies, social actors) and the set of edges corresponds to the “ties” (i.e.relationships, associations, links). 
          When the number of vertices and edges increases the visualization becomes incomprehensible. However, the visualization of a small number of vertices can be easily performed in a graph. For this a new graph mining approach based on k-cliques can be used. The concept of relaxed clique can extended to the whole graph, to achieve a general view, by covering the network with k-cliques.

Clique
          To understand a clique, we need to first learn about the subgraph. Given an undirected graph G= (V, E) where V denotes the set of vertices and E the set of edges, the graph G= (V1, E1) is called a sub-graph of G if V1 is a subset of V, E1 is a subset of E and for every edge (vi, vj) from E1 the vertices vi, vj belong to V1. A sub-graph G1 is said to be complete if there is an edge for each pair of vertices. A complete sub-graph is also called a clique.        
                                       

Cliques with 1, 2, 3, 4, 5 and 6 vertices


          The clique structure, where there must be an edge for each pair of vertices, shows many restrictions in real life modeling. So, alternative approaches were suggested in order to relax the clique concept, such as k-clique, k-clan/k-club and k-plex. 

k-clique
          A k-clique is a subset of vertices C such that, for every i, j belonging to C, the distance d(i, j) <= k. The 1-clique is identical to a clique, because the distance between the vertices is one edge. The 2-clique is the maximal complete sub-graph with a path length of one or two edges. The path distance two can be exemplified by the “friend of a friend” connection in social
relationships. In social websites like the LinkedIn, each member can reach his own connections and also the two and three degrees away. The increase of the value k corresponds to a gradual relaxation of the criterion of clique membership.



k- clique examples   


           A limitation of the k-clique model is that some vertices may be distant from the group, i.e. the distance, between two nodes, may correspond to a path involving nodes that do not belong to the k-clique. To overcome this handicap diameter based models called k-club and k-clan were introduced.

k-clan and k-club
         Let's first learn few definitions,
                 d(u,v)        :  The length of the shortest path between vertices u and v in G 
                 diam(G)   :  The diameter of G max d(u, v) 
         
         To find all k-clan, firstly all the k-cliques Si must be found and then the restriction diam(G[S]) <= k applied to remove the undesired k-cliques. 

          In the figure to the right, the 2-clique {1,2,3,4,5} was removed because  d(4,5)=3.

         
          Another approach to the diameter models is the k-club which is defined as a subset of vertices S such that diam(G[S]) <= k. In the following figure two 2-cliques: {1,2,3,4,5} and {2,3,4,5,6}, one 2-clan: {2,3,4,5,6} and three 2-clubs: {1,2,3,4}, {1,2,3,5} and {2,3,4,5,6} can be found.


                



k-plex
          An alternative way of relaxing a clique is the k-plex concept which takes into account the vertices degree. The degree of a vertex of a graph is the number of edges incident on the vertex, and is denoted by deg(v). The maximum degree of a graph G is the maximum degree of its vertices and is denoted by. Similarly, minimum degree is the minimum degree of its vertices. Asubset of vertices S is said to be a k-plex if the minimum degree in the induced subgraph >= |S| -k.
           In figure given below, the graph has 6 vertices, |S|=6, and the degree of vertices 1, 3, 4 and 5 does not exceed the value 3. So, the minimum degree in the induced sub-graph is 3. Thus, for |S|=6, k=3 is obtained.
3-plex