Saturday, March 31, 2012
Wednesday, March 28, 2012
Trawling the Web for Cyber-Communities
Algorithms such as the HITS algorithm, essentially are search algorithms designed to find high-quality pages (authorities) about a fixed topic. How should one extend such algorithms if the need is to enumerate all topics (under a certain definition)?
We first need to carefully define what is meant by a topic in the web graph. One can easily see that each topic present in the web graph structure can be detected by looking for complete bipartite graphs Ki,j in the graph, as shown in the figure given below.
It is important to note that the hub pages may not co-cite all the authority pages in such cyber-communities, but, every such community will contain a bipartite core Ci,j for non-trivial value of I and j. This process of enumerating all possible bipartite cores in the web for some fixed value of i and j is called trawling.
A natural question that arises is that for small values of i and j, the cores that are found, do they actually represent cyber communities or are they present to co-incidental citations? Experiments on the web graph suggest that only a miniscule number of such bipartite cores are coincidental. But how does one actually run such experiments on a web sized graph? The naïve search algorithms does not scale up to this size.
An algorithmic paradigm called the elimination-generation paradigm is used. The algorithm makes a number of passes over the Web graph. The paradigm is explained below:
Elimination: There are often conditions that are necessary that must be satisfied for a node to participate in a subgraph which can form a bipartite core. For example, consider the case of C4,4, any node with in-degree less than or equal to 3 cannot participate on the right side of such a core. Thus, edges that are directed into such nodes can be pruned from the graph. Similarly, nodes with out-degree less than or equal to 3 cannot participate on the left side. Such necessary conditions form elimination filters.
Generation Here we consider nodes that barely qualify for potential membership in an interesting subgraph, as it can easily be verified if they belong to the subgraph or not. Consider the example of C3,3. Let u be any node with degree exactly 3. Then, such a node can belong to the subgraph if and only if the 3 nodes it points to have a neighborhood intersection of size atleast 3. This property forms what we call a generation filter. After applying such a filer, regardless of whether we find a core or not, we can prune the node, since all potential interesting cores containing it have already been enumerated.
The above mentioned algorithm makes several passes through the web graph, eliminating nodes in each pass. In most of the experiments, the benefits of elimination/generation tail off as fewer and fewer nodes are eliminated at each phase, and the algorithm stops after some time.
It must be noted that bipartite cores are not the only interesting subgraphs which constitute enumeration problems, we can also look at bidirectional stars (webrings), cliques, directed trees etc. Experiments run on the web graph help us study some of the local properties of the web.
It was observed that both the in-degree and out-degree of the nodes follow a Zipfian distribution, with the probability that a node has degree i being proportional to 1/ia. A question that arises is that, with the random graph model that is used to model the web graph, can such properties be effectively introduced in the model?
Number of Bipartite Cores
Trawling experiments run on the a crawl of the web which contains about 100 million webpages brings up the following relation between the number of bipartite cores found with the size of the left side of the core (i), and with the size of the right side of the core (j). Again, a question arises, can such properties be effectively modeled by a random graph model for the web?
1. The Web as a Graph: Measurements, Models and Methods, Kleinberg et. al
Saturday, March 17, 2012
Sunday, March 11, 2012
Wednesday, March 7, 2012
The important point to notice about Eq. (5) is that the vertex-vertex distance increases logarithmically with the graph size n, i.e., it grows rather slowly. Even for very large networks we expect the typical distance through the network from one vertex to another to be quite small. In social networks this effect is known as the small-world effect, and was famously observed by the experimental psychologist Stanley Milgram in the letter-passing experiments he conducted in the 1960s (Milgram, 1967; Travers and Milgram, 1969; Kleinfeld, 2000).
 M.E.J Newman, Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, U.S.A.
"Random graphs as models of networks"
Monday, March 5, 2012
It means there is certain correlation between different frequencies produced from any source. In "Collective behavior of stock price movements in an emerging market" by Pan and Sinha (PRE 76, 046116, 2007) they have analyzed the time series data of different stock prices to find out correlation between different stocks. We can find the the exact correlation between these frequencies by doing similar network analysis of .wav file in following steps -
- Read wave file in matlab using wavread and generate spectrogram using Short Term Fourier Transform.
- We will get a matrix P containing Power Spectral Density(PSD) of each segment. We can find correlation matrix for different frequencies usinf corr() function in matlab on P.
- We can find eigen values and eigen vectors for this correlation matrix using eig() function.
- Let's us say N is the no. of frequencies and T is length of time series. If the N return time series of length T are mutually uncorrelated, then the resulting random correlation matrix is called a Wishmart matrix, whose statistical properties are well known . In the limit , such that Q=T/N is greater than or equal to 1, the eigenvalue distribution of this random correlation matrix is given by for and 0 other wise. The bounds of this equation are given by
- In absence of any correlation among stocks, the distribution should be bounded between lambda_min and lambda_max. Since, we are expecting correlation here, we will ignore all eigenvalues between lambda_min and lambda_max and principal eigenvalue as well.
- From the filtered eigenvalues, we construct the correlation matrix again, C_new.
- The adjacency A matrix of C_new is generated by choosing a threshold correlation c_th such that, A_ij =1 if C_new_ij > c_th otherwise 0.
- Using A, we can construct network of frequencies. This network can be used to determine which frequencies occur together by community discovery methods.
Sunday, March 4, 2012
Figure 1 shows how having multiple species present in a plant community can stabilize ecosystem processes if species vary in their responses to environmental fluctuations such that an increased abundance of one species can compensate for the decreased abundance of another. Biologically diverse communities are also more likely to contain species that confer resilience to that ecosystem because as a community accumulates species, there is a higher chance of any one of them having traits that enable them to adapt to a changing environment. Such species could buffer the system against the loss of other species.
In contrast, if stability is defined at the species level, then more diverse assemblages can actually have lower species-level stability. This is because there is a limit to the number of individuals that can be packed into a particular community, such that as the number of species in the community goes up, the average population sizes of the species in the community goes down. For example, in Figure 2, each of the simple communities can only contain three individuals, so as the number of species in the community goes up, the probability of having a large number of individuals of any given species goes down. The smaller the population size of a particular species, the more likely it is to go extinct locally, due to random — stochastic — fluctuations, so at higher species richness levels there should be a greater risk of local extinctions. Thus, if stability is defined in terms of maintaining specific populations or species in a community, then increasing diversity in randomly assembled communities should confer a greater chance of destabilizing the system.
Cedar Creek Biodiversity Experiment
This experiment determines effects of plant species numbers and functional traits on community and ecosystem dynamics and functioning. It manipulates the number of plant species in 168 plots, each 9 m x 9 m, by imposing plant species numbers of 1, 2, 4, 8, or 16 perennial grassland species. The species planted in a plot were randomly chosen from a pool of 18 species (4 species, each, of C4 grasses, C3 grasses, legumes, non-legume forbs; 2 species of woody plants). Its high replication (about 35 plots at each level of diversity) and large plots allow observation of responses of herbivorous, parasitoid and predator insects and allow additional treatments to be nested within plots. Planted in 1994, it has been annually sampled since 1996 for plant aboveground biomass and plant species abundances and for insect diversity and species abundances. Root mass, soil nitrate, light interception, biomass of invading plant species, and C and N levels in soils, roots, and aboveground biomass have been determined periodically. In addition, soil microbial processes and abundances of mycorrhizal fungi, soil bacteria and other fungi, N mineralization rates, patterns of N uptake by various species, and invading plant species, have been periodically measured in subprojects in the Biodiversity Experiment.
Plant biomass production increased with diversity because of complementary interactions among species and not because of selection (sampling) effects .
Foliar fungal disease incidence decreased at higher diversity because of greater distance between individuals of a species, and resultant lower rates of disease spread.
Greater plant diversity led to greater diversity of herbivorous insects, and this effect continued up the food web to predator and parasitoid insects.
Fewer novel plant species invaded higher diversity treatments because of their lower soil NO3 levels, greater neighborhood crowding and competition, and greater chance that functionally similar species would occur in a given neighborhood .
Greater plant species numbers led to greater ecosystem stability (lower year-to-year variation in total plant biomass) but to lower species stability (greater year-to-year variation in abundances of individual species), with the stabilizing effect of diversity mainly attributable to statistical averaging effects and overyielding effects .
The growing popularity of handheld devices, equipped with wireless interfaces like cellular network, Wifi and Bluetooth, along with the augment in their storage capacity, has resulted in a new communication paradigm - Delay Tolerant Content Distribution, where each individual carrying the device (node), can receive some content at a certain time, store it and then forward it to others at a later time.This store and forward mechanism can be made more efficient by using the concept of Egocentric Networks.
The nodes are ranked mainly based on their Egocentric Betweenness. Other parameters like pattern of meeting between the nodes, connection quality, content accuracy (identifying the node which provides interesting data) are also taken into consideration. The Egocentric Betweenness Bi of a node i is defined as the number of pairs of neighbors of i that are not directly connected to each other. Due to the dynamic nature of wireless Delay Tolerant Networks, Bi is averaged along time to give the Average Egocentric Betweenness. Nodes with higher value of the average egocentric betweenness are given higher ranks, as they have larger inﬂuence in the network because many nodes depend on them to make connections with other nodes.
The content exchange process works as follows - each node sends a content request to the highest rank node in its locality having content which is of interest to it. Locality here can be defined as the N-step neighborhood in the egocentric network of the node, for some empirical value of N. Let, for a particular case, the node requesting the content be A and the node receiving the request be B. B may receive multiple requests. If that happens, it accepts the request of the highest rank requesting node and rejects the others. Thus, the request from A can be accepted or rejected by B. If B accepts the request, A starts the download from B. The content is thus made available to A, which stores it and can forward it to other nodes on being requested. However, if B rejects the request, A sends the request to the next highest rank node and so on. Thus, the content is efficiently distributed among all nodes that are interested in it.
This type of distribution mechanism can be particularly useful for advertising, where the final objective is to achieve maximum dissemination, while taking the interest of the users into consideration.
 R. Cueves, E. Jaho, C. Guerrero, I. Stavrakakis: OnMove: A protocol for Content Distribution in Wireless Delay Tolerant Networks based on Social Information
Friday, March 2, 2012
If there are 'n' random indipendent variables with same success probability 'p', then binomial distribution will be:
and binomial(1,p) will be bernoulli distribution.
Probability mass function(PMF) is just a function which gives probability that a discrete random variable is equal to some value. This function gives probability of exactly k successes in the experiment B(n,p). Bernoulli experiment can lead us to find negative binomial geometric, many more other distribution.
Bernoulli probability mass function:
can also be like:
Binomial probability mass function:
Poisson probability mass function:
Poisson distribution is actually binomial distribution under the limiting condition that , & np = and is a constant. Under these restrictions, the limiting form of the Binomial distribution will be poisson distribution.
Some properties of network can be analyse through these distribution. poisson distribution shows it's use in fields of counting, for example vehicle movement on roads, accedents resord of city and where should we open new medicle centre for faster result.