## rankalgorithm Last Updated 2020-10-26 07:39:18

Created At

#pagerank

Calculating the PageRank of nodes in a directed graph.

##Description A PHP implementation of the PageRank algorithm described in The PageRank Citation Ranking: Bringing Order to the Web (PDF). The project was created to support my session "PageRank und TrustRank" at the AFS - Akademie für Fortbildung in SEO and gives some basic examples as well as something to get interested participants started quickly.

The PageRank itself is an indicator of the importance of a node in a linked graph. The basic idea behind the algorithm is as follows:

• an edge ("link") between two nodes in a (directed) graph can be regarded as an endorsement from the originating node to the target node
• hence, the more nodes link to "me" the more important "I" am
• moreover, the more important "I" am, the more weight carries my endorsement

In this implementation an iterative approach is used to calculate the PageRank (PR) (source: Wikipedia): where p_1, p_2, ..., p_N are the pages under consideration, d is a damping factor to avoid "rank sinks", M(p_i) is the set of pages that link to p_i, L(p_j) is the number of outbound links on page p_j, and N is the total number of pages. The PR values change on every iteration step until a predefined threshold for the difference between old and new PR value is reached.

Probably the most important application of the PageRank algorithm is being a part of the ranking algorithm of (web) search engines. The PageRank is commonly known as the most important piece that set Google apart from other search engines back in 1998. See The Anatomy of a Large-Scale Hypertextual Web Search Engine for reference.

##Basic Usage

```// define the nodes
\$a = new Node("a");
\$b = new Node("b");
\$c = new Node("c");
\$d = new Node("d");
\$e = new Node("e");
\$f = new Node("f");
\$x1 = new Node("x1");
\$x2 = new Node("x2");
\$x3 = new Node("x3");
\$x4 = new Node("x4");
\$x5 = new Node("x5");

// define the links between the nodes
\$graph = new Graph();

// calculate the PageRank
\$pageRank = new PageRank();
\$result = \$pageRank->calculatePagerank(\$graph);

// print the result
\$formatter = new ResultFormatter(4);
echo \$formatter->toString(\$result);```

Output

``````93. Round
Node	OldPr	NewPr	Difference
b	    0.3242	0.3242	-0
c	    0.2892	0.2892	0
d	    0.033	0.033	0
a	    0.0276	0.0276	0
e	    0.0682	0.0682	0
f	    0.033	0.033	0
x1	    0.0136	0.0136	0
x2	    0.0136	0.0136	0
x3	    0.0136	0.0136	0
x4	    0.0136	0.0136	0
x5	    0.0136	0.0136	0
``````

The `Graph` in the example corresponds to the following graphic, taken from the Wikipedia article on PageRank: ###Examples

See `example` folder.

##Requirements

• PHP >= 5.5

##Installation

The recommended way to install pagerank is through Composer.

``````curl -sS https://getcomposer.org/installer | php
``````

Next, update your project's composer.json file to include pagerank:

``````{
"repositories": [ { "type": "composer", "url": "http://packages.myseosolution.de/"} ],
"minimum-stability": "dev",
"require": {
"paslandau/pagerank": "dev-master"
}
"config": {
"secure-http": false
}
}
``````

Caution: You need to explicitly set `"secure-http": false` in order to access http://packages.myseosolution.de/ as repository. This change is required because composer changed the default setting for `secure-http` to true at the end of february 2016.

After installing, you need to require Composer's autoloader:

`require 'vendor/autoload.php';`

#General workflow and customization options

Firstly, we need to define a set of connected `Node`s that form a `Graph`. Usually, that data will be obtained from a web crawl that saves the outgoing links of each crawled web page e.g. as CSV file. But for now let's assume we're building the `Graph` manually from scratch:

```// define two Nodes
\$a = new Node("a");
\$b = new Node("b");

// create the Graph object
\$graph = new Graph();

// connect the two nodes by linking ("creating an Edge) from \$a to \$b

In this scenario, `Node` `\$a` links to `Node` `\$b`. Next, we need an instance of the `PageRank` class to perform the PageRank calculation on the `Graph` object:

```// create the PageRank object
// providing only null values yields the default settings (see doc comments of the constructor in the PageRank class)
\$dampingFactor = null;
\$maxRounds = null;
\$maxDistance = null;

// calculate the PageRank
\$keepAllRoundData = null;
\$result = \$pageRank->calculatePagerank(\$graph, \$keepAllRoundData);```

The different customization options will be explained subsequently.

`PageRank::calculatePagerank()` returns an object of type `PageRankResult` that holds a reference to the original `Graph` object as well as an array of `PageRankNode`s. A `PageRankNode` has a reference to the original `Node` object as well as a value for it's current PageRank and it's PageRank in the former round of calculation. Why is that?

This implementation calculates the PageRank iteratively. In each iteration, the PageRank values will vary - heavily in the beginning, less and less towards the end. This becomes clear once you realize that the PageRank of page B depends on the PageRank of page A (assuming page A links to page B) but the PageRank of page A itself relies on other pages linking to A. In fact it's not uncommon that one of those pages linking to A will be linked to from page B - creating a nice little circular interdepency between those pages. To solve this problem, we'll fixate the current PageRank of each node during an interation step and use those fixated values as base for the calculation. Once we finished the calculation for all nodes, we can set those "new" values as the current PageRank and start the next iteration. The difference between "old" and "new" PageRank will decrease over time (i.e. number of iterations steps) and the calculation terminates once a certain threshold is reached. So basically:

```    /* Pseudocode. Well, kinda.. */

\$round = 1;
do{
\$newPrs = [];
// first, calculate the PR for all nodes
foreach(\$nodes as \$key => \$node){
\$newPr = \$node->calculatePagerank(); // get linking nodes and their current PR values and calculate the PR
\$newPrs[\$key] = \$newPr; // cache the new PR
}
// second
foreach(\$nodes as \$key => \$node){
\$node->setOldPr(\$node->getCurrentPr()); // set current PR as "old"
\$node->setCurrentPr(\$newPrs[\$key]); // get newly calculated PR and set as "current"
}
// yey, next round
\$round++;
}while(\$difference > \$threshold);```

To access the final PageRank values of each original node, use the `PageRankResult::getNodes()` method like so:

```    \$pagerankNodes = \$result->getNodes();
foreach(\$pagerankNodes as \$node){
echo \$node->getName()." has a final PageRank of ".\$node->getNewPr()."\n";
}```

##Customization options There is a number of options to calibrate the PageRank calculation.

###Damping factor The damping factor is a value between 0 and 1 and is used to avoid the accumulation of PageRank in so called "rank sinks". Such a rank sink emerges when two nodes link to each other in a circular manner. In the original paper, a value of `0.85` is proposed.

```// set on object instantiation...
\$dampingFactor = 0.55;
\$pageRank = new PageRank(\$dampingFactor);

// or via setter
\$pageRank->setDampingFactor(\$dampingFactor);```

###Limit runtime Usually the PageRank calculation terminates as soon as the difference between old and new PageRank values of two subsequent iterations exceeds a certain threshold. But you can also choose to let the calculation run for a fixed number of iterations. Whatever limit is reached first terminates the calculation.

```// set on object instantiation...
\$maxRounds = 100;
\$maxDistance = 0.00001;
\$pageRank = new PageRank(null,\$maxRounds,\$maxDistance);

// or via setter
\$pageRank->setMaxRounds(\$maxRounds);
\$pageRank->setMaxDistance(\$maxDistance);```

###Collapse links To my current knowledge there is no clear statement on how to handle multiple identical links (e.g. web page A has three outgoing links to web page B). Since usually the term set is used when talking about the incoming links, I tend to think that mutiple identical links should be considered as only one link. But one might also argue (from a web perspektive) that multiple links to the same page increase the likelyhood of at least one of those links to get followed.

So I decided to leave that decision to the user by providing a `\$collapseLinks` flag. If `true`, multiple links to the same `Node` will only count as 1. Otherwise (on `false`; the default), multiple links to the same page are not treated differently than "ordinary" links.

```// set on object instantiation...

// or via setter

##Keeping historic calculation data As mentioned in the introduction, this project was created with an educational purpose in mind: Give the audience a better understanding of the concept of PageRank and the way it is calculated iteratively. So it was important to me to visualize the changing PageRank values after each iteration. To capture those temporary results, the `PageRank::calculatePagerank()` method takes the second argument `\$keepAllRoundData`. If `\$keepAllRoundData` is `true` (default is `false`), the `PageRankResult` will have an array of `PageRankHistory` objects (one for each round; otherwise only the final round is stored).

Each `PageRankHistory` object will have an array of `PageRankHistoryEntry`s, whereas each `PageRankHistoryEntry` has a reference to the corresponding `PageRankNode` as well as the old and new PageRank values of that iteration. Example:

```// create the PageRank object
\$pageRank = new PageRank();

// calculate the PageRank
\$keepAllRoundData = true; // keep history
\$result = \$pageRank->calculatePagerank(\$graph, \$keepAllRoundData);

// get the history
\$history = \$result->getHistory();

//iterate over all histories
foreach(\$history as \$h){
//iterate over each entry
echo "Round \$h->getId()."\n";
foreach(\$h->getEntries() as \$entry){
echo "Node {\$entry->getNode()->getName()} had an old PageRank (before the calculation in this iteration) of {\$entry->getOldPr()} and {\$entry->getNewPr()} afterwards."\n";
}
}```

##Import link data from CSV It is cumbersome to define a graph manually. Using a CSV format (one column contains the originating node, another one the target node) feels more natural. There are two importers ready to go: `CsvImporter` - a generic importer for CSV files and `SEOSpider` - a subclass of `CsvImporter` that is adjusted to the output format of the desktop based crawling software Teknosys SEO Tool.

Example generic CSV

``````linkFrom,linkTo
example.com, example.com/foo
example.com, example.com/bar
``````

Usage

```\$hasHeader = true;
\$encoding = "utf-8";
\$delimiter = ",";
\$pathToFile = "...";

\$graph = \$csvImporter->import(\$pathToFile);```

Example Screaming Frog CSV

``````"All Inlinks"
"Type","Source","Destination","Alt Text","Anchor","Status Code","Status","Follow"
"HREF","example.com","example.com/foo","","Home","200","OK","true"
"HREF","example.com","example.com/bar","","Home","200","OK","true"
``````

Usage

```\$csvImporter = new ScreamingFrogCsvImporter();
\$pathToFile = "...";

\$graph = \$csvImporter->import(\$pathToFile);```