The principle of ranking algorithms using the example of Reddit and HackerNews
This article presents a review of the computation of the simplest algorithms used in the articles’ ranking.
In order to understand the principle of ranking, you should familiarize yourself with the basic framework of site features, and the way it’s ranking articles on the main site page. If the purpose is to offer information (Quora, Stack Overflow), first of all you have to set the priority of providing the data to users.
No matter what the remoteness of the content is. In such systems, the ranking is based on the most popular answers. In this case, the distribution of «hot posts» is not helpful.
If a system belongs to the social web, the main content of which is entertainment or news information, users should get only accurate and relevant information. So you need to choose the format of the algorithm to be applied when sorting the content. Just below this aspect in more detail.
The calculation should be guided by the following indicators:
- The amount of likes or votes. It is the most reliable, as previously authorized users need assistants who will perform certain actions. You can vote only 1 time by pressing «Like» or «+». It depends did you like the post or not. Accordingly, the more likes on a certain post, the higher it’s rating.
- Publications time. This indicator allows us to identify the recent posts to display them to users. Old articles from a number of scores must be more votes, and new less. For example, on Reddit the rating affects the style of the article style.
- The amount of comments and commentators – is not so substantial. Despite the appointment left a comment (praise, criticism, resentment, etc.), this helps when calculating their rankings. A lot of comments indicate audience interest in the information.
- Views and their duration. It is considered a questionable indicator, so when ranking its application should be limited. Nevertheless, there are situations when Google tracks to pause and stop watching for rating web pages in searches.
A brief overview of the formula
To write a simple article, you have to follow the formula presented below:
This formula is a variable, since at higher upvote has a certain amount of scores that eventually will be reduced. Often there are circumstances when an article becoming popular, with every day from 6 or more upvotes. It’s called a «snowball effect», so a lot of upvotes for it’s a common indicator. Below are presented a few popular websites using their own rating line and analyze overcoming the snowball’s effect.
The old algorithm Reddit
Earlier Reddit was open source code, but later changed to an indoor format. The source code of Reddit, remains open, is still common.
It is a score’s partition with the aim of assessing relevant posts.
The algorithm presented above should be explained in the following way. All you have to do is to select:
- 113428003 – fixed time (Aug.12, 2005 @ 7: 46 am (UTC) to understand at which point Reddit started working).
The result is the following: seconds = date – 1134028003.
If you imagine that the s – upvote-downvote articles, in this case, s = ups – downs.
To obtain s in the logarithmic form, you change a variable’s value more than 1:
The formula used in Reddit is similar to that described above:
- Log10 (n) is the component required to calculate the scores from voting.
- (sign*seconds / 45000) – the calculation is made since the publication of the article.
The question remains open – why do I need to divide by the number of 45000? See it below.
Calculation of points from the moment of publishing posts
It contains how to get a high score recently featured posts compared to previously published. Instead of deducting points depending on the time in Reddit points are added to the last post.
The article with 10 votes, will be rated log10 (10), that is 1 point. For a post with votes more than 2 times, the quantity of points is equal to log10 (20) ≈ 1.3 scores. An article with 100 votes will get log10 (100), which is 2 points.
For a clearer understanding presents another example. The article was published 3 days ago and its rating is higher compared with those published at the moment. Accordingly, to obtain 5.76 scores (timestamp – 25920045000), it is necessary that it was given about sixteen thousand votes.
Incomparable advantages of using logarithmic functions are:
- Obtaining the maximum value in the first ballot.
- Follow the voices do not have the particular value, so their record in the total number made.
- Calculation amounts of votes to determine the popularity of posts (the more votes, the better).
It allows Reddit to effectively deal with the snowball effect, and to publish on its main page relevant and updated content.
This system, compared with the Reddit open-source. This hacker news created in the language Arc. The code required to rank the articles as follows:
For ease of understanding, it was converted to:
Here’s the source code, which has been transformed into a simpler formula.
In the HackerNews system, the formula is easier to understand compared to Reddit:
- p – vote in support of the article. Because it is necessary to subtract 1 to ignore the voice of the author.
- t – the time between the publication time and current time. As an extension, a message 2 hours ago would have t = 2.
- G – standard value defaults to 1.8.
HackerNews divides the quantity of votes and publication times to calculate the rating in the system. There are some features to consider, including:
- It is inversely proportional to the time of publication (reduced when the time is increased with an equal degree).
- To control the decline of the estimate, the G is used as a gravitational constant. The higher it is, therefore, the lower the score will be after a certain period.
- G is 1.8 and after a day the article’s estimate will be much lower than the original, and later will be reduced to 0.