Future Impact Football Analytics
This project is maintained by Taylor Killian, Abhishek Malali and Virgile Audi
The primary motive of the project outlined below was designed to answer the following question: Can we evaulate and infer a player's value to his team based solely on the impact that they have on matches? In this respect, and to simplify our analysis, we define "impact" to mean the goals and assists provided for their team during various game states.
When football clubs do not perform equal to their, and their fans', expectations they look to infuse their team with more talented players, typically drawn from lower-tier or financially poor teams. The financial incentive for both teams in any transaction is clear, the buying team has hope that the new addition will put them over the top talent-wise to improve their standing in league competition while the selling team may look at the new funds as a way to prepare for the future and secure several new players to improve their team long-term.
Within the last 5 years, the economy of player transfers has undergone significant inflation where the top-tier clubs have shown a willingness to overpay for marginal talent or for players that may not be a good fit in their team. There is some notion that the transfer system is broken and is in need of revitalization. There are multiple ways to address the psychology of club executives and try to identify their motivations for spending money to procure new talent for their team. We refrain from any investigation into club finances but want to determine a method by which we can accurately and objectively place value on any player based on the impact that they have on the matches they play in.
Our primary motivation is to quantify the impact (based on goals scored, assists made and consistency of play) an individual player has on the games which he plays for his team. We intend to use this quantity for each player to infer their value (agnostic of team at first). With this calculated player impact value, we attempt to group and classify separate tiers of player impact with the goal of identifying what players have performed to the value placed on them in reality.
The player transfer market has seen ever increasing transfer amounts over the years. Ever increasing broadcasting deals have made the leagues richer which has reflected in the spending. We looked at the top 100 transfers in the last 20 years. The last five years alone account for 65 transfers in the top 100 list with the transfer record being shattered twice. This trend motivated us to investigate if the market was inflated and whether or not the players justified their value for the respective transfer amount. |
|
The data we use for our analysis is scraped is from World Football.net. World Football has match by match data in terms of the players who participated in the games as well as the goal scorers. The site had data for all the five leagues which we were interested in. The url links were not hard to deal with provided we had a list of teams which participated in a league for that year. An example url that we visited to acquire data can be viewed here. We scraped the goalscorers, players who assisted the goalscorer along with the time at which the goal was scored. We also recorded whether the goal scorer or assist maker were substituted. We scraped the data from the top 5 European leagues (British, Spanish, German, Italian and French) for the complete 2012-13 and 2013-14 seasons. Overall we scraped 760 pages for each league which totaled ~3800 webpages.
The transfer values were scraped from the Wikipedia page for transfers for each league for each season we considered. The French and the German league did not have transfer values mentioned for the seasons under consideration. The Spanish league did not have transfer data available for the summer of 2014. An example of the data which we had to scrape can be viewed here.
We wanted to visualize where the big money transfers were happening. The plot below was built in D3 for specifically that purpose. It was see the emphasis of the price of the player on different emtrics like goals, assists and the metric score generated by us. This could also help find groupings of players who were available in the similar price range while providing the same attacking influence in terms of goals.
The aim of deriving this metric was to quantify a player's impact on a game independently from their team's performance, similar to the plus-minus metric used in basketball. From the data we acquired we consider goals scored and all relevant aspects of each goal (i.e scorer name, assist maker name, time of match, type of goal scored, etc.).
We split each 90-min match into three time intervals: Early (≤ 20 min), Mid ( > 20 min and < 80 min) and Late (≥ 80 min). We also split the goals/assists into three categories; tiebreakers, equalizers and nominal. Each of these kinds of goals/assists were weighted differently depending on when in the match they were recorded.
Players were given a bonus if they recorded a statistic after coming into the match as a substitute. Another bonus that we provided players was if they recorded a positive statistic (we, as a rule to simplify our analysis and not dwell on extremely rare events, ignore own-goals) if they were on the "away" team. It is well understood and intuitive that it is more difficult in football to score away from home, we chose to reflect that in our analysis by a small increment in the weights.
We also chose to reward players for consistency. We view a player as more valuable if they score a few goals frequently rather than a player who may score a lot at one time but is otherwise absent from the scoresheet. To do this, we created an incremental increase to the weight for a goal/assist if the player for every match in a continuous streak where the player registered a postive statistic.
As football enthusiasts a key question we wanted to address was to find the players which are valued similarly, as per our metric. This would be meaningless if it was done only using goals and assists since those numbers can distract from a player's true impact on his team. Hence we used the entirety of the data including the unwrapped contribution time series. This would cluster players who contribute similarly in the games as well over the seasons.
We found clusters of similarly influential players for both seasons. An interesting way of looking at long-term performance would be finding players who move between clusters over seasons. This can indicate a player's improvement or decline in performance from season to season. With enough data and multiple seasons, we would then be able to track a players trajectory and possibly predict their future performance.
An interesting note to make here would be that Messi is classified with Robin van Persie who had fewer goals and assists. He was, however, equally influential in helping his team to the British Premier League title. These were the kind of clusters and observations we were hoping we would see when we processed and analyzed our data.
Using traditional statistics, we look at the top players in terms of goals and assists. The first plot show these players for the 2012-13 season. The top players are typically those you may have heard of even if you aren't an ardent soccer fan.
We evaluate the top players for the season of 2012-13 using the metric which we have defined. Our list of top 20 players is very similar to the list we generated using traditional statistics. But in effect there are a few names we might come across as anomalies like Stephan Kiessling. This shows the intent of our metric, where credit is given to players who are contribute to their teams success but are easily overlooked when only looking at their absolute statistics. This demonstrates the utility of valuing players based on the impact they have on the outcome of a match.
We chart the ranking according to our metric compared to the ranking with traditional statistics. The 45 degree line is plotted for reference. We see that certain players who were ranked highly as per the goals + assist metric did fall down the ranking ladder when the metric changed. The metric does not affect the top players since we see a close grouping near the origin. But as we go down the ranking ladder we see significant jumps. We find lots of players who weren't ranked well now within the top 50. This directly states that we are doing a decent job of finding underrated players who contribute immensely but are not noticed.
For the two most globally recognisable players, Lionel Messi and Cristiano Ronaldo we see how their scores evolve over the season in the plot below. Consistent performance over the course of the season is the most desirable trend among succesful forwards.
As is shown in the time series above, Messi consistently outperformed Ronaldo in almost every game for the 2012-13 season. The contribution was recognized and rewarded as Messi was named the FIFA World Player of the Year following the 2012 season.
We were keen on analyzing these time series for some interesting players for a season. But complexity of analyzing every set of players visually and making key discoveries was cumbersome.
To check consistency of our metric against what actually happened in the seasons we evaluated, we combine the scores for each player in a team. We then rank each team in the league by their aggregate score and compare this ranking to the actual place the team finished the season. As the figure shows below, the rankings predicted by our measure were not far from the actual results and showed correlations no worse than 0.73.
The leagues that our metric most accurately predicted, when aggregated by team, were the Spanish (corr = 0.92) and French (corr = 0.88) leagues. Traditionally these leagues are known for allowing for more goals and since our metric, at present, is solely based on scoring statistics it is clear these successes are understandable. The British, German and Italian leagues are more measured defensively and it is inferred that the less accurate prediction of these leagues is due to not accounting for defensive statistics.
In order to further validate our metric of player impact we trained two separate regressions, trained on the data from the 2012-13 season and tested on the 2013-14 season.Initially, we trained a random forest regressor and then utilized an ordinary linear regression.
Once we trained our regressions we wanted to get a sense of the influence each feature, derived from our metric, had on our prediction model. We see that the importance of the 'ngoals' and 'nassists' (the number of goals and assists respectively) far outweigh the others as would be expected. It is satisfying to see that the other features describing different aspects of the goal are also nontrivial and contribute to the final prediction model. With the assurance that our model was utilizing the extracted features from our data correctly we move on to actually predicting the performance of players, based on our metric, in the following season. We fully expect the linear regression to more accurately predict the player's performance as the metric that the regression trained on is a simple linear weighting of goals and assists recorded by each player.
Random Forest |
Linear Regression |
---|---|
As expected, the linear regression very nearly predicts the measured impact for each player in the 2013-14 season. The correlation between predicted and actual value was 0.994. The random forest regression doesn't perform that much worse (correlation = 0.965) but you can see that it undervalues the players who score higher as per our metric. This is likely due to the majority of players having lower scores, thus biasing our regression slightly.
In order to evaluate the trend of player performance with respect to transfer value we consider the subset of players who were transferred between the 2012-13 and 2013-14 seasons (of the data we collected). We use the ranking of the cluster they are assigned to in each season to compare their performance across the two seasons. The difference between the two rankings is then shown below.
The performance differential of these players is ordered from left to right based on the amount that the players were transferred for between the two seasons. Gareth Bale, who was transferred from Tottenham Hotspur to Real Madrid for a record $132 M, is on the far right. We use this figure to demonstrate, at least on the surface, if the transfer of a certain player was a good or bad investment for the team. Clearly, there is some of both and a majority of players who perform similarly across the two seasons. There isn't enough data to infer any sort of trend (ie. highest valued players generally underperform) from this analysis but it is easy to see why players like Alexi Sanchez and Marouane Fellaini have such divergent reputations in the world of football.
We sought out to quantify the impact an individual player has on the games he plays for his team. We used this calculated value to compare that performance against other players and more importantly with how these players have been valued in reality. We learned early on in our analysis that to accurately characterize a player, one needs more data about very specific actions during game play. These kinds of statistics center on what a player does to effectively further his team's style of play, whether that is to be an effective passer, adventurous dribbler or lock-down defender. By basing valuation solely on goals and assists registered, we took the risk of overlooking the effect these other contributions have on rating a player's performance. We understood the effect our assumptions would make on our analysis but we grew to better appreciate the work that is done to record and analyze more detailed match data.
We succeeded in creating an effective metric that, while not providing much predictive power, sufficiently characterizes a player's value to their team. We found effective ways to visualize this metric that allowed us to compare players and find relationships between players in different leagues. We expected to find large deviations in our metric from the transfer values placed on these players in reality, hoping to find some undervalued diamonds in the rough or overvalued dead weight. Generally we saw that players were given transfer values commensurate with their performances. We aren't able to comment on the rate of inflation in transfer values based on our analysis but there is some validity in the values placed on players in relation to their peers.