Welcome to Footy Trends !ΒΆ
Footy trends is a portfolio project to explore information and technical tools to better understand the dynamics of Australian rules football. In particular:
- Back end cloud infrastructure on Amazon Web Services:
- Security and Logging accounts created using Control Tower
- S3
- Cloudfront
- Python tools & libraries
- Machine learning systems
The GameΒΆ
Australian rules football is a full contact sport played on a large oval shaped field. It has no timeouts and player substitution is done on the fly. Any play stoppages due to infractions, out of bounds, the ball getting tied up or a point being scored are short in duration (about 5-20 seconds). Play stoppage after a goal is scored is about 90 seconds. There is no offside rule and the ball can move a long way quickly as it can be kicked in any direction at any time by the player in possession. These features make the game action packed and exciting to watch !
Some other interesting aspects of the game is that it is long (4 quarters of 20 minutes) and that there is a short bench relative to the number of players on the field. Each team has 18 players on the field and up to 4 on the bench. In addition the maximum number of rotations allowed in a single game is 75.
Though it varies with position/role each player has a high total amount of work to do relative to other sports. This work is in a combination of walking, jogging and sprinting in addition to laying and weathering tackles, jumping, kicking and hand balling (throwing the ball is not allowed).
GPS trackers on players reported distances covered as high as 17 kilometres in a game. Quite a feat given many of these kilometres are not in a straight line and are being done as contests. Apparently this number has gone down to more like 13 kilometres in recent years due to the use of "infield subbing" (ie dynamically swapping players between higher and lower work rate positions/roles).
Player buildΒΆ
The characteristics of the game would seem to select against players who are heavily built or lightly built as the former would lack the endurance and the latter would get outmatched in physical contests possibly to the point of injury. Though pure size may not be as heavily selected against because of the trade off of endurance vs. physicality (for tackling) & height (for jumping). Anecdotally this is true; when watching a game there is no very heavy builds like that of a front row Rugby player, shot putter or NFL line backer. Nor is there many very lithe builds like that found in marathon running or volleyball.
Let's do some data analytics to see if we can get some insights. Python has some great libraries for this so let's use it. We'll use it in the Jupyterlab IDE. We start by importing libraries for storing, manipulating and plotting data
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
Kaggle is community where data sets and models are shared. Michael Stone has a provided a really good AFL data set we'll use. Thanks Michael ! Data set includes years 2012-2023ΒΆ
import kagglehub
path = kagglehub.dataset_download("stoney71/aflstats")
How is this dataset structured?ΒΆ
!ls {path}
games.csv players.csv stats.csv
Let's take a peek at the players csv fileΒΆ
!head {path}/'players.csv'
PlayerId,DisplayName,Height,Weight,Dob,Position,Origin 2020654979,"Aarts, Jake",177,75,8-Dec-94,Forward,Richmond 2018655703,"Abbott, Ryan",200,100,25-Jun-91,Ruck,Grovedale Tigers 2002652211,"Ablett, Gary",182,87,14-May-84,Forward,Geelong Falcons 2014651814,"Acres, Blake",191,90,7-Oct-95,Midfield,West Perth 2007653719,"Adams, Leigh",176,79,6-Apr-88,Forward,Eastern Ranges 2016652575,"Adams, Marcus",192,93,30-Jun-93,Defender,West Perth 2012656391,"Adams, Taylor",181,83,20-Sep-93,Midfield,Geelong Falcons 2004652814,"Adcock, Jed",183,83,15-Nov-85,"Defender, Forward",GWV Rebels 2006658504,"Addison, Dylan",184,84,7-Oct-87,Forward,St George
How many players are there in total ? Let's count the lines in the fileΒΆ
!wc -l {path}/'players.csv'
1650 /Users/nick/.cache/kagglehub/datasets/stoney71/aflstats/versions/4/players.csv
players = pd.read_csv(path + '/players.csv')
players.head()
PlayerId | DisplayName | Height | Weight | Dob | Position | Origin | |
---|---|---|---|---|---|---|---|
0 | 2020654979 | Aarts, Jake | 177 | 75 | 8-Dec-94 | Forward | Richmond |
1 | 2018655703 | Abbott, Ryan | 200 | 100 | 25-Jun-91 | Ruck | Grovedale Tigers |
2 | 2002652211 | Ablett, Gary | 182 | 87 | 14-May-84 | Forward | Geelong Falcons |
3 | 2014651814 | Acres, Blake | 191 | 90 | 7-Oct-95 | Midfield | West Perth |
4 | 2007653719 | Adams, Leigh | 176 | 79 | 6-Apr-88 | Forward | Eastern Ranges |
And do a count plot to see the height distribution of the playersΒΆ
title = "AFL player heights"
plt.title(title)
sns.histplot(players, x='Height',discrete=True)
plt.savefig(title)
It looks like a skewed normal distribution (skewed to the right/taller). Let's calculate the median and the meanΒΆ
players['Height'].median()
187.0
players['Height'].mean()
187.7835051546392
What about weights ?ΒΆ
title = "AFL player weights"
plt.title(title)
sns.histplot(players, x='Weight', discrete=True)
plt.savefig(title)
This also looks like a normal distribution skewed right. The jagged bars on both plots are a little bit surprising. Seems like there is big enough sample it should be a bit smoother. I wonder if there is estimating at the source, rounding and/or a conversion from metric happening somewhere in the chain.ΒΆ
players['Weight'].median()
85.0
players['Weight'].mean()
85.72650090964221
Let's do a scatterplot of height vs weightΒΆ
title = "AFL players Height vs Weight"
sns.scatterplot(players, x = "Weight", y= "Height")
plt.title(title)
plt.savefig(title)
Starting the axis at the data set may be deceiving. Let's make the axis at 0,0ΒΆ
title = "AFL players Height vs Weight"
sns.scatterplot(players, x = "Weight", y= "Height")
plt.title(title)
plt.savefig(title)
# Set the x-axis and y-axis to start at 0
plt.xlim(0)
plt.ylim(0)
(0.0, 213.15)
players['Weight'].min()/players['Weight'].max()
0.5423728813559322
players['Height'].min()/players['Height'].max()
0.7962085308056872
Though we don't know the player's body fat percentages BMI is a reasonable proxy for build for professional athletesΒΆ
players_BMI= players['Weight'] / (players['Height']/100 * players['Height']/100)
players.insert(loc=4, column='BMI', value=players_BMI)
players.head()
PlayerId | DisplayName | Height | Weight | BMI | Dob | Position | Origin | |
---|---|---|---|---|---|---|---|---|
0 | 2020654979 | Aarts, Jake | 177 | 75 | 23.939481 | 8-Dec-94 | Forward | Richmond |
1 | 2018655703 | Abbott, Ryan | 200 | 100 | 25.000000 | 25-Jun-91 | Ruck | Grovedale Tigers |
2 | 2002652211 | Ablett, Gary | 182 | 87 | 26.264944 | 14-May-84 | Forward | Geelong Falcons |
3 | 2014651814 | Acres, Blake | 191 | 90 | 24.670376 | 7-Oct-95 | Midfield | West Perth |
4 | 2007653719 | Adams, Leigh | 176 | 79 | 25.503616 | 6-Apr-88 | Forward | Eastern Ranges |
Are different player builds more common in smaller or bigger players?ΒΆ
title = "AFL players Height vs Weight"
sns.scatterplot(players, x = "Weight", y= "Height", hue='BMI', palette='coolwarm')
plt.title(title)
plt.savefig(title)
Yes ! Lighter builds are more common in lighter players and heavier builds are more common in heavier players. Percentile hue break downs illustrate this better. I may include some in the futureΒΆ
What could the potential causes of this be ? For the lighter builds in lighter players one cause could be that young players may not have filled out yet. So they may have reached their full height but not their full weight. But if they continue in the league their build will get heavier as they fill out. It could also be that some play styles rely on avoiding direct physicality like tackling and/or that physicality is asymmetric (being tackled vs laying a tackle). If injury is more common in lighter builds it's possible that these players have shorter careers. Comparing player build to each of age, career length and tackling would be interestingΒΆ
For heavier builds in heavier players this may be due to specific roles where physicality is important but the player may not need to move far in a given play or a given game thus endurance is less of a requirementΒΆ
Machine LearningΒΆ
Just for fun let's fit a Linear machine learning model to the players Height vs WeightΒΆ
X = players[['Weight']]
y = players['Height']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
lm.get_params()
{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False}
print(lm.intercept_)
128.93269601484045
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df
Coefficient | |
---|---|
Weight | 0.686751 |
~0.69 - strong linear relationshipΒΆ
Let's draw the predicted line on the original scatterplot to see how it looksΒΆ
title = "AFL players Height vs Weight"
sns.scatterplot(players, x = "Weight", y= "Height")
# Plot the regression line
plt.plot(X,lm.predict(X), label='Regression Line', color='red')
plt.legend()
plt.title(title)
plt.savefig(title)
lm_height_predictions = lm.predict(X_test)
lm_residuals = y_test - lm_height_predictions
plt.title("AFL players Height vs Weight linear model residuals")
plt.scatter(X_test,lm_residuals)
plt.xlabel("Weight")
plt.ylabel("Height residuals")
plt.savefig(title)
More volatility at the lower weights ? Or an artifact of a larger sample size?ΒΆ
Let's apply some standard tools to these residualsΒΆ
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, lm_height_predictions))
print('MSE:', metrics.mean_squared_error(y_test, lm_height_predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, lm_height_predictions)))
MAE: 3.5297148653398764 MSE: 21.056239448144126 RMSE: 4.588707818999171
MAE indicates the model is off by an average of ~3.5cm when predicting a players Height from their Weight. MSE is an indication of how large the larger differences in prediction are. RMSE the opposite - it penalizes larger errors.ΒΆ
Is there a slight exponential relationship of Height to Weight?ΒΆ
y_log_train = np.log(y_train)
em = LinearRegression().fit(X_train,y_log_train)
em_log_height_predictions = em.predict(X_test)
em_height_predictions = np.exp(em_log_height_predictions)
print('MAE:', metrics.mean_absolute_error(y_test, em_height_predictions))
print('MSE:', metrics.mean_squared_error(y_test, em_height_predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, em_height_predictions)))
MAE: 3.521240117676183 MSE: 20.962440540186822 RMSE: 4.578475787878191
em_residuals = y_test - em_height_predictions
plt.scatter(X_test,lm_residuals, label = "Linear model", color='blue')
plt.scatter(X_test, em_residuals, label = "Exponential model", color='orange')
plt.title('AFL players Height vs Weight model residuals')
plt.xlabel("Weight")
plt.ylabel("Height residuals")
plt.legend()
plt.savefig(title)