Welcome to Footy Trends !ΒΆ

Footy trends is a portfolio project to explore information and technical tools to better understand the dynamics of Australian rules football. In particular:

  • Back end cloud infrastructure on Amazon Web Services:
    • Security and Logging accounts created using Control Tower
    • S3
    • Cloudfront
  • Python tools & libraries
  • Machine learning systems

The GameΒΆ

Australian rules football is a full contact sport played on a large oval shaped field. It has no timeouts and player substitution is done on the fly. Any play stoppages due to infractions, out of bounds, the ball getting tied up or a point being scored are short in duration (about 5-20 seconds). Play stoppage after a goal is scored is about 90 seconds. There is no offside rule and the ball can move a long way quickly as it can be kicked in any direction at any time by the player in possession. These features make the game action packed and exciting to watch !

Some other interesting aspects of the game is that it is long (4 quarters of 20 minutes) and that there is a short bench relative to the number of players on the field. Each team has 18 players on the field and up to 4 on the bench. In addition the maximum number of rotations allowed in a single game is 75.

Though it varies with position/role each player has a high total amount of work to do relative to other sports. This work is in a combination of walking, jogging and sprinting in addition to laying and weathering tackles, jumping, kicking and hand balling (throwing the ball is not allowed).

GPS trackers on players reported distances covered as high as 17 kilometres in a game. Quite a feat given many of these kilometres are not in a straight line and are being done as contests. Apparently this number has gone down to more like 13 kilometres in recent years due to the use of "infield subbing" (ie dynamically swapping players between higher and lower work rate positions/roles).

Player buildΒΆ

The characteristics of the game would seem to select against players who are heavily built or lightly built as the former would lack the endurance and the latter would get outmatched in physical contests possibly to the point of injury. Though pure size may not be as heavily selected against because of the trade off of endurance vs. physicality (for tackling) & height (for jumping). Anecdotally this is true; when watching a game there is no very heavy builds like that of a front row Rugby player, shot putter or NFL line backer. Nor is there many very lithe builds like that found in marathon running or volleyball.

Let's do some data analytics to see if we can get some insights. Python has some great libraries for this so let's use it. We'll use it in the Jupyterlab IDE. We start by importing libraries for storing, manipulating and plotting data

InΒ [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Kaggle is community where data sets and models are shared. Michael Stone has a provided a really good AFL data set we'll use. Thanks Michael ! Data set includes years 2012-2023ΒΆ

InΒ [2]:
import kagglehub
path = kagglehub.dataset_download("stoney71/aflstats")

How is this dataset structured?ΒΆ

InΒ [3]:
!ls {path}
games.csv   players.csv stats.csv

Let's take a peek at the players csv fileΒΆ

InΒ [4]:
!head {path}/'players.csv'
PlayerId,DisplayName,Height,Weight,Dob,Position,Origin
2020654979,"Aarts, Jake",177,75,8-Dec-94,Forward,Richmond
2018655703,"Abbott, Ryan",200,100,25-Jun-91,Ruck,Grovedale Tigers
2002652211,"Ablett, Gary",182,87,14-May-84,Forward,Geelong Falcons
2014651814,"Acres, Blake",191,90,7-Oct-95,Midfield,West Perth
2007653719,"Adams, Leigh",176,79,6-Apr-88,Forward,Eastern Ranges
2016652575,"Adams, Marcus",192,93,30-Jun-93,Defender,West Perth
2012656391,"Adams, Taylor",181,83,20-Sep-93,Midfield,Geelong Falcons
2004652814,"Adcock, Jed",183,83,15-Nov-85,"Defender, Forward",GWV Rebels
2006658504,"Addison, Dylan",184,84,7-Oct-87,Forward,St George

How many players are there in total ? Let's count the lines in the fileΒΆ

InΒ [5]:
!wc -l {path}/'players.csv'
    1650 /Users/nick/.cache/kagglehub/datasets/stoney71/aflstats/versions/4/players.csv

It's 1650 minus 1 for the header lineΒΆ

We use the library pandas to import the players CSV file and store it as a DataFrame. We then a peek at the DataFrame to make sure it was read in correctlyΒΆ

InΒ [6]:
players = pd.read_csv(path +  '/players.csv')
players.head()
Out[6]:
PlayerId DisplayName Height Weight Dob Position Origin
0 2020654979 Aarts, Jake 177 75 8-Dec-94 Forward Richmond
1 2018655703 Abbott, Ryan 200 100 25-Jun-91 Ruck Grovedale Tigers
2 2002652211 Ablett, Gary 182 87 14-May-84 Forward Geelong Falcons
3 2014651814 Acres, Blake 191 90 7-Oct-95 Midfield West Perth
4 2007653719 Adams, Leigh 176 79 6-Apr-88 Forward Eastern Ranges

And do a count plot to see the height distribution of the playersΒΆ

InΒ [7]:
title = "AFL player heights"

plt.title(title)
sns.histplot(players, x='Height',discrete=True)
plt.savefig(title)
No description has been provided for this image

It looks like a skewed normal distribution (skewed to the right/taller). Let's calculate the median and the meanΒΆ

InΒ [8]:
players['Height'].median()
Out[8]:
187.0
InΒ [9]:
players['Height'].mean()
Out[9]:
187.7835051546392

What about weights ?ΒΆ

InΒ [10]:
title = "AFL player weights"
plt.title(title)
sns.histplot(players, x='Weight', discrete=True)
plt.savefig(title)
No description has been provided for this image

This also looks like a normal distribution skewed right. The jagged bars on both plots are a little bit surprising. Seems like there is big enough sample it should be a bit smoother. I wonder if there is estimating at the source, rounding and/or a conversion from metric happening somewhere in the chain.ΒΆ

InΒ [11]:
players['Weight'].median()
Out[11]:
85.0
InΒ [12]:
players['Weight'].mean()
Out[12]:
85.72650090964221

Let's do a scatterplot of height vs weightΒΆ

InΒ [13]:
title = "AFL players Height vs Weight"
sns.scatterplot(players, x = "Weight", y= "Height")
plt.title(title)
plt.savefig(title)
No description has been provided for this image

Starting the axis at the data set may be deceiving. Let's make the axis at 0,0ΒΆ

InΒ [14]:
title = "AFL players Height vs Weight"
sns.scatterplot(players, x = "Weight", y= "Height")
plt.title(title)
plt.savefig(title)
# Set the x-axis and y-axis to start at 0
plt.xlim(0)
plt.ylim(0)
Out[14]:
(0.0, 213.15)
No description has been provided for this image

Deceiving in the other direction but illustrates the pointΒΆ

Weight varies more than height as a percentage of each. The lightest players is about 50% of the weight of the heaviest player; the shortest player is about about 75% of the height of the tallest playerΒΆ

InΒ [15]:
players['Weight'].min()/players['Weight'].max()
Out[15]:
0.5423728813559322
InΒ [16]:
players['Height'].min()/players['Height'].max()
Out[16]:
0.7962085308056872

Though we don't know the player's body fat percentages BMI is a reasonable proxy for build for professional athletesΒΆ

InΒ [17]:
players_BMI= players['Weight'] / (players['Height']/100 * players['Height']/100)
InΒ [18]:
players.insert(loc=4, column='BMI', value=players_BMI)
InΒ [19]:
players.head()
Out[19]:
PlayerId DisplayName Height Weight BMI Dob Position Origin
0 2020654979 Aarts, Jake 177 75 23.939481 8-Dec-94 Forward Richmond
1 2018655703 Abbott, Ryan 200 100 25.000000 25-Jun-91 Ruck Grovedale Tigers
2 2002652211 Ablett, Gary 182 87 26.264944 14-May-84 Forward Geelong Falcons
3 2014651814 Acres, Blake 191 90 24.670376 7-Oct-95 Midfield West Perth
4 2007653719 Adams, Leigh 176 79 25.503616 6-Apr-88 Forward Eastern Ranges

Are different player builds more common in smaller or bigger players?ΒΆ

InΒ [20]:
title = "AFL players Height vs Weight"
sns.scatterplot(players, x = "Weight", y= "Height", hue='BMI', palette='coolwarm')
plt.title(title)
plt.savefig(title)
No description has been provided for this image

Yes ! Lighter builds are more common in lighter players and heavier builds are more common in heavier players. Percentile hue break downs illustrate this better. I may include some in the futureΒΆ

What could the potential causes of this be ? For the lighter builds in lighter players one cause could be that young players may not have filled out yet. So they may have reached their full height but not their full weight. But if they continue in the league their build will get heavier as they fill out. It could also be that some play styles rely on avoiding direct physicality like tackling and/or that physicality is asymmetric (being tackled vs laying a tackle). If injury is more common in lighter builds it's possible that these players have shorter careers. Comparing player build to each of age, career length and tackling would be interestingΒΆ

For heavier builds in heavier players this may be due to specific roles where physicality is important but the player may not need to move far in a given play or a given game thus endurance is less of a requirementΒΆ

Machine LearningΒΆ

Just for fun let's fit a Linear machine learning model to the players Height vs WeightΒΆ

InΒ [21]:
X = players[['Weight']]
y = players['Height']
InΒ [22]:
from sklearn.model_selection import train_test_split
InΒ [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
InΒ [24]:
from sklearn.linear_model import LinearRegression
InΒ [25]:
lm = LinearRegression()
InΒ [26]:
lm.fit(X_train,y_train)
Out[26]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
InΒ [27]:
lm.get_params()
Out[27]:
{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False}
InΒ [28]:
print(lm.intercept_)
128.93269601484045
InΒ [29]:
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df
Out[29]:
Coefficient
Weight 0.686751

~0.69 - strong linear relationshipΒΆ

Let's draw the predicted line on the original scatterplot to see how it looksΒΆ

InΒ [30]:
title = "AFL players Height vs Weight"
sns.scatterplot(players, x = "Weight", y= "Height")

# Plot the regression line
plt.plot(X,lm.predict(X), label='Regression Line', color='red')
plt.legend()
plt.title(title)
plt.savefig(title)
No description has been provided for this image

Looks reasonable....ΒΆ

How about we plot the residuals (actual - predicted) to see if there is a patternΒΆ

InΒ [31]:
lm_height_predictions = lm.predict(X_test)
lm_residuals = y_test - lm_height_predictions
InΒ [32]:
plt.title("AFL players Height vs Weight linear model residuals")
plt.scatter(X_test,lm_residuals)
plt.xlabel("Weight")
plt.ylabel("Height residuals")

plt.savefig(title)
No description has been provided for this image

More volatility at the lower weights ? Or an artifact of a larger sample size?ΒΆ

Let's apply some standard tools to these residualsΒΆ

InΒ [33]:
from sklearn import metrics
InΒ [34]:
print('MAE:', metrics.mean_absolute_error(y_test, lm_height_predictions))
print('MSE:', metrics.mean_squared_error(y_test, lm_height_predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, lm_height_predictions)))
MAE: 3.5297148653398764
MSE: 21.056239448144126
RMSE: 4.588707818999171

MAE indicates the model is off by an average of ~3.5cm when predicting a players Height from their Weight. MSE is an indication of how large the larger differences in prediction are. RMSE the opposite - it penalizes larger errors.ΒΆ

Is there a slight exponential relationship of Height to Weight?ΒΆ

InΒ [35]:
y_log_train = np.log(y_train)
InΒ [36]:
em = LinearRegression().fit(X_train,y_log_train)
InΒ [37]:
em_log_height_predictions = em.predict(X_test)
InΒ [38]:
em_height_predictions = np.exp(em_log_height_predictions)
InΒ [39]:
print('MAE:', metrics.mean_absolute_error(y_test, em_height_predictions))
print('MSE:', metrics.mean_squared_error(y_test, em_height_predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, em_height_predictions)))
MAE: 3.521240117676183
MSE: 20.962440540186822
RMSE: 4.578475787878191

No... very similar values as for the linear modelΒΆ

Let's plot the residuals of both models togetherΒΆ

InΒ [40]:
em_residuals = y_test - em_height_predictions
InΒ [41]:
plt.scatter(X_test,lm_residuals, label = "Linear model", color='blue')
plt.scatter(X_test, em_residuals, label = "Exponential model", color='orange')
plt.title('AFL players Height vs Weight model residuals')
plt.xlabel("Weight")
plt.ylabel("Height residuals")
plt.legend()
plt.savefig(title)
No description has been provided for this image

Possibly some fit differences for each model at the extremes of weightΒΆ