import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import kagglehub
path = kagglehub.dataset_download("stoney71/aflstats")

!ls {path}

games.csv   players.csv stats.csv

!head {path}/'players.csv'

PlayerId,DisplayName,Height,Weight,Dob,Position,Origin
2020654979,"Aarts, Jake",177,75,8-Dec-94,Forward,Richmond
2018655703,"Abbott, Ryan",200,100,25-Jun-91,Ruck,Grovedale Tigers
2002652211,"Ablett, Gary",182,87,14-May-84,Forward,Geelong Falcons
2014651814,"Acres, Blake",191,90,7-Oct-95,Midfield,West Perth
2007653719,"Adams, Leigh",176,79,6-Apr-88,Forward,Eastern Ranges
2016652575,"Adams, Marcus",192,93,30-Jun-93,Defender,West Perth
2012656391,"Adams, Taylor",181,83,20-Sep-93,Midfield,Geelong Falcons
2004652814,"Adcock, Jed",183,83,15-Nov-85,"Defender, Forward",GWV Rebels
2006658504,"Addison, Dylan",184,84,7-Oct-87,Forward,St George

!wc -l {path}/'players.csv'

    1650 /Users/nick/.cache/kagglehub/datasets/stoney71/aflstats/versions/4/players.csv

players = pd.read_csv(path +  '/players.csv')
players.head()

title = "AFL player heights"

plt.title(title)
sns.histplot(players, x='Height',discrete=True)
plt.savefig(title)

players['Height'].median()

187.0

players['Height'].mean()

187.7835051546392

title = "AFL player weights"
plt.title(title)
sns.histplot(players, x='Weight', discrete=True)
plt.savefig(title)

players['Weight'].median()

85.0

players['Weight'].mean()

85.72650090964221

title = "AFL players Height vs Weight"
sns.scatterplot(players, x = "Weight", y= "Height")
plt.title(title)
plt.savefig(title)

title = "AFL players Height vs Weight"
sns.scatterplot(players, x = "Weight", y= "Height")
plt.title(title)
plt.savefig(title)
# Set the x-axis and y-axis to start at 0
plt.xlim(0)
plt.ylim(0)

(0.0, 213.15)

players['Weight'].min()/players['Weight'].max()

0.5423728813559322

players['Height'].min()/players['Height'].max()

0.7962085308056872

players_BMI= players['Weight'] / (players['Height']/100 * players['Height']/100)

players.insert(loc=4, column='BMI', value=players_BMI)

players.head()

title = "AFL players Height vs Weight"
sns.scatterplot(players, x = "Weight", y= "Height", hue='BMI', palette='coolwarm')
plt.title(title)
plt.savefig(title)

X = players[['Weight']]
y = players['Height']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

from sklearn.linear_model import LinearRegression

lm = LinearRegression()

lm.fit(X_train,y_train)

LinearRegression()

LinearRegression()

lm.get_params()

{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False}

print(lm.intercept_)

128.93269601484045

coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df

title = "AFL players Height vs Weight"
sns.scatterplot(players, x = "Weight", y= "Height")

# Plot the regression line
plt.plot(X,lm.predict(X), label='Regression Line', color='red')
plt.legend()
plt.title(title)
plt.savefig(title)

lm_height_predictions = lm.predict(X_test)
lm_residuals = y_test - lm_height_predictions

plt.title("AFL players Height vs Weight linear model residuals")
plt.scatter(X_test,lm_residuals)
plt.xlabel("Weight")
plt.ylabel("Height residuals")

plt.savefig(title)

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, lm_height_predictions))
print('MSE:', metrics.mean_squared_error(y_test, lm_height_predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, lm_height_predictions)))

MAE: 3.5297148653398764
MSE: 21.056239448144126
RMSE: 4.588707818999171

y_log_train = np.log(y_train)

em = LinearRegression().fit(X_train,y_log_train)

em_log_height_predictions = em.predict(X_test)

em_height_predictions = np.exp(em_log_height_predictions)

print('MAE:', metrics.mean_absolute_error(y_test, em_height_predictions))
print('MSE:', metrics.mean_squared_error(y_test, em_height_predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, em_height_predictions)))

MAE: 3.521240117676183
MSE: 20.962440540186822
RMSE: 4.578475787878191

em_residuals = y_test - em_height_predictions

plt.scatter(X_test,lm_residuals, label = "Linear model", color='blue')
plt.scatter(X_test, em_residuals, label = "Exponential model", color='orange')
plt.title('AFL players Height vs Weight model residuals')
plt.xlabel("Weight")
plt.ylabel("Height residuals")
plt.legend()
plt.savefig(title)

Welcome to Footy Trends !¶

The Game¶

Player build¶

Kaggle is community where data sets and models are shared. Michael Stone has a provided a really good AFL data set we'll use. Thanks Michael ! Data set includes years 2012-2023¶

How is this dataset structured?¶

Let's take a peek at the players csv file¶

How many players are there in total ? Let's count the lines in the file¶

It's 1650 minus 1 for the header line¶

We use the library pandas to import the players CSV file and store it as a DataFrame. We then a peek at the DataFrame to make sure it was read in correctly¶

And do a count plot to see the height distribution of the players¶

It looks like a skewed normal distribution (skewed to the right/taller). Let's calculate the median and the mean¶

What about weights ?¶

Let's do a scatterplot of height vs weight¶

Starting the axis at the data set may be deceiving. Let's make the axis at 0,0¶

Deceiving in the other direction but illustrates the point¶

Weight varies more than height as a percentage of each. The lightest players is about 50% of the weight of the heaviest player; the shortest player is about about 75% of the height of the tallest player¶

Though we don't know the player's body fat percentages BMI is a reasonable proxy for build for professional athletes¶

Are different player builds more common in smaller or bigger players?¶

Yes ! Lighter builds are more common in lighter players and heavier builds are more common in heavier players. Percentile hue break downs illustrate this better. I may include some in the future¶

For heavier builds in heavier players this may be due to specific roles where physicality is important but the player may not need to move far in a given play or a given game thus endurance is less of a requirement¶

Machine Learning¶

Just for fun let's fit a Linear machine learning model to the players Height vs Weight¶

~0.69 - strong linear relationship¶

Let's draw the predicted line on the original scatterplot to see how it looks¶

Looks reasonable....¶

How about we plot the residuals (actual - predicted) to see if there is a pattern¶

More volatility at the lower weights ? Or an artifact of a larger sample size?¶

Let's apply some standard tools to these residuals¶

MAE indicates the model is off by an average of ~3.5cm when predicting a players Height from their Weight. MSE is an indication of how large the larger differences in prediction are. RMSE the opposite - it penalizes larger errors.¶

Is there a slight exponential relationship of Height to Weight?¶

No... very similar values as for the linear model¶

Let's plot the residuals of both models together¶

Possibly some fit differences for each model at the extremes of weight¶

	PlayerId	DisplayName	Height	Weight	Dob	Position	Origin
0	2020654979	Aarts, Jake	177	75	8-Dec-94	Forward	Richmond
1	2018655703	Abbott, Ryan	200	100	25-Jun-91	Ruck	Grovedale Tigers
2	2002652211	Ablett, Gary	182	87	14-May-84	Forward	Geelong Falcons
3	2014651814	Acres, Blake	191	90	7-Oct-95	Midfield	West Perth
4	2007653719	Adams, Leigh	176	79	6-Apr-88	Forward	Eastern Ranges

	PlayerId	DisplayName	Height	Weight	BMI	Dob	Position	Origin
0	2020654979	Aarts, Jake	177	75	23.939481	8-Dec-94	Forward	Richmond
1	2018655703	Abbott, Ryan	200	100	25.000000	25-Jun-91	Ruck	Grovedale Tigers
2	2002652211	Ablett, Gary	182	87	26.264944	14-May-84	Forward	Geelong Falcons
3	2014651814	Acres, Blake	191	90	24.670376	7-Oct-95	Midfield	West Perth
4	2007653719	Adams, Leigh	176	79	25.503616	6-Apr-88	Forward	Eastern Ranges