........
colindcarroll.com | |
ColCarroll | |
@colindcarroll |
get_df(2016).head()
ast | blk | day_num | dr | fga | fga3 | fgm | fgm3 | fta | ftm | ... | num_ot | or | pf | score | season | stl | team | team_name | to | won | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 13 | 6 | 11 | 28 | 57 | 17 | 29 | 4 | 27 | 15 | ... | 0 | 10 | 21 | 77 | 2016 | 11 | 1104 | Alabama | 12 | True |
1 | 6 | 4 | 11 | 27 | 55 | 19 | 19 | 7 | 26 | 19 | ... | 0 | 12 | 25 | 64 | 2016 | 7 | 1244 | Kennesaw | 16 | False |
2 | 10 | 1 | 11 | 23 | 64 | 29 | 25 | 8 | 17 | 10 | ... | 1 | 14 | 25 | 68 | 2016 | 12 | 1105 | Alabama A&M | 15 | True |
3 | 11 | 7 | 11 | 30 | 57 | 27 | 22 | 7 | 26 | 16 | ... | 1 | 18 | 21 | 67 | 2016 | 6 | 1408 | Tulane | 19 | False |
4 | 12 | 6 | 11 | 34 | 61 | 20 | 24 | 6 | 32 | 25 | ... | 0 | 17 | 21 | 79 | 2016 | 4 | 1112 | Arizona | 14 | True |
5 rows × 22 columns
predict('North Carolina', 'Connecticut')
North Carolina has a 78% chance of beating Connecticut Predicted Score: North Carolina 79 Connecticut 70
Regression
predict_scores('North Carolina', 'Connecticut')
'North Carolina 79 Connecticut 70'
Classification
predict_winner('North Carolina', 'Connecticut')
'North Carolina has a 78% chance of beating Connecticut'
Surprisingly hard not to peek at the future.
get_features(get_training_data(results_2016)).head()
avg_score_ | avg_score_opponent | home_game_ | home_game_opponent | transformed_fg_pct_ | transformed_fg_pct_opponent | transformed_avg_won_ | transformed_avg_won_opponent | |
---|---|---|---|---|---|---|---|---|
0 | 69.7 | 73.4 | True | False | -0.3 | -0.2 | -1.3 | -1.5 |
1 | 69.4 | 72.2 | False | True | -0.3 | -0.4 | -1.9 | -0.7 |
2 | 69.5 | 74.6 | False | True | -0.2 | -0.2 | -0.7 | 0.3 |
3 | 86.5 | 67.2 | True | False | 0.2 | -0.3 | 0.5 | 1.4 |
4 | 77.7 | 74.6 | False | True | -0.4 | -0.1 | 0.5 | 0.4 |
When using linear regression we assume that $$ \mathbf{score} = w_1 \cdot \mathbf{avg\_score} + w_2 \cdot \mathbf{fg\_pct} + \cdots + w_m \cdot \mathbf{win\_pct} $$
Try to find $(w_1, w_2, \ldots, w_m)$.
There's linear algebra, and then there's everything else we do to pay the bills so we can do more linear algebra.
— Tim Hopper 🔭 (@tdhopper) January 5, 2016
Try to find a $\mathbf{w}$ satisfying $\mathbf{y} = X\mathbf{w}$.
explain_model()
predicted_score_first = +0.60 * avg_score_ +0.32 * avg_score_opponent +3.34 * transformed_fg_pct_ -1.00 * transformed_fg_pct_opponent -0.16 * transformed_avg_won_ -15.28 * transformed_avg_won_opponent +0.11 * home_game_ -0.65 * home_game_opponent
The sum of the square of the the predictions minus the labels
Given our data, $(\color{red}X, \mathbf{y}\color{black})$, Bayes' Rule says that $$ P( \color{blue}\mathbf{w} \color{black}| \color{red} X, \mathbf{y} \color{black})~ \propto P( \color{red}X, \mathbf{y} \color{black}| \color{blue} \mathbf{w} \color{black}) P(\color{blue} \mathbf{w} \color{black}) $$
The probability of the weights given the data is proportional to the probability of the data given the weights times the probability of the weights.
$X\mathbf{w}$ is the nearest point to $\mathbf{y}$ in the $m$-dimensional subspace of $\mathbb{R}^n$ spanned by the columns of $X$.
(*not actually guaranteed)
predict_winner('North Carolina', 'Connecticut')
'North Carolina has a 78% chance of beating Connecticut'
Linear Regression | Logistic Regression |
$$\mathbf{y} = X\mathbf{w}$$ | $$\mathbf{y} = \sigma(X\mathbf{w})$$ |
$$ \sigma(x) = \frac{1}{1 + e^{-x}} $$
Stan | Edward | Emcee |