........

Machine Learning

and

Probabilistic Programming

Colin Carroll, Kensho

colindcarroll.com
ColCarroll
@colindcarroll

Kensho

Technology that brings transparency to complex systems

Motivating example from

Raw Data

get_df(2016).head()
ast blk day_num dr fga fga3 fgm fgm3 fta ftm ... num_ot or pf score season stl team team_name to won
0 13 6 11 28 57 17 29 4 27 15 ... 0 10 21 77 2016 11 1104 Alabama 12 True
1 6 4 11 27 55 19 19 7 26 19 ... 0 12 25 64 2016 7 1244 Kennesaw 16 False
2 10 1 11 23 64 29 25 8 17 10 ... 1 14 25 68 2016 12 1105 Alabama A&M 15 True
3 11 7 11 30 57 27 22 7 26 16 ... 1 18 21 67 2016 6 1408 Tulane 19 False
4 12 6 11 34 61 20 24 6 32 25 ... 0 17 21 79 2016 4 1112 Arizona 14 True

5 rows × 22 columns

Model

predict('North Carolina', 'Connecticut')
North Carolina has a 78% chance of beating Connecticut
Predicted Score:
	North Carolina 79 Connecticut 70

What is going on here?

Two models:

Regression

predict_scores('North Carolina', 'Connecticut')
'North Carolina 79 Connecticut 70'

Classification

predict_winner('North Carolina', 'Connecticut')
'North Carolina has a 78% chance of beating Connecticut'

Model Building

what-if.xkcd.com/5
  • Transform raw data to features

  • Train a model

  • Measure how accurate you expect the model to be

Turning raw data into features

Surprisingly hard not to peek at the future.

get_features(get_training_data(results_2016)).head() 
avg_score_ avg_score_opponent home_game_ home_game_opponent transformed_fg_pct_ transformed_fg_pct_opponent transformed_avg_won_ transformed_avg_won_opponent
0 69.7 73.4 True False -0.3 -0.2 -1.3 -1.5
1 69.4 72.2 False True -0.3 -0.4 -1.9 -0.7
2 69.5 74.6 False True -0.2 -0.2 -0.7 0.3
3 86.5 67.2 True False 0.2 -0.3 0.5 1.4
4 77.7 74.6 False True -0.4 -0.1 0.5 0.4

Nonlinear models

Turning features into a model

When using linear regression we assume that $$ \mathbf{score} = w_1 \cdot \mathbf{avg\_score} + w_2 \cdot \mathbf{fg\_pct} + \cdots + w_m \cdot \mathbf{win\_pct} $$

Try to find $(w_1, w_2, \ldots, w_m)$.

More concisely

Try to find a $\mathbf{w}$ satisfying $\mathbf{y} = X\mathbf{w}$.

explain_model()
predicted_score_first =
	+0.60 * avg_score_
	+0.32 * avg_score_opponent
	+3.34 * transformed_fg_pct_
	-1.00 * transformed_fg_pct_opponent
	-0.16 * transformed_avg_won_
	-15.28 * transformed_avg_won_opponent
	+0.11 * home_game_
	-0.65 * home_game_opponent

What can we say about linear regression?

Linear regression minimizes the sum of squared errors

$$error(\mathbf{w}) = \sum_j (\color{red}\mathbf{x}_j \cdot \mathbf{w} \color{black} - \color{blue}y_j\color{black})^2$$
The sum of the square of the the predictions minus the labels

Linear regression finds the most likely weights

Given our data, $(\color{red}X, \mathbf{y}\color{black})$, Bayes' Rule says that $$ P( \color{blue}\mathbf{w} \color{black}| \color{red} X, \mathbf{y} \color{black})~ \propto P( \color{red}X, \mathbf{y} \color{black}| \color{blue} \mathbf{w} \color{black}) P(\color{blue} \mathbf{w} \color{black}) $$

The probability of the weights given the data is proportional to the probability of the data given the weights times the probability of the weights.

Linear regression is geometrically pleasant

(and syntactically terrifying)

$X\mathbf{w}$ is the nearest point to $\mathbf{y}$ in the $m$-dimensional subspace of $\mathbb{R}^n$ spanned by the columns of $X$.

More guarantees* for linear regression:

  • If there is no noise, the true $\mathbf{w}$ will be recovered
  • $\mathbf{w}$ is unique
  • $\mathbf{w}$ exists

(*not actually guaranteed)

How wrong will I be?

A brief warning

Cross validation, testing, overfitting...

Logistic regression, briefly

predict_winner('North Carolina', 'Connecticut')
'North Carolina has a 78% chance of beating Connecticut'
Linear Regression Logistic Regression
$$\mathbf{y} = X\mathbf{w}$$ $$\mathbf{y} = \sigma(X\mathbf{w})$$

What is $\sigma$?

$$ \sigma(x) = \frac{1}{1 + e^{-x}} $$

What is $\sigma$?

Challenger Disaster

Challenger dataset

Teaser: Probabilistic Programming

Probabilistic Programming

Stan Edward Emcee

Probabilistic Programming

Let's explicitly write down our model and data

$$ \mathbf{y} = \mathbf{x} \cdot \text{weights} + \text{intercept} + Normal(0, \sigma) $$
$$ \mathbf{y} = \mathbf{x} \cdot \text{weights} + \text{intercept} + Normal(0, \sigma) $$

Resources