Machine Learning

and

Probabilistic Programming

Colin Carroll, Kensho

	colindcarroll.com
	ColCarroll
	@colindcarroll

Kensho

Technology that brings transparency to complex systems

Motivating example from

Raw Data¶

get_df(2016).head()

	ast	blk	day_num	dr	fga	fga3	fgm	fgm3	fta	ftm	...	num_ot	or	pf	score	season	stl	team	team_name	to	won
0	13	6	11	28	57	17	29	4	27	15	...	0	10	21	77	2016	11	1104	Alabama	12	True
1	6	4	11	27	55	19	19	7	26	19	...	0	12	25	64	2016	7	1244	Kennesaw	16	False
2	10	1	11	23	64	29	25	8	17	10	...	1	14	25	68	2016	12	1105	Alabama A&M	15	True
3	11	7	11	30	57	27	22	7	26	16	...	1	18	21	67	2016	6	1408	Tulane	19	False
4	12	6	11	34	61	20	24	6	32	25	...	0	17	21	79	2016	4	1112	Arizona	14	True

5 rows × 22 columns

Model¶

predict('North Carolina', 'Connecticut')

North Carolina has a 78% chance of beating Connecticut
Predicted Score:
	North Carolina 79 Connecticut 70

What is going on here?¶

Two models:

Regression

predict_scores('North Carolina', 'Connecticut')

'North Carolina 79 Connecticut 70'

Classification

predict_winner('North Carolina', 'Connecticut')

'North Carolina has a 78% chance of beating Connecticut'

Model Building

what-if.xkcd.com/5

Transform raw data to features
Train a model
Measure how accurate you expect the model to be

Turning raw data into features¶

Surprisingly hard not to peek at the future.

get_features(get_training_data(results_2016)).head()

	avg_score_	avg_score_opponent	home_game_	home_game_opponent	transformed_fg_pct_	transformed_fg_pct_opponent	transformed_avg_won_	transformed_avg_won_opponent
0	69.7	73.4	True	False	-0.3	-0.2	-1.3	-1.5
1	69.4	72.2	False	True	-0.3	-0.4	-1.9	-0.7
2	69.5	74.6	False	True	-0.2	-0.2	-0.7	0.3
3	86.5	67.2	True	False	0.2	-0.3	0.5	1.4
4	77.7	74.6	False	True	-0.4	-0.1	0.5	0.4

Nonlinear models¶

Turning features into a model¶

When using linear regression we assume that $$ \mathbf{score} = w_1 \cdot \mathbf{avg\_score} + w_2 \cdot \mathbf{fg\_pct} + \cdots + w_m \cdot \mathbf{win\_pct} $$

Try to find $(w_1, w_2, \ldots, w_m)$.

There's linear algebra, and then there's everything else we do to pay the bills so we can do more linear algebra.
— Tim Hopper 🔭 (@tdhopper) January 5, 2016

More concisely¶

Try to find a $\mathbf{w}$ satisfying $\mathbf{y} = X\mathbf{w}$.

explain_model()

predicted_score_first =
	+0.60 * avg_score_
	+0.32 * avg_score_opponent
	+3.34 * transformed_fg_pct_
	-1.00 * transformed_fg_pct_opponent
	-0.16 * transformed_avg_won_
	-15.28 * transformed_avg_won_opponent
	+0.11 * home_game_
	-0.65 * home_game_opponent

What can we say about linear regression?¶

Linear regression minimizes the sum of squared errors¶

$$error(\mathbf{w}) = \sum_j (\color{red}\mathbf{x}_j \cdot \mathbf{w} \color{black} - \color{blue}y_j\color{black})^2$$

The sum of the square of the the predictions minus the labels

Linear regression finds the most likely weights¶

Given our data, $(\color{red}X, \mathbf{y}\color{black})$, Bayes' Rule says that $$ P( \color{blue}\mathbf{w} \color{black}| \color{red} X, \mathbf{y} \color{black})~ \propto P( \color{red}X, \mathbf{y} \color{black}| \color{blue} \mathbf{w} \color{black}) P(\color{blue} \mathbf{w} \color{black}) $$

The probability of the weights given the data is proportional to the probability of the data given the weights times the probability of the weights.

Linear regression is geometrically pleasant¶

(and syntactically terrifying)¶

$X\mathbf{w}$ is the nearest point to $\mathbf{y}$ in the $m$-dimensional subspace of $\mathbb{R}^n$ spanned by the columns of $X$.

More guarantees^* for linear regression:¶

If there is no noise, the true $\mathbf{w}$ will be recovered
$\mathbf{w}$ is unique
$\mathbf{w}$ exists

(*not actually guaranteed)

How wrong will I be?¶

A brief warning¶

Cross validation, testing, overfitting...¶

Logistic regression, briefly¶

predict_winner('North Carolina', 'Connecticut')

'North Carolina has a 78% chance of beating Connecticut'

Linear Regression	Logistic Regression
$$\mathbf{y} = X\mathbf{w}$$	$$\mathbf{y} = \sigma(X\mathbf{w})$$

What is $\sigma$?¶

$$ \sigma(x) = \frac{1}{1 + e^{-x}} $$

What is $\sigma$?¶

Challenger Disaster¶

Challenger dataset¶

Teaser: Probabilistic Programming

Probabilistic Programming


Stan	Edward	Emcee

Probabilistic Programming

Let's explicitly write down our model and data

$$ \mathbf{y} = \mathbf{x} \cdot \text{weights} + \text{intercept} + Normal(0, \sigma) $$

Machine Learning

and

Probabilistic Programming

Kensho

Technology that brings transparency to complex systems

Motivating example from

Raw Data¶

Model¶

What is going on here?¶

Two models:

Model Building

Transform raw data to features

Train a model

Measure how accurate you expect the model to be

Turning raw data into features¶

Nonlinear models¶

Turning features into a model¶

More concisely¶

What can we say about linear regression?¶

Linear regression minimizes the sum of squared errors¶

Linear regression finds the most likely weights¶

Linear regression is geometrically pleasant¶

(and syntactically terrifying)¶

More guarantees* for linear regression:¶

How wrong will I be?¶

A brief warning¶

Cross validation, testing, overfitting...¶

Logistic regression, briefly¶

What is $\sigma$?¶

What is $\sigma$?¶

Challenger Disaster¶

Challenger dataset¶

Teaser: Probabilistic Programming

Probabilistic Programming

Probabilistic Programming

Let's explicitly write down our model and data

Resources

More guarantees^* for linear regression:¶