Hadley Wickham (paraphrased):
Visualization can surprise you, but it doesn’t scale well. Modeling scales well, but it doesn’t (fundamentally) surprise.
lmplot
in seaborn
-
lm
stands for linear modelThe linear model:
\[ Y = Xb + e \]
where \(Y\) represents the outcome variable, \(X\) is a matrix of predictors, \(b\) represents the “parameters”, and \(e\) represents the errors, or “residuals”
Substantial statistical functionality available in the
statsmodels
package, available in CoCalc
Example: statistical modeling for inference
Let’s get an example ready:
Visual introduction to machine learning: http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
Supervised machine learning: model “trained” and optimized for predictive power
Regression problem: predicting a numeric outcome
Classification problem: predicting a categorical outcome
Task: predict the median earnings of graduates 10 years after graduation based on a series of college characteristics
Method: train a model on a subset of the data, then test the model on the remaining subset
Example method: random forest regression
train_test_split()
function splits your data
randomly into training (75 percent, by default) and test datasets# Out-of-bag score: how model performs on out-of-bag estimate
print(rf1.oob_score_)
# Feature importance plot
fip = pd.DataFrame(data = {'importance': rf1.feature_importances_,
'feature': features})
fip.sort_values('importance', ascending = False, inplace = True)
sns.barplot(x = 'importance', y = 'feature', data = fip)
from sklearn.metrics import confusion_matrix
predicted_class = rf2.predict(test[features2])
# Prediction accuracy on test set
rf2.score(test[features2], test['is_private'])
# "Confusion" matrix
confusion_matrix(predicted_class, test['is_private'])
# What did we get wrong?
nomatch = test[test['is_private'] != predicted_class]
def find_neighbors(university):
# Get the index of the university
uni_index = colleges[colleges.name == university].index[0]
# Get the indices of the neighboring universities
neighbors = list(model[uni_index])[1:]
# Identify the names of the neighboring universities
for idx in neighbors:
nname = colleges.iloc[idx]['name']
print(nname)
find_neighbors("Texas Christian University")