Description
-
AutoML for Regression
I once worked with an old engineer who would quietly listen to younger engineers arguing over what each thought was the best solutions to a problem. Eventually, he would say, “There is no point in arguing about things that can be tested.” And then he would go and do an experiment that ended the argument.
As we get better and better at working with these models, we can begin to guess which will be best. However, a lot of the time we can just try all of them.
In this exercise, you will get a data set for regression and you will use pycaret to nd the best candidates and test them against each other.
1.1 Training and Comparing
train concrete.csv and test concrete.csv contain data about the compressive strength of sev-eral di erent concrete mixes: https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+ Strength
You will write a program called concrete train.py that will use pycaret’s compare models (no turbo!) to try a large variety of regression algorithms on train concrete.csv.
It will pick the best six (based on R2) and it will tune (using at least 24 di erent parameter combinations) and nalize each before saving the nalized model to a pickle le. Thus six .pkl les will be created.
Run the program and save the output to train.txt.
train.txt should look like this:
*** Setting up session*** |
|||||||
Description |
Value |
||||||
0 |
Session id |
8371 |
|||||
1 |
Target |
csMPa |
|||||
2 |
Target type |
Regression |
|||||
… |
|||||||
18 |
USI |
6171 |
|||||
*** Set up: 1.89 seconds |
|||||||
Model |
MAE |
MSE |
RMSE \ |
||||
catboost |
CatBoost |
Regressor |
2.8945 |
18.6345 |
4.2833 |
||
lightgbm |
Light Gradient Boosting Machine |
3.4278 |
24.0534 |
4.8647 |
|||
et |
Extra Trees |
Regressor |
3.5456 |
26.9685 |
5.1605 |
||
rf |
Random Forest |
Regressor |
3.8756 |
27.8363 |
5.2477 |
||
gbr |
Gradient Boosting |
Regressor |
3.9316 |
28.5122 |
5.3121 |
||
mlp |
MLP |
Regressor |
5.1949 |
46.8561 |
6.8208 |
||
dt |
Decision Tree |
Regressor |
5.0301 |
56.0932 |
7.3678 |
||
ada |
AdaBoost |
Regressor |
6.3680 |
61.0570 |
7.7998 |
||
knn |
K Neighbors |
Regressor |
7.3850 |
96.5281 |
9.7726 |
||
br |
Bayesian Ridge |
8.1946 |
108.8475 |
10.4094 |
|||
kr |
Kernel Ridge |
8.2325 |
108.9195 |
10.4127 |
|||
en |
Elastic Net |
8.2139 |
109.0079 |
10.4165 |
|||
ridge |
Ridge Regression |
8.2163 |
109.0042 |
10.4161 |
|||
lr |
Linear Regression |
8.2163 |
109.0043 |
10.4161 |
|||
lasso |
Lasso Regression |
8.2147 |
109.0795 |
10.4200 |
|||
ard |
Automatic Relevance Determination |
8.2609 |
109.4030 |
10.4368 |
|||
huber |
Huber |
Regressor |
8.1080 |
116.0969 |
10.6962 |
||
par |
Passive Aggressive |
Regressor |
9.6928 |
149.4162 |
12.0777 |
||
lar |
Least Angle Regression |
9.9147 |
163.0199 |
12.6530 |
|||
omp |
Orthogonal Matching Pursuit |
12.0965 |
216.9526 |
14.6893 |
|||
svm |
Support Vector Regression |
12.0595 |
227.7054 |
15.0594 |
|||
tr |
TheilSen |
Regressor |
9.0357 |
232.6367 |
14.6477 |
||
llar |
Lasso Least Angle Regression |
13.6897 |
286.5223 |
16.8957 |
|||
dummy |
Dummy |
Regressor |
13.6897 |
286.5223 |
16.8957 |
||
ransac |
Random Sample |
Consensus |
10.4945 |
352.2832 |
17.8668 |
||
R2 |
RMSLE |
MAPE |
TT (Sec) |
||||
catboost |
0.9337 |
0.1358 |
0.1003 |
0.227 |
lightgbm |
0.9143 |
0.1576 |
0.1191 |
0.016 |
et |
0.9043 |
0.1622 |
0.1235 |
0.032 |
rf |
0.9010 |
0.1772 |
0.1404 |
0.035 |
gbr |
0.8986 |
0.1760 |
0.1383 |
0.016 |
mlp |
0.8327 |
0.2217 |
0.1769 |
0.090 |
dt |
0.8006 |
0.2311 |
0.1713 |
0.007 |
ada |
0.7840 |
0.2828 |
0.2631 |
0.017 |
knn |
0.6541 |
0.3188 |
0.2843 |
0.007 |
br |
0.6097 |
0.3320 |
0.3135 |
0.007 |
kr |
0.6092 |
0.3307 |
0.3135 |
0.009 |
en |
0.6091 |
0.3315 |
0.3134 |
0.090 |
ridge |
0.6091 |
0.3312 |
0.3131 |
0.089 |
lr |
0.6091 |
0.3312 |
0.3131 |
0.219 |
lasso |
0.6088 |
0.3318 |
0.3137 |
0.095 |
ard |
0.6073 |
0.3314 |
0.3149 |
0.007 |
huber |
0.5809 |
0.3235 |
0.3038 |
0.010 |
par |
0.4758 |
0.3902 |
0.3636 |
0.007 |
lar |
0.4182 |
0.4281 |
0.3720 |
0.007 |
omp |
0.2359 |
0.4757 |
0.5022 |
0.007 |
svm |
0.1996 |
0.4822 |
0.5051 |
0.009 |
tr |
0.1558 |
0.3313 |
0.3066 |
0.171 |
llar |
-0.0068 |
0.5397 |
0.6003 |
0.007 |
dummy |
-0.0068 |
0.5397 |
0.6003 |
0.006 |
ransac |
-0.2651 |
0.3615 |
0.3362 |
0.017 |
-
compare_models: 16.59 seconds
-
Best: CatBoostRegressor LGBMRegressor ExtraTreesRegressor RandomForestRegressor GradientBoostingRegressor MLPRegressor
-
0 – CatBoostRegressor ***
Fitting 10 folds for each of 24 candidates, totalling 240 fits
MAE MSE RMSE R2 RMSLE MAPE
Fold
-
3.4779 28.4514 5.3340 0.9139 0.1772 0.1307
…
9 2.5817 15.6806 3.9599 0.9387 0.1492 0.1040 Mean 3.0409 19.2967 4.3551 0.9313 0.1484 0.1077
Std 0.3354 5.1192 0.5742 0.0193 0.0199 0.0141
*** 1 – LGBMRegressor ***
Fitting 10 folds for each of 24 candidates, totalling 240 fits
MAE MSE RMSE R2 RMSLE MAPE
Fold
-
3.5398 29.0514 5.3899 0.9121 0.1635 0.1275
…
9 2.9727 19.1409 4.3750 0.9252 0.1640 0.1160 Mean 3.1203 21.8118 4.6207 0.9222 0.1522 0.1100
Std 0.3422 6.4237 0.6787 0.0236 0.0230 0.0142
*** 2 – ExtraTreesRegressor ***
Fitting 10 folds for each of 24 candidates, totalling 240 fits
MAE MSE RMSE R2 RMSLE MAPE
Fold
-
5.3227 53.8647 7.3393 0.8370 0.2351 0.2000
1…
9 5.0418 38.2233 6.1825 0.8506 0.2439 0.2172 Mean 4.9365 41.6118 6.4389 0.8523 0.2119 0.1816
Std 0.2409 5.1511 0.3898 0.0196 0.0268 0.0253
*** 3 – RandomForestRegressor ***
Fitting 10 folds for each of 24 candidates, totalling 240 fits
MAE MSE RMSE R2 RMSLE MAPE
Fold
-
4.6211 46.6803 6.8323 0.8588 0.2290 0.1846
…
9 4.6621 32.9120 5.7369 0.8714 0.2293 0.2000 Mean 4.5181 35.5683 5.9522 0.8745 0.2030 0.1707
Std 0.1097 4.5604 0.3733 0.0121 0.0318 0.0272
*** 4 – GradientBoostingRegressor ***
Fitting 10 folds for each of 24 candidates, totalling 240 fits
MAE MSE RMSE R2 RMSLE MAPE
Fold
-
3.3277 25.1146 5.0114 0.9240 0.1740 0.1271
…
9 2.9214 19.6030 4.4275 0.9234 0.1694 0.1215 Mean 3.1014 20.4411 4.4966 0.9272 0.1542 0.1114
Std 0.2730 4.1364 0.4706 0.0171 0.0203 0.0145
*** 5 – MLPRegressor ***
Fitting 10 folds for each of 24 candidates, totalling 240 fits
MAE MSE RMSE R2 RMSLE MAPE
Fold
-
5.6160 56.9794 7.5485 0.8276 0.2771 0.1929
…
9 5.1128 37.7620 6.1451 0.8524 0.2474 0.2167 Mean 5.1949 46.8561 6.8208 0.8327 0.2217 0.1769 Std 0.4478 7.8475 0.5772 0.0343 0.0291 0.0239 Transformation Pipeline and Model Successfully Saved
*** Tuning and finalizing: 165.12 seconds
*** Total time: 183.60 seconds
(Yes, depending on the versions of the libraries that you have installed, there may be some warnings from this process. I’m not showing those here.)
When I run this, I end up with a pickle le for the top six models:
-
LGBMRegressor.pkl
-
CatBoostRegressor.pkl
-
MLPRegressor.pkl
-
ExtraTreesRegressor.pkl
-
RandomForestRegressor.pkl
-
GradientBoostingRegressor.pkl
1.2 Testing
You will write a program called concrete test.py that will scan the current directory for .pkl les. It will use pycaret to load those in one at time.
Each model will be tested on concrete test.py. The program will print the time that inference required and the R2 value.
Run the program and save the output to test.txt
My test.txt looks like this:
GradientBoostingRegressor:
Inference: 0.0095 seconds
R2 on test data = 0.9110
RandomForestRegressor:
Inference: 0.0319 seconds
R2 on test data = 0.8815
AdaBoostRegressor:
Inference: 0.0147 seconds
R2 on test data = 0.7239
ExtraTreesRegressor:
Inference: 0.0170 seconds
R2 on test data = 0.9007
MLPRegressor:
Inference: 0.0052 seconds
R2 on test data = 0.7861
CatBoostRegressor:
Inference: 0.0030 seconds
R2 on test data = 0.9079
LGBMRegressor:
Inference: 0.0071 seconds
R2 on test data = 0.9101
Which would you use if accuracy was most important? What if speed was also really important?
-
X2 testing for independence between categorical variables
Some times we will look at two categorical variables and try to gure out if they are related. Does knowing that the mouse has a particular gene tell us anything about the probability that it will get cancer?
You are given a csv with the results of this sort of experiment called mice.csv. Write a program check mice.py that does the analysis. Put the analysis into a LaTeX le. (mice.tex) Convert that to a PDF mice.pdf). Include bot les in your zip le.
For example, you should start out with a contingency table: (I did these examples with di erent data.)
Gene |
No Cancer |
Has Cancer |
|||
R |
34 |
2 |
36 |
||
J |
4 |
45 |
49 |
||
K |
17 |
18 |
35 |
||
55 |
65 |
120 |
|||
Then show conditional proportions: |
|||||
Gene |
No Cancer |
Has Cancer |
|||
R |
94.4% |
5.6% |
30.0% |
||
J |
8.2% |
91.8% |
40.8% |
||
K |
48.6% |
51.4% |
29.2% |
||
45.8% |
54.2% |
||||
Then show the expected counts if the gene and cancer were independent:
Gene |
No Cancer |
Has Cancer |
|
R |
16.5 |
19.5 |
36 |
J |
22.5 |
26.5 |
49 |
K |
16.0 |
19.0 |
35 |
45.8% |
54.2% |
||
Use the two tables to nd X2:
X2 = 62:379
Note the degrees of freedom. (It is 2.)
And do a p-test:
p = 2:853273173286652 10 14
And then give proclamation: “It seems very, very unlikely that we would have seen these numbers if the gene and cancer were independent.”
-
Criteria for success
If your name is Fred Jones, you will turn in a zip le called HW10 Jones Fred.zip of a directory called HW10 Jones Fred. It will contain:
-
concrete train.py
-
concrete test.py
-
check mice.py
-
test.txt
-
train.txt
-
mice.tex
-
mice.pdf
-
train concrete.csv
-
test concrete.csv
-
mice.csv
Be sure to format your python code with black before you submit it.
We would run your code like this:
cd HW10_Jones_Fred
python3 concrete_train.py
python3 concrete_test.py
python3 check_mice.py
Do this work by yourself. Stackover ow is OK. A hint from another student is OK. Looking at another student’s code is not OK.
The template les for the python programs have import statements. Do not use any frameworks not in those import statements.
7