This project is maintained by pterwoo
To do this,
url = "https://raw.githubusercontent.com/tyler-frazier/intro_data_science/main/data/city_persons.csv"
pns = pd.read_csv(url, sep = ",")
pns.dropna(inplace=True)
display(pns.dtypes)
pns['age'] = pns['age'].astype(int)
pns['edu'] = pns['edu'].astype(int)
X = pns.drop(["wealthC", "wealthI"], axis=1)
y = pns.wealthC
I executed the classification method with the range np.arrange(10,80)
.
Max. testing score: 0.546
Alpha value: 63
Plot:
With the distance added,
Max. testing score: 0.504
Alpha value: 53
Plot:
There was not a marked difference between the two models. The first model without the distance weight added actually performed slightly better, but only by a small margin.
Training score: 0.555
Testing score: 0.539
Again, there was not much of a significant difference between this model and the KNN models. It performed a little better than the KNN with distance by a slight margin, and performec a little worse than the KNN without distance also by a small margin.
100 | 500 | 1000 | 5000 | |
---|---|---|---|---|
Training | 0.7942708333333334 | 0.7942708333333334 | 0.7942708333333334 | 0.7942708333333334 |
Testing | 0.5012201073694486 | 0.5021961932650073 | 0.5070766227428014 | 0.5041483650561249 |
The values for training values were all the same, while the testing scores differed slightly. 1000 trees produced the best testing score, and 100 trees produced the worst.
20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | |
---|---|---|---|---|---|---|---|---|---|---|
Training | 0.6595052083333334 | 0.6549479166666666 | 0.6536458333333334 | 0.6471354166666666 | 0.6451822916666666 | 0.6435546875 | 0.6432291666666666 | 0.642578125 | 0.6373697916666666 | 0.638671875 |
Testing | 0.550024402147389 | 0.5475841874084919 | 0.5490483162518301 | 0.5505124450951684 | 0.5470961444607125 | 0.552464616886286 | 0.5466081015129332 | 0.5475841874084919 | 0.5485602733040508 | 0.5495363591996095 |
While the training score for this model was not as high as the ones from the previous model, the testing scores were higher by a small amount.
100 | 500 | 1000 | 5000 | |
---|---|---|---|---|
Training | 0.7942708333333334 | 0.7942708333333334 | 0.7942708333333334 | 0.7942708333333334 |
Testing | 0.5036603221083455 | 0.5012201073694486 | 0.4987798926305515 | 0.5031722791605662 |
Looking at the training and testing values for the model with standardization, we see that there is not much of a difference from the model with standardization. The training values were all equivalent, just like the model with standardization, and the testing values are just about the same.
Max. testing score: 0.555
Alpha value: 50
Plot:
There is a slight increase in the testing value, but not by a significant amount.
Max. testing score: 0.505
Alpha value: 79
Plot:
The testing score increased by a slight amount. This is to be expected, since the merging of the wealth classes would result in less variance due to the decrease in number of groups. Less variance usually results in more accurate classification.
Training: 0.554
Testing: 0.531
It performed slightly worse, but not by a large amount. the previous testing value was 0.539.
100 | 500 | 1000 | 5000 | |
---|---|---|---|---|
Training | 0.7913411458333334 | 0.7913411458333334 | 0.7913411458333334 | 0.7913411458333334 |
Testing | 0.51634943875061 | 0.5119570522205954 | 0.5148853099072719 | 0.51634943875061 |
The testing values generally increased by 0.01, which is more significant that any other change we have observed.
20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | |
---|---|---|---|---|---|---|---|---|---|---|
Training | 0.6572265625 | 0.6510416666666666 | 0.6552734375 | 0.6516927083333334 | 0.6438802083333334 | 0.6468098958333334 | 0.6412760416666666 | 0.6383463541666666 | 0.6402994791666666 | 0.63671875 |
Testing | 0.5588091752074182 | 0.5534407027818448 | 0.5514885309907271 | 0.5495363591996095 | 0.5563689604685212 | 0.5529526598340654 | 0.5602733040507565 | 0.55734504636408 | 0.5568570034163006 | 0.5592972181551976 |
The scores increased overall.
100 | 500 | 1000 | 5000 | |
---|---|---|---|---|
Training | 0.798828125 | 0.798828125 | 0.798828125 | 0.798828125 |
Testing | 0.48462664714494874 | 0.49097120546608103 | 0.4904831625183016 | 0.4870668618838458 |
This model yielded the most signifcantly worsened testing scores. the general decrease was around 0.02.
The model that yielded the best results was the random forest model with 26 sample splits after combining wealth classes 2 and 3. However, there is a slight overfit, with the difference between the train and test scores being 0.081. I did not have a single model that performed significantly better than the others. However, I found that models performed generally better after the wealth classes were merged by a small margin. Besides that, most of the models performed between a range of 0.49 ~ 0.56.