This project is maintained by pterwoo
To do this:
import pandas as pd
url = "https://raw.githubusercontent.com/tyler-frazier/intro_data_science/main/data/persons.csv"
pns = pd.read_csv(url, sep=",")
check_nan = pns['age'].isnull().values.any()
pns.dropna(inplace = True)
#display(pns.dtypes)
pns['age'] = pns['age'].astype(int)
pns['edu'] = pns['edu'].astype(int)
#display(pns.dtypes)
X = pns.drop(["wealthC", "wealthI"], axis = 1)
y = pns.wealthC
Drop any null values, then cast any float values to integers.
Set X (feature variables) as every column except for the target variables (wealthC
and wealthI
)
Set y (target variable) as wealthC
as specified
The R^2 value without K-Fold or standardization: 0.736
The R^2 value with standardization and no K-Fold: 0.736
We can use the DoKFold function defined in class to compute the R^2.
The R^2 value with K-Fold and no standardization (testing score): 0.756
The R^2 value with K-Fold and standardization (testing score): 0.756
Results show that performing a K-Fold validation instead of running the basic regression was more effective. We also see that standardizing the feature variables did not change the R^2 values.
The alpha range that I set here for the following results was np.linspace(70, 80, 20)
Optimal Alpha value: 73.684
Training Score: 0.736
Testing Score: 0.735
Plot:
The alpha range that I set here for the following results was np.linspace(0.001, 0.003, 10)
Optimal Alpha value: 0.00189
Training score: 0.736
Testing score: 0.735
Plot:
The alpha range that I set here for the following results was np.linspace(90,110,20)
Optimal Alpha value: 97.368
Training score: 0.826
Testing score: 0.825
Plot:
The alpha range that I set here for the following results was np.linspace(0.1, 0.2, 10)
Optimal Alpha value: 0.13333
Training score: 0.826
Testing score: 0.825
Plot:
There was little to no difference between the different types of regression models. There was some difference in R^2 with linear regression with and without K-Fold; K-Fold performed slightly better. The biggest difference came in changing the target y variable from wealthC
to wealthI
. R^2 values were considerably higher with the wealthI
as the target. Additionally, standardizing the dataset did not seem to have a meaningful difference, as seen when performing linear regression with/without standardization.