Self learning guide for machine learning
Scikit-Learn is the go-to library for machine learning in Python.
You should know the following
import sklearn
sklearn.__version__
# you should see version output
describe
to understand the datadescribe
to understand the datamake_blobs
function to make classification datasetmake_regression
to make a regression datasetmake_regression
to make a regression datasetCreate data like this
import pandas as pd
from sklearn.model_selection import train_test_split
tip_data = pd.DataFrame({'bill' : [50.00, 30.00, 60.00, 40.00, 65.00, 20.00, 10.00, 15.00, 25.00, 35.00],
'tip' : [12.00, 7.00, 13.00, 8.00, 15.00, 5.00, 2.00, 2.00, 3.00, 4.00]})
x = tip_data[['bill']]
y = tip_data[['tip']]
# Use train_test_split function (20% test split)
x_train,x_test,y_train, y_test = train_test_split (x,y,test_size=0.2)
x_train
and x_test
Read the house-sales-simplified.csv.
import pandas as pd
house_sales = pd.read_csv(...)
X = extract all columns except `SalePrice`
y = extract `SalePrice` column
Now split the X,y data into training and testing.
Print out the length of train and test datasets.
Since the data is too large to visually inspect, how can we programmatically ensure there are no common elements between train and test