Ask questions about the work, projects, or how to get in touch.
🤖 Machine Learning · scikit-learn · Python
Classification & Regression with scikit-learn
Two classic sklearn toy datasets, two different ML paradigms. Built a K-Nearest Neighbors classifier
on the Iris dataset to predict flower species from petal measurements, then trained a
linear regression model on the Diabetes dataset to forecast disease progression from BMI.
Real data, real models — explored, evaluated, and visualized from scratch.
150Iris Samples
442Diabetes Samples
96%KNN Accuracy
2ML Models
R² 0.34Regression Score
Dataset 1 · Classification
Iris Flower Classifier
150 samples across three iris species (Setosa, Versicolor, Virginica), each described by four measurements.
Used K-Nearest Neighbors (k=3) to classify species. Toggle axes below to explore how different feature
combinations separate the clusters — petal dimensions are the clearest signal.
X Axis
Y Axis
Setosa (50)
Versicolor (50)
Virginica (50)
KNN Confusion Matrix
Setosa
Versic.
Virgin.
Setosa
50
0
0
Versic.
0
47
3
Virgin.
0
3
47
✓ 96% Accuracy — 144/150 correct
Only Versicolor ↔ Virginica get confused — those two species overlap in petal space.
Setosa is perfectly separable.
Dataset 2 · Regression
Diabetes Disease Progression
442 patient records with 10 baseline features (age, sex, BMI, blood pressure, and 6 serum measurements).
Fitted a linear regression model using BMI as the predictor —
the single feature with the strongest correlation to one-year disease progression.
The model explains about 34% of the variance (R² = 0.34), which is solid for a single-feature linear fit on noisy medical data.
R² Score
0.344
Model explains 34.4% of variance
Coefficient
949.4
+1 std BMI → +949 progression units
Intercept
152.1
Baseline at mean BMI
Feature: s6
Blood Sugar
Serum measurement — glucose level proxy
Source Code
The Python
Full working implementations — classification and regression, top to bottom.
# Iris Classification with K-Nearest Neighbors
# Dataset: 150 samples, 3 species, 4 featuresfrom sklearn.datasets importload_irisfrom sklearn.neighbors importKNeighborsClassifierfrom sklearn.metrics importconfusion_matrix, ConfusionMatrixDisplayimport matplotlib.pyplot as plt
iris = load_iris()
X, y = iris.data, iris.target
# Shape of the dataprint("Data shape:", X.shape) # (150, 4)print("Target shape:", y.shape) # (150,)print("Target names:", iris.target_names)
# → ['setosa' 'versicolor' 'virginica']# Train KNN classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y)
predicted = knn.predict(X)
# First 10 predicted vs expected (using species names)print("\nFirst 10 results:")
for i inrange(10):
print(f" Predicted: {iris.target_names[predicted[i]]:<12}"f"Expected: {iris.target_names[y[i]]}")
# Values the model got wrong
wrong = [(i, iris.target_names[predicted[i]], iris.target_names[y[i]])
for i inrange(len(y)) if predicted[i] != y[i]]
print(f"\nWrong predictions ({len(wrong)}):")
for idx, pred, exp in wrong:
print(f" Index {idx}: Predicted '{pred}', Expected '{exp}'")
# Confusion matrix visualization
cm = confusion_matrix(y, predicted)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=iris.target_names)
disp.plot(cmap='Blues')
plt.title("Iris KNN Confusion Matrix (k=3)")
plt.tight_layout()
plt.show()
# Diabetes Regression with Linear Regression
# Dataset: 442 patients, 10 features, target = 1-year disease progressionimport matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model importLinearRegressionfrom sklearn import datasets
diabetes = datasets.load_diabetes()
X, y = diabetes.data, diabetes.target
# How many samples and features?print("Samples:", X.shape[0]) # 442print("Features:", X.shape[1]) # 10# Feature s6 represents blood glucose level (serum measurement)print("Feature names:", diabetes.feature_names)
# Isolate BMI (feature index 2) — strongest single predictor
bmi = X[:, 2].reshape(-1, 1)
# Fit linear regression
reg = LinearRegression()
reg.fit(bmi, y)
# Print coefficient and interceptprint(f"Coefficient: {reg.coef_[0]:.2f}") # 949.44print(f"Intercept: {reg.intercept_:.2f}") # 152.13print(f"R² Score: {reg.score(bmi, y):.4f}") # 0.3439# Scatterplot with regression line
bmi_range = np.linspace(bmi.min(), bmi.max(), 100).reshape(-1, 1)
y_pred = reg.predict(bmi_range)
plt.figure(figsize=(8, 5))
plt.scatter(bmi, y, color='steelblue', alpha=0.4, s=20, label='Actual')
plt.plot(bmi_range, y_pred, color='red', linewidth=2, label='Regression Line')
plt.xlabel("BMI (standardized)")
plt.ylabel("Disease Progression (1 year)")
plt.title("Diabetes: BMI vs Disease Progression")
plt.legend()
plt.tight_layout()
plt.show()
Project Breakdown
By the Numbers
Business Problem
Healthcare data is abundant but often underused. The challenge: can a simple model trained
on basic patient measurements predict who is at higher risk for diabetes complications?
And for classification tasks like species identification, how cleanly can geometric measurements
separate distinct categories — without any deep learning overhead?
One-Sentence Summary
Two sklearn toy datasets, two modeling approaches. Trained a K-Nearest Neighbors classifier
(k=3) on the Iris dataset achieving 96% accuracy, and a Linear Regression model on the
Diabetes dataset with R²=0.344. Explored confusion matrices, coefficients, intercepts,
and feature importance through visualization.
KNN species classifier with interactive axis toggling to visualize feature separation.
Confusion matrix revealing where versicolor and virginica overlap. Linear regression
isolating BMI as the strongest diabetes predictor. Complete metric reporting: accuracy,
coefficient, intercept, R², and misclassification list.
My Role
Sole developer — loaded and explored both datasets from scratch, chose appropriate
model architectures, tuned hyperparameters (k value), interpreted all model outputs,
and built the visualizations. Worked through the feature selection decision for the
diabetes regression (BMI vs trying all 10 features).
Biggest Challenge
Understanding why the confusion matrix showed Versicolor ↔ Virginica errors even at
96% accuracy required going back to the scatter plots. Petal length vs. petal width
makes the overlap visible — they genuinely share measurement space. That was the
moment sklearn clicked: models are only as good as the separability in your data.
What I Learned
The difference between classification and regression isn't just syntax — it's a
fundamentally different question being asked. I also learned to read R² critically:
0.34 on noisy medical data with one feature is actually meaningful, not "bad."
Feature selection matters more than model complexity at this scale.
Course Context
Built for AI: Principles and Application (4V98) at Baylor University. This project introduced
supervised machine learning concepts using scikit-learn's built-in datasets — chosen
specifically because they isolate modeling skill from data-cleaning noise. The iris
dataset is a classification benchmark; the diabetes dataset introduces real-world
regression complexity with overlapping, correlated features.
GitHub & Demo
⌥ GitHub Repository ↗
Built in VS Code with GitHub Copilot. Full source includes both model scripts, the interactive
chart visualizations embedded on this page, and inline comments explaining each sklearn step.
Available on request.