Scientific ML (2 Part Series)
1 Predicting Protein Secondary Structures with Machine Learning
2 Training a Stellar Classification Model with Ensemble Machine Learning
Training a Stellar Classification Model with Ensemble Machine Learning
Hey there, fellow data enthusiasts! Today, I’m excited to share how I built a stellar classification model to categorize galaxies, quasars (QSOs), and stars using data from the Sloan Digital Sky Survey (SDSS17). This project, inspired by a recent technical whitepaper, combines K-Nearest Neighbors (KNN), Gaussian Mixture Models (GMM), CatBoost, a Neural Network, and a Logistic Regression Meta-Learner to hit a test accuracy of 97.37%. Let’s dive into how I trained this cosmic classifier, step by step!
The Mission: Classifying the Cosmos
Imagine you’ve got 100,000 observations of celestial objects—galaxies, quasars, and stars—each with 17 features like magnitudes (u, g, r, i, z), redshift, and positional data. Your job? Sort them into three buckets: GALAXY (0), QSO (1), and STAR (2). The SDSS17 dataset is a treasure trove, but it’s messy—imbalanced classes (59% galaxies, 22% stars, 19% quasars), outliers, and noisy metadata. My goal was to build a robust, interpretable model using ensemble techniques. Here’s how I did it.
Step 1: Prepping the Data
First things first: clean the data. The raw dataset needed some TLC to shine. Here’s what I did:
- Outlier Removal: Used the Interquartile Range (IQR) method to ditch 14,266 outliers. For each feature, I calculated Q1 – 1.5 × IQR and Q3 + 1.5 × IQR and dropped anything outside that range.
- Feature Engineering: Created interaction terms like
redshift_u
(redshift × u magnitude) to capture astrophysical relationships. Redshift’s a big deal in astronomy, so I multiplied it with each photometric band (u, g, r, i, z). - Drop the Noise: Tossed metadata like
obj_ID
andrun_ID
—they’re irrelevant for classification. - Label Encoding: Mapped classes to numbers: GALAXY = 0, QSO = 1, STAR = 2.
- Split the Data: Used a 60/20/20 train/validation/test split with
random_state=42
for reproducibility and stratified sampling to preserve class ratios. - Standardization: Applied
StandardScaler
to normalize features to zero mean and unit variance. - Balance the Classes: Hit the training set with SMOTE (Synthetic Minority Oversampling Technique) to even out the imbalanced classes.
Here’s a quick snippet of the preprocessing in Python:
<span>from</span> <span>sklearn.preprocessing</span> <span>import</span> <span>StandardScaler</span><span>from</span> <span>sklearn.model_selection</span> <span>import</span> <span>train_test_split</span><span>from</span> <span>imblearn.over_sampling</span> <span>import</span> <span>SMOTE</span><span>import</span> <span>pandas</span> <span>as</span> <span>pd</span><span># Load data (assuming a CSV from SDSS17) </span><span>data</span> <span>=</span> <span>pd</span><span>.</span><span>read_csv</span><span>(</span><span>'</span><span>sdss17_stellar_data.csv</span><span>'</span><span>)</span><span># Feature engineering </span><span>for</span> <span>band</span> <span>in</span> <span>[</span><span>'</span><span>u</span><span>'</span><span>,</span> <span>'</span><span>g</span><span>'</span><span>,</span> <span>'</span><span>r</span><span>'</span><span>,</span> <span>'</span><span>i</span><span>'</span><span>,</span> <span>'</span><span>z</span><span>'</span><span>]:</span><span>data</span><span>[</span><span>f</span><span>'</span><span>redshift_</span><span>{</span><span>band</span><span>}</span><span>'</span><span>]</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>redshift</span><span>'</span><span>]</span> <span>*</span> <span>data</span><span>[</span><span>band</span><span>]</span><span># Drop metadata </span><span>data</span> <span>=</span> <span>data</span><span>.</span><span>drop</span><span>(</span><span>columns</span><span>=</span><span>[</span><span>'</span><span>obj_ID</span><span>'</span><span>,</span> <span>'</span><span>run_ID</span><span>'</span><span>,</span> <span>'</span><span>rerun_ID</span><span>'</span><span>,</span> <span>'</span><span>cam_col</span><span>'</span><span>,</span> <span>'</span><span>field_ID</span><span>'</span><span>,</span> <span>'</span><span>spec_obj_ID</span><span>'</span><span>,</span> <span>'</span><span>fiber_ID</span><span>'</span><span>])</span><span># Encode labels </span><span>data</span><span>[</span><span>'</span><span>class</span><span>'</span><span>]</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>class</span><span>'</span><span>].</span><span>map</span><span>({</span><span>'</span><span>GALAXY</span><span>'</span><span>:</span> <span>0</span><span>,</span> <span>'</span><span>QSO</span><span>'</span><span>:</span> <span>1</span><span>,</span> <span>'</span><span>STAR</span><span>'</span><span>:</span> <span>2</span><span>})</span><span># Split features and target </span><span>X</span> <span>=</span> <span>data</span><span>.</span><span>drop</span><span>(</span><span>'</span><span>class</span><span>'</span><span>,</span> <span>axis</span><span>=</span><span>1</span><span>)</span><span>y</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>class</span><span>'</span><span>]</span><span># Train/validation/test split </span><span>X_temp</span><span>,</span> <span>X_test</span><span>,</span> <span>y_temp</span><span>,</span> <span>y_test</span> <span>=</span> <span>train_test_split</span><span>(</span><span>X</span><span>,</span> <span>y</span><span>,</span> <span>test_size</span><span>=</span><span>0.2</span><span>,</span> <span>stratify</span><span>=</span><span>y</span><span>,</span> <span>random_state</span><span>=</span><span>42</span><span>)</span><span>X_train</span><span>,</span> <span>X_val</span><span>,</span> <span>y_train</span><span>,</span> <span>y_val</span> <span>=</span> <span>train_test_split</span><span>(</span><span>X_temp</span><span>,</span> <span>y_temp</span><span>,</span> <span>test_size</span><span>=</span><span>0.25</span><span>,</span> <span>stratify</span><span>=</span><span>y_temp</span><span>,</span> <span>random_state</span><span>=</span><span>42</span><span>)</span><span># Scale features </span><span>scaler</span> <span>=</span> <span>StandardScaler</span><span>()</span><span>X_train</span> <span>=</span> <span>scaler</span><span>.</span><span>fit_transform</span><span>(</span><span>X_train</span><span>)</span><span>X_val</span> <span>=</span> <span>scaler</span><span>.</span><span>transform</span><span>(</span><span>X_val</span><span>)</span><span>X_test</span> <span>=</span> <span>scaler</span><span>.</span><span>transform</span><span>(</span><span>X_test</span><span>)</span><span># Apply SMOTE </span><span>smote</span> <span>=</span> <span>SMOTE</span><span>(</span><span>random_state</span><span>=</span><span>42</span><span>)</span><span>X_train</span><span>,</span> <span>y_train</span> <span>=</span> <span>smote</span><span>.</span><span>fit_resample</span><span>(</span><span>X_train</span><span>,</span> <span>y_train</span><span>)</span><span>from</span> <span>sklearn.preprocessing</span> <span>import</span> <span>StandardScaler</span> <span>from</span> <span>sklearn.model_selection</span> <span>import</span> <span>train_test_split</span> <span>from</span> <span>imblearn.over_sampling</span> <span>import</span> <span>SMOTE</span> <span>import</span> <span>pandas</span> <span>as</span> <span>pd</span> <span># Load data (assuming a CSV from SDSS17) </span><span>data</span> <span>=</span> <span>pd</span><span>.</span><span>read_csv</span><span>(</span><span>'</span><span>sdss17_stellar_data.csv</span><span>'</span><span>)</span> <span># Feature engineering </span><span>for</span> <span>band</span> <span>in</span> <span>[</span><span>'</span><span>u</span><span>'</span><span>,</span> <span>'</span><span>g</span><span>'</span><span>,</span> <span>'</span><span>r</span><span>'</span><span>,</span> <span>'</span><span>i</span><span>'</span><span>,</span> <span>'</span><span>z</span><span>'</span><span>]:</span> <span>data</span><span>[</span><span>f</span><span>'</span><span>redshift_</span><span>{</span><span>band</span><span>}</span><span>'</span><span>]</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>redshift</span><span>'</span><span>]</span> <span>*</span> <span>data</span><span>[</span><span>band</span><span>]</span> <span># Drop metadata </span><span>data</span> <span>=</span> <span>data</span><span>.</span><span>drop</span><span>(</span><span>columns</span><span>=</span><span>[</span><span>'</span><span>obj_ID</span><span>'</span><span>,</span> <span>'</span><span>run_ID</span><span>'</span><span>,</span> <span>'</span><span>rerun_ID</span><span>'</span><span>,</span> <span>'</span><span>cam_col</span><span>'</span><span>,</span> <span>'</span><span>field_ID</span><span>'</span><span>,</span> <span>'</span><span>spec_obj_ID</span><span>'</span><span>,</span> <span>'</span><span>fiber_ID</span><span>'</span><span>])</span> <span># Encode labels </span><span>data</span><span>[</span><span>'</span><span>class</span><span>'</span><span>]</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>class</span><span>'</span><span>].</span><span>map</span><span>({</span><span>'</span><span>GALAXY</span><span>'</span><span>:</span> <span>0</span><span>,</span> <span>'</span><span>QSO</span><span>'</span><span>:</span> <span>1</span><span>,</span> <span>'</span><span>STAR</span><span>'</span><span>:</span> <span>2</span><span>})</span> <span># Split features and target </span><span>X</span> <span>=</span> <span>data</span><span>.</span><span>drop</span><span>(</span><span>'</span><span>class</span><span>'</span><span>,</span> <span>axis</span><span>=</span><span>1</span><span>)</span> <span>y</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>class</span><span>'</span><span>]</span> <span># Train/validation/test split </span><span>X_temp</span><span>,</span> <span>X_test</span><span>,</span> <span>y_temp</span><span>,</span> <span>y_test</span> <span>=</span> <span>train_test_split</span><span>(</span><span>X</span><span>,</span> <span>y</span><span>,</span> <span>test_size</span><span>=</span><span>0.2</span><span>,</span> <span>stratify</span><span>=</span><span>y</span><span>,</span> <span>random_state</span><span>=</span><span>42</span><span>)</span> <span>X_train</span><span>,</span> <span>X_val</span><span>,</span> <span>y_train</span><span>,</span> <span>y_val</span> <span>=</span> <span>train_test_split</span><span>(</span><span>X_temp</span><span>,</span> <span>y_temp</span><span>,</span> <span>test_size</span><span>=</span><span>0.25</span><span>,</span> <span>stratify</span><span>=</span><span>y_temp</span><span>,</span> <span>random_state</span><span>=</span><span>42</span><span>)</span> <span># Scale features </span><span>scaler</span> <span>=</span> <span>StandardScaler</span><span>()</span> <span>X_train</span> <span>=</span> <span>scaler</span><span>.</span><span>fit_transform</span><span>(</span><span>X_train</span><span>)</span> <span>X_val</span> <span>=</span> <span>scaler</span><span>.</span><span>transform</span><span>(</span><span>X_val</span><span>)</span> <span>X_test</span> <span>=</span> <span>scaler</span><span>.</span><span>transform</span><span>(</span><span>X_test</span><span>)</span> <span># Apply SMOTE </span><span>smote</span> <span>=</span> <span>SMOTE</span><span>(</span><span>random_state</span><span>=</span><span>42</span><span>)</span> <span>X_train</span><span>,</span> <span>y_train</span> <span>=</span> <span>smote</span><span>.</span><span>fit_resample</span><span>(</span><span>X_train</span><span>,</span> <span>y_train</span><span>)</span>from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from imblearn.over_sampling import SMOTE import pandas as pd # Load data (assuming a CSV from SDSS17) data = pd.read_csv('sdss17_stellar_data.csv') # Feature engineering for band in ['u', 'g', 'r', 'i', 'z']: data[f'redshift_{band}'] = data['redshift'] * data[band] # Drop metadata data = data.drop(columns=['obj_ID', 'run_ID', 'rerun_ID', 'cam_col', 'field_ID', 'spec_obj_ID', 'fiber_ID']) # Encode labels data['class'] = data['class'].map({'GALAXY': 0, 'QSO': 1, 'STAR': 2}) # Split features and target X = data.drop('class', axis=1) y = data['class'] # Train/validation/test split X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42) X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, stratify=y_temp, random_state=42) # Scale features scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_val = scaler.transform(X_val) X_test = scaler.transform(X_test) # Apply SMOTE smote = SMOTE(random_state=42) X_train, y_train = smote.fit_resample(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode
This gave me a clean, balanced dataset ready for modeling.
Step 2: Building the Ensemble Crew
I didn’t settle for one model—I built an ensemble of five to leverage their unique strengths. Here’s the lineup:
-
K-Nearest Neighbors (KNN):
- Tuned
n_neighbors
with GridSearchCV (tried 3, 5, 7, 10; picked 3). - Great for local patterns, but struggles with global structure.
- Tuned
-
Gaussian Mixture Model (GMM):
- Used BIC to pick 10 components. It’s unsupervised, so I evaluated it with Adjusted Rand Index (ARI ≈ 0.26)—not a star player here but useful for exploration.
-
CatBoost:
- Tuned
depth
(8),learning_rate
(0.1), andl2_leaf_reg
(1) with GridSearchCV. - Ran it on a GPU for speed. This gradient booster nailed complex interactions.
- Tuned
-
Neural Network (HybridNN):
- Built a net with two hidden layers (128 and 64 nodes, ReLU, BatchNorm, dropout 0.5) and an output layer (3 nodes).
- Fed it CatBoost’s probabilities as extra features—pretty cool hybrid twist!
- Trained with Adam, cross-entropy loss, and early stopping (patience=5).
-
Meta-Learner (Logistic Regression):
- Stacked predictions from KNN, CatBoost, and Neural Network to make the final call.
Here’s how I trained the Neural Network with PyTorch:
<span>import</span> <span>torch</span><span>import</span> <span>torch.nn</span> <span>as</span> <span>nn</span><span>class</span> <span>HybridNN</span><span>(</span><span>nn</span><span>.</span><span>Module</span><span>):</span><span>def</span> <span>__init__</span><span>(</span><span>self</span><span>,</span> <span>input_size</span><span>):</span><span>super</span><span>(</span><span>HybridNN</span><span>,</span> <span>self</span><span>).</span><span>__init__</span><span>()</span><span>self</span><span>.</span><span>net</span> <span>=</span> <span>nn</span><span>.</span><span>Sequential</span><span>(</span><span>nn</span><span>.</span><span>Linear</span><span>(</span><span>input_size</span><span>,</span> <span>128</span><span>),</span><span>nn</span><span>.</span><span>BatchNorm1d</span><span>(</span><span>128</span><span>),</span><span>nn</span><span>.</span><span>ReLU</span><span>(),</span><span>nn</span><span>.</span><span>Dropout</span><span>(</span><span>0.5</span><span>),</span><span>nn</span><span>.</span><span>Linear</span><span>(</span><span>128</span><span>,</span> <span>64</span><span>),</span><span>nn</span><span>.</span><span>BatchNorm1d</span><span>(</span><span>64</span><span>),</span><span>nn</span><span>.</span><span>ReLU</span><span>(),</span><span>nn</span><span>.</span><span>Dropout</span><span>(</span><span>0.5</span><span>),</span><span>nn</span><span>.</span><span>Linear</span><span>(</span><span>64</span><span>,</span> <span>3</span><span>)</span><span>)</span><span>def</span> <span>forward</span><span>(</span><span>self</span><span>,</span> <span>x</span><span>):</span><span>return</span> <span>self</span><span>.</span><span>net</span><span>(</span><span>x</span><span>)</span><span># Initialize model (input_size includes original features + 3 CatBoost probs) </span><span>model</span> <span>=</span> <span>HybridNN</span><span>(</span><span>input_size</span><span>=</span><span>X_train</span><span>.</span><span>shape</span><span>[</span><span>1</span><span>]</span> <span>+</span> <span>3</span><span>)</span><span>criterion</span> <span>=</span> <span>nn</span><span>.</span><span>CrossEntropyLoss</span><span>()</span><span>optimizer</span> <span>=</span> <span>torch</span><span>.</span><span>optim</span><span>.</span><span>Adam</span><span>(</span><span>model</span><span>.</span><span>parameters</span><span>(),</span> <span>lr</span><span>=</span><span>0.001</span><span>)</span><span># Training loop (simplified) </span><span>for</span> <span>epoch</span> <span>in</span> <span>range</span><span>(</span><span>50</span><span>):</span><span>model</span><span>.</span><span>train</span><span>()</span><span>inputs</span> <span>=</span> <span>torch</span><span>.</span><span>tensor</span><span>(</span><span>X_train</span><span>,</span> <span>dtype</span><span>=</span><span>torch</span><span>.</span><span>float32</span><span>)</span><span>labels</span> <span>=</span> <span>torch</span><span>.</span><span>tensor</span><span>(</span><span>y_train</span><span>.</span><span>values</span><span>,</span> <span>dtype</span><span>=</span><span>torch</span><span>.</span><span>long</span><span>)</span><span>optimizer</span><span>.</span><span>zero_grad</span><span>()</span><span>outputs</span> <span>=</span> <span>model</span><span>(</span><span>inputs</span><span>)</span><span>loss</span> <span>=</span> <span>criterion</span><span>(</span><span>outputs</span><span>,</span> <span>labels</span><span>)</span><span>loss</span><span>.</span><span>backward</span><span>()</span><span>optimizer</span><span>.</span><span>step</span><span>()</span><span>print</span><span>(</span><span>f</span><span>'</span><span>Epoch </span><span>{</span><span>epoch</span><span>+</span><span>1</span><span>}</span><span>, Loss: </span><span>{</span><span>loss</span><span>.</span><span>item</span><span>()</span><span>:</span><span>.</span><span>4</span><span>f</span><span>}</span><span>'</span><span>)</span><span>import</span> <span>torch</span> <span>import</span> <span>torch.nn</span> <span>as</span> <span>nn</span> <span>class</span> <span>HybridNN</span><span>(</span><span>nn</span><span>.</span><span>Module</span><span>):</span> <span>def</span> <span>__init__</span><span>(</span><span>self</span><span>,</span> <span>input_size</span><span>):</span> <span>super</span><span>(</span><span>HybridNN</span><span>,</span> <span>self</span><span>).</span><span>__init__</span><span>()</span> <span>self</span><span>.</span><span>net</span> <span>=</span> <span>nn</span><span>.</span><span>Sequential</span><span>(</span> <span>nn</span><span>.</span><span>Linear</span><span>(</span><span>input_size</span><span>,</span> <span>128</span><span>),</span> <span>nn</span><span>.</span><span>BatchNorm1d</span><span>(</span><span>128</span><span>),</span> <span>nn</span><span>.</span><span>ReLU</span><span>(),</span> <span>nn</span><span>.</span><span>Dropout</span><span>(</span><span>0.5</span><span>),</span> <span>nn</span><span>.</span><span>Linear</span><span>(</span><span>128</span><span>,</span> <span>64</span><span>),</span> <span>nn</span><span>.</span><span>BatchNorm1d</span><span>(</span><span>64</span><span>),</span> <span>nn</span><span>.</span><span>ReLU</span><span>(),</span> <span>nn</span><span>.</span><span>Dropout</span><span>(</span><span>0.5</span><span>),</span> <span>nn</span><span>.</span><span>Linear</span><span>(</span><span>64</span><span>,</span> <span>3</span><span>)</span> <span>)</span> <span>def</span> <span>forward</span><span>(</span><span>self</span><span>,</span> <span>x</span><span>):</span> <span>return</span> <span>self</span><span>.</span><span>net</span><span>(</span><span>x</span><span>)</span> <span># Initialize model (input_size includes original features + 3 CatBoost probs) </span><span>model</span> <span>=</span> <span>HybridNN</span><span>(</span><span>input_size</span><span>=</span><span>X_train</span><span>.</span><span>shape</span><span>[</span><span>1</span><span>]</span> <span>+</span> <span>3</span><span>)</span> <span>criterion</span> <span>=</span> <span>nn</span><span>.</span><span>CrossEntropyLoss</span><span>()</span> <span>optimizer</span> <span>=</span> <span>torch</span><span>.</span><span>optim</span><span>.</span><span>Adam</span><span>(</span><span>model</span><span>.</span><span>parameters</span><span>(),</span> <span>lr</span><span>=</span><span>0.001</span><span>)</span> <span># Training loop (simplified) </span><span>for</span> <span>epoch</span> <span>in</span> <span>range</span><span>(</span><span>50</span><span>):</span> <span>model</span><span>.</span><span>train</span><span>()</span> <span>inputs</span> <span>=</span> <span>torch</span><span>.</span><span>tensor</span><span>(</span><span>X_train</span><span>,</span> <span>dtype</span><span>=</span><span>torch</span><span>.</span><span>float32</span><span>)</span> <span>labels</span> <span>=</span> <span>torch</span><span>.</span><span>tensor</span><span>(</span><span>y_train</span><span>.</span><span>values</span><span>,</span> <span>dtype</span><span>=</span><span>torch</span><span>.</span><span>long</span><span>)</span> <span>optimizer</span><span>.</span><span>zero_grad</span><span>()</span> <span>outputs</span> <span>=</span> <span>model</span><span>(</span><span>inputs</span><span>)</span> <span>loss</span> <span>=</span> <span>criterion</span><span>(</span><span>outputs</span><span>,</span> <span>labels</span><span>)</span> <span>loss</span><span>.</span><span>backward</span><span>()</span> <span>optimizer</span><span>.</span><span>step</span><span>()</span> <span>print</span><span>(</span><span>f</span><span>'</span><span>Epoch </span><span>{</span><span>epoch</span><span>+</span><span>1</span><span>}</span><span>, Loss: </span><span>{</span><span>loss</span><span>.</span><span>item</span><span>()</span><span>:</span><span>.</span><span>4</span><span>f</span><span>}</span><span>'</span><span>)</span>import torch import torch.nn as nn class HybridNN(nn.Module): def __init__(self, input_size): super(HybridNN, self).__init__() self.net = nn.Sequential( nn.Linear(input_size, 128), nn.BatchNorm1d(128), nn.ReLU(), nn.Dropout(0.5), nn.Linear(128, 64), nn.BatchNorm1d(64), nn.ReLU(), nn.Dropout(0.5), nn.Linear(64, 3) ) def forward(self, x): return self.net(x) # Initialize model (input_size includes original features + 3 CatBoost probs) model = HybridNN(input_size=X_train.shape[1] + 3) criterion = nn.CrossEntropyLoss() optimizer = torch.optim.Adam(model.parameters(), lr=0.001) # Training loop (simplified) for epoch in range(50): model.train() inputs = torch.tensor(X_train, dtype=torch.float32) labels = torch.tensor(y_train.values, dtype=torch.long) optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step() print(f'Epoch {epoch+1}, Loss: {loss.item():.4f}')
Enter fullscreen mode Exit fullscreen mode
Step 3: Stacking with the Meta-Learner
The magic happens with the Meta-Learner. I grabbed probability outputs from KNN, CatBoost, and the Neural Network, stacked them into a new feature set, and trained a Logistic Regression model on top. Here’s the stacking process:
<span>from</span> <span>sklearn.neighbors</span> <span>import</span> <span>KNeighborsClassifier</span><span>from</span> <span>catboost</span> <span>import</span> <span>CatBoostClassifier</span><span>from</span> <span>sklearn.linear_model</span> <span>import</span> <span>LogisticRegression</span><span># Train base models </span><span>knn</span> <span>=</span> <span>KNeighborsClassifier</span><span>(</span><span>n_neighbors</span><span>=</span><span>3</span><span>)</span><span>knn</span><span>.</span><span>fit</span><span>(</span><span>X_train</span><span>,</span> <span>y_train</span><span>)</span><span>catboost</span> <span>=</span> <span>CatBoostClassifier</span><span>(</span><span>depth</span><span>=</span><span>8</span><span>,</span> <span>learning_rate</span><span>=</span><span>0.1</span><span>,</span> <span>l2_leaf_reg</span><span>=</span><span>1</span><span>,</span> <span>task_type</span><span>=</span><span>'</span><span>GPU</span><span>'</span><span>)</span><span>catboost</span><span>.</span><span>fit</span><span>(</span><span>X_train</span><span>,</span> <span>y_train</span><span>)</span><span># Get probabilities for stacking </span><span>knn_probs</span> <span>=</span> <span>knn</span><span>.</span><span>predict_proba</span><span>(</span><span>X_val</span><span>)</span><span>catboost_probs</span> <span>=</span> <span>catboost</span><span>.</span><span>predict_proba</span><span>(</span><span>X_val</span><span>)</span><span>nn_probs</span> <span>=</span> <span>model</span><span>(</span><span>torch</span><span>.</span><span>tensor</span><span>(</span><span>X_val</span><span>,</span> <span>dtype</span><span>=</span><span>torch</span><span>.</span><span>float32</span><span>)).</span><span>detach</span><span>().</span><span>numpy</span><span>()</span><span># Stack predictions </span><span>stacked_features</span> <span>=</span> <span>np</span><span>.</span><span>hstack</span><span>([</span><span>knn_probs</span><span>,</span> <span>catboost_probs</span><span>,</span> <span>nn_probs</span><span>])</span><span># Train Meta-Learner </span><span>meta_learner</span> <span>=</span> <span>LogisticRegression</span><span>()</span><span>meta_learner</span><span>.</span><span>fit</span><span>(</span><span>stacked_features</span><span>,</span> <span>y_val</span><span>)</span><span>from</span> <span>sklearn.neighbors</span> <span>import</span> <span>KNeighborsClassifier</span> <span>from</span> <span>catboost</span> <span>import</span> <span>CatBoostClassifier</span> <span>from</span> <span>sklearn.linear_model</span> <span>import</span> <span>LogisticRegression</span> <span># Train base models </span><span>knn</span> <span>=</span> <span>KNeighborsClassifier</span><span>(</span><span>n_neighbors</span><span>=</span><span>3</span><span>)</span> <span>knn</span><span>.</span><span>fit</span><span>(</span><span>X_train</span><span>,</span> <span>y_train</span><span>)</span> <span>catboost</span> <span>=</span> <span>CatBoostClassifier</span><span>(</span><span>depth</span><span>=</span><span>8</span><span>,</span> <span>learning_rate</span><span>=</span><span>0.1</span><span>,</span> <span>l2_leaf_reg</span><span>=</span><span>1</span><span>,</span> <span>task_type</span><span>=</span><span>'</span><span>GPU</span><span>'</span><span>)</span> <span>catboost</span><span>.</span><span>fit</span><span>(</span><span>X_train</span><span>,</span> <span>y_train</span><span>)</span> <span># Get probabilities for stacking </span><span>knn_probs</span> <span>=</span> <span>knn</span><span>.</span><span>predict_proba</span><span>(</span><span>X_val</span><span>)</span> <span>catboost_probs</span> <span>=</span> <span>catboost</span><span>.</span><span>predict_proba</span><span>(</span><span>X_val</span><span>)</span> <span>nn_probs</span> <span>=</span> <span>model</span><span>(</span><span>torch</span><span>.</span><span>tensor</span><span>(</span><span>X_val</span><span>,</span> <span>dtype</span><span>=</span><span>torch</span><span>.</span><span>float32</span><span>)).</span><span>detach</span><span>().</span><span>numpy</span><span>()</span> <span># Stack predictions </span><span>stacked_features</span> <span>=</span> <span>np</span><span>.</span><span>hstack</span><span>([</span><span>knn_probs</span><span>,</span> <span>catboost_probs</span><span>,</span> <span>nn_probs</span><span>])</span> <span># Train Meta-Learner </span><span>meta_learner</span> <span>=</span> <span>LogisticRegression</span><span>()</span> <span>meta_learner</span><span>.</span><span>fit</span><span>(</span><span>stacked_features</span><span>,</span> <span>y_val</span><span>)</span>from sklearn.neighbors import KNeighborsClassifier from catboost import CatBoostClassifier from sklearn.linear_model import LogisticRegression # Train base models knn = KNeighborsClassifier(n_neighbors=3) knn.fit(X_train, y_train) catboost = CatBoostClassifier(depth=8, learning_rate=0.1, l2_leaf_reg=1, task_type='GPU') catboost.fit(X_train, y_train) # Get probabilities for stacking knn_probs = knn.predict_proba(X_val) catboost_probs = catboost.predict_proba(X_val) nn_probs = model(torch.tensor(X_val, dtype=torch.float32)).detach().numpy() # Stack predictions stacked_features = np.hstack([knn_probs, catboost_probs, nn_probs]) # Train Meta-Learner meta_learner = LogisticRegression() meta_learner.fit(stacked_features, y_val)
Enter fullscreen mode Exit fullscreen mode
Step 4: Evaluating the Results
On the test set, the Meta-Learner hit 97.37% accuracy—better than KNN (94.44%), CatBoost (97.04%), or the Neural Network (96.98%) alone. Precision, recall, and F1-scores hovered around 0.95-0.96, showing balanced performance across classes. The QSO class, originally underrepresented, saw the biggest boost from stacking.
Feature importance (via CatBoost’s permutation importance) and SHAP values (for the Neural Network) pointed to redshift
and its interactions as MVPs—makes sense, since redshift tells us a lot about cosmic distances and object types.
Lessons Learned
- Ensemble Power: Combining models smoothed out individual weaknesses, especially for tricky quasars.
- Feature Engineering: Those redshift interactions were gold—don’t skip the domain knowledge!
- Compute Trade-Offs: Training took ~20 minutes total (mostly CatBoost and NN on GPU), but inference was lightning-fast.
What’s Next?
I’d love to tweak this further—maybe add polynomial features, try XGBoost in the stack, or scale it to bigger surveys. For now, I’m thrilled with 97.37% accuracy and a model that’s both powerful and interpretable.
What do you think? Have you tackled similar classification challenges? Drop a comment—I’d love to chat about stars, code, or both!
Kaggle: https://www.kaggle.com/code/allanwandia/stellar-classification-and-supervised-learning
Kaggle model card: https://www.kaggle.com/models/allanwandia/nexaastro_v1
Scientific ML (2 Part Series)
1 Predicting Protein Secondary Structures with Machine Learning
2 Training a Stellar Classification Model with Ensemble Machine Learning
原文链接:Training a Stellar Classification Model with Ensemble Machine Learning
暂无评论内容