Training a Stellar Classification Model with Ensemble Machine Learning

Scientific ML (2 Part Series)

1 Predicting Protein Secondary Structures with Machine Learning
2 Training a Stellar Classification Model with Ensemble Machine Learning

Training a Stellar Classification Model with Ensemble Machine Learning

Hey there, fellow data enthusiasts! Today, I’m excited to share how I built a stellar classification model to categorize galaxies, quasars (QSOs), and stars using data from the Sloan Digital Sky Survey (SDSS17). This project, inspired by a recent technical whitepaper, combines K-Nearest Neighbors (KNN), Gaussian Mixture Models (GMM), CatBoost, a Neural Network, and a Logistic Regression Meta-Learner to hit a test accuracy of 97.37%. Let’s dive into how I trained this cosmic classifier, step by step!

The Mission: Classifying the Cosmos

Imagine you’ve got 100,000 observations of celestial objects—galaxies, quasars, and stars—each with 17 features like magnitudes (u, g, r, i, z), redshift, and positional data. Your job? Sort them into three buckets: GALAXY (0), QSO (1), and STAR (2). The SDSS17 dataset is a treasure trove, but it’s messy—imbalanced classes (59% galaxies, 22% stars, 19% quasars), outliers, and noisy metadata. My goal was to build a robust, interpretable model using ensemble techniques. Here’s how I did it.

Step 1: Prepping the Data

First things first: clean the data. The raw dataset needed some TLC to shine. Here’s what I did:

  1. Outlier Removal: Used the Interquartile Range (IQR) method to ditch 14,266 outliers. For each feature, I calculated Q1 – 1.5 × IQR and Q3 + 1.5 × IQR and dropped anything outside that range.
  2. Feature Engineering: Created interaction terms like redshift_u (redshift × u magnitude) to capture astrophysical relationships. Redshift’s a big deal in astronomy, so I multiplied it with each photometric band (u, g, r, i, z).
  3. Drop the Noise: Tossed metadata like obj_ID and run_ID—they’re irrelevant for classification.
  4. Label Encoding: Mapped classes to numbers: GALAXY = 0, QSO = 1, STAR = 2.
  5. Split the Data: Used a 60/20/20 train/validation/test split with random_state=42 for reproducibility and stratified sampling to preserve class ratios.
  6. Standardization: Applied StandardScaler to normalize features to zero mean and unit variance.
  7. Balance the Classes: Hit the training set with SMOTE (Synthetic Minority Oversampling Technique) to even out the imbalanced classes.

Here’s a quick snippet of the preprocessing in Python:

<span>from</span> <span>sklearn.preprocessing</span> <span>import</span> <span>StandardScaler</span>
<span>from</span> <span>sklearn.model_selection</span> <span>import</span> <span>train_test_split</span>
<span>from</span> <span>imblearn.over_sampling</span> <span>import</span> <span>SMOTE</span>
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>
<span># Load data (assuming a CSV from SDSS17) </span><span>data</span> <span>=</span> <span>pd</span><span>.</span><span>read_csv</span><span>(</span><span>'</span><span>sdss17_stellar_data.csv</span><span>'</span><span>)</span>
<span># Feature engineering </span><span>for</span> <span>band</span> <span>in</span> <span>[</span><span>'</span><span>u</span><span>'</span><span>,</span> <span>'</span><span>g</span><span>'</span><span>,</span> <span>'</span><span>r</span><span>'</span><span>,</span> <span>'</span><span>i</span><span>'</span><span>,</span> <span>'</span><span>z</span><span>'</span><span>]:</span>
<span>data</span><span>[</span><span>f</span><span>'</span><span>redshift_</span><span>{</span><span>band</span><span>}</span><span>'</span><span>]</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>redshift</span><span>'</span><span>]</span> <span>*</span> <span>data</span><span>[</span><span>band</span><span>]</span>
<span># Drop metadata </span><span>data</span> <span>=</span> <span>data</span><span>.</span><span>drop</span><span>(</span><span>columns</span><span>=</span><span>[</span><span>'</span><span>obj_ID</span><span>'</span><span>,</span> <span>'</span><span>run_ID</span><span>'</span><span>,</span> <span>'</span><span>rerun_ID</span><span>'</span><span>,</span> <span>'</span><span>cam_col</span><span>'</span><span>,</span> <span>'</span><span>field_ID</span><span>'</span><span>,</span> <span>'</span><span>spec_obj_ID</span><span>'</span><span>,</span> <span>'</span><span>fiber_ID</span><span>'</span><span>])</span>
<span># Encode labels </span><span>data</span><span>[</span><span>'</span><span>class</span><span>'</span><span>]</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>class</span><span>'</span><span>].</span><span>map</span><span>({</span><span>'</span><span>GALAXY</span><span>'</span><span>:</span> <span>0</span><span>,</span> <span>'</span><span>QSO</span><span>'</span><span>:</span> <span>1</span><span>,</span> <span>'</span><span>STAR</span><span>'</span><span>:</span> <span>2</span><span>})</span>
<span># Split features and target </span><span>X</span> <span>=</span> <span>data</span><span>.</span><span>drop</span><span>(</span><span>'</span><span>class</span><span>'</span><span>,</span> <span>axis</span><span>=</span><span>1</span><span>)</span>
<span>y</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>class</span><span>'</span><span>]</span>
<span># Train/validation/test split </span><span>X_temp</span><span>,</span> <span>X_test</span><span>,</span> <span>y_temp</span><span>,</span> <span>y_test</span> <span>=</span> <span>train_test_split</span><span>(</span><span>X</span><span>,</span> <span>y</span><span>,</span> <span>test_size</span><span>=</span><span>0.2</span><span>,</span> <span>stratify</span><span>=</span><span>y</span><span>,</span> <span>random_state</span><span>=</span><span>42</span><span>)</span>
<span>X_train</span><span>,</span> <span>X_val</span><span>,</span> <span>y_train</span><span>,</span> <span>y_val</span> <span>=</span> <span>train_test_split</span><span>(</span><span>X_temp</span><span>,</span> <span>y_temp</span><span>,</span> <span>test_size</span><span>=</span><span>0.25</span><span>,</span> <span>stratify</span><span>=</span><span>y_temp</span><span>,</span> <span>random_state</span><span>=</span><span>42</span><span>)</span>
<span># Scale features </span><span>scaler</span> <span>=</span> <span>StandardScaler</span><span>()</span>
<span>X_train</span> <span>=</span> <span>scaler</span><span>.</span><span>fit_transform</span><span>(</span><span>X_train</span><span>)</span>
<span>X_val</span> <span>=</span> <span>scaler</span><span>.</span><span>transform</span><span>(</span><span>X_val</span><span>)</span>
<span>X_test</span> <span>=</span> <span>scaler</span><span>.</span><span>transform</span><span>(</span><span>X_test</span><span>)</span>
<span># Apply SMOTE </span><span>smote</span> <span>=</span> <span>SMOTE</span><span>(</span><span>random_state</span><span>=</span><span>42</span><span>)</span>
<span>X_train</span><span>,</span> <span>y_train</span> <span>=</span> <span>smote</span><span>.</span><span>fit_resample</span><span>(</span><span>X_train</span><span>,</span> <span>y_train</span><span>)</span>
<span>from</span> <span>sklearn.preprocessing</span> <span>import</span> <span>StandardScaler</span>
<span>from</span> <span>sklearn.model_selection</span> <span>import</span> <span>train_test_split</span>
<span>from</span> <span>imblearn.over_sampling</span> <span>import</span> <span>SMOTE</span>
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>

<span># Load data (assuming a CSV from SDSS17) </span><span>data</span> <span>=</span> <span>pd</span><span>.</span><span>read_csv</span><span>(</span><span>'</span><span>sdss17_stellar_data.csv</span><span>'</span><span>)</span>

<span># Feature engineering </span><span>for</span> <span>band</span> <span>in</span> <span>[</span><span>'</span><span>u</span><span>'</span><span>,</span> <span>'</span><span>g</span><span>'</span><span>,</span> <span>'</span><span>r</span><span>'</span><span>,</span> <span>'</span><span>i</span><span>'</span><span>,</span> <span>'</span><span>z</span><span>'</span><span>]:</span>
    <span>data</span><span>[</span><span>f</span><span>'</span><span>redshift_</span><span>{</span><span>band</span><span>}</span><span>'</span><span>]</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>redshift</span><span>'</span><span>]</span> <span>*</span> <span>data</span><span>[</span><span>band</span><span>]</span>

<span># Drop metadata </span><span>data</span> <span>=</span> <span>data</span><span>.</span><span>drop</span><span>(</span><span>columns</span><span>=</span><span>[</span><span>'</span><span>obj_ID</span><span>'</span><span>,</span> <span>'</span><span>run_ID</span><span>'</span><span>,</span> <span>'</span><span>rerun_ID</span><span>'</span><span>,</span> <span>'</span><span>cam_col</span><span>'</span><span>,</span> <span>'</span><span>field_ID</span><span>'</span><span>,</span> <span>'</span><span>spec_obj_ID</span><span>'</span><span>,</span> <span>'</span><span>fiber_ID</span><span>'</span><span>])</span>

<span># Encode labels </span><span>data</span><span>[</span><span>'</span><span>class</span><span>'</span><span>]</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>class</span><span>'</span><span>].</span><span>map</span><span>({</span><span>'</span><span>GALAXY</span><span>'</span><span>:</span> <span>0</span><span>,</span> <span>'</span><span>QSO</span><span>'</span><span>:</span> <span>1</span><span>,</span> <span>'</span><span>STAR</span><span>'</span><span>:</span> <span>2</span><span>})</span>

<span># Split features and target </span><span>X</span> <span>=</span> <span>data</span><span>.</span><span>drop</span><span>(</span><span>'</span><span>class</span><span>'</span><span>,</span> <span>axis</span><span>=</span><span>1</span><span>)</span>
<span>y</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>class</span><span>'</span><span>]</span>

<span># Train/validation/test split </span><span>X_temp</span><span>,</span> <span>X_test</span><span>,</span> <span>y_temp</span><span>,</span> <span>y_test</span> <span>=</span> <span>train_test_split</span><span>(</span><span>X</span><span>,</span> <span>y</span><span>,</span> <span>test_size</span><span>=</span><span>0.2</span><span>,</span> <span>stratify</span><span>=</span><span>y</span><span>,</span> <span>random_state</span><span>=</span><span>42</span><span>)</span>
<span>X_train</span><span>,</span> <span>X_val</span><span>,</span> <span>y_train</span><span>,</span> <span>y_val</span> <span>=</span> <span>train_test_split</span><span>(</span><span>X_temp</span><span>,</span> <span>y_temp</span><span>,</span> <span>test_size</span><span>=</span><span>0.25</span><span>,</span> <span>stratify</span><span>=</span><span>y_temp</span><span>,</span> <span>random_state</span><span>=</span><span>42</span><span>)</span>

<span># Scale features </span><span>scaler</span> <span>=</span> <span>StandardScaler</span><span>()</span>
<span>X_train</span> <span>=</span> <span>scaler</span><span>.</span><span>fit_transform</span><span>(</span><span>X_train</span><span>)</span>
<span>X_val</span> <span>=</span> <span>scaler</span><span>.</span><span>transform</span><span>(</span><span>X_val</span><span>)</span>
<span>X_test</span> <span>=</span> <span>scaler</span><span>.</span><span>transform</span><span>(</span><span>X_test</span><span>)</span>

<span># Apply SMOTE </span><span>smote</span> <span>=</span> <span>SMOTE</span><span>(</span><span>random_state</span><span>=</span><span>42</span><span>)</span>
<span>X_train</span><span>,</span> <span>y_train</span> <span>=</span> <span>smote</span><span>.</span><span>fit_resample</span><span>(</span><span>X_train</span><span>,</span> <span>y_train</span><span>)</span>
from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from imblearn.over_sampling import SMOTE import pandas as pd # Load data (assuming a CSV from SDSS17) data = pd.read_csv('sdss17_stellar_data.csv') # Feature engineering for band in ['u', 'g', 'r', 'i', 'z']: data[f'redshift_{band}'] = data['redshift'] * data[band] # Drop metadata data = data.drop(columns=['obj_ID', 'run_ID', 'rerun_ID', 'cam_col', 'field_ID', 'spec_obj_ID', 'fiber_ID']) # Encode labels data['class'] = data['class'].map({'GALAXY': 0, 'QSO': 1, 'STAR': 2}) # Split features and target X = data.drop('class', axis=1) y = data['class'] # Train/validation/test split X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42) X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, stratify=y_temp, random_state=42) # Scale features scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_val = scaler.transform(X_val) X_test = scaler.transform(X_test) # Apply SMOTE smote = SMOTE(random_state=42) X_train, y_train = smote.fit_resample(X_train, y_train)

Enter fullscreen mode Exit fullscreen mode

This gave me a clean, balanced dataset ready for modeling.

Step 2: Building the Ensemble Crew

I didn’t settle for one model—I built an ensemble of five to leverage their unique strengths. Here’s the lineup:

  1. K-Nearest Neighbors (KNN):

    • Tuned n_neighbors with GridSearchCV (tried 3, 5, 7, 10; picked 3).
    • Great for local patterns, but struggles with global structure.
  2. Gaussian Mixture Model (GMM):

    • Used BIC to pick 10 components. It’s unsupervised, so I evaluated it with Adjusted Rand Index (ARI ≈ 0.26)—not a star player here but useful for exploration.
  3. CatBoost:

    • Tuned depth (8), learning_rate (0.1), and l2_leaf_reg (1) with GridSearchCV.
    • Ran it on a GPU for speed. This gradient booster nailed complex interactions.
  4. Neural Network (HybridNN):

    • Built a net with two hidden layers (128 and 64 nodes, ReLU, BatchNorm, dropout 0.5) and an output layer (3 nodes).
    • Fed it CatBoost’s probabilities as extra features—pretty cool hybrid twist!
    • Trained with Adam, cross-entropy loss, and early stopping (patience=5).
  5. Meta-Learner (Logistic Regression):

    • Stacked predictions from KNN, CatBoost, and Neural Network to make the final call.

Here’s how I trained the Neural Network with PyTorch:

<span>import</span> <span>torch</span>
<span>import</span> <span>torch.nn</span> <span>as</span> <span>nn</span>
<span>class</span> <span>HybridNN</span><span>(</span><span>nn</span><span>.</span><span>Module</span><span>):</span>
<span>def</span> <span>__init__</span><span>(</span><span>self</span><span>,</span> <span>input_size</span><span>):</span>
<span>super</span><span>(</span><span>HybridNN</span><span>,</span> <span>self</span><span>).</span><span>__init__</span><span>()</span>
<span>self</span><span>.</span><span>net</span> <span>=</span> <span>nn</span><span>.</span><span>Sequential</span><span>(</span>
<span>nn</span><span>.</span><span>Linear</span><span>(</span><span>input_size</span><span>,</span> <span>128</span><span>),</span>
<span>nn</span><span>.</span><span>BatchNorm1d</span><span>(</span><span>128</span><span>),</span>
<span>nn</span><span>.</span><span>ReLU</span><span>(),</span>
<span>nn</span><span>.</span><span>Dropout</span><span>(</span><span>0.5</span><span>),</span>
<span>nn</span><span>.</span><span>Linear</span><span>(</span><span>128</span><span>,</span> <span>64</span><span>),</span>
<span>nn</span><span>.</span><span>BatchNorm1d</span><span>(</span><span>64</span><span>),</span>
<span>nn</span><span>.</span><span>ReLU</span><span>(),</span>
<span>nn</span><span>.</span><span>Dropout</span><span>(</span><span>0.5</span><span>),</span>
<span>nn</span><span>.</span><span>Linear</span><span>(</span><span>64</span><span>,</span> <span>3</span><span>)</span>
<span>)</span>
<span>def</span> <span>forward</span><span>(</span><span>self</span><span>,</span> <span>x</span><span>):</span>
<span>return</span> <span>self</span><span>.</span><span>net</span><span>(</span><span>x</span><span>)</span>
<span># Initialize model (input_size includes original features + 3 CatBoost probs) </span><span>model</span> <span>=</span> <span>HybridNN</span><span>(</span><span>input_size</span><span>=</span><span>X_train</span><span>.</span><span>shape</span><span>[</span><span>1</span><span>]</span> <span>+</span> <span>3</span><span>)</span>
<span>criterion</span> <span>=</span> <span>nn</span><span>.</span><span>CrossEntropyLoss</span><span>()</span>
<span>optimizer</span> <span>=</span> <span>torch</span><span>.</span><span>optim</span><span>.</span><span>Adam</span><span>(</span><span>model</span><span>.</span><span>parameters</span><span>(),</span> <span>lr</span><span>=</span><span>0.001</span><span>)</span>
<span># Training loop (simplified) </span><span>for</span> <span>epoch</span> <span>in</span> <span>range</span><span>(</span><span>50</span><span>):</span>
<span>model</span><span>.</span><span>train</span><span>()</span>
<span>inputs</span> <span>=</span> <span>torch</span><span>.</span><span>tensor</span><span>(</span><span>X_train</span><span>,</span> <span>dtype</span><span>=</span><span>torch</span><span>.</span><span>float32</span><span>)</span>
<span>labels</span> <span>=</span> <span>torch</span><span>.</span><span>tensor</span><span>(</span><span>y_train</span><span>.</span><span>values</span><span>,</span> <span>dtype</span><span>=</span><span>torch</span><span>.</span><span>long</span><span>)</span>
<span>optimizer</span><span>.</span><span>zero_grad</span><span>()</span>
<span>outputs</span> <span>=</span> <span>model</span><span>(</span><span>inputs</span><span>)</span>
<span>loss</span> <span>=</span> <span>criterion</span><span>(</span><span>outputs</span><span>,</span> <span>labels</span><span>)</span>
<span>loss</span><span>.</span><span>backward</span><span>()</span>
<span>optimizer</span><span>.</span><span>step</span><span>()</span>
<span>print</span><span>(</span><span>f</span><span>'</span><span>Epoch </span><span>{</span><span>epoch</span><span>+</span><span>1</span><span>}</span><span>, Loss: </span><span>{</span><span>loss</span><span>.</span><span>item</span><span>()</span><span>:</span><span>.</span><span>4</span><span>f</span><span>}</span><span>'</span><span>)</span>
<span>import</span> <span>torch</span>
<span>import</span> <span>torch.nn</span> <span>as</span> <span>nn</span>

<span>class</span> <span>HybridNN</span><span>(</span><span>nn</span><span>.</span><span>Module</span><span>):</span>
    <span>def</span> <span>__init__</span><span>(</span><span>self</span><span>,</span> <span>input_size</span><span>):</span>
        <span>super</span><span>(</span><span>HybridNN</span><span>,</span> <span>self</span><span>).</span><span>__init__</span><span>()</span>
        <span>self</span><span>.</span><span>net</span> <span>=</span> <span>nn</span><span>.</span><span>Sequential</span><span>(</span>
            <span>nn</span><span>.</span><span>Linear</span><span>(</span><span>input_size</span><span>,</span> <span>128</span><span>),</span>
            <span>nn</span><span>.</span><span>BatchNorm1d</span><span>(</span><span>128</span><span>),</span>
            <span>nn</span><span>.</span><span>ReLU</span><span>(),</span>
            <span>nn</span><span>.</span><span>Dropout</span><span>(</span><span>0.5</span><span>),</span>
            <span>nn</span><span>.</span><span>Linear</span><span>(</span><span>128</span><span>,</span> <span>64</span><span>),</span>
            <span>nn</span><span>.</span><span>BatchNorm1d</span><span>(</span><span>64</span><span>),</span>
            <span>nn</span><span>.</span><span>ReLU</span><span>(),</span>
            <span>nn</span><span>.</span><span>Dropout</span><span>(</span><span>0.5</span><span>),</span>
            <span>nn</span><span>.</span><span>Linear</span><span>(</span><span>64</span><span>,</span> <span>3</span><span>)</span>
        <span>)</span>

    <span>def</span> <span>forward</span><span>(</span><span>self</span><span>,</span> <span>x</span><span>):</span>
        <span>return</span> <span>self</span><span>.</span><span>net</span><span>(</span><span>x</span><span>)</span>

<span># Initialize model (input_size includes original features + 3 CatBoost probs) </span><span>model</span> <span>=</span> <span>HybridNN</span><span>(</span><span>input_size</span><span>=</span><span>X_train</span><span>.</span><span>shape</span><span>[</span><span>1</span><span>]</span> <span>+</span> <span>3</span><span>)</span>
<span>criterion</span> <span>=</span> <span>nn</span><span>.</span><span>CrossEntropyLoss</span><span>()</span>
<span>optimizer</span> <span>=</span> <span>torch</span><span>.</span><span>optim</span><span>.</span><span>Adam</span><span>(</span><span>model</span><span>.</span><span>parameters</span><span>(),</span> <span>lr</span><span>=</span><span>0.001</span><span>)</span>

<span># Training loop (simplified) </span><span>for</span> <span>epoch</span> <span>in</span> <span>range</span><span>(</span><span>50</span><span>):</span>
    <span>model</span><span>.</span><span>train</span><span>()</span>
    <span>inputs</span> <span>=</span> <span>torch</span><span>.</span><span>tensor</span><span>(</span><span>X_train</span><span>,</span> <span>dtype</span><span>=</span><span>torch</span><span>.</span><span>float32</span><span>)</span>
    <span>labels</span> <span>=</span> <span>torch</span><span>.</span><span>tensor</span><span>(</span><span>y_train</span><span>.</span><span>values</span><span>,</span> <span>dtype</span><span>=</span><span>torch</span><span>.</span><span>long</span><span>)</span>
    <span>optimizer</span><span>.</span><span>zero_grad</span><span>()</span>
    <span>outputs</span> <span>=</span> <span>model</span><span>(</span><span>inputs</span><span>)</span>
    <span>loss</span> <span>=</span> <span>criterion</span><span>(</span><span>outputs</span><span>,</span> <span>labels</span><span>)</span>
    <span>loss</span><span>.</span><span>backward</span><span>()</span>
    <span>optimizer</span><span>.</span><span>step</span><span>()</span>
    <span>print</span><span>(</span><span>f</span><span>'</span><span>Epoch </span><span>{</span><span>epoch</span><span>+</span><span>1</span><span>}</span><span>, Loss: </span><span>{</span><span>loss</span><span>.</span><span>item</span><span>()</span><span>:</span><span>.</span><span>4</span><span>f</span><span>}</span><span>'</span><span>)</span>
import torch import torch.nn as nn class HybridNN(nn.Module): def __init__(self, input_size): super(HybridNN, self).__init__() self.net = nn.Sequential( nn.Linear(input_size, 128), nn.BatchNorm1d(128), nn.ReLU(), nn.Dropout(0.5), nn.Linear(128, 64), nn.BatchNorm1d(64), nn.ReLU(), nn.Dropout(0.5), nn.Linear(64, 3) ) def forward(self, x): return self.net(x) # Initialize model (input_size includes original features + 3 CatBoost probs) model = HybridNN(input_size=X_train.shape[1] + 3) criterion = nn.CrossEntropyLoss() optimizer = torch.optim.Adam(model.parameters(), lr=0.001) # Training loop (simplified) for epoch in range(50): model.train() inputs = torch.tensor(X_train, dtype=torch.float32) labels = torch.tensor(y_train.values, dtype=torch.long) optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step() print(f'Epoch {epoch+1}, Loss: {loss.item():.4f}')

Enter fullscreen mode Exit fullscreen mode

Step 3: Stacking with the Meta-Learner

The magic happens with the Meta-Learner. I grabbed probability outputs from KNN, CatBoost, and the Neural Network, stacked them into a new feature set, and trained a Logistic Regression model on top. Here’s the stacking process:

<span>from</span> <span>sklearn.neighbors</span> <span>import</span> <span>KNeighborsClassifier</span>
<span>from</span> <span>catboost</span> <span>import</span> <span>CatBoostClassifier</span>
<span>from</span> <span>sklearn.linear_model</span> <span>import</span> <span>LogisticRegression</span>
<span># Train base models </span><span>knn</span> <span>=</span> <span>KNeighborsClassifier</span><span>(</span><span>n_neighbors</span><span>=</span><span>3</span><span>)</span>
<span>knn</span><span>.</span><span>fit</span><span>(</span><span>X_train</span><span>,</span> <span>y_train</span><span>)</span>
<span>catboost</span> <span>=</span> <span>CatBoostClassifier</span><span>(</span><span>depth</span><span>=</span><span>8</span><span>,</span> <span>learning_rate</span><span>=</span><span>0.1</span><span>,</span> <span>l2_leaf_reg</span><span>=</span><span>1</span><span>,</span> <span>task_type</span><span>=</span><span>'</span><span>GPU</span><span>'</span><span>)</span>
<span>catboost</span><span>.</span><span>fit</span><span>(</span><span>X_train</span><span>,</span> <span>y_train</span><span>)</span>
<span># Get probabilities for stacking </span><span>knn_probs</span> <span>=</span> <span>knn</span><span>.</span><span>predict_proba</span><span>(</span><span>X_val</span><span>)</span>
<span>catboost_probs</span> <span>=</span> <span>catboost</span><span>.</span><span>predict_proba</span><span>(</span><span>X_val</span><span>)</span>
<span>nn_probs</span> <span>=</span> <span>model</span><span>(</span><span>torch</span><span>.</span><span>tensor</span><span>(</span><span>X_val</span><span>,</span> <span>dtype</span><span>=</span><span>torch</span><span>.</span><span>float32</span><span>)).</span><span>detach</span><span>().</span><span>numpy</span><span>()</span>
<span># Stack predictions </span><span>stacked_features</span> <span>=</span> <span>np</span><span>.</span><span>hstack</span><span>([</span><span>knn_probs</span><span>,</span> <span>catboost_probs</span><span>,</span> <span>nn_probs</span><span>])</span>
<span># Train Meta-Learner </span><span>meta_learner</span> <span>=</span> <span>LogisticRegression</span><span>()</span>
<span>meta_learner</span><span>.</span><span>fit</span><span>(</span><span>stacked_features</span><span>,</span> <span>y_val</span><span>)</span>
<span>from</span> <span>sklearn.neighbors</span> <span>import</span> <span>KNeighborsClassifier</span>
<span>from</span> <span>catboost</span> <span>import</span> <span>CatBoostClassifier</span>
<span>from</span> <span>sklearn.linear_model</span> <span>import</span> <span>LogisticRegression</span>

<span># Train base models </span><span>knn</span> <span>=</span> <span>KNeighborsClassifier</span><span>(</span><span>n_neighbors</span><span>=</span><span>3</span><span>)</span>
<span>knn</span><span>.</span><span>fit</span><span>(</span><span>X_train</span><span>,</span> <span>y_train</span><span>)</span>

<span>catboost</span> <span>=</span> <span>CatBoostClassifier</span><span>(</span><span>depth</span><span>=</span><span>8</span><span>,</span> <span>learning_rate</span><span>=</span><span>0.1</span><span>,</span> <span>l2_leaf_reg</span><span>=</span><span>1</span><span>,</span> <span>task_type</span><span>=</span><span>'</span><span>GPU</span><span>'</span><span>)</span>
<span>catboost</span><span>.</span><span>fit</span><span>(</span><span>X_train</span><span>,</span> <span>y_train</span><span>)</span>

<span># Get probabilities for stacking </span><span>knn_probs</span> <span>=</span> <span>knn</span><span>.</span><span>predict_proba</span><span>(</span><span>X_val</span><span>)</span>
<span>catboost_probs</span> <span>=</span> <span>catboost</span><span>.</span><span>predict_proba</span><span>(</span><span>X_val</span><span>)</span>
<span>nn_probs</span> <span>=</span> <span>model</span><span>(</span><span>torch</span><span>.</span><span>tensor</span><span>(</span><span>X_val</span><span>,</span> <span>dtype</span><span>=</span><span>torch</span><span>.</span><span>float32</span><span>)).</span><span>detach</span><span>().</span><span>numpy</span><span>()</span>

<span># Stack predictions </span><span>stacked_features</span> <span>=</span> <span>np</span><span>.</span><span>hstack</span><span>([</span><span>knn_probs</span><span>,</span> <span>catboost_probs</span><span>,</span> <span>nn_probs</span><span>])</span>

<span># Train Meta-Learner </span><span>meta_learner</span> <span>=</span> <span>LogisticRegression</span><span>()</span>
<span>meta_learner</span><span>.</span><span>fit</span><span>(</span><span>stacked_features</span><span>,</span> <span>y_val</span><span>)</span>
from sklearn.neighbors import KNeighborsClassifier from catboost import CatBoostClassifier from sklearn.linear_model import LogisticRegression # Train base models knn = KNeighborsClassifier(n_neighbors=3) knn.fit(X_train, y_train) catboost = CatBoostClassifier(depth=8, learning_rate=0.1, l2_leaf_reg=1, task_type='GPU') catboost.fit(X_train, y_train) # Get probabilities for stacking knn_probs = knn.predict_proba(X_val) catboost_probs = catboost.predict_proba(X_val) nn_probs = model(torch.tensor(X_val, dtype=torch.float32)).detach().numpy() # Stack predictions stacked_features = np.hstack([knn_probs, catboost_probs, nn_probs]) # Train Meta-Learner meta_learner = LogisticRegression() meta_learner.fit(stacked_features, y_val)

Enter fullscreen mode Exit fullscreen mode

Step 4: Evaluating the Results

On the test set, the Meta-Learner hit 97.37% accuracy—better than KNN (94.44%), CatBoost (97.04%), or the Neural Network (96.98%) alone. Precision, recall, and F1-scores hovered around 0.95-0.96, showing balanced performance across classes. The QSO class, originally underrepresented, saw the biggest boost from stacking.

Feature importance (via CatBoost’s permutation importance) and SHAP values (for the Neural Network) pointed to redshift and its interactions as MVPs—makes sense, since redshift tells us a lot about cosmic distances and object types.

Lessons Learned

  • Ensemble Power: Combining models smoothed out individual weaknesses, especially for tricky quasars.
  • Feature Engineering: Those redshift interactions were gold—don’t skip the domain knowledge!
  • Compute Trade-Offs: Training took ~20 minutes total (mostly CatBoost and NN on GPU), but inference was lightning-fast.

What’s Next?

I’d love to tweak this further—maybe add polynomial features, try XGBoost in the stack, or scale it to bigger surveys. For now, I’m thrilled with 97.37% accuracy and a model that’s both powerful and interpretable.

What do you think? Have you tackled similar classification challenges? Drop a comment—I’d love to chat about stars, code, or both!
Kaggle: https://www.kaggle.com/code/allanwandia/stellar-classification-and-supervised-learning
Kaggle model card: https://www.kaggle.com/models/allanwandia/nexaastro_v1

Scientific ML (2 Part Series)

1 Predicting Protein Secondary Structures with Machine Learning
2 Training a Stellar Classification Model with Ensemble Machine Learning

原文链接:Training a Stellar Classification Model with Ensemble Machine Learning

© 版权声明
THE END
喜欢就支持一下吧
点赞15 分享
The God only arranges a happy ending. If it is not happy, it means that it is not the final result.
上天只会安排的快乐的结局。如果不快乐,说明还不是最后结局
评论 抢沙发

请登录后发表评论

    暂无评论内容