Unveiling the Cosmos: Discovering Exoplanets with Machine Learning

Join us as we navigate through the Transit Method, light curve preprocessing, and the art of training ML models.

8 min readOct 4, 2023

In our quest to explore the cosmos, we, a dedicated team of aspiring scientists, delved into the fascinating world of Transiting Exoplanet Discovery using Machine Learning (ML) techniques. Join us on this cosmic journey as we demystify the process step by step, from the basics of the Transit Method to training ML models for exoplanet detection.

What is the Transit Method?

Introduction to Exoplanets:

Exoplanets, celestial bodies beyond our Solar System, have captivated the imaginations of astronomers for decades. These distant worlds, often orbiting other stars, hold secrets waiting to be uncovered.

The Transit Phenomenon:

The Transit Method, an essential technique in exoplanet discovery, occurs when an exoplanet passes in front of its parent star as observed from Earth. This passage causes a temporary dip or reduction in the star's brightness.

Understanding Lightcurves:

We employ lightcurves — graphs displaying the star's brightness over time to grasp this phenomenon. The dips in brightness within these curves provide valuable clues about the exoplanet's properties, such as its size, orbital radius, and temperature.

Workflow

Our journey begins with a carefully crafted workflow that ensures a systematic approach to exoplanet discovery.

Step 1: Data Acquisition: We kickstart the process by downloading lightcurve data for both exoplanets and non-exoplanets, which serves as the foundation for our analysis.
Step 2: Lightcurve Processing: This crucial step involves meticulous analysis to distinguish lightcurves with exoplanets from those without.
Step 3: NASA's Cumulative Dataset: NASA's vast cumulative dataset becomes our playground for deeper analysis.
Step 4: ML Model Application: With a wealth of data at our disposal, we employ various ML models to identify exoplanets.
Step 5: Model Evaluation: We rigorously analyse the performance of these models to determine the best algorithm for Exoplanet Detection.

Light Curve Preprocessing

Lightkurve, a Python library, emerges as a vital tool for analysing time series data of celestial objects. It grants access to data from NASA's Kepler and TESS telescopes.

● Target pixel file function: The target pixel file is a function of the Lightkurve library, and the pixel here refers to the actual pixel of the camera of Kepler or TESS telescope.

We provide the exact KIC ID of a star, and a 5x5 pixel image is generated. The boxes show the different luminosity of the star over time, but it's hard to analyse this and interpret anything from it, so we try to plot light curves for it.

Light curve plot: Here in the graph, you can see vertical lines that show a dip in the flux output of the star, and this is also periodic in nature, but in this graph, it shows that the star is diminishing or maybe it's moving up and down, or perhaps it's the motion of the telescope.

Different preprocessing applied on the graph

1. Flattening: Refers to the flattening of data for better analysis. After flattening, we can analyse the exact period for which the planet crosses the star.

2. Folding: Phase plots contain "folded" data to fit within a mathematically defined/ determined period. These are often referred to as folded light curves. It puts all the periods one upon another and adds them to have a better and more precise view of the dips. We can interpret more features, like the size of the planet, by observing the amount of change in luminosity

4. Periodogram: The above folding is done by providing an exact period; that same period is determined with the help of a periodogram. It creates a frequency graph that shows all the repetitive patterns in our chart and tells which one is more likely to be periodic.

This graph shows that a period of 3.5225 is the most probable.

5. Folding and scattering: We last used a scatter plot to get a better view and information about our curve. The significant drop gives reasonable evidence that something is going around the star periodically, and the depth and width of this curve give more information about the planet.

Deciding Exoplanet Presence

The above plots were made for the star with KIC 6922244, where it was confirmed that there is an exoplanet present around it.

Now we know how graphs look if there is an exoplanet around a star, so let's look at a diagram of a star where it's pre-determined that there is no exoplanet present around it. We'll consider the KIC 40521343 for this example.

1. Plotting the pixel graph

2. Plotting the light curve

3. Flattening of the curve

4. Folding of the curve

5. Periodogram of the curve

6. Folding and scattering of the curve

Conclusions from the plots:

The light curve does not show any periodic change in 7the flux intensity of the star.
The folding of the curve does not make any difference to the light curve
The periodogram also implies that there is no specific repetitive pattern that is periodic, i.e. there are uneven changes in the flux intensity of the star.
The folding and scattering curve also does not hint at a periodicity of change in flux intensity.

From the above points, we can conclude that no exoplanet exists around the star.

Training an ML Model

Now we know how to determine if there is an exoplanet around a star, so we took a set of data provided by the NASA Exoplanet Archive, which contains data about different stars, including if there is an exoplanet around it.

Here, we delve into the technical aspects of our project, explaining how we prepared the data for ML model training.

Importing the dataset and necessary libraries:

We import the libraries required to train our ML models.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
pd.set_option('max_columns', None)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

import warnings
warnings.filterwarnings(action='ignore')

We also import the dataset.

data = pd.read_csv('cumulative.csv')

Data Preprocessing:

Our data preprocessing involved dropping rows/columns with missing values, one-hot encoding categorical features, splitting data into train and test sets, and scaling data using StandardScaler().

def preprocess_inputs(df):
    df = df.copy()
    
    # Drop unused columns
    df = df.drop(['rowid', 'kepid', 'kepoi_name', 'kepler_name', 'koi_pdisposition', 'koi_score'], axis=1)
    
    # Limit target values to CANDIDATE and CONFIRMED
    false_positive_rows = df.query("koi_disposition == 'FALSE POSITIVE'").index
    df = df.drop(false_positive_rows, axis=0).reset_index(drop=True)
    
    # Drop columns with all missing values
    df = df.drop(['koi_teq_err1', 'koi_teq_err2'], axis=1)
    
    # Fill remaining missing values
    df['koi_tce_delivname'] = df['koi_tce_delivname'].fillna(df['koi_tce_delivname'].mode()[0])
    for column in df.columns[df.isna().sum() > 0]:
        df[column] = df[column].fillna(df[column].mean())
    
    # One-hot encode koi_tce_delivname column
    delivname_dummies = pd.get_dummies(df['koi_tce_delivname'], prefix='delivname')
    df = pd.concat([df, delivname_dummies], axis=1)
    df = df.drop('koi_tce_delivname', axis=1)
    
    # Split df into X and y
    y = df['koi_disposition']
    X = df.drop('koi_disposition', axis=1)
    
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=1)
    
    # Scale X
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train = pd.DataFrame(scaler.transform(X_train), index=X_train.index, columns=X_train.columns)
    X_test = pd.DataFrame(scaler.transform(X_test), index=X_test.index, columns=X_test.columns)
    
    return X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = preprocess_inputs(data)

Finally, Training the ML Models:

We train the following ML Algorithms on the above-preprocessed data:

Logistic Regression, Neural Network — MLP Classifier, Random Forest Classifier and Gradient Boosting Classifier.

models = {
    "Logistic Regression": LogisticRegression(),
    "     Neural Network": MLPClassifier(),
    "      Random Forest": RandomForestClassifier(),
    "  Gradient Boosting": GradientBoostingClassifier(),
}

for name, model in models.items():
    model.fit(X_train, y_train)
    print(name + " trained.")

Analysing Model Performance

The ML Algorithms: We applied various ML algorithms, including Logistic Regression, Neural Network (MLP Classifier), Random Forest Classifier, and Gradient Boosting Classifier.

models = {
    "Logistic Regression": LogisticRegression(),
    "     Neural Network": MLPClassifier(),
    "      Random Forest": RandomForestClassifier(),
    "  Gradient Boosting": GradientBoostingClassifier(),
}

for name, model in models.items():
    model.fit(X_train, y_train)
    print(name + " trained.")

Performance Metrics: We evaluate these models based on accuracy, precision, F1-score, and recall metrics.

def get_classifications(y_test, y_pred, positive_label='CONFIRMED'):
    tp = 0
    fn = 0
    fp = 0
    tn = 0
    
    for y_t, y_p in zip(y_test, y_pred):
        if y_t == positive_label:
            if y_p == positive_label:
                tp += 1
            else:
                fn += 1
        else:
            if y_p == positive_label:
                fp += 1
            else:
                tn += 1
    
    return tp, fn, fp, tn

def get_accuracy(tp, fn, fp, tn):
    acc = (tp + tn) / (tp + fn + fp + tn)
    return acc

def get_precision(tp, fn, fp, tn):
    precision = tp / (tp + fp)
    return precision

def get_recall(tp, fn, fp, tn):
    recall = tp / (tp + fn)
    return recall

def get_f1_score(tp, fn, fp, tn):
    precision = get_precision(tp, fn, fp, tn)
    recall = get_recall(tp, fn, fp, tn)
    f1_score = (2 * precision * recall) / (precision + recall)
    return f1_score

Choosing the Best Model

After rigorous analysis, we conclude that the Gradient Boost Classifier boasts the highest accuracy, making it our top choice for exoplanet detection in future data.