Data Science Capstone Project

Capstone Project

Winning Space Race with Data Science

This was my capstone project from the IBM Data Science Professional Certificate course. It involved applying my understanding of Data Science methodology, Python libraries (NumPy, Pandas, Matplotlib, Seaborn, Folium, and Scikit-Learn), SQL, math, machine learning, and presentation creation.

The major steps of the project included data collection, data cleaning (selecting data to keep, transform, place in database, and query), data analysis, data visualization, interactive dashboard creation, machine learning model training, and final report writing.

The Jupyter Notebooks and final presentation (.pdf) can be accessed via the links below. These files and the raw and processed data are stored on my GitHub page under the data-science-capstone repository.

Summary: In competition with SpaceX, a rival rocket launch company wants to make predictions about the success/failure of SpaceX Falcon 9 rocket first stage landings. What is the nature and extent of the data that we have on SpaceX Falcon 9 first stage landings? Which machine learning model would work best (have the highest accuracy) to predict the outcome of a Falcon 9 first stage landing from a future launch? Will a future Falcon 9 first stage landing be successful?

Methodology: Data was collected from the SpaceX public API and publically available data on Wikipedia. Data wrangling included extracting launch outcome information to serve as the dependent variable in the Machine Learning models. SQL queries and data visualizations (static plots, interactive maps, and an interactive dashboard) were created to discover insights about the data set and answer questions. Predictive analysis was pursued using Logistic Regression, SVM (Support Vector Machine), Decision Tree, and KNN (k-Nearest Neighbors) Machine Learning models.

Results: Launch data include info about flight number, date of launch, payload mass, orbit type, launch site, mission outcome and other variables was explored and visualized. Logistic Regression, SVM (Support Vector Machine), and KNN (k-Nearest Neighbors) performed equally well for Machine Learning models on this dataset.