Data Viz, Imputation, & Feature Selection

  • Tech Stack: Python (numpy, pandas, matplotlib, plotly, keras, sklearn (t-SNE, PCA), UMAP))
  • Github URL: Project Link

This project demonstrates how dimensionality reduction techniques—PCA, t-SNE, and UMAP—enable intuitive visualization and deeper exploration of high-dimensional image data in the MNIST dataset.

PCA
Principal Component Analysis is a linear dimensionality reduction technique that projects high-dimensional data onto a smaller set of orthogonal axes (principal components) to capture the most variance in the data.
t-SNE
t-Distributed Stochastic Neighbor Embedding is a nonlinear method that visualizes high-dimensional data by preserving local neighborhood structures, making clusters apparent in two or three dimensions.
UMAP
Uniform Manifold Approximation and Projection is a nonlinear technique that reduces dimensionality by modeling the data’s manifold structure, preserving both local and some global relationships for clearer visualization and clustering.

By applying these methods, we revealed clear clusters and patterns among handwritten digits, making complex data more interpretable.

Building on these insights, a neural network model achieved 94% accuracy in classifying digits, highlighting the combined value of visualization and machine learning for data understanding and prediction.