Projects

Showcasing my technical projects and achievements

Parallelizing Text-to-Image Generation Using Diffusion

Python PyTorch Dask CLIP
  • Built DDPM model from scratch and Engineered Dask-powered distributed preprocessing pipeline (Multiprocessing backend) for text-to-image datasets, achieving 1.48x CPU throughput scaling versus single-thread baselines.
  • Accelerated training via Automatic Mixed Precision (AMP) with Exponential Moving Average (EMA) weight averaging (β = 0.999), reducing training time by 11% (23.5 → 20.9 hrs/epoch) and loss variance by 29% through gradient histogram stabilization.
1.48x CPU throughput scaling
11% training time reduction
29% loss variance reduction

Fintech-Data-Processing-ETL-Platform

Python Snowflake Airflow FastAPI
  • Built a master financial statement database for US public companies, implementing three distinct storage strategies (Raw Staging, JSON Transformation, Denormalized Fact Tables) with Snowflake to optimize data access patterns.
  • Engineered automated ETL pipelines with Apache Airflow for SEC data retrieval, implementing data scraping from SEC Markets Data page and utilizing S3 for efficient staging, with Data Validation Tool (DVT) ensuring schema integrity.
  • Developed a client-facing analytics solution combining Streamlit for visualization and FastAPI backend services, enabling financial analysts to perform real-time fundamental analysis of public company data.
Master financial database
Automated ETL pipelines
Real-time analytics

Academic Projects

NVIDIA GB200 GPU Programming Challenge

CuteDSL CUDA PTX Triton NVFP4/FP8
  • Ranked 24th place in GPUMODE's NVFP4 Gated Dual GEMM competition; optimized Blackwell B200 tensor core kernels for block-scaled matrix operations using CuteDSL (CUTLASS Python wrapper) and inline PTX assembly.
  • Achieved 3-6x speedup over baseline through custom SwiGLU epilogue in PTX (halving SFU instruction count), TMA memory pipelining, and async scheduling optimizations; progressed from 60th to 4th across 3 challenges.
Ranked 24th in GPUMODE
3-6x speedup over baseline
Blackwell B200 optimization

Medical Knowledge Graph RAG System

Neo4j LangChain Graph RAG Python

Developed comprehensive Graph RAG framework for medical data retrieval using automated triplet extraction and dynamic knowledge graph construction with Neo4j. System implements semantic search optimization through LangChain and provides enhanced medical information retrieval with superior contextual understanding compared to traditional retrieval methods.

Automated triplet extraction
Dynamic knowledge graph construction
Semantic search optimization

Llama3 - Pure C/CUDA Implementation

C/CUDA llm.c Performance Optimization GPU Kernels

Implementation of Llama 3.1 in pure C/CUDA built on Karpathy's llm.c. Optimized various kernels including RMSNorm (2.3x speedup), SwiGLU with bfloat16, and integrated optimizations into MHA implementation. Achieved significant performance gains through cooperative groups, coalesced memory access, and efficient load/store operations. Features complete forward/backward pass implementation with attention mechanisms and RoPE embeddings.

2.3x RMSNorm speedup
Optimized GPU kernels
Complete forward/backward pass

Qwen600 CUDA to ROCm Port for AMD MI300X

C++ HIP CUDA ROCm

Ported six essential transformer kernels from NVIDIA CUDA to AMD ROCm/HIP, optimizing for MI300X architecture (304 CUs, 192GB HBM3, 5.2TB/s bandwidth). Achieved 12.1ms per token generation with 79-83% memory bandwidth utilization. Implemented BFloat16 optimizations with coalesced memory access patterns and fused operations, reducing memory traffic by 50%. Features full CUDA to HIP conversion with multi-GPU architecture support and optimized softmax implementation.

12.1ms token generation
50% memory traffic reduction
Multi-GPU architecture support

Stock Price Forecaster

Streamlit Hidden Markov Models Time Series Python

Streamlit application forecasting stock prices using Hidden Markov Models based on historical data. Supports 1-day and multi-day forecasts for major tech stocks (AAPL, TSLA, NVDA). Models multivariate time series capturing opening price, closing price, and volume. Trained using maximum likelihood estimation with hidden states tuned via Akaike Information Criterion to prevent overfitting. Deployed at stock-hmm-analysis.streamlit.app.

Multi-day forecasting
Hidden Markov Models
Live deployment

Work Experience Projects

Advanced Suicidal Ideation Classification System

AWS SageMaker ETL Pipelines SFT-DPO Training Model Sharding

Led development of a multi-conversation classification system processing diverse text inputs to detect mental health risks. Created comprehensive ETL pipelines for 123,000 training samples, implemented hybrid SFT-DPO training methodologies, and deployed on AWS SageMaker with model sharding across GPU instances. Achieved 71% macro-AUC improvement and 40% output consistency improvement through constitutional AI controls and real-time monitoring.

71% macro-AUC improvement
40% output consistency improvement
123,000 training samples

Two-Stage Cardiac Arrhythmia Detection System

PyTorch CNN-Transformer Signal Processing Medical AI

Designed novel neural architecture processing ECG signals through specialized CNN-Transformer hybrids. First stage classifies overall rhythm patterns with 94-98% accuracy, while second stage provides adaptive ectopic beat detection with 90-95% accuracy. Features dynamic windowing algorithms adjusting to patient-specific heart rates and achieves perfect classification for critical arrhythmias like atrial flutter, enabling reliable automated cardiac screening.

94-98% rhythm accuracy
90-95% ectopic detection
Perfect critical arrhythmia detection

OCR Pipeline with Layout Detection

PyTorch ONNX CUDA

Deep learning-based OCR system for financial documents using ViT + CRNN architectures with 98% text extraction accuracy and 42% faster processing through ONNX runtime optimization and CUDA kernels.

98% text extraction accuracy
42% faster processing
Financial document optimization