Showcasing my technical projects and achievements
Developed comprehensive Graph RAG framework for medical data retrieval using automated triplet extraction and dynamic knowledge graph construction with Neo4j. System implements semantic search optimization through LangChain and provides enhanced medical information retrieval with superior contextual understanding compared to traditional retrieval methods.
Implementation of Llama 3.1 in pure C/CUDA built on Karpathy's llm.c. Optimized various kernels including RMSNorm (2.3x speedup), SwiGLU with bfloat16, and integrated optimizations into MHA implementation. Achieved significant performance gains through cooperative groups, coalesced memory access, and efficient load/store operations. Features complete forward/backward pass implementation with attention mechanisms and RoPE embeddings.
Ported six essential transformer kernels from NVIDIA CUDA to AMD ROCm/HIP, optimizing for MI300X architecture (304 CUs, 192GB HBM3, 5.2TB/s bandwidth). Achieved 12.1ms per token generation with 79-83% memory bandwidth utilization. Implemented BFloat16 optimizations with coalesced memory access patterns and fused operations, reducing memory traffic by 50%. Features full CUDA to HIP conversion with multi-GPU architecture support and optimized softmax implementation.
Streamlit application forecasting stock prices using Hidden Markov Models based on historical data. Supports 1-day and multi-day forecasts for major tech stocks (AAPL, TSLA, NVDA). Models multivariate time series capturing opening price, closing price, and volume. Trained using maximum likelihood estimation with hidden states tuned via Akaike Information Criterion to prevent overfitting. Deployed at stock-hmm-analysis.streamlit.app.
Led development of a multi-conversation classification system processing diverse text inputs to detect mental health risks. Created comprehensive ETL pipelines for 123,000 training samples, implemented hybrid SFT-DPO training methodologies, and deployed on AWS SageMaker with model sharding across GPU instances. Achieved 71% macro-AUC improvement and 40% output consistency improvement through constitutional AI controls and real-time monitoring.
Designed novel neural architecture processing ECG signals through specialized CNN-Transformer hybrids. First stage classifies overall rhythm patterns with 94-98% accuracy, while second stage provides adaptive ectopic beat detection with 90-95% accuracy. Features dynamic windowing algorithms adjusting to patient-specific heart rates and achieves perfect classification for critical arrhythmias like atrial flutter, enabling reliable automated cardiac screening.
Deep learning-based OCR system for financial documents using ViT + CRNN architectures with 98% text extraction accuracy and 42% faster processing through ONNX runtime optimization and CUDA kernels.