Projects - Ramshankar Bhuvaneswaran

Parallelizing Text-to-Image Generation Using Diffusion

Python PyTorch Dask CLIP

→ Built DDPM model from scratch and Engineered Dask-powered distributed preprocessing pipeline (Multiprocessing backend) for text-to-image datasets, achieving 1.48x CPU throughput scaling versus single-thread baselines.
→ Accelerated training via Automatic Mixed Precision (AMP) with Exponential Moving Average (EMA) weight averaging (β = 0.999), reducing training time by 11% (23.5 → 20.9 hrs/epoch) and loss variance by 29% through gradient histogram stabilization.

GitHub Repository

1.48x CPU throughput scaling

11% training time reduction

29% loss variance reduction

Fintech-Data-Processing-ETL-Platform

Python Snowflake Airflow FastAPI

→ Built a master financial statement database for US public companies, implementing three distinct storage strategies (Raw Staging, JSON Transformation, Denormalized Fact Tables) with Snowflake to optimize data access patterns.
→ Engineered automated ETL pipelines with Apache Airflow for SEC data retrieval, implementing data scraping from SEC Markets Data page and utilizing S3 for efficient staging, with Data Validation Tool (DVT) ensuring schema integrity.
→ Developed a client-facing analytics solution combining Streamlit for visualization and FastAPI backend services, enabling financial analysts to perform real-time fundamental analysis of public company data.

GitHub Repository

Master financial database

Automated ETL pipelines

Real-time analytics

Academic Projects

NVIDIA GB200 GPU Programming Challenge

CuteDSL CUDA PTX Triton NVFP4/FP8

→ Ranked 24th place in GPUMODE's NVFP4 Gated Dual GEMM competition; optimized Blackwell B200 tensor core kernels for block-scaled matrix operations using CuteDSL (CUTLASS Python wrapper) and inline PTX assembly.
→ Achieved 3-6x speedup over baseline through custom SwiGLU epilogue in PTX (halving SFU instruction count), TMA memory pipelining, and async scheduling optimizations; progressed from 60th to 4th across 3 challenges.

Ranked 24th in GPUMODE

3-6x speedup over baseline

Blackwell B200 optimization

Medical Knowledge Graph RAG System

Neo4j LangChain Graph RAG Python

Developed comprehensive Graph RAG framework for medical data retrieval using automated triplet extraction and dynamic knowledge graph construction with Neo4j. System implements semantic search optimization through LangChain and provides enhanced medical information retrieval with superior contextual understanding compared to traditional retrieval methods.

GitHub Repository

Automated triplet extraction

Dynamic knowledge graph construction

Semantic search optimization

Llama3 - Pure C/CUDA Implementation

C/CUDA llm.c Performance Optimization GPU Kernels

Implementation of Llama 3.1 in pure C/CUDA built on Karpathy's llm.c. Optimized various kernels including RMSNorm (2.3x speedup), SwiGLU with bfloat16, and integrated optimizations into MHA implementation. Achieved significant performance gains through cooperative groups, coalesced memory access, and efficient load/store operations. Features complete forward/backward pass implementation with attention mechanisms and RoPE embeddings.

2.3x RMSNorm speedup

Optimized GPU kernels

Complete forward/backward pass

Qwen600 CUDA to ROCm Port for AMD MI300X

C++ HIP CUDA ROCm

Ported six essential transformer kernels from NVIDIA CUDA to AMD ROCm/HIP, optimizing for MI300X architecture (304 CUs, 192GB HBM3, 5.2TB/s bandwidth). Achieved 12.1ms per token generation with 79-83% memory bandwidth utilization. Implemented BFloat16 optimizations with coalesced memory access patterns and fused operations, reducing memory traffic by 50%. Features full CUDA to HIP conversion with multi-GPU architecture support and optimized softmax implementation.

GitHub Repository Technical Blog

12.1ms token generation

50% memory traffic reduction

Multi-GPU architecture support

Stock Price Forecaster

Streamlit Hidden Markov Models Time Series Python

Streamlit application forecasting stock prices using Hidden Markov Models based on historical data. Supports 1-day and multi-day forecasts for major tech stocks (AAPL, TSLA, NVDA). Models multivariate time series capturing opening price, closing price, and volume. Trained using maximum likelihood estimation with hidden states tuned via Akaike Information Criterion to prevent overfitting. Deployed at stock-hmm-analysis.streamlit.app.

GitHub Repository

Multi-day forecasting

Hidden Markov Models

Live deployment

Work Experience Projects

Advanced Suicidal Ideation Classification System

AWS SageMaker ETL Pipelines SFT-DPO Training Model Sharding

Led development of a multi-conversation classification system processing diverse text inputs to detect mental health risks. Created comprehensive ETL pipelines for 123,000 training samples, implemented hybrid SFT-DPO training methodologies, and deployed on AWS SageMaker with model sharding across GPU instances. Achieved 71% macro-AUC improvement and 40% output consistency improvement through constitutional AI controls and real-time monitoring.

71% macro-AUC improvement

40% output consistency improvement

123,000 training samples

Two-Stage Cardiac Arrhythmia Detection System

PyTorch CNN-Transformer Signal Processing Medical AI

Designed novel neural architecture processing ECG signals through specialized CNN-Transformer hybrids. First stage classifies overall rhythm patterns with 94-98% accuracy, while second stage provides adaptive ectopic beat detection with 90-95% accuracy. Features dynamic windowing algorithms adjusting to patient-specific heart rates and achieves perfect classification for critical arrhythmias like atrial flutter, enabling reliable automated cardiac screening.

94-98% rhythm accuracy

90-95% ectopic detection

Perfect critical arrhythmia detection

OCR Pipeline with Layout Detection

PyTorch ONNX CUDA

Deep learning-based OCR system for financial documents using ViT + CRNN architectures with 98% text extraction accuracy and 42% faster processing through ONNX runtime optimization and CUDA kernels.

98% text extraction accuracy

42% faster processing

Financial document optimization