Experience - Ramshankar Bhuvaneswaran

Northeastern University Boston, MA Jan 2025 – Dec 2025

• Implemented knowledge distillation pipeline compressing Qwen image-to-image model into lightweight GAN architectures, achieving 65% parameter reduction while maintaining visual quality; integrated distillation with quantization-aware training for efficient deployment.
• Authored research paper BitSkip investigating compositional effects of quantization and early exit in LLM.
• Served as Teaching Assistant, mentoring students on distributed computing workflows including SLURM job scheduling, parallel processing architectures, and debugging assignments across CPU and GPU environments.

BulkBeings Chennai, India May 2024 – Aug 2024

• Optimized deep learning frameworks for GPU performance by developing custom CUDA kernels and integrating optimizations into PyTorch training pipelines for Mixtral and Llama models, achieving 42% training acceleration across 8 A100 GPUs FSDP configuration through memory coalescing, warp-level primitives, and kernel fusion techniques.
• Designed and scaled ML cluster orchestration across cloud (GCP) and on-prem environments using Kubernetes; deployed Ray clusters for distributed training and online inference, managing GPU scheduling and resource allocation for multi-node training jobs.
• Engineered high-performance GPU kernels for attention mechanisms and feedforward layers using CUDA(CUTLASS) and OpenMP; debugged gradient explosion in multi-GPU distributed training by implementing mixed-precision strategies and gradient clipping, reducing OOM errors by 85% through systematic memory profiling.
• Built distributed data preprocessing pipeline with PySpark processing 200GB+ datasets, implementing sparse feature selection algorithms that reduced data transfer overhead by 35% while maintaining training convergence.

BulkBeings Chennai, India May 2023 – Dec 2023

• Prototyped and productionalized OCR model (ViT+CRNN) achieving 98% precision while making the model 42% faster via ONNX/CUDA optimization on AWS EC2 instance with API gateway achieving sub-100ms p99 latency.
• Implemented observability stack with Prometheus, Grafana, and OpenTelemetry for ML inference services; built custom metrics dashboards tracking GPU utilization, model latency (P50/P95/P99), and throughput; configured alerting for SLA violations.
• Developed Two stage Conv1D-Transformer architecture for beat level ECG classification achieving (89% F1-score) on both ectopic and beats, applying quantization and kernel fusion to deploy on L4 GPU and meeting SLA constraints.
• Built automated retraining pipelines with Python scripts that extracted data from production databases using SQL and Airflow, reducing model refresh cycle from 2 weeks to 3 days and improving prediction accuracy by 12%.

Velozity Global Solutions Pvt Chennai, India Jan 2022 – Aug 2022

• Architected end-to-end ML pipelines in AWS, implementing predictive segments (high-value, at-risk) using XGBoost on behavioral patterns and validating clusters against campaign response data, resulting in a 45% increase in campaign conversion rates.
• Identified critical data leakage in feature pipeline causing 20% overestimation of model performance; redesigned temporal feature extraction logic ensuring proper train-test split, leading to more reliable production deployments.
• Led cross-functional team in developing retail mix optimization system for a supermarket chain in India using Bayesian hierarchical models (PyMC3), increasing recurring customer LTV prediction accuracy by 34%; shared optimization techniques with ML community.
• Built monitoring system detecting performance degradation in production models; implemented automated retraining pipeline triggered by drift detection, maintaining model accuracy above 90% threshold and preventing $50K potential revenue loss.

Work Experience