Work Experience

My professional journey and key achievements in AI/ML engineering

Graduate Teaching and Research Assistant

Northeastern University Boston, MA Jan 2025 – Dec 2025
  • Implemented knowledge distillation pipeline compressing Qwen image-to-image model into lightweight GAN architectures, achieving 65% parameter reduction while maintaining visual quality; integrated distillation with quantization-aware training for efficient deployment.
  • Authored research paper BitSkip investigating compositional effects of quantization and early exit in LLM.
  • Served as Teaching Assistant, mentoring students on distributed computing workflows including SLURM job scheduling, parallel processing architectures, and debugging assignments across CPU and GPU environments.

AI Engineer

BulkBeings Chennai, India May 2024 – Aug 2024
  • Optimized deep learning frameworks for GPU performance by developing custom CUDA kernels and integrating optimizations into PyTorch training pipelines for Mixtral and Llama models, achieving 42% training acceleration across 8 A100 GPUs FSDP configuration through memory coalescing, warp-level primitives, and kernel fusion techniques.
  • Designed and scaled ML cluster orchestration across cloud (GCP) and on-prem environments using Kubernetes; deployed Ray clusters for distributed training and online inference, managing GPU scheduling and resource allocation for multi-node training jobs.
  • Engineered high-performance GPU kernels for attention mechanisms and feedforward layers using CUDA(CUTLASS) and OpenMP; debugged gradient explosion in multi-GPU distributed training by implementing mixed-precision strategies and gradient clipping, reducing OOM errors by 85% through systematic memory profiling.
  • Built distributed data preprocessing pipeline with PySpark processing 200GB+ datasets, implementing sparse feature selection algorithms that reduced data transfer overhead by 35% while maintaining training convergence.

ML Engineer (Research)

BulkBeings Chennai, India May 2023 – Dec 2023
  • Prototyped and productionalized OCR model (ViT+CRNN) achieving 98% precision while making the model 42% faster via ONNX/CUDA optimization on AWS EC2 instance with API gateway achieving sub-100ms p99 latency.
  • Implemented observability stack with Prometheus, Grafana, and OpenTelemetry for ML inference services; built custom metrics dashboards tracking GPU utilization, model latency (P50/P95/P99), and throughput; configured alerting for SLA violations.
  • Developed Two stage Conv1D-Transformer architecture for beat level ECG classification achieving (89% F1-score) on both ectopic and beats, applying quantization and kernel fusion to deploy on L4 GPU and meeting SLA constraints.
  • Built automated retraining pipelines with Python scripts that extracted data from production databases using SQL and Airflow, reducing model refresh cycle from 2 weeks to 3 days and improving prediction accuracy by 12%.

ML Engineer Intern

Velozity Global Solutions Pvt Chennai, India Jan 2022 – Aug 2022
  • Architected end-to-end ML pipelines in AWS, implementing predictive segments (high-value, at-risk) using XGBoost on behavioral patterns and validating clusters against campaign response data, resulting in a 45% increase in campaign conversion rates.
  • Identified critical data leakage in feature pipeline causing 20% overestimation of model performance; redesigned temporal feature extraction logic ensuring proper train-test split, leading to more reliable production deployments.
  • Led cross-functional team in developing retail mix optimization system for a supermarket chain in India using Bayesian hierarchical models (PyMC3), increasing recurring customer LTV prediction accuracy by 34%; shared optimization techniques with ML community.
  • Built monitoring system detecting performance degradation in production models; implemented automated retraining pipeline triggered by drift detection, maintaining model accuracy above 90% threshold and preventing $50K potential revenue loss.

Technical Achievements

GPU Kernel Optimization

Developed and optimized high-performance GPU kernels for inference/training workloads, demonstrating deep knowledge of memory hierarchy and compute/memory-bound optimization strategies.

Compiler Integration

Worked with graph compilers (CUDA, HIP) to optimize deep learning frameworks for various hardware architectures including AMD GPUs, ensuring streams integration.

Performance Crisis Resolution

Identified 70% redundant model calls through profiling; architected Redis caching system reducing monthly GPU costs by $12K and inference latency by 42%.