My professional journey and key achievements in AI/ML engineering
Developed and optimized high-performance GPU kernels for inference/training workloads, demonstrating deep knowledge of memory hierarchy and compute/memory-bound optimization strategies.
Worked with graph compilers (CUDA, HIP) to optimize deep learning frameworks for various hardware architectures including AMD GPUs, ensuring streams integration.
Identified 70% redundant model calls through profiling; architected Redis caching system reducing monthly GPU costs by $12K and inference latency by 42%.