Amol Umbarkar is an AI infrastructure engineer specializing in optimizing large language model (LLM) training and inference at scale. He currently works on the TIR AI Platform at E2E Networks, where he has been instrumental in architecting and developing the platform from its early days, focusing on scaling model training and improving inference performance across large GPU clusters. With deep expertise in technologies such as NeMo, PyTorch, vLLM, Lightning, and distributed systems using Slurm and NCCL, Amol has built and optimized systems designed for high-performance AI workloads.
Over the years, he has often taken on the role of a “day-0 engineer,” writing the first lines of code for several impactful products, including the TIR AI platform, the enterprise version of SigNoz, and hyperML, an open-source framework for running AI on Kubernetes. Previously, he contributed to building enterprise features at SigNoz and led engineering initiatives across cloud infrastructure and product development. Amol actively shares his thoughts on engineering and AI systems through his writing platform, mindhash.xyz, where he discusses real-world lessons from building and scaling modern AI infrastructure
