I build infrastructure that doesn't break — and scale systems that shouldn't slow down.
Lead DevOps/SRE Engineer with 6+ years building and scaling high-availability platforms across multi-cloud environments. I architect Kubernetes platforms at scale (40+ nodes, 26+ clusters), design GPU-capable cloud infrastructure, and deliver production-grade observability.
Currently leading an 8-member DevOps/SRE team at Eptura (Fortune 500 SaaS) across 4 time zones, managing a $2M+ annual multi-cloud budget. Extending deep platform engineering expertise into AI/ML infrastructure — model serving, GPU scheduling, and inference optimization.
I also build AI products independently through KraftAI, applying serverless inference patterns and modern AI APIs to solve real-world problems.
AWS, Azure, GCP — production expertise across all three
GPU scheduling, vLLM, Triton, inference at scale
SOC2, ISO 27001, GDPR delivered ahead of schedule
8-person team across 4 time zones, global async
Enterprise-grade tools I use to build, scale, and secure infrastructure
Enterprise-scale implementations with measurable business impact
Architected and operate multi-cloud Kubernetes platforms (EKS & AKS) across 26+ clusters and 40+ nodes for Fortune 500 SaaS. Migrated 300+ production workloads with zero downtime, improving resource utilization by 45%.
Architected Prometheus/Grafana/Loki monitoring for 40-node production environments with AI-powered log analysis and anomaly detection. Implemented SLO-driven reliability culture, reducing MTTR by 65%.
Kubernetes-native platform for Ethereum, Bitcoin, and Dogecoin nodes with auto-scaling, chain-sync dashboards, RPC load balancing, and automated snapshot recovery. Handles 50M+ API calls/month.
Solo AI products venture (kraftai.in) built with Gemini API, Next.js, and serverless inference patterns. Applying platform engineering expertise to build AI-powered tools and services for real-world use cases.
AI-powered Kubernetes incident response bot that detects anomalies, correlates alerts, and suggests runbooks. Combines SRE best practices with ML-driven pattern recognition for faster incident resolution.
Delivered $280K+ total savings through cloud spend governance, Karpenter-based node autoscaling, IP exhaustion solutions on EKS, and resource right-sizing across multi-cloud environments.
Extending platform engineering expertise into AI infrastructure — solving the operational challenges of serving and scaling ML workloads in production
Working with NVIDIA GPU Operator, device plugin, MIG partitioning, and time-slicing for efficient GPU resource allocation across K8s clusters.
Hands-on with vLLM, Triton Inference Server, and Ray Serve — deploying inference endpoints with autoscaling strategies for production traffic.
Building a solo AI products venture using Gemini API, Next.js, and serverless inference patterns to solve real-world problems with AI.
Applying Karpenter autoscaling, Prometheus observability, and FinOps expertise to GPU-specific infrastructure challenges and inference latency SLOs.
Building teams and delivering impact at scale
Lead and mentor 8-member DevOps/SRE team across 4 time zones. Manage $2M+ annual multi-cloud budget with FinOps discipline. Architect multi-cloud Kubernetes platforms (EKS & AKS) across 26+ clusters sustaining 99.99% production uptime. Migrated 300+ workloads with zero downtime. Standardized GitOps with Argo CD, increasing deploy frequency 3x. Delivered SOC2 compliance and GDPR readiness.
Built DevOps and SRE foundations for 15+ production environments with 99.9% uptime. Designed multi-cloud deployments across AWS, Azure, GCP. Modernized infrastructure with Terraform and Kubernetes, generating $160K annual savings. Optimized CI/CD pipelines reducing build time by 62%. Built blockchain infrastructure (Ethereum, Dogecoin, IPFS nodes). Automated 100+ hours/month of ops tasks. Mentored 5 engineers.
GPA: 8.2/10. Ghaziabad, India. Strong foundation in computer science, distributed systems, and software engineering principles.
Endorsements from colleagues and leaders
Hritik's expertise in debugging complex Kubernetes issues was invaluable. He quickly identified the root cause in our pod networking configuration that had stumped the team for days.
Hritik's debugging and troubleshooting skills are world-class. He consistently resolved complex infrastructure issues that other engineers couldn't solve.
He managed our entire GitLab infrastructure for 300+ employees. His expertise achieved 95% deployment success rate and transformed our release process.
Available for leadership roles, consulting, and technical collaborations across IST, CET/CEST, and global time zones. Open to international travel.
AI-powered · Knows my full background