Hritik Chaudhary | Lead DevOps / SRE Engineer • AI & ML Infrastructure

// About

Infrastructure at Scale.
Reliability by Design.

Lead DevOps/SRE Engineer with 6+ years building and scaling high-availability platforms across multi-cloud environments. I architect Kubernetes platforms at scale (40+ nodes, 26+ clusters), design GPU-capable cloud infrastructure, and deliver production-grade observability.

Currently leading an 8-member DevOps/SRE team at Eptura (Fortune 500 SaaS) across 4 time zones, managing a $2M+ annual multi-cloud budget. Extending deep platform engineering expertise into AI/ML infrastructure — model serving, GPU scheduling, and inference optimization.

I also build AI products independently through KraftAI, applying serverless inference patterns and modern AI APIs to solve real-world problems.

Multi-Cloud

AWS, Azure, GCP — production expertise across all three

AI Infrastructure

GPU scheduling, vLLM, Triton, inference at scale

Compliance

SOC2, ISO 27001, GDPR delivered ahead of schedule

Leadership

8-person team across 4 time zones, global async

// Tech Stack

Skills & Expertise

Enterprise-grade tools I use to build, scale, and secure infrastructure

Cloud & Compute

AWS (EKS, EC2, Fargate) Azure (AKS) GCP (GKE) GPU Instances DigitalOcean

Containers & Orchestration

Kubernetes Docker Helm Istio Karpenter Podman

AI/ML Infrastructure

vLLM Triton Inference Ray Serve GPU Scheduling NVIDIA Device Plugin

CI/CD & GitOps

Argo CD GitHub Actions Jenkins GitLab CI Azure DevOps Flux

Infrastructure as Code

Terraform Ansible CloudFormation Pulumi CDK

Observability & APM

Prometheus Grafana Loki ELK Jaeger Datadog

Security & DevSecOps

HashiCorp Vault SAST/DAST Snyk Aqua CIS Hardening TLS/mTLS

Blockchain Infra

Ethereum (Geth) Bitcoin Core IPFS Dogecoin Solidity Hardhat

Programming & SRE

Python Go Bash Node.js TypeScript SLOs/SLIs

// Featured Work

Key Projects

Enterprise-scale implementations with measurable business impact

Eptura DevOps

Multi-Cloud Kubernetes Platform

Architected and operate multi-cloud Kubernetes platforms (EKS & AKS) across 26+ clusters and 40+ nodes for Fortune 500 SaaS. Migrated 300+ production workloads with zero downtime, improving resource utilization by 45%.

99.99%

Uptime

300+

Workloads

45%

Better Utilization

EKSAKSKarpenterArgo CDTerraform

Eptura Observability

Production Observability Stack

Architected Prometheus/Grafana/Loki monitoring for 40-node production environments with AI-powered log analysis and anomaly detection. Implemented SLO-driven reliability culture, reducing MTTR by 65%.

65%

MTTR Reduction

40+

Nodes Monitored

Deploy Frequency

PrometheusGrafanaLokiSLOs

RapidInnovation Blockchain

Blockchain Node Platform

Kubernetes-native platform for Ethereum, Bitcoin, and Dogecoin nodes with auto-scaling, chain-sync dashboards, RPC load balancing, and automated snapshot recovery. Handles 50M+ API calls/month.

50M+

API Calls/Mo

99.9%

Uptime

Auto

Scaling

GethBitcoin CoreKubernetesIPFS

KraftAI AI Product

KraftAI — AI Products Venture

Solo AI products venture (kraftai.in) built with Gemini API, Next.js, and serverless inference patterns. Applying platform engineering expertise to build AI-powered tools and services for real-world use cases.

Solo

Founder

First

Live

Production

Next.jsGemini APIServerlessAI/ML

Open Source AI + K8s

K8s AI Incident Bot

AI-powered Kubernetes incident response bot that detects anomalies, correlates alerts, and suggests runbooks. Combines SRE best practices with ML-driven pattern recognition for faster incident resolution.

Driven

K8s

Native

Auto

Runbooks

PythonKubernetesAI/MLIncident Response

Eptura FinOps

FinOps & Cloud Optimization

Delivered $280K+ total savings through cloud spend governance, Karpenter-based node autoscaling, IP exhaustion solutions on EKS, and resource right-sizing across multi-cloud environments.

$280K+

Total Savings

$2M+

Budget

35%

Cost Reduction

KarpenterFinOpsRight-sizingSpot Instances

// AI/ML Infrastructure

Building the Future of AI Ops

Extending platform engineering expertise into AI infrastructure — solving the operational challenges of serving and scaling ML workloads in production

GPU Scheduling in Kubernetes

Working with NVIDIA GPU Operator, device plugin, MIG partitioning, and time-slicing for efficient GPU resource allocation across K8s clusters.

Model Serving Frameworks

Hands-on with vLLM, Triton Inference Server, and Ray Serve — deploying inference endpoints with autoscaling strategies for production traffic.

KraftAI — AI Products

Building a solo AI products venture using Gemini API, Next.js, and serverless inference patterns to solve real-world problems with AI.

GPU Observability & Cost

Applying Karpenter autoscaling, Prometheus observability, and FinOps expertise to GPU-specific infrastructure challenges and inference latency SLOs.

// Career

Work Experience

Building teams and delivering impact at scale

March 2024 — Present

Eptura (Fortune 500 SaaS)

Lead DevOps / SRE Engineer

Lead and mentor 8-member DevOps/SRE team across 4 time zones. Manage $2M+ annual multi-cloud budget with FinOps discipline. Architect multi-cloud Kubernetes platforms (EKS & AKS) across 26+ clusters sustaining 99.99% production uptime. Migrated 300+ workloads with zero downtime. Standardized GitOps with Argo CD, increasing deploy frequency 3x. Delivered SOC2 compliance and GDPR readiness.

AWSAzureKubernetesArgo CDKarpenterSOC2FinOps

August 2020 — March 2024

RapidInnovation

Senior DevOps Engineer

Built DevOps and SRE foundations for 15+ production environments with 99.9% uptime. Designed multi-cloud deployments across AWS, Azure, GCP. Modernized infrastructure with Terraform and Kubernetes, generating $160K annual savings. Optimized CI/CD pipelines reducing build time by 62%. Built blockchain infrastructure (Ethereum, Dogecoin, IPFS nodes). Automated 100+ hours/month of ops tasks. Mentored 5 engineers.

Multi-CloudTerraformBlockchainEthereumIPFSCI/CD

2017 — 2021

ABES Institute of Technology

B.Tech Computer Science & Engineering

GPA: 8.2/10. Ghaziabad, India. Strong foundation in computer science, distributed systems, and software engineering principles.

// Testimonials

What People Say

Endorsements from colleagues and leaders

“

Hritik's expertise in debugging complex Kubernetes issues was invaluable. He quickly identified the root cause in our pod networking configuration that had stumped the team for days.

Jonathan Smith

Lead Developer, Flush Project

“

Hritik's debugging and troubleshooting skills are world-class. He consistently resolved complex infrastructure issues that other engineers couldn't solve.

David Rogers

CTO, RapidInnovation

“

He managed our entire GitLab infrastructure for 300+ employees. His expertise achieved 95% deployment success rate and transformed our release process.

Vineet Kulkarni

DevOps Manager, RapidInnovation

// Contact

Let's Work Together

Let's build infrastructure that doesn't sleep.

Available for leadership roles, consulting, and technical collaborations across IST, CET/CEST, and global time zones. Open to international travel.

Email

hritikchaudhary016@gmail.com

Connect with me

GitHub

github.com/hritikch24

+91 8859820935

HritikChaudhary

Infrastructure at Scale.Reliability by Design.