Senior AI Site Reliability Engineer (AI SRE):
OpenKyber is hiring a Senior AI Site Reliability Engineer to lead the reliability, scalability, and performance of our production AI/ML platform. This role is deeply technical and hands on, owning end to end stability for mission critical model serving, data pipelines, and GPU intensive workloads. You will architect resilient systems, drive automation, and set reliability standards for OpenKyber's AI products.
Responsibilities:
• Own SLOs/SLAs for availability, latency, performance, and cost across AI services
• Architect and operate highly available, fault tolerant AI/ML infrastructure
• Lead incident response, deep dive troubleshooting, RCA, and postmortems
• Deploy, monitor, and scale ML models and real time inference services
• Manage model lifecycle (training validation deployment rollback)
• Detect and mitigate model drift, data skew, and inference degradation
• Build observability for model accuracy, data quality, pipelines, and system health
• Implement logging, tracing, and alerting for AI workloads
• Automate CI/CD and MLOps pipelines; manage IaC (Terraform, CloudFormation)
• Optimize cloud compute (GPU/CPU) for performance and cost efficiency
• Ensure secure handling of data, models, APIs, and compliance requirements
Must Have Skills:
• 7+ years in SRE, DevOps, or Platform Engineering
• Proven experience running production AI/ML systems at scale
• Strong Python; Go/Java a plus
• Deep expertise with Linux, Docker, Kubernetes
• Cloud experience with AWS, Google Cloud Platform, or Azure
• Strong understanding of model serving, inference pipelines, data pipelines, feature stores
• Experience with GPU workloads and performance tuning
• Advanced troubleshooting across data, model, and infrastructure layers
• Observability tools: Prometheus, Grafana, Datadog, OpenTelemetry
• ML monitoring (model metrics, drift detection, inference health)
• CI/CD, MLOps, IaC (Terraform, CloudFormation)
Nice to Have:
• Experience with Kubeflow, MLflow, SageMaker, Vertex AI
• Background in ML or data science
• Experience with real time, high throughput inference systems
• Exposure to AI governance, explainability, or responsible AI
Success Indicators:
• AI services consistently exceed reliability and performance targets
• Incidents decrease through strong operational rigor and automation
• Models are deployed safely, quickly, and with confidence
• Engineering teams rely on the platform and tooling you build
For applications and inquiries, contact:
[email protected]