◆Posted Mar 28, 2026

Junior Cloud Automation Engineer

Senior AI Site Reliability Engineer (AI SRE): OpenKyber is hiring a Senior AI Site Reliability Engineer to lead the reliability, scalability, and performance of our production AI/ML platform. This role is deeply technical and hands on, owning end to end stability for mission critical model serving, data pipelines, and GPU intensive workloads. You will architect resilient systems, drive automation, and set reliability standards for OpenKyber's AI products. Responsibilities: • Own SLOs/SLAs for availability, latency, performance, and cost across AI services • Architect and operate highly available, fault tolerant AI/ML infrastructure • Lead incident response, deep dive troubleshooting, RCA, and postmortems • Deploy, monitor, and scale ML models and real time inference services • Manage model lifecycle (training validation deployment rollback) • Detect and mitigate model drift, data skew, and inference degradation • Build observability for model accuracy, data quality, pipelines, and system health • Implement logging, tracing, and alerting for AI workloads • Automate CI/CD and MLOps pipelines; manage IaC (Terraform, CloudFormation) • Optimize cloud compute (GPU/CPU) for performance and cost efficiency • Ensure secure handling of data, models, APIs, and compliance requirements Must Have Skills: • 7+ years in SRE, DevOps, or Platform Engineering • Proven experience running production AI/ML systems at scale • Strong Python; Go/Java a plus • Deep expertise with Linux, Docker, Kubernetes • Cloud experience with AWS, Google Cloud Platform, or Azure • Strong understanding of model serving, inference pipelines, data pipelines, feature stores • Experience with GPU workloads and performance tuning • Advanced troubleshooting across data, model, and infrastructure layers • Observability tools: Prometheus, Grafana, Datadog, OpenTelemetry • ML monitoring (model metrics, drift detection, inference health) • CI/CD, MLOps, IaC (Terraform, CloudFormation) Nice to Have: • Experience with Kubeflow, MLflow, SageMaker, Vertex AI • Background in ML or data science • Experience with real time, high throughput inference systems • Exposure to AI governance, explainability, or responsible AI Success Indicators: • AI services consistently exceed reliability and performance targets • Incidents decrease through strong operational rigor and automation • Models are deployed safely, quickly, and with confidence • Engineering teams rely on the platform and tooling you build For applications and inquiries, contact: [email protected]

Apply Now

Junior Cloud Automation Engineer

More Remote Jobs