About the Company
A fast-growing, venture-backed startup is building a next-generation AI compute platform focused on decentralized, high-performance infrastructure. The company is rethinking how organizations access and scale compute by integrating global data centers into a unified, serverless platform.
Their mission is to democratize access to AI compute and provide an end-to-end lifecycle solution-from raw data to deployed models-through a combination of platform infrastructure and forward-deployed engineering.
With a global footprint and early traction, the team is tackling challenges across multi-cloud orchestration, GPU scheduling, and enterprise-grade infrastructure, with a strong focus on security and compliance.
The Role
This is a high-impact infrastructure role focused on designing and scaling distributed systems that power AI/ML workloads at scale.
You'll work across:
- Core platform architecture
- Multi-cloud compute orchestration
- Managed services development
- Customer-facing deployments
This role requires a strong mix of systems engineering + product thinking, with exposure to both backend infrastructure and end-user experience.
What You'll Work On
Compute Platform & Multi-Cloud Architecture
- Design abstraction layers across cloud providers (AWS, GCP, Azure, bare-metal)
- Build systems that unify compute, storage, and networking across environments
- Expand global compute capacity by integrating with cloud and data center providers
- Architect reusable, composable infrastructure components
Managed Services & Platform Development
- Own services end-to-end (design → deployment → monitoring)
- Build orchestration systems for GPU workloads and container scheduling
- Develop APIs and control planes for provisioning, scaling, and lifecycle management
- Drive improvements in performance, reliability, and cost efficiency
Infrastructure & Platform Services
- Build systems for billing, usage tracking, and cost attribution
- Develop observability tooling (metrics, logging, tracing)
- Establish engineering standards and best practices
- Mentor engineers and contribute to system design decisions
What They're Looking For
Core Requirements
- 4+ years building distributed systems, backend infrastructure, or cloud platforms
- Strong experience with AWS, GCP, or Azure
- Deep understanding of:
- Compute (VMs, instances)
- Storage (object, block, file systems)
- Networking (VPCs, load balancers, security groups)
- Experience with Kubernetes and container orchestration
- Strong programming skills (Golang preferred; Python/Rust a plus)
- Experience building APIs, control planes, or platform services
- Familiarity with databases (Postgres, Redis, etc.) and messaging systems (Kafka, RabbitMQ)
Nice to Have
- GPU orchestration or AI/ML infrastructure experience
- HPC or cluster management (Kubernetes, Slurm)
- Data engineering or large-scale ETL systems
- Systems-level programming (low-level infra, operators, daemons)
- ML platform engineering (training/inference pipelines)
- Experience deploying into enterprise or on-prem environments
Oscar Associates Limited (US) is acting as an Employment Agency in relation to this vacancy.