HiredinAI LogoHiredinAI
JobsCompaniesJob Alerts
  1. Home
  2. chevron_right
  3. Machine Learning Engineer
  4. chevron_right
  5. AI Infra Engineer

AI Infra Engineer

Perplexity AI
P
apartmentPerplexity AIlocation_onSan Francisco; Palo AltoschedulePosted 15 days ago
Full-timeAI Infra EngineerAI InfrastructureKubernetesSlurm$220,000 - $405,000

About the Role

We are looking for an AI Infra engineer to join our growing team. We work with Kubernetes, Slurm, Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering closely with our Inference and Research teams to build, deploy, and optimize our large-scale AI training and inference clusters.

Responsibilities

  • Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
  • Manage and optimize Slurm-based HPC environments for distributed training of large language models
  • Develop robust APIs and orchestration systems for both training pipelines and inference services
  • Implement resource scheduling and job management systems across heterogeneous compute environments
  • Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
  • Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm
  • Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services
  • Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands

Requirements

  • Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
  • Proficiency with Slurm job scheduling, resource management, and cluster configuration
  • Python and C++ programming with focus on systems and infrastructure automation
  • Hands-on experience with ML frameworks such as PyTorch in distributed training contexts
  • Strong understanding of networking, storage, and compute resource management for ML workloads
  • Experience developing APIs and managing distributed systems for both batch and real-time workloads
  • Solid debugging and monitoring skills with expertise in observability tools for containerized environments
  • Demonstrated experience managing large-scale Kubernetes deployments in production environments
  • Proven track record with Slurm cluster administration and HPC workload management
  • Previous roles in SRE, DevOps, or Platform Engineering with focus on ML infrastructure
  • Experience supporting both long-running training jobs and high-availability inference services
  • Ideally, 3-5 years of relevant experience in ML systems deployment with specific focus on cluster orchestration and resource management

Qualifications

  • Experience with Kubernetes operators and custom controllers for ML workloads
  • Advanced Slurm administration including multi-cluster federation and advanced scheduling policies
  • Familiarity with GPU cluster management and CUDA optimization
  • Experience with other ML frameworks like TensorFlow or distributed training libraries
  • Background in HPC environments, parallel computing, and high-performance networking
  • Knowledge of infrastructure as code (Terraform, Ansible) and GitOps practices
  • Experience with container registries, image optimization, and multi-stage builds for ML workloads

Benefits

  • Full-time U.S. employees enjoy a comprehensive benefits program including equity, health, dental, vision, retirement, fitness, commuter and dependent care accounts, and more.
  • Full-time employees outside the U.S. enjoy a comprehensive benefits program tailored to their region of residence.
  • USD salary ranges apply only to U.S.-based positions. International salaries are set based on the local market. Final offer amounts are determined by multiple factors, including experience and expertise, and may vary from the amounts listed above.
  • Offers Equity
notifications_active

Similar Job Alerts

Get notified about new Machine Learning Engineer roles.

expand_more
expand_more
Perplexity AI
P

Perplexity AI

View Companyarrow_forward
Machine Learning EngineerSan Francisco; Palo Alto

Frequently Asked Questions

How do I apply for this AI Infra Engineer position?

Click the "Apply Now" button on this page to be directed to the application. You will be taken to the employer's application page.

Is this position remote?

This role is based in San Francisco; Palo Alto. Check the full description for remote or hybrid options.

What is the salary range?

The listed salary range for this position is $220,000 - $405,000. Final compensation may vary based on experience, qualifications, and location.

When was this job posted?

This position was posted 15 days ago. We recommend applying promptly as positions can fill quickly.

Explore More

attach_moneyAI Salary GuideschoolEntry Level AI JobscategoryMore Machine Learning Engineer Jobs

Career Resources

article

AI Jobs Salary Guide 2026

Compensation data

article

AI-Proof Jobs: 25 Careers Safe from Automation

Career advice

article

The Complete Guide to AI Training Jobs

Industry guide

smart_toy
HiredinAI

Curated AI jobs across engineering, marketing, design, research, and more — from top companies and startups, updated daily.

alternate_emailworkcode

For Job Seekers

  • Browse Jobs
  • Job Categories
  • Companies
  • Remote AI Jobs
  • Entry Level Jobs
  • AI Salaries
  • Job Alerts
  • Career Blog

For Employers

  • Post a Job
  • Pricing
  • Employer Login
  • Dashboard

Resources

  • Blog
  • AI Glossary
  • Career Advice
  • Salary Guides
  • Industry News

AI Jobs by City

  • San Francisco
  • New York
  • London
  • Seattle
  • Toronto
  • Remote

Company

  • About Us
  • Contact
  • Privacy Policy
  • Terms of Service
  • Guidelines
  • DMCA

© 2026 HiredinAI. All rights reserved.

SitemapPrivacyTermsCookies