Luma AI

    Software Engineer - Site Reliability

    Luma AI
    Posted 12/6/2025Lead/Manager
    Full-time
    Technology
    Linux
    Cloud Infrastructure
    Security
    Performance Tuning
    Automation

    ⭐ Join thousands of remote professionals with full access • From $4/week

    Job Description

    About Luma AI Luma’s mission is to build multimodal AI to expand human imagination and capabilities. We believe that multimodality is critical for intelligence. This requires a massive, reliable, and performant GPU infrastructure that pushes the boundaries of scale. Our SRE team is the foundation of our research and product velocity, responsible for the thousands of NVIDIA and AMD GPUs across multiple providers that power our work.

    Where You Come In We are looking for a hands-on, first-principles engineer who is fluent in Linux, comfortable operating close to the metal, and capable of architecting systems for the next generation of AI infrastructure. You will build, maintain, and scale Luma’s infrastructure across on-prem and multi-vendor clouds (AWS & OCI), serving as the bridge between hardware vendors, cloud providers, and our research teams.

    What You’ll Do Architect for Reliability & Scale: Participate in critical re-architecture sessions to redesign our systems for higher efficiency and scale. You won't just maintain existing clusters; you will help define how our next-generation infrastructure operates. Own Multi-Cloud GPU Clusters: Take end-to-end ownership of our production clusters for training and inference across AWS and OCI, ensuring high availability and peak performance. Drive Security & Compliance: Assist in achieving and maintaining security certifications (SOC 2 Type 1 & 2, ISO standards) by implementing robust infrastructure security practices in a fast-moving AI startup environment. Deep Linux Performance Tuning: Use your mastery of Linux systems to troubleshoot and optimize performance at the OS and kernel level. Build Robust Automation: Write high-quality tools and automation in Python, Go, or Bash to manage, monitor, and heal our infrastructure without relying on heavy operational toil. Debug Complex Hardware/Software Failures: Serve as the final escalation point for the most challenging GPU, networking (InfiniBand/RDMA), and system-level issues, often collaborating directly with hardware vendors like NVIDIA.

    Who You Are 8+ years of experience as an SRE, production engineer, or infrastructure engineer in a fast-paced, large-scale environment. Deep Linux Mastery: You possess deep, hands-on expertise in Linux, containerized systems, and debugging low-level system performance. Cloud Infrastructure Expert: You have strong experience with providers like AWS or OCI. Tenacious Troubleshooter: You thrive on solving complex, low-level problems where hardware and software intersect. Startup DNA: You are energetic and thrive in a less structured, fast-paced environment. Security-Minded: You possess a working knowledge of security best practices and familiarity with compliance frameworks, such as SOC 2 and ISO. Expert in High-Performance Networking: You have practical experience with InfiniBand, RDMA, or RoCE and understand how to optimize throughput for massive distributed training jobs.

    What Sets You Apart (Bonus Points) Deep expertise with GPU tooling for NVIDIA and AMD GPUs like DCGM or ROCm. Experience managing large-scale GPU clusters for AI/ML workloads (training or inference). Familiarity with job management systems based on Kubernetes or orchestration frameworks like Ray.

    💼 Want More Jobs Like This?

    Get similar opportunities delivered to your inbox. Free, no account needed!

    Similar Jobs You Might Like

    Enterprise Solutions Engineering, Spain

    Postman
    Not specified13 days ago
    Full-time
    Enterprise Sales
    Solutions Engineering
    Software Development
    APIs
    Data Platforms

    SEO/GEO Manager

    Ooma
    Not specified13 days ago
    Full-time
    SEO
    Data Analysis
    AI Search Optimization
    Keyword Research
    Content Management

    Senior Staff Software Engineer, UI Experience

    Ridgeline
    Not specified13 days ago
    Full-time
    TypeScript
    React
    HTML
    JavaScript
    CSS

    Senior Software Engineer (#rlang)

    Recast
    RemoteNot specified13 days ago
    Full-time
    R
    S3
    R6
    CRAN-Quality Packages
    Production Cloud Environments

    Want to see all 28,403 jobs?

    You're currently viewing 1 out of 28,403 available remote opportunities

    🔒 28,402 more jobs are waiting for you

    Unlock All Jobs

    Access every remote opportunity

    Advanced Filters

    Find your perfect match faster

    Daily Updates

    New opportunities every day

    Save & Alerts

    Never miss an opportunity

    Weekly
    $4
    Perfect for quick searches
    POPULAR
    Monthly
    $12
    Best for active job seekers
    Yearly
    $48
    Save 67% • Best value
    Unlock All 28403 Jobs

    Join thousands of remote workers who found their dream job

    Frequently Asked Questions

    What's included in premium access?

    Premium members get unlimited access to all remote job listings, advanced search filters, job alerts, and the ability to save favorite jobs.

    Can I cancel anytime?

    Yes! You can cancel your subscription at any time from your account settings. You'll continue to have access until the end of your billing period.

    Do you offer refunds?

    We offer a 7-day money-back guarantee on all plans. If you're not satisfied, contact us within 7 days for a full refund.

    Is my payment secure?

    Absolutely! We use Stripe for payment processing, which is trusted by millions of businesses worldwide. We never store your payment information.