Aliso Viejo, California, United States Hybrid/Remote Options
Stubhub
StubHub to be the safest, most convenient way to offer a ticket to the millions of fans who browse our platform around the world. StubHub is looking for Senior Site Reliability Engineer (SRE) to design and develop next-generation technologies and complex features. As a Senior SRE at StubHub, you will be at the forefront of tackling significant … and automation to reduce toil across engineering teams Ensure Systems effectively balance cost, perfomance and reliability at scale What You've Done: Extensive experience (typically 5+ years) in a site reliability engineering or a related role, demonstrating a strong command of incident management, mitigation, & prevention, troubleshooting, and performance tuning. Experience with developing robust, mission-critical systems using one or More ❯
StubHub to be the safest, most convenient way to offer a ticket to the millions of fans who browse our platform around the world. StubHub is looking for Senior Site Reliability Engineer (SRE) to design and develop next-generation technologies and complex features. As a Senior SRE at StubHub, you will be at the forefront of tackling significant … and automation to reduce toil across engineering teams Ensure Systems effectively balance cost, perfomance and reliability at scale What You've Done: Extensive experience (typically 5+ years) in a site reliability engineering or a related role, demonstrating a strong command of incident management, mitigation, & prevention, troubleshooting, and performance tuning. Experience with developing robust, mission-critical systems using one or More ❯
Los Angeles, California, United States Hybrid/Remote Options
Stubhub
StubHub to be the safest, most convenient way to offer a ticket to the millions of fans who browse our platform around the world. StubHub is looking for Senior Site Reliability Engineer (SRE) to design and develop next-generation technologies and complex features. As a Senior SRE at StubHub, you will be at the forefront of tackling significant … and automation to reduce toil across engineering teams Ensure Systems effectively balance cost, perfomance and reliability at scale What You've Done: Extensive experience (typically 5+ years) in a site reliability engineering or a related role, demonstrating a strong command of incident management, mitigation, & prevention, troubleshooting, and performance tuning. Experience with developing robust, mission-critical systems using one or More ❯
Senior/Staff Site Reliability Engineer - Observability | London (Hybrid) If you care deeply about building and operating world-class infrastructure for AI at scale , this one’s worth your time. We’re working with a company that builds the backbone powering some of the most demanding AI workloads on the planet. Think large-scale GPU clusters, global telemetry More ❯
City of London, London, United Kingdom Hybrid/Remote Options
Motive Group
Senior/Staff Site Reliability Engineer - Observability | London (Hybrid) If you care deeply about building and operating world-class infrastructure for AI at scale , this one’s worth your time. We’re working with a company that builds the backbone powering some of the most demanding AI workloads on the planet. Think large-scale GPU clusters, global telemetry More ❯
in Office] Role Overview We’re representing a global trading and digital assets firm at the forefront of high-performance technology and infrastructure innovation. The business is seeking a Site Reliability & Infrastructure Engineer to help design, automate, and scale the systems that underpin its global trading platforms. This role sits within a high-performing 11-person infrastructure team … that combines Site Reliability and Core Infrastructure responsibilities - owning everything from AWS cloud systems to on-prem deployments. The team is expanding to meet new strategic demands, including increased automation, enhanced observability, and the rollout of new colocation environments to support lower-latency trading. It’s a technically hands-on position that blends architecture, build, and operational ownership, suited … to an engineer with curiosity, precision, and a drive to constantly improve how infrastructure is built and run... Key Responsibilities Design, build, and maintain highly available infrastructure across both cloud (AWS) and on-prem environments Implement automation across the stack using Infrastructure-as-Code principles (Terraform, Ansible, or similar) Administer and optimise Kubernetes clusters across multiple regions, improving resilience More ❯
City of London, London, United Kingdom Hybrid/Remote Options
Techfellow Limited
in Office] Role Overview We’re representing a global trading and digital assets firm at the forefront of high-performance technology and infrastructure innovation. The business is seeking a Site Reliability & Infrastructure Engineer to help design, automate, and scale the systems that underpin its global trading platforms. This role sits within a high-performing 11-person infrastructure team … that combines Site Reliability and Core Infrastructure responsibilities - owning everything from AWS cloud systems to on-prem deployments. The team is expanding to meet new strategic demands, including increased automation, enhanced observability, and the rollout of new colocation environments to support lower-latency trading. It’s a technically hands-on position that blends architecture, build, and operational ownership, suited … to an engineer with curiosity, precision, and a drive to constantly improve how infrastructure is built and run... Key Responsibilities Design, build, and maintain highly available infrastructure across both cloud (AWS) and on-prem environments Implement automation across the stack using Infrastructure-as-Code principles (Terraform, Ansible, or similar) Administer and optimise Kubernetes clusters across multiple regions, improving resilience More ❯
Denver, Colorado, United States Hybrid/Remote Options
Checkr
problems with innovative solutions that advance our mission. Checkr is recognized on Forbes Cloud List and is a Y Combinator 2024 Breakthrough Company . About the role: As a Site Reliability Engineer II on the Platform team, you will uncover issues and technical challenges across the engineering teams and platforms, and develop creative solutions to resolve them. You … to a wide array of customers, including operations, developers, technical architects, and executives What You Bring: Degree in Computer Science (or related field) 3+ years experience as a software engineer (Ruby, Python, or GoLang) 3+ years experience in maintaining and observing production customer facing environments in AWS or Azure 3+ Experience as incident commander or response team member of More ❯
MIO Partners, Inc. (MIO) provides proprietary investment products to McKinsey's retirement plan and partners and offers independent, high-quality financial advice to McKinsey's partners. We manage a wide array of investment vehicles with significant expertise and a long More ❯
Cupertino, California, United States Hybrid/Remote Options
Monks
While Monks may contact potential candidates via LinkedIn, all applications must be submitted through our official website ( ). About the Role We are seeking a highly skilled and experienced Site Reliability Engineer (SRE) to join our Platform Engineering team, supporting a world-class media production environment for a leading global technology company. This is a crucial role within … a Managed Services model, focused on ensuring the high availability, performance, and resilience of critical server, storage, and media workflow systems. You will be one of two dedicated on-site SREs who will partner with remote and consulting staff to provide around-the-clock operational support and continuous infrastructure improvement. Key Responsibilities Infrastructure Management: Maintain and troubleshoot all production … and maintenance of custom applications and dashboards that support media workflows, including tools for project deployment, directory services integration, and ticketing. Remote/On-Demand Support: Provide active on-site support and participate in a 24/7 on-call rotation for critical interventions (e.g., power/cooling issues). Backup and Archive: Manage the Backup and Archive environment More ❯
Irvine, California, United States Hybrid/Remote Options
Varnish Software
At Varnish Software, we empower the world's largest content providers to deliver lightning-fast web and streaming experiences, ensuring resilience and scalability for massive audiences. Our open-source roots combined with enterprise-level robustness help us lead the way More ❯
Role Overview: This isn't a "keep the lights on" SRE role. This is a strategic, high-impact opportunity to build the nervous system for a platform that transforms how networks of satellites, ground stations, and fleets are interconnected and More ❯
San Jose, California, United States Hybrid/Remote Options
Ez Texting
Who We Are EZ Texting is a recognized leader in text message marketing for small and medium-sized businesses and organizations, setting the standard for professional texting. Our messaging solutions allow everyone to easily and effectively reach their mobile audiences. More ❯
Boston, Massachusetts, United States Hybrid/Remote Options
Axon
you'll take ownership and drive real change. Constantly grow as you work hard for a mission that matters at a company where you matter. Your Impact As a Site Reliability Engineer within the APX SRE organization, you'll focus on delivering practical, scalable solutions to support the reliability and performance of our mission-critical, cloud-native global More ❯
Boston, Massachusetts, United States Hybrid/Remote Options
Axon
you'll take ownership and drive real change. Constantly grow as you work hard for a mission that matters at a company where you matter. Your Impact As a Site Reliability Engineer within the APX SRE organization, you'll focus on delivering practical, scalable solutions to support the reliability and performance of our mission-critical, cloud-native global More ❯
Atlanta, Georgia, United States Hybrid/Remote Options
Axon
you'll take ownership and drive real change. Constantly grow as you work hard for a mission that matters at a company where you matter. Your Impact As a Site Reliability Engineer within the APX SRE organization, you'll focus on delivering practical, scalable solutions to support the reliability and performance of our mission-critical, cloud-native global More ❯
autonomous systems better than the last. The state of our autonomous systems are nascent with the foundational pieces either recently having been completed or currently under development. As a Site Reliability Engineer you will be embedded with a cross functional team who has key responsibilities for certain portions of our systems. Over the course of the first year … successive years you will be given specific mission critical objectives that help build out and improve our autonomous systems and simultaneously build out your personal brand as an exceptional engineer who has built and maintained amazing systems that can grow to internet scale. Minimum Qualifications Curiosity, a willingness to learn, a passion to continually improve, and unbridled enthusiasm to … systems, a related field; comparable certifications; or equivalent direct work experience A minimum of 8 years of experience in hands on technical roles A minimum of 2 years of Site Reliability Engineering experience Experience building autonomous systems that manage software operational details without human intervention Preferred Qualifications M.S. in computer science, information systems, a related field; comparable certifications; or More ❯
Johnson City, Tennessee, United States Hybrid/Remote Options
Palantir Technologies
the people who need it, our platforms empower our partners to develop lifesaving drugs, forecast supply chain disruptions, locate missing children, and more. The Role We're looking for Site Reliability Engineers who can help us build, operate, and maintain high-performance, scalable, and reliable services for our production infrastructure, across both cloud & on-prem environments. Site Reliability More ❯
San Francisco, California, United States Hybrid/Remote Options
Crusoe
you'll drive meaningful innovation, make a tangible impact, and join a team that's setting the pace for responsible, transformative cloud infrastructure. About This Role: At Crusoe, our Site Reliability Engineering (SRE) team is responsible for ensuring the reliability, performance, and operational excellence of our critical infrastructure. We are dedicated to building and maintaining highly available and resilient More ❯
Senior Linux SRE Outside IR35 - 12 month contract initially Full remote role across UK/Europe Our client is a consumer facing tech business and they are looking for a Senior SRE with a strong background in Linux infrastructure and More ❯
Seattle, Washington, United States Hybrid/Remote Options
Lambda
GPU. If you'd like to build the world's best deep learning cloud, join us. Note: This position requires presence in our upcoming Seattle office location or on-site with strategic customers 4 days per week; Lambda's designated work from home day is currently Tuesday. About The Role We're looking for a Forward Deployed Engineer … ambiguity is the default state. Your job is to map problems, structure delivery paths, and ship solutions that create measurable impact. What You'll Do Customer Engagement Embed on-site with a named strategic customer, becoming an extension of their team Act as the primary technical liaison between Lambda and the customer organization Navigate ambiguous requirements to identify root … elevate the capabilities of the broader team Represent Lambda with executive presence in high-stakes customer interactions About You Must-Have 6+ years of experience in a SRE, software engineer, or similar role, with a deep knowledge of running Linux clusters and systems Strong programming skills in Go and Python ; experience with GitOps (e.g., ArgoCD), Helm, and Kubernetes operators More ❯
San Francisco, California, United States Hybrid/Remote Options
Lambda
upgrades, patching, and deletion. Define and implement SLOs and SLIs for Kubernetes services, workloads, and platform reliability. About You Must-Have 6+ years of experience in a SRE, operations engineer, or similar role, with a deep knowledge of running Linux clusters and systems Strong programming skills in Go and Python ; experience with GitOps (e.g., ArgoCD), Helm, and Kubernetes operators … Work on cutting-edge Managed Kubernetes platforms for AI/ML workloads Influence the platform roadmap and help shape operations and reliability best practices Collaborate with a highly skilled engineer Opportunity to mentor and grow within a fast-growing, technology-driven environment Salary Range Information The annual salary range for this position has been set based on market data More ❯
San Francisco, California, United States Hybrid/Remote Options
Workos
funded, having raised an $80M Series B . Our fast-growing customer base includes hundreds of rapidly growing SaaS companies like Webflow, Vercel, Plaid, Loom, and Drata. About the Site Reliability Engineering Team The Site Reliability Engineering (SRE) team ensures the WorkOS platform remains fast, reliable, and resilient at scale. We build the systems and practices that keep More ❯
developers signing up to use MongoDB every month, it's no wonder that leading organizations, like Samsung and Toyota, trust MongoDB to build next-generation, AI-powered applications. The Site Reliability Engineering team designs and builds the global infrastructure on which we deploy our services, focusing on the above mentioned flagship MongoDB Atlas platform. As our customers grow and More ❯
Colorado Springs, Colorado, United States Hybrid/Remote Options
Pushpay
products (both internal and external) and associated infrastructure which may constitute intellectual property that belongs to Pushpay. What You'll Bring 7+ years of relevant SRE, System, or Software engineer experience; "relevant" being: Developing Internet-scale multi-user web/mobile/cloud type software products. Applicable tertiary qualifications. Strong passion for developing new software and systems that are More ❯