Site Reliability Engineer (Hiring Immediately)
Find the latest job opportunities in AI and tech.
RunPod offers GPU cloud computing for AI/ML, providing secure and community cloud options, on-demand and spot pods, and serverless GPU scaling.
The flexibility of remote work with an inclusive, collaborative team.
An opportunity to grow with a company that values innovation and user-centric design.
Generous vacation policy to ensure work-life harmony and well-being.
Contribute to a company with a global impact based in the US, Canada, and Europe.
- 5+ years of experience in Site Reliability Engineering or a similar role
- 3+ years of experience in a technical leadership or management position
- Deep understanding of Linux systems, containerization, virtualization, and networking technologies
- Strong background in managing and monitoring large-scale distributed systems and bare-metal fleets
- Expertise in infrastructure-as-code and configuration management tools
- Lead and mentor a team of Site Reliability Engineers, fostering a culture of innovation, continuous learning, and technical excellence
- Develop and implement strategic plans to enhance the reliability, scalability, and efficiency of our infrastructure
- Collaborate with cross-functional teams to align SRE initiatives with broader organizational goals
- Establish and maintain SLIs, SLOs, and SLAs for critical systems and services
- Drive the adoption of best practices in automation, monitoring, and incident response
Software Engineer, Site Reliability Engineer.
Fireworks AI offers a fast and efficient platform for building and deploying generative AI applications with a focus on speed, value, and scalability.
Tyk AI Studio is an AI gateway and management solution that helps organizations harness AI's potential while ensuring governance, security, compliance, and control.
- Proven experience in a senior SRE role or similar.
- Strong knowledge of cloud technologies and SLA SLO SLI management.
- Experience leading teams and implementing SCRUM processes.
- Excellent communication and leadership skills.
- Experience line managing, mentoring, and coaching.
- Collaborate with the Principal SRE to shape and implement the SRE strategic plan.
- Lead the SRE team in translating strategy into actionable plans, coordinating these through the SCRUM process.
- Address wellbeing and performance concerns, fostering a positive and productive team environment.
- Work with the Principal SRE and Scrum Master to analyze wellbeing survey outcomes and develop improvement plans.
Invisible AI is an on-premise computer vision platform for manufacturing that uses AI to improve worker productivity and safety by analyzing manual assembly work.
- Bachelor’s degree in Computer Science, Information Technology, or a related field, or equivalent experience.
- 5+ years of experience building and managing infrastructure at scale, particularly on the edge.
- Proficiency in Python, Docker, Linux systems, and scripting (Bash, Python).
- Strong expertise with infrastructure automation tools (Terraform, Ansible).
- Experience managing observability and monitoring systems, particularly Prometheus.
- Deep understanding of networking concepts and protocols.
- Design, build, and maintain scalable and resilient infrastructure on the edge.
- Develop automation and infrastructure-as-code solutions using Terraform, Ansible, and scripting languages (Python, Bash).
- Deploy and manage containerized applications using Docker and related technologies.
- Ensure system observability by building and optimizing monitoring systems, particularly using Prometheus.
- Troubleshoot and optimize Linux-based systems (e.g., Red Hat, CentOS, Ubuntu).
xAI's Grok is a powerful, multilingual large language model available on X and via API, focused on accelerating scientific discovery.
- Expert in at least one programming language that compiles to machine code such as Rust, C++, or Go.
- Expert knowledge of monitoring technologies such as Prometheus, Grafana, and PagerDuty.
- Expert knowledge of deployment technologies such as Pulumi or Terraform.
- Expert knowledge of Kubernetes.
- Improving our observability by adding/adjusting metrics.
- Building easily parsable dashboards.
- Designing and overseeing our on-call rotations.
- Improving our deployment process to increase reliability.
Luminance is an AI-powered legal tech platform that streamlines contract lifecycle management with features including AI-powered negotiation and an intelligent contract repository.
- Bachelor's or Master's degree with a First or 2:1, preferably in a technical subject.
- Excellent problem-solving skills, including diagnosing issues within complex systems.
- Ability and desire to identify root causes of issues, and propose and implement structural improvements.
- Strong communication skills and capability to perform in scenarios with urgency.
- Knowledge of the design and operation of web-based software applications, based on technologies such as node.js, PostgreSQL, or Elasticsearch.
- Knowledge of modern infrastructure and operational tooling within cloud-based architectures, such as Linux, Python, AWS, Ansible, Prometheus.
Fathom is a free AI meeting assistant that records, transcribes, and summarizes your meetings, saving you time and improving productivity.
- 6+ years.
- Scaling existing tools.
- Enhancing automation for scaling infrastructure.
- Playing a key role in diversifying and scaling platform.
- Evaluating options to replace existing real-time data pipeline.
- Providing platform support to engineering.
AppTek.ai provides AI-powered speech and language solutions including ASR, NMT, NLP/U, LLMs, and TTS, serving diverse industries globally.
- BS in a field related to Computational Linguistics, Computer/Data Science.
- 2+ years of industry experience (desirable for Site Reliability Engineer role).
- Strong knowledge of Linux.
- Strong knowledge of AWS.
- Docker.
- Scripting languages (Bash, Python).
- Familiarity with load-testing tools.
- Must be U.S. citizen capable of obtaining a Secret clearance (for Computational Linguist and Linguist roles).
- On-call first-level response.
- Respond to customer issue reports.
- Troubleshoot problems to maintain service SLAs.
- End-to-end monitoring across infrastructure and services for metrics/alerts/logs.
Linc's CX automation platform uses AI to streamline retail customer service, boosting efficiency and delighting customers.
- B.S. in Computer Science or a related field.
- 1+ years of site reliability engineering experience.
- Familiarity with at least one cloud service provider, preferably AWS.
- Familiar with basic SQL commands and Intent protocols.
- Proficient in cloud application orchestration tools like Kubernetes, Helm.
- Experience with monitoring stacks, preferably Datadog.
- Collaborate with engineering teams to define and maintain services SLA.
- Monitor metrics, alerts, logs across infrastructure and applications.
- Create and maintain tools to monitor the platform.
- Respond to incidents, troubleshoot, investigate root causes.
- Conduct post-incident investigation and report.
QED.ai provides AI-driven solutions for data scarcity in health and agriculture, offering tools for data digitization, geospatial mapping, and spectroscopy.
Travel to exotic places around the world.
Ask Sage is a versatile, secure Generative AI platform for government and commercial use, offering significant productivity improvements and LLM-agnostic support.
- 3+ years in site reliability engineering, Kubernetes administration, or related role.
- Deep expertise of Kubernetes and containers.
- Strong understanding of cloud infrastructure, automation tools, and best practices for high availability and performance.
- Monitor system performance and reliability.
Hebbia is an enterprise-grade AI platform that empowers knowledge workers by automating complex tasks and providing insights from various data sources. It’s designed for seamless integration and high security.
- 4+ years software development experience at a venture-backed startup or top technology firm.
- Proven experience as a Site Reliability Engineer, DevOps Engineer, or similar role.
- Strong expertise in managing CI/CD pipelines and deployment automation.
- Proficiency in cloud platforms such as AWS, Azure, or Google Cloud (we are an AWS shop).
- Solid understanding of containerization and orchestration technologies such as Docker and Kubernetes. <]]>
- Company
- AI Tech Suite
- Location
- London, UK
Hybrid / WFH Options - Employment Type
- Full-time
- Posted
- Company
- AI Tech Suite
- Location
- London, UK
Hybrid / WFH Options - Employment Type
- Full-time
- Posted