diagnose and resolve issues, maintain, configure, and troubleshoot hardware and software components Knowledge of networking hardware devices and systems Desired: Familiarity with advanced Kubernetes ecosystem tools such as Rook, Ceph, MetalLB, Artifactory, and ActiveMQ Experience with Active Directory integration and management Proficiency in scripting languages such as Python, Bash, and PowerShell Thorough understanding and application of DevOps principles and methodologies More ❯
role, you’ll be responsible for developing and maintaining systems that support their machine learning workflows, virtual infrastructure, and cloud-native environments. You should have strong experience in Kubernetes, Ceph, virtualization along with sound knowledge of networking and security principles. Responsibilities Manage and scale Kubernetes clusters for infrastructure. Develop infrastructure automation using Python and tools like Ansible or Terraform. Monitor … infrastructure health and performance using Prometheus, Grafana, and logs. Maintain and integrate identity provider system with OAuth/OIDC/LDAP Deploy and maintain Ceph clusters for distributed, fault-tolerant storage. Monitor and manage user identities and access permissions within cloud platforms using Cloud technologies such as Cloud IAM System and Cloud Intrusion Detection System. Set up and enforce networking … and virtualization management. Understanding of networking fundamentals: firewalls, VPNs, NAT, VLANs, DNS, routing. Linux system administration experience, including shell scripting and performance tuning. Nice to have: Certifications in Kubernetes, Ceph, or cloud technologies (AWS/GCP/Azure). Familiar with with ML infrastructure: Ray (KubeRay), MLFlow, KubeFlow, Nexus (Pypi) or model serving platforms. Familiar with Git, GitOps workflows to More ❯
implementing solutions in a global complex enterprise environment. • Ability to work in a 24x7 on-call after hour rotation environment. • Experience with distributed storage technologies such as NFS, HDFS, Ceph, and Amazon S3, as well as dynamic resource management frameworks (Apache Mesos, Kubernetes, Yarn). Desired Skills: • Proactive approach identifying complex problems, performance bottlenecks, and areas for improvement. • Advocate for … implementing solutions in a global complex enterprise environment. • Ability to work in a 24x7 on-call after hour rotation environment. • Experience with distributed storage technologies such as NFS, HDFS, Ceph, and Amazon S3, as well as dynamic resource management frameworks (Apache Mesos, Kubernetes, Yarn). Desired Skills: • Proactive approach identifying complex problems, performance bottlenecks, and areas for improvement. • Advocate for More ❯
implementing solutions in a global complex enterprise environment. Ability to work in a 24x7 on-call after hour rotation environment. Experience with distributed storage technologies such as NFS, HDFS, Ceph, and Amazon S3, as well as dynamic resource management frameworks (Apache Mesos, Kubernetes, Yarn). Desired Skills: Proactive approach identifying complex problems, performance bottlenecks, and areas for improvement. Advocate for … implementing solutions in a global complex enterprise environment. Ability to work in a 24x7 on-call after hour rotation environment. Experience with distributed storage technologies such as NFS, HDFS, Ceph, and Amazon S3, as well as dynamic resource management frameworks (Apache Mesos, Kubernetes, Yarn). Desired Skills: Proactive approach identifying complex problems, performance bottlenecks, and areas for improvement. Advocate for More ❯
a collaborative, low-ego environment where infrastructure is truly seen as a strategic asset. What You’ll Do: Design, build, and support high-performance storage systems (e.g., NFS, S3, CEPH, GPFS, Lustre, VAST, WEKA, DDN) Collaborate closely with Quant Researchers and AI/ML teams to translate their workloads into scalable storage infrastructure Optimize performance across Linux kernel , storage , and More ❯
a collaborative, low-ego environment where infrastructure is truly seen as a strategic asset. What You’ll Do: Design, build, and support high-performance storage systems (e.g., NFS, S3, CEPH, GPFS, Lustre, VAST, WEKA, DDN) Collaborate closely with Quant Researchers and AI/ML teams to translate their workloads into scalable storage infrastructure Optimize performance across Linux kernel , storage , and More ❯
have a strong technical background in building, testing and supporting Linux based, large-scale, high-performance workloads, with likely deep exposure in technologies and vendors such as NFS, S3, CEPH, GPFS, Lustre, ROCE, VAST, WEKA or DDN. You will work directly with our researchers to understand their technology ecosystem which is using the latest AI/ML modelling technologies. You More ❯
Denver, Colorado, United States Hybrid / WFH Options
Boom Supersonic
and source control Have deployed and monitored distributed systems, such as microservices or client/server architectures Hands-on experience designing and managing petabyte-scale storage systems (Lustre, BeeGFS, Ceph, ZFS) Know how to wrangle fleets of Linux workstations with configuration management and automation tools Familiarity with containerization (Docker, Singularity) and infrastructure-as-code (Terraform, Ansible, CDK) Are comfortable coordinating More ❯
San Francisco, California, United States Hybrid / WFH Options
Crusoe
first cloud environments. What You'll Bring to the Team: 5+ years of professional experience in SRE, systems, or storage engineering. Hands-on experience with distributed storage systems (e.g., Ceph, GlusterFS, OpenEBS) and deep understanding of object, block, and file storage paradigms. Proficiency in a programming language such as Python, Go, Java, or C. Experience with Infrastructure as Code and More ❯
Ability to build applications from source and troubleshoot compiling issues Experience with compilers such as (GNU, Intel, and AOCC) Storage Experience installation and tuning (ZFS, XFS, GPFS, Luster, Hadoop, Ceph, Object Storage) Shell scripting experience (Bash, Perl, Python) Virtualization Experience (VMWare, Xen, Hyper-V, KVM, etc.) Experience with x86 bootstrap process (BIOS, RAID, Fiber Channel, etc.) Experience with batch control More ❯
Fort Belvoir, Virginia, United States Hybrid / WFH Options
Enlighten, an HII - Mission Technologies Company
schedule. Must be willing/able to help open/close the workspace during regular business hours as needed. Preferred Requirements Experience with big data technologies like: Hadoop, Accumulo, Ceph, Spark, NiFi, Kafka, PostgreSQL, ElasticSearch, Hive, Drill, Impala, Trino, Presto, etc. Experience with containers, EKS, Diode, CI/CD, and Terraform are a plus. We have many more additional great More ❯
critical production infrastructure—spanning hundreds of Kubernetes clusters across on-premise environments, from large data centers to edge devices—is seeking a Senior Infrastructure Engineer with deep expertise in Ceph . This individual will enhance the scale, reliability, and performance of ruggedized Kubernetes offerings operating under complex and novel constraints.Kubernetes offerings operati Ideal candidates are passionate about infrastructure at scale … adept in Ceph, and eager to contribute to the broader open-source ecosystem. Key Responsibilities Manage Ceph at Scale : Design, deploy, and maintain Ceph storage solutions across a variety of hardware environments with an emphasis on high availability and performance. Automate Deployments : Create automation frameworks and tooling to manage large-scale Ceph deployments, minimizing manual effort and maximizing operational efficiency. … Innovate and Contribute : Drive the integration of emerging tools and features from the Ceph and CNCF ecosystems, and contribute upstream to relevant open-source projects. Community Engagement : Actively participate in the Ceph developer and CNCF communities through collaboration, contribution, and knowledge sharing. Infrastructure Evolution : Partner with peers to architect and build scalable, secure, and resilient infrastructure for next-generation deployments. More ❯
critical production infrastructure—spanning hundreds of Kubernetes clusters across on-premise environments, from large data centers to edge devices—is seeking a Senior Infrastructure Engineer with deep expertise in Ceph . This individual will enhance the scale, reliability, and performance of ruggedized Kubernetes offerings operating under complex and novel constraints.Kubernetes offerings operati Ideal candidates are passionate about infrastructure at scale … adept in Ceph, and eager to contribute to the broader open-source ecosystem. Key Responsibilities Manage Ceph at Scale : Design, deploy, and maintain Ceph storage solutions across a variety of hardware environments with an emphasis on high availability and performance. Automate Deployments : Create automation frameworks and tooling to manage large-scale Ceph deployments, minimizing manual effort and maximizing operational efficiency. … Innovate and Contribute : Drive the integration of emerging tools and features from the Ceph and CNCF ecosystems, and contribute upstream to relevant open-source projects. Community Engagement : Actively participate in the Ceph developer and CNCF communities through collaboration, contribution, and knowledge sharing. Infrastructure Evolution : Partner with peers to architect and build scalable, secure, and resilient infrastructure for next-generation deployments. More ❯
critical production infrastructure—spanning hundreds of Kubernetes clusters across on-premise environments, from large data centers to edge devices—is seeking a Senior Infrastructure Engineer with deep expertise in Ceph . This individual will enhance the scale, reliability, and performance of ruggedized Kubernetes offerings operating under complex and novel constraints.Kubernetes offerings operati Ideal candidates are passionate about infrastructure at scale … adept in Ceph, and eager to contribute to the broader open-source ecosystem. Key Responsibilities Manage Ceph at Scale : Design, deploy, and maintain Ceph storage solutions across a variety of hardware environments with an emphasis on high availability and performance. Automate Deployments : Create automation frameworks and tooling to manage large-scale Ceph deployments, minimizing manual effort and maximizing operational efficiency. … Innovate and Contribute : Drive the integration of emerging tools and features from the Ceph and CNCF ecosystems, and contribute upstream to relevant open-source projects. Community Engagement : Actively participate in the Ceph developer and CNCF communities through collaboration, contribution, and knowledge sharing. Infrastructure Evolution : Partner with peers to architect and build scalable, secure, and resilient infrastructure for next-generation deployments. More ❯
of Kubernetes clusters using infrastructure-as-code (IaC) tools (e.g., Terraform, Ansible) solutions for on-prem Kubernetes clusters and applications. Implement and manage cloud-native persistent storage solutions (e.g. CEPH) with Kubernetes clusters. Manage and optimize storage solutions within the Kubernetes environment, including persistent volumes and storage classes. Customer First! Provide technical guidance and support to customers deploying our applications … Kubernetes Engine (RKE2) Experience with database technologies like ElasticSearch, MySQL, BigQuery, cloud technologies like S3, Pubsub, RabbitMQ and caching technologies like Redis. Experience with cloud-native storage solutions (e.g. Ceph, Rook) with a Proven ability to develop and manage custom Kubernetes controllers or operators. Good understanding of public cloud design considerations and limitations in areas of microservice architectures, security, global More ❯
will be added advantage OpenShift monitoring and writing custom alerting using Prometheus Alertmanager CheckMK to monitor physical infrastructure Experience with Red Hat Quay Container Registry Experience with Red Hat CEPH Storage Experience with Red Hat OpenStack Experience with maintaining Dell PowerEdge Servers More ❯
structured and OOP) using one or more high-level languages, such as Python, Java, C/C++, Ruby, and JavaScript Experience with distributed storage technologies such as NFS, HDFS, Ceph, and Amazon S3, as well as dynamic resource management frameworks (Apache Mesos, Kubernetes, Yarn) Proactive approach to identifying problems, performance bottlenecks, and areas for improvement Agile/Scrum experience. Physical More ❯
and/or Alexandria, VA 5 days a week. Flexibility is essential to adapt to schedule changes as needed. Preferred Requirements Experience with big data technologies like: Hadoop, Accumulo, Ceph, Spark, NiFi, Kafka, PostgreSQL, ElasticSearch, Hive, Drill, Impala, Trino, Presto, etc. Experience with containers, EKS, Diode, CI/CD, and Terraform are a plus. Work could possibly require some on More ❯
work on-site 4-5 days/week. Flexibility is key to accommodate any schedules changes per the customer. Preferred Requirements Experience with big data technologies like: Hadoop, Accumulo, Ceph, Spark, NiFi, Kafka, PostgreSQL, ElasticSearch, Hive, Drill, Impala, Trino, Presto, etc. Experience with containers, EKS, Diode, CI/CD, and Terraform are a plus. Work could possibly require some on More ❯
performance compute environments Partner with research teams to translate AI/ML and data modeling requirements into scalable storage solutions Work with technologies such as NFS, S3, GPFS, Lustre, CEPH, VAST, WEKA, or DDN Optimize storage and network performance using advanced benchmarking and kernel tuning techniques Contribute to infrastructure automation using Terraform, Ansible, and CI/CD tools like GitLab More ❯
performance compute environments Partner with research teams to translate AI/ML and data modeling requirements into scalable storage solutions Work with technologies such as NFS, S3, GPFS, Lustre, CEPH, VAST, WEKA, or DDN Optimize storage and network performance using advanced benchmarking and kernel tuning techniques Contribute to infrastructure automation using Terraform, Ansible, and CI/CD tools like GitLab More ❯