DevOps Specialist
Job Title: DevOps Specialist & Data Engineer
Location: Remote
Type: Full-time
Experience Level: Senior
Industry: Generative AI / Artificial Intelligence / Machine Learning
Reports To: Head of Engineering / CTO
About UsReady to join a cutting edge AI company? We’re on a mission to become the OpenAI of the spicy content industry, building a full-spectrum ecosystem of revolutionary AI infrastructure and products. Our platform, OhChat, features digital twins of real-world personalities and original AI characters, enabling users to interact with lifelike AI-generated characters through text, voice, and images, with a roadmap that includes agentic superModels, API integrations, and video capabilities.
Role OverviewWe are looking for a Senior DevOps Specialist with a strong python and data engineering background to support our R&D and tech teams by designing, building, and maintaining robust infrastructure and data pipelines across AWS and GCP. You will be instrumental in ensuring our systems are scalable, observable, cost-effective, and secure. This role is hands-on, cross-functional, and central to our product and research success.
Key ResponsibilitiesDevOps & Infrastructure- Design, implement, and maintain infrastructure on AWS and Google Cloud Platform (GCP) to support high-performance computing workloads and scalable services.
- Collaborate with R&D teams to provision and manage compute environments for model training and experimentation.
- Maintain / monitor systems, implement observability solutions (e.g., logging, metrics, tracing), and proactively resolve infrastructure issues.
- Manage CI/CD pipelines for rapid, reliable deployment of services and models.
- Ensure high availability, disaster recovery, and robust security practices across environments.
- Build and maintain data processing pipelines for model training, experimentation, and analytics.
- Work closely with machine learning engineers and researchers to understand data requirements and workflows.
- Design and implement solutions for data ingestion, transformation, and storage using tools such as Scrappy, Playwright, agentic workflows (e.g. crawl4ai) or equivalent.
- Optimize and benchmark AI training / inference / data workflows to ensure high performance, scalability, cost and an exceptional customer experience.
- Maintain data quality, lineage, and compliance across multiple environments.
- 5+ years of experience in DevOps, Site Reliability Engineering, or Data Engineering roles.
- Deep expertise with AWS and GCP, including services like EC2, S3, Lambda, IAM, GKE, BigQuery, and more.
- Strong proficiency in infrastructure-as-code tools (e.g., Terraform, Pulumi, CloudFormation).
- Extensive hands-on experience with Docker, Kubernetes, and CI/CD tools such as GitHub Actions, Bitbucket Pipelines, or Jenkins, with a strong ability to optimize CI/CD workflows as well as AI training and inference pipelines for performance and reliability."
- Exceptional programming skills in Python. You are expected to write clean, efficient, and production-ready code. You should be highly proficient with modern Python programming paradigms and tooling.
- Proficiency in data-centric programming and scripting languages (e.g., Python, SQL, Bash).
- Proven experience designing and maintaining scalable ETL/ELT pipelines.
- Focused, sharp, and results-oriented: You are decisive, work with a high degree of autonomy, and consistently deliver high-quality results. You are quick to understand and solve the core of a problem and know how to summarize it efficiently for stakeholders.
- Effective communicator and concise in reporting: You should be able to communicate technical insights in a clear and actionable manner, both verbally and in written form. Your reports should be precise, insightful, and aligned with business objectives.
- Experience supporting AI/ML model training infrastructure (e.g., GPU orchestration, model serving) for both Diffusion- and LLM pipelines.
- Familiarity with data lake architectures and tools like Delta Lake, LakeFS, or Databricks.
- Knowledge of security and compliance best practices (e.g., SOC2, ISO 27001).
- Exposure to MLOps platforms or frameworks (e.g., MLflow, Kubeflow, Vertex AI).
- Competitive salary + equity
- Flexible work environment and remote-friendly culture
- Opportunities to work on cutting-edge AI/ML technology
- Fast-paced environment with high impact and visibility
- Professional growth support and resources
- Company
- OhChat
- Location
- Bradford, UK
- Employment Type
- Full-time
- Posted
- Company
- OhChat
- Location
- Bradford, UK
- Employment Type
- Full-time
- Posted