Site Reliability Engineer

You will be joining a team of high-performance engineers and have a significant impact on managing a growing infrastructure and service delivery. You’ll be tasked to maintain the health of the Domino platform in a variety of environments, enhancing our observability systems, engineering reliability into our stack, and governing our infrastructure.

We are especially interested in engineers with experience operating services on GCP or Azure or implementing security policies and controls in cloud service providers.

Qualifications

Tech we use is listed in parentheses; comparable experience is OK.

  • Experience with managing cloud environments (AWS, GCP, Azure)
  • Strong coding ability (Python, Bash)
  • Systems fluency (Linux, storage, networking)
  • Experience with container management (Kubernetes, Docker)
  • Observability systems (New Relic, Prometheus)
  • Operating stacks based on modern software components
    (Redis, ElasticSearch, RabbitMQ, MongoDB, PostgreSQL, Play)
  • Programming experience (Python, Go, Bash)
  • Infrastructure and configuration automation (Terraform, SaltStack)
  • Exceptional problem solving acumen

Responsibilities

  • Engineer reliability and performance into our product and services
  • Instrument and monitor service health
  • Manage and secure our cloud-based infrastructure
  • Diagnose and fix issues in a distributed, containerized application
  • Incident response (on-call) and root cause analysis
  • Implement and manage access control and security services
  • Collaborate with developers and PMs to continuously improve Domino
  • Develop tools and processes to improve efficiency and reduce toil

Domino Data Lab

Powering model-driven businesses.

Technology we use

Python
Go
Scala
TypeScript
PostgreSQL
MongoDB
Elasticsearch
Redis
React
AWS
Docker
Node.js
Bash
RabbitMQ

More jobs in Domino Data Lab