Senior Site Reliability Engineer – Observability

Sr. Site Reliability Engineer 

The Sr. Site Reliability Engineer will be a key member of the Observability team focused on scaling and supporting the observability platform across Fastly’s technology stack. You will be working alongside other internal engineering and support teams, and your experience in logging, metrics, and synthetic tracing will be critical in this role to help Fastly scale.

What We’re Looking For

  • 6+ years experience running high availability systems and supporting distributed infrastructure.
  • Experience with open-source monitoring tools such as Graphite, Grafana, Prometheus etc., or other tools such as Datadog, New Relic, and more
  • Experience with programming languages such as Go, Ruby or Python
  • Expert understanding of Linux systems, high and low level.
  • Passionate about building great products that address real problems
  • Great oral and written communication skills
  • Excellent listening skills and a high degree of empathy

What You’ll Do

  • Improving and growing our Prometheus, Graphite, and Splunk systems
  • Troubleshoot, diagnose and resolve performance and reliability issues affecting the Observability infrastructure
  • Build dashboards for insights and visibility into critical business metrics
  • Design and develop solutions that deliver value for our internal customer teams
  • Participate in incident reviews to create improved alerts for detection and potential proactive mitigation
  • Instrument and integrate monitoring and alerting into Fastly’s systems to gain better insights into systems supporting our customers

Fastly

Fastly’s edge cloud platform enables the best of the web to thrive, and helps you deliver better online experiences.

Technology we use

Python
C
Go
Rust
MySQL