Walt Disney Careers – Site Reliability Engineer

Website Disney

Job Description:

As an SRE, you are looked at by your fellow team members as a trusted advisor for all things reliability; you are someone who has a clear understanding of, and can thoroughly elaborate on SRE principles and best practices to a given audience. To be successful in this role you will continuously uphold and improve all the relevant reliability aspects for our services, with an increased focus on SLIs and SLOs, while raising the reliability of a variety of large scale user facing and internal services.

Job Responsibilities:

  • Continuously refine monitoring processes, configurations, and thresholds;
  • Write code that improves scalability, performance, maintainability, and security;
  • Collaborate and provide technical excellence within and across teams;
  • Identify areas of improvement in reliability, efficiency, and operations;
  • Develop useful telemetry, alerts, and response to reduce Mean Time To Repair (MTTR);
  • Practice and promote sustainable incident response and blameless postmortems
  • Add, tune and maintain alert configurations and documentation as needed;
  • Deploy and manage innovative modern cloud technologies using infrastructure-as-code, self-healing, and security automation patterns;
  • Consult on best practices and develop tools to enable smooth adoptions of good service reliability practices and methods;
  • Develop runbooks and tools to streamline processes and shorten problem resolution time;
  • Operate in the high-pressure environment and troubleshoot complex issues across distributed applications quickly, while successfully handling multiple priorities;
  • Build tools to help your SRE team quickly pinpoint, isolate and resolve issues related to infrastructure, platform services and applications;

Job Requirements:

  • Creative and innovative outside the box thinking
  • 2-5 years of experience in SRE, devops, technical operations, systems engineering, software engineering or related discipline
  • Proficient, collaborative, & experienced in building reliable, scalable, enterprise systems
  • Excellent communication skills, both verbal and written
  • Passionate and curious about ways to leverage technology while continually learning
  • Ability to identify root-cause sources of instability in a high-traffic, large-scale distributed systems
  • Experience in building, and operating large-scale production systems
  • Efficiently skilled with the use of containers in enterprise production environments (e.g. Docker, Kubernetes, LXC, AWS ECS and EKS)
  • Configuration management and orchestration (e.g. Terraform, Cloud Formation, Ansible)
  • Comfortable in one or more of the following languages (Python, Java, Scala, Go, Rust, Ruby, or similar)
  • Scripting languages like Ruby, Bash, PowerShell or Python;
  • Skilled in Cloud/PaaS/SaaS Environments (e.g. AWS, Azure, Google Cloud Compute)
  • Hands-on experience using source control (Git, GitHub) and feature branching strategies

Qualification & Experience:

  • Experience with DevOps methodologies and/or SRE
  • Experience with container orchestration systems, such as AWS ECS or Kubernetes
  • Experience with monitoring and observability tooling such as Datadog, Prometheus, Grafana
  • Experience with automating infrastructure, deployment and testing using tools like Cloudformation, Ansible or Terraform, and can explain the Infrastructure as Code paradigm
  • Experience with Service Level Objectives and Error Budgets
  • Experience with configuration management, such as Puppet and Ansible
  • Understanding of the principles and methodologies behind Chaos Engineering
  • Experience with software development in Java, Scala, etc
  • BS Degree in Computer Science, Electrical & Computer Engineering or Mathematics;

Job Details:

Company:  Disney

Vacancy Type:  Full Time

Job Location: Manchester, England, UK

Application Deadline: N/A

Apply Here

vacancyoptions.com