Site Reliability Engineer @ RedHat

This job is no long available. Want to browse other dev ops jobs?

Posted almost 5 years ago

*Company description**

At Red Hat, we connect an innovative community of customers, partners, and contributors to deliver an open source stack of trusted, high-performing solutions. We offer cloud, Linux, middleware, storage, and virtualization technologies, together with award-winning global customer support, consulting, and implementation services. Red Hat is a rapidly growing company supporting more than 90% of Fortune 500 companies.

*Job summary**

The Red Hat Cloud Services Site Reliability Engineering (SRE) team is looking for a Site Reliability Engineer to join us in Westford, MA. In this position, you will play a key role in a team responsible for keeping managed services running in a Red Hat OpenShift environment available and secure. You'll be responsible for problem detection and automated recovery scenarios, incident management, and understanding complicated, interconnected data points to determine when issues arise. As a site reliability engineer, you'll need to be able to work in a complicated and fast-paced environment, while quickly learning new skills and creating ways to consistently meet service-level agreements (SLAs) and keep a cloud-based service running for our customers.

*Primary job responsibilities**

+ Actively work to automatically detect potential issues in a large containerized environment

+ Participate in a regular on-call schedule

+ Write automation scripts to auto-correct or completely prevent issues in our online solution

+ Resolve customer issues escalated from Red Hat's Global Support team

+ Track and review changes in a highly dynamic environment

+ Identify single points of failure and other high-risk architecture issues and propose more resilient resolutions

+ Perform and oversee releases to ensure that proper life cycle and policies are followed

+ Perform software updates, testing, and Common Vulnerabilities and Exposures (CVE) analyses

+ Respond to security threats

+ Create and maintain standard operating procedures (SOPs) for performing maintenance tasks and remediating problems in our environment

*Required skills**

+ Experience running Linux server like Red Hat Enterprise Linux (RHEL), CentOS, or Fedora

+ Basic knowledge of monitoring systems; knowledge of Zabbix or Nagios is a plus

+ Basic understanding of configuration management systems like Red Hat Ansible Automation, Puppet, or Chef

+ Demonstrated ability to quickly and accurately troubleshoot issues

+ Experience with object-oriented programming in at least one dynamic language; experience with Python is a plus

+ Solid understanding of standard TCP/IP networking and common protocols like DNS and HTTP

The following are considered a plus:

+ Experience with Kubernetes

+ Experience with containers

+ Some experience with cloud technologies like Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), Red Hat OpenStack Platform, or Amazon Web Services (AWS)

Red Hat is proud to be an equal opportunity workplace and an affirmative action employer. We review applications for employment without regard to their race, color, religion, sex, sexual orientation, gender identity, national origin, ancestry, citizenship, age, uniformed services, genetic information, physical or mental disability, medical condition, marital status, or any other basis prohibited by law.

Red Hat does not seek or accept unsolicited resumes or CVs from recruitment agencies. We are not responsible for, and will not pay, any fees, commissions, or any other payment related to unsolicited resumes or CVs except as required in a written contract between Red Hat and the recruitment agency or party requesting payment of a fee.

*Job ID** _69336_
*Category** _Software Engineering_

RedHat

redhat.com

Apply Now

Remote Dev Ops Jobs

Site Reliability Engineer

RedHat

Other Dev Ops Jobs