Posted about 1 year ago
We are looking for a talented engineer with a blend of core, distributed systems operations experience and systems-level Java expertise to join our Cloud Operations team. The ideal candidate has experience with infrastructure as code, deployment automation, configuration management and multi-cloud provisioning in environments providing 99.9%+ SLAs. Working with a mix of private and public hybrid cloud environments as part of an elite world-wide team of engineers, you will take operational responsibility and provide technical expertise, workarounds, root-cause analysis, and patches on core database technologies in our flagship products inspired by Apache Cassandra.
Past experience with running large-scale distributed systems is required. Qualified candidates will have a deep understanding of both the traditional RDBMS and non-relational (ideally Cassandra) database technologies with the ability to interface with both developer audience and non-technical audiences. You will be comfortable interfacing with many teams to lead incident management with your superb troubleshooting and interpersonal skills.
We are extremely selective, but the chosen few are those who are energized by the exciting challenges associated with introducing a new, disruptive technology to customers seeking a managed cloud solution for their database technology needs. If you are highly energetic, entrepreneurial, technical, and driven to constantly learn new products and technologies, this is the opportunity for you.
Essential Job Functions:
- Build and operate customer database environments; participate in 24x7 on-call rotating schedules with other members of the world-wide operations team.
- Monitor and evaluate the health of live distributed systems, in non-production and production environments using industry standard and proprietary tooling.
- Plan, schedule, and implement all OS, database, and cluster administration operations as needed (changes to topology, add nodes, remove nodes, configuration changes, rolling restarts, apply patches and upgrades, etc.).
- Execute all planned and unplanned maintenance activities (including upgrades, migrations, implement fixes, etc.).
- Proactively addresses cluster issues as a result of cloud infrastructure issues outside of customer control (Cloud platform / console issues).
- Investigate and respond to tickets regarding health and security of the system; assigning priority, and coordinating with internal and customer teams to ensure SLA compliance.
- Build updates and patches as needed to fix system issues.
- Work with internal and external application and infrastructure teams to deploy updates to cloud systems; communicate changes scheduled and unscheduled activities, updates, etc.
- Provide feedback to the Cloud development team on encountered issues to facilitate longer term improvements to the operational capabilities of the next generation system.
- Expert troubleshooting skills with large software deployments and distributed systems.
- 6+ years experience working in an operations related, customer-facing role, handling critical, time sensitive issues, directly engaging with customer technical resources.
- Deep understanding of the software development life cycle, change management, and zero downtime release management.
- Experience with performance profiling and optimization, preferably in a distributed. environment, able to debug and identify network issues.
- Strong Linux environment / OS performance and troubleshooting skills.
- Strong understanding of Java, Ruby, and/or another programming language.
- Strong automation skills using tools such as Ansible, Chef, Terraform, Jenkins, etc.
- Working knowledge of ELK and Graphite/Grafana.
- Comfortable reading, reviewing, and modifying others' code.
- Self-motivated with ability to multi-task and work under minimal supervision.
- Excellent written and verbal communication skills.
- Some travel may be required and on an infrequent basis.
- BA / BS / MS in Computer Science, or equivalent.
- This role is located in Asia Pacific, preferably Sydney.
- Recent experience in a 24x7 production operations environment supporting a highly available SaaS or cloud provider solution.
- 3+ years operational experience on Apache Cassandra or DataStax Enterprise in a developer or support role.
- Experience working in a release automation focused (Jenkins, TFS, etc.) and containerized (Docker or Kubernetes) environment.
- Solid experience in system administration and configuration management automation (Ansible, Puppet or Chef) and cloud provisioning tooling (Terraform, ARM, CloudFormation).
- Experience building, testing, deploying and operating highly scalable and resilient cloud-based infrastructure (AWS, GCP, Azure) in a medium or large enterprise.