Search by job, company or skills
We are seeking a skilled Data Site Reliability Engineer (SRE) who has experience with data platforms to join a dynamic international company.
The ideal candidate will be responsible for ensuring the reliability, scalability, and performance of our systems and applications. As an SRE, you will collaborate closely with development and operations teams to design, implement, and maintain robust infrastructure solutions. You will also be involved in monitoring, troubleshooting, and optimizing our systems to meet the demands of our rapidly growing business. The role will sit in the cloud Engineering team where you will
develop and maintain cloud-native technology:
Highly scalable Kubernetes clusters
Cloud Access management automation and integration with k8s
As Data Platform Site Reliability Engineering you will manage infrastructure and applications on
cloud computing platforms to deliver data processing, governance, and storage.
As an SRE, you'll need to solve problems that arise using empirical data, teamwork, and your own
unique expertise.
The Data Platform SRE will work directly with our data platform and engineering teams in an
embedded SRE model, operating in unison with the developers to deliver seamless experiences
for our customers.
Responsibilities:
Qualifications
Strong sense of ownership and integrity demonstrated through clear communication and
collaboration
Experience in architecting, developing, operating, and troubleshooting Kubernetes
clusters and/or other highly available systems at scale.
Proficiency with the architecture, deployment, performance tuning, and troubleshooting of
open-source data analytics technologies, especially Apache Spark, Trino and related
software in a large-scale environment
The ability to design, author, and release code in languages like Go, Python, or Java
Acute drive to automate manual operations and to improve them through repeated
iteration
Understanding of the Linux Operating System, standard networking protocols, and
components
Experience with cloud-native services on AWS/GCP
Hands-on experience managing large numbers of diverse systems with configuration
management or software delivery platforms (such as Terraform, Cloudformation, ArgoCD,
and Flux)
Experience with deploying, supporting and monitoring new and existing services,
platforms, and application stacks
Excellent troubleshooting and problem-solving skills
Experience with scale testing, disaster recovery, and capacity planning
Effective communication and collaboration skills: have the ability to drive and promote
technical partnerships across teams
Incident response and/or incident management experience
This is a long term remote contract role, candidates in the surrounding regions are preferred due to the time zone.
If the above matches your skillset , please apply
Login to check your skill match score
Date Posted: 28/05/2024
Job ID: 80115663