Search by job, company or skills

Aventus

Site Reliability Engineer

Early Applicant
  • 6 months ago
  • Be among the first 50 applicants

Job Description

We are seeking a skilled Data Site Reliability Engineer (SRE) who has experience with data platforms to join a dynamic international company.

The ideal candidate will be responsible for ensuring the reliability, scalability, and performance of our systems and applications. As an SRE, you will collaborate closely with development and operations teams to design, implement, and maintain robust infrastructure solutions. You will also be involved in monitoring, troubleshooting, and optimizing our systems to meet the demands of our rapidly growing business. The role will sit in the cloud Engineering team where you will

develop and maintain cloud-native technology:

Highly scalable Kubernetes clusters

Cloud Access management automation and integration with k8s

As Data Platform Site Reliability Engineering you will manage infrastructure and applications on

cloud computing platforms to deliver data processing, governance, and storage.

As an SRE, you'll need to solve problems that arise using empirical data, teamwork, and your own

unique expertise.

The Data Platform SRE will work directly with our data platform and engineering teams in an

embedded SRE model, operating in unison with the developers to deliver seamless experiences

for our customers.

Responsibilities:

  1. Design, implement, and maintain scalable and reliable data infrastructure solutions for storing, processing, and analyzing large volumes of data.
  2. Collaborate with data engineering and data science teams to define and implement operational requirements for data pipelines, ETL processes, and analytical workflows.
  3. Automate deployment, configuration, and monitoring of data systems and services to ensure efficient and reliable operation.
  4. Develop and maintain monitoring and alerting systems to proactively identify and address issues with data availability, quality, and performance.
  5. Troubleshoot and resolve data-related issues in a timely manner, minimizing impact on downstream applications and users.
  6. Implement data governance and security best practices to ensure the confidentiality, integrity, and availability of our data assets.
  7. Perform capacity planning and performance tuning to optimize the performance and cost-effectiveness of our data infrastructure.
  8. Participate in on-call rotations and respond to data-related incidents outside of regular business hours when necessary.
  9. Evaluate and adopt new data technologies and tools to improve the efficiency, reliability, and scalability of our data infrastructure.
  10. Document system designs, configurations, and operational procedures to facilitate knowledge sharing and collaboration.

Qualifications

Strong sense of ownership and integrity demonstrated through clear communication and

collaboration

Experience in architecting, developing, operating, and troubleshooting Kubernetes

clusters and/or other highly available systems at scale.

Proficiency with the architecture, deployment, performance tuning, and troubleshooting of

open-source data analytics technologies, especially Apache Spark, Trino and related

software in a large-scale environment

The ability to design, author, and release code in languages like Go, Python, or Java

Acute drive to automate manual operations and to improve them through repeated

iteration

Understanding of the Linux Operating System, standard networking protocols, and

components

Experience with cloud-native services on AWS/GCP

Hands-on experience managing large numbers of diverse systems with configuration

management or software delivery platforms (such as Terraform, Cloudformation, ArgoCD,

and Flux)

Experience with deploying, supporting and monitoring new and existing services,

platforms, and application stacks

Excellent troubleshooting and problem-solving skills

Experience with scale testing, disaster recovery, and capacity planning

Effective communication and collaboration skills: have the ability to drive and promote

technical partnerships across teams

Incident response and/or incident management experience

This is a long term remote contract role, candidates in the surrounding regions are preferred due to the time zone.

If the above matches your skillset , please apply

More Info

Industry:Other

Function:technology

Job Type:Permanent Job

Skills Required

Login to check your skill match score

Login

Date Posted: 28/05/2024

Job ID: 80115663

Report Job

About Company

Follow

Hi , want to stand out? Get your resume crafted by experts.

Last Updated: 23-11-2024 05:58:21 PM
Home Jobs in United Arab Emirates Site Reliability Engineer