Site Reliability Engineer

Aventus

Early Applicant

6 months ago
Be among the first 50 applicants

Exp: 0-2 Years

United Arab Emirates, Dubai

Job Description

We are seeking a skilled Data Site Reliability Engineer (SRE) who has experience with data platforms to join a dynamic international company.

The ideal candidate will be responsible for ensuring the reliability, scalability, and performance of our systems and applications. As an SRE, you will collaborate closely with development and operations teams to design, implement, and maintain robust infrastructure solutions. You will also be involved in monitoring, troubleshooting, and optimizing our systems to meet the demands of our rapidly growing business. The role will sit in the cloud Engineering team where you will

develop and maintain cloud-native technology:

Highly scalable Kubernetes clusters

Cloud Access management automation and integration with k8s

As Data Platform Site Reliability Engineering you will manage infrastructure and applications on

cloud computing platforms to deliver data processing, governance, and storage.

As an SRE, you'll need to solve problems that arise using empirical data, teamwork, and your own

unique expertise.

The Data Platform SRE will work directly with our data platform and engineering teams in an

embedded SRE model, operating in unison with the developers to deliver seamless experiences

for our customers.

Responsibilities:

Design, implement, and maintain scalable and reliable data infrastructure solutions for storing, processing, and analyzing large volumes of data.
Collaborate with data engineering and data science teams to define and implement operational requirements for data pipelines, ETL processes, and analytical workflows.
Automate deployment, configuration, and monitoring of data systems and services to ensure efficient and reliable operation.
Develop and maintain monitoring and alerting systems to proactively identify and address issues with data availability, quality, and performance.
Troubleshoot and resolve data-related issues in a timely manner, minimizing impact on downstream applications and users.
Implement data governance and security best practices to ensure the confidentiality, integrity, and availability of our data assets.
Perform capacity planning and performance tuning to optimize the performance and cost-effectiveness of our data infrastructure.
Participate in on-call rotations and respond to data-related incidents outside of regular business hours when necessary.
Evaluate and adopt new data technologies and tools to improve the efficiency, reliability, and scalability of our data infrastructure.
Document system designs, configurations, and operational procedures to facilitate knowledge sharing and collaboration.

Qualifications

Strong sense of ownership and integrity demonstrated through clear communication and

collaboration

Experience in architecting, developing, operating, and troubleshooting Kubernetes

clusters and/or other highly available systems at scale.

Proficiency with the architecture, deployment, performance tuning, and troubleshooting of

open-source data analytics technologies, especially Apache Spark, Trino and related

software in a large-scale environment

The ability to design, author, and release code in languages like Go, Python, or Java

Acute drive to automate manual operations and to improve them through repeated

iteration

Understanding of the Linux Operating System, standard networking protocols, and