Overview
The Lead Systems Engineer - Computing Technology engages in the design, leads implementation, and provides Level 3 expert support for large-scale private Cloud computing and/or HPC infrastructure, with a specific emphasis on computing technologies including hardware layer, operating system, hypervisor, and orchestration services.
Responsibilities
- Co-design, lead implementation, and manage hybrid virtualization and containerized platforms based on OpenStack, VMware VCF, and/or Red Hat OpenShift, ensuring platform stability, performance, and compliance with industry standards and best practices.
- Define and oversee the implementation of the roadmap for all Virtualization and HPC platforms across the company.
- Collaborate with architecture and engineering teams on technology stack component evaluation and selection, ensuring solutions are designed following best practices and optimized from both functional and non-functional perspectives.
- Lead regular capacity planning exercises to anticipate and accommodate the growing demands on the virtualized environment and HPC infrastructure, ensuring it meets current and future requirements.
- Develop and oversee plans to enhance the reliability of the computing infrastructure, addressing potential points of failure and ensuring high availability of services.
- Lead regular performance assessments and implement improvements based on findings in collaboration with relevant teams.
- Define and oversee execution of disaster recovery strategies ensuring system integrity, availability, and protection across all platforms and environments.
- Design and enhance observability stack in collaboration with the infrastructure operations team ensuring monitoring coverage and accuracy.
- Provide L3 expert support, including on-call shifts, and act as the final tier of resolution for L2 support teams through problem analysis and communication with vendor's technical support.
- Lead the collaboration with architecture and engineering teams on technology stack component evaluation and selection, ensuring solutions adhere to best practices and are optimized for both functional and non-functional requirements.
- Lead the analysis and implementation of performance optimization strategies for the cloud computing and/or HPC environment to maximize efficiency and resource utilization.
- Lead and mentor a team of engineers and collaborate with other infrastructure engineering and systems architect teams on solution design and delivery.
- Collaborate with security management teams to ensure that systems are safe and secure against cybersecurity threats.
- Write and maintain relevant documentation, ensuring completeness and quality.
- Work closely with process management and operational teams, contributing to process development, standardizing the collaboration framework, and improving collaboration efficiency.
- Participate in the Hiring process by conducting technical interviews and contributing to the team's growth and expertise.
Qualifications
- Bachelor's or master's degree in computer science, Engineering, Software Engineering, or a related field in technology.
- 2+ years of experience leading a team of 3+ engineers, holding accountability for quality and timely delivery of infrastructure projects.
- 7+ years of experience and deep expertise in designing, implementing, and managing private cloud stacks with a focus on compute and virtualization technologies.
- Extensive hands-on experience with at least one of the following platforms/stacks: OpenStack, Apache CloudStack, VMware VCF and Red Hat OpenShift, and related computing technologies such as x86 hardware, OS, KVM/ESXi, and orchestration services.
- 7+ years of hands-on experience in Linux Environments and 3+ years of experience in Senior Systems or Infrastructure engineering role.
- Profound understanding of hardware architecture and components [x86 and ARM, NUMA, types of memory and channels, types of NICs, etc).
- Good understanding of network and storage types and architecture.
- Good understanding of Cloud Native concepts and technologies.
- Experience in managing large-scale public or private cloud environments and/or working in a cloud service provider environment is highly desirable.
- Advanced programming and scripting skills using Python and/or Golang, bash.
- Good knowledge in Data center network designs and related technologies [OSI model, TCP/IP stack, routing, VLAN/VxLAN, etc]
- Understanding of storage types, architecture, and protocols such as object/block/file storages, NFS/SMB, iSCSI, FC, etc.
- Experience with integration of identity management, access management, and authorization solutions (PKI, LDAP, OAUTH, OpenID).
- Hands-on experience with monitoring and observability tools like Zabbix or Nagios, Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana).
- Understanding of CI/CD principles, Infrastructure as Code (IaaC) approach and software defined infrastructure solutions.
- Experience with database management and optimization for both SQL and NoSQL databases such as MySQL, PostgreSQL, MongoDB, or Cassandra is highly desirable.
- Experience with ITSM tools such as Jira, Redmine, ServiceNow, etc.
- Relevant certifications in Linux, virtualization, and cloud computing are a plus.
- Knowledge and experience working with GPU-hardware and AI hardware accelerators is a plus.
- Strong organizational skills with the ability to multitask and prioritize.
- A proactive approach to problem-solving and decision-making.