About the job

About the Role:

We are seeking a highly skilled Platform and Site Reliability Engineer (SRE) with a strong background in software engineering and cloud infrastructure. In this role, you will play a critical part in building, scaling, and securing our Cloud offering. You will focus on optimizing system architecture to balance cost, performance, and security while ensuring the platform remains highly available and resilient.

If you have a passion for infrastructure-as-code (IaC), deep observability, and scaling distributed systems to handle massive workloads, we want to hear from you.

What You Will Do:

Cloud Infrastructure & Architecture: Help build, scale, and optimize our core Cloud offering. Continually improve system architecture to reduce operational costs while maintaining world-class security and performance standards.

Observability & Metrics: Design, implement, and track metrics for platform uptime. Increase deep visibility into our distributed environments by capturing, analyzing, and structuring relevant metrics and logs.

Security & Compliance: Implement and maintain robust intrusion detection, automated remediation, and proactive patch management systems. Actively contribute to and support our SOC2 and GDPR compliance initiatives.

CI/CD & Release Management: Design and optimize CI/CD systems to accelerate deployment velocity, embedding proper change and release management processes directly into the engineering lifecycle.

Reliability & On-Call: Participate in monitoring, alerting, and incident response rotations to maintain strict platform availability, performance, and disaster recovery readiness.

Requirements & Qualifications

Education: Bachelor of Engineering in Computer Science (or equivalent academic background).

Experience: 7+ years of professional experience designing, operating, and scaling highly available, distributed cloud systems.

Scale Mindset: Proven track record working with large-scale infrastructure (ideally environments supporting millions of managed databases or highly distributed, high-volume workloads).

Cost & Performance Optimization: Demonstrated success in optimizing cloud architectures to significantly reduce infrastructure overhead without sacrificing resiliency.

We are looking for a candidate with deep, practical expertise in the following areas and technologies:

Core Competencies: Site Reliability Engineering (SRE), Infrastructure as Code (IaC), Incident Management.

Cloud & Orchestration: AWS, Kubernetes, Linux.

Automation & IaC Tools: Pulumi, Terraform, Ansible.

Databases: PostgreSQL (experience handling high-volume managed databases is a huge plus).

Observability & Telemetry: Prometheus, Grafana.

Programming Languages: Node.js, Golang.

Cloud Platform Engineer and SRE

About the job

About the Role:

What You Will Do:

Similar Jobs

Staff Platform Engineer

Site Reliability Engineer

FDE/DevOps at global leader in media delivery!

Similar Jobs

Staff Platform Engineer

Site Reliability Engineer

FDE/DevOps at global leader in media delivery!