We are seeking a highly skilled Platform and Site Reliability Engineer (SRE) with a strong background in software engineering and cloud infrastructure. In this role, you will play a critical part in building, scaling, and securing our Cloud offering. You will focus on optimizing system architecture to balance cost, performance, and security while ensuring the platform remains highly available and resilient.
If you have a passion for infrastructure-as-code (IaC), deep observability, and scaling distributed systems to handle massive workloads, we want to hear from you.
Cloud Infrastructure & Architecture: Help build, scale, and optimize our core Cloud offering. Continually improve system architecture to reduce operational costs while maintaining world-class security and performance standards.
Observability & Metrics: Design, implement, and track metrics for platform uptime. Increase deep visibility into our distributed environments by capturing, analyzing, and structuring relevant metrics and logs.
Security & Compliance: Implement and maintain robust intrusion detection, automated remediation, and proactive patch management systems. Actively contribute to and support our SOC2 and GDPR compliance initiatives.
CI/CD & Release Management: Design and optimize CI/CD systems to accelerate deployment velocity, embedding proper change and release management processes directly into the engineering lifecycle.
Reliability & On-Call: Participate in monitoring, alerting, and incident response rotations to maintain strict platform availability, performance, and disaster recovery readiness.
Requirements & Qualifications
Education: Bachelor of Engineering in Computer Science (or equivalent academic background).
Experience: 7+ years of professional experience designing, operating, and scaling highly available, distributed cloud systems.
Scale Mindset: Proven track record working with large-scale infrastructure (ideally environments supporting millions of managed databases or highly distributed, high-volume workloads).
Cost & Performance Optimization: Demonstrated success in optimizing cloud architectures to significantly reduce infrastructure overhead without sacrificing resiliency.
We are looking for a candidate with deep, practical expertise in the following areas and technologies:
Core Competencies: Site Reliability Engineering (SRE), Infrastructure as Code (IaC), Incident Management.
Cloud & Orchestration: AWS, Kubernetes, Linux.
Automation & IaC Tools: Pulumi, Terraform, Ansible.
Databases: PostgreSQL (experience handling high-volume managed databases is a huge plus).
Observability & Telemetry: Prometheus, Grafana.
Programming Languages: Node.js, Golang.
⚡️ What We Do Flower is Flexible Power. We are a next-gen energy company leveraging AI and machine learning to make renewable energy stable and always available – even when the sun isn’t shining and the wind isn’t blowin...Show more
About Us Visa is a world leader in payments technology, facilitating transactions between consumers, merchants, financial institutions and government entities across more than 200 countries and territories, dedicated to ...Show more
Would you like to help shape the future of telecommunications and live media? Want to work hands-on with pioneering technology used by customers worldwide? Then keep reading. OFFER Net Insight is a global technology lead...Show more