
Site Reliability Engineer
- תל אביב
- משרה קבועה
- משרה מלאה
- Develop and implement SRE capabilities to enhance the reliability, availability, and performance of Admin solutions.
- Design and maintain proactive monitoring and alerting systems for deep visibility into critical business flows, beyond simple statuses, to identify functional issues.
- Drive improvements in the Software Development Lifecycle (SDLC) for reliability and scalability from design to deployment.
- Collaborate with development and operations teams to troubleshoot production incidents affecting the purchase flow through root cause analysis.
- Lead SRE initiatives to boost system resilience and operational efficiency.
- Implement best practices for incident management and conduct blameless post-mortems, contributing to capacity planning and performance testing to ensure scalability.
- 5+ years of experience as a Site Reliability/DevOps Engineer
- Deep understanding of E-commerce flows, specifically with back-office operations and order processing - must
- Experience as an Automation/Software Engineer with a strong understanding of software development principles and in building, testing, and deploying distributed systems - must
- Experience in designing, implementing, and utilizing monitoring and observability platforms such as DataDog, NewRelic, Prometheus/Grafana, or ELK stack - must
- Proficiency in scripting and automation using languages such as Python, Java, etc. - must
- Ability to create dashboards, alerts, and insightful queries - must
- Experience with AWS services to build and operate scalable and resilient applications (e.g., EC2, ECS/EKS, RDS, S3, Lambda, CloudWatch) - plus
- Experience in automating infrastructure provisioning, application deployments, and repetitive operational tasks - plus
- Proactive approach with excellent problem-solving skills
- Strong collaborator, with an ability to work with cross-functional teams
- Proficient in English