SRE & Production Team Leader
- רמת גן
- משרה קבועה
- משרה מלאה
- Design, build, and manage our SRE framework to ensure observability, resilience, and high availability.
- Develop and automate solutions for proactive monitoring, incident response, and performance optimization.
- Improve and maintain our alerting and monitoring stack, leveraging tools like Datadog, Prometheus, and Grafana.
- Lead post-mortem analysis and implement continuous improvement initiatives.
- Collaborate with DevOps, Engineering, and Product teams to ensure smooth and efficient delivery of reliable services.
- SRE & Production Manager with 5+ years of experience in SRE, Production Engineering, or DevOps, including 2+ years in a leadership role.
- Experience with monitoring and observability tools like Datadog, Prometheus, and Grafana.
- A problem solver, capable of finding creative solutions and getting things done.
- Fluent with incident management, RCA processes, and operational best practices.
- Experience with AWS (EKS, EC2, RDS, S3, networking configurations).
- Experience in high-scale distributed systems.
- Background in security and compliance for cloud infrastructure.
- Understanding of cost optimization and resource management in cloud environments.
- Familiarity with machine learning or predictive analytics for proactive reliability management.
- Proficiency in Python, Go, or Bash for automation and scripting.
Mploy