Site Reliability Engineer
MWDN
- ישראל
- משרה קבועה
- משרה מלאה
CyberintWhy does MWDN rock?:MWDN connects exceptional tech talent with leading companies across Israel, the USA, Great Britain, and Western Europe. We aim to ensure our employees enjoy a rewarding and secure experience while collaborating with prestigious international clients.Here’s what you can expect when you get employed by MWDN:
- Security first. We vet our clients to eliminate risks, ensuring reliability and timely payments for your hard work—no fraud or unforeseen events here!
- Career support. If a match isn't right, we're here for you. We actively assist our employees in finding new opportunities that fit their skills and aspirations.
- Legal assistance. We provide guidance on legal matters (e.g., opening and administering your private entrepreneur account, taxes, army enrollment, etc.).
- Professional development. We offer English courses and other engaging activities, including team-building events.
- Domain: Computer and network security
- Location: Israel
- Company size: 51-200 employees
- Founded in: 2009
- 3+ years of experience in an SRE, DevOps, or similar role in a SaaS/cloud-native environment.
- Strong experience with Kubernetes, and cloud-based distributed systems.
- Hands-on experience building or maintaining monitoring stacks such as Prometheus, Grafana, ELK, etc.
- Proficiency in Python, Bash, or similar scripting languages.
- Familiarity with CI/CD tools (e.g., GitHub Actions, Jenkins, ArgoCD).
- Solid analytical and problem-solving skills with a passion for operational excellence.
- Exposure to AI-based tooling (e.g., OpenAI API, LLM-based bots) to automate operations or enhance incident response processes.
- Upper-intermediate English level.
- Proficiency with AWS (EC2, S3, Lambda, Streaming, EMR, EKS).
- Experience with Infrastructure as Code tools (Terraform, Helm, etc.).
- Experience with incident management platforms (e.g., PagerDuty).
- Security-minded mindset and experience in the cybersecurity industry.
- Experience with service mesh, zero-downtime deployments, or chaos engineering.
- Contributions to AI-assisted SRE initiatives or platform operations & monitoring automation.
- Design, implement, and maintain monitoring and alerting systems (e.g., Prometheus, Grafana) to detect and prevent reliability issues.
- Develop tools and automation (Python, Bash, etc.) for improving infrastructure reliability and operational efficiency.
- Collaborate with R&D and Product teams to embed reliability-first principles into every stage of the development process.
- Participate in and improve incident response processes, including running blameless postmortems and implementing preventive measures.
- Enhance our Infrastructure-as-Code (IaC) and CI/CD practices to streamline deployments and reduce risk.
- Maintain and extend internal AI-driven tools, such as bots that support SRE workflows (on-call management, triaging, etc.).
- Document infrastructure, playbooks, and operational procedures to facilitate onboarding and knowledge sharing.
- People-oriented management without bureaucracy
- The friendly climate inside the company is confirmed by the frequent comeback of previous employees
- Flexible working schedule
- 29 paid time off (18 working days per year, plus 11 days — all national holidays)
- 10 sick leave days
- Full financial and legal support for private entrepreneurs
- Free English classes with native speakers or with Ukrainian teachers (for your choice)
- Dedicated HR