Apply on
Project Description:
- As a Site Reliability Engineer you will play a crucial role in ensuring the reliability, scalability, and performance of our systems. Collaborate with cross-functional teams to design, build, and maintain scalable infrastructure, automate operational processes, and respond to incidents swiftly. The ideal candidate is passionate about automation, has a deep understanding of system architecture, and is dedicated to delivering high-quality, reliable services.
Responsibilities:
• System and Service Reliability: Ensure overall reliability and performance, Monitoring system health, Performing root cause analysis
• Incident/Ticket Management: Alert management, Incident response, Triaging, Investigating & Mitigating incidents, Co-ordinating with cross functional teams
• Automation and Tooling: Automation and process improvements, Developing automation tools, scripts and infrastructure, Identify and automate repetitive tasks to reduce manual work
• Capacity Planning and Scalability: Collaborate with development and infrastructure teams, Conduct capacity planning exercises, Forecast resource requirements, Optimize system scalability to handle increased workloads
• Performance and Optimization: Monitor and analyze performance metrics, Identify bottlenecks and recommend optimizations, Collaborate to optimize application code, database queries and system configurations
• Reliability Engineering Practices: Advocate and implement reliability engg. practices, Error budgeting and reviews, Conduct blameless postmortems
• Continuous Improvement: Analyze incident trends and monitor system metrics, Gather feedback from devops, app developers and customers, Identify areas of improvement and collaborate with development teams.
• Collaboration and Communication: Foster collaborations with development , operations & cross functional teams, Acts as a bridge between different teams, Knowledge sharing, promote effective communications, Create and contribute to documentation & share best practices
Mandatory Skills Description:
• 5-7 years of experience
• Bachelor's or higher degree in Computer Science, Information Technology, or related field.
• Proven experience as a Site Reliability Engineer, DevOps Engineer, or similar role.
• Strong problem-solving skills and the ability to troubleshoot complex issues.
• Excellent communication and collaboration skills.
• Strong programming/scripting skills (e.g., Python, Go, Shell) for automation and tooling.
• Ability to troubleshoot and do simple code fixes (.Net)
• Proficient in cloud computing platforms (e.g., AWS, Azure - preferred, GCP).
• Familiarity with CI/CD pipelines and version control systems (e.g., Jenkins, Git, GitHub Actions).
• Proficient with setting up and integrating with monitoring tools (eg, Dynatrace, Moogsoft, Azure Monitor)
• Service Management (Incident, Change, Problem, Alert Management)
• Schedule: Amenable to US hours shift - following cleint's business hours (central time)
Nice-to-Have Skills Description:
• Knowledge in containerization and orchestration tools (e.g., Docker, Kubernetes) is a plus.
• Knowledge of Linux/Unix systems and networking is a plus
• Knowledge with configuration management tools (e.g., Ansible, Puppet, Chef) is a plus
Languages:
- English: Advanced