Epicareer Might not Working Properly
Learn More

IT Systems and Infrastructure Manager

Salary undisclosed

Checking job availability...

Original
Simplified

Job Title: IT Systems and Infrastructure Manager

Location: Pasay

Job Type: Full time

Job Summary:

We are seeking a highly skilled and proactive IT Systems and Infrastructure Manager to oversee systems and event monitoring, incident management, and infrastructure reliability. The role requires continuous system surveillance, proactive issue resolution, and coordination of critical incident responses to minimize downtime. This position involves working outside normal business hours, including nights, weekends, and holidays, to ensure 24/7 operational stability.

Key Responsibilities:

1. Systems and Events Monitoring:

  • Oversee real-time monitoring of IT infrastructure, servers, networks, and applications.
  • Implement and maintain event management systems to detect and escalate performance anomalies.
  • Ensure proactive alerts and notifications are configured for early issue detection.
  • Work with relevant teams to fine-tune monitoring thresholds and automation rules.

2. Incident Management & Response:

  • Develop and enforce incident response procedures to minimize downtime and business impact.
  • Act as the primary escalation point for critical incidents, coordinating resolution efforts across teams.
  • Lead root cause analysis (RCA) post-incident and drive permanent resolutions.
  • Maintain detailed documentation of incidents, impact analysis, and corrective actions.

3. IT Infrastructure Operations & Maintenance:

  • Ensure high availability and performance of servers, databases, storage, and cloud infrastructure.
  • Collaborate with cross-functional teams on system upgrades, patches, and improvements.
  • Work with vendors and service providers to resolve infrastructure issues efficiently.
  • Develop disaster recovery (DR) plans and ensure regular testing.

4. Shift-Based Operations & Availability:

  • Willing to work beyond standard business hours, including overnight shifts, weekends, and on-call duties.
  • Ensure continuous monitoring and support coverage for mission-critical systems.
  • Provide leadership and direction to the IT support team during off-hours.

5. Documentation & Reporting:

  • Maintain accurate records of system performance, incidents, and resolutions.
  • Generate and present periodic reports on system health, incident trends, and improvement initiatives.
  • Update and maintain IT policies, procedures, and best practices.

Qualifications & Experience:

  • Bachelor’s degree in IT, Computer Science, or a related field.
  • 5+ years of experience in IT infrastructure, systems monitoring, and incident management.
  • Hands-on experience with monitoring tools (e.g., Nagios, Splunk, SolarWinds, Prometheus, New Relic, etc.).
  • Strong understanding of ITIL principles and best practices for incident and problem management.
  • Experience with Windows/Linux servers, cloud platforms (AWS, Azure, GCP), and networking fundamentals.
  • Familiarity with automation and scripting (PowerShell, Python, Bash) is a plus.
  • Excellent troubleshooting skills and ability to remain calm under pressure.
  • Willing to work in a shift-based/on-call environment to support 24/7 operations.

Job Title: IT Systems and Infrastructure Manager

Location: Pasay

Job Type: Full time

Job Summary:

We are seeking a highly skilled and proactive IT Systems and Infrastructure Manager to oversee systems and event monitoring, incident management, and infrastructure reliability. The role requires continuous system surveillance, proactive issue resolution, and coordination of critical incident responses to minimize downtime. This position involves working outside normal business hours, including nights, weekends, and holidays, to ensure 24/7 operational stability.

Key Responsibilities:

1. Systems and Events Monitoring:

  • Oversee real-time monitoring of IT infrastructure, servers, networks, and applications.
  • Implement and maintain event management systems to detect and escalate performance anomalies.
  • Ensure proactive alerts and notifications are configured for early issue detection.
  • Work with relevant teams to fine-tune monitoring thresholds and automation rules.

2. Incident Management & Response:

  • Develop and enforce incident response procedures to minimize downtime and business impact.
  • Act as the primary escalation point for critical incidents, coordinating resolution efforts across teams.
  • Lead root cause analysis (RCA) post-incident and drive permanent resolutions.
  • Maintain detailed documentation of incidents, impact analysis, and corrective actions.

3. IT Infrastructure Operations & Maintenance:

  • Ensure high availability and performance of servers, databases, storage, and cloud infrastructure.
  • Collaborate with cross-functional teams on system upgrades, patches, and improvements.
  • Work with vendors and service providers to resolve infrastructure issues efficiently.
  • Develop disaster recovery (DR) plans and ensure regular testing.

4. Shift-Based Operations & Availability:

  • Willing to work beyond standard business hours, including overnight shifts, weekends, and on-call duties.
  • Ensure continuous monitoring and support coverage for mission-critical systems.
  • Provide leadership and direction to the IT support team during off-hours.

5. Documentation & Reporting:

  • Maintain accurate records of system performance, incidents, and resolutions.
  • Generate and present periodic reports on system health, incident trends, and improvement initiatives.
  • Update and maintain IT policies, procedures, and best practices.

Qualifications & Experience:

  • Bachelor’s degree in IT, Computer Science, or a related field.
  • 5+ years of experience in IT infrastructure, systems monitoring, and incident management.
  • Hands-on experience with monitoring tools (e.g., Nagios, Splunk, SolarWinds, Prometheus, New Relic, etc.).
  • Strong understanding of ITIL principles and best practices for incident and problem management.
  • Experience with Windows/Linux servers, cloud platforms (AWS, Azure, GCP), and networking fundamentals.
  • Familiarity with automation and scripting (PowerShell, Python, Bash) is a plus.
  • Excellent troubleshooting skills and ability to remain calm under pressure.
  • Willing to work in a shift-based/on-call environment to support 24/7 operations.