JOB DETAILS

Operations Manager, GPU

CompanySingtel Group
LocationSingapore
Work ModeOn Site
PostedJune 14, 2026
About The Company
Singtel Group is Asia's leading communications technology group, providing a portfolio of services from next-generation communication, digital services and digital infrastructure including regional data centre arm Nxera and regional IT services arm NCS. The Group has presence in Asia, Australia and Africa and reaches over 820 million mobile customers in 20 countries. Singtel is dedicated to continuous innovation, harnessing next-generation technologies to create new and exciting customer experiences as we shape a more sustainable, digital future.
About the Role

About Singtel Digital InfraCo – RE:AI

Singtel Digital InfraCo’s RE:AI division is building Asia’s most advanced and sustainable AI infrastructure ecosystem. RE:AI enables enterprises, research institutions, and digital-native businesses to accelerate innovation through responsible, high-performance AI compute and connectivity solutions.

Be a Part of Something BIG!

Operations Manager, GPU Operations is responsible for leading the day-to-day operations of Singtel’s GPU-as-a-Service (GPUaaS) platform. This role ensures high levels of system availability, performance, security, and reliability across GPU infrastructure and supporting data centre operations.

The role serves as the primary operational interface with GPU infrastructure engineering teams, collaborating on platform upgrades, observability, security enhancements, and continuous operational improvements.

Make an Impact by 

  • Acting as the overall coordinator and primary point of contact for end-to-end GPUaaS operations, including data centre operations and operational reporting.
  • Leading daily GPUaaS and data centre operations covering hardware, environmental controls, networking, security, and supporting software platforms.
  • Managing operations teams, vendors, and consultants during both normal operations and emergency situations.
  • Coordinating with internal teams and external partners to implement GPUaaS enhancements and data centre initiatives.
  • Implementing, validating, and continuously improving operational plans to ensure platform stability across GPU hardware, software, and data centre infrastructure (e.g. power and cooling).
  • Leading incident response and resolution for GPUaaS environments, including root cause analysis (RCA) and timely communication to customers and stakeholders.
  • Presenting operational status, risks, and improvement plans to senior management and relevant stakeholders.
  • Ensuring incidents are addressed or escalated in accordance with criticality, impact, and SLA/SLO requirements.
  • Building and leading a high-performing operations team, fostering collaboration, innovation, and continuous improvement.
  • Setting clear goals, mentoring team members, and supporting professional development.
  • Leading security incident management and enforcing security and compliance best practices within the GPUaaS environment.
  • Monitoring industry security trends and implementing measures to protect customer data and platform integrity.
  • Participating in scheduled or on-call support outside standard working hours as required.

Skills for Success

  • Bachelor’s degree in Computer Science, Information Technology, or a related discipline.
  • Minimum of 8 years’ experience in data centre operations and management, including at least 3 years in a leadership or managerial role.
  • Strong knowledge of data centre infrastructure, including servers, networking, storage, physical security, and cybersecurity.
  • Experience with electrical and mechanical systems, maintenance, and facilities operations.
  • Proven people leadership and vendor management capabilities.
  • Strong organisational skills and adaptability to changing operational requirements.
  • Effective interpersonal, communication, and presentation skills.
  • Experience managing customer interactions and driving service quality improvements.

Desirable qualifications 

  • Experienced in Linux and hypervisor administration for GPU infrastructure and GPUaaS.
  • Complex technical problem-solving with a proactive approach to system operation and optimization.
  • Knowledge of storage technologies and experience in capacity planning, troubleshooting, and data protection.
  • Experience in GPU and GPU infrastructure management, including configuration, monitoring, and performance
  • Experience with liquid cooling systems specific to GPU infrastructure operation and monitoring.
  • Understanding of GPU cluster architectures and operations, including GPU-based systems, collective communications (e.g. NCCL, RDMA), AI/HPC networking (e.g. InfiniBand), and containerized or orchestrated environments supporting AI and HPC workloads.

Rewards that Go Beyond 

  • Flexible work arrangements
  • Full suite of health and wellness benefits 
  • Ongoing training and development programs 
  • Internal mobility opportunities

 Your Career Growth Starts Here. Apply Now!

Key Skills
Data Centre OperationsGPU Infrastructure ManagementPeople LeadershipVendor ManagementCybersecurityIncident ResponseLinux AdministrationHypervisor AdministrationCapacity PlanningLiquid Cooling SystemsGPU Cluster ArchitecturesInfiniBandRDMANCCLContainerizationOrchestration
Categories
TechnologyManagement & LeadershipEngineeringData & AnalyticsSecurity & Safety
Benefits
Flexible Work ArrangementsHealth And Wellness BenefitsOngoing Training And Development ProgramsInternal Mobility Opportunities
Job Information
📋Core Responsibilities
Lead the day-to-day operations of the GPU-as-a-Service platform, ensuring high system availability, security, and reliability. Coordinate end-to-end data centre operations, manage vendor relationships, and lead incident response and root cause analysis.
📋Job Type
full time
📊Experience Level
10+
💼Company Size
8337
📊Visa Sponsorship
No
💼Language
English
🏢Working Hours
40 hours
Apply Now →

You'll be redirected to
the company's application page