Cloud Platform Engineer

Role Description:
• Design and implement cloud-native database infrastructure using Terraform /Ansible to provision managed DB instances in multi-clouds (RDS/Azure DB /Cloud SQL) and self-managed clusters
• Automate Configuration Management, security hardening, and patching of database instances across all environments. Automate workflows to reduce manual effort and improve reliability
• Develop internal tools and scripts (Python/Bash) to enable production support teams to manage their own database instances and environments safely. Develop scripts for routine operational tasks like backups, health checks, etc.
• Integrate advanced observability platforms (Dynatrace, CloudWatch) with AIOps tools to establish SLOs and train models for anomaly detection and proactive forecasting of database degradation like predicting slow queries or imminent connection pool exhaustion).
• Design, deploy, and govern AI-powered agents (using Azure Copilot /AWS Bedrock) to achieve autonomous self-healing capabilities and automated resource management.
• Implement advanced monitoring (CloudWatch, Dynatrace) for key database metrics (SLIs/SLOs) like latency, throughput, error rates, and connection pools. Develop and train predictive ML models to analyze historical telemetry and forecast potential system outages or performance bottlenecks and configure proactive monitoring and alerting for critical services.
• Respond to alerts and create self-healing actions based on alerts • Design and implement cross-region/multi-AZ replication, automated failover strategies, and point-in-time recovery (PITR) procedures for mission-critical databases. Disaster recovery planning and DR drills
• Execute backup strategies and validate recovery procedures using Rubrik and Perform restores as needed • Work closely with application operations / Production support teams to troubleshoot issues on database layer (performance, locks, schema) and the platform layer (multi-cloud /middleware /network, resource limits) to find the root causes
• Lead incident response and root cause analysis (RCA) for database outages, performance degradations, and data integrity issues. Collaborate with DBAs and application teams for root cause analysis.
• Implement AI tools to perform real-time Root Cause Analysis (RCA), correlate complex event data (logs, metrics) and auto-generate runbooks
• Define and automate scaling strategies (read replicas, sharding, auto-scaling) based on predicted load and business growth. Provide input for capacity planning and resource optimization.
• Implement cost management policies, including rightsizing instances, managing storage tiers, and defining lifecycle rules for backups and snapshots.
• Proactively analyze query performance, index usage, and database configuration, making and automating changes to optimize throughput and reduce latency. Support DBA teams in performance tuning initiatives.
• Implement robust secrets management solutions (AWS Secrets Manager, HashiCorp Vault) for database credentials, ensuring applications retrieve secrets securely at runtime.
• Ensure database environments meet regulatory requirements (PCI, HIPAA, GDPR) through encryption-at-rest and in-transit, audit logging, and automated compliance checks.
• Define and enforce least-privilege access policies (IAM roles, service accounts) for databases. • Implement encryption and data masking policies as directed.
• Manage security and compliance by utilizing AI agents to detect configuration drift and auto-generate compliant updates for IAM, network, and security policies.
• Apply patches and perform upgrades in coordination with DBA teams. • Validate post-upgrade functionality and compliance.
• 8+ years of experience in Oracle / DB2 /MSSQL/Snowflake/PostgreSQL and MySQL administration, with a strong focus on AIOps integration.
• 5+ years of experience in public cloud operations (AWS, Azure, GCP).
• Deep, demonstrable expertise designing and operationalizing solutions leveraging AWS Bedrock/Agent Frameworks and Azure Copilot for DB Operations.
• Expertise in Infrastructure as Code (Terraform, CloudFormation), Ansible, and CI/CD pipelines, including supervising AI-generated infrastructure artifacts.
• Expertise integrating observability platforms into AI/ML platforms for predictive analysis and anomaly detection. - Advanced (7+ Years)
• Hands-On experience on Informatica PowerCenter / PowerBI /Cognos /Sapiens /Alteryx/IDMC/ILM/SAS / BusinessObjects / Glue / SPSS /ODI is a plus - Advanced (7+ Years)
• Proficiency in scripting languages (Python, Bash) - Advanced (7+ Years)
Competitive compensation and benefits package:
Note: Benefits differ based on employee level.
About Capgemini
Capgemini is a global leader in partnering with companies to transform and manage their business by harnessing the power of technology. The Group is guided everyday by its purpose of unleashing human energy through technology for an inclusive and sustainable future. It is a responsible and diverse organization of over 340,000 team members in more than 50 countries. With its strong 55-year heritage and deep industry expertise, Capgemini is trusted by its clients to address the entire breadth of their business needs, from strategy and design to operations, fueled by the fast evolving and innovative world of cloud, data, AI, connectivity, software, digital engineering and platforms. The Group €22.5 billion in revenues in 2023.
https://www.capgemini.com/us-en/about-us/who-we-are/
You'll be redirected to
the company's application page