Sr. Manager, Site Reliability & Innovation, IT

Company : Citic

Job Type : Full Time

Hong Kong

Number of Applicants

000+

Apply Now

Let AI Supercharge Your Job Hunt!

JobCopilot scans 500,000+ company career sites daily to find jobs for you

Never miss an opportunity Save hours by auto-filling applications forms Land more interviews with tailored applications

Activate JobCopilot

Job Description - Sr. Manager, Site Reliability & Innovation, IT

Position Description
We are seeking a Senior Site Reliability Engineer who will be responsible for both build and shared services operations, including monitoring, site reliability engineering (SRE), and ensuring the stability, scalability, and performance of critical systems.
The ideal candidate is a strong technical problem-solver and capable of delivering end-to-end monitoring and reliability solutions while diagnosing complex issues during critical incidents.

Key Areas of Responsibilities

Own monitoring, Kubernetes platform reliability, and SRE operations to ensure highly reliable, available, and performant systems
Build, enhance, and maintain monitoring solutions using ITRS Geneos, Prometheus, Victoria-Metrics, Elasticsearch, and Grafana
Develop, optimize, and maintain alerting rules, dashboards, and observability pipelines
Troubleshoot Linux servers (RHEL 7/8/9), including upgrades, configurations, patching, and maintenance, while determining appropriate monitoring requirements for system changes
Analyze logs, investigate issues, and perform fault finding to identify performance exceptions
Collaborate with engineering, application, and infrastructure teams to improve system resilience, stability, security, efficiency, and scalability.
Operate, maintain, and optimize Kubernetes environments, including cluster health, workload reliability, capacity planning, and platform observability
Continuously research and adopt modern monitoring and SRE tools and practices.

Requirements

Bachelor’s degree or higher in Computer Science / Engineering
Around 8-10 years of experience within IT, preferably in site reliability engineering, production support, platform engineering, or investment banking environments
Strong experience configuring and maintaining monitoring and observability platforms, including:
ITRS Geneos, Prometheus, Victoriametrics, Elasticsearch, Grafana, and Kibana
Experience with automation (e.g., Bash, Python, Ansible, CI/CD tools) is a must
Hands-on experience building and implementing Prometheus pipelines, including exporters, scraping configurations, relabelling, metric routing, and integrations with long-term storage (e.g., Victoriametrics)
Experience building and maintaining Logstash pipelines, including ingestion, parsing, filtering, enrichment, and routing of logs into Elasticsearch
Ability to design, build, and maintain Grafana and Kibana dashboards for metrics, logs, and performance analytics across distributed systems
Understanding of metrics, logging, alerting, dashboards, and observability pipelines
Strong Linux administration skills (RHEL 7/8/9), including troubleshooting, upgrades, configuration, patching, and performance optimization.
Good understanding of SRE principles, high availability, scalability, incident management and Disaster Recovery / Business Continuity Planning) activities
Experience managing GPU-enabled infrastructure for AI or machine learning platforms is preferred.
Strong hands-on experience with Kubernetes, including cluster operations, workload orchestration, troubleshooting, scaling, and production support
Understanding of networking fundamentals, performance tuning, and troubleshooting distributed systems
Operations with participation in on-call rotations, including after-hours and weekend support
Self-motivated, adaptable and able to prioritize, learn continuously and manage multiple responsibilities effectively
Excellent in English, with Chinese will be advantage

Stay informed on CITIC CLSA Job Opportunities

Not the right fit? You can create a job alert to receive our latest job openings that meet your interest.

Original job Sr. Manager, Site Reliability & Innovation, IT posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.

Apply Now

Auto-Apply to Similar Jobs

Share Job

Get your Resume Reviewed for Free

Automate Job Applications for Similar Jobs

Auto-Apply to Sr. Manager, Site Reliability & Innovation Jobs with your AI JobCopilot

Auto-Apply with AI

Similar Sr. Manager, Site Reliability & Innovation Jobs in Hong Kong

Get your Resume Reviewed for Free

Email address

Why are you reporting this job?

I think it’s a discriminatory or offensive

I think it’s fraudulent or a scam

I think it’s trying to sell something unrelated to the job / it’s asking for money

I think it contains incorrect or broken information

Other

All Job Ads are subject to GrabJobs’s Terms of Service. We allow users to flag postings that may be in violation of those terms. Job Ads may also be flagged by GrabJobs moderation team. However, no moderation system is perfect, and flagging a posting does not ensure that it will be removed.

Setup your job alert:

Frequency

By activating job alerts, I agree to GrabJobs Terms & Privacy Policy. I can unsubscribe to job alerts anytime. Skip

Sr. Manager, Site Reliability & Innovation, IT

Job Description - Sr. Manager, Site Reliability & Innovation, IT

Similar Sr. Manager, Site Reliability & Innovation Jobs in Hong Kong

Mobile Apps