VP of Site Reliability

Company : Titan Ai

Job Type : Full Time

United States

Job Description - VP of Site Reliability

About Titan

Titan builds AI software for banks: purpose-built small language models, a banking ontology, and AI bankers that financial institutions can trust. Our models outperform general-purpose LLMs by 30 to 80 percent on banking tasks. Customers include community banks, credit unions, and large regional and super-regional institutions. We are backed by leading fintech investors and operate under the compliance, audit, and model-risk standards that banking requires.

Why This Role Exists

Titan is scaling from a handful of live banking customers to thirty, then to hundreds. Each bank deploys differently: Azure, private cloud, or the bank's existing infrastructure. The core problem this role solves is making the platform work consistently and reliably across all of them, managing the last-mile deployment complexity that grows with every new customer.

This is a hands-on, principal-level role. You are not coming in to build an org chart. You are coming in to do the work: write the runbooks, stand up the on-call rotation, own incident command when a bank has an outage, and build the deployment playbook that takes us from client 10 to client 350. The practices get built before the teams do.

What You Own

Site Reliability Engineering. You build the SRE practice and operate it yourself first: SLO framework, on-call rotation, and incident command process. You write the SLOs, run the rotation, lead incident response at live bank customers, and produce the postmortems. Once the practice is stable and documented, you bring in an SRE Lead to own it and grow the function.

Production Support. Before the first support hire, you define severity tiers, SLA commitments per customer tier, and escalation paths, and you route alerts into a real queue. You are the technical accountable owner when a bank has a production incident. Once the structure works, you hire into it cost-efficiently and hand off the day-to-day to a Support Lead as customer volume justifies it.

Engineering Operations. You set the operating system across all four engineering lanes: sprint discipline, release rituals, code review standards, change management evidence, and the metrics the CEO and board read monthly. You own the SOC 2 artifacts, model risk review documentation, and the change traceability that bank examiners scrutinize.

What You Will Not Own

• Technical direction and architecture. Owned by the CTO and Chief Architect.

• Quality engineering. QE is being built as a separate function with dedicated QE engineers and a QE Lead hired independently.

• People management of the AI Toolbelt, Product Engineering, and Banking Models lanes. Lane leads manage their own teams. You influence through process, not reporting lines.

Who You Are

Ten or more years in engineering, with at least five years personally building SRE or platform operations functions at a software company selling into enterprise or regulated markets. You have not spent your career in internal bank IT. You come from companies that ship software to customers and operate it at scale: ServiceNow, MongoDB, AWS, GCP, or comparable. You have managed multi-tenant and multi-deployment-model infrastructure and know the last-mile complexity that comes with it.

You have written SLOs that people actually use. You have stood up an on-call rotation from nothing. You have been the technical owner during a production incident and know what it costs to not have a process. You earn trust from senior engineers without leaning on title. You see process as leverage, not overhead. You are not here to manage. You are here to build.

What Success Looks Like

In your first 90 days: a diagnostic of engineering operations shared with the CEO and CTO, written SLOs on customer-facing services with a clear baseline of where performance stands, and the support triage structure defined before the first support hire. In your first six months: the on-call rotation is running, incident command has been tested in production, and the deployment playbook covers the three deployment models we operate. At one year: the platform runs reliably from client 10 to client 30 and the foundation is in place to reach 100. The operating system runs without you prompting it.

Compensation and Structure

• Competitive base and meaningful equity.

• Atlanta, GA strongly preferred. West Coast considered on a case-by-case basis. Remote-friendly for the right candidate.

• Reports to the CEO. Peer to the CTO and the Chief Customer and Growth Officer.

Original job VP of Site Reliability posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.

Share Job

Get your Resume Reviewed for Free

Similar VP of Site Reliability Jobs in the US

Get your Resume Reviewed for Free

Email address

Why are you reporting this job?

I think it’s a discriminatory or offensive

I think it’s fraudulent or a scam

I think it’s trying to sell something unrelated to the job / it’s asking for money

I think it contains incorrect or broken information

Other

All Job Ads are subject to GrabJobs’s Terms of Service. We allow users to flag postings that may be in violation of those terms. Job Ads may also be flagged by GrabJobs moderation team. However, no moderation system is perfect, and flagging a posting does not ensure that it will be removed.

Setup your job alert:

Frequency

By activating job alerts, I agree to GrabJobs Terms & Privacy Policy. I can unsubscribe to job alerts anytime. Skip