About the Role
You’ll join a team of SREs who work closely with other teams of world-class engineers to tenaciously and creatively solve problems and reduce manual toil wherever possible. We expect AI and automation to be a force multiplier in everything you do — from accelerating root-cause analysis and enriching alerts, to generating runbooks and codifying remediation so that the platform increasingly heals itself.
Successful candidates are heavily results-driven, bring well-established expertise across both traditional and bleeding-edge technology, and have a strong desire to continuously grow and improve themselves and our platform. This is a global operation spanning multiple regions and time zones, and the role demands the flexibility and commitment that a 24/7 payment platform requires.
This position participates in an engineering on-call rotation and provides after-hours support for production issue escalations on a rotational basis.
This position is based in the Philadelphia area with a hybrid schedule. Remote arrangements may be considered for exceptional candidates, with occasional travel to Philadelphia required.
Primary Responsibilities:
- Build and maintain a comprehensive understanding of the platform and custom application stack.
- Implement, maintain, and continuously improve observability strategies and metrics that ensure complete system health for numerous complex products throughout all stages of the development lifecycle, up to and including production.
- Continuously identify automation opportunities and follow through to successful implementation, applying AI-assisted tooling to accelerate development and reduce manual effort.
- Design, build, and maintain automated remediation and self-healing workflows that detect, triage, and resolve common failure modes with minimal human intervention.
- Leverage AI/ML-driven observability — anomaly detection, alert correlation, and intelligent noise reduction — to surface issues earlier and shorten time to detection.
- Use AI-assisted analysis to accelerate root-cause investigation, enrich incident context, and generate first-draft postmortems and runbooks for human review.
- Handle escalations and collaborate effectively with other team members to quickly determine the root cause of any type of service degradation.
- Implement, maintain, and continuously improve incident response procedures and other operational documentation, automating documentation generation and upkeep wherever practical.
- Assist with troubleshooting and remediation of failed scheduled jobs and data-related concerns.
- Champion responsible, secure adoption of AI tooling across the SRE function — sharing patterns, prompts, and automations that raise the productivity of the whole team
AI Enablement & Automation
- Apply AI-assisted development and operations tools — including Anthropic (Claude), OpenAI (Codex), and Azure AI services (Foundry, Azure SRE Agent) and the agentic workflows built on them — to write, review, and accelerate automation and infrastructure code.
- Build and integrate automation that turns repetitive operational work into codified, repeatable, and self-service workflows.
- Use AIOps and ML-driven observability capabilities within the APM stack for anomaly detection, predictive alerting, and alert correlation.
- Develop and refine prompts, agents, and integrations that connect monitoring, ticketing, and remediation systems into faster end-to-end response loops.
- Evaluate emerging AI tooling for reliability and operations use cases, and advocate for adoption where it delivers measurable improvements in toil reduction, MTTR, or availability.
- Ensure all AI and automation usage adheres to FreedomPay’s security, privacy, and PCI obligations — keeping sensitive data appropriately protected and human review in place for high-impact actions.
Required Background and Experience
- BS degree in Computer Science or equivalent, or equivalent years of relevant experience.
- Minimum of 5 years of hands-on technical experience in highly available, high-throughput, web-based technology environments.
- Demonstrated history of self-directed learning — someone who independently seeks out knowledge, builds new skills without being told to, and doesn’t wait for formal training to close gaps.
- Next-level problem-solving abilities and a strong bias toward practical, proven solutions.
- A track record of identifying and eliminating manual toil through automation.
- Excellent communication and organizational skills, with a strong sense of ownership and service.
Required Technical Skills
- Expert-level proficiency in an enterprise APM platform and its AI/ML-driven (AIOps) capabilities; Dynatrace experience strongly preferred, though deep expertise in comparable tools such as Datadog or New Relic where readily transferable.
- Hands-on experience with AI-assisted development and automation tools — such as Anthropic (Claude), OpenAI (Codex), and Azure AI services (Foundry, Azure SRE Agent) — and a demonstrated ability to apply them to real operational and engineering work.
- Proficiency in scripting and automation — PowerShell and/or Python — to build tooling and remediation workflows.
- Strong SQL / T-SQL skills.
- Solid understanding of core networking concepts: DNS, HTTP/HTTPS, load balancing, and TCP/IP routing and switching.
- Working knowledge of modern technology infrastructure including container orchestration, IaaS/PaaS cloud services, Azure, and VMware.
- Working knowledge of application development processes.
Preferred Technical Skills and Experience
- Proven track record of successfully implementing SLI/SLOs and fostering their adoption across an organization.
- Experience implementing enterprise incident management practices.
- Experience building AIOps or ML-driven automation into production observability and incident response.
- Azure Kubernetes Service (AKS) and broader container orchestration experience.
- Windows Server (IIS) administration.
- PagerDuty Process Automation (formerly Rundeck) or comparable runbook automation platforms.
- Comprehensive experience supporting real-time transaction processing applications.
- PCI policies and best practices.
Additional Experience, a Plus
AI/ML model deployment, evaluation, or operations (MLOps).
Documentation automation and self-service tooling / service catalog implementation.
Experience integrating QA test automation into CI/CD pipelines.