Developing and governing resilience strategies across system architecture, deployment, monitoring, and incident response; Defining and tracking stability KPIs (e.g., MTTD, MTTR, error budgets), partnering with performance and operations teams to meet or exceed targets; Designing and implementing fault injection testing, chaos engineering practices, and scenario-based simulations to validate platform robustness; Collaborating with product, infrastructure, architecture and development teams to re-design services with built-in redundancy, failover, and graceful degradation; Driving automation and observability improvements to reduce noise, increase fault detection speed, and support predictive failure mitigation; Contributing to the design and maintenance of our Business Continuity and Disaster Recovery Plan (BCDR), ensuring IoT systems remain resilient and recoverable in the face of unexpected disruptions; Owning the resilience roadmap and continuously assessing emerging threats, technologies, and architectural shifts to guide evolution of stability practices; Evangelizing a culture of resilience through internal communication, workshops, and post-incident learning programs; Deliver high-quality engineering solutions while continuously strengthening the resilience, scalability, and cost efficiency of our IoT platform; Consistently meet or exceed delivery expectations by prioritizing the highest-leverage resilience initiatives that improve customer experience, business outcomes, and financial performance; Build trusted, transparent, and outcome-driven relationships by providing clear technical direction and trade-off recommendations to business and engineering stakeholders. Educated to BSc degree level in Software Engineer or related discipline with Computer Science Strong scripting and automation experience (e.g., Python, Bash, Go, PowerShell), with a demonstrated ability to replace manual processes with reliable, scalable automation; Proven experience designing and operating high-availability, fault-tolerant systems, including the use of chaos engineering techniques and proactive failure-mitigation strategies; Experience applying Business Continuity and resilience standards (e.g., ISO 22301) in the context of real-world platform design and operational readiness; Hands-on experience designing or integrating monitoring, alerting, and automated testing frameworks to support early fault detection and system validation; Broad experience working with Linux-based platforms across on-premises and cloud environments, with an understanding of how infrastructure choices impact reliability, scalability, and recovery; Deep expertise in Site Reliability Engineering principles, including SLOs/SLIs, error budgets, observability, toil reduction, and automation, with the ability to apply them at platform and system scale to guide architectural decisions and long-term resilience strategy; Proven ability to balance long-term platform stability with delivery velocity by making clear, data-driven trade-offs; Strong understanding of security principles, practices, and standards, and the ability to incorporate them into resilient, real-world technical solutions; Deep command of telemetry, logging, and alerting ecosystems (e.g., Prometheus, Grafana, ELK, Datadog, Splunk), with the ability to design signals that enable early fault detection and informed decision-making; Experience defining meaningful SLIs and building dashboards that drive architectural insight, prioritization, and corrective action; Proven experience leading blameless post-incident reviews, root cause analysis, and systemic improvements across multiple teams; Expertise in identifying and addressing system bottlenecks, latency issues, and throughput constraints in distributed environments; Proficiency in forecasting demand, planning capacity, and managing system growth in a cost-efficient and sustainable manner; Strong track record of partnering with software engineering, infrastructure, product, and business teams to embed resilience into the full development lifecycle; Fluency in English.
Todos os Anúncios de Emprego estão sujeitos aos Terms of Service do GrabJobs. Permitimos que os usuários marquem postagens que possam estar em violação desses termos. Anúncios de emprego também podem ser marcados pela equipe de moderação do GrabJobs. No entanto, nenhum sistema de moderação é perfeito, e marcar uma postagem não garante que ela será removida.
Seja o primeiro a receber as últimas vagas Others Full-Time em Portugal.
Setup your job alert:
Ao ativar os alertas de emprego, eu concordo com os Terms & Privacy Policy do GrabJobs. Posso cancelar a inscrição nos alertas de emprego a qualquer momento.
Pular
Você atingiu seu número máximo de alertas de emprego.
O GrabJobs é o portal de empregos número 1 em Portugal, conectando você rapidamente a milhares de empregos de !
Encontre os melhores empregos de em Portugal, candidate-se com apenas 1 clique e consiga um emprego hoje!