Responsibilities
1. Responsible for the planning, technical architecture design and R&D of the company's overall infrastructure stability platform, including but not limited to: global disaster recovery architecture design, plan drill platform, risk inspection system, change risk control system, full-link monitoring management and other stability-related product construction 2. Build infrastructure component SLI, SLO, SLA management and computing framework to help businesses sort out core SLIs and improve exception handling efficiency and collaboration capabilities 3. Lead and design stability assurance solutions and implement them in platform products, reduce collaborative costs through platform products, improve service efficiency, and achieve efficient automated and intelligent operation modes 4. Continue to follow up on the industry's cutting-edge technical solutions and combine the actual situation of ByteDance to explore the direction of stability service assurance product construction and implement applications from the perspective of technical products, and continuously improve the stability of infrastructure.
Qualifications
1. Bachelor degree or above, computer-related majors, and five years or more of work experience in related fields 2. Solid computer software foundation, familiar with Linux operating system, proficient in at least one programming language in Go/Python/Java 3. Have good architecture design and code development experience, be able to formulate disassembled and implementable goals, and be able to guide team members in technology and product design 4. Familiar with the full-link observation product system such as monitoring/alarm/log/event/trace, and have built inspection, alarm, diagnosis, plan, self-healing and other systems from 0 to 1. Candidates with one of the following conditions will be given priority: 1. Familiar with a variety of common disaster recovery architectures & recovery solutions, such as multi-active in the same city/multi-active in different locations/master-slave synchronization/backup recovery, etc. Candidates with experience in disaster recovery platform construction will be given priority 2. Candidates who have a lot of thoughts on the stability and SLA guarantee system as a whole, and can use a variety of technical/operational means to ensure and improve system stability, and those with practical work experience in vertical business operation and maintenance will be given priority 3. Candidates who understand various traffic access and distribution systems, are familiar with stress testing/current limiting mechanisms, have a deep understanding and application of global Internet traffic scheduling, and are familiar with network IDC architecture will be given priority.