Platform Reliability Operations
This is a critical role with a wide range of responsibilities, including: ● Analyze and improve system design to reduce failure modes and promote self-healing systems ● Establish and maintain robust systems that facilitate observability, encompassing logging, monitoring, distributed tracing, alerting, and offline test tools. ● Work with development partners to shape the architecture, design, and implementations of new and existing systems to enhance their reliability, performance, efficiency, and scalability ● Ability to work both independently as well as part of a geographically dispersed yet integrated team. ● Collaborate with service engineers to establish Service Level Agreements (SLAs) and Service Level Objectives (SLOs) for backend services. ● Being able to identify the indications or cues that demonstrate the effectiveness of an application and having the knowledge to improve or repair its performance ● Ability to assess options and suggest solutions when there is limited or unclear information. This position requires a level of comfort and assurance in dealing with uncertain situations. ● Ability to work seamlessly within a team as well as manage individual tasks ● Respond to emerging incidents, solve critical issues, and follow through with a plan for resolution or future mitigation ● Act as an SME on the Engineering Operations team, partnering with backend services teams and application teams to overcome challenges across all the platforms where we stream our service Qualities / Experience We’re Seeking We believe the right individual will have the following skills and experience to be successful in the role: ● 5+ years experience in software development ● Degree in Computer Science or related or equivalent work experience ● You have solid engineering and coding skills, data structure knowledge, and the ability to write high-performance production-quality code. ● Experience building service-oriented APIs and cloud services (preferable against AWS) ● Experience designing, implementing, and deploying microservices ● Extremely technical hands-on server software experience ● Proficient in Golang, and Javascript, and quick to learn new languages. ● Experience in the Linux environment and a good understanding of its fundamentals and internals: filesystems and modern memory management, threads, and processes, the user/kernel-space divide, etc. ● A good understanding of large-scale distributed systems in practice, including multi-tier architectures, application security, monitoring, and storage systems. ● Working knowledge of the TCP/IP stack, internet routing, and load balancing. ● Grit, drive, and a deep feeling of ownership. Bonus Points for Experience with the following: ● Golang ● Typescript ● Kubernetes ● Terraform ● Opentelemetry ● Istio ● Datadog ● Helm Charts ● HLS video transcoding, distribution & playback ● Experience designing, implementing, and running services in high demand high-traffic environments ● Experience with high-availability services
Job Segment:
Cloud, Testing, Operations Manager, Computer Science, Linux, Technology, Operations