Overview
Lead Software Engineer, Observability and Reliability
This role is for an experienced engineer who believes reliability is built through software, not heroics. You will operate at the intersection of systems engineering, platform development, and observability, helping ensure that large scale, customer facing services remain fast, resilient, and predictable as they grow. Your work will directly shape how engineering teams design, ship, and operate critical systems.
The Opportunity
You will join a reliability focused engineering group within a fast growing software platform that supports data driven, highly personalized digital experiences. The environment is complex, distributed, and performance sensitive. Reliability is treated as a product feature, and the team approaches operations as an engineering discipline grounded in automation, measurement, and continuous improvement.
What This Team Does
The reliability and observability team builds core Back End services, internal platforms, and automation that allow product engineering teams to release software safely and scale it with confidence. The group partners closely with feature teams, embedding where needed to improve architecture, performance, and operational maturity.
The team also acts as educators and advocates, helping engineers across the organization learn how to debug distributed systems, design self healing services, and push system performance to its practical limits.
Your Impact
As a Lead Software Engineer in Observability and Reliability, you will define how complex production problems are solved and prevented. You will own key technical areas, set direction for reliability improvements, and influence how engineering teams think about availability, scalability, and efficiency.
Your work will improve customer facing stability while also increasing the productivity of product engineers by reducing operational friction, noise, and uncertainty.
What You Will Do
You will design, build, and operate foundational services that enable highly available and scalable systems. You will identify systemic bottlenecks and lead efforts to remove them, achieving meaningful gains in throughput, latency, and resilience.
You will develop tooling, automation, and processes that prevent incidents before they happen, working with partners to address root causes rather than symptoms. You will define and own the technical roadmap for your domain, collaborating with stakeholders to prioritize the highest impact work.
You will write and maintain production software that improves service availability, operational efficiency, and performance. You will work closely with product engineers and other reliability engineers to ship changes that matter.
You will participate in an on call rotation with a strong emphasis on learning, prevention, and alert quality. When issues arise, you will help drive clear diagnosis, resolution, and long term fixes.
You will use data and quantitative analysis to understand system behavior, guide scaling decisions, and measure improvement. You will actively promote reliability best practices through design reviews, documentation, and hands on collaboration.
Technical Environment
The systems you work on run in cloud based environments and rely on technologies such as Python, container orchestration platforms, infrastructure as code, relational and in memory data stores, and Linux based operating systems. Observability, automation, and safe deployment practices are core to how work gets done.
What You Bring
You bring a decade or more of experience in site reliability engineering, platform engineering, or DevOps focused roles. You have spent significant time operating production systems and understand how software behaves under real world conditions.
You are comfortable leading through incidents and can guide teams from failure through root cause analysis to durable prevention. You have a strong understanding of Linux systems and networking fundamentals, from the operating system up through application level behavior.
You have experience building software as part of an engineering team and write high quality code in languages such as Python, Go, or similar. You apply sound engineering practices to both product code and operational tooling.
You are curious and growth oriented, always looking to improve your own skills and raise the bar for those around you. You have begun experimenting with artificial intelligence in professional or personal projects and are eager to explore how new tools can responsibly improve reliability and efficiency.
About Andiamo
About Andiamo
Talent Partners for the AI Revolution. As a globally recognized staffing and consulting firm, we specialize in placing the top 2% of technology and go-to-market professionals with the world's largest and most well-known companies.
For over 20 years, we've maintained the status of tier-one vendor for firms such as Palantir, Amazon, Fluidstack, Bloomberg, Relativity Space, Firefly, MasterCard, Visa, Two Sigma, Citadel, as well as other major financial services firms, elite hedge funds, Google-backed tech start-ups, and major software firms.
Our talent solutions include Permanent Placement, Contract Staffing, Executive Search, and Dedicated Recruiting Services (RPO). Find out more at www.andiamogo.com
