Job description

About the company

Reddit is a network of communities centered around shared interests, trust, and lively discussion. It hosts more than 100,000 active communities and serves about 126 million daily active unique visitors, making it one of the largest information platforms on the internet.

Role overview

The Site Experience SRE team works across infrastructure, product engineering, and user experience to keep Reddit’s web, mobile, API, feed, media delivery, and real-time systems fast, dependable, and resilient. As a Staff Site Reliability Engineer, you will own reliability programs for high-impact, user-facing systems at internet scale and collaborate closely with product and infrastructure teams to raise availability, latency, scalability, and operational maturity.

This is a senior technical leadership role for someone who enjoys solving deep distributed-systems problems, improving large-scale service reliability, and shaping engineering practices across the organization.

Key responsibilities

You will lead reliability engineering efforts for critical user experiences, design for scale, reduce operational risk, and build automation that improves deployment safety, incident handling, and remediation. You will also guide incident response, drive sustainable fixes through postmortems, define reliability standards, and mentor engineers to strengthen the company’s reliability culture.

What you will do

Own reliability engineering for core user-facing systems and services.
Improve performance and resilience across APIs, content delivery, feed generation, search, messaging, and real-time experiences.
Work with product and infrastructure teams to build systems that stay available and performant under very large global traffic.
Influence architectural choices around failover, redundancy, graceful degradation, traffic handling, and capacity planning.
Spot systemic risks and reliability constraints across services, dependencies, deployments, and infrastructure.
Create preventive strategies and engineering improvements that lower incident rates and improve overall service health.
Remove repetitive operations through tooling and automation.
Build guardrails and workflows that improve deployment safety, incident response, and remediation.
Lead major incident response efforts across engineering groups.
Run blameless postmortems, determine root causes, and ensure long-term corrective actions are delivered.
Set and promote standards for SLIs/SLOs, capacity management, release practices, and operational maturity.
Mentor engineers across SRE and software engineering teams and help raise the bar for operational excellence.

Requirements

At least 8 years of experience in Site Reliability Engineering, Infrastructure Engineering, or similar roles supporting large distributed systems.
Proven ability to collaborate well and influence technical direction across teams.
Hands-on experience supporting high-traffic production environments used by end users.
Strong understanding of distributed systems, networking, Linux, or cloud-native architecture.
Experience building highly available systems with strong operational practices.
Programming ability in Go, Python, or comparable languages.
Working knowledge of observability tooling such as metrics, logs, traces, and alerting.
Experience improving reliability using SLOs, automation, incident management, and performance tuning.
Ability to troubleshoot difficult issues across applications, infrastructure, networking, and services.

Preferred experience

Experience running systems at internet-scale traffic levels.
Background with Kubernetes, containers, cloud infrastructure, and modern deployment platforms.
Exposure to tools and systems such as Prometheus, Grafana, OpenTelemetry, Envoy, Kafka, ClickHouse, Cassandra, Redis, or similar technologies.
Experience with CDN tuning, edge reliability, traffic engineering, or global infrastructure.
Contributions to open-source software or participation in technical communities.
Experience leading major incident response and large-scale operational change initiatives.

Why this role stands out

You will help shape the reliability and performance of a major consumer platform used by millions every day, while tackling complex engineering challenges at massive scale and influencing the future of reliability engineering at Reddit.

Benefits

Global benefit programs designed to support different lifestyles, including workspace, professional development, and caregiving support.
Family planning support.
Gender-affirming care.
Mental health and coaching benefits.
Private medical, dental, and vision coverage.
Personal retirement savings account with employer matching.
Cycle to Work and Tax Saver schemes.
Flexible vacation and paid volunteer time off.
Generous paid parental leave.

Interview and privacy notice

For some roles and locations, interviews may be recorded, transcribed, and summarized using AI. Candidates can opt out before any scheduled interview.

During interviews, the company may collect identifiers, professional and employment-related information, sensory information such as audio or video recordings, and any other information you choose to share. This information is used to assess your application for employment or contractor work. It is not sold or shared with third parties for marketing. Interview recordings are deleted promptly after a hiring decision is made. Additional details are available in the candidate privacy policy for employees and contractors.

Equal opportunity and accommodations

The employer is an equal opportunity organization and is committed to representing the diverse communities it serves. Reasonable accommodations are available for qualified individuals with disabilities and disabled veterans during the application process. If you need an accommodation during the interview stage, notify your recruiter.

Staff Site Reliability Engineer - Site Experience

Where you'll work