직무 설명

Role overview

This position focuses on keeping production environments highly available, fast, and dependable through cloud operations, automation, monitoring, and disciplined incident handling.

What you'll do

Design, build, and maintain scalable AWS-based infrastructure using Terraform or CloudFormation.
Set up and operate observability and monitoring platforms such as Prometheus, Grafana, Splunk, or Datadog.
Respond to incidents, perform root cause analysis, participate in on-call rotations, and work with SLIs, SLOs, and error budgets.
Automate recurring operational work to increase reliability, efficiency, and recovery speed.
Support Kubernetes, Docker, CI/CD pipelines, runbooks, and ITIL-based operational processes.

Skills and experience needed

Hands-on background in SRE, DevOps, production support, or cloud operations with AWS exposure.
Working knowledge of Kubernetes, Docker, Linux, and core networking concepts.
Ability to script in Python, Bash, or Go.
Experience with monitoring platforms and incident resolution / RCA workflows.
Familiarity with infrastructure as code, CI/CD tools, and enterprise support systems is preferred.

Experience

A minimum of 5 years of experience in SRE, DevOps, cloud, or production support roles is required.

Site Reliability Engineer (SRE)

Where you'll work

직무 설명

Role overview

What you'll do

Skills and experience needed

Experience

기술