This page was automatically translated and may contain errors. View in English.

Site Reliability Engineer (m/f/d)

gridscale GmbH

Köln, Nordrhein-Westfalen, Deutschland ・ フルタイム

最初に応募しよう

経験
どれでも
給料
求人情報
1
投稿済み
1時間前
作業モード
在任中
再開する
応募必須

勤務地

仕事内容

At our company, it’s all about #OneTeam! Join gridscale and help shape the future of the cloud together with OVH.

As a leading tech company, we’ve been working for over two decades to reduce our environmental footprint - with innovative solutions and an open cloud designed to be sustainable from the ground up: #SustainableByDesign.

Our Tech Stack 🚀

OpenStack · Kubernetes · KVM · Linux · Bare-metal

· Ansible · Terraform · FluxCD/ ArgoCD · Git · Go · Python

· Claude Code/ Cursor/ agentic coding tooling

Your Role💻

You'll help build, operate, and industrialize OVHcloud's on-premise cloud platform (OPCP). You'll join a small, senior team that owns the OpenStack-based infrastructure and the Kubernetes / GitOps stack our customer-facing cloud runs on and that treats AI-assisted engineering as a first-class part of how we work.

The platform is actively in build mode, so joining now means real influence on the architecture, the automation strategy, and how we adopt AI in platform engineering. As a Senior, you shape the focus of your role around your strengths and interests: there's a clear backbone of automation, compute-lifecycle, and platform work, plus an explicit AI-substrate workstream. You're at home in a security-oriented, highly automated (GitOps) environment, keep an overview in ambiguous situations, and make well-founded decisions on that basis.

Your Tasks

  • Design and build OpenStack-based on-prem infrastructure that deploys itself autonomously - discovering available hardware and bringing up a functional datacenter in minutes.

  • Develop Infrastructure as Code with Ansible and Terraform - typically spec-first with LLM assistance, then human-validated; push this further via custom agent / sub-agent setups, agentic test generation, and prompt-engineered review loops.

  • Drive the ongoing development of our Kubernetes stack and GitOps workflows (FluxCD / ArgoCD).

  • Own the full lifecycle of our compute infrastructure - from bare-metal (firmware, provisioning, hardware health) through hypervisors to virtual compute nodes - and build the automation that keeps capacity healthy and rolls out updates without disturbing tenant workloads.

  • Build and extend the AI substrate that compounds our output: Markdown knowledge bases as retrieval substrate, agentic prototypes for incident triage and capacity planning, and deeper integration of agentic coding tools into daily work.

  • Contribute to the self-healing direction, turning today's manual runbooks into tomorrow's reasoning agents. Auto-remediation isn't a separate team here - it's how platform work is meant to land.

  • Design and implement test suites aligned with functional and technical specs (non-regression, performance, security).

  • Document and package the solution so users can deploy and operate it without friction, and keep improving the platform based on telemetry and user feedback.

  • Act as a technical reference and mentor across automation, platform engineering, and AI-tooling topics.

What we offer you💼

  • A platform that is genuinely in build mode - your architectural decisions stick.

  • A senior team where seniority means autonomy, not just a title.

  • AI-augmented engineering as a first-class workflow -Claude Code and comparable agentic tooling, Markdown-KB-as-substrate, and room to push the practice further. Modern tooling that compounds your work instead of just sitting next to it.

  • Exceptional team spirit across all departments and national borders - we live #OneTeam

  • Exciting work in a highly innovative, international environment with cutting-edge technologies

  • 32 vacation days, increasing with length of service

  • Flexible working hours, home-office options, and a secure permanent position with market- and performance-based compensation

  • Employer-funded pension plan and an attractive insurance package

  • OVHcloud covers 50% of public transportation costs

  • Up to €400 per year toward sports activities (gym membership, classes, etc.)

  • Attractive discounts at numerous shops and companies through Corporate Benefits

  • A contribution toward leasing your cargo bike

  • Regular company events and free cold and hot beverages

  • Several years of hands-on experience running production infrastructure (SRE, Platform, or DevOps).

  • Solid OpenStack experience - deployed, operated, and debugged it in production.

  • End-to-end compute infrastructure management, from bare-metal lifecycle through hypervisor and virtual compute node operations (migration, host evacuation, graceful drains, capacity rebalancing). The skill matters more than the specific tooling - what counts is having done it at scale and automated it.

  • Strong with Infrastructure as Code (Ansible, Terraform) and GitOps (FluxCD or ArgoCD), plus solid Linux administration including on bare-metal.

  • Active, daily practice of AI-assisted engineering, with opinions formed from real use. You can describe a workflow where an LLM saved you half a day, and one where you should have skipped it. Theoretical interest doesn't count.

  • Fluent English, written and spoken - our team is distributed, and this is the working language.

  • Nice to Have

    • Production experience with Kubernetes and the cloud-native ecosystem.

    • Production-quality Go and/or Python.

    • Deeper agentic tooling craft (Claude Code, Cursor, Aider): custom agent / sub-agent setups, hooks, prompt engineering, your own workflows or skills and managing a Markdown-first knowledge base as substrate for AI workflows.

    • Advanced compute-node tuning (CPU pinning, NUMA, hugepages, SR-IOV / PCI passthrough) and basic network debugging (VLANs, BGP).

    • Observability tooling (Prometheus, Loki, Grafana, etc.) and auto-remediation / self-healing systems (StackStorm, Event-Driven Ansible, or similar).

    • Experience in security-critical environments and with edge or multi-site deployments.

  • Soft Skills

    • A continuous-improvement mindset and ownership for what you build.

    • You see AI tooling as a structural shift in how engineering gets done - not a trend, not a threat and want to shape how the team adopts it.

    • You enjoy sharing knowledge, learning from peers, and can synthesize ideas clearly.

Find more

返信をご希望の場合は、そのまま残してください。それ以外の目的には一切使用いたしません。

クリックして閲覧ドラッグ&ドロップ、または ペースト スクリーンショット

PNG、JPG、GIF、MP4、WebM、MOV形式 · 各ファイル最大20MB · 最大5ファイルまで

🤖
ブロクサーアシスタント
オンライン・即時AIサポート
🤖
AI搭載 · Broxerヘルプからの回答