Site Reliability Engineer

Location:

Type:

Full Time

Site Reliability Engineer

Ledgebrook is a tech-enabled E&S MGA on a mission to modernize Specialty insurance. The industry is burdened with legacy technology and inefficient processes, preventing innovation at scale. We are changing that. Our goal is to become the best-in-class full-stack insurance and re/insurer, leveraging AI and data-driven insights to revolutionize underwriting, pricing, and risk selection.

We believe in talent density - fewer, better people working together as one. We win as a team, and our success is shared through generous equity packages for all employees.

About the Role

We’re seeking a Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of Ledgebrook’s cloud-native infrastructure and applications. You’ll be at the forefront of operational excellence, driving automation, observability, and stability across our platforms.

This role combines infrastructure engineering, systems automation, and software development practices to proactively build resilient systems and reduce downtime, directly influencing our ability to scale rapidly and reliably.

What You’ll Work On

Reliability & Scalability – Architect and maintain highly available, scalable cloud infrastructure to ensure the seamless operation of Ledgebrook's production and internal environments.
Observability & Monitoring – Implement and optimize monitoring, logging, tracing, and alerting systems to proactively detect and resolve issues before they impact business operations.
Incident Response & Management – Lead and participate in incident response efforts, performing root cause analysis and implementing corrective actions and improvements to prevent recurrence.
Infrastructure as Code (IaC) – Automate infrastructure deployments and management using IaC tools (Terraform), ensuring consistency, repeatability, and security.
Continuous Integration & Delivery (CI/CD) – Develop and maintain robust CI/CD pipelines, accelerating software delivery while maintaining system stability and security.
Performance Optimization – Identify performance bottlenecks and work cross-functionally to optimize application and infrastructure performance.
Security & Compliance – Ensure infrastructure adheres to best practices for security, compliance, and disaster recovery planning, maintaining up-to-date documentation and procedures.

About You

Here at Ledgebrook we’re passionate about creating a team that thrives on continuous learning and shares our excitement about building a company from the ground up. We’re looking for people to join us with:

A passion for delivering world-class customer service to our internal and external customers
Intellectual curiosity and a strong desire for innovation, rather than following the status quo
A hunger for continuous learning and opportunities to grow
Agile prioritization skills coupled with a keen sense of urgency - we balance getting it right with getting it done right now
A strong drive and desire to win togethe r as a high-performing team
A moral compass to " do the right thing, period." We have zero tolerance for toxic behaviors
An eagerness to actively participate and connect with the whole team, across remote locations
An honest, transparent communication style
A proactive, solution-oriented mindset. We don’t look for blame, we look for the solution

Tech Stack

Infrastructure & Cloud: AWS (ECS, EKS, Lambda, S3, RDS, CloudWatch, IAM)
IaC & Automation: Terraform, CloudFormation
CI/CD & Containerization: GitHub Actions, Docker, Elastic Container Services
Observability & Monitoring: Datadog, Sentry
Languages & Scripting: Python, Bash, JavaScript/TypeScript

Must Haves

5+ years experience in Site Reliability Engineering, DevOps, or infrastructure-focused engineering roles
Extensive hands-on experience with AWS infrastructure and services
Strong knowledge of container orchestration (Docker, Kubernetes, ECS)
Experience building and maintaining CI/CD pipelines
Proficiency in infrastructure automation (Terraform or CloudFormation)
Solid scripting/coding experience (Python, Bash, JavaScript/TypeScript)

Nice to Haves

Experience in insurance, fintech, or regulated industries
Familiarity with Socotra or other insurance-specific platforms
Background in managing production environments
Experience working in high-growth startups or fast-paced technology environments

Site Reliability Engineer

Ouick Links

Useful Links

Follow Us

Contact Us

Quick Links

Useful Links

Follow Us

Contact Us