Feedback
Need help? Have any feedback, feature requests or bugs? Submit it here

Golang Site Reliability Jobs


Hand-Picked Golang jobs • Apply directly to companies • Clear salary ranges

Browse 100+ Golang Site Reliability Jobs (1 new this week) in November 2022 at companies like MaxMind Inc, Gtmhub and Monzo with salaries from $100,000 to $500,000 working as a Site Reliability Engineer, Site Reliability Engineer and Site Reliability Engineer (Platform). Last post

2-Click Apply

  1. Upload Your CV
  2. Go to your Inbox & Confirm Your Application





10 of 107 Site Reliability Jobs • Sort by Date
MaxMind Inc Logo
Site Reliability Engineer
MaxMind Inc
Remote - US & Canada
$130,000 to $180,000 a year
December 2021
1 Applicants This Week
This job posting is no longer available

Job Description

MaxMind (www.maxmind.com) is looking for a talented, experienced, and highly motivated Site Reliability Engineer (SRE) to join us. We help protect thousands of companies worldwide from fraud, screening over a billion online transactions each year, and we provide IP intelligence data to thousands more. As an SRE, you will play an important role in the continuous improvement and maintenance of MaxMind’s products, services and internal systems to deliver a performant and secure solution.

This is a full time remote position, and we welcome candidates to apply from the following US states CA, CO, FL, LA, MN, NV, NY (excluding New York City and Yonkers), NC, NV, OR, PA, TX, WA and anywhere in Canada.

MaxMind does not currently sponsor US employment visas. For Canadian candidates, you must be eligible/authorized to work in Canada.

The Position Overview

As a MaxMind SRE, you will make a difference in defining broader architectural, design, and technical objectives of MaxMind, and achieving customer satisfaction by:

  • Building performant and scalable SaaS solutions and the tools to maintain them
  • Offering ideas and suggestions to the improvement of the development tool set, technical direction, and software architecture
  • Identifying, triaging, and resolving system issues
  • Researching changes in technologies, development environments, and tools
  • Enabling and extending complex system monitoring
  • Updating configuration management and deployments
  • Collaborating with, mentoring, and advising software engineers and the product team
  • Supporting on call after hours in rotation with other members of the team

Our salary range for Site Reliability Engineering roles begins at $130k USD or $150k CAD (in Canada), with the specific offer depending upon skills and experience. See more about benefits and compensation below.

Minimum Qualifications

  • Experience as a Site Reliability Engineer and/or System Administrator and/or DevOps Engineer for Highly Available SaaS solutions processing web traffic
  • Knowledge of TCP/IP, HTTP, DNS, TLS, and SMTP
  • Experience building complex monitoring solutions to support identification of issues with high availability web services
  • Able to investigate and resolve issues with Linux performance and network latency/reachability
  • Experience with Ansible, Terraform, or other configuration management and infrastructure as code software
  • Programming experience in Go or another language. Our SRE code is mostly Ansible and Terraform, but we also have a small amount of Go and Perl. We're happy to hear from you if more familiar with other programming languages or configuration management software too
  • Significant experience with Linux systems
  • Experience with version control, preferably Git
  • Strong analytical and problem-solving skills, with logical and repeatable debugging and problem solving approaches
  • Ready to learn new things
  • Excellent written and verbal communication skills with ability to communicate clearly with partners and end users
  • Able to work with a geographically distributed team

Desired, but not required

  • Experience managing PostgreSQL, including streaming replication and backups
  • Experience with Google Cloud or another major cloud provider
  • Experience doing security audits, security compliance, or penetration testing
  • Experience with HAProxy configuration, Docker, Kubernetes, or other container tools, ELK/Elastic Stack, Cloudflare, Systemd configuration, Open source technologies
  • Experience with emerging cloud platforms and infrastructure tool

Our Site Reliability Engineering Practices

Our Site Reliability Engineers are members of our Engineering team, working together to deliver to our customers’ success. At MaxMind, we are committed to security and the contributions of our SRE team are integral to our work. To learn more about our commitment to security, visit https://www.maxmind.com/en/company/commitment-to-security. We have built a culture of peers, with highly developed practices and processes to work together remotely. To learn more about working at MaxMind, visit https://www.maxmind.com/en/company/working-at-maxmind.

We use Linux, PostgreSQL, Ansible, and Terraform to deliver our solution. We use a wide variety of tools to manage and monitor our systems, including Prometheus, Grafana, Nagios, and the Elastic/ELK stack. All work goes through internal code review on GitHub Enterprise.

Our goal is to automate as much as possible. Our tools are written in Go and Perl. We also want to improve our coding practices for the SRE code we write, writing libraries and tests wherever possible instead of one-off scripts.

Why work at MaxMind?

In a recent survey, employees listed having a supportive work culture, good co-workers, autonomy, and feeling trusted, valued, and respected as some of the things they like most about working here.

MaxMind has a social mission. MaxMind donates over 60% of profits to charities.

MaxMind’s compensation strives to reward getting stuff done, quality of work, and working well with others.

Our culture is very important to us. We’re friendly, collaborative, and work-focused. We don’t like office politics and unnecessary stress. We like to have productive workdays and don’t like work to chase us when we’re done for the day. We maintain a set of core, overlapping hours, but are flexible with specific start and end times and are understanding about appointments and life events. We care about helping each other succeed.

Our engineering team works remotely so communication centers around video chat, group chat, and Agile planning tools.

Normally, we hold a company summit one time per year in Waltham, MA.

Benefits

In addition to competitive compensation, our US benefits include medical, dental, vision, life, and short and long term disability insurance, a Safe Harbor 401(k) with employer contribution, Health Savings Account, Limited Purpose Flexible Spending Account, Dependent Care Account, paid parental leave, and public transit reimbursement.

Our Canadian benefits include medical, dental, vision, life, accidental death and dismemberment, critical Illness, short and long term disability insurance, Employee and Family Assistance Program, and paid parental leave. You also have access to a group Registered Retirement Savings Plan. In lieu of a Canadian RRSP contribution we provide a bonus payout at the end of each year that employees may decide to use toward retirement savings.

Everyone participates in a company performance-based bonus plan. MaxMind offers a $2,000 professional development budget and five days for professional development annually.

Diversity and Inclusion

We're committed to diversity and inclusion and are mindful of incorporating them into all aspects of our company.

We encourage and sincerely welcome applications from candidates of color, women, queer candidates, candidates with family caregiving responsibilities, transgender candidates, and from other communities not well represented in the tech world.

See our complete diversity and inclusion statement - https://www.maxmind.com/en/company/working-at-maxmind.

*Resumes without cover letters will not be considered. We want to know about you. Please tell us why you’re interested in MaxMind and in this position in particular. Please share any projects or accomplishments and include a link so we can learn more. One of the first steps in our interviewing process is a homework assignment, and we will ask you for a submission so we can gain insight into your work. *


⎘ Copy Link ↗ Visit Link
Site Reliability Engineer
Gtmhub
Sofia, Bulgaria
€30,000 to €35,000 a year
July 2019
1 Applicants This Week
This job posting is no longer available

Job Description

Gtmhub is the world’s most beautiful and intuitive Objectives and Key Results (OKRs) management and employee experience solution. We build enterprise-scale software with a consumer-grade experience.

We help organizations amplify revenue growth by aligning every employee with their corporate purpose using the OKRs method. We are big believers in the power of employee experience to drive productivity, so our product facilitates best practice employee success features.

At heart, we are product people who love data so much that we built the only solution that integrates more than 150 data connectors to allow for true automation of progress and productivity management.

The Role

The term site reliability engineering is credited to Benjamin Treynor Sloss, Vice President of Engineering at Google. He said site reliability engineering is “what happens when a software engineer is tasked with what used to be called operations.”

To us, a Site Reliability Engineer (SRE) is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their services.

SREs design and implement automation with software to replace human labor. They want systems that are automatic, not just automated—such that their services are able to run and repair themselves.

Responsibilities

Engage in and improve the entire lifecycle of services—from inception and design, through to deployment, operation, and refinement/system tuning

Support services before they go live through activities like system design consulting, developing software platforms and frameworks, capacity planning and launch reviews

Maintain services once they are live by measuring and monitoring availability, latency, and overall system health

Identify performance bottlenecks and troubleshoot performance issues

Scale systems sustainably through mechanisms like automation, and evolve systems by advocating for changes that improve reliability and velocity

Practice sustainable incident response and postmortems

Basic Qualifications

Experience with algorithms, data structures, complexity analysis, and software design

Ability to work across teams (business and technical) to continuously analyze system performance in production, troubleshoot consumer reported issues, and proactively identify areas requiring optimization

Preferred Qualifications

Expertise in designing, analyzing and troubleshooting large-scale distributed systems

A systematic problem-solving approach, accompanying effective communication skills, a sense of ownership, self-direction, and drive

Ability to debug and optimize code and to automate routine tasks

Practical experience in supporting application reliability practices for consumer-facing web and mobile experiences

The Stack

Our tech stack includes (but is not limited to):

Kubernetes, Docker, Golang, Java, GAP, ELK, OpenTracing, Python, OpenShift, Terraform, Ansible

We started in Sofia in 2015 with a mission to ship a world-class data management and analytics engine which allows companies to automatically track and visualize KPIs in real-time and create custom insights to inform goal setting, performance management, and long-term strategic decision making. Today we operate across offices in Sofia, London, Berlin, and San Francisco.

Apply today if our mission inspires you! Join us in developing yourself and others as our Site Reliability Engineer.


⎘ Copy Link ↗ Visit Link
Monzo Logo
Site Reliability Engineer (Platform)
Monzo
London, UK / Remote (EU)
£59,000 to £116,000 a year
September 2020
3 Applicants This Week

Job Description

At Monzo we’re aiming to build the best current account in the world. We are always keen to hear from capable, creative engineers who want to help us accomplish that goal 🚀

We’re currently looking for Site Reliability Engineers (SREs) to join our Platform team.

We’re looking for SREs who are software engineers at heart - you’re as comfortable writing software to solve problems as you are operating AWS or Kubernetes. If you’re a software engineer who has some good cloud infrastructure experience already, or you’re eager to get really familiar with systems, tooling and libraries, this could be the role for you.

As a team, we’re responsible for designing, building, and operating the services we consume from AWS, along with the software we run on top like Kubernetes, Cassandra, Prometheus, and Kafka. We’re also responsible for operating our three physical data centres, our network, and being on-call for the things we own and run.

To achieve this, we’re organised into three squads within the Platform Group; Infrastructure Platform, Storage Platform, and Backend Platform. Each squad is responsible for solving a specific set of problems for our customers and our engineers. We’re looking for engineers who are interested in joining our Infrastructure Platform or Storage Platform squads right now, but there are opportunities to move between them as you gain experience with our platform.

We've posted a good overview of our platform on our blog if you’d like to learn more.

We're investing a lot of up-front effort in building a scalable, secure, and extensible architecture for our millions of customers. Come and help us build a state-of-the-art microservices platform and build the kind of bank you want to use.

Our engineers have a variety of different backgrounds

We have several non-graduates; only some of us studied Computer Science; some of us have worked in huge companies; some have only ever worked in startups; others are former consultants. As long as you enjoy learning new things, we’d love to talk to you. We do not ask for formal qualifications or degree requirements for any of our engineering roles.

We are actively creating an equitable environment for all of our engineers to thrive

Diversity and inclusion are a priority for us and we are making sure we have lots of support for all of our people to grow at Monzo. We provide a sponsorship framework in Engineering for women and people of colour; all of our leaders are trained on privilege awareness and we are creating partnerships with organisations dedicated to supporting underrepresented groups. You can read more in our 2020 Diversity and Inclusion report.

Monzo works in project-based sprints in small, interdisciplinary teams

We have around 150 engineers out of roughly 1,400 people in total - and we have big ambitions. There are many interesting challenges ahead, and we're happy for people to move between teams or to specialise, whatever you prefer. As an engineer here you'd be able to work directly with anyone across the company, and we run regular knowledge-sharing sessions so you’ll learn heaps about everything from how banks work to effective communication.

We encourage an open and transparent working environment

You can get involved in any aspect of the business you are interested in and, following Stripe’s example, all emails in the company are visible in an email archive. We contribute to open source software as much as possible. We’ve also made our product roadmap public and give sneak peeks of features in our community forum. Our technology blog is a good place to learn even more about what we do!

At Monzo you will get to work with a lot of exciting new technology.

We rely heavily on the following tools and technologies:

You should apply if:

Our open roles are for mid-level to senior Site Reliability Engineers at present. Apply if:

  • the work we’re doing sounds exciting!
  • you’re a software engineer at heart and you’re comfortable writing software to solve problems
  • you’re interested in distributed systems and writing resilient, scalable software
  • you have strong experience working on the backend of a technology product
  • you’re familiar with some of our Platform technologies, or specialise in just one part
  • you want to help build, scale and operate a platform to support a product that you (and everyone you know) use every day
  • you’re keen to learn more about new technologies and the arcane inner workings of the financial industry
  • you’re comfortable working in a team that deals with ambiguity

Logistics

Salary ranges between £59,000 - £116,000 plus stock options and other benefits.

We can help you relocate to London & we can sponsor visas.

This role can be based in our London office, but we're open to distributed working (as long as you can spend around 20% of your time in London).

We have payroll set up in four countries: the UK, Ireland, France, and Spain. Right now, we can only hire people who work from those countries and we’ll keep this updated with new ones as we expand and are able to hire from more places 🌎

We're usually always hiring for engineers, so there's no closing date for this job.

We offer flexible working hours and trust you to work enough hours to do your job well, at times that suit you and your team.

Diversity and inclusion is a priority for us – if we want to solve problems for people around the world, our team has to represent our customers. So we need to attract the best talent and create an environment that supports and includes them. You can read more about diversity and inclusion on our blog.

If you prefer to work part-time, we'll make this happen whenever we can - whether this is to help you meet other commitments or strike a great work-life balance.

Our interview process is normally a phone interview, a coding task and call to discuss it, and 2-3 hours of onsite interviews that can be conducted via hangouts as well. We promise not to ask you any brain teasers or trick questions. We might design a system together on a whiteboard, the same way we often work together, but we won’t make you write code on one.

Equal Opportunity Statement

At Monzo, embracing diversity in all of its forms and fostering an inclusive environment for all people to do the best work of their lives with us. This is integral to our mission of making money work for everyone.

We're an equal opportunity employer. All applicants will be considered for employment without attention to ethnicity, religion, sexual orientation, gender identity, family or parental status, national origin, veteran, neurodiversity status or disability status.


Perks & Benefits

https://monzo.com/careers/#benefits

Apply ⎘ Copy Link ↗ Visit Link
Site Reliability Engineer
PubNative
Berlin, Germany
€40,000 to €65,000 a year
October 2018
2 Applicants This Week
This job posting is no longer available

Job Description

PubNative is a mobile publisher platform that serves native ads via a scalable and flexible API for mobile apps and web. Our publisher-first approach focuses on the specific needs of each publisher across all verticals. Our ad serving technology is used by developers and publishers around the world.

Our system consists of a myriad of high load Golang-based APIs, iOS SDKs, Ruby/Rails 5 dashboard, Scala and Spark data- and ML pipelines, Druid OLAP system, running on a Mesos and Kubernetes cluster.

We're always on call to keep our networks up and running, ensuring our users have the best and fastest experience possible. We follow “Infrastructure as Code” model and immutable deployment strategies.

We are looking for a Site Reliability Engineer (m/f) to help us build and operate infrastructure platforms, and provide technical consultancy to engineering teams on how to build reliable, scalable and efficient services.

Our Responsibilities: - You help us build a hybrid, poly-cloud-provider environment - You help to design, develop and operate monitoring, tracking platforms - You drive scalability and operability of supported systems/infrastructure - You participate in on-call rotation and be on-call for the services you build and support - You work with other teams to provide consultations in systems architecture support for new and existing production systems - You write code so that you can automate tasks, support SLA for Production Systems, you support other engineering teams on reliability, scalability and efficiency topics - You manage OS image/templates via Packer, provision infrastructure via Terraform - You support CI/CD and make new pipelines - You engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation, and refinement - You support services before they go live through activities such as system design consulting - You maintain services once they are live by measuring and monitoring availability, latency, and overall system health

Our Requirements: - 3+ years of experience in a Site Reliability role/Full-stack developer - Experience with public cloud providers (AWS, Google Cloud, Digital Ocean, etc.) and Infrastructure as Code (Terraform) - Strong programming skills and familiarity with modern programming languages: Go, Ruby, Python, Shell etc. - Knowledge of managing docker containers and microservices via Kubernetes - Experience building and monitoring systems and metric collection pipelines - Track record of building automation and solving multi-datacenter/clouds infrastructure problems - Knowledge of algorithms, data structures, complexity analysis, software design and reverse engineering - Interest in designing, analyzing and troubleshooting large-scale distributed systems - Experience working with source control - Git - Experience with continuous integration platforms such as TeamCity, Jenkins, CircleCI etc. - Understanding of Agile, DevOps practices such as CI/CD, automated testing etc.


⎘ Copy Link ↗ Visit Link
Digital Ocean Logo
Senior Engineer Tools & Platforms SRE
Digital Ocean
New York / Cambridge / Palo Alto, United States / Remote
$155,000 to $190,000 a year
July 2019
3 Applicants This Week

Job Description

Do you ever wonder what happens inside the cloud?

Based in New York, DigitalOcean is a dynamic, high-growth technology company that serves a robust and passionate community of developers, teams, and businesses around the world. We believe that today’s entrepreneurs are changing the world through software. Our mission is to empower these entrepreneurs by bringing modern app development within reach for any developer, anywhere in the world.

We want people who are passionate about building the systems, culture, and processes that will improve the resiliency, reliability, scaling, and performance for cloud services.

We are looking for an experienced Site Reliability Engineer to work closely with our product engineering and infrastructure teams. Reporting to the Director of Platform Systems, the Site Reliability Engineer will be performing a mix of hands-on development, coaching, and collaborating with other teams and stakeholders to help bring DigitalOcean’s engineering systems and culture up to the next level.

This is a key opportunity to make a significant impact in DigitalOcean’s engineering and operational systems and influence future product designs and requirements. This role is essential to accelerate the improvement of the high expectations our customers have of DigitalOcean as we continue to grow and expand.

What You’ll Be Doing:

  • Performing hands on technical work to directly improve the reliability, resiliency, and scaling of our key platform systems
  • Working with stakeholders to develop and implement reliability and performance metrics
  • Facilitate DigitalOcean’s culture of learning by providing insight and recommendations for improvement
  • Coaching teams and individuals on reliability best practices and solutions
  • Working with other SREs and engineering leaders to define the architectures and practices that should be adopted in order to deliver on our engineering and operational goals
  • Establishing best practices for development, architecture, deployment, and operations
  • Working with peer SREs to improve services and processes (including architecture reviews, incident response, monitoring) in a cross-functional manner throughout the engineering organization

What We’ll Expect From You:

  • Distinguished track record as SRE (or similar role) with hands-on experience implementing reliability, process, and scaling solutions
  • History of fostering positive relationships with stakeholders and a track record of successful collaboration and coaching
  • Clear communication skills (both written and verbal) to document processes and architectures
  • Experience implementing disaster recovery best practices
  • Developing robust solutions that facilitate streamlined resolution of customer inquiries through use of technologies for automation, deflection, and issue management
  • Adept in Ruby and Go with a broad understanding of the full technology stack for a modern infrastructure
  • Advocate of effective development environments with the use of CI/CD tooling and configuration management technologies such as Chef or Ansible

Why You’ll Like Working for DigitalOcean:

  • We have amazing people. We can promise you will work with some of the smartest and most interesting people in the industry. We work hard but we always have fun doing it. We care deeply about each other and take our “no jerks” rule very seriously.
  • We value development. We are a high-performance organization that is always challenging ourselves to continuously grow. That means we maintain a growth mindset in everything we do and invest deeply in employee development. You’ll need to be great to get hired here and we promise you’ll get even better.
  • We care about you. We offer competitive health, dental, and vision benefits for employees and their dependents, a monthly gym reimbursement to support your physical health, and a monthly commute allowance to make your trips to and from work easier.
  • We invest in your future. We offer competitive compensation and a 401k plan with up to a 4% employer match. We also provide all employees with Kindles and reimbursement for relevant conferences, training, and education.
  • We want you to love where you work. We have great office spaces located in the heart of SoHo NYC and Cambridge and offer daily catered lunches to keep your hunger at bay. We’re also very remote-friendly—we use Slack to communicate across the company—and all remote employees have the opportunity to onboard in-office and take an all-expenses paid trip to our annual company offsite, Shark Week, to get quality in-person time with the team at least once a year. We also allow employees to customize their workstations to meet their needs—whether remote or in office.
  • We value diversity and inclusivity. We are an equal opportunity employer and we do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Apply ⎘ Copy Link ↗ Visit Link
TextNow Logo
Senior Site Reliability Engineer
TextNow
Remote (United States)
$150,000 to $230,000 a year
October 2021
1 Applicants This Week
This job posting is no longer available

Job Description

TextNow is based around a simple idea: Communication belongs to everyone. We work hard to help people stay connected by offering a solution that makes phone service free. At TextNow, we work together to solve complex and interesting problems that have a positive impact on our customers' lives.

Join us in our mission to help people stay connected with technology that is free (or as close to free as possible.)

TextNow is looking for motivated Site Reliability Engineers (SRE's) to own infrastructure, monitoring, logging, ci/cd, reliability and everything in between!

What You’ll Do:

  • Be responsible for maintaining and scaling production services and servers for complex and high throughput.
  • Improve scalability, service reliability, capacity, and performance.
  • Write automation code for provisioning and operating infrastructure at scale.
  • Build tools for internal use to support software engineering best practices.
  • You are not an operator; you’re an experienced software engineer focused on operations.
  • Work with development teams to make sure the applications fit nicely within the infrastructure and scalability/reliability/security is designed and implemented from the start.
  • Participate in on-call rotation, being responsible for uptime and support.
  • Roll up the sleeves to troubleshoot incidents, formulate and test your hypotheses, and narrow down possibilities to find the root cause.

Who You Are:

  • Creator of cool stuff with experience deploying web apps and distributed, service-oriented architectures.
  • Brilliant Collaborator with 8+ years of professional experience in an operationally focused role, preferably in DevOps or SRE, with a B.S., M.S., or PhD. in Computer Science (or equivalent).
  • Someone who takes action and ownership with proven ability to use automation tools.
  • Respectfully candid with the ability to motivate people to act and work on behalf of our customer.
  • A bold risk-taker and self-starter who loves to solve challenging problems.
  • Resourceful and scrappy with the ability to be strategic, roll up your sleeves and execute.

Other:

  • Strong knowledge of Linux and open source software
  • Understanding of modern web architecture (HTTPS, REST) and technology stacks
  • 2+ years of experience with programming/scripting languages (Bash, Go, Python, Ruby, etc.)
  • Experience with deployment automation using Ansible, Puppet, and Terraform
  • Experience supporting various databases such as MariaDB, Redis, and various NOSQL engines
  • Experience deploying containers using Docker and Kubernetes
  • Experience working in the Amazon public cloud (AWS)
  • Experience supporting mobile applications (Android and iOS)
  • Experience in the telecommunications industry

#LI-SW1

Benefits:

· Strong work life blend

· Flexible work arrangements (wfh, remote)

· Employee Stock Options

· Unlimited vacation

· Competitive pay and benefits

· Parental leave

· Benefits for both physical and mental well being

Diversity and Inclusion:

At TextNow, our mission is built around inclusion and offering a service for EVERYONE, in an industry that traditionally only caters to the few who have the means to afford it. We believe that diversity of thought and inclusion of others promotes a greater feeling of belonging and higher levels of engagement. We know that if we work together, we can do amazing things, and that our differences are what make our product and company great.

For TextNow Candidates:

The People and Culture team is available to support you through the hiring process by providing reasonable accommodations to help enable a barrier-free interview experience. If you need assistance applying for a role due to a disability or special need, please let us know by completing this form. Once received our Equity, Diversity and Inclusion Specialist will reach out to you and assist with accommodations that you may require.


⎘ Copy Link ↗ Visit Link
Rebellion Defense Logo
Site Reliability Engineer
Rebellion Defense
Washington, DC / Chicago, Illinois, United States
$100,000 to $200,000 a year
November 2020
6 Applicants This Week

Job Description

We are looking for a Site Reliability Engineer (SRE). As an SRE, you will be tasked with the reliability and operation of our production environments. SREs are tasked with ensuring teams within the company receive help maintaining software at scale, as well as help designing and developing software for scale. SREs are expected to engage with the product teams to ensure the delivery of our software is as seamless as possible.

These position is based out of our Washington D.C. or Chicago Illinois office locations. An active clearance or ability to obtain TS/SCI clearance will be required.

We look for a track record of the following:

  • Coming alongside high energy engineering teams to enable the adoption of best practices to enable the scalability and reliability of deployed software,
  • Defined architecture and built services at scale on public infrastructure such as AWS and Azure,
  • Experience designing, implementing, deploying, and operating high scale production services,
  • Experience facilitating the definition and implementation of SLIs and SLOs,
  • Understanding how to carefully spend error budget to handle regular deployment of large changes to production,
  • Deep experience in Linux operating systems, and systems engineering,
  • Comfort delivering critical software in Go and Python,
  • Willingness to debug problems across the stack,
  • Comfortability with working on underspecified problems and are capable of rapidly learning and iterating on solutions,
  • Experience building the wrong system enough times to avoid the common pitfalls, whether building something personally or advising others.

You might be a good fit if you:

  • 5+ years of relevant SRE experience in the tech industry,
  • demonstrable knowledge of TCP/IP, HTTP, web application security and experience supporting web application architecture,
  • experience working with a variety of storage systems, application architectures, compute infrastructure and network management systems,
  • experience designing, implementing, deploying, and operating high scale production service,
  • defined architecture and built services at scale on public infrastructure such as AWS and Azure, proven knowledge at least one higher-level language (eg. Python and Golang),
  • The ability and desire to build and learn new systems with new technologies.

Rebellion is a well-capitalized technology start-up firm that is passionate about defining and delivering modern, life-changing software products to the US Department of Defense (DoD), the UK Ministry of Defence (MoD), and their allies. At Rebellion we believe in operating what we own, we deliver all of our products as managed services, this allows our product teams to maintain operational ownership across all deployments. Expect talented, motivated, intense, and interesting co-workers.

Compensation includes meaningful equity ownership, competitive salaries, full medical coverage, disability and life insurance, and transit reimbursement.

An Equal Opportunity Employer/Veterans/Disabled. Rebellion Defense is an equal opportunity employer and makes employment decisions on the basis of merit and business needs. Rebellion Defense does not discriminate against applicants on the basis of race, color, religion, sex, sexual orientation, gender, gender identity, national origin, veteran status, disability, or any other protected characteristic in accordance with federal, state, and local law.


Apply ⎘ Copy Link ↗ Visit Link
Castor EDC Logo
Site Reliability Engineer
Castor EDC
Amsterdam, The Netherlands
€60,000 to €80,000 a year
February 2020
1 Applicants This Week
This job posting is no longer available

Job Description

Our true purpose at Castor

Castor is one of the leading platforms for data collection in medical research. We believe standardizing and reusing datasets is key to overcoming the healthcare challenges of the future.

How we operate

Our main Electronic Data Capture (EDC) application runs on a proven stack consisting of Ubuntu, Nginx, PHP and MySQL. For our cloud installations, we orchestrate these setups by using Terraform combined with Ansible for the server configuration management.

Due to the nature of processing medical data, we have clients in different regions across the globe, often with specific regulatory constraints around where and how their research data is stored. To meet these customer demands we combine both traditional as well as cloud-based hosting solutions.

Most of our clients prefer to run in Azure, but we’re using Google Cloud Platform for things like Kubernetes hosting of greenfield projects, blob storage for scalable file upload storage and their Key Management System (KMS) to further secure our data.

For our metrics we’ve begun standardizing on Prometheus and we’re moving towards Loki for log aggregation. We use PagerDuty for alerting, communicate via Slack and host our code on Github.

Why we’re growing our team

With our recent expansion have come new challenges, both in how we organize ourselves and in how we manage and scale our infrastructure in the future.

To further these efforts we have formed a Platform team consisting of SRE and Software Engineering, which we are now looking to grow with the addition of an additional SRE.

Additionally, due to the sensitive nature of medical data, Castor is certified for both ISO/IEC 9001 (quality) and 27001 (Information security). In addition, we have to adhere to a number of other regulations, including Good Clinical Practice (GCP) guidelines.

Our goal is to unite these requirements with emerging SRE practices around infrastructure as code and other principles to create a well designed and documented system, while still allowing us to remain flexible to change.

How you will contribute

Our absolute commitment to patient data security and privacy informs our vendor selection with certified datacenter and cloud providers. To achieve real impact in medical research, Castor needs to operate security around the world.

Historically, our production platform has run on top of managed hosting services. This model doesn’t scale well for our global, international footprint, which is why we are currently expanding our in-house knowledge and transitioning to Infrastructure-as-a-Service providers.

As a Site Reliability Engineer, you’ll have the ability to shape our operations and continuously deliver a working product. Working very closely with the development teams, you’ll collaborate in supporting and structuring our efforts around automation, observability and security. With your help we plan to scale the Castor platform to the next level.

Some things we worked on recently

Whilst there are many operational challenges as we continue to grow and scale at Castor, our Platform team has made great improvements to a variety of our systems already. To give you some examples of what we achieved last month:

  • Migrated our DNS to AWS Route53
  • Set up automatic documentation pipelines using MkDocs
  • Moved our CI/CD pipelines from Jenkins to CircleCI
  • Built a key-service on AWS Lambda to store disk encryption keys off-site for an otherwise region-local setup

Your background

You have helped run web-facing services under production workloads and have experienced the challenges that come with maintaining and scaling these systems. Making and owning decisions about systems architecture together with your team is something you enjoy and feel comfortable with.

Qualities we’re looking for include:

  • A good grasp on how *NIX systems operate
  • The ability to evaluate and implement best practices for IT operations
  • A working knowledge of both cloud-native and traditional systems architecture and the trade-offs between them
  • Experience with a configuration management framework such as Ansible, Chef, Puppet or SaltStack
  • The ability and desire to work with a wide range of open source technologies
  • A strong privacy- and security mindset
  • Experience with some aspects of Observability and distributed systems: from monitoring, logging and metrics instrumentation to resiliency to failure
  • A good understanding of how relational databases operate
  • Experience with at least one programming or scripting language, preferably Python or Go(lang)
  • Knowledge that a list of skills and requirements doesn’t mean you have to tick every single box to apply ;)

How we say thank you

At Castor we truly live our core values, believing we can achieve anything with a healthy and happy team. With this in mind, we offer the following benefits:

  • Our own ‘Castor Burrow’ - brand new offices by Amsterdam Amstelstation
  • A competitive salary plus an annual company bonus plan
  • Employee Stock Option Programme incentive
  • 30 days annual leave plus 6 public holidays
  • An individual training and professional development budget
  • Flexible working with the opportunity to work from home 1 day per week
  • Meditation room with daily yoga, mindfulness and company subscription to Calm
  • Lunch and healthy snacks in the office every day
  • A new Mac or Dell laptop

⎘ Copy Link ↗ Visit Link
Site Reliability Engineer
Goldman Sachs
London, United Kingdom
£40,000 to £100,000 a year
November 2018
1 Applicants This Week
This job posting is no longer available

Job Description

Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. At Goldman Sachs, SRE is responsible for the availability and reliability of our firm's most critical platform services, and ensures they meet the requirements of our internal and external users. We look for engineers who are motivated to collaborate with our businesses to build and run sustainable production systems, which can evolve and adapt to changes in our fast-paced, global business environment.

Skills & Requirements

  • Proficiency in one or more of the following: Go, Python, C, C++, Java, Perl, Ruby or shell scripting
  • Experience with algorithms, data structures and software design
  • Experience with UNIX operating systems internals and / or networking
  • Experience with distributed systems design, maintenance, and troubleshooting
  • Hands-on experience with debugging and optimizing code, as well as automation
  • Strong interpersonal skills, drive, and ownership
  • Coding beyond simple scripts
  • Solving novel problems from first principles

ABOUT GOLDMAN SACHS

The Goldman Sachs Group, Inc. is a leading global investment banking, securities and investment management firm that provides a wide range of financial services to a substantial and diversified client base that includes corporations, financial institutions, governments and individuals. Founded in 1869, the firm is headquartered in New York and maintains offices in all major financial centers around the world.


⎘ Copy Link ↗ Visit Link
Netflix Logo
Senior Site Reliability Engineer, CORE
Netflix
Los Gatos, California, United States
$250,000 to $500,000 a year
January 2020
6 Applicants This Week

Job Description

At Netflix, we strive to bring joy to people across the world through amazing stories. As we grow internationally, we are continually enhancing our cloud-based infrastructure to improve our performance, scalability, and reliability.

The SRE team's goal is to ensure customer joy by successfully managing risk and minimizing impact across Netflix. We do this through cross-functional engagement with other engineering teams, managing issues when they happen, as well as promoting reliability and resilience practices throughout the organization.

Outcomes

  • Improve our incident management lifecycle to identify, mitigate, and learn from reliability risks
  • Increase our reliability through establishing guidance and methods of improvement
  • Form and maintain relationships with internal and external partners
  • Develop deeper insights and analysis into the quality of experience for our customers

We Value

  • Curiosity about how complex sociotechnical systems successfully operate at scale when failure is inevitable
  • People who see influence as their preferred tool for cultivating relationships
  • Collaboration and continuous improvement
  • A desire to learn and readiness to teach
  • Iteration as the path forward

Our Work

  • Drive incidents to resolution by coordinating with multiple engineering teams
  • Identify sources of instability in large-scale distributed systems and drive operational excellence
  • Analyze complex systems from a reliability and resilience perspective
  • Engage with product teams to diagnose operational surprises and carry forward improvements
  • Improve reliability and drive down the burden of toil with tooling and automation

Nice to Have

  • Experience with global, continuous delivery methods
  • Development with Python, Go, Java, or JavaScript/Node.js
  • Involvement with incident management and response
  • Knowledge of cloud platforms like AWS and microservices architecture
  • Deep network analysis
  • Linux systems engineering capability

Things that show how we think


Apply ⎘ Copy Link ↗ Visit Link
Get a weekly email with all new Golang jobs
10 of 107 Site Reliability jobs found