Previous Job
Previous
Site Reliability Engineer - 18-00797
Ref No.: 18-00797
Location: Boston, Massachusetts
Position Type:Direct Placement
Start Date: 03/30/2018
The role

Site Reliability Engineers are responsible for the pulse of the software ecosystem.
We monitor and improve the system and suggest improvements for implementation by others.
The name of the game is automating our job, because hiring linearly with our traffic growth is unsustainable.
We are involved in incident and change management.
We also act as consultants for engineers when new code and services are getting ready to launch.
Our client acquired a workforce health coaching platform company recently. 
One of your first responsibilities in this role will be working to support the acquisition's team's services and cloud infrastructure, as well as helping them integrate with the rest of  the infrastructure.

Responsibilities
  • Detective: SREs handle problems in live production systems, both on their own and in collaboration with systems and application engineers.
  • Ambassador: Keep the company informed about the status of our client's services, the impact of known issues, and the progress of ongoing investigations.
  • Developer: Design and refactor parts of the product backend system for stability and performance, and write tools and scripts to automate maintenance and monitoring tasks.
  • Coach: Meet with other teams and attend architecture reviews, and offer advice on how to implement features that are efficient, highly available, and fault-tolerant.
What do we look for?

We want people that:
  • Write code in Python and JavaScript, and not just for classes.
  • Dig into the details of how a system, library, or tool works instead of just blindly using it.
  • Are willing and eager to wear many hats, as illustrated by the roles described above.
  • Dive into things that "aren't their problem.”
  • Are willing to teach and lead others.
Requirements
  • You have 5+ years of experience as a systems/operations engineer or system administrator
  • You are comfortable with the Python programming language and ecosystem
  • You are very comfortable using and administering Linux servers
  • You can work independently with limited supervision
  • You can communicate effectively with peers and to tailor your communication to your audience
  • You have a willingness to dive in and assist coworkers when incidents arise
  • You're willing to participate in the team's production on-call rotation
Nice-to-haves
  • Experience working with high-traffic, scalable web applications and services
  • Experience building, deploying, and operating your own web service
  • Familiarity with JavaScript (ES 2015+), Node.js, and their ecosystems
  • Knowledge of the administration and/or performance tuning of MySQL, Cassandra, or MongoDB
  • Prior experience being part of an on-call rotation and responding to production incidents
  • Experience with cloud computing platforms like AWS or Google Cloud Platform
  • Familiarity with configuration management and provisioning tools like Puppet, Chef, Ansible, or Terraform
  • Experience developing and shepherding processes around change and incident management
  • Experience with one or more of the technologies in our stack (or similar technologies):
    • Frameworks: Mongoose, Hibernate, Spring, Finagle, Finatra, Thrift
    • Messaging: Kafka
    • Caching: Memcached, Redis
    • Logging and Monitoring: Prometheus, Graphite, StatsD, Nagios, Logstash, Kibana, New Relic APM
  • Our client just made an acquisition so
  • this person would be helping one of our acquisitions manage their AWS infrastructure and hopefully helping support their Node.js backend and MongoDB database (so a few different technologies).