Previous Job
Site Reliability Engineer - 17-02234
Ref No.: 17-02234
Location: Boston, Massachusetts
Position Type:Direct Placement
The role

Site Reliability Engineers are responsible for the pulse of the software ecosystem.
We monitor and improve the system and suggest improvements for implementation by others.
The name of the game is automating our job, because hiring linearly with our traffic growth is unsustainable.
We are involved in incident and change management.
We also act as consultants for engineers when new code and services are getting ready to launch.

  • Detective: SREs handle problems in live production systems, both on their own and in collaboration with systems and application engineers.
  • Ambassador: Keep the company informed about the status of client's services, the impact of known issues, and the progress of ongoing investigations.
  • Developer: Design and refactor parts of the client's backend system for stability and performance, and write tools and scripts to automate maintenance and monitoring tasks.
  • Coach: Meet with other teams and attend architecture reviews, and offer advice on how to implement features that are efficient, highly available, and fault-tolerant.
  • You have 3+ years of experience as a software engineer, site reliability engineer, or operations engineer
  • You're very comfortable with the Java language and ecosystem
  • You can work independently with limited supervision
  • You can communicate effectively with peers and to tailor your communication to your audience
  • You have a willingness to dive in and assist coworkers when incidents arise
  • You're willing to participate in the team's production on-call rotation
  • Some familiarity with Python and its ecosystem
  • Expertise in concurrency and multi-threaded code (particularly in Java)
  • Experience working with high-traffic, scalable web applications and services
  • Experience building, deploying, and operating your own web service
  • Experience being part of an on-call rotation and responding to production incidents
  • Experience with one or more of the technologies in our stack (or similar technologies):
    • OS: Linux
    • Languages: Java, Python
    • Frameworks: Hibernate, Spring, Finagle, Finatra, Thrift
    • Databases: MySQL, Cassandra
    • Messaging: Kafka
    • Caching: Memcached, Redis
    • Logging and Monitoring: Prometheus, Graphite, StatsD, Nagios, Logstash, Kibana
    • Other: Aurora/Mesos, Tomcat, Elasticsearch, Puppet, Ansible, Terraform