Datto, the world’s leading provider of IT solutions delivered through managed service providers, is looking for a Staff Site Reliability Engineer to join a growing team. Datto is a creative company at its core and is an exciting and dynamic workplace. We're 100% focused on our managed service provider partners and believe that with the right technology, managed service providers can change how businesses around the world operate. Datto provides data protection, business continuity, networking, business management, and file backup and sync products that empower and protect the clients of our 17,000+ partners. We're headquartered in Norwalk, Connecticut and have 22 offices worldwide. Learn more at datto.com.
More than someone who checks every box, we’re looking for people who are excited to work and grow at Datto. If that's you we hope you apply for the role!
You enjoy teamwork
You come with new ideas and a unique point of view. You look forward to collaborating with a diverse team. You eagerly seek and give help. Transparency tops your list of values, and you contribute to a culture of respect and inclusion.
Inquisitive and focused, you see every challenge as an opportunity. You would rather create the future than wait for it.
You’re customer-focused and take pride in your work.
You put extra attention into details with all you do. You care about the work you provide to customers and how it reflects on yourself and Datto. When you find or see something wrong, you attempt to resolve it. You look for opportunities to not only better yourself, but others around you. You aim to be the best that you can be and always do the right thing.
What you’ll do
The Staff Site Reliability Engineer will ensure the overall system reliability, uptime, health, and performance of Datto’s Partner Portal. The Partner Portal is the heart of our partner’s interaction with Datto. From the Partner Portal, they can get the latest partner news, open tickets, check on the status of their devices, networks, administer their users and clients, get help and support, and much more.
You’ll work closely with various stakeholders to understand and help shape the architecture and design of the Partner Portal in order to help quickly resolve service impacting issues, detect and self-heal problems before they become service impacting, and provide valuable information and data back to the developers in order to improve the long-term reliability of the platform.
Your job function and responsibilities include:
- Develop, deploy, operate, and maintain the appropriate systems, services, and tooling for Datto’s Partner Portal that provides constant feedback to stakeholders
- Implement best practices promoting service availability/reliability and fault tolerance
- Serve as a quality and reliability ambassador as part of an Agile software development team
- Drive product reliability improvements through monitoring, alerting, and application of software development best practices
- Identify creative ways to break the products, uncover and report defects, as well as validate systems/solutions are operating as intended
- Write, review, and execute test plans/strategies for validating product/system performance, scalability, and reliability
- Build and maintain test frameworks and environments for executing performance, scale, system, and resiliency tests
- Write code to automate operational tasks, reducing the need for human interaction and improving the reliability and repeatability of these processes
- Assist in finding, reproducing, and characterizing defects and triaging them with Product Management, Software Development, and Quality Assurance Engineers.
- Participate in release processes/pipelines and monitoring/debugging production environments.
- Maintain open communication with Engineering and Product teams around system performance and reliability
- Educate your peers in QA, Engineering, and Product around performance and scalability best-practices, metrics, and guidelines
- Maintain and communicate testing timelines, schedules and status reports
- Suggest and drive efforts to improve testing processes and methodology, metrics collection / success measurement, test coverage, and product reliability
- Troubleshoot complex issues quickly and effectively; continually improve processes and reliability based on post-mortem analysis
- Participate in a rotational on-call program and enhance troubleshooting techniques and utilities to ensure quick resolution to service impacting issues
- Other duties as assigned by Management
- A Bachelor’s degree in Computer Science, Management Information Systems or Software Engineering; or equivalent work experience
- 5+ years of hands-on experience with performance, scalability, and reliability testing techniques, tools, and activities
- Experience with software build, package, configuration and release management tools (eg. Gitlab, Jenkins, Ansible, Salt, Puppet)
- Proficient with Linux, MySQL, and Shell scripting
- Familiar with containerization (Docker/Kubernetes) concepts
- Familiarity with object-oriented programming languages and concepts (Python, Java, Golang, etc..) and exposure to writing automated tests with common test frameworks such as Pytest
- Familiarity with software networking and basic network troubleshooting
- Knowledge of infrastructure (networking, hypervisors, storage, security, etc.) - experience working with a private cloud is a plus
- Excellent problem-solving skills, and the ability to troubleshoot complex issues quickly and effectively
- Able to find opportunities for improvement and tackle them without external direction
- Ability to “think outside of the box” and find creative solutions to operational problems
- Dedication to collaboration, “teaching others to fish”-style knowledge sharing and cross training.
- Solid understanding of agile software product cycles (scrum & kanban)
- Independent contributor who enjoys taking ownership of things and driving them to completion
- Strong root cause analysis and troubleshooting competency
- Strong tendency to automate and monitor everything
- Excellent communication skills
- Ability to operate in a fast paced environment
- Self-motivated & willing to learn
- Ability to work independently and as part of a team
- Experience with testing APIs and implementing API automated testing
- Experience with data visualization tools such as Kibana and Grafana
- Familiarity with success measurement (SLI/SLO/SLA)
- Experience with metrics collection, time series queries, middleware such as Telegraf, and backends such as OpenTSDB or Prometheus
- Exposure to chaos engineering tools and load testing techniques at scale
At Datto, we’re committed to cultivating a healthy, positive and growth enabling environment. We are proud of our wide ranging benefits package which is available to all full-time employees, including:
- Comprehensive health-care benefits
- Flexible paid time off policy
- Generous paid paternal leave
- “Datto University” virtual on-boarding program
- Access to more than 5,000 courses via LinkedIn Learning
- Education reimbursement
- Employee Assistance Program
- Headspace App
- Charity match program
- A dynamic and socially active work culture, including Employee Resource Groups
- Networking and career development opportunities
- And more!
By submitting an application, you acknowledge we will process your data in order to consider you for the position you apply for and for other open positions within our company for which you may be suited. We collect and store your data in accordance with our Recruiting Privacy Practices.
Datto is an equal opportunity employer.