Site Reliability Engineer
Our client is an industry leader. They serve more than 100 million customers a year. This is because of their singular focus on customers. They are now innovating to enhance the customer experience using technology.
We are in the process of migrating our systems from an on-premise datacenter to the cloud. Our Site Reliability Engineer will work collaboratively with application development and infrastructure teams to monitor, iterate, and improve our performance, SLAs and availability. Additionally, you will have an opportunity to implement a new monitoring and alerting stack utilizing your favorite cloud technologies.
The position involves the following day to day responsibilities:
- Own our logging and monitoring toolchain – development, hosting, and improvement
- Application Performance Monitoring
- Event Correlation
- Log Mining
- Provide subject matter expertise in the areas of performance, logging, and alerting
- Work with security and compliance to ensure we are managing risk appropriately
- Diagnose issues in concert with your peers
- Bachelor’s degree in computer science or other relevant major, or equivalent additional training or background
- 4+ years’ experience
- Experience with cloud providers (e.g. AWS, Azure, Google Cloud)
- Familiarity with open-source tools such as Elasticsearch, Logstash, and Kibana
- Experience with existing APM tools such as AppDynamics, New Relic, etc.
- Experience with incident and change management tools such as ServiceNow, Jira Service Desk, etc.
- Relative experience with business intelligence and analytics, as pertains to creating visualizations and event correlations
- Excellent written and verbal communication skills
- Familiarity with machine learning-based log mining tools such as SumoCloud and Splunk
- Strong understanding of Data Security and Privacy technologies
- A solid understanding of data security and privacy laws
- Experience with at least one programming or scripting language
- Basic understanding of linux/unix administration and troubleshooting
- Basic understanding of Windows administration and troubleshooting
- Knowledge of configuration management tools like Ansible
- Knowledge of infrastructure provisioning tools like Terraform
- Experience with Docker Containers