We’re a leading, global security authority that’s disrupting our own category. Our encryption is trusted by the major ecommerce brands, the world’s largest companies, the major cloud providers, entire country financial systems, entire internets of things and even down to little things like surgically embedded pacemakers. We help companies put trust—an abstract idea—to work. That’s digital trust for the real world.

PRIMARY RESPONSIBILITIES AND COMPETENCIES

Perform proactive daily monitoring of our services including reviewing system and applications logs and manage Incident life cycle (Detection, Confirmation, Notification, Repair/Isolation, Escalation, Resolution and Reporting) to ensure quick turnaround in service restoration
Repair and recover from hardware or software failures. Coordinate and communicate with impacted stakeholders and clients, escalating where appropriate
Work closely with development and engineering teams helping to build, maintain and extend support for all production services
Review entire environment and execute initiatives to reduce failures, defects and improving overall performance
Monitor and troubleshoot issues across the entire stack - hardware, software, application and network
Demonstrate technical leadership with incident handling and troubleshooting
Document current and future configuration processes and policies
Assist with the implementation and development of SRE tools and applications
Manage and support SRE tools and applications
Perform periodic on-call duty as part of a global team
Must be able to maintain a Trusted role
Able to install and manage web certificates (SSL, Client Auth)
Prior working knowledge of Salt, Splunk, JIRA, Atlassian Wiki, NewRelic, and Nagios
Complete and submit daily shift-end reports to peers and management. Complete and submit incident reports on a timely manner

REQUIRED QUALIFICATIONS (EDUCATION, EXPERIENCE, &/OR CERTIFICATION)

BS. degree in Systems Engineering, Computer Science or related fields, 2 to 5 years of related experience preferred
Strong working knowledge of application deployments and troubleshooting in Kubernetes, rancher and docker environment (essential experience)
2 to 5 Year’s experience in webservers management, i.e., Apache, Nginx and Linux, etc.
In depth knowledge of System/OS level (essential experience)
TCP/IP Networking experience (advantageous)
Strong knowledge in one or more of the following: AWS, Azure, OCI, OpenStack, MariaDB, Cassandra (advantageous)
On-call for a full week frequency (5+ years of experience Senior only), once every 8 weeks (estimated) – Incident related stuff
Working knowledge of scripting languages including shell, Perl, PHP and Python (advantageous)
Effective analytical, troubleshooting, and problem-solving skills with respect to network troubleshooting, performance tuning and applications data flow
Excellent written, verbal and presentation skills

#LI-GA1

This job is no longer accepting applications

See open jobs at DigiCert.See open jobs similar to "Site Reliability Engineer" TA Associates.

See more open positions at DigiCert

Privacy policy Cookie policy

TA News & Insights

Team News & Insights

Portfolio News & Insights

Accelerate your career.

Site Reliability Engineer