hero

Accelerate your career.

Explore opportunities across TA's portfolio

Site Reliability Engineer

DigiCert

DigiCert

Software Engineering
India
Posted 6+ months ago

ABOUT DIGICERT

We’re a leading, global security authority that’s disrupting our own category. Our encryption is trusted by the major ecommerce brands, the world’s largest companies, the major cloud providers, entire country financial systems, entire internets of things and even down to little things like surgically embedded pacemakers. We help companies put trust—an abstract idea—to work. That’s digital trust for the real world.

PRIMARY RESPONSIBILITIES AND COMPETENCIES

  • Perform proactive daily monitoring of our services including reviewing system and applications logs and manage Incident life cycle (Detection, Confirmation, Notification, Repair/Isolation, Escalation, Resolution and Reporting) to ensure quick turnaround in service restoration
  • Repair and recover from hardware or software failures. Coordinate and communicate with impacted stakeholders and clients, escalating where appropriate
  • Work closely with development and engineering teams helping to build, maintain and extend support for all production services
  • Review entire environment and execute initiatives to reduce failures, defects and improving overall performance
  • Monitor and troubleshoot issues across the entire stack - hardware, software, application and network
  • Demonstrate technical leadership with incident handling and troubleshooting
  • Document current and future configuration processes and policies
  • Assist with the implementation and development of SRE tools and applications
  • Manage and support SRE tools and applications
  • Perform periodic on-call duty as part of a global team
  • Must be able to maintain a Trusted role
  • Able to install and manage web certificates (SSL, Client Auth)
  • Prior working knowledge of Salt, Splunk, JIRA, Atlassian Wiki, NewRelic, and Nagios
  • Complete and submit daily shift-end reports to peers and management. Complete and submit incident reports on a timely manner

REQUIRED QUALIFICATIONS (EDUCATION, EXPERIENCE, &/OR CERTIFICATION)

  • BS. degree in Systems Engineering, Computer Science or related fields, 2 to 5 years of related experience preferred
  • Strong working knowledge of application deployments and troubleshooting in Kubernetes, rancher and docker environment (essential experience)
  • 2 to 5 Year’s experience in webservers management, i.e., Apache, Nginx and Linux, etc.
  • In depth knowledge of System/OS level (essential experience)
  • TCP/IP Networking experience (advantageous)
  • Strong knowledge in one or more of the following: AWS, Azure, OCI, OpenStack, MariaDB, Cassandra (advantageous)
  • On-call for a full week frequency (5+ years of experience Senior only), once every 8 weeks (estimated) – Incident related stuff
  • Working knowledge of scripting languages including shell, Perl, PHP and Python (advantageous)
  • Effective analytical, troubleshooting, and problem-solving skills with respect to network troubleshooting, performance tuning and applications data flow
  • Excellent written, verbal and presentation skills

#LI-GA1