Site Reliability Engineer - Big Data
Location: Reston
Posted on: June 23, 2025
|
|
Job Description:
Verisign helps enable the security, stability, and resiliency of
the internet. We are a trusted provider of internet infrastructure
services for the networked world and deliver unmatched performance
in domain name system (DNS) services. We are a mission focused,
values driven company where each individual can contribute to
building a stronger, more secure internet. We offer a dynamic and
flexible work environment with competitive benefits and the ability
to grow your career. Within Verisign, our team is responsible for
building and managing Verisign Data Platform enabling the creation
of large-scale, high-throughput (millions requests per second) data
products and services delivering actionable operational and
business intelligence. To help us advance the platform, we are
looking for a highly skilled Mid-level Site Reliability Engineer
(SRE). This role will play a critical part in ensuring the
stability, performance, and security of our data platforms An ideal
candidate should deeply care about big data systems and automation,
be fluent in Infrastructure-as-Code, CI/CD, and be eager to learn
as needed. The successful candidate should have an understanding of
fundamentals, including core Computer Science concepts, operating
systems, networking, file systems and databases accompanied by
hands-on experience managing large-scale distributed systems.
Acquiring these competencies typically requires an equivalent of a
bachelor’s degree and 6 or more years of practical work experience.
We are also open to other career paths. The candidate will be
involved in all aspects of the data platform, including ideation,
design, implementation, deployment, customer onboarding and
support. This implies regular cross-team collaboration with Data
Engineering, Infrastructure, Engineering, Security, and Operation
Teams. As part of the team, we expect the candidate to take
ownership of the data platform, regularly interacting with the
internal customers, proactively identifying, prioritizing, and
delivering on their common data platform needs. Key
Responsibilities: Architect, Design, deploy, monitor, and operate
large scale data platforms like Hadoop, Kafka, Spark and Druid
running both on physical servers and on top of Kubernetes
Participate in technical designs, Proof of Concepts for software
solutions that combine Open-Source components, COTS (commercial off
the shelf) components, and custom developed components Deploy and
manage Production releases with minimum supervision Automate
cluster provisioning (CI/CD, Infrastructure-as-Code), scaling, and
monitoring using Ansible, Python, Jenkins, Terraform and other
relevant tools Build and deploy containerized applications using
Docker and Kubernetes Troubleshooting complex issues in large and
distributed environments Upgrading (including patching, deploying
releases) large-scale data platforms improving system capabilities
and security while ensuring minimal customer impact Performance of
occasional operations support functions, including problem
isolation and resolution Participate in the on-call rotation to
monitor the health of the production systems and respond to
incidents or customer needs Ensuring platform SLOs by collecting,
visualizing, and alerting on relevant telemetry Supporting data
platform customers and continuously improving the monitoring,
performance, and functionality of the clusters Staying up to date
with the industry data platform best practices and standards,
focusing on hybrid cloud environments The candidate must have:
Bachelor’s degree in computer science or a related technical field,
or equivalent combination of education and experience 5 years of
experience managing big data platforms (Hadoop, Spark Kafka, Druid)
Excellent understanding of Linux configuration and administration
Strong automation experience - Not just developing automation, but
knowing why we automate and what to automate Strong understanding
of infrastructure-as-code Strong written and verbal communication
skills – able to clearly and succinctly describe complex issues
Familiarity with networking protocols and systems Desired Skills,
Experience, and Attributes: Experience with a high-level scripting
language such as Python Experience with RedHat Enterprise Linux
and/or FreeBSD Experience with network troubleshooting using such
tools as ping, traceroute and dig Deployment automation experience
using tools such as Ansible Experience working with teams using
Kanban and/or Scrum a plus Experience with Docker or Kubernetes in
a production environment Experience with OpenStack in a production
environment Experience administrating Unix systems in a large-scale
environment Experience using Jenkins in a continuous delivery and
integration environment This position is based in our Reston, VA
office and offers a flexible, hybrid work schedule The pay range is
$108,900 - $147,300. The anticipated annual base salary range for
this position is noted above, however, base pay offered may vary
depending on job-related knowledge, skills, experience. Verisign
offers a discretionary bonus which is based on individual and
company performance, and certain roles may be eligible for
discretionary stock awards. Verisign is an equal opportunity
employer. That means we recruit, hire, compensate, train, promote,
transfer, and administer all terms and conditions of employment
without regard to their race, color, religion, national origin,
sex, sexual orientation, gender identity, age, protected veteran
status, disability, or other protected categories under applicable
law.
Keywords: , Frederick , Site Reliability Engineer - Big Data, Engineering , Reston, Maryland