Site Reliability Engineer

Extelligence is an intelligent partner that goes the extra mile. We provide customized information management solutions for major industries. Our team in Prague and Bucharest is working with international companies, transforming, and adding value to their business on a daily basis. We are growing quickly, and we are interested to bring more talented individuals into our team. 

We are seeking an experienced Site Reliability Engineer (SRE) to support our client in live and VOD streaming platform. You will be working in a highly dynamic environment, overseeing cloud-hosted infrastructure as a code and contributing to the optimization of streaming services. The role requires advanced skills in managing cloud and on-premises services, performance monitoring, and implementing zero-trust security measures.

This role involves managing a complex infrastructure, including load balancers, caching, monitoring, and custom dashboards, ensuring high availability and resilience across multiple layers. You will also engage in troubleshooting, incident resolution, and optimising performance, especially in distributed environments.

Key Responsibilities:

Infrastructure and System Management

  • Advanced knowledge of K8s platform and related tools
  • Oversee and optimize AWS Network Load Balancers and HAProxy for both internal app-to-app and external communications.
  • Manage Grafana/Prometheus/Zabbix and other tools for monitoring purposes
  • Implement rate limits on login systems to ensure secure and reliable access.
  • Utilize Varnish caching and ensure availability across multiple availability zones (self-
    healing for DB instances).
  • Work with Terraform to automate infrastructure provisioning and management.
  • Support Redis and Dragonfly (a multicore Redis rewrite) for high-performance caching
    and data management.

Monitoring and Logging

      • Leverage Percona Monitoring and Management (built on Grafana) for PostgreSQL,
        MySQL, and ProxySQL metrics and advanced session monitoring.
      • Integrate Kibana for extensive log analysis (Nginx, Varnish, HAProxy, Syslogs, and APIs).
      • Utilize Sentry for error tracking, performance profiling, and application-level alerting.
      • Implement Thanos (Prometheus) for collecting and scraping all monitoring targets.

      Data and User Analytics

        • Set up and monitor Firebase Analytics for mobile crash logs and release metrics.
        • Collaborate with BI teams on Conviva for user tracking, visual reports, performance
          insights, and tracking releases and analytics on mobile platforms.

        Security and Compliance

          • Work with Cloudflare for CDN, firewall, DDoS protection, SSL offloading, DNS, and pre-
            entry security measures.
          • Enforce zero-trust architecture principles across production environments.

          Incident Management and Escalation

            • Provide 24/7 support (on call) in collaboration with in-house technicians and 3rd party
              vendors.
            • Work closely with Nova to manage Kibana (ELK) instances for reactive monitoring.
            • Maintain audit logs for tracking system and user-level events

            Requirements:

            Key Qualifications

            • Proven experience in a DevOps/SRE role supporting high-availability live streaming or VOD platforms.
            • Strong skills in AWS, K8s, Terraform, Grafana, and Percona Monitoring.
            • Experience with HAProxy, Redis (including Dragonfly), Kibana, and Varnish.
            • Proficiency in Sentry for error tracking, Thanos (Prometheus) for metrics,
              and Cloudflare for security.
            • Knowledge of Firebase for mobile app analytics and Conviva for performance tracking.
            • Familiarity with zero-trust security principles.
            • Excellent incident management and troubleshooting skills, with the ability to work 24/7
              as needed.

            Preferred Skills

            • Background in live/VOD streaming environments, handling user tracking
              and performance optimization.
            • Familiarity with Scylla (NoSQL database) for user data and “continue watching”
              functionality.
            • Familiarity with RDS for MySQL for user data
            • Proficiency in Nginx, syslog management, and in-depth application log analysis.

            Working with Extelligence:

            • We take care of the important things that matter to contractors, for example, we guarantee on-time payment for your work. You will never have to chase us for payment.
            • We always seek to have long term relationships with our team and we always seek to offer opportunities to extend cooperation beyond the first contract or project.
            • Extelligence is a multicultural team, we have more than 15 different nationalities working with us.
            • We also organize events to bring our team together including team building activities and social events.
            Job Type: Contract
            Job Location: Hybrid in Prague

            Apply for this position

            Allowed Type(s): .pdf, .doc, .docx