Site Reliability Engineer (SRE)

Apply Now

Job Description

Industry: IT Services and IT Consulting
Seniority for this role: Mid-Senior level
Job Title : Production Environment Manager / Site Reliability Engineer (SRE) Type: Hybrid Term: Daily Rate Contract Long Term – Starts with 6 months Multiple number of positions Role Overview The Production Environment Manager will be responsible for overseeing and improving the stability, performance, and reliability of production systems. This role focuses on ensuring efficient operations, resolving incidents, automating processes, and enhancing monitoring strategies to optimize platform performance. The candidate will be a key player in managing CI/CD pipelines, automating infrastructure processes, and mentoring junior resources. This position requires a blend of hands-on technical expertise, problem-solving skills, and the ability to collaborate with global teams to drive platform resilience and reliability. Key Responsibilities Production Management Plan, manage, and oversee all aspects of the Production Environment to ensure system stability, availability, and performance. Define and implement strategies for Application Performance Monitoring (APM), optimization, and proactive performance improvements. Respond to production incidents, conduct root cause analysis, and implement fixes to reduce incident recurrence. Measure and document incident reduction trends over time while enhancing system reliability. Monitoring & Optimization Design, develop, and standardize monitoring and alerting mechanisms to provide end-to-end visibility for production applications. Take a holistic approach to problem-solving during production incidents, diagnosing issues across the entire technology stack to minimize recovery time. Continuously analyze platform performance, identify operational gaps, and recommend improvements. DevOps & CI/CD Support Support the deployment of code across multiple environments (dev, staging, production). Maintain and optimize CI/CD pipelines using tools like Jenkins and scripting languages (Groovy, YAML, Shell). Ensure seamless software promotion into higher environments with operational gating and validations. Lead automation initiatives across infrastructure, deployment, and monitoring to improve speed and efficiency. System Reliability & Scaling Improve system scalability through automation and sustainable system evolution. Proactively measure and monitor availability, latency, and system health, ensuring high standards of performance. Engage in end-to-end lifecycle management of servicesโ€”from inception and design to deployment, operation, and optimization. Participate in system design consulting, capacity planning, and launch readiness reviews. Collaboration & Mentorship Collaborate with globally distributed teams across multiple time zones and tech hubs. Share knowledge with team members, mentor junior engineers, and foster a culture of learning and collaboration. Conduct training sessions and workshops as needed to improve team understanding of processes, tools, and systems. On-Call & Off-Hours Support Perform on-call duties on a rotational basis, ensuring swift incident response and resolution. Willingness to work off-hours for urgent incidents, deployments, or planned maintenance activities. Requirements Must-Have Skills Production Support Experience: Proven experience in supporting cloud-based applications (AWS, Azure, GCP, etc.) in a production environment. Automation & Configuration Management: Expertise with Ansible or Chef for automating infrastructure and application processes. CI/CD Pipelines: Proficiency in managing CI/CD pipelines using tools like Jenkins. Experience writing and troubleshooting Groovy scripting and YAML configurations. Linux Administration: Strong knowledge of Linux operating systems, including system troubleshooting, performance tuning, and shell scripting. Scripting & Automation: Proficiency in Shell scripting for automating workflows and resolving incidents. Monitoring & Troubleshooting: Hands-on experience designing monitoring solutions and resolving complex system issues across distributed systems. Incident Management: Experience in responding to incidents, performing root cause analysis, and driving incident resolution processes. Good-to-Have Skills ITSM tools experience (e.g., ServiceNow, Jira). Experience working with observability platforms (e.g., Prometheus, Grafana, Splunk, Datadog). Knowledge of Infrastructure-as-Code (IaC) tools like Terraform. Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes). Exposure to Agile methodologies and DevOps best practices. Experience with capacity planning and system design consulting. Soft Skills Strong problem-solving and analytical skills, with the ability to diagnose issues across the technology stack. Excellent verbal and written communication skills to collaborate with global teams. Ability to mentor and train junior resources effectively. Team player with a proactive mindset and passion for automating repetitive tasks. Flexibility to work in a dynamic, fast-paced environment with occasional off-hours support. Qualifications Bachelorโ€™s degree in Computer Science, Information Technology, or a related field. 5+ years of experience in Production Support, DevOps, or Site Reliability Engineering roles. Relevant certifications (e.g., AWS/GCP/Azure, Ansible, Jenkins) are a plus. Key Performance Indicators (KPIs) Reduction in incident count and Mean Time to Recovery (MTTR). Improved system uptime, performance, and availability. Efficiency of CI/CD pipelines and automation processes. Adoption and effectiveness of monitoring and alerting systems. Contribution to knowledge sharing and team mentoring. Show more Show less