Site Reliability Engineer

About Parkar:

We love building software products. With a decade of experience and a global presence across four countries, we've established ourselves as a trusted partner for over 100 organizations, helping them leverage technology to drive transformative growth. Staying at the forefront of technological advancements, we actively explore and integrate the latest trends into our solutions. From cloud computing and blockchain to AI-driven operations (AIOps), Generative AI, and Machine Learning, we enable our clients to stay ahead in the rapidly evolving digital landscape.

Our flagship platform, Vector, harnesses the power of technology to revolutionize IT operations. The latest release, Vector 2.0, introduces cutting-edge GenAI capabilities, empowering our clients to achieve unprecedented efficiency and drive business growth.

Join us on this exciting journey of innovation, where we harness the power of technology to drive meaningful changes and shape the future of business. Partner with Parkar Digital and experience the transformative power of AI.

For more info, Visit our website: www.parkar.in

LinkedIn - https://www.linkedin.com/company/parkar-digital/

We are looking for a highly skilled Site Reliability Engineer (SRE) to join our team. The ideal candidate will focus on enhancing system reliability, automation, and performance, ensuring high availability and scalability of our applications. You will work closely with development and operations teams to improve deployment pipelines, monitoring, and incident response.

Key Responsibilities

Design, develop, and maintain scalable, reliable, and secure infrastructure.

Implement monitoring, logging, and alerting solutions using tools like Datadog (preferred); experience with Prometheus, Grafana, ELK Stack, or Splunk is an advantage.

Improve system observability and enhance incident response through automation and root cause analysis.

Optimize CI/CD pipelines to ensure smooth deployments and minimal downtime.

Automate infrastructure provisioning and management using Terraform, Ansible, or Kubernetes.

Ensure high availability and disaster recovery through load balancing, failover mechanisms, and backups.

Collaborate with development teams to enhance application performance, reliability, and scalability.

Manage cloud-based environments (AWS, Azure, or GCP) for efficient resource utilization.

Enhance security best practices, including vulnerability assessments and patch management.

Participate in on-call rotations to troubleshoot and resolve critical system issues.

Key Skills and Qualifications

5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure roles.

Strong knowledge of Linux/Unix systems and shell scripting.

Hands-on experience with cloud platforms (AWS, Azure, or GCP).

Expertise in Kubernetes, Docker, and container orchestration.

Experience with CI/CD tools like Jenkins, GitHub Actions, or GitLab CI.

Proficiency in Infrastructure as Code (IaC) tools like Terraform, Ansible, or CloudFormation.

Solid experience with monitoring and observability tools (Prometheus, Grafana, ELK, Splunk, or New Relic).

Strong knowledge of networking, security, and system architecture.

Experience with scripting languages like Python, Bash, or Go.

Familiarity with database performance tuning and optimization.

Strong problem-solving skills and ability to work in a fast-paced, Agile environment