
AWS Cloud Ops Monitoring Support Engineer (RARR Job 6271)
For A Next-Generation Global Technology Solutions Company
3 - 5 Years
Full Time
Up to 30 Days
Up to 8 LPA
1 Position(s)
Bangalore / Bengaluru, Coimbatore
Posted By : RARR Technologies Pvt Ltd
Posted Updated Today
Job Skills
Job Description
Role Overview:
We are looking for an Infrastructure & Application Operations Engineer to manage and monitor a large-scale AWS-based platform comprising microservices, mobile, and web applications. The role involves proactive monitoring, incident management, advanced troubleshooting, RCA, and continuous improvement, working closely with the JCH DevOps and Infrastructure teams.
Key Responsibilities:
- Infrastructure Operations
- Monitor AWS infrastructure across all regions, including compute, storage, networking, security, and logs.
- Manage access controls, including IAM roles, database credentials, and secrets management.
- Respond to infrastructure alerts and perform first-level and advanced troubleshooting.
- Track and manage TLS/SSL certificate expirations to avoid service disruptions.
- Microservices & Advanced Application Support
- Monitor and support 72+ microservices, 8 mobile applications, and 2 web applications, along with underlying AWS infrastructure.
- Handle complex production escalations, perform root cause analysis (RCA), and ensure timely resolution.
- Ensure system stability and reliability, preventing recurrence of incidents.
- Perform quarterly OS and base-image updates and support vulnerability remediation activities.
- Monitor application health checks, generate health reports, and analyze pod restarts and stability issues in containerized environments.
- Patch & Update Monitoring
- Monitor and assess the impact of:
- AWS platform updates
- JCH-applied patches
- Third-party component updates
- Monitor and assess the impact of:
- Monitoring & Observability
- Use industry-standard monitoring and observability tools including:
- Datadog
- Grafana
- AWS CloudWatch
- Kibana / OpenSearch
- Prometheus
- Ensure comprehensive infrastructure and application monitoring coverage.
- Use industry-standard monitoring and observability tools including:
- Incident Management
- Provide 24×7 monitoring, alerting, diagnostics, and escalation support.
- Follow structured incident management processes with P1–P4 severity classification.
- Coordinate with internal teams and JCH stakeholders during critical incidents.
- Reporting & Knowledge Management
- Prepare monthly operational reports, including SLA compliance, uptime, incident trends, and performance metrics.
- Maintain and continuously update:
- Standard Operating Procedures (SOPs)
- Runbooks
- Configuration documentation
- Centralized knowledge base
- Create detailed RCA documentation for infrastructure and platform issues.
- Continuous Improvement & Optimization
- Continuously improve monitoring by:
- Reducing false positives
- Tuning alert thresholds
- Enhancing observability coverage
- Recommend cost optimization and performance improvements across infrastructure and applications.
- Conduct periodic performance and service reviews with the JCH team.
- Work on pre-defined DevOps and Infrastructure tasks in collaboration with the JCH team, based on associate bandwidth and prioritization.
- Continuously improve monitoring by:
Experience & Qualifications:
Technical Expertise:
- Strong hands-on experience with AWS infrastructure and operations.
- Experience supporting microservices-based, containerized applications.
- Solid understanding of incident management, monitoring, and production support models.
- Hands-on experience with monitoring and observability tools (Datadog, Grafana, CloudWatch, Prometheus, ELK/OpenSearch).
- Experience with RCA, post-incident analysis, and documentation.
Matching Jobs
No matching jobs found.