Job Description
Job Description
Job Description
Salary:
Overview :
We are seeking a highly skilled and experienced Datadog Subject Matter Expert (SME) to join our team. In this role, you will leverage your in-depth knowledge of Datadog’s monitoring, observability, and cloud infrastructure capabilities to guide the implementation, optimization, and maintenance of Datadog solutions across our systems. The ideal candidate will have extensive experience with Datadog’s features, including APM, log management, infrastructure monitoring, and cloud security.
This position will require close collaboration with cross-functional teams, including DevOps, IT, engineering, and security, to ensure Datadog is being used effectively to monitor system performance, diagnose issues, and enhance overall operational efficiency.
Key Responsibilities :
- Datadog Implementation and Integration :
Lead the deployment and integration of Datadog solutions across various cloud environments (AWS, Azure, GCP) and on-premise systems. Ensure Datadog agents, monitors, dashboards, and integrations are configured according to best practices. - Monitoring and Troubleshooting :
Provide subject matter expertise in configuring Datadog’s monitoring tools for infrastructure, applications, logs, and security. Troubleshoot complex issues related to system performance, application logs, and infrastructure alerts using Datadog. - Optimization :
Continuously monitor and optimize Datadog usage to ensure efficient use of resources, minimize costs, and maximize the value derived from the platform. Recommend and implement improvements to existing Datadog setups. - Custom Dashboards & Alerts :
Design, create, and maintain custom Datadog dashboards and alerting systems tailored to the needs of different teams and stakeholders. Ensure alert thresholds, notification channels, and escalation paths are properly configured. - Collaboration & Knowledge Sharing :
Act as the primary point of contact for Datadog-related inquiries. Work closely with DevOps, IT, and engineering teams to ensure Datadog is aligned with business goals. Provide training and mentorship to junior team members. - Data Analysis & Reporting :
Analyze data collected from Datadog to provide actionable insights into application performance, infrastructure health, and overall system reliability. Generate regular reports and provide recommendations for process improvements. - Best Practices & Documentation :
Establish and enforce Datadog best practices for monitoring, alerting, and dashboard design. Create and maintain comprehensive documentation for Datadog configurations and integrations. - Stay Current with Datadog Updates :
Keep up to date with the latest Datadog features, releases, and industry trends. Provide recommendations for adopting new tools or features that can enhance system observability.
Qualifications :
- Technical Skills :
- Deep knowledge of Datadog, including its full suite of monitoring and observability tools (Infrastructure Monitoring, APM, Log Management, Synthetics, etc.).
- Experience with cloud platforms (AWS, Azure, GCP) and container orchestration systems (Kubernetes, Docker).
- Familiarity with programming/scripting languages (e.g., Python, Bash, Go) for automating and customizing Datadog tasks.
- Strong understanding of network protocols, databases, and application performance management (APM).
- Experience with CI/CD pipelines, DevOps tools, and integration of Datadog in modern software delivery workflows.
- Experience :
- Minimum of [X] years of experience in a monitoring, DevOps, or site reliability engineering (SRE) role, with at least [Y] years of hands-on experience specifically with Datadog.
- Proven experience deploying and configuring Datadog in complex, distributed environments.
- Experience in troubleshooting performance bottlenecks, diagnosing application errors, and providing root cause analysis.
- Soft Skills :
- Strong problem-solving and analytical skills with the ability to quickly understand complex systems.
- Excellent communication skills, with the ability to translate technical concepts for non-technical stakeholders.
- Ability to work independently and as part of a collaborative team.
- Strong organizational skills and attention to detail.
Preferred :
- Datadog certifications (e.g., Datadog Certified Expert).
- Salesforce experience preferred.
- Experience with log aggregation, APM, and synthetic monitoring.
- Familiarity with cloud security monitoring using Datadog.
- Experience with infrastructure-as-code (IaC) tools like Terraform or CloudFormation.
remote work
Job Tags
Remote job,