Unified Platform Monitoring
How One Customer Improved Application Observability through Unified Platform Monitoring
Location: United States
Company Size: Large Enterprise (210,000+ employees)
Solution: Unified Platform Monitoring, Incident Triage
The client was a national franchise network with a robust e-commerce program, generating over 23 million users sessions/month. As a result of COVID-19 in-store restrictions, they experienced an unprecedented surge in e-commerce growth and found that platform stability became an increasing concern.
Confronted with a marked increase in production challenges, the client needed optimized monitoring and defect triage processes. In particular, regular outages and a large backlog of defects caused measurable customer friction, resulting in hundreds of thousands of dollars of lost revenue.
“With an infrastructure as big as ours, it was impossible to know all the issues and incidents when they occurred. And when we did find an issue, half the time we discovered it after the fact because an end user had reported it to us.”
The first priority was to tackle monitoring visibility across the entire organization. Production defects were difficult to identify and diagnose, then challenging to resolve across Site Reliability, Product Management, and Engineering teams.
The DORIAN Group worked across multiple departments in the organization to first gather solution requirements and better understand internal operations. They conducted a multi-vendor POC then implemented the organization’s chosen solution: DataDog, an application monitoring system which allows teams to monitor the health of their distributed applications.
Within months, the client had a comprehensive and user-friendly observability platform across the entire e-commerce organization allowing them to:
- Enhance their ability to identify and resolve active failures
- Create consolidated dashboards with custom alarms across Site Reliability, Engineering, and Service Desk
- Utilize synthetic monitoring to identify user pain points prior to causing major production disruptions
- Break down communication silos between departments
- Quickly diagnose the root cause of failures
With this new observability tool in place, The DORIAN Group then worked with the client’s team to standardize priority designations for issues, creating a streamlined triage and resolution process.
“DataDog brought in so many improvements. Through our work with The DORIAN Group, we now have complete visibility on our issues and the capability of being proactive rather than reactive. Now, 90% of the time, we see problem areas as they arise and can act on them immediately…..It’s been a tremendous asset to our team and we’ve seen huge improvements across the board.”
With the help of The Dorian Group, the client was able to successfully optimize their platform monitoring and defect resolution across the organization.
Mean Time to Recovery (MTTR) was reduced by 60%, and within weeks defect resolutions consistently exceeded creations. These resolved issues in turn increased customer satisfaction, leading to a significant improvement in customer mobile app ratings.