Client Profile
Our client is a publicly-traded, Fortune 50, general merchandise retailer with more than 1,800 stores spread across every U.S. state and the District of Columbia. It ranks among the 10 largest retailers in the U.S. and employs more than 350,000 people. In 2019, it reported more than $75 billion in revenue. The client also owns 45 brands unique to its stores and operates several subsidiaries that range from a grocery delivery service to a skin care and beauty e-commerce site.
Business Challenge
The client had acquired a software company with a proprietary, cloud-based platform to simplify the process of selling and activating connected products such as mobile phones. The platform integrates with the largest cellular network providers and supports ancillary processes like payment programs, insurance, fraud protection, and warranties.
The software company ran its applications on virtual machines in a data center. As part of the acquisition, the client wanted to segregate these servers from the legacy company, moving them to a reliable and scalable platform.
The client’s new division also needed to mature its processes to handle the demands of joining such a large corporation, including stricter security requirements, more redundancies, faster deployments, and the capacity to handle a steady stream of new feature sets.
The product platform had been migrated to the AWS (Amazon Web Services) Cloud Computing environment, but didn’t leverage all cloud capabilities suffering many challenges:
- Site Reliability Engineers (SRE) required a better ability to handle the surging volume of work in a fast-paced, agile environment that supported upwards of 10 development teams, hundreds of AWS instances, and weekly deployments.
- Important but lower-priority items suffered constant delays.
- Expected cost savings also were often delayed or not realized.
- Legacy operations, monolithic applications, and outdated real-time system health dashboards were difficult to maintain, requiring many manual processes and manual intervention.
- When issues occurred, manual escalation practices delayed responses and restoration activities.
- Tracking high volumes of infrastructure changes manually was difficult and time-consuming.
- A lack of modular or customizable building blocks led to inefficient environment builds.
In order to achieve the stable, reliable, and highly available service the client expected from the platform, the new division’s operations needed to mature to a higher standard. More cohesive, heavily automated, efficient processes were required to accommodate the larger pipeline of work and manage the various environments from development to production. The client also needed to ensure the platform could handle peak periods during the holiday shopping season, where the platform realized significant (40X) transaction volume increases.
Solution & Approach
Auxis was leveraged as a key partner for automation and to mature SRE, DevOps, security, and overall operations.
Auxis provided DevOps support for the division, serving as the central knowledge base for the platform, engineering tasks, deployments, and production support. Auxis was able to gain valuable insight into the platform’s struggles – identifying automation opportunities that improved overall results by streamlining processes, controlling costs, and delivering faster, more consistent deployments.
Key transformations included:
- Increasing the size of the system to handle the exponentially higher load during peak holiday seasons. Auxis worked to streamline the scalability of the platform to meet business needs.
- Managing cloud infrastructure using a DevOps mindset. In collaboration with the client, Auxis DevOps experts implemented a unified philosophy under Chef/Terraform to automate builds. Not only does this combination allow more agility, but it ensures consistency to changes across environments. Auxis also created new Lambda functions that scanned for frequent infrastructure changes, automatically creating or modifying performance dashboards.
- Automation of CI/CD (Continuous Integration/Continuous Delivery) and monitoring/availability for immediate awareness of potential issues. Adopting a modular integrated solution that combined Cloudwatch metrics, alarms, and automated actions ensured all changes across environments would be automatically monitored using statistics like CPU usage, network throughput, memory and disk utilization, and unhealthy instances.
- Integrating alerts with the PagerDuty incident notification solution, generating immediate notification and responses. The teams also integrated the communications tool, Slack, with GitHub, heightening awareness of potential issues like build failures and triggering faster resolutions.
- Establishing operational efficiencies with automated content pushes and health checks. Auxis wrote Ansible plays to automate the process of running scripts to execute database copies and swaps to the production environment with minimal effort. Other plays captured health checks and verification that code was updated for deployments properly. Other scripts also automatically copied artifacts linked to specific releases from development to production without manual intervention.
- Creating modules to simplify deployment of changes. To establish a routine and reusable foundation for building new infrastructure, Auxis created modules for all major resources, including AWS instances, load balancers, base applications, databases, and monitoring. Every module was created with versioning capabilities, making the rollout of changes as simple as increasing the version of the module.
Results
By utilizing automation and other best practices, Auxis was able to streamline delivery management, system maintenance, and application deployments for the client – creating a more scalable model moving forward. Key benefits include:
- 100%+ faster deployment times with more successful outcomes: The time it took to create new environments plummeted. Deployment times decreased by more than 100%, now taking an average of less than 30 minutes to complete instead of several hours.