Constructor Holiday Readiness Program: Ensuring peak performance during peak demand
Overview
Constructor’s conversion optimization and discovery benefits are only as good as our uptime and performance. For this reason, we have a robust process of performance validation and monitoring. During the holiday season and the peak demand period of Black Friday and Cyber Monday, we increase our standards in all of these areas out of recognition that it is the most important selling period for many of our customers.
Survey of peak demand for 2020
In planning our preparations in the run-up to the 2021 holiday season, we looked back to daily and peak demand changes during the holiday season and also reviewed how our baseline traffic has increased in the time since then.
Last Black Friday our overall traffic increased 200% over daily baseline levels, and peaked at 500% of baseline. Not only did we maintain our 100% uptime, but performance during the peak demand periods actually improved relative to equivalent periods (due to changes in traffic patterns). Since then the system has already scaled without interruption or degradation for average daily traffic by over 383%.
Performance improvements over the past year
Over the past year we have worked continuously to drive even better performance and scalability, contributing to improved latencies and zero downtime. Some example projects and outcomes include:
- Optimized scaleout policies
- Introduced stand-by server pools
- Improved instance boot and data download time by ~300%
- Increased performance of personalization service
- Doubled performance of underlying search & browse servers
- Decreased index update delivery times
- Increased database read capacity
Scale-out performance testing
We have tested scaling to 2000% of current average daily traffic volume, while validating the continued performance of the following:
- Database connections
- Monitoring infrastructure
- Networking infrastructure
- Response latencies @ median, 90th percentile, 95th percentile, 99th percentile
- Response latencies for each customer, and each product used by each customer
- Data ingestion SLA times
Chaos and anti-fragility testing
We also use chaos testing to validate that catastrophic failure of the following supporting infrastructure does not impact critical features (primarily search, autosuggest, browse, recommendations, collections request/response times):
- Disabled MySQL
- Disabled index builders
- Disabled personalization queues
- Disabled supplemental ranking engines
- Availability zone and data center failures
All of the above is in addition to the rigorous performance test and rollout plan we use for every release:
- Full test suite on every pull request (incremental code change).
- Production traffic replay for all deployment builds (multiple times a week).
- Rolling, risk-adjusted deployment procedures across worldwide data centers.
- Canary deployment for deploys touching critical path request/response lifecycle.
- Automatic build failures if sensitive thresholds on result quality, latency, memory consumption, CPU consumption and more are breached at any of these levels.
Standard on-call procedures
At all times we have multiple on-call schedules for the following teams:
- Front-end and client teams
- Data science and result quality teams
- Core platform and response performance teams
- Each of these have multiple fallbacks and tiered escalation policies
Automated alerting
Alerting is automated across dozens of metrics to ensure we are aware of incidents within seconds. A few representative examples:
- Queuing times
- Per-service latencies
- Memory and CPU consumption
Special holiday on-call procedures
In addition, we take special precautions during peak holiday shopping periods:
- We will over-provision all infrastructure above and beyond typical scale-out policy.
- We double on-call rotation utilizing the above-mentioned automatic notification and escalation policies.
- The entire account and product team will be monitoring throughout the Black Friday / Cyber Monday period, with elevated focus for other holiday periods (such as Boxing Day).
Conclusion
At Constructor, we take uptime, performance, and service stability very seriously because the best conversion optimization and ML are moot if we don’t deliver fast and stable service consistently. The goal of this document is to provide our customers with a broad overview of our site reliability practices, as well as a specific view of our holiday readiness procedures. As always, please feel free to reach out to your Customer Success Manager if you have any further questions.