Constructor 2022 Holiday Readiness Program: Ensuring peak performance during peak demand
Overview
Constructor’s conversion optimization and discovery benefits are only as good as our uptime and performance. For this reason, we have a robust process of performance validation and monitoring. During the holiday season and the peak demand period of Black Friday and Cyber Monday, we increase our standards in all of these areas out of recognition that it is the most important selling period for many of our customers.
Survey of peak demand for 2021
In planning our preparations in the run-up to the 2022 holiday season, we looked back to daily and peak demand changes during the holiday season and also reviewed how our baseline traffic has increased in the time since then.
Last Black Friday our overall traffic increased 44% compared to Black Friday 2020, which was already 200% over daily baseline levels. Not only did we maintain our 100% uptime, but performance during the peak demand periods actually improved relative to equivalent periods (due to changes in traffic patterns). Since then the system has already scaled without interruption or degradation. For the past several months our average daily traffic has been greater than last year’s Black Friday event.
Performance improvements over the past year
Over the past year we have worked continuously to drive even better performance and scalability, contributing to improved latencies and zero downtime. Some example projects and outcomes include:
- Decreased index update delivery times with incremental indices
- Rolled out new datacentre in Asia
- Continued to improve instance boot and data download time by a further ~50%
- Introduced parallel processing of requests
- Added more IP’s per service region
- Created independently scalable sub-services in each region
- Increased service dependency redundancy
- Improved our internal service monitoring platform
- Upgraded our operations and incident response platform
- Introduced IP based traffic prioritization
Scale-out performance testing
We have tested scaling to 2000% of current average daily traffic volume, while validating the continued performance of the following:
- Database connections
- Monitoring infrastructure
- Networking infrastructure
- Response latencies @ median, 90th percentile, 95th percentile, 99th percentile
- Response latencies for each customer, and each product used by each customer
- Data ingestion SLA times
Chaos and anti-fragility testing
We also use chaos testing to validate that catastrophic failure of the following supporting infrastructure does not impact critical features (primarily search, autosuggest, browse, recommendations, collections request/response times):
- Disabled MySQL
- Disabled index builders
- Disabled personalization queues
- Disabled supplemental ranking engines
- Availability zone and data center failures
All of the above is in addition to the rigorous performance test and rollout plan we use for every release:
- Full test suite on every pull request (incremental code change).
- Production traffic replay for all deployment builds (multiple times a week).
- Rolling, risk-adjusted deployment procedures across worldwide data centers.
- Canary deployment for deploys touching critical path request/response lifecycle.
- Automatic build failures if sensitive thresholds on result quality, latency, memory consumption, CPU consumption and more are breached at any of these levels.
Standard on-call procedures
At all times we have multiple on-call schedules for the following teams:
- Front-end and client teams
- Data science and result quality teams
- Core platform and response performance teams
- Each of these have multiple fallbacks and tiered escalation policies
Automated alerting
Alerting is automated across dozens of metrics to ensure we are aware of incidents within seconds. A few representatives:
- Queuing times
- Per-service latencies
- Memory and CPU consumption
Special holiday on-call procedures
In addition, we take special precautions during peak holiday shopping periods:
- We will over-provision all infrastructure above and beyond typical scale-out policy.
- We double on-call rotation utilizing the above-mentioned automatic notification and escalation policies.
- The entire account and product team will be monitoring throughout the Black Friday / Cyber Monday period, with elevated focus for other holiday periods (such as Boxing Day).
Code freeze during critical time periods
We will freeze all deployments except for the most critical fixes from Black Friday through to Cyber Monday.
Conclusion
At Constructor, we take uptime, performance, and service stability very seriously because the best conversion optimization and ML are moot if we don’t deliver fast and stable service consistently. The goal of this document is to provide our customers with a broad overview of our site reliability practices, as well as a specific view of our holiday readiness procedures. As always, please feel free to reach out to your Customer Success Manager if you have any further questions.