Tech numbers and how we scaled for the biggest beauty week ( black friday ) in Brasil !

Marcio Ribeiro
6 min readDec 3, 2020
Beauty Week 2020

Beleza Na Web and Grupo Boticário hit their largest beauty week numbers across all the ecommerce platform this year, here are some of the tech numbers and lessons we learned preparing for the week.

Photo: Pieter Stam De Jonge/AFP/Ritzau Scanpix
  • 4 Kubernetes clusters ( AWS EKS ).
  • 60+ micro services.
  • 476 K8s ec2 worker nodes.
  • 4.000+ pods.
  • 200+ deployments.
  • 15TB ram in use across kubernetes clusters.
  • 900 cpu cores.
  • 250.000 requests per minute on api gateway.
  • 197.000 requests per minute on storefront.
  • 13.000+ active searches per minute on elasticsearch cluster ( runs on ec2 self managed ).
  • 1 elasticsearch cluster , 200 i3.2xlarge data nodes.
  • 25 Rds postgresql DBs.
  • 25 mongodb replica sets + 1 sharded cluster ( all on ec2 self managed ).
  • 6.5 Million lambda executions on our main function.
Sudden spikes of over 100% in traffic in just a few minutes.

100% spike in traffic in less than 5 minutes and all average response times stayed within check.

We achieved this by testing our environment and pre scaling to demand before it actually occurred.

Stress testing was essential to achieve this result. We tested 2.5x the volume expected for the beauty week.

We performed backend tests using Locust written in house. Front-end testing was done using a partner.

For the backend apis we used Locust , for front-end testing we used a partner.

Several application optimizations were required but we also adjusted some infrastructure numbers :

  • On nodeJS apps, 1 core * 1.5GB RAM for optimal performance inside the V8 engine constraints, using this config we were able to go from 12 pods to 98 organically letting the HPA work on the cluster with minimal spikes in response times.
  • We configure 1GB headroom for all backend java apis running in kubernetes.

All these values were found stress testing the platform and services.

Our api-gateway during peak time. Stable response time.
Aggregation api, core piece of the platform also with a stable response time.

Our core backend apis also endured flat response times during the peak in traffic.

During testing we found that the backend took a little too long to scale to demand, and letting the hpa work organically sometimes resulted in a higher average response time. This was due to HPA working based on CPU and memory ( JMX_EXPORTER custom metrics for java apis ) and not jvm response time or requests per minute. For 2021 we are seeking to scale based on both also.

The minimum number of 25 pods per service above was found during stress testing.

Optimizing and right sizing the ec2 instance witch mongodb runs can reduce significantly IO wait times.

By optimizing our mongoDB ec2 instances and ebs volumes we were able to reduce the amount of CPU IO Wait time from 2 or 3% to a very low average of 0.2% of cpu time.

This means fast query times for our apis and a almost flat average response time on the application level.

Another key performance indicator for mongodb , page faults !

Another health indicator for MongoDB is the page faults metric. It should be always 0 or as close as possible. A value above that indicates that MongoDB is reading too much data from the disk because the dataset cannot fit properly on memory.

This increases the response time and the overall health diminishes.

Checkout api with stable response times after all adjustments.

By increasing the instance from m5.xlarge to m5.x2large we were able to provision MongoDB enough memory for the dataset and the page faults stabilized.

On the EBS volumes if the sum of read/write iops went over the disks maximum and if the read/write queue metrics went over 1 we provisioned them to IO1 and removed the IOPS bottleneck.

Query average time after creating an index.

On the RDS Postgresql instances we acted on 2 initiatives :

  1. The data layer, index and query optimization.
  2. The RDS parameter group options.

By optimizing both we were able to achieve this kind of result, the average query time for the api dropped from 70ms to 3.5ms.

This ensures our instances could withstand the higher volume of traffic and that our apis remained with low response times even under load.

Database tuning is very specific to the workload in question.

Some of the parameter group values we modified :
wal_buffer , work_mem, effective_cache_size, effective_io_concurrency, shared_buffers …

We also had to adjust the connection pool on the application side so we don’t run out of available connections and pods fail their health check or overload the rds server with too many connections.

High volume elasticsearch cluster, thus the 200 data nodes.

We use the recommended i3 instance family to get cost-efficient IO performance for our cluster.

Our data nodes experience large spikes in IOPS thus the instance family.

With EBS disks, these spikes would be a problem and provisioning 200 IO1 disks is not cost efficient !

Each index config will depend on data size and frequency of access. Some of our largest indexes have over 10GB of data and are responsible for over half the traffic in the cluster. We used 11 shards x 10 replicas each. We also removed old unused versions of our index ( kept for rollback purposes ) which helped reduce the load on the data nodes.

Stable response time on our search api, the only one that communicates directly with our elasticsearch cluster.

We also had to configure DNS cache on the nodeJS applications. Without this the dns requests inside the front-end cluster would pass 5.000 mark and some packets were dropped on aws due to the hard limit of 1024 packets per second. When this happens DNS resolution errors occur inside the cluster.

Reduced from 5.000 requests per minute to 300 on the front end cluster.

Something that must be clear right now is that without data, metrics and some form of measurement you cannot optimize something. It is a basic principle if you want to optimize you must measure before and after. We always work on top of metrics.

With all this preparation and testing we had a great black friday experience with a stable platform even during peak times and sudden spikes. Learned some lessons that were all written down for next year.

Also the DevOps team behind all this infrastructure is composed of only 6 people, however everything is highly automated and everything is monitored and all alarms are also automatic.

--

--

Marcio Ribeiro

SRE and Internal Developer Platform Manager at Grupo Boticário