Implementing distributed calculations: A journey from bottlenecks to breakthroughs

Implementing distributed calculations: A journey from bottlenecks to breakthroughs

Symphony Logo
Symphony
October 13th, 2024

Initial problem description

Our system was grappling with the challenge of executing long-running, CPU-intensive calculations, two of which were vital to our workflow. 

These calculations, both optimization algorithms, were essential for determining the optimal solution. Given the complexity of the problems the system addresses, we had to employ both to achieve the desired outcome.

  • The first calculation is a stochastic optimization, which generates multiple solutions by using varying input data points. While this method can be effective, it has the drawback of requiring sequential computations, which can limit efficiency.

  • The second calculation follows a well-known approach of iterative gradient-based optimization. However, this comes with difficulties, such as the unpredictable number of iterations required to reach the optimal solution and the complexity of gradient calculations in multi-dimensional spaces. 

The problem? Legacy services were designed to run continuously but could only handle one task at a time. This architecture led to performance bottlenecks, limiting the number of concurrent calculations and restricting service availability to just one user at a time. As a result, users experienced delays, waiting for the service to become available, which could lead to frustration. 

Efficient task management was crucial to maintaining optimal system performance and user satisfaction.

However, continuously running services capable of handling such tasks around the clock can be expensive. We needed a smarter solution to keep the system fast and responsive without breaking the bank.

We considered horizontal scaling—adding more service instances to handle tasks concurrently. While this could address the issue of supporting multiple users, it introduced high costs and didn’t improve the performance of individual tasks. Additionally, this approach didn’t tackle the root issue: optimizing how each calculation was processed. To truly enhance performance, we needed to rethink our approach beyond just adding more resources.

Initial infrastructure

The system we implemented followed microservice architecture, a design paradigm that aligns with the principle of separation of concerns. Each microservice is responsible for a specific aspect or feature of the system, ensuring that the overall system is more maintainable, scalable, and resilient. Also microservices were containerized using Docker and deployed on AWS ECS (Elastic Container Service), enabling flexible management and independent scaling of the services.

Key Characteristics of Microservice Architecture

Microservice architecture is characterized by several key principles. One of the most important is the separation of concerns, where each microservice is designed to handle a single responsibility within the system. This modular approach simplifies the development, testing, and maintenance of each service independently. Scalability is another critical characteristic, as microservices can be scaled independently based on demand. This allows for efficient resource allocation, ensuring that high-demand services have the necessary capacity without over-provisioning resources for other parts of the system. 

The architecture also enhances resilience, by isolating services, failures in one service are less likely to impact others, helping to maintain the overall stability and availability of the system. Flexibility in technology is a further advantage, as different microservices can be developed using various programming languages, frameworks, or databases, allowing for the selection of the best tools for each specific task. Lastly, microservices offer ease of deployment, with each service being deployable independently. This independence means that new features and updates can be rolled out more frequently and with less risk of affecting the entire system.

Our microservices are architected with a clear separation of concerns, divided into two distinct components:

  • Web Part: This component handles user-facing operations, such as processing HTTP requests and executing simple I/O tasks like CRUD operations on the database. It operates in a synchronous, request-response model, ensuring low-latency interaction for users. This part is designed to scale horizontally to maintain responsiveness and throughput under varying load conditions.

  • Consumer Part: This component is designed for handling long-running, compute-intensive tasks, such as complex calculations, large data migrations, or synchronization processes. It operates asynchronously, decoupling heavy processing from the Web Part to avoid blocking the main application flow. The Consumer Part ensures scalability by leveraging parallel processing and task queuing, allowing it to scale independently based on demand and optimize resource utilization.

By isolating these responsibilities, we achieve key architectural characteristics such as scalability, fault tolerance, and resilience. This ensures the system remains responsive while efficiently processing background tasks.

Communication between the web components and consumer components was facilitated through RabbitMQ. The web parts would send messages into designated message queues, where each consumer would monitor a specific queue. Once a message was available in the queue, the relevant consumer would pick it up for processing. This approach ensured a decoupled and asynchronous workflow, where web services didn't need to wait for the consumer to finish processing, allowing for more responsive interactions.

First calculation optimization 

The calculation was divided into n sequential parts, with each part generating a result that was stored in the database. Every step had a set of calculations that needed to be performed in a specific order to produce the final outcome. Each of these steps involved complex mathematical computations, creating a bottleneck in the process as the system could not proceed until each calculation was completed in sequence.

Flow for the calculation looked like this where the web part sent a message through the RabbitMQ, as mentioned previously, and consumer processed.

670d29cca3a13d5340addfbe

The initial infrastructure was designed to run one instance of a consumer running 24/7 and it handled all the calculations. But due to limited computing resources (4CPU and 8GB of RAM), this sequential processing led to poor performance, scalability issues and directly impacted user experience.

As a temporary solution we just upgraded the instance to 8CPU and 16GB of RAM. However, this highlights an important architectural consideration, while vertical scaling, boosting, instance resources can offer immediate improvements in processing speed, it is not a sustainable long-term strategy. In distributed systems where scalability, efficiency, and cost-effectiveness are critical, vertical scaling quickly reaches its limits. The focus should shift toward horizontal scaling and architectural optimizations to ensure the system can handle increased workloads more flexibly and efficiently.

We did performance testing with only one user running the calculations.

670d2a09a3a13d5340addfd1

Relying on resource upgrades (CPU/RAM) offers only momentary improvements, as demonstrated by the performance testing results. Even with the stronger instance, the service runtime decreased by only 50%, and this approach did not solve the issue of handling multiple users or calculations concurrently. The core bottleneck—sequential processing on a single consumer—remains unaddressed.

Our solution

First implementation   

To address our performance issue, we applied a common optimization pattern by examining the calculations to identify any segments that could be executed concurrently without impacting the accuracy of the final result.

By doing so, we managed to extract n calculations that had no effect on each other. That enabled us to run those calculations concurrently on a single consumer, but we had to run a stronger instance of the consumer, like said earlier,  that had the ability to process those n calculations concurrently.

With this approach we managed to slightly improve the performance of one calculation but still had a problem with multiple users running the calculations. 

New approach 

Once we managed to run the calculations concurrently, we had to make some infrastructure changes to further improve performance and enable multiple users to run the calculations simultaneously. Our solution was horizontal scaling, where multiple consumers would work in parallel, handling multiple calculations at the same time. However, running multiple consumers continuously would have been prohibitively expensive, especially considering that all our microservices were containerized and deployed on AWS ECS instances.

To address this, we designed the consumers to start on demand when a user initiates a calculation. Each consumer listens to the same message queue, where the web part of the microservice sends calculation tasks. This architecture allows any available consumer to pick up a task and begin processing it.

The on-demand start of consumers was implemented using AWS Lambda, which triggers at regular intervals, typically every few seconds. This mechanism is known as autoscaling. The Lambda function monitors the message queue (managed by RabbitMQ) and starts new consumer instances based on the number of messages waiting to be processed. By dynamically scaling the number of consumers to match the workload, the system optimizes resource usage and reduces operational costs, ensuring that only the necessary number of consumers are active at any given time. This ensures that our system is always scaling efficiently based on real-time demand. We set a cap of 30 consumers running concurrently to avoid over-provisioning and manage costs effectively.

Once a consumer finishes processing a calculation, it sends a message through RabbitMQ to notify the user that their calculation is complete. This decouples the notification logic from the calculation process and ensures reliable communication with users.

Each consumer is designed to shut down immediately after finishing one calculation, further reducing unnecessary resource usage.

To highlight the cost-effectiveness of this approach, consider a continuously running AWS instance, mentioned before, with 8 CPUs and 16 GB of RAM, which could cost around $500 to $700 per month. In contrast, our on-demand consumers, running approximately only 3-5 minutes with 2 CPUs and 4 GB of RAM, are significantly cheaper. To put this in perspective, one user triggers n calculations that each run for about 5 min, which would cost $0.00265 * n. By scaling up only when needed, we dramatically reduce operational costs while maintaining the flexibility to handle high workloads.

Key Improvements

  1. Task Segmentation: We adjusted the consumer's sole calculation and infrastructure to process tasks as smaller, independent units of the larger calculation. This allowed us to parallelize the workload effectively.

  2. Result Storing and Notification: Each consumer is running separately, storing a part of the result in the database and notifying the user.

Benefits of the New Implementation

  • On-Demand Services: With this approach, there was no need to maintain services running 24/7. We could start up and shut down services as required, significantly reducing operational costs.
  • Enhanced Performance: By distributing the workload across multiple tasks and running them concurrently, the solution became more performant.
  • Cost Efficiency: The ability to manage services dynamically according to demand made the solution more cost-effective.
670d2ba3a3a13d5340ade1bf

For example, in this code snippet:

670d2bc3a3a13d5340ade1d3

In this example, the web part of the microservice would send three messages to the RabbitMQ queue. The AWS Lambda function, responsible for autoscaling, would read the number of pending messages in the queue and trigger the startup of three new consumers, each of which would take one message. These consumers would then process the three calculations simultaneously. Once all the calculations are completed, the consumers would shut down, ensuring that resources are efficiently utilized without running unnecessary instances. This approach allows the system to handle multiple calculations concurrently while maintaining cost-effectiveness and scalability. 

Each consumer had logic written for its shutdown immediately after processing a single task, also each consumer was designed so it would not take up more than one task. This enabled us to have full control and observability over all calculations and always have a fairly simple system infrastructure. If a calculation fails, the message is automatically routed to a dead letter queue (DLQ). RabbitMQ handles this by setting a message expiration time or based on other conditions such as the maximum number of delivery attempts. After being placed in the dead letter queue, the message will be retriggered after a default interval, which is configured internally by RabbitMQ. This retry mechanism provides ample time for the team to investigate the issue, resolve any bugs, and deploy a new version of the calculation, ensuring minimal downtime while preserving the failed messages for reprocessing.

Now theoretically there is no boundary for the number of calculations, but as already mentioned we set a limit for 30 consumers directly in the lambda function. Therefore theoretically, there is no boundary for the number of users that initiate calculations at the same time since all of them are running separately. 

Implementation of each consumer 

After successfully implementing the infrastructural changes for our new solution, we turned our attention to the individual implementation of the consumer.

Asynchronous Consumer

We built a fully asynchronous consumer from scratch using asyncio, aio_pika, and tenacity. This allowed us to optimize task handling and improve overall efficiency. Using async io and aio_pika, we created the mechanism for reading messages from the queue. With tenacity, we implemented retry mechanisms for checking the messages in the queue. If there are no messages in the queue for every retry, the consumer shuts down.

Consumer Workflow

The consumer operates through a well-defined workflow. It starts with a task monitoring loop, running continuously to check for new tasks. Upon receiving a task, the consumer processes it asynchronously, utilizing the asyncio library to ensure all calculation methods are handled without blocking. For optimized performance, we've integrated the numpy library, which enhances the efficiency of loops and CPU-bound operations, leading to significantly faster calculations.

Once the task is completed, the consumer shuts down automatically, ensuring that resources are used efficiently and not wasted on idle processes.

Error handling

Our calculation had n parts so each of the consumers handled one of those n parts. For each part a notification was sent to the user regarding the state of the calculation. 

One complex part was how to handle multiple users running the calculations so each user would receive their corresponding notification for each task. We added a unique identifier of the request that triggered the calculations.

If a calculation fails, the consumer will notify the user, the initiating calculation task message will go directly to the dead letter queue and will be retriggered after some default time set internally by Rabbitmq. This mechanism gives us plenty of time to investigate, fix and deploy the new fixed calculation.

Performance results of the optimization 

After implementing the solution we did the performance testing with the same examples but we only used weaker service instances since we now only do the single process calculation on each of our consumers and there was no benefit or performance enhancement in running stronger ones:

670d2c01a3a13d5340ade1e6

Optimization implementation is faster then the upgraded implementation by around 9 times even when using weaker service instances. With this implementation we managed to handle and perform complex calculations with a 10 times larger set of data that it could not process prior. System can now tackle larger calculations and still have a processing time under 30 min, also each of the users will have the same performance since for all calculations triggered new consumers will start up. 

Potential downsides

With this implementation, the only downside is complex troubleshooting, where there will have to be some in detail logging implemented on each consumer that will help anytime a problem arises. 

Second calculation optimization

This calculation uses an iterative gradient-based method, but it has challenges like the unpredictable number of iterations to find the optimal solution and the difficulty of calculating gradients in multi-dimensional spaces.

The initial infrastructure is the same as in the example before. 

All the iterations have a set of complex calculations where we will, once again, try to find the smallest possible extractable portions that could run concurrently and divide the overall calculation in those small tasks. In this case each iteration has its own set of results that, if not met the criteria, are new inputs for the next iteration. 

The plan is similar to the solution before, we will start up multiple consumers that will process all those tasks concurrently and calculate each of the results. There will be one consumer that will initiate tasks of smaller calculations and serve the purpose of an orchestrator.

Since the calculation is now not only one iteration we needed to implement a consumer that will shut down only when the final solution is obtained, not after one processed task like in the previous example. 

With multiple consumers now working in parallel to calculate different portions of the result, we needed a reliable way to ensure that only one consumer would be responsible for gathering and finalizing these results. To achieve this, we turned to a high-performance data storage solution that excels in fast read and write operations. In our case, we chose Redis.

Now each of the consumers will cache a portion of the result and one consumer will read from Redis until all the results are present. Once all the results are obtained that consumer will check if they meet the criteria and return if the criteria is met, or will start a new iteration. It will send an initiation task to the orchestrator consumer with the newly calculated results as new inputs for the next iteration. Also that consumer will handle deleting data from Redis once all results have been read. 

This approach allows for flexible adaptation of results and stored objects since Redis does not enforce a strict model. Additionally, if the calculation inputs involve a large set of data, we can cache these inputs in Redis. This way, all consumers can access the data in every iteration without repeatedly sending it, potentially reducing the time required for data transmission.

Error handling

Given that multiple consumers will process portions of the solution simultaneously, we will implement a fast-stop mechanism in case of failure. If one consumer fails, it will cache the unique identifier for the calculation. Other consumers will check for this identifier when processing tasks to determine if the calculation has already failed, ensuring a quick halt to avoid wasted resources.

With this approach, we address the problem of multiple notifications to the user. Now, all consumers are aware if a calculation fails and only the first one will send a notification to the user. This consumer will also delete all cached values from Redis to prevent unnecessary data retention.

In the event of an unhandled exception, we plan to implement a slow stop mechanism. Each consumer will have an idle stop timeout. If results are not present in Redis after this timeout, the consumer will delete all cached values and shut down. This ensures that the system can gracefully handle unexpected issues without leaving stale data.

670d2c76a3a13d5340ade218

Performance results of the optimization 

Prior to optimization, we ran a calculation using an average-sized dataset and a moderate number of iterations. 

670d2cc6a3a13d5340ade241

We observed a significant performance improvement, with our optimized solution being more than 12 times faster than the initial implementation.

Conclusion

This approach is more complex but allows us to further parallelize our calculations and, with the use of common data storage, in our case Redis, significantly increase the performance. By efficiently managing tasks and handling errors, we ensure that our system remains robust, scalable, and user-friendly.

About the author

Matej Marincic is a Software Engineer from the Sarajevo branch who specializes in backend development and integrating DevOps practices. His main language is Python, and he also brings experience with AWS, Jenkins, Java, C++, and C#. Matej enjoys working on complex software and infrastructure challenges and has a keen interest in solving mathematical problems, contributing to diverse projects with a thoughtful and reliable approach.

Contact us if you have any questions about our company or products.

We will try to provide an answer within a few days.