Hence, the application is fully distributed, and both the preprocessing data step and the training step could be done in a distributed manner. The computing requirements for this application are:
- The need for training many Machine Learning (ML) models for pattern recognition, usually with complex architectures. Currently, AI Investments trains a dedicated model for each market, for each time interval (hourly, daily, weekly), and for each investment strategy. It means that for 200 markets, 3 time intervals and 5 investment strategies, 3000 models must be trained and re-trained periodically based on new data.
- There is a complex preprocessing of each training data set: The data is currently transposed and joined, and then a Wavelet transformation is applied. Different Wavelet functions are used depending on the results of the training, but mostly the Wavelets functions from the Daubechies and Symlet families are used.
- Predictions based on the trained networks must be done in near real-time: One minute is the processing deadline for the shorter time intervals where data must be gathered from the markets, transposed, transformed, and the prediction produced.
These requirements imply a high performance architecture for training in terms of parallel processing and a reliable architecture for prediction in terms of low latency and availability. On the other hand, the infrastructure costs of the infrastructure should obviously be minimized.
The deployment utility aims to minimize costs assuming that the training of the given number of models should end by a certain deadline. This type of utility will be used in the production environment for re-training models used for real-time predictions. The initial time to train one model is estimated based on the number of models to train, number of epochs needed to train a model and the average time of training one epoch. Then the number of worker computers needed is calculated to meet the training deadline. Preferably, on-premise servers are used and the machines in the Cloud are deployed only if the on-premises servers are too few to meet the training deadline. The MELODIC platform is continuously monitor the time it takes to train a model, and estimate the time remaining to finish the training of all models. Based on that estimate, MELODIC optimizes the number of worker machines required to meet the deadline by autonomously adding or removing virtual machines in the Cloud.
For training and experiments, a hybrid infrastructure is used where a part of the application is deployed in the Cloud and a part of the application is deployed on-premise. This architecture is forced by very high prices of virtual machines with GPUs or dedicated hardware for ML training in the Cloud. Also the cost of using HPC solutions is very high compared with the current hybrid deployment. The baseline infrastructure for training is therefore on-premise servers with GPUs, and Cloud resources used only when it is not possible to meet the time constraints of the training on the local computational resources. Even assuming that the the leading Cloud Providers are equipped with the highest quality network connections, the communication overhead could be a problem that must be considered, and work is ongoing to extend MELODIC to take latency into consideration when assigning the location of data and processing jobs. However, for this application, the time of transferring the data for training the model is not significant compared with training process: Measurements indicate that transferring data normally takes up to 5 seconds, while the training time is over 7 hours on average. Hence, the network delay and bandwidth are not bottlenecks for completing the training.