For the AI Investments application it is assumed that on average 10 models are trained in parallel, but at peak level up to 50 models could be trained, and the average number of models can reach 18 models during 120 hours per month. The distribution of the training time is presented on Diagram. It presents the percentage of the models trained by a given time. The figure shows that training time is not constant and depends of the size of the model, training period length, and the performance of the server and GPU.
With these parameters two different scenarios were considered.
- A scenario with 5 on-premise servers each with two GPUs. This corresponds to two models to be trained per server in the normal average case. Additional servers with GPUs running in the Cloud must therefore be used as needed.
Based on above assumptions the three years’ Total Cost of Ownership (TCO) is composed of the following elements:
- The initial purchase of 5 on-premise servers where each server costs 5 000 USD, giving a total investment of 25 000 USD in total.
- The cost of running a server and its maintenance covering electricity, cooling, and repairs is assumed to be 10% of the hardware cost yearly. Thus the total cost of infrastructure operation is 7 500 USD over the period of three years’ server depreciation.
- Cloud resources are used only when needed. Up to 10 models trained at once could be handled by the private infrastructure, and only if it is necessary to train more models will the Cloud resources be used. Based on the parameters described above, the Cloud resources are needed during the 120 hours per month when the average load is higher than normal. On average 8Cost Benefits of Multi-Cloud Deployment additional servers are needed during this time since the Cloud servers only have one GPU each. The cost of such a Cloud server is around one USD per hour, so the cost of the used Cloud resources will be 960 USD per month, and 34 560 USD over three years .
Adding the cost elements listed above gives a TCO of 67 060 USD for the three years of operation. The real cost will be lower as the additional servers in the Cloud are started on demand by the MELODIC platform and stopped when no longer needed, hence, the extra servers will be running only when number of trained models exceed the normal average supported by the private infrastructure.
- A scenario with 25 private servers with two GPUs. This corresponds to two models to be trained per server at peak need. The cost of these servers is 162 500 USD. This is composed of 125 000 USD in hardware cost for the 25 servers, and maintenance cost assumed to be 10% percent of the hardware cost.
The difference between the two scenarios is 95 400 USD, so the first scenario could save almost 60% of the cost of running the application on a private infrastructure.