Technology Blog
HPC technical transformation plan of a pharmaceutical company in Shanghai
Project background
The customer's production business is genetic computing, which generates about one million files each time. There are about 60% of large files and about 40% of small files. The total size of these files is about 1.5T, and the calculation tasks of this size will be performed 3 to 4 times a week.
The original computing environment is: 32 computing nodes perform genetic computing at the same time, and GlusterFS provides storage services. The specific information of the storage hardware is as follows: there are three nodes in total, each node has 36 6T SATA disks, which do Raid6, and the three nodes do Raid5. The available capacity is 432T, and the output bandwidth of the storage cluster is 1.5GB/s-2GB/s, which cannot meet the customer's further computing needs.
- During the calculation process, due to the limitation of the number of GlusterFS threads, the utilization rate of a single set of the customer's 32 computing node clusters is only about 20%, which seriously wastes the hardware resources of the computing nodes;
- The peak value will appear about half an hour after the start of the calculation task. Due to the limitation of the number of threads, the subsequent tasks are waiting. At this time, the storage performance has reached the peak value, resulting in a greatly prolonged time for the completion of the production task;
- The peak value at the end of the calculation and the peak value at the end of the task when the data is written to the storage. Because the bandwidth provided by GlusterFS is about 1.5GB/s-2GB/s, the amount of data generated by the computing node after the end of the task is relatively large, and the write rate is too slow, resulting in a long task end time;
LeoStor storage design
After analysis, the customer originally used 108 hard disks. This time, the new storage system still selected 108 disks. The LeoRaid 4+1 redundancy mode allows one node to go down or one hard disk to fail, and requires one node to fail. When the original data is readable and the data is writable, the minimum number of security configuration nodes for the system is N+M+M nodes, N=4, M=1, and the minimum number of security configuration nodes for the system is 6. The Seagate SATA 4TB enterprise hard disk is selected, with 18 disks per node, The total cluster storage available space is: 4TB * 108 * 80%=345TB.
Equipment | To configure | Quantity |
Metadata node |
|
2 |
Storage node |
|
6 |
Switch | Huawei S6720-32X-LI-32S-AC | 2 |
Delivery test
Use iozone for storage stress test, netperf for network test, atop for performance monitoring, and LeoStor and GlusterFS test parameters and processes are the same.
LeoStor cluster test, 1MB granularity and 128K granularity, according to the 6:4 mixed test results (read/write): 5.97GBps/5.5GBps.
Customer revenue
- The storage performance has been improved by more than twice, improving the efficiency of computing business;
- Complete web monitoring, which can check the operation status and temperature of hardware;
- Efficient hard disk failure data recovery only takes 20% of the time compared with the original dual Raid scheme;