horovod命令参数解析
- 2022-04-18 15:27:16
- 工具
- 53
- shevechco
Horovod是Uber于2017年发布的一个易于使用的高性能的分布式深度学习训练框架,支持TensorFlow、Keras、PyTorch和Apache MXNet。Horovod的名字来自于俄国传统民间舞蹈,舞者手牵手围成一个圈跳舞,与分布式训练使用Horovod互相通信的场景很像。 Horovod的发展吸取了Facebook "Training ImageNet In 1 Hour" 与百度 "Ring Allreduce" 的优点,在保证分布式训练性能的同时,兼顾了前端的简洁和对不同深度学习框架的支持,使用起来对开发人员比较的友好,算是分布式训练方向的标杆项目了。因为各个机器学习框架对于底层集合通信库(nccl,openmpi,gloo 等等)的利用水平可能各不相同,使得他们无法充分利用这些底层集合通信库的威力。因而,hovorod就整合这些框架,提供一个易用高效的解决方案。
harovod有以下参数
01.usage: horovodrun [-h] [-v] -np NP [-cb] [-p SSH_PORT] [--disable-cache]02. [--start-timeout START_TIMEOUT] [--verbose]03. [--config-file CONFIG_FILE]04. [--fusion-threshold-mb FUSION_THRESHOLD_MB]05. [--cycle-time-ms CYCLE_TIME_MS]06. [--cache-capacity CACHE_CAPACITY]07. [--hierarchical-allreduce | --no-hierarchical-allreduce]08. [--hierarchical-allgather | --no-hierarchical-allgather]09. [--autotune] [--autotune-log-file AUTOTUNE_LOG_FILE]10. [--autotune-warmup-samples AUTOTUNE_WARMUP_SAMPLES]11. [--autotune-steps-per-sample AUTOTUNE_STEPS_PER_SAMPLE]12. [--autotune-bayes-opt-max-samples AUTOTUNE_BAYES_OPT_MAX_SAMPLES]13. [--autotune-gaussian-process-noise AUTOTUNE_GAUSSIAN_PROCESS_NOISE]14. [--timeline-filename TIMELINE_FILENAME]15. [--timeline-mark-cycles] [--no-stall-check]16. [--stall-check-warning-time-seconds STALL_CHECK_WARNING_TIME_SECONDS]17. [--stall-check-shutdown-time-seconds STALL_CHECK_SHUTDOWN_TIME_SECONDS]18. [--mpi-threads-disable]19. [--num-nccl-streams NUM_NCCL_STREAMS]20. [--mlsl-bgt-affinity MLSL_BGT_AFFINITY]21. [--log-level {TRACE,DEBUG,INFO,WARNING,ERROR,FATAL}]22. [--log-hide-timestamp] [-H HOSTS | -hostfile HOSTFILE]23. [--gloo | --mpi]24. ...25. 26.Horovod Runner27. 28.positional arguments:29. command Command to be executed.30. 31.optional arguments:32. -h, --help show this help message and exit33. 34. # 查看版本35. -v, --version Shows Horovod version.36. 37. # 训练进程总数38. -np NP, --num-proc NP Total number of training processes.39. 40. # 显示在Horovod中内置了哪些框架和库41. -cb, --check-build Shows which frameworks and libraries have been built into Horovod.42. 43. # 指定端口44. -p SSH_PORT, --ssh-port SSH_PORT SSH port on all the hosts.45. 46. # 如果未设置该标志,则horovodrun将每60分钟执行一次初始化检查-如果检查成功通过。否则,每次调用horovodrun时,所有检查都将运行。47. --disable-cache If the flag is not set, horovodrun will perform the48. initialization checks only once every 60 minutes -- if49. the checks successfully pass. Otherwise, all the50. checks will run every time horovodrun is called.51. 52. # Horovodrun必须执行所有检查并在指定的超时之前启动进程。默认值为30秒。或者,环境变量HOROVOD_START_TIMEOUT也可以用于指定初始化超时。53. --start-timeout START_TIMEOUT54. Horovodrun has to perform all the checks and start the55. processes before the specified timeout. The default56. value is 30 seconds. Alternatively, The environment57. variable HOROVOD_START_TIMEOUT can also be used to58. specify the initialization timeout.59. 60. # 如果设置了此标志,则将打印其他消息。61. --verbose If this flag is set, extra messages will printed.62. 63. # 包含Horovod的运行时参数配置的YAML文件的路径。请注意,这将覆盖此参数之前提供的所有命令行参数,并被其后的所有参数覆盖。64. --config-file CONFIG_FILE65. Path to YAML file containing runtime parameter66. configuration for Horovod. Note that this will67. override any command line arguments provided before68. this argument, and will be overridden by any arguments69. that come after it.70. 71.# 可调参数72.tuneable parameter arguments:73. 74. # 融合缓冲区阈值(以MB为单位)。这是在allreduce / allgather期间可以融合在一起成为一个批处理的最大张量数据。设置为0将禁用张量融合。 (默认值:64)75. --fusion-threshold-mb FUSION_THRESHOLD_MB76. Fusion buffer threshold in MB. This is the maximum77. amount of tensor data that can be fused together into78. a single batch during allreduce / allgather. Setting 079. disables tensor fusion. (default: 64)80. 81. # 循环时间(毫秒)。这是每个张量融合周期之间的延迟。周期时间越长,批处理越多,但每次减少/聚集操作之间的等待时间就越大。 (默认值:582. --cycle-time-ms CYCLE_TIME_MS83. Cycle time in ms. This is the delay between each84. tensor fusion cycle. The larger the cycle time, the85. more batching, but the greater latency between each86. allreduce / allgather operations. (default: 587. 88. # 在执行allreduce / allgather之前将要缓存的张量名称的最大数量,以减少工作进程之间所需的协调量。 (默认值:102489. --cache-capacity CACHE_CAPACITY90. Maximum number of tensor names that will be cached to91. reduce amount of coordination required between workers92. before performing allreduce / allgather. (default:93. 102494. 95. # 在工作进程之间执行分层缩减,而不是环缩减。分层降序在主机内执行本地降序/聚集,然后在跨工人的相等本地级别之间进行并行交叉降序,最后进行本地聚集。96. --hierarchical-allreduce97. Perform hierarchical allreduce between workers instead98. of ring allreduce. Hierarchical allreduce performs a99. local allreduce / gather within a host, then a100. parallel cross allreduce between equal local ranks101. across workers, and finally a local gather.102. 103. # 明确禁用分层Allreduce,以防止自动调整。???104. --no-hierarchical-allreduce105. Explicitly disable hierarchical allreduce to prevent106. autotuning from adjusting it.107. 108. # 在工作进程之间执行分层聚集,而不是环形聚集。有关算法的详细信息,请参见分层缩减。109. --hierarchical-allgather110. Perform hierarchical allgather between workers instead111. of ring allgather. See hierarchical allreduce for112. algorithm details.113. 114. # 明确禁用分层聚集,以防止自动调整。???115. --no-hierarchical-allgather116. Explicitly disable hierarchical allgather to prevent117. autotuning from adjusting it.118. 119.# 自动调节参数120.autotune arguments:121. 122. # 执行自动调整以选择参数参数值,以最大程度地提高allreduce / allgather的吞吐量。在调试过程中,任何显式设置的参数将保持不变。123. --autotune Perform autotuning to select parameter argument values124. that maximimize throughput for allreduce / allgather.125. Any parameter explicitly set will be held constant126. during tuning.127. 128. # 自动调节参数日志记录129. --autotune-log-file AUTOTUNE_LOG_FILE130. Comma-separated log of trials containing each131. hyperparameter and the score of the trial. The last132. row will always contain the best value found.133. 134. # 自动调参预热过程丢弃的样本数默认为3135. --autotune-warmup-samples AUTOTUNE_WARMUP_SAMPLES136. Number of samples to discard before beginning the137. optimization process during autotuning. Performance138. during the first few batches can be affected by139. initialization and cache warmups. (default: 3)140. #141. --autotune-steps-per-sample AUTOTUNE_STEPS_PER_SAMPLE142. Number of steps (approximate) to record before143. observing a sample. The sample score is defined to be144. the median score over all batches within the sample.145. The more batches per sample, the less variance in146. sample scores, but the longer autotuning will take.147. (default: 10)148. --autotune-bayes-opt-max-samples AUTOTUNE_BAYES_OPT_MAX_SAMPLES149. Maximum number of samples to collect for each Bayesian150. optimization process. (default: 20)151. --autotune-gaussian-process-noise AUTOTUNE_GAUSSIAN_PROCESS_NOISE152. Regularization value [0, 1] applied to account for153. noise in samples. (default: 0.8)154. 155.# 时间轴参数156.timeline arguments: # 用于记录时间线事件157. --timeline-filename TIMELINE_FILENAME158. JSON file containing timeline of Horovod events used159. for debugging performance. If this is provided,160. timeline events will be recorded, which can have a161. negative impact on training performance.162. --timeline-mark-cycles163. Mark cycles on the timeline. Only enabled if the164. timeline filename is provided.165. 166.# 失速检查参数167.stall check arguments:168. # 禁用失速检查169. --no-stall-check Disable the stall check. The stall check will log a170. warning when workers have stalled waiting for other171. ranks to submit tensors.172. # 失速警告记录到stderr的秒数阈值(默认值:60)173. --stall-check-warning-time-seconds STALL_CHECK_WARNING_TIME_SECONDS174. Seconds until the stall warning is logged to stderr.175. (default: 60)176. --stall-check-shutdown-time-seconds STALL_CHECK_SHUTDOWN_TIME_SECONDS177. Seconds until Horovod is shutdown due to stall.178. Shutdown will only take place if this value is greater179. than the warning time. (default: 0)180. 181.# 库参数182.library arguments:183. # 禁用PMI,仅在PMI模式下使用,在一些情况下多线程MPI会降低其它组件的速度,如果你希望在horovod上跑mpi4pyname则需要这个选项。184. --mpi-threads-disable185. Disable MPI threading support. Only applies when186. running in MPI mode. In some cases, multi-threaded MPI187. can slow down other components, but is necessary if188. you wish to run mpi4py on top of Horovod.189. 190. # NCCL流的数量。仅在具有NCCL支持的情况下运行时适用。 (默认值:1)191. --num-nccl-streams NUM_NCCL_STREAMS192. Number of NCCL streams. Only applies when running with NCCL support. (default: 1)193. 194. # MLSL背景线程亲和力。仅在带有MLSL支持的情况下运行时适用。 (默认值:0)195. --mlsl-bgt-affinity MLSL_BGT_AFFINITY196. MLSL background thread affinity. Only applies when running with MLSL support. (default: 0)197. 198.# 日志参数199.logging arguments:200. # 日志6个等级201. --log-level {TRACE,DEBUG,INFO,WARNING,ERROR,FATAL}202. Minimum level to log to stderr from the Horovod backend. (default: WARNING).203. # 隐藏Horovod日志消息中的时间戳204. --log-hide-timestamp Hide the timestamp from Horovod log messages.205. 206.host arguments:207. 208. # 主机名列表和每个主机上正在运行的进程的可用插槽数,格式为:<hostname>:<slots>(例如:host1:2,host2:4,host3:1指示2个进程可以在host1、4上运行在host2上,在host3上为1)。如果未指定,则默认使用localhost:<np>209. -H HOSTS, --hosts HOSTS210. List of host names and the number of available slots211. for running processes on each, of the form:212. <hostname>:<slots> (e.g.: host1:2,host2:4,host3:1213. indicating 2 processes can run on host1, 4 on host2,214. and 1 on host3). If not specified, defaults to using215. localhost:<np>216. 217. # 包含主机名列表和可用插槽数的主机文件的路径。文件的每一行都必须采用以下格式:<hostname> slot = <slots>218. -hostfile HOSTFILE, --hostfile HOSTFILE219. Path to a host file containing the list of host names220. and the number of available slots. Each line of the221. file must be of the form: <hostname> slots=<slots>222. 223.# 控制器参数224.controller arguments:225. # 使用Gloo控制器运行Horovod。 如果Horovod不是在MPI支持下构建的,则这将是默认设置。226. --gloo Run Horovod using the Gloo controller. This will be227. the default if Horovod was not built with MPI support.228. # 使用MPI控制器运行Horovod。 如果Horovod是在MPI支持下构建的,则这将是默认设置。229. --mpi Run Horovod using the MPI controller. This will be the230. default if Horovod was built with MPI support.
在Horovod中内置这些框架和库
01.# horovodrun -cb02. Horovod v0.18.1:03. Available Frameworks:04. [X] TensorFlow05. [X] PyTorch06. [X] MXNet07. Available Controllers:08. [X] MPI09. [X] Gloo10. Available Tensor Operations:11. [X] NCCL12. [ ] DDL13. [ ] MLSL14. [X] MPI15. [X] Gloo
原文链接:https://blog.csdn.net/thmx43/article/details/116014088
内容版权声明:除非注明,否则皆为本站原创文章。
转载注明出处:http://www.sulao.cn/post/826