Horovod是Uber于2017年发布的一个易于使用的高性能的分布式深度学习训练框架,支持TensorFlow、Keras、PyTorch和Apache MXNet。Horovod的名字来自于俄国传统民间舞蹈,舞者手牵手围成一个圈跳舞,与分布式训练使用Horovod互相通信的场景很像。 Horovod的发展吸取了Facebook "Training ImageNet In 1 Hour" 与百度 "Ring Allreduce" 的优点,在保证分布式训练性能的同时,兼顾了前端的简洁和对不同深度学习框架的支持,使用起来对开发人员比较的友好,算是分布式训练方向的标杆项目了。因为各个机器学习框架对于底层集合通信库(nccl,openmpi,gloo 等等)的利用水平可能各不相同,使得他们无法充分利用这些底层集合通信库的威力。因而,hovorod就整合这些框架,提供一个易用高效的解决方案。
harovod有以下参数
usage: horovodrun [-h] [-v] -np NP [-cb] [-p SSH_PORT] [--disable-cache] [--start-timeout START_TIMEOUT] [--verbose] [--config-file CONFIG_FILE] [--fusion-threshold-mb FUSION_THRESHOLD_MB] [--cycle-time-ms CYCLE_TIME_MS] [--cache-capacity CACHE_CAPACITY] [--hierarchical-allreduce | --no-hierarchical-allreduce] [--hierarchical-allgather | --no-hierarchical-allgather] [--autotune] [--autotune-log-file AUTOTUNE_LOG_FILE] [--autotune-warmup-samples AUTOTUNE_WARMUP_SAMPLES] [--autotune-steps-per-sample AUTOTUNE_STEPS_PER_SAMPLE] [--autotune-bayes-opt-max-samples AUTOTUNE_BAYES_OPT_MAX_SAMPLES] [--autotune-gaussian-process-noise AUTOTUNE_GAUSSIAN_PROCESS_NOISE] [--timeline-filename TIMELINE_FILENAME] [--timeline-mark-cycles] [--no-stall-check] [--stall-check-warning-time-seconds STALL_CHECK_WARNING_TIME_SECONDS] [--stall-check-shutdown-time-seconds STALL_CHECK_SHUTDOWN_TIME_SECONDS] [--mpi-threads-disable] [--num-nccl-streams NUM_NCCL_STREAMS] [--mlsl-bgt-affinity MLSL_BGT_AFFINITY] [--log-level {TRACE,DEBUG,INFO,WARNING,ERROR,FATAL}] [--log-hide-timestamp] [-H HOSTS | -hostfile HOSTFILE] [--gloo | --mpi] ... Horovod Runner positional arguments: command Command to be executed. optional arguments: -h, --help show this help message and exit # 查看版本 -v, --version Shows Horovod version. # 训练进程总数 -np NP, --num-proc NP Total number of training processes. # 显示在Horovod中内置了哪些框架和库 -cb, --check-build Shows which frameworks and libraries have been built into Horovod. # 指定端口 -p SSH_PORT, --ssh-port SSH_PORT SSH port on all the hosts. # 如果未设置该标志,则horovodrun将每60分钟执行一次初始化检查-如果检查成功通过。否则,每次调用horovodrun时,所有检查都将运行。 --disable-cache If the flag is not set, horovodrun will perform the initialization checks only once every 60 minutes -- if the checks successfully pass. Otherwise, all the checks will run every time horovodrun is called. # Horovodrun必须执行所有检查并在指定的超时之前启动进程。默认值为30秒。或者,环境变量HOROVOD_START_TIMEOUT也可以用于指定初始化超时。 --start-timeout START_TIMEOUT Horovodrun has to perform all the checks and start the processes before the specified timeout. The default value is 30 seconds. Alternatively, The environment variable HOROVOD_START_TIMEOUT can also be used to specify the initialization timeout. # 如果设置了此标志,则将打印其他消息。 --verbose If this flag is set, extra messages will printed. # 包含Horovod的运行时参数配置的YAML文件的路径。请注意,这将覆盖此参数之前提供的所有命令行参数,并被其后的所有参数覆盖。 --config-file CONFIG_FILE Path to YAML file containing runtime parameter configuration for Horovod. Note that this will override any command line arguments provided before this argument, and will be overridden by any arguments that come after it. # 可调参数 tuneable parameter arguments: # 融合缓冲区阈值(以MB为单位)。这是在allreduce / allgather期间可以融合在一起成为一个批处理的最大张量数据。设置为0将禁用张量融合。 (默认值:64) --fusion-threshold-mb FUSION_THRESHOLD_MB Fusion buffer threshold in MB. This is the maximum amount of tensor data that can be fused together into a single batch during allreduce / allgather. Setting 0 disables tensor fusion. (default: 64) # 循环时间(毫秒)。这是每个张量融合周期之间的延迟。周期时间越长,批处理越多,但每次减少/聚集操作之间的等待时间就越大。 (默认值:5 --cycle-time-ms CYCLE_TIME_MS Cycle time in ms. This is the delay between each tensor fusion cycle. The larger the cycle time, the more batching, but the greater latency between each allreduce / allgather operations. (default: 5 # 在执行allreduce / allgather之前将要缓存的张量名称的最大数量,以减少工作进程之间所需的协调量。 (默认值:1024 --cache-capacity CACHE_CAPACITY Maximum number of tensor names that will be cached to reduce amount of coordination required between workers before performing allreduce / allgather. (default: 1024 # 在工作进程之间执行分层缩减,而不是环缩减。分层降序在主机内执行本地降序/聚集,然后在跨工人的相等本地级别之间进行并行交叉降序,最后进行本地聚集。 --hierarchical-allreduce Perform hierarchical allreduce between workers instead of ring allreduce. Hierarchical allreduce performs a local allreduce / gather within a host, then a parallel cross allreduce between equal local ranks across workers, and finally a local gather. # 明确禁用分层Allreduce,以防止自动调整。??? --no-hierarchical-allreduce Explicitly disable hierarchical allreduce to prevent autotuning from adjusting it. # 在工作进程之间执行分层聚集,而不是环形聚集。有关算法的详细信息,请参见分层缩减。 --hierarchical-allgather Perform hierarchical allgather between workers instead of ring allgather. See hierarchical allreduce for algorithm details. # 明确禁用分层聚集,以防止自动调整。??? --no-hierarchical-allgather Explicitly disable hierarchical allgather to prevent autotuning from adjusting it. # 自动调节参数 autotune arguments: # 执行自动调整以选择参数参数值,以最大程度地提高allreduce / allgather的吞吐量。在调试过程中,任何显式设置的参数将保持不变。 --autotune Perform autotuning to select parameter argument values that maximimize throughput for allreduce / allgather. Any parameter explicitly set will be held constant during tuning. # 自动调节参数日志记录 --autotune-log-file AUTOTUNE_LOG_FILE Comma-separated log of trials containing each hyperparameter and the score of the trial. The last row will always contain the best value found. # 自动调参预热过程丢弃的样本数默认为3 --autotune-warmup-samples AUTOTUNE_WARMUP_SAMPLES Number of samples to discard before beginning the optimization process during autotuning. Performance during the first few batches can be affected by initialization and cache warmups. (default: 3) # --autotune-steps-per-sample AUTOTUNE_STEPS_PER_SAMPLE Number of steps (approximate) to record before observing a sample. The sample score is defined to be the median score over all batches within the sample. The more batches per sample, the less variance in sample scores, but the longer autotuning will take. (default: 10) --autotune-bayes-opt-max-samples AUTOTUNE_BAYES_OPT_MAX_SAMPLES Maximum number of samples to collect for each Bayesian optimization process. (default: 20) --autotune-gaussian-process-noise AUTOTUNE_GAUSSIAN_PROCESS_NOISE Regularization value [0, 1] applied to account for noise in samples. (default: 0.8) # 时间轴参数 timeline arguments: # 用于记录时间线事件 --timeline-filename TIMELINE_FILENAME JSON file containing timeline of Horovod events used for debugging performance. If this is provided, timeline events will be recorded, which can have a negative impact on training performance. --timeline-mark-cycles Mark cycles on the timeline. Only enabled if the timeline filename is provided. # 失速检查参数 stall check arguments: # 禁用失速检查 --no-stall-check Disable the stall check. The stall check will log a warning when workers have stalled waiting for other ranks to submit tensors. # 失速警告记录到stderr的秒数阈值(默认值:60) --stall-check-warning-time-seconds STALL_CHECK_WARNING_TIME_SECONDS Seconds until the stall warning is logged to stderr. (default: 60) --stall-check-shutdown-time-seconds STALL_CHECK_SHUTDOWN_TIME_SECONDS Seconds until Horovod is shutdown due to stall. Shutdown will only take place if this value is greater than the warning time. (default: 0) # 库参数 library arguments: # 禁用PMI,仅在PMI模式下使用,在一些情况下多线程MPI会降低其它组件的速度,如果你希望在horovod上跑mpi4pyname则需要这个选项。 --mpi-threads-disable Disable MPI threading support. Only applies when running in MPI mode. In some cases, multi-threaded MPI can slow down other components, but is necessary if you wish to run mpi4py on top of Horovod. # NCCL流的数量。仅在具有NCCL支持的情况下运行时适用。 (默认值:1) --num-nccl-streams NUM_NCCL_STREAMS Number of NCCL streams. Only applies when running with NCCL support. (default: 1) # MLSL背景线程亲和力。仅在带有MLSL支持的情况下运行时适用。 (默认值:0) --mlsl-bgt-affinity MLSL_BGT_AFFINITY MLSL background thread affinity. Only applies when running with MLSL support. (default: 0) # 日志参数 logging arguments: # 日志6个等级 --log-level {TRACE,DEBUG,INFO,WARNING,ERROR,FATAL} Minimum level to log to stderr from the Horovod backend. (default: WARNING). # 隐藏Horovod日志消息中的时间戳 --log-hide-timestamp Hide the timestamp from Horovod log messages. host arguments: # 主机名列表和每个主机上正在运行的进程的可用插槽数,格式为:<hostname>:<slots>(例如:host1:2,host2:4,host3:1指示2个进程可以在host1、4上运行在host2上,在host3上为1)。如果未指定,则默认使用localhost:<np> -H HOSTS, --hosts HOSTS List of host names and the number of available slots for running processes on each, of the form: <hostname>:<slots> (e.g.: host1:2,host2:4,host3:1 indicating 2 processes can run on host1, 4 on host2, and 1 on host3). If not specified, defaults to using localhost:<np> # 包含主机名列表和可用插槽数的主机文件的路径。文件的每一行都必须采用以下格式:<hostname> slot = <slots> -hostfile HOSTFILE, --hostfile HOSTFILE Path to a host file containing the list of host names and the number of available slots. Each line of the file must be of the form: <hostname> slots=<slots> # 控制器参数 controller arguments: # 使用Gloo控制器运行Horovod。 如果Horovod不是在MPI支持下构建的,则这将是默认设置。 --gloo Run Horovod using the Gloo controller. This will be the default if Horovod was not built with MPI support. # 使用MPI控制器运行Horovod。 如果Horovod是在MPI支持下构建的,则这将是默认设置。 --mpi Run Horovod using the MPI controller. This will be the default if Horovod was built with MPI support.
在Horovod中内置这些框架和库
# horovodrun -cb Horovod v0.18.1: Available Frameworks: [X] TensorFlow [X] PyTorch [X] MXNet Available Controllers: [X] MPI [X] Gloo Available Tensor Operations: [X] NCCL [ ] DDL [ ] MLSL [X] MPI [X] Gloo
原文链接:https://blog.csdn.net/thmx43/article/details/116014088