监控模拟#

Flower 允许您在运行模拟时监控系统资源。此外，Flower 仿真引擎功能强大，能让您决定如何按客户端方式分配资源并限制总使用量。从资源消耗中获得的观察可以帮助您做出更明智的决策，并加快执行时间。

具体说明假定你使用的是 macOS，并且安装了 Homebrew 软件包管理器。

下载#

brew install prometheus grafana

Prometheus 用于收集数据，而 Grafana 则能让你将收集到的数据可视化。它们都与 Flower 在引擎下使用的 Ray 紧密集成。

重写配置文件（根据设备的不同，可能安装在不同的路径上）。

如果你使用的是 M1 Mac，应该是这样：

/opt/homebrew/etc/prometheus.yml
/opt/homebrew/etc/grafana/grafana.ini

在上一代英特尔 Mac 设备上，应该是这样：

/usr/local/etc/prometheus.yml
/usr/local/etc/grafana/grafana.ini

打开相应的配置文件并修改它们。根据设备情况，使用以下两个命令之一：

# M1 macOS
open /opt/homebrew/etc/prometheus.yml

# Intel macOS
open /usr/local/etc/prometheus.yml

然后删除文件中的所有文本，粘贴一个新的 Prometheus 配置文件，如下所示。您可以根据需要调整时间间隔：

global:
  scrape_interval: 1s
  evaluation_interval: 1s

scrape_configs:
# Scrape from each ray node as defined in the service_discovery.json provided by ray.
- job_name: 'ray'
  file_sd_configs:
  - files:
    - '/tmp/ray/prom_metrics_service_discovery.json'

编辑完 Prometheus 配置后，请对 Grafana 配置文件执行同样的操作。与之前一样，使用以下命令之一打开这些文件：

# M1 macOS
open /opt/homebrew/etc/grafana/grafana.ini

# Intel macOS
open /usr/local/etc/grafana/grafana.ini

您的终端编辑器应该会打开，并允许您像之前一样应用以下配置。

[security]
allow_embedding = true

[auth.anonymous]
enabled = true
org_name = Main Org.
org_role = Viewer

[paths]
provisioning = /tmp/ray/session_latest/metrics/grafana/provisioning

恭喜您，您刚刚下载了指标跟踪所需的所有软件。现在，让我们开始吧。

跟踪指标#

在运行 Flower 模拟之前，您必须启动刚刚安装和配置的监控工具。

brew services start prometheus
brew services start grafana

开始模拟时，请在 Python 代码中加入以下参数。

fl.simulation.start_simulation(
    # ...
    # all the args you used before
    # ...
    ray_init_args = {"include_dashboard": True}
)

现在，您可以开始工作了。

模拟启动后不久，您就会在终端中看到以下日志：

2023-01-20 16:22:58,620       INFO [worker.py:1529](http://worker.py:1529/) -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265

您可以在 http://127.0.0.1:8265 查看所有内容。

这是一个 Ray Dashboard。您可以导航到 "度量标准"（左侧面板，最低选项）。

或者，您也可以点击右上角的 "在 Grafana 中查看"，在 Grafana 中查看它们。请注意，Ray 仪表盘只能在模拟期间访问。模拟结束后，您只能使用 Grafana 浏览指标。您可以访问 ``http://localhost:3000/``启动 Grafana。

完成可视化后，请停止 Prometheus 和 Grafana。这一点很重要，否则只要它们在运行，就会阻塞机器上的端口 3000。

brew services stop prometheus
brew services stop grafana

资源分配#

您必须了解 Ray 库是如何工作的，才能有效地为自己的仿真客户端分配系统资源。

最初，模拟（由 Ray 在引擎下处理）默认使用系统上的所有可用资源启动，并在客户端之间共享。但这并不意味着它会将资源平均分配给所有客户端，也不意味着模型训练会在所有客户端同时进行。您将在本博客的后半部分了解到更多相关信息。您可以运行以下命令检查系统资源：

import ray
ray.available_resources()

在 Google Colab 中，您看到的结果可能与此类似：

{'memory': 8020104807.0,
 'GPU': 1.0,
 'object_store_memory': 4010052403.0,
 'CPU': 2.0,
 'accelerator_type:T4': 1.0,
 'node:172.28.0.2': 1.0}

不过，您可以覆盖默认值。开始模拟时，请执行以下操作（不必全部覆盖）：

num_cpus = 2
num_gpus = 1
ram_memory = 16_000 * 1024 * 1024  # 16 GB
fl.simulation.start_simulation(
    # ...
    # all the args you were specifying before
    # ...
    ray_init_args = {
            "include_dashboard": True, # we need this one for tracking
            "num_cpus": num_cpus,
            "num_gpus": num_gpus,
            "memory": ram_memory,
  }
)

我们还可以为单个客户指定资源。

# Total resources for simulation
num_cpus = 4
num_gpus = 1
ram_memory = 16_000 * 1024 * 1024 # 16 GB

# Single client resources
client_num_cpus = 2
client_num_gpus = 1

fl.simulation.start_simulation(
    # ...
    # all the args you were specifying before
    # ...
    ray_init_args = {
            "include_dashboard": True, # we need this one for tracking
            "num_cpus": num_cpus,
            "num_gpus": num_gpus,
            "memory": ram_memory,
    },
    # The argument below is new
    client_resources = {
            "num_cpus": client_num_cpus,
            "num_gpus": client_num_gpus,
    }
)

现在到了关键部分。只有在资源允许的情况下，Ray 才会在拥有所有所需资源（如并行运行）时启动新客户端。

在上面的示例中，将只运行一个客户端，因此您的客户端不会并发运行。设置 client_num_gpus = 0.5 将允许运行两个客户端，从而使它们能够并发运行。请注意，所需的资源不要超过可用资源。如果您指定 client_num_gpus = 2，模拟将无法启动（即使您有 2 个 GPU，但决定在 ray_init_args 中设置为 1）。

常见问题#

问：我没有看到任何指标记录。

答：时间范围可能没有正确设置。设置在右上角（默认为 "最后 30 分钟"）。请更改时间框架，以反映模拟运行的时间段。

问：我看到 "未检测到 Grafana 服务器。请确保 Grafana 服务器正在运行并刷新此页面"。

答：您可能没有运行 Grafana。请检查正在运行的服务

brew services list

问：在访问 `<http://127.0.0.1:8265>`_时，我看到 "无法访问该网站"。

答：要么模拟已经完成，要么您还需要启动Prometheus。

资源#

Ray Dashboard: https://docs.ray.io/en/latest/ray-observability/getting-started.html

Ray Metrics: https://docs.ray.io/en/latest/cluster/metrics.html