0%
1.6k 字 8 分钟

FATE学习:跟着日志读源码(三)一般任务生命周期

综述

从联邦建模的角度去理解整个job的生命周期,就是一系列功能模块组成的DAG,可以参照fate官方的文档,image
其中,各个功能模块就具体实现而言,就是fate中的各个算法或数据组件。

不过结合源码,从日志的角度来看,可以将整个job的生命周期,按照不同阶段所做的操作,进行划分。具体如下图:
job life cycle

  1. submit:提交 job
  2. create:创建job 和该job下对应的 tasks(相当于元数据) ,这里创建好之后,所有状态都是waiting
  3. job schedule:对于job,按照FIFO的顺序,轮询到waiting状态的job,为其申请资源并将状态置为running。
  4. task scheduler:轮询到running状态的job,对其涉及的tasks,按照依赖关系依次调度
  5. execute: 执行4中置为running状态的task
  6. task finish:task 运行完毕,进行资源回收和环境清理
  7. job finish:job 运行完毕,进行资源回收和环境清理
  8. cancel:中止正在执行的job,cancel并不算一般的生命周期中的操作,可以发生在create之后的任何阶段,接受到cancel后,在polling下一次轮询时生效

此外还有非生命周期中阶段polling: 是fate_flow_server上的轮询机制,探测到各种状态的job和task分别予以相对应的操作,严格而言,并不算是job的生命周期,只是一个轮询的角色。

样例

先从最简单的upload 开始,因为upload只是一个单方的job,在job schedule 至 job finish 阶段,不涉及多方的协作,日志都在单机上,分析较为简单。参照文档,在CLI 提交如下命令

1
python fate_flow_client.py -f upload -c upload_guest.json

日志位置

任务提交成功后,主要产生三部分日志:

Console 日志

提交任务完成之后,打印在屏幕上的日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
{
"data": {
"board_url": "http://fateboard:8080/index.html#/dashboard?job_id=202107260820309976351&role=local&party_id=0",
"job_dsl_path": "/data/projects/fate/jobs/202107260820309976351/job_dsl.json",
"job_id": "202107260820309976351",
"job_runtime_conf_on_party_path": "/data/projects/fate/jobs/202107260820309976351/local/job_runtime_on_party_conf.json",
"job_runtime_conf_path": "/data/projects/fate/jobs/202107260820309976351/job_runtime_conf.json",
"logs_directory": "/data/projects/fate/logs/202107260820309976351",
"model_info": {
"model_id": "local-0#model",
"model_version": "202107260820309976351"
},
"namespace": "cl",
"pipeline_dsl_path": "/data/projects/fate/jobs/202107260820309976351/pipeline_dsl.json",
"table_name": "hetero_guest",
"train_runtime_conf_path": "/data/projects/fate/jobs/202107260820309976351/train_runtime_conf.json"
},
"jobId": "202107260820309976351",
"retcode": 0,
"retmsg": "success"
}

容器日志

KubeFATE部署的话,查看名为python 的容器的日志,主要是POST请求和结果的日志。
其中和submit job 相关部分截取如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
1. 10.200.96.235 - - [26/Jul/2021 08:20:31] "POST /v1/party/202107260820309976351/local/0/create HTTP/1.1" 200 -
2. 10.200.96.235 - - [26/Jul/2021 08:20:31] "POST /v1/data/upload?%7B%22file%22:%20%22/data/projects/fate/examples/data/breast_hetero_guest.csv%22,%20%22head%22:%201,%20%22partition%22:%201,%20%22work_mode%22:%201,%20%22table_name%22:%20%22hetero_guest%22,%20%22namespace%22:%20%22cl%22,%20%22config%22:%20%22/data/projects/fate/cl/upload_guest.json%22,%20%22function%22:%20%22upload%22%7D HTTP/1.1" 200 -
3. 10.200.96.235 - - [26/Jul/2021 08:20:32] "POST /v1/party/202107260820309976351/local/0/resource/apply HTTP/1.1" 200 -
4. 10.200.96.235 - - [26/Jul/2021 08:20:32] "POST /v1/party/202107260820309976351/local/0/start HTTP/1.1" 200 -
5. 10.200.96.235 - - [26/Jul/2021 08:20:32] "POST /v1/party/202107260820309976351/upload_0/202107260820309976351_upload_0/0/local/0/status/running HTTP/1.1" 200 -
6. 10.200.96.235 - - [26/Jul/2021 08:20:32] "POST /v1/party/202107260820309976351/upload_0/202107260820309976351_upload_0/0/local/0/start HTTP/1.1" 200 -
7. 10.200.96.235 - - [26/Jul/2021 08:20:33] "POST /v1/party/202107260820309976351/upload_0/202107260820309976351_upload_0/0/local/0/report HTTP/1.1" 200 -
8. 10.200.96.235 - - [26/Jul/2021 08:20:34] "POST /v1/party/202107260820309976351/upload_0/202107260820309976351_upload_0/0/local/0/collect HTTP/1.1" 200 -
9. 10.200.96.235 - - [26/Jul/2021 08:20:36] "POST /v1/party/202107260820309976351/upload_0/202107260820309976351_upload_0/0/local/0/collect HTTP/1.1" 200 -
10. 10.200.96.235 - - [26/Jul/2021 08:20:38] "POST /v1/party/202107260820309976351/upload_0/202107260820309976351_upload_0/0/local/0/collect HTTP/1.1" 200 -
11. 10.200.96.235 - - [26/Jul/2021 08:20:40] "POST /v1/party/202107260820309976351/upload_0/202107260820309976351_upload_0/0/local/0/collect HTTP/1.1" 200 -
12. 10.200.96.235 - - [26/Jul/2021 08:20:41] "POST /v1/party/202107260820309976351/local/0/update HTTP/1.1" 200 -
13. 10.200.96.235 - - [26/Jul/2021 08:20:42] "POST /v1/party/202107260820309976351/upload_0/202107260820309976351_upload_0/0/local/0/collect HTTP/1.1" 200 -
14. 10.200.96.235 - - [26/Jul/2021 08:20:43] "POST /v1/tracker/202107260820309976351/upload_0/202107260820309976351_upload_0/0/local/0/output_data_info/save HTTP/1.1" 200 -
15. 10.200.96.235 - - [26/Jul/2021 08:20:43] "POST /v1/tracker/202107260820309976351/upload_0/202107260820309976351_upload_0/0/local/0/metric_data/save HTTP/1.1" 200 -
16. 10.200.96.235 - - [26/Jul/2021 08:20:43] "POST /v1/tracker/202107260820309976351/upload_0/202107260820309976351_upload_0/0/local/0/metric_meta/save HTTP/1.1" 200 -
17. 10.200.96.235 - - [26/Jul/2021 08:20:44] "POST /v1/party/202107260820309976351/upload_0/202107260820309976351_upload_0/0/local/0/report HTTP/1.1" 200 -
18. 10.200.96.235 - - [26/Jul/2021 08:20:45] "POST /v1/party/202107260820309976351/upload_0/202107260820309976351_upload_0/0/local/0/collect HTTP/1.1" 200 -
19. 10.200.96.235 - - [26/Jul/2021 08:20:45] "POST /v1/party/202107260820309976351/upload_0/202107260820309976351_upload_0/0/local/0/report HTTP/1.1" 200 -
static conf path: /data/projects/fate/eggroll/conf/eggroll.properties
20. 10.200.96.235 - - [26/Jul/2021 08:20:47] "POST /v1/party/202107260820309976351/upload_0/202107260820309976351_upload_0/0/local/0/status/success HTTP/1.1" 200 -
21. 10.200.96.235 - - [26/Jul/2021 08:20:47] "POST /v1/party/202107260820309976351/upload_0/202107260820309976351_upload_0/0/local/0/stop/success HTTP/1.1" 200 -
22. 10.200.96.235 - - [26/Jul/2021 08:20:47] "POST /v1/party/202107260820309976351/local/0/model HTTP/1.1" 200 -
23. 10.200.96.235 - - [26/Jul/2021 08:20:47] "POST /v1/party/202107260820309976351/local/0/status/success HTTP/1.1" 200 -
24. 10.200.96.235 - - [26/Jul/2021 08:20:47] "POST /v1/party/202107260820309976351/local/0/stop/success HTTP/1.1" 200 -
25. 10.200.96.235 - - [26/Jul/2021 08:20:48] "POST /v1/party/202107260820309976351/local/0/clean HTTP/1.1" 200 -
26. 10.200.96.235 - - [26/Jul/2021 08:20:48] "POST /v1/party/202107260820309976351/local/0/stop/success HTTP/1.1" 200 -
27. 10.200.96.235 - - [26/Jul/2021 08:20:48] "POST /v1/party/202107260820309976351/local/0/clean HTTP/1.1" 200 -

该生命周期中,各个阶段的情况,在这里都有体现。

FATE Flow 日志

日志会生成在fate_flow 和 jobid 两个目录下,为了便于区分,分别会用fate_flow/xxx.log 和 $ {job_log_dir}/xxx.log 进行区分。
此外${job_log_dir}下还可进一步细分为如下几个目录:

  • ${job_log_dir}
  • $ {job_log_dir}/$ {role}/${party}
  • $ {job_log_dir}/$ {role}/$ {party}/${taskid} = ${task_log_dir}

    fate_flow_server 日志,位于/data/projects/fate/logs/fate_flow 目录下,说明见前文:

    1
    2
    3
    4
    5
    - fate_flow_audit.log
    - fate_flow_detect.log
    - fate_flow_stat.log
    - fate_flow_schedule.log
    - peewee.log

job日志

位于/data/projects/fate/logs/${jobid} 下(此目录后文称之为job_log_dir),量级过大,这里只列下目录,每个日志的说明,见前文。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
- fate_flow_audit.log:
- fate_flow_schedule.log
- fate_flow_sql.log
+ ${role}
+ ${party}
- DEBUG.log
- INFO.log
+ ${taskid}
- DEBUG.log
- fate_flow_schedule.log
- INFO.log
- peewee.log
- PROFILING.log
- stat.log
- std.log