周末目标:分布式简单版(没有图划分)完成开发并且在你的可用资源上测试中图;毕设代码测试(或者重写啥的)
今日目标:分布式简单版code+debug+run small test
上午
- 1
- twitter query and hw single node result
- 设计简单分布式
- 学习每个节点一个全图的简单mpi分布式
- 2
- mpi test
- 简单例程ok,但是collecttive call不可以
- 用教程例子测试下
- 简单例程ok,但是collecttive call不可以
- mpi test
- 3
- 解决通信不行
下午
- 1
- 解决通信
- 解决啦哈哈哈哈所以说报错信息+分析真的yyds
- test time measure
- 解决通信
- 2
- test time measure
- 3
- code simple distribute
- 4
- code simple distribute
- debug multiple run time logic
- gen data
晚上
1
- 单机模拟多机:
debug multiple run(目前时间很奇怪)
(checkanswer和single run目前kan起来没有问题) - 新服务器账号
- 单机模拟多机:
2
- 多机:
- check checkanswer(看是不是每个节点一个check_ans文件然后检查fail)
- check single run(看>文件输出是不是一个节点一个(不是,主节点输出);看时间对不对;看内存统计是不是一个节点一个)
- 多机:
3
- check multiple run(同上)
- hnu上两个100跑一下单机(平均20次),看多机有没有加快
- 没有呜呜
- mod生成数据
- 这里解决一些error
4
- mod搭建集群测试
- 有个没有装mpi给装一下
- mod搭建集群测试
不理解为什么叶子数目更少了却慢很多呢?通信明明是在时间已经拿到之后的
- 有可能是100x分成两个节点只有各50leaf,这样32线程?
- 试一下7500的
- 是计时方法有问题吗?
- 有可能是100x分成两个节点只有各50leaf,这样32线程?
mod搭集群遇到的问题:
mod90和mod91之间可以ssh免密登录,且都关闭防火墙了
下面是测试/home/yuanzhiqiu/tutorials/mpi-broadcast-and-collective-communication/code程序的报错
各自单机跑bcast可以
mod90上
mod90是我刚安装的openmpi
另外mod90上有时候单机跑bcast程序也会卡住
跑helloworld和跑bcast
[yuanzhiqiu@server90 code]$ mpirun -np 2 --host server90,server91 ./compare_bcast 1000 10
orted: Error: unknown option "--tree-spawn"
Type 'orted --help' for usage.
Usage: orted [OPTION]...
-d|--debug Debug the OpenRTE
--daemonize Daemonize the orted into the background
--debug-daemons Enable debugging of OpenRTE daemons
--debug-daemons-file Enable debugging of OpenRTE daemons, storing output
in files
-h|--help This help message
--hnp Direct the orted to act as the HNP
--hnp-uri <arg0> URI for the HNP
-nodes|--nodes <arg0>
Regular expression defining nodes in system
-output-filename|--output-filename <arg0>
Redirect output from application processes into
filename.rank
--parent-uri <arg0> URI for the parent if tree launch is enabled.
-report-bindings|--report-bindings
Whether to report process bindings to stderr
--report-uri <arg0> Report this process' uri on indicated pipe
-s|--spin Have the orted spin until we can connect a debugger
to it
--set-sid Direct the orted to separate from the current
session
--singleton-died-pipe <arg0>
Watch on indicated pipe for singleton termination
--test-suicide <arg0>
Suicide instead of clean abort after delay
--tmpdir <arg0> Set the root for the session directory tree
-xterm|--xterm <arg0>
Create a new xterm window and display output from
the specified ranks there
For additional mpirun arguments, run 'mpirun --help <category>'
The following categories exist: general (Defaults to this option), debug,
output, input, mapping, ranking, binding, devel (arguments useful to OMPI
Developers), compatibility (arguments supported for backwards compatibility),
launch (arguments to modify launch options), and dvm (Distributed Virtual
Machine arguments).
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
mod91上
跑helloworld
[yuanzhiqiu@server91 code]$ mpirun -np 2 --host server90,server91 ./mpi_hello_world
[server91:20464] [[64313,0],0] tcp_peer_recv_connect_ack: received different version from [[64313,0],1]: 4.0.0 instead of 3.1.0
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[64313,0],0] on node server91
Remote daemon: [[64313,0],1] on node server90
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[yuanzhiqiu@server91 code]$
跑bcast
[yuanzhiqiu@server91 code]$ mpirun -np 2 --host server90,server91 ./compare_bcast 1000 10
[server91:20427] [[64258,0],0] tcp_peer_recv_connect_ack: received different version from [[64258,0],1]: 4.0.0 instead of 3.1.0
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[64258,0],0] on node server91
Remote daemon: [[64258,0],1] on node server90
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[yuanzhiqiu@server91 code]$
分析
看mod91的报错,应该是两台服务器openmpi的版本不一致
另外还观察到,为啥版本是4.0.0而不是4.1.5?
现象:
在s2上跑(s2,s4)集群:
## 没有collective call的可以
## 教程例子gather scatter可以
## 教程例子bcast也会报错但是会结束运行,error(11)
```bash
[yuanzhiqiu@s2 code]$ mpirun -np 2 ./compare_bcast 1000 10
Data size = 4000, Trials = 10
Avg my_bcast time = 0.000008
Avg MPI_Bcast time = 0.000005
[yuanzhiqiu@s2 code]$ mpirun -np 2 –host s2,s4 ./compare_bcast 1000 10
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now hang or fail.
Local host: s2
PID: 43343
Message: connect() to 115.157.197.31:1024 failed
Error: Resource temporarily unavailable (11)
Data size = 4000, Trials = 10
Avg my_bcast time = 0.010016
Avg MPI_Bcast time = 0.010018
下面这些命令得到的输出和上面一样(找通信的网卡是通过`ifconfig`或者`netstat -nr`) 指明网卡
bash
mpirun -np 2 –host s2,s4 –mca btl_tcp_if_include 115.157.197.0/24 ./compare_bcast 1000 10
mpirun -np 2 –host s2,s4 –mca btl_tcp_if_include em1 ./compare_bcast 1000 10
加通信方式
bash
mpirun -np 2 –host s2,s4 –mca btl tcp,vader,self ./compare_bcast 1000 10
## 我的例子会卡住,error(15)
bash
[yuanzhiqiu@s2 tool]$ mpirun -np 2 –host s2,s4 ./mpi_test t
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now hang or fail.
Local host: s2
PID: 43860
Message: connect() to 115.157.197.31:1024 failed
Error: Operation now in progress (115)
卡在这里
加参数的命令的也是上面一样的报错
## 我的分析
看报错,message是"connect() to 115.157.197.31:1024 failed",应该和网络有关系
## 尝试解决
### mpirun参数
见上述现象
### 防火墙
分析:s2上可以跑(s2,s4)集群简单例程,s4上不可以,因此原因可能是s2没有关闭防火墙
可是s2和s4防火墙都是关闭的:
```bash
[yuanzhiqiu@s4 ~]$ systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:firewalld(1)
注意到一个提示:
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
所以不仅要关闭firewalld,还要关闭iptables?
s4上:
[yuanzhiqiu@s4 ~]$ service iptables status
Redirecting to /bin/systemctl status iptables.service
● iptables.service - IPv4 firewall with iptables
Loaded: loaded (/usr/lib/systemd/system/iptables.service; enabled; vendor preset: disabled)
Active: active (exited) since Mon 2023-03-13 11:24:04 CST; 1 months 9 days ago
Process: 1956 ExecStart=/usr/libexec/iptables/iptables.init start (code=exited, status=0/SUCCESS)
Main PID: 1956 (code=exited, status=0/SUCCESS)
CGroup: /system.slice/iptables.service
而s2上:
[yuanzhiqiu@s2 ~]$ service iptables status
Redirecting to /bin/systemctl status iptables.service
Unit iptables.service could not be found.
所以s4上关一下看看
防火墙操作命令
1:查看防火状态
systemctl status firewalld
service iptables status
2:暂时关闭防火墙
systemctl stop firewalld
service iptables stop
3:永久关闭防火墙
systemctl disable firewalld
chkconfig iptables off
4:重启防火墙
systemctl enable firewalld
service iptables restart
果然可以耶!现在可以在s4上跑集群,也可以跑bcast了!