Sorry, your browser cannot access this site
This page requires browser support (enable) JavaScript
Learn more >

周末目标:分布式简单版(没有图划分)完成开发并且在你的可用资源上测试中图;毕设代码测试(或者重写啥的)
今日目标:分布式简单版code+debug+run small test

上午

  • 1
    • twitter query and hw single node result
    • 设计简单分布式
    • 学习每个节点一个全图的简单mpi分布式
  • 2
    • mpi test
      • 简单例程ok,但是collecttive call不可以
        • 用教程例子测试下
  • 3
    • 解决通信不行

下午

  • 1
    • 解决通信
      • 解决啦哈哈哈哈所以说报错信息+分析真的yyds
    • test time measure
  • 2
    • test time measure
  • 3
    • code simple distribute
  • 4
    • code simple distribute
    • debug multiple run time logic
    • gen data

晚上

  • 1

    • 单机模拟多机:
      debug multiple run(目前时间很奇怪)
      (checkanswer和single run目前kan起来没有问题)
    • 新服务器账号
  • 2

    • 多机:
      • check checkanswer(看是不是每个节点一个check_ans文件然后检查fail)
      • check single run(看>文件输出是不是一个节点一个(不是,主节点输出);看时间对不对;看内存统计是不是一个节点一个)
  • 3

    • check multiple run(同上)
    • hnu上两个100跑一下单机(平均20次),看多机有没有加快
      • 没有呜呜
    • mod生成数据
      • 这里解决一些error
  • 4

    • mod搭建集群测试
      • 有个没有装mpi给装一下
  • 不理解为什么叶子数目更少了却慢很多呢?通信明明是在时间已经拿到之后的

    • 有可能是100x分成两个节点只有各50leaf,这样32线程?
      • 试一下7500的
    • 是计时方法有问题吗?

mod搭集群遇到的问题:

mod90和mod91之间可以ssh免密登录,且都关闭防火墙了
下面是测试/home/yuanzhiqiu/tutorials/mpi-broadcast-and-collective-communication/code程序的报错
各自单机跑bcast可以

mod90上

mod90是我刚安装的openmpi
另外mod90上有时候单机跑bcast程序也会卡住

跑helloworld和跑bcast

[yuanzhiqiu@server90 code]$ mpirun -np 2 --host server90,server91 ./compare_bcast 1000 10
orted: Error: unknown option "--tree-spawn"
Type 'orted --help' for usage.
Usage: orted [OPTION]...
-d|--debug               Debug the OpenRTE
   --daemonize           Daemonize the orted into the background
   --debug-daemons       Enable debugging of OpenRTE daemons
   --debug-daemons-file  Enable debugging of OpenRTE daemons, storing output
                         in files
-h|--help                This help message
   --hnp                 Direct the orted to act as the HNP
   --hnp-uri <arg0>      URI for the HNP
   -nodes|--nodes <arg0>  
                         Regular expression defining nodes in system
   -output-filename|--output-filename <arg0>  
                         Redirect output from application processes into
                         filename.rank
   --parent-uri <arg0>   URI for the parent if tree launch is enabled.
   -report-bindings|--report-bindings 
                         Whether to report process bindings to stderr
   --report-uri <arg0>   Report this process' uri on indicated pipe
-s|--spin                Have the orted spin until we can connect a debugger
                         to it
   --set-sid             Direct the orted to separate from the current
                         session
   --singleton-died-pipe <arg0>  
                         Watch on indicated pipe for singleton termination
   --test-suicide <arg0>  
                         Suicide instead of clean abort after delay
   --tmpdir <arg0>       Set the root for the session directory tree
   -xterm|--xterm <arg0>  
                         Create a new xterm window and display output from
                         the specified ranks there

For additional mpirun arguments, run 'mpirun --help <category>'

The following categories exist: general (Defaults to this option), debug,
    output, input, mapping, ranking, binding, devel (arguments useful to OMPI
    Developers), compatibility (arguments supported for backwards compatibility),
    launch (arguments to modify launch options), and dvm (Distributed Virtual
    Machine arguments).
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------

mod91上

跑helloworld

[yuanzhiqiu@server91 code]$ mpirun -np 2 --host server90,server91 ./mpi_hello_world
[server91:20464] [[64313,0],0] tcp_peer_recv_connect_ack: received different version from [[64313,0],1]: 4.0.0 instead of 3.1.0
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[64313,0],0] on node server91
  Remote daemon: [[64313,0],1] on node server90

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[yuanzhiqiu@server91 code]$ 

跑bcast

[yuanzhiqiu@server91 code]$ mpirun -np 2 --host server90,server91 ./compare_bcast 1000 10
[server91:20427] [[64258,0],0] tcp_peer_recv_connect_ack: received different version from [[64258,0],1]: 4.0.0 instead of 3.1.0
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[64258,0],0] on node server91
  Remote daemon: [[64258,0],1] on node server90

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[yuanzhiqiu@server91 code]$ 

分析

看mod91的报错,应该是两台服务器openmpi的版本不一致
另外还观察到,为啥版本是4.0.0而不是4.1.5?

现象:

在s2上跑(s2,s4)集群:
## 没有collective call的可以
## 教程例子gather scatter可以
## 教程例子bcast也会报错但是会结束运行,error(11)
```bash
[yuanzhiqiu@s2 code]$ mpirun -np 2 ./compare_bcast 1000 10
Data size = 4000, Trials = 10
Avg my_bcast time = 0.000008
Avg MPI_Bcast time = 0.000005
[yuanzhiqiu@s2 code]$ mpirun -np 2 –host s2,s4 ./compare_bcast 1000 10

WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.

Your Open MPI job may now hang or fail.

Local host: s2
PID: 43343
Message: connect() to 115.157.197.31:1024 failed
Error: Resource temporarily unavailable (11)

Data size = 4000, Trials = 10
Avg my_bcast time = 0.010016
Avg MPI_Bcast time = 0.010018
下面这些命令得到的输出和上面一样(找通信的网卡是通过`ifconfig`或者`netstat -nr`) 指明网卡bash
mpirun -np 2 –host s2,s4 –mca btl_tcp_if_include 115.157.197.0/24 ./compare_bcast 1000 10
mpirun -np 2 –host s2,s4 –mca btl_tcp_if_include em1 ./compare_bcast 1000 10
加通信方式bash
mpirun -np 2 –host s2,s4 –mca btl tcp,vader,self ./compare_bcast 1000 10
## 我的例子会卡住,error(15)bash
[yuanzhiqiu@s2 tool]$ mpirun -np 2 –host s2,s4 ./mpi_test t

WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.

Your Open MPI job may now hang or fail.

Local host: s2
PID: 43860
Message: connect() to 115.157.197.31:1024 failed
Error: Operation now in progress (115)

卡在这里

加参数的命令的也是上面一样的报错
## 我的分析
看报错,message是"connect() to 115.157.197.31:1024 failed",应该和网络有关系
## 尝试解决
### mpirun参数
见上述现象
### 防火墙
分析:s2上可以跑(s2,s4)集群简单例程,s4上不可以,因此原因可能是s2没有关闭防火墙
可是s2和s4防火墙都是关闭的:
```bash
[yuanzhiqiu@s4 ~]$ systemctl status firewalld 
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)       
   Active: inactive (dead)
     Docs: man:firewalld(1)

注意到一个提示:

This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.

所以不仅要关闭firewalld,还要关闭iptables?
s4上:

[yuanzhiqiu@s4 ~]$ service iptables status
Redirecting to /bin/systemctl status iptables.service
● iptables.service - IPv4 firewall with iptables
   Loaded: loaded (/usr/lib/systemd/system/iptables.service; enabled; vendor preset: disabled)
   Active: active (exited) since Mon 2023-03-13 11:24:04 CST; 1 months 9 days ago
  Process: 1956 ExecStart=/usr/libexec/iptables/iptables.init start (code=exited, status=0/SUCCESS)
 Main PID: 1956 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/iptables.service

而s2上:

[yuanzhiqiu@s2 ~]$ service iptables status
Redirecting to /bin/systemctl status iptables.service
Unit iptables.service could not be found.

所以s4上关一下看看

防火墙操作命令

1:查看防火状态
    systemctl status firewalld
    service  iptables status

2:暂时关闭防火墙
    systemctl stop firewalld
    service  iptables stop

3:永久关闭防火墙
    systemctl disable firewalld
    chkconfig iptables off

4:重启防火墙
    systemctl enable firewalld
    service iptables restart 

果然可以耶!现在可以在s4上跑集群,也可以跑bcast了!

评论