今日目标:希望把mod集群搭建起来,在上面跑7500,找分布式比单机版慢的原因;twitter抄到word;有时间的话运行毕设代码填表格(呜呜希望下午可以出去吃点东西划掉,不管怎么样下午都出去吃东西因为这样才可以确保下周可以继续连轴转)
早上
- 1
- 装同一版本openMPI
- twitter抄到word
- 解决集群问题!
上午
- 1,2,3
- run 7500 on mod cluster
- 装高版本gcc
- 果然分布式要更慢,三倍时间,每个节点都是单节点的三倍时间,所以接下来的步骤是使用性能分析工具揭示原因
- 搜分布式profile工具和MPI提供的工具
- run 7500 on mod cluster
下午
- 1
- 学习man mpirun,发现原因可能是每个进程只一个核的原因
- 2,3,4
- 验证是否上述原因和解决
- 哈哈哈哈哈哈哈哈哈哈哈哈哈真的是原因!妈的!
出去玩!
- 哈哈哈哈哈哈哈哈哈哈哈哈哈真的是原因!妈的!
- 验证是否上述原因和解决
晚上
mod搭集群遇到的问题
在都装了4.1.5版本openmpi之后
## mod90上
helloworld可以跑集群,但是bcast报错:
```bash
[yuanzhiqiu@server90 code]$ mpirun -np 2 –host server90,server91 ./compare_bcast 1000 10
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.
This attempted connection will be ignored; your MPI job may or may not
continue properly.
Local host: server90
PID: 50670
[server90:50665] 1 more process has sent help message help-mpi-btl-tcp.txt / server accept cannot find guid
[server90:50665] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages
## mod91上 helloworld可以跑集群,但是bcast报错:
bash
[yuanzhiqiu@server91 code]$ mpirun -np 2 –host server90,server91 ./compare_bcast 1000 10
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.
This attempted connection will be ignored; your MPI job may or may not
continue properly.
Local host: server91
PID: 11323
[server91:11318] 1 more process has sent help message help-mpi-btl-tcp.txt / server accept cannot find guid
[server91:11318] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages
## 解决
搜索发现这个报错的原因和解决:https://www.mail-archive.com/users@lists.open-mpi.org/msg34182.html
```txt
That typically occurs when some nodes have multiple interfaces, and
several nodes have a similar IP on a private/unused interface.
I suggest you explicitly restrict the interface Open MPI should be using.
For example, you can
mpirun --mca btl_tcp_if_include eth0 ...
先ifconfig看网卡,然后
[yuanzhiqiu@server90 code]$ mpirun -np 2 --host server90,server91 --mca btl_tcp_if_include em1 ./compare_bcast 1000 10
Data size = 4000, Trials = 10
Avg my_bcast time = 0.000161
Avg MPI_Bcast time = 0.000158
[yuanzhiqiu@server91 code]$ mpirun -np 2 --host server90,server91 --mca btl_tcp_if_include em1 ./compare_bcast 1000 10
Data size = 4000, Trials = 10
Avg my_bcast time = 0.000159
Avg MPI_Bcast time = 0.000158