https://mpitutorial.com/tutorials/running-an-mpi-cluster-within-a-lan/
OpenMPI 集群配置 - HelloWooo - 博客园 (cnblogs.com)
步骤:1. 建立连接 2. 运行mpi程序
建立连接
Your machines are gonna be talking over the network via SSH
1. 配置/etc/hosts
your cluster:
manager, worker
manager上/etc/hosts
有内容:
127.0.0.1 manager
<ip_address> worker
worker上:
127.0.0.1 worker
<ip_address> manager
2. set up ssh
ssh access from manager to worker:
可以相互ssh免密登录即可:(在manager上ssh worker
可以去worker,在worker上ssh manager
可以去manager(同一个用户名))
manager上(worker上同理):
- 在manager上
ssh-keygen
(如果manager上~/.ssh/下有id_rsa和id_rsa.pub,则此步不用进行) - 然后
ssh-copy-id worker
3. 关闭集群节点的防火墙
如果是没有通信MPI call的简单程序,则主节点不需要关闭防火墙;
通常是有通信call的,那么所有节点都要关闭防火墙:
防火墙操作命令:
1:查看防火状态
systemctl status firewalld
service iptables status
2:暂时关闭防火墙
systemctl stop firewalld
service iptables stop
3:永久关闭防火墙
systemctl disable firewalld
chkconfig iptables off
4:重启防火墙
systemctl enable firewalld
service iptables restart
检查是否成功关闭
下面两条命令都显示inactive或者没有服务才是完全关闭了
systemctl status firewalld
service iptables status
下面运行的时候如果报错显示是连接的问题,通常是建立连接这步哪里没有ok
比如
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now hang or fail.
Local host: s2
PID: 43860
Message: connect() to 115.157.197.31:1024 failed
这个的意思是s2到115.157.197.31:1024连接不上
比如
A process or daemon was unable to complete a TCP connection
to another process:
Local host: s2
Remote host: s4
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
这个的意思是s2到s4连接不上
运行mpi程序 mpirun
都安装好open mpi(即可以用mpirun的命令) https://sites.google.com/site/rangsiman1993/comp-env/program-install/install-openmpi
在相同路径下有相同的可执行文件
- 当然也可以配置NFS来共享目录
-np
是指明处理器数目
方法一:--host
然后在manager或worker上:
mpirun -np 1 --host manager ./mpi-hello-world
mpirun -np 2 --host manager,worker ./mpi-hello-world
mpirun -np 4 --host manager:2,worker:2 ./mpi-hello-world
方法二:--hostfile
manager和worker上在相同路径下有相同的host_file:
manager
worker
然后在manager或worker上:
mpirun -np 10 --hostfile host_file ./mpi-hello-world
实验存档
ssh无密码登录:
s2->s4, s4->s2
防火墙都关闭
s2和s4上都可以跑集群(s4,s2)