Sorry, your browser cannot access this site
This page requires browser support (enable) JavaScript
Learn more >

https://mpitutorial.com/tutorials/running-an-mpi-cluster-within-a-lan/

OpenMPI 集群配置 - HelloWooo - 博客园 (cnblogs.com)

步骤:1. 建立连接 2. 运行mpi程序

建立连接

Your machines are gonna be talking over the network via SSH

1. 配置/etc/hosts

your cluster:
manager, worker

manager上/etc/hosts有内容:

127.0.0.1       manager
<ip_address>    worker

worker上:

127.0.0.1       worker
<ip_address>    manager

2. set up ssh

ssh access from manager to worker:

可以相互ssh免密登录即可:(在manager上ssh worker可以去worker,在worker上ssh manager可以去manager(同一个用户名))
manager上(worker上同理):

  • 在manager上ssh-keygen(如果manager上~/.ssh/下有id_rsa和id_rsa.pub,则此步不用进行)
  • 然后ssh-copy-id worker

3. 关闭集群节点的防火墙

如果是没有通信MPI call的简单程序,则主节点不需要关闭防火墙;
通常是有通信call的,那么所有节点都要关闭防火墙:

防火墙操作命令:

1:查看防火状态
    systemctl status firewalld
    service  iptables status

2:暂时关闭防火墙
    systemctl stop firewalld
    service  iptables stop

3:永久关闭防火墙
    systemctl disable firewalld
    chkconfig iptables off

4:重启防火墙
    systemctl enable firewalld
    service iptables restart 

检查是否成功关闭

下面两条命令都显示inactive或者没有服务才是完全关闭了

systemctl status firewalld
service  iptables status

下面运行的时候如果报错显示是连接的问题,通常是建立连接这步哪里没有ok

比如

WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: s2
  PID:        43860
  Message:    connect() to 115.157.197.31:1024 failed

这个的意思是s2到115.157.197.31:1024连接不上
比如

A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    s2
  Remote host:   s4
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.

这个的意思是s2到s4连接不上

运行mpi程序 mpirun

  1. 都安装好open mpi(即可以用mpirun的命令) https://sites.google.com/site/rangsiman1993/comp-env/program-install/install-openmpi

  2. 相同路径下有相同的可执行文件

    1. 当然也可以配置NFS来共享目录
  3. 运行命令:
    http://selkie.macalester.edu/csinparallel/modules/Patternlets/build/html/MessagePassing/RunningMPI.html

-np是指明处理器数目

方法一:--host

然后在manager或worker上:

mpirun -np 1 --host manager ./mpi-hello-world
mpirun -np 2 --host manager,worker ./mpi-hello-world
mpirun -np 4 --host manager:2,worker:2 ./mpi-hello-world

方法二:--hostfile

manager和worker上在相同路径下有相同的host_file:

manager
worker

然后在manager或worker上:

mpirun -np 10 --hostfile host_file ./mpi-hello-world

实验存档

ssh无密码登录:
s2->s4, s4->s2
防火墙都关闭

s2和s4上都可以跑集群(s4,s2)

评论