[LinuxFailSafe] No cluster_id entry on sec. node

Daniel Berg daniel_berg@mail.com
Wed, 17 Jul 2002 06:37:17 -0500


> Does the error still occur if you made the /etc/hosts adjustments I told you
> to? (ie, listing _only_ the internal IPs of the nodes, each with short & long
> name).

I've done the changes you told me to do, so now the only thing in
/etc/hosts are the names, shortnames and ip:s of the two nodes in
the cluster, and the localhost-entry. 

> You also haven't enabled / configured STONITH for the nodes in the cluster.
> That won't work either.
> 
Well, now I have done that, using the ssh-module, however this is
not the primary issue so I can deal with this some other time, 
when I have gotten the whole cluster going.


> If cdbd_log on all nodes says the database has been replicated correctly, and

I think that the cdbd-log looks ok, its syncing with the other
node and everything looks rather nice.

> if crsd_log looks clean too, you'll have to do a "Start HA Services"

The crsd-log is not in fine condition, but I thought this had 
to do with me not having any serial connections between the nodes.

Here is a dump from it. It kind of repeat itself, so I'm sending a slice of it:

Wed Jul 17 13:19:12.883 <N crsd crs 21405:0 crsd_main.c:200> Crsd restarted.
Wed Jul 17 13:19:15.958 <W crsd crs 21405:0 crs_config.c:667> CI_ERR_NOTFOUND, SystemController information for node tyson not found, requests will be ignored.
Wed Jul 17 13:19:18.042 <W crsd crs 21405:0 crsd_pending.c:492> CI_CRSERR_INVAL, The node specified for monitoring has its controlled port disabled. Ignoring this request.
Wed Jul 17 13:19:19.714 <W crsd crs 21405:0 crs_config.c:667> CI_ERR_NOTFOUND, SystemController information for node tyson not found, requests will be ignored.
Wed Jul 17 13:19:19.856 <W crsd crs 21405:0 crsd_pending.c:492> CI_CRSERR_INVAL, The node specified for monitoring has its controlled port disabled. Ignoring this request.
Wed Jul 17 13:19:21.184 <W crsd crs 21405:0 crs_config.c:667> CI_ERR_NOTFOUND, SystemController information for node tyson not found, requests will be ignored.
Wed Jul 17 13:19:22.273 <W crsd crs 21405:0 crsd_pending.c:492> CI_CRSERR_INVAL, The node specified for monitoring has its controlled port disabled. Ignoring this request.
Wed Jul 17 13:19:24.489 <W crsd crs 21405:0 crs_config.c:667> CI_ERR_NOTFOUND, SystemController information for node tyson not found, requests will be ignored.
Wed Jul 17 13:19:25.583 <W crsd crs 21405:0 crsd_pending.c:492> CI_CRSERR_INVAL, The node specified for monitoring has its controlled port disabled. Ignoring this request.


The cmsd-log (up to here it's nice, but then this comes...):

Wed Jul 17 13:26:19.134 <I0 ha_cmsd cms 21597:0 cmsd_config.c:665> End configuration.
Wed Jul 17 13:26:21.145 <W ha_cmsd cms 21597:0 cmsd_bcast.c:142> Message (from node tore:1) with a different CDB checksum
	local checksums  = 0x4f1d8f674c066e88:0xc47ff2b988a07c7:0x8d98537b0060c3b4
	remote checksums = 0x4f1d8f674c066e88:0xc47ff2b988a07c7:0xdbc4f08e0ccc25e6,
	Rejecting message ...
Wed Jul 17 13:26:25.253 <I0 ha_cmsd crs 21597:0 cmsd_reset.c:119> Attempted to start reset line monitoring for the following (1) node(s).
Wed Jul 17 13:26:25.253 <I0 ha_cmsd crs 21597:0 cmsd_reset.c:133> Node tore id 1 : failed to start monitoring.
Wed Jul 17 13:26:25.254 <I0 ha_cmsd cms 21597:0 cmsd_client.c:93> LAST MESSAGE IN THE cms SUBSYSTEM REPEATED ONCE
Wed Jul 17 13:26:25.254 <I0 ha_cmsd cms 21597:0 cmsd_client.c:93> client registration: gcd, id 21593
Wed Jul 17 13:26:27.166 <W ha_cmsd cms 21597:0 cmsd_bcast.c:142> Message (from node tore:1) with a different CDB checksum
	local checksums  = 0x4f1d8f674c066e88:0xc47ff2b988a07c7:0x8d98537b0060c3b4
	remote checksums = 0x4f1d8f674c066e88:0xc47ff2b988a07c7:0xdbc4f08e0ccc25e6,
	Rejecting message ...
Wed Jul 17 13:26:28.263 <W ha_cmsd cms 21597:0 cmsd_service.c:297> CI_CMSERR_NOMEMB, client error: name gcd, id 21593 command 5 error 0x200c
Wed Jul 17 13:26:30.176 <W ha_cmsd cms 21597:0 cmsd_bcast.c:142> Message (from node tore:1) with a different CDB checksum
	local checksums  = 0x4f1d8f674c066e88:0xc47ff2b988a07c7:0x8d98537b0060c3b4
	remote checksums = 0x4f1d8f674c066e88:0xc47ff2b988a07c7:0xdbc4f08e0ccc25e6,
	Rejecting message ...


The gcd-log:

Wed Jul 17 13:27:28.965 <I0 ha_gcd gcd 21608:0 gcd_options.c:721> Value of gcd_incno = 36.
Wed Jul 17 13:27:28.970 <N ha_gcd gcd 21608:0 gcd_init.c:206> My node name = ingo.
Wed Jul 17 13:27:28.972 <W ha_gcd ipc 21608:0 ipc_clnt.c:295> CI_IPCERR_NOSERVER, Connection file /var/run/failsafe/comm/cmsd-ipc_ingo not present.
Wed Jul 17 13:27:28.972 <E ha_gcd cms 21608:0 cms_ipc.c:83> CI_IPCERR_NOSERVER, cms ipc: ipcclnt_connect() failed, file /var/run/failsafe/comm/cmsd-ipc_ingo
.Check if the cmsd daemon is running.
Wed Jul 17 13:27:29.974 <W ha_gcd ipc 21608:0 ipc_clnt.c:295> CI_IPCERR_NOSERVER, Connection file /var/run/failsafe/comm/cmsd-ipc_ingo not present.
Wed Jul 17 13:27:29.974 <E ha_gcd cms 21608:0 cms_ipc.c:83> CI_IPCERR_NOSERVER, cms ipc: ipcclnt_connect() failed, file /var/run/failsafe/comm/cmsd-ipc_ingo
.Check if the cmsd daemon is running.
Wed Jul 17 13:27:30.984 <W ha_gcd ipc 21608:0 ipc_clnt.c:295> CI_IPCERR_NOSERVER, Connection file /var/run/failsafe/comm/cmsd-ipc_ingo not present.
Wed Jul 17 13:27:30.984 <E ha_gcd cms 21608:0 cms_ipc.c:83> CI_IPCERR_NOSERVER, cms ipc: ipcclnt_connect() failed, file /var/run/failsafe/comm/cmsd-ipc_ingo
.Check if the cmsd daemon is running.
****and so on*****

And the cbdb-log:

Wed Jul 17 13:19:22.654 ingo cdbd  - Checking quorum with 2 members for any unknown members.
Wed Jul 17 13:19:22.654 ingo cdbd  - All quorum member machines are known to us.
Wed Jul 17 13:19:22.662 ingo cdbd  - CDB on node tore (1) marked obsolete in fs2d_adopt_new_quorum
Wed Jul 17 13:19:22.663 ingo cdbd  - Local CDB obsolete for new quorum: cluster id: 0x00000000.0x3d327e1026fec9c0, master: 2, sequence: 682, member count: 2, members:  3,  2
Wed Jul 17 13:19:25.599 ingo cdbd  - Machine 1 request_quorum accepted
Wed Jul 17 13:19:25.686 ingo cdbd  - Checking quorum with 2 members for any unknown members.
Wed Jul 17 13:19:25.686 ingo cdbd  - All quorum member machines are known to us.
Wed Jul 17 13:19:25.687 ingo cdbd  - Starting to receive CDB sync series from machine 2
Wed Jul 17 13:19:25.706 ingo cdbd  - Checking quorum with 2 members for any unknown members.
Wed Jul 17 13:19:25.706 ingo cdbd  - All quorum member machines are known to us.
Wed Jul 17 13:19:25.714 ingo cdbd  - CDB on node tore (1) marked current in fs2d_adopt_new_quorum
Wed Jul 17 13:19:25.718 ingo cdbd  - Local CDB obsolete for new quorum: cluster id: 0x00000000.0x3d327e1026fec9c0, master: 1, sequence: 682, member count: 2, members:  3,  1
Wed Jul 17 13:19:25.866 ingo cdbd  - Checking quorum with 2 members for any unknown members.
Wed Jul 17 13:19:25.866 ingo cdbd  - All quorum member machines are known to us.
Wed Jul 17 13:19:25.867 ingo cdbd  - Starting to receive CDB sync series from machine 1
Wed Jul 17 13:19:27.044 ingo cdbd  - Machine 1 request_quorum accepted
Wed Jul 17 13:19:27.163 ingo cdbd  - Checking quorum with 2 members for any unknown members.
Wed Jul 17 13:19:27.163 ingo cdbd  - All quorum member machines are known to us.
Wed Jul 17 13:19:27.171 ingo cdbd  - CDB on node tyson (2) marked obsolete in fs2d_adopt_new_quorum
Wed Jul 17 13:19:27.172 ingo cdbd  - Local CDB obsolete for new quorum: cluster id: 0x00000000.0x3d327e1026fec9c0, master: 1, sequence: 683, member count: 2, members:  3,  1
Wed Jul 17 13:19:29.844 ingo cdbd  - Machine 1 request_quorum accepted
Wed Jul 17 13:20:43.009 ingo cdbd  - Finished receiving CDB sync series from machine 1
Wed Jul 17 13:20:43.011 ingo cdbd  - Starting to receive CDB sync series from machine 1
Wed Jul 17 13:20:43.043 ingo cdbd  - Finished receiving CDB sync series from machine 1
Wed Jul 17 13:20:43.208 ingo cdbd  - Machine 2 machine_sync failed with lock_timeout error
Wed Jul 17 13:20:43.210 ingo cdbd  - Checking quorum with 3 members for any unknown members.
Wed Jul 17 13:20:43.210 ingo cdbd  - All quorum member machines are known to us.
Wed Jul 17 13:20:43.230 ingo cdbd  - CDB on node ingo (3) marked current in fs2d_adopt_new_quorum
Wed Jul 17 13:20:43.365 ingo cdbd  - New quorum: cluster id: 0x00000000.0x3d327e1026fec9c0, master: 1, sequence: 684, member count: 3, members:  2,  3,  1
Wed Jul 17 13:20:43.365 ingo cdbd  - New quorum: cluster id: 0x00000000.0x3d327e1026fec9c0, master: 1, sequence: 684, member count: 3, members:  2,  3,  1
Wed Jul 17 13:20:43.365 ingo cdbd  - Checking quorum with 3 members for any unknown members.
Wed Jul 17 13:20:43.365 ingo cdbd  - All quorum member machines are known to us.
Wed Jul 17 13:26:51.009 ingo cdbd  - Checking quorum with 3 members for any unknown members.
Wed Jul 17 13:26:51.009 ingo cdbd  - All quorum member machines are known to us.
Wed Jul 17 13:26:51.285 ingo cdbd  - Checking quorum with 3 members for any unknown members.
Wed Jul 17 13:26:51.285 ingo cdbd  - All quorum member machines are known to us.
Wed Jul 17 13:26:51.285 ingo cdbd  - CDB on node tyson (2) marked current in fs2proc_machine_register_1
Wed Jul 17 13:28:30.598 ingo cdbd  - Checking quorum with 3 members for any unknown members.
Wed Jul 17 13:28:30.598 ingo cdbd  - All quorum member machines are known to us.


The local machine is tore but these logs are taken from the second
node in the cluster, ingo, if that is of importance.

> (haActivate on the commandline, if you prefer) before the nodes will come
> "up".

This is the big problem, when I start HA services only one goes up,
the other one shows down. This is when I start them sequentially with the node I'm running fstask on first. If I start the whole
cluster at the same time they go into UNKNOWN state.  

Sincerely, Daniel

-------------------------------------------
Daniel Berg
CC-systems sweden


-- 
__________________________________________________________
Sign-up for your own FREE Personalized E-mail at Mail.com
http://www.mail.com/?sr=signup

Save up to $160 by signing up for NetZero Platinum Internet service.
http://www.netzero.net/?refcd=N2P0602NEP8