[Linux-HA] ocf::LVM monitor needs excessive time to complete
Maloja01
maloja01 at arcor.de
Fri Aug 5 08:41:06 MDT 2011
Hi,
processes in state D looks like locked in a kernel call/device request.
Do you have a problem with your storage? This is not cluster related .
Kind regards
Fabian
On 08/05/2011 01:55 PM, Ulrich Windl wrote:
> Hi,
>
> we run a cluster that has about 30 LVM VGs that are monitored every minute with a timeout interval of 90s. Surprisingly even if the system is in nominal state, the LVM monitor times out.
>
> I suspect this has to do with multiple LVM commands being run in parallel like this:
> # ps ax |grep vg
> 2014 pts/0 D+ 0:00 vgs
> 2580 ? D 0:00 vgdisplay -v NFS_C11_IO
> 2638 ? D 0:00 vgck CBW_DB_BTD
> 2992 ? D 0:00 vgdisplay -v C11_DB_Exe
> 3002 ? D 0:00 vgdisplay -v C11_DB_15k
> 4564 pts/2 S+ 0:00 grep vg
> # ps ax |grep vg
> 8095 ? D 0:00 vgck CBW_DB_Exe
> 8119 ? D 0:00 vgdisplay -v C11_DB_FATA
> 8194 ? D 0:00 vgdisplay -v NFS_SAP_Exe
>
> When I tried a "vgs" manually, it could not be suspended or killed, and it took more than 30 seconds to complete.
>
> Thus the LVM monitoring is quite useless as it is now (SLES 11 SP1 x86_64 on a machine with lots of disks, RAM and CPUs).
>
> As I had changed all the timeouts via "crm configure edit", I suspect the LRM starts all these monitors at the same time, creating massive parallelism. Maybe a random star delay would be more useful than having the user specify a variable start delay for the monitor. Possibly those stuck monitor operations also affect monitors that would finish in time.
>
> Here's a part of the mess on one node:
> Aug 5 13:50:55 h03 lrmd: [14526]: WARN: operation monitor[360] on ocf::LVM::prm_cbw_ci_mnt_lvm for client 14529, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.5] CRM_meta_record_pending=[true] CRM_meta_timeout=[30000] CRM_meta_interval=[10000] volgrpname=[CBW_CI] : pid [29910] timed out
> Aug 5 13:50:55 h03 crmd: [14529]: ERROR: process_lrm_event: LRM operation prm_cbw_ci_mnt_lvm_monitor_10000 (360) Timed Out (timeout=30000ms)
> Aug 5 13:50:55 h03 lrmd: [14526]: WARN: perform_ra_op: the operation operation monitor[154] on ocf::IPaddr2::prm_a20_ip_1 for client 14529, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.5] CRM_meta_record_pending=[true] CRM_meta_timeout=[20000] CRM_meta_interval=[10000] iflabel=[a20] ip=[172.20.17.54] stayed in operation list for 24020 ms (longer than 10000 ms)
> Aug 5 13:50:56 h03 lrmd: [14526]: WARN: perform_ra_op: the operation operation monitor[179] on ocf::Raid1::prm_nfs_cbw_trans_raid1 for client 14529, its parameters: CRM_meta_record_pending=[true] raidconf=[/etc/mdadm/mdadm.conf] crm_feature_set=[3.0.5] OCF_CHECK_LEVEL=[1] raiddev=[/dev/md8] CRM_meta_name=[monitor] CRM_meta_timeout=[60000] CRM_meta_interval=[60000] stayed in operation list for 24010 ms (longer than 10000 ms)
> Aug 5 13:50:56 h03 attrd: [14527]: notice: attrd_ais_dispatch: Update relayed from h04
> Aug 5 13:50:56 h03 attrd: [14527]: info: attrd_local_callback: Expanded fail-count-prm_cbw_ci_mnt_lvm=value++ to 9
> Aug 5 13:50:56 h03 attrd: [14527]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-prm_cbw_ci_mnt_lvm (9)
> Aug 5 13:50:56 h03 attrd: [14527]: info: attrd_perform_update: Sent update 416: fail-count-prm_cbw_ci_mnt_lvm=9
> Aug 5 13:50:56 h03 attrd: [14527]: notice: attrd_ais_dispatch: Update relayed from h04
>
> Regards,
> Ulrich
>
>
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
More information about the Linux-HA
mailing list