Friday, December 13, 2013

'zpool import' May Appear to Hang, or Take a Very Long Time, if Processing a Large Destroy/Delete Operation

Symptoms
Customer has 2-node Cluster and would panic since one of zpools appeared hung.

The core files show cluster induced panic:

Nov 27 21:37:22 xxx cl_dlpitrans: [ID 624622 kern.notice] Notifying cluster that this node is panicking
And, the zpool itself did not show any signs of checksum errors and/or being suspended.

The TXG sync thread will show:

==== kernel thread: 0xfffffe80027fcc40  PID: 0 ====
cmd: sched
t_wchan: 0xfffffea208addca0  sobj: condition var (from zfs:zio_wait+0x53)
t_procp: 0xfffffffffbc27700 (proc_sched)
 p_as: 0xfffffffffbc293e0 (kas)
 p_zone: 0xfffffffffbce9f60 (global)
t_stk: 0xfffffe80027fcc40  sp: 0xfffffe80027fbe60  t_stkbase: 0xfffffe80027f5000
t_pri: 60  pctcpu: 0.222453
t_cid: 0 (SYS)  t_cldata: NULL
t_transience: 0  t_wkld_flags: 0
t_cpupart: 0xfffffffffbc76720(0)  last CPU: 1
idle: 13 ticks (0.013s)
start: Wed Nov 27 21:16:05 2013
age: 1277 seconds (21 minutes 17 seconds)
t_state:     TS_SLEEP
t_flag:      8 (T_TALLOCSTK)
t_proc_flag: 0 (none set)
t_schedflag: 3 (TS_LOAD|TS_DONT_SWAP)
p_flag:      1 (SSYS)

pc:      unix:_resume_from_idle+0xfb resume_return:  addq   $0x8,%rsp
startpc: zfs:txg_sync_thread+0:  pushq  %rbp

0xfffffe80027fbe60 unix:_resume_from_idle+0xfb resume_return()
                 unix:swtch - frame recycled
0xfffffe80027fbe80 void genunix:cv_wait+0x68(kcondvar_t *, kmutex_t *)
0xfffffe80027fbeb0 int zfs:zio_wait+0x53(zio_t *)
0xfffffe80027fbf60 int zfs:arc_read_nolock+0x67b(zio_t *, spa_t *, const blkptr_t *, arc_done_func_t *, void *, int, int, uint32_t *, const zbookmark_t *)
0xfffffe80027fbfd0 int zfs:arc_read+0x77(zio_t *, spa_t *, const blkptr_t *, arc_buf_t *, arc_done_func_t *, void *, int, int, uint32_t *, const zbookmark_t *)
0xfffffe80027fc000 int zfs:dsl_read+0x2c(zio_t *, spa_t *, const blkptr_t *, arc_buf_t *, arc_done_func_t *, void *, int, int, uint32_t *, const zbookmark_t *)
0xfffffe80027fc0d0 int zfs:traverse_visitbp+0x23d(traverse_data_t *, const dnode_phys_t *, arc_buf_t *, blkptr_t *, const zbookmark_t *)
0xfffffe80027fc140 int zfs:traverse_dnode+0x77(traverse_data_t *, const dnode_phys_t *, arc_buf_t *, uint64_t, uint64_t)
0xfffffe80027fc210 int zfs:traverse_visitbp+0x3c9(traverse_data_t *, const dnode_phys_t *, arc_buf_t *, blkptr_t *, const zbookmark_t *)
0xfffffe80027fc2e0 int zfs:traverse_visitbp+0x2ba(traverse_data_t *, const dnode_phys_t *, arc_buf_t *, blkptr_t *, const zbookmark_t *)
0xfffffe80027fc3b0 int zfs:traverse_visitbp+0x2ba(traverse_data_t *, const dnode_phys_t *, arc_buf_t *, blkptr_t *, const zbookmark_t *)
0xfffffe80027fc480 int zfs:traverse_visitbp+0x2ba(traverse_data_t *, const dnode_phys_t *, arc_buf_t *, blkptr_t *, const zbookmark_t *)
0xfffffe80027fc550 int zfs:traverse_visitbp+0x2ba(traverse_data_t *, const dnode_phys_t *, arc_buf_t *, blkptr_t *, const zbookmark_t *)
0xfffffe80027fc620 int zfs:traverse_visitbp+0x2ba(traverse_data_t *, const dnode_phys_t *, arc_buf_t *, blkptr_t *, const zbookmark_t *)
0xfffffe80027fc6f0 int zfs:traverse_visitbp+0x2ba(traverse_data_t *, const dnode_phys_t *, arc_buf_t *, blkptr_t *, const zbookmark_t *)
0xfffffe80027fc760 int zfs:traverse_dnode+0x77(traverse_data_t *, const dnode_phys_t *, arc_buf_t *, uint64_t, uint64_t)
0xfffffe80027fc830 int zfs:traverse_visitbp+0x45b(traverse_data_t *, const dnode_phys_t *, arc_buf_t *, blkptr_t *, const zbookmark_t *)
0xfffffe80027fc900 int zfs:traverse_impl+0x109(spa_t *, dsl_dataset_t *, blkptr_t *, uint64_t, int, blkptr_cb_t *, void *)
0xfffffe80027fc920 int zfs:traverse_dataset+0x32(dsl_dataset_t *, uint64_t, int, blkptr_cb_t *, void *)
0xfffffe80027fca00 void zfs:dsl_dataset_destroy_sync+0x626(void *, void *, dmu_tx_t *)
0xfffffe80027fca40 void zfs:dsl_sync_task_group_sync+0x10f(dsl_sync_task_group_t *, dmu_tx_t *)
0xfffffe80027fcab0 void zfs:dsl_pool_sync+0x1ab(dsl_pool_t *, uint64_t)
0xfffffe80027fcb70 void zfs:spa_sync+0x2d7(spa_t *, uint64_t)
0xfffffe80027fcc20 void zfs:txg_sync_thread+0x1d2(dsl_pool_t *)
0xfffffe80027fcc30 unix:thread_start+8()
-- end of kernel thread's stack --
Changes
And zpool history will show that upon every import, there is underlined destroy operation going on in the "syncing" context:

2013-11-27.22:55:56 zpool import -o cachefile=none -N comm19-3 [user root on xxx:global]
2013-11-27.22:55:56 [internal destroy_begin_sync txg:25623963] dataset = 523549 [user root on xxx]
Cause
The issue is with large clone destroy going on in the "syncing" context that slows everything that happens whenever zpool is imported.

This is well understood as part of:

Bug 16423386 - dnode_next_offset() doesn't do exhaustive search
Solution
One way to get access to the zpool data (with limited access) is to do read-only import of the zpool. This will prevent from performing any modifications to the zpool.

Second option is to wait for the destroy to complete (can take long time depending upon how large the destroy is and underlined zpool I/O bandwidth etc)

If neither option is feasible, ask your Oracle support engineer to request an IDR to fix Bug 16423386.

No comments:

Post a Comment