intel-kernel/kernel/rcu · openEuler/intel-kernel - AtomGit

JJiacheng Yurcu: Fix racy re-initialization of irq_work causing hangs

文件	最后提交记录	最后更新时间
Kconfig	rcu: Make TASKS_RUDE_RCU select IRQ_WORK stable inclusion from stable-v5.10.121 commit 10f30cba8f6c4bcbc5c17443fd6a9999d3991ae3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5L6CQ Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=10f30cba8f6c4bcbc5c17443fd6a9999d3991ae3 -------------------------------- [ Upstream commit 46e861be589881e0905b9ade3d8439883858721c ] The TASKS_RUDE_RCU does not select IRQ_WORK, which can result in build failures for kernels that do not otherwise select IRQ_WORK. This commit therefore causes the TASKS_RUDE_RCU Kconfig option to select IRQ_WORK. Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com> Acked-by: Xie XiuQi <xiexiuqi@huawei.com>	3 年前
Kconfig.debug	rcu: Add RCU stall diagnosis information mainline inclusion from mainline-v6.3-rc1 commit be42f00b73a0f50710d16eb7cb4efda0cce062dd category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7OIXK Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be42f00b73a0f50710d16eb7cb4efda0cce062dd -------------------------------- Because RCU CPU stall warnings are driven from the scheduling-clock interrupt handler, a workload consisting of a very large number of short-duration hardware interrupts can result in misleading stall-warning messages. On systems supporting only a single level of interrupts, that is, where interrupts handlers cannot be interrupted, this can produce misleading diagnostics. The stack traces will show the innocent-bystander interrupted task, not the interrupts that are at the very least exacerbating the stall. This situation can be improved by displaying the number of interrupts and the CPU time that they have consumed. Diagnosing other types of stalls can be eased by also providing the count of softirqs and the CPU time that they consumed as well as the number of context switches and the task-level CPU time consumed. Consider the following output given this change: rcu: INFO: rcu_preempt self-detected stall on CPU rcu: 0-....: (1250 ticks this GP) <omitted> rcu: hardirqs softirqs csw/system rcu: number: 624 45 0 rcu: cputime: 69 1 2425 ==> 2500(ms) This output shows that the number of hard and soft interrupts is small, there are no context switches, and the system takes up a lot of time. This indicates that the current task is looping with preemption disabled. The impact on system performance is negligible because snapshot is recorded only once for all continuous RCU stalls. This added debugging information is suppressed by default and can be enabled by building the kernel with CONFIG_RCU_CPU_STALL_CPUTIME=y or by booting with rcupdate.rcu_cpu_stall_cputime=1. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Conflicts: Documentation/admin-guide/kernel-parameters.txt kernel/rcu/Kconfig.debug [Change RCU_CPU_STALL_CPUTIME to be enabled by default] kernel/rcu/rcu.h kernel/rcu/tree.h kernel/rcu/tree_stall.h kernel/rcu/update.c Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>	2 年前
Makefile	rcuperf: Change rcuperf to rcuscale This commit further avoids conflation of rcuperf with the kernel's perf feature by renaming kernel/rcu/rcuperf.c to kernel/rcu/rcuscale.c, and also by similarly renaming the functions and variables inside this file. This has the side effect of changing the names of the kernel boot parameters, so kernel-parameters.txt and ver_functions.sh are also updated. The rcutorture --torture type was also updated from rcuperf to rcuscale. [ paulmck: Fix bugs located by Stephen Rothwell. ] Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	5 年前
rcu.h	rcu: Add RCU stall diagnosis information mainline inclusion from mainline-v6.3-rc1 commit be42f00b73a0f50710d16eb7cb4efda0cce062dd category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7OIXK Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be42f00b73a0f50710d16eb7cb4efda0cce062dd -------------------------------- Because RCU CPU stall warnings are driven from the scheduling-clock interrupt handler, a workload consisting of a very large number of short-duration hardware interrupts can result in misleading stall-warning messages. On systems supporting only a single level of interrupts, that is, where interrupts handlers cannot be interrupted, this can produce misleading diagnostics. The stack traces will show the innocent-bystander interrupted task, not the interrupts that are at the very least exacerbating the stall. This situation can be improved by displaying the number of interrupts and the CPU time that they have consumed. Diagnosing other types of stalls can be eased by also providing the count of softirqs and the CPU time that they consumed as well as the number of context switches and the task-level CPU time consumed. Consider the following output given this change: rcu: INFO: rcu_preempt self-detected stall on CPU rcu: 0-....: (1250 ticks this GP) <omitted> rcu: hardirqs softirqs csw/system rcu: number: 624 45 0 rcu: cputime: 69 1 2425 ==> 2500(ms) This output shows that the number of hard and soft interrupts is small, there are no context switches, and the system takes up a lot of time. This indicates that the current task is looping with preemption disabled. The impact on system performance is negligible because snapshot is recorded only once for all continuous RCU stalls. This added debugging information is suppressed by default and can be enabled by building the kernel with CONFIG_RCU_CPU_STALL_CPUTIME=y or by booting with rcupdate.rcu_cpu_stall_cputime=1. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Conflicts: Documentation/admin-guide/kernel-parameters.txt kernel/rcu/Kconfig.debug [Change RCU_CPU_STALL_CPUTIME to be enabled by default] kernel/rcu/rcu.h kernel/rcu/tree.h kernel/rcu/tree_stall.h kernel/rcu/update.c Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>	2 年前
rcu_segcblist.c	rcu/segcblist: Prevent useless GP start if no CBs to accelerate The rcu_segcblist_accelerate() function returns true iff it is necessary to request another grace period. A tracing session showed that this function unnecessarily requests grace periods. For example, consider the following sequence of events: 1. Callbacks are queued only on the NEXT segment of CPU A's callback list. 2. CPU A runs RCU_SOFTIRQ, accelerating these callbacks from NEXT to WAIT. 3. Thus rcu_segcblist_accelerate() returns true, requesting grace period N. 4. RCU's grace-period kthread wakes up on CPU B and starts grace period N. 4. CPU A notices the new grace period and invokes RCU_SOFTIRQ. 5. CPU A's RCU_SOFTIRQ again invokes rcu_segcblist_accelerate(), but there are no new callbacks. However, rcu_segcblist_accelerate() nevertheless (uselessly) requests a new grace period N+1. This extra grace period results in additional lock contention and also additional wakeups, all for no good reason. This commit therefore adds a check to rcu_segcblist_accelerate() that prevents the return of true when there are no new callbacks. This change reduces the number of grace periods (GPs) and wakeups in each of eleven five-second rcutorture runs as follows: +----+-------------------+-------------------+ \| # \| Number of GPs \| Number of Wakeups \| +====+=========+=========+=========+=========+ \| 1 \| With \| Without \| With \| Without \| +----+---------+---------+---------+---------+ \| 2 \| 75 \| 89 \| 113 \| 119 \| +----+---------+---------+---------+---------+ \| 3 \| 62 \| 91 \| 105 \| 123 \| +----+---------+---------+---------+---------+ \| 4 \| 60 \| 79 \| 98 \| 110 \| +----+---------+---------+---------+---------+ \| 5 \| 63 \| 79 \| 99 \| 112 \| +----+---------+---------+---------+---------+ \| 6 \| 57 \| 89 \| 96 \| 123 \| +----+---------+---------+---------+---------+ \| 7 \| 64 \| 85 \| 97 \| 118 \| +----+---------+---------+---------+---------+ \| 8 \| 58 \| 83 \| 98 \| 113 \| +----+---------+---------+---------+---------+ \| 9 \| 57 \| 77 \| 89 \| 104 \| +----+---------+---------+---------+---------+ \| 10 \| 66 \| 82 \| 98 \| 119 \| +----+---------+---------+---------+---------+ \| 11 \| 52 \| 82 \| 83 \| 117 \| +----+---------+---------+---------+---------+ The reduction in the number of wakeups ranges from 5% to 40%. Cc: urezki@gmail.com [ paulmck: Rework commit log and comment. ] Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	5 年前
rcu_segcblist.h	rcu: Remove kfree_rcu() special casing and lazy-callback handling This commit removes kfree_rcu() special-casing and the lazy-callback handling from Tree RCU. It moves some of this special casing to Tiny RCU, the removal of which will be the subject of later commits. This results in a nice negative delta. Suggested-by: Paul E. McKenney <paulmck@linux.ibm.com> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> [ paulmck: Add slab.h #include, thanks to kbuild test robot <lkp@intel.com>. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	6 年前
rcuscale.c	rcuscale: Move rcu_scale_writer() schedule_timeout_uninterruptible() to _idle() stable inclusion from stable-v5.10.197 commit 55887adc76e19aec9763186e2c1d0a3481d20e96 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I96Q8P Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=55887adc76e19aec9763186e2c1d0a3481d20e96 -------------------------------- [ Upstream commit e60c122a1614b4f65b29a7bef9d83b9fd30e937a ] The rcuscale.holdoff module parameter can be used to delay the start of rcu_scale_writer() kthread. However, the hung-task timeout will trigger when the timeout specified by rcuscale.holdoff is greater than hung_task_timeout_secs: runqemu kvm nographic slirp qemuparams="-smp 4 -m 2048M" bootparams="rcuscale.shutdown=0 rcuscale.holdoff=300" [ 247.071753] INFO: task rcu_scale_write:59 blocked for more than 122 seconds. [ 247.072529] Not tainted 6.4.0-rc1-00134-gb9ed6de8d4ff #7 [ 247.073400] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 247.074331] task:rcu_scale_write state:D stack:30144 pid:59 ppid:2 flags:0x00004000 [ 247.075346] Call Trace: [ 247.075660] <TASK> [ 247.075965] __schedule+0x635/0x1280 [ 247.076448] ? __pfx___schedule+0x10/0x10 [ 247.076967] ? schedule_timeout+0x2dc/0x4d0 [ 247.077471] ? __pfx_lock_release+0x10/0x10 [ 247.078018] ? enqueue_timer+0xe2/0x220 [ 247.078522] schedule+0x84/0x120 [ 247.078957] schedule_timeout+0x2e1/0x4d0 [ 247.079447] ? __pfx_schedule_timeout+0x10/0x10 [ 247.080032] ? __pfx_rcu_scale_writer+0x10/0x10 [ 247.080591] ? __pfx_process_timeout+0x10/0x10 [ 247.081163] ? __pfx_sched_set_fifo_low+0x10/0x10 [ 247.081760] ? __pfx_rcu_scale_writer+0x10/0x10 [ 247.082287] rcu_scale_writer+0x6b1/0x7f0 [ 247.082773] ? mark_held_locks+0x29/0xa0 [ 247.083252] ? __pfx_rcu_scale_writer+0x10/0x10 [ 247.083865] ? __pfx_rcu_scale_writer+0x10/0x10 [ 247.084412] kthread+0x179/0x1c0 [ 247.084759] ? __pfx_kthread+0x10/0x10 [ 247.085098] ret_from_fork+0x2c/0x50 [ 247.085433] </TASK> This commit therefore replaces schedule_timeout_uninterruptible() with schedule_timeout_idle(). Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: sanglipeng <sanglipeng1@jd.com>	2 年前
rcutorture.c	rcutorture: Avoid problematic critical section nesting on PREEMPT_RT stable inclusion from stable-5.10.80 commit 7f43cda650d5ca7cac9ced26bb2f3f64643ddb9d bugzilla: 185821 https://gitee.com/openeuler/kernel/issues/I4L7CG Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=7f43cda650d5ca7cac9ced26bb2f3f64643ddb9d -------------------------------- [ Upstream commit 71921a9606ddbcc1d98c00eca7ae82c373d1fecd ] rcutorture is generating some nesting scenarios that are not compatible on PREEMPT_RT. For example: preempt_disable(); rcu_read_lock_bh(); preempt_enable(); rcu_read_unlock_bh(); The problem here is that on PREEMPT_RT the bottom halves have to be disabled and enabled in preemptible context. Reorder locking: start with BH locking and continue with then with disabling preemption or interrupts. In the unlocking do it reverse by first enabling interrupts and preemption and BH at the very end. Ensure that on PREEMPT_RT BH locking remains unchanged if in non-preemptible context. Link: https://lkml.kernel.org/r/20190911165729.11178-6-swood@redhat.com Link: https://lkml.kernel.org/r/20210819182035.GF4126399@paulmck-ThinkPad-P17-Gen-1 Signed-off-by: Scott Wood <swood@redhat.com> [bigeasy: Drop ATOM_BH, make it only about changing BH in atomic context. Allow enabling RCU in IRQ-off section. Reword commit message.] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Chen Jun <chenjun102@huawei.com> Reviewed-by: Weilong Chen <chenweilong@huawei.com> Acked-by: Weilong Chen <chenweilong@huawei.com> Signed-off-by: Chen Jun <chenjun102@huawei.com> Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>	4 年前
refscale.c	refscale: Fix uninitalized use of wait_queue_head_t stable inclusion from stable-v5.10.195 commit 066fbd8bc981cf49923bf828b7b4092894df577f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I95JOC Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=066fbd8bc981cf49923bf828b7b4092894df577f -------------------------------- [ Upstream commit f5063e8948dad7f31adb007284a5d5038ae31bb8 ] Running the refscale test occasionally crashes the kernel with the following error: [ 8569.952896] BUG: unable to handle page fault for address: ffffffffffffffe8 [ 8569.952900] #PF: supervisor read access in kernel mode [ 8569.952902] #PF: error_code(0x0000) - not-present page [ 8569.952904] PGD c4b048067 P4D c4b049067 PUD c4b04b067 PMD 0 [ 8569.952910] Oops: 0000 [#1] PREEMPT_RT SMP NOPTI [ 8569.952916] Hardware name: Dell Inc. PowerEdge R750/0WMWCR, BIOS 1.2.4 05/28/2021 [ 8569.952917] RIP: 0010:prepare_to_wait_event+0x101/0x190 : [ 8569.952940] Call Trace: [ 8569.952941] <TASK> [ 8569.952944] ref_scale_reader+0x380/0x4a0 [refscale] [ 8569.952959] kthread+0x10e/0x130 [ 8569.952966] ret_from_fork+0x1f/0x30 [ 8569.952973] </TASK> The likely cause is that init_waitqueue_head() is called after the call to the torture_create_kthread() function that creates the ref_scale_reader kthread. Although this init_waitqueue_head() call will very likely complete before this kthread is created and starts running, it is possible that the calling kthread will be delayed between the calls to torture_create_kthread() and init_waitqueue_head(). In this case, the new kthread will use the waitqueue head before it is properly initialized, which is not good for the kernel's health and well-being. The above crash happened here: static inline void __add_wait_queue(...) { : if (!(wq->flags & WQ_FLAG_PRIORITY)) <=== Crash here The offset of flags from list_head entry in wait_queue_entry is -0x18. If reader_tasks[i].wq.head.next is NULL as allocated reader_task structure is zero initialized, the instruction will try to access address 0xffffffffffffffe8, which is exactly the fault address listed above. This commit therefore invokes init_waitqueue_head() before creating the kthread. Fixes: 653ed64b01dc ("refperf: Add a test to measure performance of read-side synchronization") Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: sanglipeng <sanglipeng1@jd.com>	2 年前
srcutiny.c	srcu: Provide polling interfaces for Tiny SRCU grace periods mainline inclusion from mainline-5.10.62 commit b6ae3854075e67a2764e30447f8603ef964aecc5 bugzilla: 182217 https://gitee.com/openeuler/kernel/issues/I4EFOS Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b6ae3854075e67a2764e30447f8603ef964aecc5 -------------------------------- commit 8b5bd67cf6422b63ee100d76d8de8960ca2df7f0 upstream. There is a need for a polling interface for SRCU grace periods, so this commit supplies get_state_synchronize_srcu(), start_poll_synchronize_srcu(), and poll_state_synchronize_srcu() for this purpose. The first can be used if future grace periods are inevitable (perhaps due to a later call_srcu() invocation), the second if future grace periods might not otherwise happen, and the third to check if a grace period has elapsed since the corresponding call to either of the first two. As with get_state_synchronize_rcu() and cond_synchronize_rcu(), the return value from either get_state_synchronize_srcu() or start_poll_synchronize_srcu() must be passed in to a later call to poll_state_synchronize_srcu(). Link: https://lore.kernel.org/rcu/20201112201547.GF3365678@moria.home.lan/ Reported-by: Kent Overstreet <kent.overstreet@gmail.com> [ paulmck: Add EXPORT_SYMBOL_GPL() per kernel test robot feedback. ] [ paulmck: Apply feedback from Neeraj Upadhyay. ] Link: https://lore.kernel.org/lkml/20201117004017.GA7444@paulmck-ThinkPad-P72/ Reviewed-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Chen Jun <chenjun102@huawei.com> Acked-by: Weilong Chen <chenweilong@huawei.com> Signed-off-by: Chen Jun <chenjun102@huawei.com> Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>	4 年前
srcutree.c	srcu: Tighten cleanup_srcu_struct() GP checks mainline inclusion from mainline-v5.19-rc1 commit 8ed00760203d8018bee042fbfe8e076579be2c2b category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IBP415 CVE: CVE-2022-49651 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8ed00760203d8018bee042fbfe8e076579be2c2b -------------------------------- Currently, cleanup_srcu_struct() checks for a grace period in progress, but it does not check for a grace period that has not yet started but which might start at any time. Such a situation could result in a use-after-free bug, so this commit adds a check for a grace period that is needed but not yet started to cleanup_srcu_struct(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>	1 年前
sync.c	rcu/sync: Simplify the state machine With this patch rcu_sync has a single state variable and the transition rules become really simple: GP_IDLE - owned by the first rcu_sync_enter() which moves it to GP_ENTER - owned by rcu-callback which moves it to GP_PASSED - owned by the last rcu_sync_exit() which moves it to GP_EXIT - and this is the only "nontrivial" state. rcu-callback moves it back to GP_IDLE unless another enter() comes before a GP pass. If rcu-callback is invoked before the next rcu_sync_exit() it must see gp_count incremented by that enter() and set GP_PASSED. Otherwise, if the next rcu_sync_exit() wins the race, it will move it to GP_REPLAY - owned by rcu-callback which moves it to GP_EXIT Signed-off-by: Oleg Nesterov <oleg@redhat.com> [ paulmck: While here, apply READ_ONCE() and WRITE_ONCE() to ->gp_state. ] [ paulmck: Tweaks to make htmldocs happy. (Reported by kbuild test robot.) ] Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>	6 年前
tasks.h	rcu-tasks: Fix show_rcu_tasks_trace_gp_kthread buffer overflow mainline inclusion from mainline-v6.10-rc1 commit cc5645fddb0ce28492b15520306d092730dffa48 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IA6SGB CVE: CVE-2024-38577 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cc5645fddb0ce28492b15520306d092730dffa48 --------------------------- There is a possibility of buffer overflow in show_rcu_tasks_trace_gp_kthread() if counters, passed to sprintf() are huge. Counter numbers, needed for this are unrealistically high, but buffer overflow is still possible. Use snprintf() with buffer size instead of sprintf(). Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: edf3775f0ad6 ("rcu-tasks: Add count for idle tasks on offline CPUs") Signed-off-by: Nikita Kiryushin <kiryushin@ancud.ru> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Conflicts: kernel/rcu/tasks.h [yyl: adjust context] Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>	1 年前
tiny.c	rcu: Rename _kfree_callback/_kfree_rcu_offset/kfree_call_* The following changes are introduced: 1. Rename rcu_invoke_kfree_callback() to rcu_invoke_kvfree_callback(), as well as the associated trace events, so the rcu_kfree_callback(), becomes rcu_kvfree_callback(). The reason is to be aligned with kvfree() notation. 2. Rename __is_kfree_rcu_offset to __is_kvfree_rcu_offset. All RCU paths use kvfree() now instead of kfree(), thus rename it. 3. Rename kfree_call_rcu() to the kvfree_call_rcu(). The reason is, it is capable of freeing vmalloc() memory now. Do the same with __kfree_rcu() macro, it becomes __kvfree_rcu(), the goal is the same. Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	5 年前
tree.c	rcu: Fix racy re-initialization of irq_work causing hangs mainline inclusion from mainline-v6.17-rc2 commit 61399e0c5410567ef60cb1cda34cca42903842e3 category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/8988 CVE: CVE-2025-39744 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=61399e0c5410567ef60cb1cda34cca42903842e3 -------------------------------- RCU re-initializes the deferred QS irq work everytime before attempting to queue it. However there are situations where the irq work is attempted to be queued even though it is already queued. In that case re-initializing messes-up with the irq work queue that is about to be handled. The chances for that to happen are higher when the architecture doesn't support self-IPIs and irq work are then all lazy, such as with the following sequence: 1) rcu_read_unlock() is called when IRQs are disabled and there is a grace period involving blocked tasks on the node. The irq work is then initialized and queued. 2) The related tasks are unblocked and the CPU quiescent state is reported. rdp->defer_qs_iw_pending is reset to DEFER_QS_IDLE, allowing the irq work to be requeued in the future (note the previous one hasn't fired yet). 3) A new grace period starts and the node has blocked tasks. 4) rcu_read_unlock() is called when IRQs are disabled again. The irq work is re-initialized (but it's queued! and its node is cleared) and requeued. Which means it's requeued to itself. 5) The irq work finally fires with the tick. But since it was requeued to itself, it loops and hangs. Fix this with initializing the irq work only once before the CPU boots. Fixes: b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202508071303.c1134cce-lkp@intel.com Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org> Conflicts: kernel/rcu/tree.c [Context conflicts] Signed-off-by: Jiacheng Yu <yujiacheng3@huawei.com>	4 个月前
tree.h	rcu: Fix racy re-initialization of irq_work causing hangs mainline inclusion from mainline-v6.17-rc2 commit 61399e0c5410567ef60cb1cda34cca42903842e3 category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/8988 CVE: CVE-2025-39744 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=61399e0c5410567ef60cb1cda34cca42903842e3 -------------------------------- RCU re-initializes the deferred QS irq work everytime before attempting to queue it. However there are situations where the irq work is attempted to be queued even though it is already queued. In that case re-initializing messes-up with the irq work queue that is about to be handled. The chances for that to happen are higher when the architecture doesn't support self-IPIs and irq work are then all lazy, such as with the following sequence: 1) rcu_read_unlock() is called when IRQs are disabled and there is a grace period involving blocked tasks on the node. The irq work is then initialized and queued. 2) The related tasks are unblocked and the CPU quiescent state is reported. rdp->defer_qs_iw_pending is reset to DEFER_QS_IDLE, allowing the irq work to be requeued in the future (note the previous one hasn't fired yet). 3) A new grace period starts and the node has blocked tasks. 4) rcu_read_unlock() is called when IRQs are disabled again. The irq work is re-initialized (but it's queued! and its node is cleared) and requeued. Which means it's requeued to itself. 5) The irq work finally fires with the tick. But since it was requeued to itself, it loops and hangs. Fix this with initializing the irq work only once before the CPU boots. Fixes: b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202508071303.c1134cce-lkp@intel.com Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org> Conflicts: kernel/rcu/tree.c [Context conflicts] Signed-off-by: Jiacheng Yu <yujiacheng3@huawei.com>	4 个月前
tree_exp.h	rcu: Defer RCU kthreads wakeup when CPU is dying mainline inclusion from mainline-v6.8-rc2 commit e787644caf7628ad3269c1fbd321c3255cf51710 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9NZ3E Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=e787644caf7628ad3269c1fbd321c3255cf51710 -------------------------------- When the CPU goes idle for the last time during the CPU down hotplug process, RCU reports a final quiescent state for the current CPU. If this quiescent state propagates up to the top, some tasks may then be woken up to complete the grace period: the main grace period kthread and/or the expedited main workqueue (or kworker). If those kthreads have a SCHED_FIFO policy, the wake up can indirectly arm the RT bandwith timer to the local offline CPU. Since this happens after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the timer gets ignored. Therefore if the RCU kthreads are waiting for RT bandwidth to be available, they may never be actually scheduled. This triggers TREE03 rcutorture hangs: rcu: INFO: rcu_preempt self-detected stall on CPU rcu: 4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved) rcu: (t=21035 jiffies g=938281 q=40787 ncpus=6) rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0 rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. rcu: RCU grace-period kthread stack dump: task:rcu_preempt state:R running task stack:14896 pid:14 tgid:14 ppid:2 flags:0x00004000 Call Trace: <TASK> __schedule+0x2eb/0xa80 schedule+0x1f/0x90 schedule_timeout+0x163/0x270 ? __pfx_process_timeout+0x10/0x10 rcu_gp_fqs_loop+0x37c/0x5b0 ? __pfx_rcu_gp_kthread+0x10/0x10 rcu_gp_kthread+0x17c/0x200 kthread+0xde/0x110 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2b/0x40 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> The situation can't be solved with just unpinning the timer. The hrtimer infrastructure and the nohz heuristics involved in finding the best remote target for an unpinned timer would then also need to handle enqueues from an offline CPU in the most horrendous way. So fix this on the RCU side instead and defer the wake up to an online CPU if it's too late for the local one. Reported-by: Paul E. McKenney <paulmck@kernel.org> Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com> Conflicts: kernel/rcu/tree.c [s/HK_TYPE_RCU/HK_FLAG_RCU/] Signed-off-by: Wei Li <liwei391@huawei.com>	1 年前
tree_plugin.h	rcu: Fix racy re-initialization of irq_work causing hangs mainline inclusion from mainline-v6.17-rc2 commit 61399e0c5410567ef60cb1cda34cca42903842e3 category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/8988 CVE: CVE-2025-39744 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=61399e0c5410567ef60cb1cda34cca42903842e3 -------------------------------- RCU re-initializes the deferred QS irq work everytime before attempting to queue it. However there are situations where the irq work is attempted to be queued even though it is already queued. In that case re-initializing messes-up with the irq work queue that is about to be handled. The chances for that to happen are higher when the architecture doesn't support self-IPIs and irq work are then all lazy, such as with the following sequence: 1) rcu_read_unlock() is called when IRQs are disabled and there is a grace period involving blocked tasks on the node. The irq work is then initialized and queued. 2) The related tasks are unblocked and the CPU quiescent state is reported. rdp->defer_qs_iw_pending is reset to DEFER_QS_IDLE, allowing the irq work to be requeued in the future (note the previous one hasn't fired yet). 3) A new grace period starts and the node has blocked tasks. 4) rcu_read_unlock() is called when IRQs are disabled again. The irq work is re-initialized (but it's queued! and its node is cleared) and requeued. Which means it's requeued to itself. 5) The irq work finally fires with the tick. But since it was requeued to itself, it loops and hangs. Fix this with initializing the irq work only once before the CPU boots. Fixes: b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202508071303.c1134cce-lkp@intel.com Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org> Conflicts: kernel/rcu/tree.c [Context conflicts] Signed-off-by: Jiacheng Yu <yujiacheng3@huawei.com>	4 个月前
tree_stall.h	rcu: shorten the critical section that rnp->lock protects in rcu_dump_cpu_stacks hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I7MQJB -------------------------------- Concurrent ltp stress testcases cause a hardlockup issue in KunPeng920: ------------[ cut here ]------------ [ 2301.316914] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [ 2301.320386] try_charge+0x2c8/0x600 [ 2301.325566] rcu: 69-...0: (23 ticks this GP) idle=fb2/1/0x4000000000000000 softirq=39368/39368 fqs=5591 [ 2301.335766] (detected by 29, t=15006 jiffies, g=91585, q=3635521) [ 2301.345458] Sending NMI from CPU 29 to CPUs 69: ------------[ cut here ]------------ [ 2379.033470] NMI watchdog: Watchdog detected hard LOCKUP on cpu 69 [ 2379.033523] CPU: 69 PID: 2143608 Comm: memcg_test_1 Kdump: loaded Tainted: G W 5.10.0-sp2-lockdepdbg+ #45 [ 2379.033524] Hardware name: Huawei TaiShan 5280 V2/BC82AMDDA, BIOS 1.93 10/13/2022 [ 2379.033525] pstate: 00400089 (nzcv daIf +PAN -UAO -TCO BTYPE=--) [ 2379.033525] pc : native_queued_spin_lock_slowpath+0x264/0x330 [ 2379.033526] lr : rcu_iw_handler+0xc4/0x130 [ 2379.033527] sp : ffff80001022be60 [ 2379.033528] x29: ffff80001022be60 x28: ffff2f6c7cc44d00 [ 2379.033529] x27: ffff2f6b81ee21a8 x26: ffff80001022c000 [ 2379.033530] x25: ffff800010228000 x24: ffffd913b32a6000 [ 2379.033532] x23: 0000000000000000 x22: ffffd913b3aa4000 [ 2379.033533] x21: ffffd913b32ba5f0 x20: ffff2f6b7ffa37c0 [ 2379.033534] x19: ffffd913b365c440 x18: 0000000000000060 [ 2379.033535] x17: 0000000000000000 x16: 0000000000000000 [ 2379.033537] x15: ffffffffffffffff x14: ffff8000bcb9b4f0 [ 2379.033538] x13: 00000000fffffffd x12: 0000000000000040 [ 2379.033539] x11: ffff2f63806bada0 x10: ffff2f63806bada2 [ 2379.033541] x9 : 0000000000000000 x8 : 0000000000000000 [ 2379.033542] x7 : ffff2f6b7ffa3740 x6 : ffffd913b2f67740 [ 2379.033543] x5 : ffff2f6b7ffa3740 x4 : 0000000001180101 [ 2379.033544] x3 : ffffd913b365c440 x2 : 0000000000000118 [ 2379.033545] x1 : 0000000001180000 x0 : 0000000000000000 [ 2379.033547] Call trace: [ 2379.033547] native_queued_spin_lock_slowpath+0x264/0x330 [ 2379.033548] irq_work_single+0x38/0x9c [ 2379.033548] flush_smp_call_function_queue+0x144/0x26c [ 2379.033549] generic_smp_call_function_single_interrupt+0x1c/0x30 [ 2379.033550] do_handle_IPI+0x84/0x2e4 [ 2379.033550] ipi_handler+0x24/0x3c [ 2379.033551] handle_percpu_devid_fasteoi_ipi+0x84/0x14c [ 2379.033552] __handle_domain_irq+0x84/0xf0 [ 2379.033553] gic_handle_irq+0x78/0x2c0 [ 2379.033553] el1_irq+0xb8/0x140 [ 2379.033554] dump_stack+0xe8/0x140 [ 2379.033554] dump_header+0x50/0x19c [ 2379.033555] out_of_memory+0x338/0x380 [ 2379.033556] mem_cgroup_out_of_memory+0x128/0x144 [ 2379.033557] mem_cgroup_oom+0x188/0x250 [ 2379.033557] try_charge+0x2c8/0x600 [ 2379.033558] mem_cgroup_charge+0x128/0x424 [ 2379.033559] wp_page_copy+0xc8/0xb40 [ 2379.033559] do_wp_page+0x228/0x594 [ 2379.033560] handle_pte_fault+0x1f8/0x21c [ 2379.033561] __handle_mm_fault+0x1b0/0x380 [ 2379.033561] handle_mm_fault+0xf4/0x250 [ 2379.033562] do_page_fault+0x188/0x454 [ 2379.033563] do_mem_abort+0x48/0xb0 [ 2379.033563] el0_da+0x44/0x80 [ 2379.033564] el0_sync_handler+0x88/0xb4 [ 2379.033564] el0_sync+0x160/0x180 cpu29 cpu69 rcu_dump_cpu_stacks() grab rnp->lock nmi_trigger_cpumask_backtrace() arm64_send_ipi() do_handle_IPI flush_smp_call_function_queue rcu_iw_handler spin rnp->lock deadlock nmi_cpu_backtrace wait for 10s or backtrace_mask clear For arm64 w/o NMI-triggered stack traces, IPI backtrace feature is used, while in rcu_dump_cpu_stacks(), raw_spin_lock_irqsave_rcu_node() will grab the rcu_node->lock to protect the rcu_node data used in the for_each_leaf_node_possible_cpu loop, while the process of backtrace for the rcu stalled cpu may be longer than expected, causing potential concurrent issue while someone contending for the same rcu_node->lock. Like the call trace shown above, rcu_node->lock will not be released until all the stalled cpus' backtrace finished in nmi_cpu_backtrace() or 10s timeout in nmi_trigger_cpumask_backtrace(), if there are pending IPI callbacks in the smp call_single_queue ahead of the ipi_cpu_backtrace callback contending for the same rcu_node->lock, deadlock will be inevitable. To avoid such problems, shorten the critical section that rcu_node->lock protects to avoid waiting for the backtrace process finish. Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com> Fixes: 81ad62c3883e ("arm64: Fix the ipi backtrace warning when softlockup") Signed-off-by: Wei Li <liwei391@huawei.com>	2 年前
update.c	rcu: Add RCU stall diagnosis information mainline inclusion from mainline-v6.3-rc1 commit be42f00b73a0f50710d16eb7cb4efda0cce062dd category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7OIXK Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be42f00b73a0f50710d16eb7cb4efda0cce062dd -------------------------------- Because RCU CPU stall warnings are driven from the scheduling-clock interrupt handler, a workload consisting of a very large number of short-duration hardware interrupts can result in misleading stall-warning messages. On systems supporting only a single level of interrupts, that is, where interrupts handlers cannot be interrupted, this can produce misleading diagnostics. The stack traces will show the innocent-bystander interrupted task, not the interrupts that are at the very least exacerbating the stall. This situation can be improved by displaying the number of interrupts and the CPU time that they have consumed. Diagnosing other types of stalls can be eased by also providing the count of softirqs and the CPU time that they consumed as well as the number of context switches and the task-level CPU time consumed. Consider the following output given this change: rcu: INFO: rcu_preempt self-detected stall on CPU rcu: 0-....: (1250 ticks this GP) <omitted> rcu: hardirqs softirqs csw/system rcu: number: 624 45 0 rcu: cputime: 69 1 2425 ==> 2500(ms) This output shows that the number of hard and soft interrupts is small, there are no context switches, and the system takes up a lot of time. This indicates that the current task is looping with preemption disabled. The impact on system performance is negligible because snapshot is recorded only once for all continuous RCU stalls. This added debugging information is suppressed by default and can be enabled by building the kernel with CONFIG_RCU_CPU_STALL_CPUTIME=y or by booting with rcupdate.rcu_cpu_stall_cputime=1. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Conflicts: Documentation/admin-guide/kernel-parameters.txt kernel/rcu/Kconfig.debug [Change RCU_CPU_STALL_CPUTIME to be enabled by default] kernel/rcu/rcu.h kernel/rcu/tree.h kernel/rcu/tree_stall.h kernel/rcu/update.c Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>	2 年前