kernel/fs/proc · openEuler/kernel - AtomGit

Nni-cunshumm/cma_folio: make CMA folio allocation per-task opt-in via procfs

文件	最后提交记录	最后更新时间
Kconfig	proc: make config PROC_CHILDREN depend on PROC_FS Commit 2e13ba54a268 ("fs, proc: introduce CONFIG_PROC_CHILDREN") introduces the config PROC_CHILDREN to configure kernels to provide the /proc/<pid>/task/<tid>/children file. When one deselects PROC_FS for kernel builds without /proc/, the config PROC_CHILDREN has no effect anymore, but is still visible in menuconfig. Add the dependency on PROC_FS to make the PROC_CHILDREN option disappear for kernel builds without /proc/. Link: https://lkml.kernel.org/r/20220909122529.1941-1-lukas.bulwahn@gmail.com Fixes: 2e13ba54a268 ("fs, proc: introduce CONFIG_PROC_CHILDREN") Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Cc: Iago López Galeiras <iago@endocode.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	3 年前
Makefile	arm64: Refactor the xcall proc code hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/release-management/issues/IBV2E4 -------------------------------- Refactor the xcall proc code by move to proc_xcall.c and replace fast_syscall_enabled() with system_supports_xcall(). Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>	10 个月前
array.c	procfs: fix missing RCU protection when reading real_parent in do_task_stat() stable inclusion from stable-v6.6.128 commit 0e64bd46a04a4fd61279aca9f53a664e9e5f7e7e category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8688 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0e64bd46a04a4fd61279aca9f53a664e9e5f7e7e -------------------------------- [ Upstream commit 76149d53502cf17ef3ae454ff384551236fba867 ] When reading /proc/[pid]/stat, do_task_stat() accesses task->real_parent without proper RCU protection, which leads to: cpu 0 cpu 1 ----- ----- do_task_stat var = task->real_parent release_task call_rcu(delayed_put_task_struct) task_tgid_nr_ns(var) rcu_read_lock <--- Too late to protect task->real_parent! task_pid_ptr <--- UAF! rcu_read_unlock This patch uses task_ppid_nr_ns() instead of task_tgid_nr_ns() to add proper RCU protection for accessing task->real_parent. Link: https://lkml.kernel.org/r/20260128083007.3173016-1-alexjlzheng@tencent.com Fixes: 06fffb1267c9 ("do_task_stat: don't take rcu_read_lock()") Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: David Hildenbrand <david@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: ruippan <ruippan@tencent.com> Cc: Usama Arif <usamaarif642@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Conflicts: fs/proc/array.c [Context Conflicts] Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>	3 个月前
base.c	mm/cma_folio: make CMA folio allocation per-task opt-in via procfs euleros inclusion category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/9121 --------------------------------------------- Add task_struct::folio_cma_enabled and expose it through /proc/<pid>/folio_cma_enabled. alloc_cma_folio_mpol() now consults current->folio_cma_enabled before attempting a CMA folio allocation; if unset, the request falls back to the normal page allocator. This lets administrators opt individual processes into CMA folio usage instead of having it enabled globally for all eligible allocations. Signed-off-by: Ni Cunshu <nicunshu@huawei.com>	1 个月前
bootconfig.c	proc: bootconfig: Add null pointer check kzalloc is a memory allocation function which can return NULL when some internal memory errors happen. It is safer to add null pointer check. Link: https://lkml.kernel.org/r/20220329104004.2376879-1-lv.ruyi@zte.com.cn Cc: stable@vger.kernel.org Fixes: c1a3c36017d4 ("proc: bootconfig: Add /proc/bootconfig to show boot config list") Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Lv Ruyi <lv.ruyi@zte.com.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	4 年前
cmdline.c	proc: mark /proc/cmdline as permanent /proc/cmdline is never removed, mark is as permanent for slightly faster open and close. Link: https://lkml.kernel.org/r/Y66xAveh2yUsP7m9@p183 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	3 年前
consoles.c	proc: consoles: use console_list_lock for list iteration The console_lock is used in part to guarantee safe list iteration. The console_list_lock should be used because list synchronization responsibility will be removed from the console_lock in a later change. Note, the console_lock is still needed to serialize the device() callback with other console operations. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20221116162152.193147-35-john.ogness@linutronix.de	3 年前
cpuinfo.c	x86/aperfmperf: Replace aperfmperf_get_khz() The frequency invariance infrastructure provides the APERF/MPERF samples already. Utilize them for the cpu frequency display in /proc/cpuinfo. The sample is considered valid for 20ms. So for idle or isolated NOHZ full CPUs the function returns 0, which is matching the previous behaviour. This gets rid of the mass IPIs and a delay of 20ms for stabilizing observed by Eric when reading /proc/cpuinfo. Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20220415161206.875029458@linutronix.de	4 年前
devices.c	proc: mark more files as permanent Mark /proc/devices /proc/kpagecount /proc/kpageflags /proc/kpagecgroup /proc/loadavg /proc/meminfo /proc/softirqs /proc/uptime /proc/version as permanent /proc entries, saving alloc/free and some list/spinlock ops per use. These files are never removed by the kernel so it is OK to mark them. Link: https://lkml.kernel.org/r/Yyn527DzDMa+r0Yj@localhost.localdomain Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	3 年前
etmem_proc.c	etmem: fix use-after-free of mm in the scan release process euleros inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IBFHR4 CVE: NA ---------------------------------------------------- In the mm_idle_release function, etmem first uses the mmdrop to release this mm, and then call page_scan_release, resulting in a use-after-free problem. Instead, this patch swaps the placement of mmdrop and page_scan_release to avoid uaf problem. Fixes: 5d3b64fd78b8 ("etmem: add etmem scan feature") Signed-off-by: chenrenhui <chenrenhui1@huawei.com>	1 年前
etmem_scan.c	etmem: add etmem scan feature euleros inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T1MB?from=project-issue CVE: NA ------------------------------------------------- This patch implements the etmem feature. etmem scan module communicates with the user space program through registered proc file system. It periodically scans the vma segments of the target process, by walking its page table and check access bit of each page, before reporting the scan results to user space, so that we can better classify hotness of pages and further migrate hot ones to fast memory tier and cold ones to slow memory tier. Signed-off-by: yanxiaodan <yanxiaodan@huawei.com> Signed-off-by: Feilong Lin <linfeilong@huawei.com> Signed-off-by: geruijun <geruijun@huawei.com> Signed-off-by: liubo <liubo254@huawei.com> Signed-off-by: Yuchen Tang <tangyuchen5@huawei.com>	2 年前
etmem_scan.h	etmem: add etmem scan feature euleros inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T1MB?from=project-issue CVE: NA ------------------------------------------------- This patch implements the etmem feature. etmem scan module communicates with the user space program through registered proc file system. It periodically scans the vma segments of the target process, by walking its page table and check access bit of each page, before reporting the scan results to user space, so that we can better classify hotness of pages and further migrate hot ones to fast memory tier and cold ones to slow memory tier. Signed-off-by: yanxiaodan <yanxiaodan@huawei.com> Signed-off-by: Feilong Lin <linfeilong@huawei.com> Signed-off-by: geruijun <geruijun@huawei.com> Signed-off-by: liubo <liubo254@huawei.com> Signed-off-by: Yuchen Tang <tangyuchen5@huawei.com>	2 年前
etmem_swap.c	mm: madvise: pageout: ignore references rather than clearing young mainline inclusion from mainline-v6.9-rc1 commit 2864f3d0f5831a50253befc5d4583868268b7153 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2864f3d0f5831a50253befc5d4583868268b7153 -------------------------------- While doing MADV_PAGEOUT, the current code will clear PTE young so that vmscan won't read young flags to allow the reclamation of madvised folios to go ahead. It seems we can do it by directly ignoring references, thus we can remove tlb flush in madvise and rmap overhead in vmscan. Regarding the side effect, in the original code, if a parallel thread runs side by side to access the madvised memory with the thread doing madvise, folios will get a chance to be re-activated by vmscan (though the time gap is actually quite small since checking PTEs is done immediately after clearing PTEs young). But with this patch, they will still be reclaimed. But this behaviour doing PAGEOUT and doing access at the same time is quite silly like DoS. So probably, we don't need to care. Or ignoring the new access during the quite small time gap is even better. For DAMON's DAMOS_PAGEOUT based on physical address region, we still keep its behaviour as is since a physical address might be mapped by multiple processes. MADV_PAGEOUT based on virtual address is actually much more aggressive on reclamation. To untouch paddr's DAMOS_PAGEOUT, we simply pass ignore_references as false in reclaim_pages(). A microbench as below has shown 6% decrement on the latency of MADV_PAGEOUT, #define PGSIZE 4096 main() { int i; #define SIZE 51210241024 volatile long *p = mmap(NULL, SIZE, PROT_READ \| PROT_WRITE, MAP_PRIVATE \| MAP_ANONYMOUS, -1, 0); for (i = 0; i < SIZE/sizeof(long); i += PGSIZE / sizeof(long)) p[i] = 0x11; madvise(p, SIZE, MADV_PAGEOUT); } w/o patch w/ patch root@10:~# time ./a.out root@10:~# time ./a.out real 0m49.634s real 0m46.334s user 0m0.637s user 0m0.648s sys 0m47.434s sys 0m44.265s Link: https://lkml.kernel.org/r/20240226005739.24350-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: SeongJae Park <sj@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Conflicts: mm/vmscan.c mm/etmem.c include/linux/swap.h fs/proc/etmem_swap.c [ Adapt reclaim_pages() and reclaim_folio_list() used in etmem. ] Signed-off-by: Liu Shixin <liushixin2@huawei.com>	2 年前
fd.c	proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation stable inclusion from stable-v6.6.34 commit 0c08b92f982731c2a1b808a6b55e0cfb3307305c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAD6H2 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0c08b92f982731c2a1b808a6b55e0cfb3307305c -------------------------------- commit 0a960ba49869ebe8ff859d000351504dd6b93b68 upstream. The following commits loosened the permissions of /proc/<PID>/fdinfo/ directory, as well as the files within it, from 0500 to 0555 while also introducing a PTRACE_MODE_READ check between the current task and <PID>'s task: - commit 7bc3fa0172a4 ("procfs: allow reading fdinfo with PTRACE_MODE_READ") - commit 1927e498aee1 ("procfs: prevent unprivileged processes accessing fdinfo dir") Before those changes, inode based system calls like inotify_add_watch(2) would fail when the current task didn't have sufficient read permissions: [...] lstat("/proc/1/task/1/fdinfo", {st_mode=S_IFDIR\|0500, st_size=0, ...}) = 0 inotify_add_watch(64, "/proc/1/task/1/fdinfo", IN_MODIFY\|IN_ATTRIB\|IN_MOVED_FROM\|IN_MOVED_TO\|IN_CREATE\|IN_DELETE\| IN_ONLYDIR\|IN_DONT_FOLLOW\|IN_EXCL_UNLINK) = -1 EACCES (Permission denied) [...] This matches the documented behavior in the inotify_add_watch(2) man page: ERRORS EACCES Read access to the given file is not permitted. After those changes, inotify_add_watch(2) started succeeding despite the current task not having PTRACE_MODE_READ privileges on the target task: [...] lstat("/proc/1/task/1/fdinfo", {st_mode=S_IFDIR\|0555, st_size=0, ...}) = 0 inotify_add_watch(64, "/proc/1/task/1/fdinfo", IN_MODIFY\|IN_ATTRIB\|IN_MOVED_FROM\|IN_MOVED_TO\|IN_CREATE\|IN_DELETE\| IN_ONLYDIR\|IN_DONT_FOLLOW\|IN_EXCL_UNLINK) = 1757 openat(AT_FDCWD, "/proc/1/task/1/fdinfo", O_RDONLY\|O_NONBLOCK\|O_CLOEXEC\|O_DIRECTORY) = -1 EACCES (Permission denied) [...] This change in behavior broke .NET prior to v7. See the github link below for the v7 commit that inadvertently/quietly (?) fixed .NET after the kernel changes mentioned above. Return to the old behavior by moving the PTRACE_MODE_READ check out of the file .open operation and into the inode .permission operation: [...] lstat("/proc/1/task/1/fdinfo", {st_mode=S_IFDIR\|0555, st_size=0, ...}) = 0 inotify_add_watch(64, "/proc/1/task/1/fdinfo", IN_MODIFY\|IN_ATTRIB\|IN_MOVED_FROM\|IN_MOVED_TO\|IN_CREATE\|IN_DELETE\| IN_ONLYDIR\|IN_DONT_FOLLOW\|IN_EXCL_UNLINK) = -1 EACCES (Permission denied) [...] Reported-by: Kevin Parsons (Microsoft) <parsonskev@gmail.com> Link: https://github.com/dotnet/runtime/commit/89e5469ac591b82d38510fe7de98346cce74ad4f Link: https://stackoverflow.com/questions/75379065/start-self-contained-net6-build-exe-as-service-on-raspbian-system-unauthorizeda Fixes: 7bc3fa0172a4 ("procfs: allow reading fdinfo with PTRACE_MODE_READ") Cc: stable@vger.kernel.org Cc: Christian Brauner <brauner@kernel.org> Cc: Christian König <christian.koenig@amd.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Hardik Garg <hargar@linux.microsoft.com> Cc: Allen Pais <apais@linux.microsoft.com> Signed-off-by: Tyler Hicks (Microsoft) <code@tyhicks.com> Link: https://lore.kernel.org/r/20240501005646.745089-1-code@tyhicks.com Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Wang Hai <wanghai38@huawei.com>	1 年前
fd.h	fs: port ->permission() to pass mnt_idmap Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>	3 年前
generic.c	fs/proc: fix uaf in proc_readdir_de() stable inclusion from stable-v6.6.117 commit 67272c11f379d9aa5e0f6b16286b9d89b3f76046 category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/11253 CVE: CVE-2025-40271 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=67272c11f379d9aa5e0f6b16286b9d89b3f76046 -------------------------------- commit 895b4c0c79b092d732544011c3cecaf7322c36a1 upstream. Pde is erased from subdir rbtree through rb_erase(), but not set the node to EMPTY, which may result in uaf access. We should use RB_CLEAR_NODE() set the erased node to EMPTY, then pde_subdir_next() will return NULL to avoid uaf access. We found an uaf issue while using stress-ng testing, need to run testcase getdent and tun in the same time. The steps of the issue is as follows: 1) use getdent to traverse dir /proc/pid/net/dev_snmp6/, and current pde is tun3; 2) in the [time windows] unregister netdevice tun3 and tun2, and erase them from rbtree. erase tun3 first, and then erase tun2. the pde(tun2) will be released to slab; 3) continue to getdent process, then pde_subdir_next() will return pde(tun2) which is released, it will case uaf access. CPU 0 \| CPU 1 ------------------------------------------------------------------------- traverse dir /proc/pid/net/dev_snmp6/ \| unregister_netdevice(tun->dev) //tun3 tun2 sys_getdents64() \| iterate_dir() \| proc_readdir() \| proc_readdir_de() \| snmp6_unregister_dev() pde_get(de); \| proc_remove() read_unlock(&proc_subdir_lock); \| remove_proc_subtree() \| write_lock(&proc_subdir_lock); [time window] \| rb_erase(&root->subdir_node, &parent->subdir); \| write_unlock(&proc_subdir_lock); read_lock(&proc_subdir_lock); \| next = pde_subdir_next(de); \| pde_put(de); \| de = next; //UAF \| rbtree of dev_snmp6 \| pde(tun3) / \ NULL pde(tun2) Link: https://lkml.kernel.org/r/20251025024233.158363-1-albin_yang@163.com Signed-off-by: Wei Yang <albinwyang@tencent.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: wangzijie <wangzijie1@honor.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Li Nan <linan122@huawei.com>	4 个月前
inode.c	fix proc_sys_compare() handling of in-lookup dentries stable inclusion from stable-v6.6.99 commit f9b3d28f1f62bef49293e2eb3d5ded248341e7f8 category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8365 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=f9b3d28f1f62bef49293e2eb3d5ded248341e7f8 -------------------------------- [ Upstream commit b969f9614885c20f903e1d1f9445611daf161d6d ] There's one case where ->d_compare() can be called for an in-lookup dentry; usually that's nothing special from ->d_compare() point of view, but... proc_sys_compare() is weird. The thing is, /proc/sys subdirectories can look differently for different processes. Up to and including having the same name resolve to different dentries - all of them hashed. The way it's done is ->d_compare() refusing to admit a match unless this dentry is supposed to be visible to this caller. The information needed to discriminate between them is stored in inode; it is set during proc_sys_lookup() and until it's done d_splice_alias() we really can't tell who should that dentry be visible for. Normally there's no negative dentries in /proc/sys; we can run into a dying dentry in RCU dcache lookup, but those can be safely rejected. However, ->d_compare() is also called for in-lookup dentries, before they get positive - or hashed, for that matter. In case of match we will wait until dentry leaves in-lookup state and repeat ->d_compare() afterwards. In other words, the right behaviour is to treat the name match as sufficient for in-lookup dentries; if dentry is not for us, we'll see that when we recheck once proc_sys_lookup() is done with it. While we are at it, fix the misspelled READ_ONCE and WRITE_ONCE there. Fixes: d9171b934526 ("parallel lookups machinery, part 4 (and last)") Reported-by: NeilBrown <neilb@brown.name> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org> (cherry picked from commit f9b3d28f1f62bef49293e2eb3d5ded248341e7f8) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>	4 个月前
internal.h	proc: use the same treatment to check proc_lseek as ones for proc_read_iter et.al mainline inclusion from mainline-v6.17-rc1 commit ff7ec8dc1b646296f8d94c39339e8d3833d16c05 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/ICUC9Z Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ff7ec8dc1b646296f8d94c39339e8d3833d16c05 -------------------------------- Check pde->proc_ops->proc_lseek directly may cause UAF in rmmod scenario. It's a gap in proc_reg_open() after commit 654b33ada4ab("proc: fix UAF in proc_get_inode()"). Followed by AI Viro's suggestion, fix it in same manner. Link: https://lkml.kernel.org/r/20250607021353.1127963-1-wangzijie1@honor.com Fixes: 3f61631d47f1 ("take care to handle NULL ->proc_lseek()") Signed-off-by: wangzijie <wangzijie1@honor.com> Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Conflicts: fs/proc/inode.c [Commit 133097647206 ("proc: use __auto_type more") removed the definition of release in proc_reg_open().] Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>	9 个月前
interrupts.c	proc: introduce proc_create_seq{,_data} Variants of proc_create{,_data} that directly take a struct seq_operations argument and drastically reduces the boilerplate code in the callers. All trivial callers converted over. Signed-off-by: Christoph Hellwig <hch@lst.de>	7 年前
kcore.c	fs/proc/kcore.c: Clear ret value in read_kcore_iter after successful iov_iter_zero stable inclusion from stable-v6.6.64 commit 6868deee4a6bb8d324e883e2f540aaa33499a0f9 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IBL4B6 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=6868deee4a6bb8d324e883e2f540aaa33499a0f9 -------------------------------- commit 088f294609d8f8816dc316681aef2eb61982e0da upstream. If iov_iter_zero succeeds after failed copy_from_kernel_nofault, we need to reset the ret value to zero otherwise it will be returned as final return value of read_kcore_iter. This fixes objdump -d dump over /proc/kcore for me. Cc: stable@vger.kernel.org Cc: Alexander Gordeev <agordeev@linux.ibm.com> Fixes: 3d5854d75e31 ("fs/proc/kcore.c: allow translation of physical memory addresses") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20241121231118.3212000-1-jolsa@kernel.org Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Wen Zhiwei <wenzhiwei@kylinos.cn>	1 年前
kmsg.c	Merge tag 'printk-for-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: - Initialize pointer hashing using the system workqueue. It avoids taking locks in printk()/vsprintf() code path - Misc code clean up * tag 'printk-for-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printk: Mark __printk percpu data ready __ro_after_init printk: Remove bogus comment vs. boot consoles printk: Remove write only variable nr_ext_console_drivers printk: Declare log_wait properly printk: Make pr_flush() static lib/vsprintf: Initialize vsprintf's pointer hash once the random core is ready. lib/vsprintf: Remove static_branch_likely() from __ptr_to_hashval(). lib/vnsprintf: add const modifier for param 'bitmap'	3 年前
loadavg.c	proc: mark more files as permanent Mark /proc/devices /proc/kpagecount /proc/kpageflags /proc/kpagecgroup /proc/loadavg /proc/meminfo /proc/softirqs /proc/uptime /proc/version as permanent /proc entries, saving alloc/free and some list/spinlock ops per use. These files are never removed by the kernel so it is OK to mark them. Link: https://lkml.kernel.org/r/Yyn527DzDMa+r0Yj@localhost.localdomain Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	3 年前
mem_reliable.c	mm: mem_reliable: Alloc task memory from reliable region hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8USBA CVE: NA ------------------------------------------ Adding reliable flag for user task. User task with reliable flag can only alloc memory from mirrored region. PF_RELIABLE is added to represent the task's reliable flag. - For init task, which is regarded as special task which alloc memory from mirrored region. - For normal user tasks, The reliable flag can be set via procfs interface shown as below and can be inherited via fork(). User can change a user task's reliable flag by $ echo [0/1] > /proc/<pid>/reliable and check a user task's reliable flag by $ cat /proc/<pid>/reliable Note, global init task's reliable file can not be accessed. Signed-off-by: Peng Wu <wupeng58@huawei.com> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>	2 年前
meminfo.c	mm/numa_remote: add pre-online count in meminfo hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID3KH8 -------------------------------- In pre-online mode, the pre-online remote memory is actually inaccessible. We should strip it out of total memory count and display it separately, since such memory can't be used yet. Signed-off-by: Liu Shixin <liushixin2@huawei.com>	7 个月前
namespaces.c	Merge branch 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull openat2 support from Al Viro: "This is the openat2() series from Aleksa Sarai. I'm afraid that the rest of namei stuff will have to wait - it got zero review the last time I'd posted #work.namei, and there had been a leak in the posted series I'd caught only last weekend. I was going to repost it on Monday, but the window opened and the odds of getting any review during that... Oh, well. Anyway, openat2 part should be ready; that _did_ get sane amount of review and public testing, so here it comes" From Aleksa's description of the series: "For a very long time, extending openat(2) with new features has been incredibly frustrating. This stems from the fact that openat(2) is possibly the most famous counter-example to the mantra "don't silently accept garbage from userspace" -- it doesn't check whether unknown flags are present[1]. This means that (generally) the addition of new flags to openat(2) has been fraught with backwards-compatibility issues (O_TMPFILE has to be defined as __O_TMPFILE\|O_DIRECTORY\|[O_RDWR or O_WRONLY] to ensure old kernels gave errors, since it's insecure to silently ignore the flag[2]). All new security-related flags therefore have a tough road to being added to openat(2). Furthermore, the need for some sort of control over VFS's path resolution (to avoid malicious paths resulting in inadvertent breakouts) has been a very long-standing desire of many userspace applications. This patchset is a revival of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum project[5]) with a few additions and changes made based on the previous discussion within [6] as well as others I felt were useful. In line with the conclusions of the original discussion of AT_NO_JUMPS, the flag has been split up into separate flags. However, instead of being an openat(2) flag it is provided through a new syscall openat2(2) which provides several other improvements to the openat(2) interface (see the patch description for more details). The following new LOOKUP_* flags are added: LOOKUP_NO_XDEV: Blocks all mountpoint crossings (upwards, downwards, or through absolute links). Absolute pathnames alone in openat(2) do not trigger this. Magic-link traversal which implies a vfsmount jump is also blocked (though magic-link jumps on the same vfsmount are permitted). LOOKUP_NO_MAGICLINKS: Blocks resolution through /proc/$pid/fd-style links. This is done by blocking the usage of nd_jump_link() during resolution in a filesystem. The term "magic-links" is used to match with the only reference to these links in Documentation/, but I'm happy to change the name. It should be noted that this is different to the scope of ~LOOKUP_FOLLOW in that it applies to all path components. However, you can do openat2(NO_FOLLOW\|NO_MAGICLINKS) on a magic-link and it will not fail (assuming that no parent component was a magic-link), and you will have an fd for the magic-link. In order to correctly detect magic-links, the introduction of a new LOOKUP_MAGICLINK_JUMPED state flag was required. LOOKUP_BENEATH: Disallows escapes to outside the starting dirfd's tree, using techniques such as ".." or absolute links. Absolute paths in openat(2) are also disallowed. Conceptually this flag is to ensure you "stay below" a certain point in the filesystem tree -- but this requires some additional to protect against various races that would allow escape using "..". Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it can trivially beam you around the filesystem (breaking the protection). In future, there might be similar safety checks done as in LOOKUP_IN_ROOT, but that requires more discussion. In addition, two new flags are added that expand on the above ideas: LOOKUP_NO_SYMLINKS: Does what it says on the tin. No symlink resolution is allowed at all, including magic-links. Just as with LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an fd for the symlink as long as no parent path had a symlink component. LOOKUP_IN_ROOT: This is an extension of LOOKUP_BENEATH that, rather than blocking attempts to move past the root, forces all such movements to be scoped to the starting point. This provides chroot(2)-like protection but without the cost of a chroot(2) for each filesystem operation, as well as being safe against race attacks that chroot(2) is not. If a race is detected (as with LOOKUP_BENEATH) then an error is generated, and similar to LOOKUP_BENEATH it is not permitted to cross magic-links with LOOKUP_IN_ROOT. The primary need for this is from container runtimes, which currently need to do symlink scoping in userspace[7] when opening paths in a potentially malicious container. There is a long list of CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT (such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and CVE-2019-5736, just to name a few). In order to make all of the above more usable, I'm working on libpathrs[8] which is a C-friendly library for safe path resolution. It features a userspace-emulated backend if the kernel doesn't support openat2(2). Hopefully we can get userspace to switch to using it, and thus get openat2(2) support for free once it's ready. Future work would include implementing things like RESOLVE_NO_AUTOMOUNT and possibly a RESOLVE_NO_REMOTE (to allow programs to be sure they don't hit DoSes though stale NFS handles)" * 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: Documentation: path-lookup: include new LOOKUP flags selftests: add openat2(2) selftests open: introduce openat2(2) syscall namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution namei: LOOKUP_IN_ROOT: chroot-like scoped resolution namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution namei: LOOKUP_NO_XDEV: block mountpoint crossing namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution namei: LOOKUP_NO_SYMLINKS: block symlink resolution namei: allow set_root() to produce errors namei: allow nd_jump_link() to produce errors nsfs: clean-up ns_get_path() signature to return int namei: only return -ECHILD from follow_dotdot_rcu()	6 年前
nommu.c	fs: create helper file_user_path() for user displayed mapped file path mainline inclusion from mainline-v6.7-rc1 commit 08582d678fcf11fc86188f0b92239d3d49667d8e category: feature bugzilla: https://gitee.com/src-openeuler/kernel/issues/IBHLU4 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=08582d678fcf11fc86188f0b92239d3d49667d8e -------------------------------- Overlayfs uses backing files with "fake" overlayfs f_path and "real" underlying f_inode, in order to use underlying inode aops for mapped files and to display the overlayfs path in /proc/<pid>/maps. In preparation for storing the overlayfs "fake" path instead of the underlying "real" path in struct backing_file, define a noop helper file_user_path() that returns f_path for now. Use the new helper in procfs and kernel logs whenever a path of a mapped file is displayed to users. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231009153712.1566422-3-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Yifan Qiao <qiaoyifan4@huawei.com>	1 年前
page.c	mm: introduce CMA-based folio allocation euleros inclusion category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/9121 --------------------------------------------- Add a new memory management feature that routes large folio allocations (order > 0) through dedicated Contiguous Memory Allocator (CMA) regions. This allows the kernel to reserve physical contiguous memory early for large folios (e.g., THP) and return it back to CMA on release, improving fragmentation behavior for workloads that benefit from contiguous pages. The following limitations are inherent to the current design and should be considered before enabling it in production: Once allocated, CMA folios are immobile. They cannot be migrated by compaction or NUMA balancing, so the reserved memory can become a permanent "sieve" of unmovable blocks. CMA memory is reserved at boot and subtracted from the buddy. If the workload does not actually use it, the pages stay idle and are unavailable for normal allocations even under OOM pressure. The feature is controlled by CONFIG_CMA_FOLIO (depends on CMA and TRANSPARENT_HUGEPAGE). When enabled, a CMA area per NUMA node is reserved via the "folio_cma=" kernel command line. A runtime switch /proc/sys/vm/cma_folio_enabled is also provided. Signed-off-by: Ni Cunshu <nicunshu@huawei.com>	1 个月前
proc_net.c	Merge tag 'v6.6-fs.proc.uapi' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull procfs fixes from Christian Brauner: "Mode changes to files under /proc/<pid>/ aren't supported ever since commit 6d76fa58b050 ("Don't allow chmod() on the /proc/<pid>/ files"). Due to an oversight in commit 1b3044e39a89 ("procfs: fix pthread cross-thread naming if !PR_DUMPABLE") in switching from REG to NOD, mode changes on /proc/thread-self/comm were accidently allowed. Similar, mode changes for all files beneath /proc/<pid>/net/ are blocked but mode changes on /proc/<pid>/net itself were accidently allowed. Both issues come down to not using the generic proc_setattr() helper which blocks all mode changes. This is rectified with this pull request. This also removes a strange nolibc test that abused /proc/<pid>/net for testing mode changes. Using procfs for this test never made a lot of sense given procfs has special semantics for almost everything anway. Both changes are minor user-visible changes. It is however very unlikely that mode changes on proc/<pid>/net and /proc/thread-self/comm are something that userspace relies on" * tag 'v6.6-fs.proc.uapi' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: procfs: block chmod on /proc/thread-self/comm proc: use generic setattr() for /proc/$PID/net selftests/nolibc: drop test chmod_net	2 年前
proc_sysctl.c	fix proc_sys_compare() handling of in-lookup dentries stable inclusion from stable-v6.6.99 commit f9b3d28f1f62bef49293e2eb3d5ded248341e7f8 category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8365 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=f9b3d28f1f62bef49293e2eb3d5ded248341e7f8 -------------------------------- [ Upstream commit b969f9614885c20f903e1d1f9445611daf161d6d ] There's one case where ->d_compare() can be called for an in-lookup dentry; usually that's nothing special from ->d_compare() point of view, but... proc_sys_compare() is weird. The thing is, /proc/sys subdirectories can look differently for different processes. Up to and including having the same name resolve to different dentries - all of them hashed. The way it's done is ->d_compare() refusing to admit a match unless this dentry is supposed to be visible to this caller. The information needed to discriminate between them is stored in inode; it is set during proc_sys_lookup() and until it's done d_splice_alias() we really can't tell who should that dentry be visible for. Normally there's no negative dentries in /proc/sys; we can run into a dying dentry in RCU dcache lookup, but those can be safely rejected. However, ->d_compare() is also called for in-lookup dentries, before they get positive - or hashed, for that matter. In case of match we will wait until dentry leaves in-lookup state and repeat ->d_compare() afterwards. In other words, the right behaviour is to treat the name match as sufficient for in-lookup dentries; if dentry is not for us, we'll see that when we recheck once proc_sys_lookup() is done with it. While we are at it, fix the misspelled READ_ONCE and WRITE_ONCE there. Fixes: d9171b934526 ("parallel lookups machinery, part 4 (and last)") Reported-by: NeilBrown <neilb@brown.name> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org> (cherry picked from commit f9b3d28f1f62bef49293e2eb3d5ded248341e7f8) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>	4 个月前
proc_tty.c	proc: delete unused <linux/uaccess.h> includes Those aren't necessary after seq files won. Link: https://lkml.kernel.org/r/YqnA3mS7KBt8Z4If@localhost.localdomain Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	3 年前
proc_xcall.c	xcall2.0: Fix stack-out-of-bounds bug in xcall_write() hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/release-management/issues/ID5CMS -------------------------------- The following command cause a stack-out-of-bounds KASAN bug. commit fd83d0fe16f1 ("xcall: Rework the early exception vector of XCALL and SYNC") delete the count limit for copy_from_user(). Due to copy_from_user() copy one extra character into the "buf", which overwrites the original trailing '\0' character in the "buf", causing kstrtouint() to access out of bounds. Fix it by limiting copy_from_user() to copy at most the buf's maximum size minus one to reserve space for a '\0'. echo 1234 > /proc/1/xcall or echo !123 > /proc/1/xcall xcall_write count: 5 ================================================================== BUG: KASAN: stack-out-of-bounds in _kstrtoull+0x120/0x150 Read of size 1 at addr ffff800085d67b85 by task sh/1209 CPU: 4 PID: 1209 Comm: sh Not tainted 6.6.0-27103-g5fd643bb2ba7-dirty #232 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x94/0xf0 show_stack+0x1c/0x30 dump_stack_lvl+0x44/0x58 print_report+0x164/0x5b8 kasan_report+0x80/0xc0 __asan_load1+0x58/0x60 _kstrtoull+0x120/0x150 kstrtoull+0x38/0x48 kstrtouint+0x70/0xe0 xcall_write+0x274/0x3e0 vfs_write+0x228/0x548 ksys_write+0xc8/0x170 __arm64_sys_write+0x48/0x60 invoke_syscall+0x64/0x148 el0_svc_common.constprop.0+0x7c/0x110 do_el0_svc+0x68/0xb8 el0_slow_syscall+0x3c/0x128 .slow_syscall+0x16c/0x170 The buggy address belongs to stack of task sh/1209 and is located at offset 69 in frame: xcall_write+0x0/0x3e0 This frame has 3 objects: [32, 36) 'sc_no' [48, 52) 'val' [64, 69) 'buf' The buggy address belongs to the virtual mapping at [ffff800085d60000, ffff800085d69000) created by: kernel_clone+0xf4/0x550 The buggy address belongs to the physical page: page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x105556 flags: 0x2000000000000000(node=0\|zone=2) raw: 2000000000000000 0000000000000000 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff800085d67a80: f1 f1 f1 f1 00 f3 f3 f3 00 00 00 00 00 00 00 00 ffff800085d67b00: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 04 f2 04 f2 >ffff800085d67b80: 05 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00 ^ ffff800085d67c00: 00 00 00 00 f1 f1 f1 f1 00 00 00 00 00 f2 f2 f2 ffff800085d67c80: f2 f2 00 00 00 00 00 00 f3 f3 f3 f3 00 00 00 00 ================================================================== Disabling lock debugging due to kernel taint Fixes: fd83d0fe16f1 ("xcall: Rework the early exception vector of XCALL and SYNC") Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>	7 个月前
root.c	fs: super: dynamically allocate the s_shrink mainline inclusion category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/IC3A7I ------------------------------------------------- In preparation for implementing lockless slab shrink, use new APIs to dynamically allocate the s_shrink, so that it can be freed asynchronously via RCU. Then it doesn't need to wait for RCU read-side critical section when releasing the struct super_block. Link: https://lkml.kernel.org/r/20230911094444.68966-39-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: David Sterba <dsterba@suse.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Chao Yu <chao@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Chuck Lever <cel@kernel.org> Cc: Coly Li <colyli@suse.de> Cc: Dai Ngo <Dai.Ngo@oracle.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Airlie <airlied@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huang Rui <ray.huang@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Marijn Suijten <marijn.suijten@somainline.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Neil Brown <neilb@suse.de> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Olga Kornievskaia <kolga@netapp.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Clark <robdclark@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sean Paul <sean@poorly.run> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Song Liu <song@kernel.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com> Cc: Tom Talpey <tom@talpey.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yue Hu <huyue2@coolpad.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>	8 个月前
self.c	procfs: convert to ctime accessor functions In later patches, we're going to change how the inode's ctime field is used. Switch to using accessor functions instead of raw accesses of inode->i_ctime. Acked-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230705190309.579783-65-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2 年前
softirqs.c	proc/softirqs: replace seq_printf with seq_put_decimal_ull_width stable inclusion from stable-v6.6.64 commit fe8c40810a110a33f666040b1ea29f927b28241a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IBL4B6 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=fe8c40810a110a33f666040b1ea29f927b28241a -------------------------------- [ Upstream commit 84b9749a3a704dcc824a88aa8267247c801d51e4 ] seq_printf is costy, on a system with n CPUs, reading /proc/softirqs would yield 10*n decimal values, and the extra cost parsing format string grows linearly with number of cpus. Replace seq_printf with seq_put_decimal_ull_width have significant performance improvement. On an 8CPUs system, reading /proc/softirqs show ~40% performance gain with this patch. Signed-off-by: David Wang <00107082@163.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Wen Zhiwei <wenzhiwei@kylinos.cn>	1 年前
stat.c	bpf: treewide: Annotate BPF kfuncs in BTF mainline inclusion from mainline-v6.9-rc1 commit 6f3189f38a3e995232e028a4c341164c4aca1b20 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f3189f38a3e995232e028a4c341164c4aca1b20 ---------------------------------------------------------------------- This commit marks kfuncs as such inside the .BTF_ids section. The upshot of these annotations is that we'll be able to automatically generate kfunc prototypes for downstream users. The process is as follows: 1. In source, use BTF_KFUNCS_START/END macro pair to mark kfuncs 2. During build, pahole injects into BTF a "bpf_kfunc" BTF_DECL_TAG for each function inside BTF_KFUNCS sets 3. At runtime, vmlinux or module BTF is made available in sysfs 4. At runtime, bpftool (or similar) can look at provided BTF and generate appropriate prototypes for functions with "bpf_kfunc" tag To ensure future kfunc are similarly tagged, we now also return error inside kfunc registration for untagged kfuncs. For vmlinux kfuncs, we also WARN(), as initcall machinery does not handle errors. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Acked-by: Benjamin Tissoires <bentiss@kernel.org> Link: https://lore.kernel.org/r/e55150ceecbf0a5d961e608941165c0bee7bc943.1706491398.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org> Conflicts: fs/verity/measure.c kernel/bpf/cpumask.c kernel/bpf/helpers.c kernel/cgroup/rstat.c kernel/trace/bpf_trace.c net/core/filter.c net/core/xdp.c tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c arch/arm64/kernel/bpf-rvi.c block/blk-cgroup.c drivers/hid/bpf/hid_bpf_dispatch.c fs/proc/stat.c kernel/bpf-rvi/common_kfuncs.c kernel/cgroup/cpuset.c kernel/sched/bpf_sched.c kernel/sched/cpuacct.c net/xfrm/xfrm_state_bpf.c [Fix conflicts because of missing some kfuns or context diff. And USE KFUNCS for self-dev kfuncs.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com>	2 个月前
task_mmu.c	mm: fix the inaccurate memory statistics issue for users stable inclusion from stable-v6.6.99 commit 56995226431a37182128d0b0adf39cd003bf94d7 category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8365 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=56995226431a37182128d0b0adf39cd003bf94d7 -------------------------------- commit 82241a83cd15aaaf28200a40ad1a8b480012edaf upstream. On some large machines with a high number of CPUs running a 64K pagesize kernel, we found that the 'RES' field is always 0 displayed by the top command for some processes, which will cause a lot of confusion for users. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top 1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd The main reason is that the batch size of the percpu counter is quite large on these machines, caching a significant percpu value, since converting mm's rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter"). Intuitively, the batch number should be optimized, but on some paths, performance may take precedence over statistical accuracy. Therefore, introducing a new interface to add the percpu statistical count and display it to users, which can remove the confusion. In addition, this change is not expected to be on a performance-critical path, so the modification should be acceptable. In addition, the 'mm->rss_stat' is updated by using add_mm_counter() and dec/inc_mm_counter(), which are all wrappers around percpu_counter_add_batch(). In percpu_counter_add_batch(), there is percpu batch caching to avoid 'fbc->lock' contention. This patch changes task_mem() and task_statm() to get the accurate mm counters under the 'fbc->lock', but this should not exacerbate kernel 'mm->rss_stat' lock contention due to the percpu batch caching of the mm counters. The following test also confirm the theoretical analysis. I run the stress-ng that stresses anon page faults in 32 threads on my 32 cores machine, while simultaneously running a script that starts 32 threads to busy-loop pread each stress-ng thread's /proc/pid/status interface. From the following data, I did not observe any obvious impact of this patch on the stress-ng tests. w/o patch: stress-ng: info: [6848] 4,399,219,085,152 CPU Cycles 67.327 B/sec stress-ng: info: [6848] 1,616,524,844,832 Instructions 24.740 B/sec (0.367 instr. per cycle) stress-ng: info: [6848] 39,529,792 Page Faults Total 0.605 M/sec stress-ng: info: [6848] 39,529,792 Page Faults Minor 0.605 M/sec w/patch: stress-ng: info: [2485] 4,462,440,381,856 CPU Cycles 68.382 B/sec stress-ng: info: [2485] 1,615,101,503,296 Instructions 24.750 B/sec (0.362 instr. per cycle) stress-ng: info: [2485] 39,439,232 Page Faults Total 0.604 M/sec stress-ng: info: [2485] 39,439,232 Page Faults Minor 0.604 M/sec On comparing a very simple app which just allocates & touches some memory against v6.1 (which doesn't have f1a7941243c1) and latest Linus tree (4c06e63b9203) I can see that on latest Linus tree the values for VmRSS, RssAnon and RssFile from /proc/self/status are all zeroes while they do report values on v6.1 and a Linus tree with this patch. Link: https://lkml.kernel.org/r/f4586b17f66f97c174f7fd1f8647374fdb53de1c.1749119050.git.baolin.wang@linux.alibaba.com Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Tested-by Donet Tom <donettom@linux.ibm.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: SeongJae Park <sj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit 56995226431a37182128d0b0adf39cd003bf94d7) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>	4 个月前
task_nommu.c	fs: create helper file_user_path() for user displayed mapped file path mainline inclusion from mainline-v6.7-rc1 commit 08582d678fcf11fc86188f0b92239d3d49667d8e category: feature bugzilla: https://gitee.com/src-openeuler/kernel/issues/IBHLU4 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=08582d678fcf11fc86188f0b92239d3d49667d8e -------------------------------- Overlayfs uses backing files with "fake" overlayfs f_path and "real" underlying f_inode, in order to use underlying inode aops for mapped files and to display the overlayfs path in /proc/<pid>/maps. In preparation for storing the overlayfs "fake" path instead of the underlying "real" path in struct backing_file, define a noop helper file_user_path() that returns f_path for now. Use the new helper in procfs and kernel logs whenever a path of a mapped file is displayed to users. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231009153712.1566422-3-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Yifan Qiao <qiaoyifan4@huawei.com>	1 年前
thread_self.c	procfs: convert to ctime accessor functions In later patches, we're going to change how the inode's ctime field is used. Switch to using accessor functions instead of raw accesses of inode->i_ctime. Acked-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230705190309.579783-65-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2 年前
uptime.c	proc: mark more files as permanent Mark /proc/devices /proc/kpagecount /proc/kpageflags /proc/kpagecgroup /proc/loadavg /proc/meminfo /proc/softirqs /proc/uptime /proc/version as permanent /proc entries, saving alloc/free and some list/spinlock ops per use. These files are never removed by the kernel so it is OK to mark them. Link: https://lkml.kernel.org/r/Yyn527DzDMa+r0Yj@localhost.localdomain Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	3 年前
util.c	fs/proc/util.c: include fs/proc/internal.h for name_to_int() name_to_int() is defined in fs/proc/util.c and declared in fs/proc/internal.h, but the declaration isn't included at the point of the definition. Include the header to enforce that the definition matches the declaration. This addresses a gcc warning when -Wmissing-prototypes is enabled. Link: http://lkml.kernel.org/r/20181115001833.49371-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	7 年前
version.c	proc: mark more files as permanent Mark /proc/devices /proc/kpagecount /proc/kpageflags /proc/kpagecgroup /proc/loadavg /proc/meminfo /proc/softirqs /proc/uptime /proc/version as permanent /proc entries, saving alloc/free and some list/spinlock ops per use. These files are never removed by the kernel so it is OK to mark them. Link: https://lkml.kernel.org/r/Yyn527DzDMa+r0Yj@localhost.localdomain Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	3 年前
vmcore.c	fs/proc: fix softlockup in __read_vmcore (part 2) stable inclusion from stable-v6.6.74 commit a5a2ee8144c3897d37403a69118c3e3dc5713958 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IBLWTG CVE: CVE-2025-21694 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=a5a2ee8144c3897d37403a69118c3e3dc5713958 -------------------------------- commit cbc5dde0a461240046e8a41c43d7c3b76d5db952 upstream. Since commit 5cbcb62dddf5 ("fs/proc: fix softlockup in __read_vmcore") the number of softlockups in __read_vmcore at kdump time have gone down, but they still happen sometimes. In a memory constrained environment like the kdump image, a softlockup is not just a harmless message, but it can interfere with things like RCU freeing memory, causing the crashdump to get stuck. The second loop in __read_vmcore has a lot more opportunities for natural sleep points, like scheduling out while waiting for a data write to happen, but apparently that is not always enough. Add a cond_resched() to the second loop in __read_vmcore to (hopefully) get rid of the softlockups. Link: https://lkml.kernel.org/r/20250110102821.2a37581b@fangorn Fixes: 5cbcb62dddf5 ("fs/proc: fix softlockup in __read_vmcore") Signed-off-by: Rik van Riel <riel@surriel.com> Reported-by: Breno Leitao <leitao@debian.org> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Li Huafei <lihuafei1@huawei.com>	1 年前