diff options
Diffstat (limited to 'openvz-sources/022.078-r3/0100_patch-022stab078-core.patch')
-rw-r--r-- | openvz-sources/022.078-r3/0100_patch-022stab078-core.patch | 85981 |
1 files changed, 0 insertions, 85981 deletions
diff --git a/openvz-sources/022.078-r3/0100_patch-022stab078-core.patch b/openvz-sources/022.078-r3/0100_patch-022stab078-core.patch deleted file mode 100644 index a179de7..0000000 --- a/openvz-sources/022.078-r3/0100_patch-022stab078-core.patch +++ /dev/null @@ -1,85981 +0,0 @@ -diff -uprN linux-2.6.8.1.orig/COPYING.SWsoft linux-2.6.8.1-ve022stab078/COPYING.SWsoft ---- linux-2.6.8.1.orig/COPYING.SWsoft 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/COPYING.SWsoft 2006-05-11 13:05:37.000000000 +0400 -@@ -0,0 +1,350 @@ -+ -+Nothing in this license should be construed as a grant by SWsoft of any rights -+beyond the rights specified in the GNU General Public License, and nothing in -+this license should be construed as a waiver by SWsoft of its patent, copyright -+and/or trademark rights, beyond the waiver required by the GNU General Public -+License. This license is expressly inapplicable to any product that is not -+within the scope of the GNU General Public License -+ -+---------------------------------------- -+ -+ GNU GENERAL PUBLIC LICENSE -+ Version 2, June 1991 -+ -+ Copyright (C) 1989, 1991 Free Software Foundation, Inc. -+ 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA -+ Everyone is permitted to copy and distribute verbatim copies -+ of this license document, but changing it is not allowed. -+ -+ Preamble -+ -+ The licenses for most software are designed to take away your -+freedom to share and change it. By contrast, the GNU General Public -+License is intended to guarantee your freedom to share and change free -+software--to make sure the software is free for all its users. This -+General Public License applies to most of the Free Software -+Foundation's software and to any other program whose authors commit to -+using it. (Some other Free Software Foundation software is covered by -+the GNU Library General Public License instead.) You can apply it to -+your programs, too. -+ -+ When we speak of free software, we are referring to freedom, not -+price. Our General Public Licenses are designed to make sure that you -+have the freedom to distribute copies of free software (and charge for -+this service if you wish), that you receive source code or can get it -+if you want it, that you can change the software or use pieces of it -+in new free programs; and that you know you can do these things. -+ -+ To protect your rights, we need to make restrictions that forbid -+anyone to deny you these rights or to ask you to surrender the rights. -+These restrictions translate to certain responsibilities for you if you -+distribute copies of the software, or if you modify it. -+ -+ For example, if you distribute copies of such a program, whether -+gratis or for a fee, you must give the recipients all the rights that -+you have. You must make sure that they, too, receive or can get the -+source code. And you must show them these terms so they know their -+rights. -+ -+ We protect your rights with two steps: (1) copyright the software, and -+(2) offer you this license which gives you legal permission to copy, -+distribute and/or modify the software. -+ -+ Also, for each author's protection and ours, we want to make certain -+that everyone understands that there is no warranty for this free -+software. If the software is modified by someone else and passed on, we -+want its recipients to know that what they have is not the original, so -+that any problems introduced by others will not reflect on the original -+authors' reputations. -+ -+ Finally, any free program is threatened constantly by software -+patents. We wish to avoid the danger that redistributors of a free -+program will individually obtain patent licenses, in effect making the -+program proprietary. To prevent this, we have made it clear that any -+patent must be licensed for everyone's free use or not licensed at all. -+ -+ The precise terms and conditions for copying, distribution and -+modification follow. -+ -+ GNU GENERAL PUBLIC LICENSE -+ TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION -+ -+ 0. This License applies to any program or other work which contains -+a notice placed by the copyright holder saying it may be distributed -+under the terms of this General Public License. The "Program", below, -+refers to any such program or work, and a "work based on the Program" -+means either the Program or any derivative work under copyright law: -+that is to say, a work containing the Program or a portion of it, -+either verbatim or with modifications and/or translated into another -+language. (Hereinafter, translation is included without limitation in -+the term "modification".) Each licensee is addressed as "you". -+ -+Activities other than copying, distribution and modification are not -+covered by this License; they are outside its scope. The act of -+running the Program is not restricted, and the output from the Program -+is covered only if its contents constitute a work based on the -+Program (independent of having been made by running the Program). -+Whether that is true depends on what the Program does. -+ -+ 1. You may copy and distribute verbatim copies of the Program's -+source code as you receive it, in any medium, provided that you -+conspicuously and appropriately publish on each copy an appropriate -+copyright notice and disclaimer of warranty; keep intact all the -+notices that refer to this License and to the absence of any warranty; -+and give any other recipients of the Program a copy of this License -+along with the Program. -+ -+You may charge a fee for the physical act of transferring a copy, and -+you may at your option offer warranty protection in exchange for a fee. -+ -+ 2. You may modify your copy or copies of the Program or any portion -+of it, thus forming a work based on the Program, and copy and -+distribute such modifications or work under the terms of Section 1 -+above, provided that you also meet all of these conditions: -+ -+ a) You must cause the modified files to carry prominent notices -+ stating that you changed the files and the date of any change. -+ -+ b) You must cause any work that you distribute or publish, that in -+ whole or in part contains or is derived from the Program or any -+ part thereof, to be licensed as a whole at no charge to all third -+ parties under the terms of this License. -+ -+ c) If the modified program normally reads commands interactively -+ when run, you must cause it, when started running for such -+ interactive use in the most ordinary way, to print or display an -+ announcement including an appropriate copyright notice and a -+ notice that there is no warranty (or else, saying that you provide -+ a warranty) and that users may redistribute the program under -+ these conditions, and telling the user how to view a copy of this -+ License. (Exception: if the Program itself is interactive but -+ does not normally print such an announcement, your work based on -+ the Program is not required to print an announcement.) -+ -+These requirements apply to the modified work as a whole. If -+identifiable sections of that work are not derived from the Program, -+and can be reasonably considered independent and separate works in -+themselves, then this License, and its terms, do not apply to those -+sections when you distribute them as separate works. But when you -+distribute the same sections as part of a whole which is a work based -+on the Program, the distribution of the whole must be on the terms of -+this License, whose permissions for other licensees extend to the -+entire whole, and thus to each and every part regardless of who wrote it. -+ -+Thus, it is not the intent of this section to claim rights or contest -+your rights to work written entirely by you; rather, the intent is to -+exercise the right to control the distribution of derivative or -+collective works based on the Program. -+ -+In addition, mere aggregation of another work not based on the Program -+with the Program (or with a work based on the Program) on a volume of -+a storage or distribution medium does not bring the other work under -+the scope of this License. -+ -+ 3. You may copy and distribute the Program (or a work based on it, -+under Section 2) in object code or executable form under the terms of -+Sections 1 and 2 above provided that you also do one of the following: -+ -+ a) Accompany it with the complete corresponding machine-readable -+ source code, which must be distributed under the terms of Sections -+ 1 and 2 above on a medium customarily used for software interchange; or, -+ -+ b) Accompany it with a written offer, valid for at least three -+ years, to give any third party, for a charge no more than your -+ cost of physically performing source distribution, a complete -+ machine-readable copy of the corresponding source code, to be -+ distributed under the terms of Sections 1 and 2 above on a medium -+ customarily used for software interchange; or, -+ -+ c) Accompany it with the information you received as to the offer -+ to distribute corresponding source code. (This alternative is -+ allowed only for noncommercial distribution and only if you -+ received the program in object code or executable form with such -+ an offer, in accord with Subsection b above.) -+ -+The source code for a work means the preferred form of the work for -+making modifications to it. For an executable work, complete source -+code means all the source code for all modules it contains, plus any -+associated interface definition files, plus the scripts used to -+control compilation and installation of the executable. However, as a -+special exception, the source code distributed need not include -+anything that is normally distributed (in either source or binary -+form) with the major components (compiler, kernel, and so on) of the -+operating system on which the executable runs, unless that component -+itself accompanies the executable. -+ -+If distribution of executable or object code is made by offering -+access to copy from a designated place, then offering equivalent -+access to copy the source code from the same place counts as -+distribution of the source code, even though third parties are not -+compelled to copy the source along with the object code. -+ -+ 4. You may not copy, modify, sublicense, or distribute the Program -+except as expressly provided under this License. Any attempt -+otherwise to copy, modify, sublicense or distribute the Program is -+void, and will automatically terminate your rights under this License. -+However, parties who have received copies, or rights, from you under -+this License will not have their licenses terminated so long as such -+parties remain in full compliance. -+ -+ 5. You are not required to accept this License, since you have not -+signed it. However, nothing else grants you permission to modify or -+distribute the Program or its derivative works. These actions are -+prohibited by law if you do not accept this License. Therefore, by -+modifying or distributing the Program (or any work based on the -+Program), you indicate your acceptance of this License to do so, and -+all its terms and conditions for copying, distributing or modifying -+the Program or works based on it. -+ -+ 6. Each time you redistribute the Program (or any work based on the -+Program), the recipient automatically receives a license from the -+original licensor to copy, distribute or modify the Program subject to -+these terms and conditions. You may not impose any further -+restrictions on the recipients' exercise of the rights granted herein. -+You are not responsible for enforcing compliance by third parties to -+this License. -+ -+ 7. If, as a consequence of a court judgment or allegation of patent -+infringement or for any other reason (not limited to patent issues), -+conditions are imposed on you (whether by court order, agreement or -+otherwise) that contradict the conditions of this License, they do not -+excuse you from the conditions of this License. If you cannot -+distribute so as to satisfy simultaneously your obligations under this -+License and any other pertinent obligations, then as a consequence you -+may not distribute the Program at all. For example, if a patent -+license would not permit royalty-free redistribution of the Program by -+all those who receive copies directly or indirectly through you, then -+the only way you could satisfy both it and this License would be to -+refrain entirely from distribution of the Program. -+ -+If any portion of this section is held invalid or unenforceable under -+any particular circumstance, the balance of the section is intended to -+apply and the section as a whole is intended to apply in other -+circumstances. -+ -+It is not the purpose of this section to induce you to infringe any -+patents or other property right claims or to contest validity of any -+such claims; this section has the sole purpose of protecting the -+integrity of the free software distribution system, which is -+implemented by public license practices. Many people have made -+generous contributions to the wide range of software distributed -+through that system in reliance on consistent application of that -+system; it is up to the author/donor to decide if he or she is willing -+to distribute software through any other system and a licensee cannot -+impose that choice. -+ -+This section is intended to make thoroughly clear what is believed to -+be a consequence of the rest of this License. -+ -+ 8. If the distribution and/or use of the Program is restricted in -+certain countries either by patents or by copyrighted interfaces, the -+original copyright holder who places the Program under this License -+may add an explicit geographical distribution limitation excluding -+those countries, so that distribution is permitted only in or among -+countries not thus excluded. In such case, this License incorporates -+the limitation as if written in the body of this License. -+ -+ 9. The Free Software Foundation may publish revised and/or new versions -+of the General Public License from time to time. Such new versions will -+be similar in spirit to the present version, but may differ in detail to -+address new problems or concerns. -+ -+Each version is given a distinguishing version number. If the Program -+specifies a version number of this License which applies to it and "any -+later version", you have the option of following the terms and conditions -+either of that version or of any later version published by the Free -+Software Foundation. If the Program does not specify a version number of -+this License, you may choose any version ever published by the Free Software -+Foundation. -+ -+ 10. If you wish to incorporate parts of the Program into other free -+programs whose distribution conditions are different, write to the author -+to ask for permission. For software which is copyrighted by the Free -+Software Foundation, write to the Free Software Foundation; we sometimes -+make exceptions for this. Our decision will be guided by the two goals -+of preserving the free status of all derivatives of our free software and -+of promoting the sharing and reuse of software generally. -+ -+ NO WARRANTY -+ -+ 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY -+FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN -+OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES -+PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED -+OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF -+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS -+TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE -+PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, -+REPAIR OR CORRECTION. -+ -+ 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING -+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR -+REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, -+INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING -+OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED -+TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY -+YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER -+PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE -+POSSIBILITY OF SUCH DAMAGES. -+ -+ END OF TERMS AND CONDITIONS -+ -+ How to Apply These Terms to Your New Programs -+ -+ If you develop a new program, and you want it to be of the greatest -+possible use to the public, the best way to achieve this is to make it -+free software which everyone can redistribute and change under these terms. -+ -+ To do so, attach the following notices to the program. It is safest -+to attach them to the start of each source file to most effectively -+convey the exclusion of warranty; and each file should have at least -+the "copyright" line and a pointer to where the full notice is found. -+ -+ <one line to give the program's name and a brief idea of what it does.> -+ Copyright (C) <year> <name of author> -+ -+ This program is free software; you can redistribute it and/or modify -+ it under the terms of the GNU General Public License as published by -+ the Free Software Foundation; either version 2 of the License, or -+ (at your option) any later version. -+ -+ This program is distributed in the hope that it will be useful, -+ but WITHOUT ANY WARRANTY; without even the implied warranty of -+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -+ GNU General Public License for more details. -+ -+ You should have received a copy of the GNU General Public License -+ along with this program; if not, write to the Free Software -+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA -+ -+ -+Also add information on how to contact you by electronic and paper mail. -+ -+If the program is interactive, make it output a short notice like this -+when it starts in an interactive mode: -+ -+ Gnomovision version 69, Copyright (C) year name of author -+ Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. -+ This is free software, and you are welcome to redistribute it -+ under certain conditions; type `show c' for details. -+ -+The hypothetical commands `show w' and `show c' should show the appropriate -+parts of the General Public License. Of course, the commands you use may -+be called something other than `show w' and `show c'; they could even be -+mouse-clicks or menu items--whatever suits your program. -+ -+You should also get your employer (if you work as a programmer) or your -+school, if any, to sign a "copyright disclaimer" for the program, if -+necessary. Here is a sample; alter the names: -+ -+ Yoyodyne, Inc., hereby disclaims all copyright interest in the program -+ `Gnomovision' (which makes passes at compilers) written by James Hacker. -+ -+ <signature of Ty Coon>, 1 April 1989 -+ Ty Coon, President of Vice -+ -+This General Public License does not permit incorporating your program into -+proprietary programs. If your program is a subroutine library, you may -+consider it more useful to permit linking proprietary applications with the -+library. If this is what you want to do, use the GNU Library General -+Public License instead of this License. -diff -uprN linux-2.6.8.1.orig/Documentation/cachetlb.txt linux-2.6.8.1-ve022stab078/Documentation/cachetlb.txt ---- linux-2.6.8.1.orig/Documentation/cachetlb.txt 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/Documentation/cachetlb.txt 2006-05-11 13:05:30.000000000 +0400 -@@ -142,6 +142,11 @@ changes occur: - The ia64 sn2 platform is one example of a platform - that uses this interface. - -+8) void lazy_mmu_prot_update(pte_t pte) -+ This interface is called whenever the protection on -+ any user PTEs change. This interface provides a notification -+ to architecture specific code to take appropiate action. -+ - - Next, we have the cache flushing interfaces. In general, when Linux - is changing an existing virtual-->physical mapping to a new value, -diff -uprN linux-2.6.8.1.orig/Documentation/filesystems/Locking linux-2.6.8.1-ve022stab078/Documentation/filesystems/Locking ---- linux-2.6.8.1.orig/Documentation/filesystems/Locking 2004-08-14 14:56:01.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/Documentation/filesystems/Locking 2006-05-11 13:05:35.000000000 +0400 -@@ -90,7 +90,7 @@ prototypes: - void (*destroy_inode)(struct inode *); - void (*read_inode) (struct inode *); - void (*dirty_inode) (struct inode *); -- void (*write_inode) (struct inode *, int); -+ int (*write_inode) (struct inode *, int); - void (*put_inode) (struct inode *); - void (*drop_inode) (struct inode *); - void (*delete_inode) (struct inode *); -diff -uprN linux-2.6.8.1.orig/Documentation/filesystems/vfs.txt linux-2.6.8.1-ve022stab078/Documentation/filesystems/vfs.txt ---- linux-2.6.8.1.orig/Documentation/filesystems/vfs.txt 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/Documentation/filesystems/vfs.txt 2006-05-11 13:05:35.000000000 +0400 -@@ -176,7 +176,7 @@ filesystem. As of kernel 2.1.99, the fol - - struct super_operations { - void (*read_inode) (struct inode *); -- void (*write_inode) (struct inode *, int); -+ int (*write_inode) (struct inode *, int); - void (*put_inode) (struct inode *); - void (*drop_inode) (struct inode *); - void (*delete_inode) (struct inode *); -diff -uprN linux-2.6.8.1.orig/Documentation/i386/zero-page.txt linux-2.6.8.1-ve022stab078/Documentation/i386/zero-page.txt ---- linux-2.6.8.1.orig/Documentation/i386/zero-page.txt 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/Documentation/i386/zero-page.txt 2006-05-11 13:05:29.000000000 +0400 -@@ -28,7 +28,8 @@ Offset Type Description - - 0xa0 16 bytes System description table truncated to 16 bytes. - ( struct sys_desc_table_struct ) -- 0xb0 - 0x1c3 Free. Add more parameters here if you really need them. -+ 0xb0 - 0x13f Free. Add more parameters here if you really need them. -+ 0x140- 0x1be EDID_INFO Video mode setup - - 0x1c4 unsigned long EFI system table pointer - 0x1c8 unsigned long EFI memory descriptor size -diff -uprN linux-2.6.8.1.orig/Documentation/power/swsusp.txt linux-2.6.8.1-ve022stab078/Documentation/power/swsusp.txt ---- linux-2.6.8.1.orig/Documentation/power/swsusp.txt 2004-08-14 14:54:46.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/Documentation/power/swsusp.txt 2006-05-11 13:05:25.000000000 +0400 -@@ -211,8 +211,8 @@ A: All such kernel threads need to be fi - where it is safe to be frozen (no kernel semaphores should be held at - that point and it must be safe to sleep there), and add: - -- if (current->flags & PF_FREEZE) -- refrigerator(PF_FREEZE); -+ if (test_thread_flag(TIF_FREEZE)) -+ refrigerator(); - - Q: What is the difference between between "platform", "shutdown" and - "firmware" in /sys/power/disk? -diff -uprN linux-2.6.8.1.orig/Documentation/ve.txt linux-2.6.8.1-ve022stab078/Documentation/ve.txt ---- linux-2.6.8.1.orig/Documentation/ve.txt 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/Documentation/ve.txt 2006-05-11 13:05:40.000000000 +0400 -@@ -0,0 +1,37 @@ -+ OpenVZ Overview -+ --------------- -+ (C) SWsoft, 2005, http://www.sw-soft.com, All rights reserved. -+ Licensing governed by "linux/COPYING.SWsoft" file. -+ -+OpenVZ is a virtualization technology which allows to run multiple -+isolated VPSs (Virtual Private Server) on a single operating system. -+It uses a single instance of Linux kernel in memory which efficiently -+manages resources between VPSs. -+ -+Virtual environment (VE) notion which is used in kernel is the original -+name of more modern notion of Virtual Private Server (VPS). -+ -+From user point of view, every VPS is an isolated operating system with -+private file system, private set of users, private root superuser, -+private set of processes and so on. Every application which do not -+require direct hardware access can't feel the difference between VPS -+and real standalone server. -+ -+From kernel point of view, VPS is an isolated set of processes spawned -+from their private 'init' process. Kernel controls which resources are -+accessible inside VPS and which amount of these resources can be -+consumed/used by VPS processes. Also kernel provides isolation between -+VPSs thus ensuring that one VPS can't use private resources of another -+VPS, make DoS/hack/crash attack on it's neighbour and so on. -+ -+main Open Virtuozzo config options: -+ CONFIG_FAIRSCHED=y -+ CONFIG_SCHED_VCPU=y -+ CONFIG_VE=y -+ CONFIG_VE_CALLS=m -+ CONFIG_VE_NETDEV=m -+ CONFIG_VE_IPTABLES=y -+ -+Official product pages: -+ http://www.virtuozzo.com -+ http://openvz.org -diff -uprN linux-2.6.8.1.orig/Documentation/vsched.txt linux-2.6.8.1-ve022stab078/Documentation/vsched.txt ---- linux-2.6.8.1.orig/Documentation/vsched.txt 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/Documentation/vsched.txt 2006-05-11 13:05:39.000000000 +0400 -@@ -0,0 +1,83 @@ -+Copyright (C) 2005 SWsoft. All rights reserved. -+Licensing governed by "linux/COPYING.SWsoft" file. -+ -+Hierarchical CPU schedulers -+~~~~~~~~~~~~~~~~~~~~~~~~~~~ -+ -+Hierarchical CPU scheduler is a stack of CPU schedulers which allows -+to organize different policies of scheduling in the system and/or between -+groups of processes. -+ -+Virtuozzo uses a hierarchical Fair CPU scheduler organized as a 2-stage -+CPU scheduler, where the scheduling decisions are made in 2 steps: -+1. On the first step Fair CPU scheduler selects a group of processes -+ which should get some CPU time. -+2. Then standard Linux scheduler chooses a process inside the group. -+Such scheduler efficiently allows to isolate one group of processes -+from another and still allows a group to use more than 1 CPU on SMP systems. -+ -+This document describes a new middle layer of Virtuozzo hierarchical CPU -+scheduler which makes decisions after Fair scheduler, but before Linux -+scheduler and which is called VCPU scheduler. -+ -+ -+Where VCPU scheduler comes from? -+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -+ -+Existing hierarchical CPU scheduler uses isolated algorithms on each stage -+of decision making, i.e. every scheduler makes its decisions without -+taking into account the details of other schedulers. This can lead to a number -+of problems described below. -+ -+On SMP systems there are possible situations when the first CPU scheduler -+in the hierarchy (e.g. Fair scheduler) wants to schedule some group of -+processes on the physical CPU, but the underlying process scheduler -+(e.g. Linux O(1) CPU scheduler) is unable to schedule any processes -+on this physical CPU. Usually this happens due to the fact that Linux -+kernel scheduler uses per-physical CPU runqueues. -+ -+Another problem is that Linux scheduler also knows nothing about -+Fair scheduler and can't balance efficiently without taking into account -+statistics about process groups from Fair scheduler. Without such -+statistics Linux scheduler can concentrate all processes on one physical -+CPU, thus making CPU consuming highly inefficient. -+ -+VCPU scheduler solves these problems by adding a new layer between -+Fair schedule and Linux scheduler. -+ -+VCPU scheduler -+~~~~~~~~~~~~~~ -+ -+VCPU scheduler is a CPU scheduler which splits notion of -+physical and virtual CPUs (VCPU and PCPU). This means that tasks are -+running on virtual CPU runqueues, while VCPUs are running on PCPUs. -+ -+The Virtuozzo hierarchical fair scheduler becomes 3 stage CPU scheduler: -+1. First, Fair CPU scheduler select a group of processes. -+2. Then VCPU scheduler select a virtual CPU to run (this is actually -+ a runqueue). -+3. Standard Linux scheduler chooses a process from the runqueue. -+ -+For example on the picture below PCPU0 executes tasks from -+VCPU1 runqueue and PCPU1 is idle: -+ -+ virtual | physical | virtual -+ idle CPUs | CPUs | CPUS -+--------------------|------------------------|-------------------------- -+ | | ----------------- -+ | | | virtual sched X | -+ | | | ----------- | -+ | | | | VCPU0 | | -+ | | | ----------- | -+ ------------ | ----------- | ----------- | -+| idle VCPU0 | | | PCPU0 | <---> | | VCPU1 | | -+ ------------ | ----------- | ----------- | -+ | | ----------------- -+ | | -+ | | ----------------- -+ | | | virtual sched Y | -+ ------------ ----------- | | ----------- | -+| idle VCPU1 | <---> | PCPU1 | | | | VCPU0 | | -+ ------------ ----------- | | ----------- | -+ | | ----------------- -+ | | -diff -uprN linux-2.6.8.1.orig/Makefile linux-2.6.8.1-ve022stab078/Makefile ---- linux-2.6.8.1.orig/Makefile 2004-08-14 14:55:35.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/Makefile 2006-05-11 13:05:49.000000000 +0400 -@@ -1,7 +1,10 @@ - VERSION = 2 - PATCHLEVEL = 6 - SUBLEVEL = 8 --EXTRAVERSION = .1 -+EXTRAVERSION-y = smp -+EXTRAVERSION- = up -+EXTRAVERSION-n = up -+EXTRAVERSION = -022stab078-$(EXTRAVERSION-$(CONFIG_SMP)) - NAME=Zonked Quokka - - # *DOCUMENTATION* -diff -uprN linux-2.6.8.1.orig/arch/alpha/kernel/ptrace.c linux-2.6.8.1-ve022stab078/arch/alpha/kernel/ptrace.c ---- linux-2.6.8.1.orig/arch/alpha/kernel/ptrace.c 2004-08-14 14:56:14.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/alpha/kernel/ptrace.c 2006-05-11 13:05:26.000000000 +0400 -@@ -354,7 +354,7 @@ do_sys_ptrace(long request, long pid, lo - */ - case PTRACE_KILL: - ret = 0; -- if (child->state == TASK_ZOMBIE) -+ if (child->exit_state == EXIT_ZOMBIE) - break; - child->exit_code = SIGKILL; - /* make sure single-step breakpoint is gone. */ -diff -uprN linux-2.6.8.1.orig/arch/arm/kernel/ptrace.c linux-2.6.8.1-ve022stab078/arch/arm/kernel/ptrace.c ---- linux-2.6.8.1.orig/arch/arm/kernel/ptrace.c 2004-08-14 14:54:49.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/arm/kernel/ptrace.c 2006-05-11 13:05:26.000000000 +0400 -@@ -677,7 +677,7 @@ static int do_ptrace(int request, struct - /* make sure single-step breakpoint is gone. */ - child->ptrace &= ~PT_SINGLESTEP; - ptrace_cancel_bpt(child); -- if (child->state != TASK_ZOMBIE) { -+ if (child->exit_state != EXIT_ZOMBIE) { - child->exit_code = SIGKILL; - wake_up_process(child); - } -diff -uprN linux-2.6.8.1.orig/arch/arm/kernel/signal.c linux-2.6.8.1-ve022stab078/arch/arm/kernel/signal.c ---- linux-2.6.8.1.orig/arch/arm/kernel/signal.c 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/arm/kernel/signal.c 2006-05-11 13:05:25.000000000 +0400 -@@ -548,9 +548,10 @@ static int do_signal(sigset_t *oldset, s - if (!user_mode(regs)) - return 0; - -- if (current->flags & PF_FREEZE) { -- refrigerator(0); -- goto no_signal; -+ if (unlikely(test_thread_flag(TIF_FREEZE))) { -+ refrigerator(); -+ if (!signal_pending(current)) -+ goto no_signal; - } - - if (current->ptrace & PT_SINGLESTEP) -diff -uprN linux-2.6.8.1.orig/arch/arm26/kernel/ptrace.c linux-2.6.8.1-ve022stab078/arch/arm26/kernel/ptrace.c ---- linux-2.6.8.1.orig/arch/arm26/kernel/ptrace.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/arm26/kernel/ptrace.c 2006-05-11 13:05:26.000000000 +0400 -@@ -614,7 +614,7 @@ static int do_ptrace(int request, struct - /* make sure single-step breakpoint is gone. */ - child->ptrace &= ~PT_SINGLESTEP; - ptrace_cancel_bpt(child); -- if (child->state != TASK_ZOMBIE) { -+ if (child->exit_state != EXIT_ZOMBIE) { - child->exit_code = SIGKILL; - wake_up_process(child); - } -diff -uprN linux-2.6.8.1.orig/arch/cris/arch-v10/kernel/ptrace.c linux-2.6.8.1-ve022stab078/arch/cris/arch-v10/kernel/ptrace.c ---- linux-2.6.8.1.orig/arch/cris/arch-v10/kernel/ptrace.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/cris/arch-v10/kernel/ptrace.c 2006-05-11 13:05:26.000000000 +0400 -@@ -185,7 +185,7 @@ sys_ptrace(long request, long pid, long - case PTRACE_KILL: - ret = 0; - -- if (child->state == TASK_ZOMBIE) -+ if (child->exit_state == EXIT_ZOMBIE) - break; - - child->exit_code = SIGKILL; -diff -uprN linux-2.6.8.1.orig/arch/h8300/kernel/ptrace.c linux-2.6.8.1-ve022stab078/arch/h8300/kernel/ptrace.c ---- linux-2.6.8.1.orig/arch/h8300/kernel/ptrace.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/h8300/kernel/ptrace.c 2006-05-11 13:05:26.000000000 +0400 -@@ -199,7 +199,7 @@ asmlinkage int sys_ptrace(long request, - case PTRACE_KILL: { - - ret = 0; -- if (child->state == TASK_ZOMBIE) /* already dead */ -+ if (child->exit_state == EXIT_ZOMBIE) /* already dead */ - break; - child->exit_code = SIGKILL; - h8300_disable_trace(child); -diff -uprN linux-2.6.8.1.orig/arch/i386/boot/setup.S linux-2.6.8.1-ve022stab078/arch/i386/boot/setup.S ---- linux-2.6.8.1.orig/arch/i386/boot/setup.S 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/boot/setup.S 2006-05-11 13:05:38.000000000 +0400 -@@ -156,7 +156,7 @@ cmd_line_ptr: .long 0 # (Header versio - # can be located anywhere in - # low memory 0x10000 or higher. - --ramdisk_max: .long (MAXMEM-1) & 0x7fffffff -+ramdisk_max: .long (__MAXMEM-1) & 0x7fffffff - # (Header version 0x0203 or later) - # The highest safe address for - # the contents of an initrd -diff -uprN linux-2.6.8.1.orig/arch/i386/boot/video.S linux-2.6.8.1-ve022stab078/arch/i386/boot/video.S ---- linux-2.6.8.1.orig/arch/i386/boot/video.S 2004-08-14 14:56:25.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/boot/video.S 2006-05-11 13:05:37.000000000 +0400 -@@ -123,6 +123,9 @@ video: pushw %ds # We use different seg - cmpw $ASK_VGA, %ax # Bring up the menu - jz vid2 - -+#ifndef CONFIG_FB -+ mov $VIDEO_80x25, %ax # hack to force 80x25 mode -+#endif - call mode_set # Set the mode - jc vid1 - -@@ -1901,7 +1904,7 @@ store_edid: - - movl $0x13131313, %eax # memset block with 0x13 - movw $32, %cx -- movw $0x440, %di -+ movw $0x140, %di - cld - rep - stosl -@@ -1910,7 +1913,7 @@ store_edid: - movw $0x01, %bx - movw $0x00, %cx - movw $0x01, %dx -- movw $0x440, %di -+ movw $0x140, %di - int $0x10 - - popw %di # restore all registers -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/Makefile linux-2.6.8.1-ve022stab078/arch/i386/kernel/Makefile ---- linux-2.6.8.1.orig/arch/i386/kernel/Makefile 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/Makefile 2006-05-11 13:05:38.000000000 +0400 -@@ -7,7 +7,7 @@ extra-y := head.o init_task.o vmlinux.ld - obj-y := process.o semaphore.o signal.o entry.o traps.o irq.o vm86.o \ - ptrace.o i8259.o ioport.o ldt.o setup.o time.o sys_i386.o \ - pci-dma.o i386_ksyms.o i387.o dmi_scan.o bootflag.o \ -- doublefault.o -+ doublefault.o entry_trampoline.o - - obj-y += cpu/ - obj-y += timers/ -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/acpi/boot.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/acpi/boot.c ---- linux-2.6.8.1.orig/arch/i386/kernel/acpi/boot.c 2004-08-14 14:56:01.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/acpi/boot.c 2006-05-11 13:05:38.000000000 +0400 -@@ -484,7 +484,7 @@ acpi_scan_rsdp ( - * RSDP signature. - */ - for (offset = 0; offset < length; offset += 16) { -- if (strncmp((char *) (start + offset), "RSD PTR ", sig_len)) -+ if (strncmp((char *) __va(start + offset), "RSD PTR ", sig_len)) - continue; - return (start + offset); - } -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/acpi/sleep.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/acpi/sleep.c ---- linux-2.6.8.1.orig/arch/i386/kernel/acpi/sleep.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/acpi/sleep.c 2006-05-11 13:05:38.000000000 +0400 -@@ -19,13 +19,29 @@ extern void zap_low_mappings(void); - - extern unsigned long FASTCALL(acpi_copy_wakeup_routine(unsigned long)); - --static void init_low_mapping(pgd_t *pgd, int pgd_limit) -+static void map_low(pgd_t *pgd_base, unsigned long start, unsigned long end) - { -- int pgd_ofs = 0; -- -- while ((pgd_ofs < pgd_limit) && (pgd_ofs + USER_PTRS_PER_PGD < PTRS_PER_PGD)) { -- set_pgd(pgd, *(pgd+USER_PTRS_PER_PGD)); -- pgd_ofs++, pgd++; -+ unsigned long vaddr; -+ pmd_t *pmd; -+ pgd_t *pgd; -+ int i, j; -+ -+ pgd = pgd_base; -+ -+ for (i = 0; i < PTRS_PER_PGD; pgd++, i++) { -+ vaddr = i*PGDIR_SIZE; -+ if (end && (vaddr >= end)) -+ break; -+ pmd = pmd_offset(pgd, 0); -+ for (j = 0; j < PTRS_PER_PMD; pmd++, j++) { -+ vaddr = i*PGDIR_SIZE + j*PMD_SIZE; -+ if (end && (vaddr >= end)) -+ break; -+ if (vaddr < start) -+ continue; -+ set_pmd(pmd, __pmd(_KERNPG_TABLE + _PAGE_PSE + -+ vaddr - start)); -+ } - } - } - -@@ -39,7 +55,9 @@ int acpi_save_state_mem (void) - { - if (!acpi_wakeup_address) - return 1; -- init_low_mapping(swapper_pg_dir, USER_PTRS_PER_PGD); -+ if (!cpu_has_pse) -+ return 1; -+ map_low(swapper_pg_dir, 0, LOW_MAPPINGS_SIZE); - memcpy((void *) acpi_wakeup_address, &wakeup_start, &wakeup_end - &wakeup_start); - acpi_copy_wakeup_routine(acpi_wakeup_address); - -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/acpi/wakeup.S linux-2.6.8.1-ve022stab078/arch/i386/kernel/acpi/wakeup.S ---- linux-2.6.8.1.orig/arch/i386/kernel/acpi/wakeup.S 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/acpi/wakeup.S 2006-05-11 13:05:38.000000000 +0400 -@@ -67,6 +67,13 @@ wakeup_code: - movw $0x0e00 + 'i', %fs:(0x12) - - # need a gdt -+ #use the gdt copied in this low mem -+ lea temp_gdt_table - wakeup_code, %eax -+ xor %ebx, %ebx -+ movw %ds, %bx -+ shll $4, %ebx -+ addl %ebx, %eax -+ movl %eax, real_save_gdt + 2 - wakeup_code - lgdt real_save_gdt - wakeup_code - - movl real_save_cr0 - wakeup_code, %eax -@@ -89,6 +96,7 @@ real_save_cr4: .long 0 - real_magic: .long 0 - video_mode: .long 0 - video_flags: .long 0 -+temp_gdt_table: .fill GDT_ENTRIES, 8, 0 - - bogus_real_magic: - movw $0x0e00 + 'B', %fs:(0x12) -@@ -231,6 +239,13 @@ ENTRY(acpi_copy_wakeup_routine) - movl %edx, real_save_cr0 - wakeup_start (%eax) - sgdt real_save_gdt - wakeup_start (%eax) - -+ # gdt wont be addressable from real mode in 4g4g split -+ # copying it to the lower mem -+ xor %ecx, %ecx -+ movw saved_gdt, %cx -+ movl saved_gdt + 2, %esi -+ lea temp_gdt_table - wakeup_start (%eax), %edi -+ rep movsb - movl saved_videomode, %edx - movl %edx, video_mode - wakeup_start (%eax) - movl acpi_video_flags, %edx -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/apic.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/apic.c ---- linux-2.6.8.1.orig/arch/i386/kernel/apic.c 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/apic.c 2006-05-11 13:05:40.000000000 +0400 -@@ -970,9 +970,7 @@ void __init setup_boot_APIC_clock(void) - - void __init setup_secondary_APIC_clock(void) - { -- local_irq_disable(); /* FIXME: Do we need this? --RR */ - setup_APIC_timer(calibration_result); -- local_irq_enable(); - } - - void __init disable_APIC_timer(void) -@@ -1035,7 +1033,7 @@ int setup_profiling_timer(unsigned int m - * value into /proc/profile. - */ - --inline void smp_local_timer_interrupt(struct pt_regs * regs) -+asmlinkage void smp_local_timer_interrupt(struct pt_regs * regs) - { - int cpu = smp_processor_id(); - -@@ -1088,11 +1086,18 @@ inline void smp_local_timer_interrupt(st - - void smp_apic_timer_interrupt(struct pt_regs regs) - { -- int cpu = smp_processor_id(); -+#ifdef CONFIG_4KSTACKS -+ union irq_ctx *curctx; -+ union irq_ctx *irqctx; -+ u32 *isp; -+#endif -+ int cpu; -+ struct ve_struct *envid; - - /* - * the NMI deadlock-detector uses this. - */ -+ cpu = smp_processor_id(); - irq_stat[cpu].apic_timer_irqs++; - - /* -@@ -1105,9 +1110,35 @@ void smp_apic_timer_interrupt(struct pt_ - * Besides, if we don't timer interrupts ignore the global - * interrupt lock, which is the WrongThing (tm) to do. - */ -+ envid = set_exec_env(get_ve0()); - irq_enter(); -+#ifdef CONFIG_4KSTACKS -+ curctx = (union irq_ctx *) current_thread_info(); -+ irqctx = hardirq_ctx[cpu]; -+ if (curctx == irqctx) { -+ smp_local_timer_interrupt(®s); -+ } else { -+ /* build the stack frame on the IRQ stack */ -+ isp = (u32*) ((char*)irqctx + sizeof(*irqctx)); -+ irqctx->tinfo.task = curctx->tinfo.task; -+ irqctx->tinfo.real_stack = curctx->tinfo.real_stack; -+ irqctx->tinfo.virtual_stack = curctx->tinfo.virtual_stack; -+ irqctx->tinfo.previous_esp = current_stack_pointer(); -+ -+ *--isp = (u32) ®s; -+ asm volatile( -+ " xchgl %%ebx,%%esp \n" -+ " call smp_local_timer_interrupt \n" -+ " xchgl %%ebx,%%esp \n" -+ : : "b"(isp) -+ : "memory", "cc", "edx", "ecx" -+ ); -+ } -+#else - smp_local_timer_interrupt(®s); -+#endif - irq_exit(); -+ (void)set_exec_env(envid); - } - - /* -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/asm-offsets.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/asm-offsets.c ---- linux-2.6.8.1.orig/arch/i386/kernel/asm-offsets.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/asm-offsets.c 2006-05-11 13:05:38.000000000 +0400 -@@ -61,5 +61,19 @@ void foo(void) - DEFINE(TSS_sysenter_esp0, offsetof(struct tss_struct, esp0) - - sizeof(struct tss_struct)); - -+ DEFINE(TI_task, offsetof (struct thread_info, task)); -+ DEFINE(TI_exec_domain, offsetof (struct thread_info, exec_domain)); -+ DEFINE(TI_flags, offsetof (struct thread_info, flags)); -+ DEFINE(TI_preempt_count, offsetof (struct thread_info, preempt_count)); -+ DEFINE(TI_addr_limit, offsetof (struct thread_info, addr_limit)); -+ DEFINE(TI_real_stack, offsetof (struct thread_info, real_stack)); -+ DEFINE(TI_virtual_stack, offsetof (struct thread_info, virtual_stack)); -+ DEFINE(TI_user_pgd, offsetof (struct thread_info, user_pgd)); -+ -+ DEFINE(FIX_ENTRY_TRAMPOLINE_0_addr, -+ __fix_to_virt(FIX_ENTRY_TRAMPOLINE_0)); -+ DEFINE(FIX_VSYSCALL_addr, __fix_to_virt(FIX_VSYSCALL)); - DEFINE(PAGE_SIZE_asm, PAGE_SIZE); -+ DEFINE(task_thread_db7, -+ offsetof (struct task_struct, thread.debugreg[7])); - } -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/cpu/amd.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/cpu/amd.c ---- linux-2.6.8.1.orig/arch/i386/kernel/cpu/amd.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/cpu/amd.c 2006-05-11 13:05:28.000000000 +0400 -@@ -28,6 +28,22 @@ static void __init init_amd(struct cpuin - int mbytes = num_physpages >> (20-PAGE_SHIFT); - int r; - -+#ifdef CONFIG_SMP -+ unsigned long long value; -+ -+ /* Disable TLB flush filter by setting HWCR.FFDIS on K8 -+ * bit 6 of msr C001_0015 -+ * -+ * Errata 63 for SH-B3 steppings -+ * Errata 122 for all steppings (F+ have it disabled by default) -+ */ -+ if (c->x86 == 15) { -+ rdmsrl(MSR_K7_HWCR, value); -+ value |= 1 << 6; -+ wrmsrl(MSR_K7_HWCR, value); -+ } -+#endif -+ - /* - * FIXME: We should handle the K5 here. Set up the write - * range and also turn on MSR 83 bits 4 and 31 (write alloc, -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/cpu/common.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/cpu/common.c ---- linux-2.6.8.1.orig/arch/i386/kernel/cpu/common.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/cpu/common.c 2006-05-11 13:05:38.000000000 +0400 -@@ -196,7 +196,10 @@ int __init have_cpuid_p(void) - - /* Do minimum CPU detection early. - Fields really needed: vendor, cpuid_level, family, model, mask, cache alignment. -- The others are not touched to avoid unwanted side effects. */ -+ The others are not touched to avoid unwanted side effects. -+ -+ WARNING: this function is only called on the BP. Don't add code here -+ that is supposed to run on all CPUs. */ - void __init early_cpu_detect(void) - { - struct cpuinfo_x86 *c = &boot_cpu_data; -@@ -228,8 +231,6 @@ void __init early_cpu_detect(void) - if (cap0 & (1<<19)) - c->x86_cache_alignment = ((misc >> 8) & 0xff) * 8; - } -- -- early_intel_workaround(c); - } - - void __init generic_identify(struct cpuinfo_x86 * c) -@@ -275,6 +276,8 @@ void __init generic_identify(struct cpui - get_model_name(c); /* Default name */ - } - } -+ -+ early_intel_workaround(c); - } - - static void __init squash_the_stupid_serial_number(struct cpuinfo_x86 *c) -@@ -554,12 +557,16 @@ void __init cpu_init (void) - set_tss_desc(cpu,t); - cpu_gdt_table[cpu][GDT_ENTRY_TSS].b &= 0xfffffdff; - load_TR_desc(); -- load_LDT(&init_mm.context); -+ if (cpu) -+ load_LDT(&init_mm.context); - - /* Set up doublefault TSS pointer in the GDT */ - __set_tss_desc(cpu, GDT_ENTRY_DOUBLEFAULT_TSS, &doublefault_tss); - cpu_gdt_table[cpu][GDT_ENTRY_DOUBLEFAULT_TSS].b &= 0xfffffdff; - -+ if (cpu) -+ trap_init_virtual_GDT(); -+ - /* Clear %fs and %gs. */ - asm volatile ("xorl %eax, %eax; movl %eax, %fs; movl %eax, %gs"); - -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/cpu/intel.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/cpu/intel.c ---- linux-2.6.8.1.orig/arch/i386/kernel/cpu/intel.c 2004-08-14 14:55:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/cpu/intel.c 2006-05-11 13:05:38.000000000 +0400 -@@ -10,6 +10,7 @@ - #include <asm/processor.h> - #include <asm/msr.h> - #include <asm/uaccess.h> -+#include <asm/desc.h> - - #include "cpu.h" - -@@ -19,8 +20,6 @@ - #include <mach_apic.h> - #endif - --extern int trap_init_f00f_bug(void); -- - #ifdef CONFIG_X86_INTEL_USERCOPY - /* - * Alignment at which movsl is preferred for bulk memory copies. -@@ -97,10 +96,13 @@ static struct _cache_table cache_table[] - { 0x70, LVL_TRACE, 12 }, - { 0x71, LVL_TRACE, 16 }, - { 0x72, LVL_TRACE, 32 }, -+ { 0x78, LVL_2, 1024 }, - { 0x79, LVL_2, 128 }, - { 0x7a, LVL_2, 256 }, - { 0x7b, LVL_2, 512 }, - { 0x7c, LVL_2, 1024 }, -+ { 0x7d, LVL_2, 2048 }, -+ { 0x7f, LVL_2, 512 }, - { 0x82, LVL_2, 256 }, - { 0x83, LVL_2, 512 }, - { 0x84, LVL_2, 1024 }, -@@ -147,7 +149,7 @@ static void __init init_intel(struct cpu - - c->f00f_bug = 1; - if ( !f00f_workaround_enabled ) { -- trap_init_f00f_bug(); -+ trap_init_virtual_IDT(); - printk(KERN_NOTICE "Intel Pentium with F0 0F bug - workaround enabled.\n"); - f00f_workaround_enabled = 1; - } -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/cpu/mtrr/if.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/cpu/mtrr/if.c ---- linux-2.6.8.1.orig/arch/i386/kernel/cpu/mtrr/if.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/cpu/mtrr/if.c 2006-05-11 13:05:40.000000000 +0400 -@@ -358,7 +358,7 @@ static int __init mtrr_if_init(void) - return -ENODEV; - - proc_root_mtrr = -- create_proc_entry("mtrr", S_IWUSR | S_IRUGO, &proc_root); -+ create_proc_entry("mtrr", S_IWUSR | S_IRUGO, NULL); - if (proc_root_mtrr) { - proc_root_mtrr->owner = THIS_MODULE; - proc_root_mtrr->proc_fops = &mtrr_fops; -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/cpu/proc.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/cpu/proc.c ---- linux-2.6.8.1.orig/arch/i386/kernel/cpu/proc.c 2004-08-14 14:56:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/cpu/proc.c 2006-05-11 13:05:40.000000000 +0400 -@@ -3,6 +3,8 @@ - #include <linux/string.h> - #include <asm/semaphore.h> - #include <linux/seq_file.h> -+#include <linux/vsched.h> -+#include <linux/fairsched.h> - - /* - * Get CPU information for use by the procfs. -@@ -58,11 +60,17 @@ static int show_cpuinfo(struct seq_file - struct cpuinfo_x86 *c = v; - int i, n = c - cpu_data; - int fpu_exception; -+ unsigned long vcpu_khz; - - #ifdef CONFIG_SMP -- if (!cpu_online(n)) -+ if (!vcpu_online(n)) - return 0; - #endif -+#ifdef CONFIG_VE -+ vcpu_khz = ve_scale_khz(cpu_khz); -+#else -+ vcpu_khz = cpu_khz; -+#endif - seq_printf(m, "processor\t: %d\n" - "vendor_id\t: %s\n" - "cpu family\t: %d\n" -@@ -81,14 +89,14 @@ static int show_cpuinfo(struct seq_file - - if ( cpu_has(c, X86_FEATURE_TSC) ) { - seq_printf(m, "cpu MHz\t\t: %lu.%03lu\n", -- cpu_khz / 1000, (cpu_khz % 1000)); -+ vcpu_khz / 1000, (vcpu_khz % 1000)); - } - - /* Cache size */ - if (c->x86_cache_size >= 0) - seq_printf(m, "cache size\t: %d KB\n", c->x86_cache_size); - #ifdef CONFIG_X86_HT -- if (cpu_has_ht) { -+ if (smp_num_siblings > 1) { - extern int phys_proc_id[NR_CPUS]; - seq_printf(m, "physical id\t: %d\n", phys_proc_id[n]); - seq_printf(m, "siblings\t: %d\n", smp_num_siblings); -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/doublefault.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/doublefault.c ---- linux-2.6.8.1.orig/arch/i386/kernel/doublefault.c 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/doublefault.c 2006-05-11 13:05:38.000000000 +0400 -@@ -8,12 +8,13 @@ - #include <asm/pgtable.h> - #include <asm/processor.h> - #include <asm/desc.h> -+#include <asm/fixmap.h> - - #define DOUBLEFAULT_STACKSIZE (1024) - static unsigned long doublefault_stack[DOUBLEFAULT_STACKSIZE]; - #define STACK_START (unsigned long)(doublefault_stack+DOUBLEFAULT_STACKSIZE) - --#define ptr_ok(x) ((x) > 0xc0000000 && (x) < 0xc1000000) -+#define ptr_ok(x) (((x) > __PAGE_OFFSET && (x) < (__PAGE_OFFSET + 0x01000000)) || ((x) >= FIXADDR_START)) - - static void doublefault_fn(void) - { -@@ -39,8 +40,8 @@ static void doublefault_fn(void) - - printk("eax = %08lx, ebx = %08lx, ecx = %08lx, edx = %08lx\n", - t->eax, t->ebx, t->ecx, t->edx); -- printk("esi = %08lx, edi = %08lx\n", -- t->esi, t->edi); -+ printk("esi = %08lx, edi = %08lx, ebp = %08lx\n", -+ t->esi, t->edi, t->ebp); - } - } - -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/entry.S linux-2.6.8.1-ve022stab078/arch/i386/kernel/entry.S ---- linux-2.6.8.1.orig/arch/i386/kernel/entry.S 2004-08-14 14:55:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/entry.S 2006-05-11 13:05:43.000000000 +0400 -@@ -43,8 +43,10 @@ - #include <linux/config.h> - #include <linux/linkage.h> - #include <asm/thread_info.h> -+#include <asm/asm_offsets.h> - #include <asm/errno.h> - #include <asm/segment.h> -+#include <asm/page.h> - #include <asm/smp.h> - #include <asm/page.h> - #include "irq_vectors.h" -@@ -81,7 +83,102 @@ VM_MASK = 0x00020000 - #define resume_kernel restore_all - #endif - --#define SAVE_ALL \ -+#ifdef CONFIG_X86_HIGH_ENTRY -+ -+#ifdef CONFIG_X86_SWITCH_PAGETABLES -+ -+#if defined(CONFIG_PREEMPT) && defined(CONFIG_SMP) -+/* -+ * If task is preempted in __SWITCH_KERNELSPACE, and moved to another cpu, -+ * __switch_to repoints %esp to the appropriate virtual stack; but %ebp is -+ * left stale, so we must check whether to repeat the real stack calculation. -+ */ -+#define repeat_if_esp_changed \ -+ xorl %esp, %ebp; \ -+ testl $-THREAD_SIZE, %ebp; \ -+ jnz 0b -+#else -+#define repeat_if_esp_changed -+#endif -+ -+/* clobbers ebx, edx and ebp */ -+ -+#define __SWITCH_KERNELSPACE \ -+ cmpl $0xff000000, %esp; \ -+ jb 1f; \ -+ \ -+ /* \ -+ * switch pagetables and load the real stack, \ -+ * keep the stack offset: \ -+ */ \ -+ \ -+ movl $swapper_pg_dir-__PAGE_OFFSET, %edx; \ -+ \ -+ /* GET_THREAD_INFO(%ebp) intermixed */ \ -+0: \ -+ movl %esp, %ebp; \ -+ movl %esp, %ebx; \ -+ andl $(-THREAD_SIZE), %ebp; \ -+ andl $(THREAD_SIZE-1), %ebx; \ -+ orl TI_real_stack(%ebp), %ebx; \ -+ repeat_if_esp_changed; \ -+ \ -+ movl %edx, %cr3; \ -+ movl %ebx, %esp; \ -+1: -+ -+#endif -+ -+ -+#define __SWITCH_USERSPACE \ -+ /* interrupted any of the user return paths? */ \ -+ \ -+ movl EIP(%esp), %eax; \ -+ \ -+ cmpl $int80_ret_start_marker, %eax; \ -+ jb 33f; /* nope - continue with sysexit check */\ -+ cmpl $int80_ret_end_marker, %eax; \ -+ jb 22f; /* yes - switch to virtual stack */ \ -+33: \ -+ cmpl $sysexit_ret_start_marker, %eax; \ -+ jb 44f; /* nope - continue with user check */ \ -+ cmpl $sysexit_ret_end_marker, %eax; \ -+ jb 22f; /* yes - switch to virtual stack */ \ -+ /* return to userspace? */ \ -+44: \ -+ movl EFLAGS(%esp),%ecx; \ -+ movb CS(%esp),%cl; \ -+ testl $(VM_MASK | 3),%ecx; \ -+ jz 2f; \ -+22: \ -+ /* \ -+ * switch to the virtual stack, then switch to \ -+ * the userspace pagetables. \ -+ */ \ -+ \ -+ GET_THREAD_INFO(%ebp); \ -+ movl TI_virtual_stack(%ebp), %edx; \ -+ movl TI_user_pgd(%ebp), %ecx; \ -+ \ -+ movl %esp, %ebx; \ -+ andl $(THREAD_SIZE-1), %ebx; \ -+ orl %ebx, %edx; \ -+int80_ret_start_marker: \ -+ movl %edx, %esp; \ -+ movl %ecx, %cr3; \ -+ \ -+ __RESTORE_ALL_USER; \ -+int80_ret_end_marker: \ -+2: -+ -+#else /* !CONFIG_X86_HIGH_ENTRY */ -+ -+#define __SWITCH_KERNELSPACE -+#define __SWITCH_USERSPACE -+ -+#endif -+ -+#define __SAVE_ALL \ - cld; \ - pushl %es; \ - pushl %ds; \ -@@ -96,7 +193,7 @@ VM_MASK = 0x00020000 - movl %edx, %ds; \ - movl %edx, %es; - --#define RESTORE_INT_REGS \ -+#define __RESTORE_INT_REGS \ - popl %ebx; \ - popl %ecx; \ - popl %edx; \ -@@ -105,29 +202,44 @@ VM_MASK = 0x00020000 - popl %ebp; \ - popl %eax - --#define RESTORE_REGS \ -- RESTORE_INT_REGS; \ --1: popl %ds; \ --2: popl %es; \ --.section .fixup,"ax"; \ --3: movl $0,(%esp); \ -- jmp 1b; \ --4: movl $0,(%esp); \ -- jmp 2b; \ --.previous; \ -+#define __RESTORE_REGS \ -+ __RESTORE_INT_REGS; \ -+ popl %ds; \ -+ popl %es; -+ -+#define __RESTORE_REGS_USER \ -+ __RESTORE_INT_REGS; \ -+111: popl %ds; \ -+222: popl %es; \ -+ jmp 666f; \ -+444: movl $0,(%esp); \ -+ jmp 111b; \ -+555: movl $0,(%esp); \ -+ jmp 222b; \ -+666: \ - .section __ex_table,"a";\ - .align 4; \ -- .long 1b,3b; \ -- .long 2b,4b; \ -+ .long 111b,444b;\ -+ .long 222b,555b;\ - .previous - -+#define __RESTORE_ALL_USER \ -+ __RESTORE_REGS_USER \ -+ __RESTORE_IRET -+ -+#ifdef CONFIG_X86_HIGH_ENTRY -+#define __RESTORE_ALL \ -+ __RESTORE_REGS \ -+ __RESTORE_IRET -+#else /* !CONFIG_X86_HIGH_ENTRY */ -+#define __RESTORE_ALL __RESTORE_ALL_USER -+#endif - --#define RESTORE_ALL \ -- RESTORE_REGS \ -+#define __RESTORE_IRET \ - addl $4, %esp; \ --1: iret; \ -+333: iret; \ - .section .fixup,"ax"; \ --2: sti; \ -+666: sti; \ - movl $(__USER_DS), %edx; \ - movl %edx, %ds; \ - movl %edx, %es; \ -@@ -136,10 +248,18 @@ VM_MASK = 0x00020000 - .previous; \ - .section __ex_table,"a";\ - .align 4; \ -- .long 1b,2b; \ -+ .long 333b,666b;\ - .previous - -+#define SAVE_ALL \ -+ __SAVE_ALL; \ -+ __SWITCH_KERNELSPACE; -+ -+#define RESTORE_ALL \ -+ __SWITCH_USERSPACE; \ -+ __RESTORE_ALL; - -+.section .entry.text,"ax" - - ENTRY(lcall7) - pushfl # We get a different stack layout with call -@@ -240,17 +360,9 @@ sysenter_past_esp: - pushl $(__USER_CS) - pushl $SYSENTER_RETURN - --/* -- * Load the potential sixth argument from user stack. -- * Careful about security. -- */ -- cmpl $__PAGE_OFFSET-3,%ebp -- jae syscall_fault --1: movl (%ebp),%ebp --.section __ex_table,"a" -- .align 4 -- .long 1b,syscall_fault --.previous -+ /* -+ * No six-argument syscall is ever used with sysenter. -+ */ - - pushl %eax - SAVE_ALL -@@ -266,12 +378,35 @@ sysenter_past_esp: - movl TI_flags(%ebp), %ecx - testw $_TIF_ALLWORK_MASK, %cx - jne syscall_exit_work -+ -+#ifdef CONFIG_X86_SWITCH_PAGETABLES -+ -+ GET_THREAD_INFO(%ebp) -+ movl TI_virtual_stack(%ebp), %edx -+ movl TI_user_pgd(%ebp), %ecx -+ movl %esp, %ebx -+ andl $(THREAD_SIZE-1), %ebx -+ orl %ebx, %edx -+sysexit_ret_start_marker: -+ movl %edx, %esp -+ movl %ecx, %cr3 -+ /* -+ * only ebx is not restored by the userspace sysenter vsyscall -+ * code, it assumes it to be callee-saved. -+ */ -+ movl EBX(%esp), %ebx -+#endif -+ - /* if something modifies registers it must also disable sysexit */ - movl EIP(%esp), %edx - movl OLDESP(%esp), %ecx -+ xorl %ebp,%ebp - sti - sysexit -- -+#ifdef CONFIG_X86_SWITCH_PAGETABLES -+sysexit_ret_end_marker: -+ nop -+#endif - - # system call handler stub - ENTRY(system_call) -@@ -321,6 +456,22 @@ work_notifysig: # deal with pending s - # vm86-space - xorl %edx, %edx - call do_notify_resume -+ -+#if CONFIG_X86_HIGH_ENTRY -+ /* -+ * Reload db7 if necessary: -+ */ -+ movl TI_flags(%ebp), %ecx -+ testb $_TIF_DB7, %cl -+ jnz work_db7 -+ -+ jmp restore_all -+ -+work_db7: -+ movl TI_task(%ebp), %edx; -+ movl task_thread_db7(%edx), %edx; -+ movl %edx, %db7; -+#endif - jmp restore_all - - ALIGN -@@ -358,14 +509,6 @@ syscall_exit_work: - jmp resume_userspace - - ALIGN --syscall_fault: -- pushl %eax # save orig_eax -- SAVE_ALL -- GET_THREAD_INFO(%ebp) -- movl $-EFAULT,EAX(%esp) -- jmp resume_userspace -- -- ALIGN - syscall_badsys: - movl $-ENOSYS,EAX(%esp) - jmp resume_userspace -@@ -376,7 +519,7 @@ syscall_badsys: - */ - .data - ENTRY(interrupt) --.text -+.previous - - vector=0 - ENTRY(irq_entries_start) -@@ -386,7 +529,7 @@ ENTRY(irq_entries_start) - jmp common_interrupt - .data - .long 1b --.text -+.previous - vector=vector+1 - .endr - -@@ -427,12 +570,17 @@ error_code: - movl ES(%esp), %edi # get the function address - movl %eax, ORIG_EAX(%esp) - movl %ecx, ES(%esp) -- movl %esp, %edx - pushl %esi # push the error code -- pushl %edx # push the pt_regs pointer - movl $(__USER_DS), %edx - movl %edx, %ds - movl %edx, %es -+ -+/* clobbers edx, ebx and ebp */ -+ __SWITCH_KERNELSPACE -+ -+ leal 4(%esp), %edx # prepare pt_regs -+ pushl %edx # push pt_regs -+ - call *%edi - addl $8, %esp - jmp ret_from_exception -@@ -523,7 +671,7 @@ nmi_stack_correct: - pushl %edx - call do_nmi - addl $8, %esp -- RESTORE_ALL -+ jmp restore_all - - nmi_stack_fixup: - FIX_STACK(12,nmi_stack_correct, 1) -@@ -600,6 +748,8 @@ ENTRY(spurious_interrupt_bug) - pushl $do_spurious_interrupt_bug - jmp error_code - -+.previous -+ - .data - ENTRY(sys_call_table) - .long sys_restart_syscall /* 0 - old "setup()" system call, used for restarting */ -@@ -887,4 +1037,26 @@ ENTRY(sys_call_table) - .long sys_mq_getsetattr - .long sys_ni_syscall /* reserved for kexec */ - -+ .rept 500-(.-sys_call_table)/4 -+ .long sys_ni_syscall -+ .endr -+ .long sys_fairsched_mknod /* 500 */ -+ .long sys_fairsched_rmnod -+ .long sys_fairsched_chwt -+ .long sys_fairsched_mvpr -+ .long sys_fairsched_rate -+ -+ .rept 510-(.-sys_call_table)/4 -+ .long sys_ni_syscall -+ .endr -+ -+ .long sys_getluid /* 510 */ -+ .long sys_setluid -+ .long sys_setublimit -+ .long sys_ubstat -+ .long sys_ni_syscall -+ .long sys_ni_syscall -+ .long sys_lchmod /* 516 */ -+ .long sys_lutime -+ - syscall_table_size=(.-sys_call_table) -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/entry_trampoline.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/entry_trampoline.c ---- linux-2.6.8.1.orig/arch/i386/kernel/entry_trampoline.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/entry_trampoline.c 2006-05-11 13:05:38.000000000 +0400 -@@ -0,0 +1,75 @@ -+/* -+ * linux/arch/i386/kernel/entry_trampoline.c -+ * -+ * (C) Copyright 2003 Ingo Molnar -+ * -+ * This file contains the needed support code for 4GB userspace -+ */ -+ -+#include <linux/init.h> -+#include <linux/smp.h> -+#include <linux/mm.h> -+#include <linux/sched.h> -+#include <linux/kernel.h> -+#include <linux/string.h> -+#include <linux/highmem.h> -+#include <asm/desc.h> -+#include <asm/atomic_kmap.h> -+ -+extern char __entry_tramp_start, __entry_tramp_end, __start___entry_text; -+ -+void __init init_entry_mappings(void) -+{ -+#ifdef CONFIG_X86_HIGH_ENTRY -+ -+ void *tramp; -+ int p; -+ -+ /* -+ * We need a high IDT and GDT for the 4G/4G split: -+ */ -+ trap_init_virtual_IDT(); -+ -+ __set_fixmap(FIX_ENTRY_TRAMPOLINE_0, __pa((unsigned long)&__entry_tramp_start), PAGE_KERNEL_EXEC); -+ __set_fixmap(FIX_ENTRY_TRAMPOLINE_1, __pa((unsigned long)&__entry_tramp_start) + PAGE_SIZE, PAGE_KERNEL_EXEC); -+ tramp = (void *)fix_to_virt(FIX_ENTRY_TRAMPOLINE_0); -+ -+ printk("mapped 4G/4G trampoline to %p.\n", tramp); -+ BUG_ON((void *)&__start___entry_text != tramp); -+ /* -+ * Virtual kernel stack: -+ */ -+ BUG_ON(__kmap_atomic_vaddr(KM_VSTACK_TOP) & (THREAD_SIZE-1)); -+ BUG_ON(sizeof(struct desc_struct)*NR_CPUS*GDT_ENTRIES > 2*PAGE_SIZE); -+ BUG_ON((unsigned int)&__entry_tramp_end - (unsigned int)&__entry_tramp_start > 2*PAGE_SIZE); -+ -+ /* -+ * set up the initial thread's virtual stack related -+ * fields: -+ */ -+ for (p = 0; p < ARRAY_SIZE(current->thread_info->stack_page); p++) -+ current->thread_info->stack_page[p] = virt_to_page((char *)current->thread_info + (p*PAGE_SIZE)); -+ -+ current->thread_info->virtual_stack = (void *)__kmap_atomic_vaddr(KM_VSTACK_TOP); -+ -+ for (p = 0; p < ARRAY_SIZE(current->thread_info->stack_page); p++) { -+ __kunmap_atomic_type(KM_VSTACK_TOP-p); -+ __kmap_atomic(current->thread_info->stack_page[p], KM_VSTACK_TOP-p); -+ } -+#endif -+ current->thread_info->real_stack = (void *)current->thread_info; -+ current->thread_info->user_pgd = NULL; -+ current->thread.esp0 = (unsigned long)current->thread_info->real_stack + THREAD_SIZE; -+} -+ -+ -+ -+void __init entry_trampoline_setup(void) -+{ -+ /* -+ * old IRQ entries set up by the boot code will still hang -+ * around - they are a sign of hw trouble anyway, now they'll -+ * produce a double fault message. -+ */ -+ trap_init_virtual_GDT(); -+} -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/i386_ksyms.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/i386_ksyms.c ---- linux-2.6.8.1.orig/arch/i386/kernel/i386_ksyms.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/i386_ksyms.c 2006-05-11 13:05:38.000000000 +0400 -@@ -92,7 +92,6 @@ EXPORT_SYMBOL_NOVERS(__down_failed_inter - EXPORT_SYMBOL_NOVERS(__down_failed_trylock); - EXPORT_SYMBOL_NOVERS(__up_wakeup); - /* Networking helper routines. */ --EXPORT_SYMBOL(csum_partial_copy_generic); - /* Delay loops */ - EXPORT_SYMBOL(__ndelay); - EXPORT_SYMBOL(__udelay); -@@ -106,13 +105,17 @@ EXPORT_SYMBOL_NOVERS(__get_user_4); - EXPORT_SYMBOL(strpbrk); - EXPORT_SYMBOL(strstr); - -+#if !defined(CONFIG_X86_UACCESS_INDIRECT) - EXPORT_SYMBOL(strncpy_from_user); --EXPORT_SYMBOL(__strncpy_from_user); -+EXPORT_SYMBOL(__direct_strncpy_from_user); - EXPORT_SYMBOL(clear_user); - EXPORT_SYMBOL(__clear_user); - EXPORT_SYMBOL(__copy_from_user_ll); - EXPORT_SYMBOL(__copy_to_user_ll); - EXPORT_SYMBOL(strnlen_user); -+#else /* CONFIG_X86_UACCESS_INDIRECT */ -+EXPORT_SYMBOL(direct_csum_partial_copy_generic); -+#endif - - EXPORT_SYMBOL(dma_alloc_coherent); - EXPORT_SYMBOL(dma_free_coherent); -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/i387.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/i387.c ---- linux-2.6.8.1.orig/arch/i386/kernel/i387.c 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/i387.c 2006-05-11 13:05:38.000000000 +0400 -@@ -227,6 +227,7 @@ void set_fpu_twd( struct task_struct *ts - static int convert_fxsr_to_user( struct _fpstate __user *buf, - struct i387_fxsave_struct *fxsave ) - { -+ struct _fpreg tmp[8]; /* 80 bytes scratch area */ - unsigned long env[7]; - struct _fpreg __user *to; - struct _fpxreg *from; -@@ -243,23 +244,25 @@ static int convert_fxsr_to_user( struct - if ( __copy_to_user( buf, env, 7 * sizeof(unsigned long) ) ) - return 1; - -- to = &buf->_st[0]; -+ to = tmp; - from = (struct _fpxreg *) &fxsave->st_space[0]; - for ( i = 0 ; i < 8 ; i++, to++, from++ ) { - unsigned long __user *t = (unsigned long __user *)to; - unsigned long *f = (unsigned long *)from; - -- if (__put_user(*f, t) || -- __put_user(*(f + 1), t + 1) || -- __put_user(from->exponent, &to->exponent)) -- return 1; -+ *t = *f; -+ *(t + 1) = *(f+1); -+ to->exponent = from->exponent; - } -+ if (copy_to_user(buf->_st, tmp, sizeof(struct _fpreg [8]))) -+ return 1; - return 0; - } - - static int convert_fxsr_from_user( struct i387_fxsave_struct *fxsave, - struct _fpstate __user *buf ) - { -+ struct _fpreg tmp[8]; /* 80 bytes scratch area */ - unsigned long env[7]; - struct _fpxreg *to; - struct _fpreg __user *from; -@@ -267,6 +270,8 @@ static int convert_fxsr_from_user( struc - - if ( __copy_from_user( env, buf, 7 * sizeof(long) ) ) - return 1; -+ if (copy_from_user(tmp, buf->_st, sizeof(struct _fpreg [8]))) -+ return 1; - - fxsave->cwd = (unsigned short)(env[0] & 0xffff); - fxsave->swd = (unsigned short)(env[1] & 0xffff); -@@ -278,15 +283,14 @@ static int convert_fxsr_from_user( struc - fxsave->fos = env[6]; - - to = (struct _fpxreg *) &fxsave->st_space[0]; -- from = &buf->_st[0]; -+ from = tmp; - for ( i = 0 ; i < 8 ; i++, to++, from++ ) { - unsigned long *t = (unsigned long *)to; - unsigned long __user *f = (unsigned long __user *)from; - -- if (__get_user(*t, f) || -- __get_user(*(t + 1), f + 1) || -- __get_user(to->exponent, &from->exponent)) -- return 1; -+ *t = *f; -+ *(t + 1) = *(f + 1); -+ to->exponent = from->exponent; - } - return 0; - } -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/init_task.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/init_task.c ---- linux-2.6.8.1.orig/arch/i386/kernel/init_task.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/init_task.c 2006-05-11 13:05:38.000000000 +0400 -@@ -27,7 +27,7 @@ EXPORT_SYMBOL(init_mm); - */ - union thread_union init_thread_union - __attribute__((__section__(".data.init_task"))) = -- { INIT_THREAD_INFO(init_task) }; -+ { INIT_THREAD_INFO(init_task, init_thread_union) }; - - /* - * Initial task structure. -@@ -45,5 +45,5 @@ EXPORT_SYMBOL(init_task); - * section. Since TSS's are completely CPU-local, we want them - * on exact cacheline boundaries, to eliminate cacheline ping-pong. - */ --struct tss_struct init_tss[NR_CPUS] __cacheline_aligned = { [0 ... NR_CPUS-1] = INIT_TSS }; -+struct tss_struct init_tss[NR_CPUS] __attribute__((__section__(".data.tss"))) = { [0 ... NR_CPUS-1] = INIT_TSS }; - -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/io_apic.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/io_apic.c ---- linux-2.6.8.1.orig/arch/i386/kernel/io_apic.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/io_apic.c 2006-05-11 13:05:28.000000000 +0400 -@@ -635,7 +635,7 @@ failed: - return 0; - } - --static int __init irqbalance_disable(char *str) -+int __init irqbalance_disable(char *str) - { - irqbalance_disabled = 1; - return 0; -@@ -652,7 +652,7 @@ static inline void move_irq(int irq) - } - } - --__initcall(balanced_irq_init); -+late_initcall(balanced_irq_init); - - #else /* !CONFIG_IRQBALANCE */ - static inline void move_irq(int irq) { } -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/irq.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/irq.c ---- linux-2.6.8.1.orig/arch/i386/kernel/irq.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/irq.c 2006-05-11 13:05:40.000000000 +0400 -@@ -45,6 +45,9 @@ - #include <asm/desc.h> - #include <asm/irq.h> - -+#include <ub/beancounter.h> -+#include <ub/ub_task.h> -+ - /* - * Linux has a controller-independent x86 interrupt architecture. - * every controller has a 'controller-template', that is used -@@ -79,6 +82,68 @@ static void register_irq_proc (unsigned - #ifdef CONFIG_4KSTACKS - union irq_ctx *hardirq_ctx[NR_CPUS]; - union irq_ctx *softirq_ctx[NR_CPUS]; -+union irq_ctx *overflow_ctx[NR_CPUS]; -+#endif -+ -+#ifdef CONFIG_DEBUG_STACKOVERFLOW -+static void report_stack_overflow(unsigned long delta) -+{ -+ printk("Stack overflow %lu task=%s (%p)", -+ delta, current->comm, current); -+ dump_stack(); -+} -+ -+void check_stack_overflow(void) -+{ -+ /* Debugging check for stack overflow: is there less than 512KB free? */ -+ long esp; -+ unsigned long flags; -+#ifdef CONFIG_4KSTACKS -+ u32 *isp; -+ union irq_ctx * curctx; -+ union irq_ctx * irqctx; -+#endif -+ -+ __asm__ __volatile__("andl %%esp,%0" : -+ "=r" (esp) : "0" (THREAD_SIZE - 1)); -+ if (likely(esp > (sizeof(struct thread_info) + STACK_WARN))) -+ return; -+ -+ local_irq_save(flags); -+#ifdef CONFIG_4KSTACKS -+ curctx = (union irq_ctx *) current_thread_info(); -+ irqctx = overflow_ctx[smp_processor_id()]; -+ -+ if (curctx == irqctx) -+ report_stack_overflow(esp); -+ else { -+ /* build the stack frame on the IRQ stack */ -+ isp = (u32*) ((char*)irqctx + sizeof(*irqctx)); -+ irqctx->tinfo.task = curctx->tinfo.task; -+ irqctx->tinfo.real_stack = curctx->tinfo.real_stack; -+ irqctx->tinfo.virtual_stack = curctx->tinfo.virtual_stack; -+ irqctx->tinfo.previous_esp = current_stack_pointer(); -+ -+ *--isp = (u32) esp; -+ -+ asm volatile( -+ " xchgl %%ebx,%%esp \n" -+ " call report_stack_overflow \n" -+ " xchgl %%ebx,%%esp \n" -+ : -+ : "b"(isp) -+ : "memory", "cc", "eax", "edx", "ecx" -+ ); -+ } -+#else -+ report_stack_overflow(esp); -+#endif -+ local_irq_restore(flags); -+} -+#else -+void check_stack_overflow(void) -+{ -+} - #endif - - /* -@@ -221,15 +286,19 @@ asmlinkage int handle_IRQ_event(unsigned - { - int status = 1; /* Force the "do bottom halves" bit */ - int retval = 0; -+ struct user_beancounter *ub; - - if (!(action->flags & SA_INTERRUPT)) - local_irq_enable(); - -+ ub = set_exec_ub(get_ub0()); - do { - status |= action->flags; - retval |= action->handler(irq, action->dev_id, regs); - action = action->next; - } while (action); -+ (void)set_exec_ub(ub); -+ - if (status & SA_SAMPLE_RANDOM) - add_interrupt_randomness(irq); - local_irq_disable(); -@@ -270,7 +339,7 @@ static void report_bad_irq(int irq, irq_ - - static int noirqdebug; - --static int __init noirqdebug_setup(char *str) -+int __init noirqdebug_setup(char *str) - { - noirqdebug = 1; - printk("IRQ lockup detection disabled\n"); -@@ -429,23 +498,13 @@ asmlinkage unsigned int do_IRQ(struct pt - irq_desc_t *desc = irq_desc + irq; - struct irqaction * action; - unsigned int status; -+ struct ve_struct *envid; - -+ envid = set_exec_env(get_ve0()); - irq_enter(); - --#ifdef CONFIG_DEBUG_STACKOVERFLOW -- /* Debugging check for stack overflow: is there less than 1KB free? */ -- { -- long esp; -+ check_stack_overflow(); - -- __asm__ __volatile__("andl %%esp,%0" : -- "=r" (esp) : "0" (THREAD_SIZE - 1)); -- if (unlikely(esp < (sizeof(struct thread_info) + STACK_WARN))) { -- printk("do_IRQ: stack overflow: %ld\n", -- esp - sizeof(struct thread_info)); -- dump_stack(); -- } -- } --#endif - kstat_this_cpu.irqs[irq]++; - spin_lock(&desc->lock); - desc->handler->ack(irq); -@@ -513,6 +572,8 @@ asmlinkage unsigned int do_IRQ(struct pt - /* build the stack frame on the IRQ stack */ - isp = (u32*) ((char*)irqctx + sizeof(*irqctx)); - irqctx->tinfo.task = curctx->tinfo.task; -+ irqctx->tinfo.real_stack = curctx->tinfo.real_stack; -+ irqctx->tinfo.virtual_stack = curctx->tinfo.virtual_stack; - irqctx->tinfo.previous_esp = current_stack_pointer(); - - *--isp = (u32) action; -@@ -541,7 +602,6 @@ asmlinkage unsigned int do_IRQ(struct pt - } - - #else -- - for (;;) { - irqreturn_t action_ret; - -@@ -568,6 +628,7 @@ out: - spin_unlock(&desc->lock); - - irq_exit(); -+ (void)set_exec_env(envid); - - return 1; - } -@@ -995,13 +1056,15 @@ static int irq_affinity_read_proc(char * - return len; - } - -+int no_irq_affinity; -+ - static int irq_affinity_write_proc(struct file *file, const char __user *buffer, - unsigned long count, void *data) - { - int irq = (long)data, full_count = count, err; - cpumask_t new_value, tmp; - -- if (!irq_desc[irq].handler->set_affinity) -+ if (!irq_desc[irq].handler->set_affinity || no_irq_affinity) - return -EIO; - - err = cpumask_parse(buffer, count, new_value); -@@ -1122,6 +1185,9 @@ void init_irq_proc (void) - */ - static char softirq_stack[NR_CPUS * THREAD_SIZE] __attribute__((__aligned__(THREAD_SIZE))); - static char hardirq_stack[NR_CPUS * THREAD_SIZE] __attribute__((__aligned__(THREAD_SIZE))); -+#ifdef CONFIG_DEBUG_STACKOVERFLOW -+static char overflow_stack[NR_CPUS * THREAD_SIZE] __attribute__((__aligned__(THREAD_SIZE))); -+#endif - - /* - * allocate per-cpu stacks for hardirq and for softirq processing -@@ -1151,8 +1217,19 @@ void irq_ctx_init(int cpu) - - softirq_ctx[cpu] = irqctx; - -- printk("CPU %u irqstacks, hard=%p soft=%p\n", -- cpu,hardirq_ctx[cpu],softirq_ctx[cpu]); -+#ifdef CONFIG_DEBUG_STACKOVERFLOW -+ irqctx = (union irq_ctx*) &overflow_stack[cpu*THREAD_SIZE]; -+ irqctx->tinfo.task = NULL; -+ irqctx->tinfo.exec_domain = NULL; -+ irqctx->tinfo.cpu = cpu; -+ irqctx->tinfo.preempt_count = HARDIRQ_OFFSET; -+ irqctx->tinfo.addr_limit = MAKE_MM_SEG(0); -+ -+ overflow_ctx[cpu] = irqctx; -+#endif -+ -+ printk("CPU %u irqstacks, hard=%p soft=%p overflow=%p\n", -+ cpu,hardirq_ctx[cpu],softirq_ctx[cpu],overflow_ctx[cpu]); - } - - extern asmlinkage void __do_softirq(void); -@@ -1173,6 +1250,8 @@ asmlinkage void do_softirq(void) - curctx = current_thread_info(); - irqctx = softirq_ctx[smp_processor_id()]; - irqctx->tinfo.task = curctx->task; -+ irqctx->tinfo.real_stack = curctx->real_stack; -+ irqctx->tinfo.virtual_stack = curctx->virtual_stack; - irqctx->tinfo.previous_esp = current_stack_pointer(); - - /* build the stack frame on the softirq stack */ -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/ldt.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/ldt.c ---- linux-2.6.8.1.orig/arch/i386/kernel/ldt.c 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/ldt.c 2006-05-11 13:05:38.000000000 +0400 -@@ -2,7 +2,7 @@ - * linux/kernel/ldt.c - * - * Copyright (C) 1992 Krishna Balasubramanian and Linus Torvalds -- * Copyright (C) 1999 Ingo Molnar <mingo@redhat.com> -+ * Copyright (C) 1999, 2003 Ingo Molnar <mingo@redhat.com> - */ - - #include <linux/errno.h> -@@ -18,6 +18,8 @@ - #include <asm/system.h> - #include <asm/ldt.h> - #include <asm/desc.h> -+#include <linux/highmem.h> -+#include <asm/atomic_kmap.h> - - #ifdef CONFIG_SMP /* avoids "defined but not used" warnig */ - static void flush_ldt(void *null) -@@ -29,34 +31,31 @@ static void flush_ldt(void *null) - - static int alloc_ldt(mm_context_t *pc, int mincount, int reload) - { -- void *oldldt; -- void *newldt; -- int oldsize; -+ int oldsize, newsize, i; - - if (mincount <= pc->size) - return 0; -+ /* -+ * LDT got larger - reallocate if necessary. -+ */ - oldsize = pc->size; - mincount = (mincount+511)&(~511); -- if (mincount*LDT_ENTRY_SIZE > PAGE_SIZE) -- newldt = vmalloc(mincount*LDT_ENTRY_SIZE); -- else -- newldt = kmalloc(mincount*LDT_ENTRY_SIZE, GFP_KERNEL); -- -- if (!newldt) -- return -ENOMEM; -- -- if (oldsize) -- memcpy(newldt, pc->ldt, oldsize*LDT_ENTRY_SIZE); -- oldldt = pc->ldt; -- memset(newldt+oldsize*LDT_ENTRY_SIZE, 0, (mincount-oldsize)*LDT_ENTRY_SIZE); -- pc->ldt = newldt; -- wmb(); -+ newsize = mincount*LDT_ENTRY_SIZE; -+ for (i = 0; i < newsize; i += PAGE_SIZE) { -+ int nr = i/PAGE_SIZE; -+ BUG_ON(i >= 64*1024); -+ if (!pc->ldt_pages[nr]) { -+ pc->ldt_pages[nr] = alloc_page(GFP_HIGHUSER|__GFP_UBC); -+ if (!pc->ldt_pages[nr]) -+ return -ENOMEM; -+ clear_highpage(pc->ldt_pages[nr]); -+ } -+ } - pc->size = mincount; -- wmb(); -- - if (reload) { - #ifdef CONFIG_SMP - cpumask_t mask; -+ - preempt_disable(); - load_LDT(pc); - mask = cpumask_of_cpu(smp_processor_id()); -@@ -67,24 +66,32 @@ static int alloc_ldt(mm_context_t *pc, i - load_LDT(pc); - #endif - } -- if (oldsize) { -- if (oldsize*LDT_ENTRY_SIZE > PAGE_SIZE) -- vfree(oldldt); -- else -- kfree(oldldt); -- } - return 0; - } - - static inline int copy_ldt(mm_context_t *new, mm_context_t *old) - { -- int err = alloc_ldt(new, old->size, 0); -- if (err < 0) -+ int i, err, size = old->size, nr_pages = (size*LDT_ENTRY_SIZE + PAGE_SIZE-1)/PAGE_SIZE; -+ -+ err = alloc_ldt(new, size, 0); -+ if (err < 0) { -+ new->size = 0; - return err; -- memcpy(new->ldt, old->ldt, old->size*LDT_ENTRY_SIZE); -+ } -+ for (i = 0; i < nr_pages; i++) -+ copy_user_highpage(new->ldt_pages[i], old->ldt_pages[i], 0); - return 0; - } - -+static void free_ldt(mm_context_t *mc) -+{ -+ int i; -+ -+ for (i = 0; i < MAX_LDT_PAGES; i++) -+ if (mc->ldt_pages[i]) -+ __free_page(mc->ldt_pages[i]); -+} -+ - /* - * we do not have to muck with descriptors here, that is - * done in switch_mm() as needed. -@@ -96,10 +103,13 @@ int init_new_context(struct task_struct - - init_MUTEX(&mm->context.sem); - mm->context.size = 0; -+ memset(mm->context.ldt_pages, 0, sizeof(struct page *) * MAX_LDT_PAGES); - old_mm = current->mm; - if (old_mm && old_mm->context.size > 0) { - down(&old_mm->context.sem); - retval = copy_ldt(&mm->context, &old_mm->context); -+ if (retval < 0) -+ free_ldt(&mm->context); - up(&old_mm->context.sem); - } - return retval; -@@ -107,23 +117,21 @@ int init_new_context(struct task_struct - - /* - * No need to lock the MM as we are the last user -+ * Do not touch the ldt register, we are already -+ * in the next thread. - */ - void destroy_context(struct mm_struct *mm) - { -- if (mm->context.size) { -- if (mm == current->active_mm) -- clear_LDT(); -- if (mm->context.size*LDT_ENTRY_SIZE > PAGE_SIZE) -- vfree(mm->context.ldt); -- else -- kfree(mm->context.ldt); -- mm->context.size = 0; -- } -+ int i, nr_pages = (mm->context.size*LDT_ENTRY_SIZE + PAGE_SIZE-1) / PAGE_SIZE; -+ -+ for (i = 0; i < nr_pages; i++) -+ __free_page(mm->context.ldt_pages[i]); -+ mm->context.size = 0; - } - - static int read_ldt(void __user * ptr, unsigned long bytecount) - { -- int err; -+ int err, i; - unsigned long size; - struct mm_struct * mm = current->mm; - -@@ -138,8 +146,25 @@ static int read_ldt(void __user * ptr, u - size = bytecount; - - err = 0; -- if (copy_to_user(ptr, mm->context.ldt, size)) -- err = -EFAULT; -+ /* -+ * This is necessary just in case we got here straight from a -+ * context-switch where the ptes were set but no tlb flush -+ * was done yet. We rather avoid doing a TLB flush in the -+ * context-switch path and do it here instead. -+ */ -+ __flush_tlb_global(); -+ -+ for (i = 0; i < size; i += PAGE_SIZE) { -+ int nr = i / PAGE_SIZE, bytes; -+ char *kaddr = kmap(mm->context.ldt_pages[nr]); -+ -+ bytes = size - i; -+ if (bytes > PAGE_SIZE) -+ bytes = PAGE_SIZE; -+ if (copy_to_user(ptr + i, kaddr, bytes)) -+ err = -EFAULT; -+ kunmap(mm->context.ldt_pages[nr]); -+ } - up(&mm->context.sem); - if (err < 0) - return err; -@@ -158,7 +183,7 @@ static int read_default_ldt(void __user - - err = 0; - address = &default_ldt[0]; -- size = 5*sizeof(struct desc_struct); -+ size = 5*LDT_ENTRY_SIZE; - if (size > bytecount) - size = bytecount; - -@@ -200,7 +225,15 @@ static int write_ldt(void __user * ptr, - goto out_unlock; - } - -- lp = (__u32 *) ((ldt_info.entry_number << 3) + (char *) mm->context.ldt); -+ /* -+ * No rescheduling allowed from this point to the install. -+ * -+ * We do a TLB flush for the same reason as in the read_ldt() path. -+ */ -+ preempt_disable(); -+ __flush_tlb_global(); -+ lp = (__u32 *) ((ldt_info.entry_number << 3) + -+ (char *) __kmap_atomic_vaddr(KM_LDT_PAGE0)); - - /* Allow LDTs to be cleared by the user. */ - if (ldt_info.base_addr == 0 && ldt_info.limit == 0) { -@@ -221,6 +254,7 @@ install: - *lp = entry_1; - *(lp+1) = entry_2; - error = 0; -+ preempt_enable(); - - out_unlock: - up(&mm->context.sem); -@@ -248,3 +282,26 @@ asmlinkage int sys_modify_ldt(int func, - } - return ret; - } -+ -+/* -+ * load one particular LDT into the current CPU -+ */ -+void load_LDT_nolock(mm_context_t *pc, int cpu) -+{ -+ struct page **pages = pc->ldt_pages; -+ int count = pc->size; -+ int nr_pages, i; -+ -+ if (likely(!count)) { -+ pages = &default_ldt_page; -+ count = 5; -+ } -+ nr_pages = (count*LDT_ENTRY_SIZE + PAGE_SIZE-1) / PAGE_SIZE; -+ -+ for (i = 0; i < nr_pages; i++) { -+ __kunmap_atomic_type(KM_LDT_PAGE0 - i); -+ __kmap_atomic(pages[i], KM_LDT_PAGE0 - i); -+ } -+ set_ldt_desc(cpu, (void *)__kmap_atomic_vaddr(KM_LDT_PAGE0), count); -+ load_LDT_desc(); -+} -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/mpparse.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/mpparse.c ---- linux-2.6.8.1.orig/arch/i386/kernel/mpparse.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/mpparse.c 2006-05-11 13:05:38.000000000 +0400 -@@ -690,7 +690,7 @@ void __init get_smp_config (void) - * Read the physical hardware table. Anything here will - * override the defaults. - */ -- if (!smp_read_mpc((void *)mpf->mpf_physptr)) { -+ if (!smp_read_mpc((void *)phys_to_virt(mpf->mpf_physptr))) { - smp_found_config = 0; - printk(KERN_ERR "BIOS bug, MP table errors detected!...\n"); - printk(KERN_ERR "... disabling SMP support. (tell your hw vendor)\n"); -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/nmi.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/nmi.c ---- linux-2.6.8.1.orig/arch/i386/kernel/nmi.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/nmi.c 2006-05-11 13:05:49.000000000 +0400 -@@ -31,7 +31,12 @@ - #include <asm/mpspec.h> - #include <asm/nmi.h> - --unsigned int nmi_watchdog = NMI_NONE; -+#ifdef CONFIG_NMI_WATCHDOG -+#define NMI_DEFAULT NMI_IO_APIC -+#else -+#define NMI_DEFAULT NMI_NONE -+#endif -+unsigned int nmi_watchdog = NMI_DEFAULT; - static unsigned int nmi_hz = HZ; - static unsigned int nmi_perfctr_msr; /* the MSR to reset in NMI handler */ - static unsigned int nmi_p4_cccr_val; -@@ -459,6 +464,21 @@ void touch_nmi_watchdog (void) - alert_counter[i] = 0; - } - -+static spinlock_t show_regs_lock = SPIN_LOCK_UNLOCKED; -+ -+void smp_show_regs(struct pt_regs *regs, void *info) -+{ -+ if (regs == NULL) -+ return; -+ -+ bust_spinlocks(1); -+ spin_lock(&show_regs_lock); -+ printk("----------- IPI show regs -----------"); -+ show_regs(regs); -+ spin_unlock(&show_regs_lock); -+ bust_spinlocks(0); -+} -+ - void nmi_watchdog_tick (struct pt_regs * regs) - { - -@@ -486,7 +506,11 @@ void nmi_watchdog_tick (struct pt_regs * - bust_spinlocks(1); - printk("NMI Watchdog detected LOCKUP on CPU%d, eip %08lx, registers:\n", cpu, regs->eip); - show_registers(regs); -- printk("console shuts up ...\n"); -+ smp_nmi_call_function(smp_show_regs, NULL, 1); -+ bust_spinlocks(1); -+ /* current CPU messages should go bottom */ -+ if (!decode_call_traces) -+ smp_show_regs(regs, NULL); - console_silent(); - spin_unlock(&nmi_print_lock); - bust_spinlocks(0); -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/process.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/process.c ---- linux-2.6.8.1.orig/arch/i386/kernel/process.c 2004-08-14 14:54:46.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/process.c 2006-05-11 13:05:49.000000000 +0400 -@@ -46,6 +46,7 @@ - #include <asm/i387.h> - #include <asm/irq.h> - #include <asm/desc.h> -+#include <asm/atomic_kmap.h> - #ifdef CONFIG_MATH_EMULATION - #include <asm/math_emu.h> - #endif -@@ -219,11 +220,14 @@ __setup("idle=", idle_setup); - void show_regs(struct pt_regs * regs) - { - unsigned long cr0 = 0L, cr2 = 0L, cr3 = 0L, cr4 = 0L; -+ extern int die_counter; - - printk("\n"); -- printk("Pid: %d, comm: %20s\n", current->pid, current->comm); -- printk("EIP: %04x:[<%08lx>] CPU: %d\n",0xffff & regs->xcs,regs->eip, smp_processor_id()); -- print_symbol("EIP is at %s\n", regs->eip); -+ printk("Pid: %d, comm: %20s, oopses: %d\n", current->pid, current->comm, die_counter); -+ printk("EIP: %04x:[<%08lx>] CPU: %d, VCPU: %d:%d\n",0xffff & regs->xcs,regs->eip, smp_processor_id(), -+ task_vsched_id(current), task_cpu(current)); -+ if (decode_call_traces) -+ print_symbol("EIP is at %s\n", regs->eip); - - if (regs->xcs & 3) - printk(" ESP: %04x:%08lx",0xffff & regs->xss,regs->esp); -@@ -247,6 +251,8 @@ void show_regs(struct pt_regs * regs) - : "=r" (cr4): "0" (0)); - printk("CR0: %08lx CR2: %08lx CR3: %08lx CR4: %08lx\n", cr0, cr2, cr3, cr4); - show_trace(NULL, ®s->esp); -+ if (!decode_call_traces) -+ printk(" EIP: [<%08lx>]\n",regs->eip); - } - - /* -@@ -272,6 +278,13 @@ int kernel_thread(int (*fn)(void *), voi - { - struct pt_regs regs; - -+ /* Don't allow kernel_thread() inside VE */ -+ if (!ve_is_super(get_exec_env())) { -+ printk("kernel_thread call inside VE\n"); -+ dump_stack(); -+ return -EPERM; -+ } -+ - memset(®s, 0, sizeof(regs)); - - regs.ebx = (unsigned long) fn; -@@ -311,6 +324,9 @@ void flush_thread(void) - struct task_struct *tsk = current; - - memset(tsk->thread.debugreg, 0, sizeof(unsigned long)*8); -+#ifdef CONFIG_X86_HIGH_ENTRY -+ clear_thread_flag(TIF_DB7); -+#endif - memset(tsk->thread.tls_array, 0, sizeof(tsk->thread.tls_array)); - /* - * Forget coprocessor state.. -@@ -324,9 +340,8 @@ void release_thread(struct task_struct * - if (dead_task->mm) { - // temporary debugging check - if (dead_task->mm->context.size) { -- printk("WARNING: dead process %8s still has LDT? <%p/%d>\n", -+ printk("WARNING: dead process %8s still has LDT? <%d>\n", - dead_task->comm, -- dead_task->mm->context.ldt, - dead_task->mm->context.size); - BUG(); - } -@@ -350,7 +365,7 @@ int copy_thread(int nr, unsigned long cl - { - struct pt_regs * childregs; - struct task_struct *tsk; -- int err; -+ int err, i; - - childregs = ((struct pt_regs *) (THREAD_SIZE + (unsigned long) p->thread_info)) - 1; - *childregs = *regs; -@@ -361,7 +376,18 @@ int copy_thread(int nr, unsigned long cl - p->thread.esp = (unsigned long) childregs; - p->thread.esp0 = (unsigned long) (childregs+1); - -+ /* -+ * get the two stack pages, for the virtual stack. -+ * -+ * IMPORTANT: this code relies on the fact that the task -+ * structure is an THREAD_SIZE aligned piece of physical memory. -+ */ -+ for (i = 0; i < ARRAY_SIZE(p->thread_info->stack_page); i++) -+ p->thread_info->stack_page[i] = -+ virt_to_page((unsigned long)p->thread_info + (i*PAGE_SIZE)); -+ - p->thread.eip = (unsigned long) ret_from_fork; -+ p->thread_info->real_stack = p->thread_info; - - savesegment(fs,p->thread.fs); - savesegment(gs,p->thread.gs); -@@ -513,10 +539,42 @@ struct task_struct fastcall * __switch_t - - __unlazy_fpu(prev_p); - -+#ifdef CONFIG_X86_HIGH_ENTRY -+{ -+ int i; -+ /* -+ * Set the ptes of the virtual stack. (NOTE: a one-page TLB flush is -+ * needed because otherwise NMIs could interrupt the -+ * user-return code with a virtual stack and stale TLBs.) -+ */ -+ for (i = 0; i < ARRAY_SIZE(next_p->thread_info->stack_page); i++) { -+ __kunmap_atomic_type(KM_VSTACK_TOP-i); -+ __kmap_atomic(next_p->thread_info->stack_page[i], KM_VSTACK_TOP-i); -+ } -+ /* -+ * NOTE: here we rely on the task being the stack as well -+ */ -+ next_p->thread_info->virtual_stack = -+ (void *)__kmap_atomic_vaddr(KM_VSTACK_TOP); -+} -+#if defined(CONFIG_PREEMPT) && defined(CONFIG_SMP) -+ /* -+ * If next was preempted on entry from userspace to kernel, -+ * and now it's on a different cpu, we need to adjust %esp. -+ * This assumes that entry.S does not copy %esp while on the -+ * virtual stack (with interrupts enabled): which is so, -+ * except within __SWITCH_KERNELSPACE itself. -+ */ -+ if (unlikely(next->esp >= TASK_SIZE)) { -+ next->esp &= THREAD_SIZE - 1; -+ next->esp |= (unsigned long) next_p->thread_info->virtual_stack; -+ } -+#endif -+#endif - /* - * Reload esp0, LDT and the page table pointer: - */ -- load_esp0(tss, next); -+ load_virtual_esp0(tss, next_p); - - /* - * Load the per-thread Thread-Local Storage descriptor. -@@ -759,6 +817,8 @@ asmlinkage int sys_get_thread_area(struc - if (idx < GDT_ENTRY_TLS_MIN || idx > GDT_ENTRY_TLS_MAX) - return -EINVAL; - -+ memset(&info, 0, sizeof(info)); -+ - desc = current->thread.tls_array + idx - GDT_ENTRY_TLS_MIN; - - info.entry_number = idx; -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/ptrace.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/ptrace.c ---- linux-2.6.8.1.orig/arch/i386/kernel/ptrace.c 2004-08-14 14:55:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/ptrace.c 2006-05-11 13:05:49.000000000 +0400 -@@ -253,7 +253,7 @@ asmlinkage int sys_ptrace(long request, - } - ret = -ESRCH; - read_lock(&tasklist_lock); -- child = find_task_by_pid(pid); -+ child = find_task_by_pid_ve(pid); - if (child) - get_task_struct(child); - read_unlock(&tasklist_lock); -@@ -388,7 +388,7 @@ asmlinkage int sys_ptrace(long request, - long tmp; - - ret = 0; -- if (child->state == TASK_ZOMBIE) /* already dead */ -+ if (child->exit_state == EXIT_ZOMBIE) /* already dead */ - break; - child->exit_code = SIGKILL; - /* make sure the single step bit is not set. */ -@@ -541,8 +541,10 @@ void do_syscall_trace(struct pt_regs *re - return; - /* the 0x80 provides a way for the tracing parent to distinguish - between a syscall stop and SIGTRAP delivery */ -+ set_pn_state(current, entryexit ? PN_STOP_LEAVE : PN_STOP_ENTRY); - ptrace_notify(SIGTRAP | ((current->ptrace & PT_TRACESYSGOOD) - ? 0x80 : 0)); -+ clear_pn_state(current); - - /* - * this isn't the same as continuing with a signal, but it will do -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/reboot.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/reboot.c ---- linux-2.6.8.1.orig/arch/i386/kernel/reboot.c 2004-08-14 14:55:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/reboot.c 2006-05-11 13:05:38.000000000 +0400 -@@ -233,12 +233,11 @@ void machine_real_restart(unsigned char - CMOS_WRITE(0x00, 0x8f); - spin_unlock_irqrestore(&rtc_lock, flags); - -- /* Remap the kernel at virtual address zero, as well as offset zero -- from the kernel segment. This assumes the kernel segment starts at -- virtual address PAGE_OFFSET. */ -- -- memcpy (swapper_pg_dir, swapper_pg_dir + USER_PGD_PTRS, -- sizeof (swapper_pg_dir [0]) * KERNEL_PGD_PTRS); -+ /* -+ * Remap the first 16 MB of RAM (which includes the kernel image) -+ * at virtual address zero: -+ */ -+ setup_identity_mappings(swapper_pg_dir, 0, LOW_MAPPINGS_SIZE); - - /* - * Use `swapper_pg_dir' as our page directory. -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/setup.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/setup.c ---- linux-2.6.8.1.orig/arch/i386/kernel/setup.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/setup.c 2006-05-11 13:05:38.000000000 +0400 -@@ -39,6 +39,7 @@ - #include <linux/efi.h> - #include <linux/init.h> - #include <linux/edd.h> -+#include <linux/mmzone.h> - #include <video/edid.h> - #include <asm/e820.h> - #include <asm/mpspec.h> -@@ -1073,7 +1074,19 @@ static unsigned long __init setup_memory - INITRD_START ? INITRD_START + PAGE_OFFSET : 0; - initrd_end = initrd_start+INITRD_SIZE; - } -- else { -+ else if ((max_low_pfn << PAGE_SHIFT) < -+ PAGE_ALIGN(INITRD_START + INITRD_SIZE)) { -+ /* GRUB places initrd as high as possible, so when -+ VMALLOC_AREA is bigger than std Linux has, such -+ initrd is inaccessiable in normal zone (highmem) */ -+ -+ /* initrd should be totally in highmem, sorry */ -+ BUG_ON(INITRD_START < (max_low_pfn << PAGE_SHIFT)); -+ -+ initrd_copy = INITRD_SIZE; -+ printk(KERN_ERR "initrd: GRUB workaround enabled\n"); -+ /* initrd is copied from highmem in initrd_move() */ -+ } else { - printk(KERN_ERR "initrd extends beyond end of memory " - "(0x%08lx > 0x%08lx)\ndisabling initrd\n", - INITRD_START + INITRD_SIZE, -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/signal.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/signal.c ---- linux-2.6.8.1.orig/arch/i386/kernel/signal.c 2004-08-14 14:55:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/signal.c 2006-05-11 13:05:45.000000000 +0400 -@@ -42,6 +42,7 @@ sys_sigsuspend(int history0, int history - mask &= _BLOCKABLE; - spin_lock_irq(¤t->sighand->siglock); - saveset = current->blocked; -+ set_sigsuspend_state(current, saveset); - siginitset(¤t->blocked, mask); - recalc_sigpending(); - spin_unlock_irq(¤t->sighand->siglock); -@@ -50,8 +51,10 @@ sys_sigsuspend(int history0, int history - while (1) { - current->state = TASK_INTERRUPTIBLE; - schedule(); -- if (do_signal(regs, &saveset)) -+ if (do_signal(regs, &saveset)) { -+ clear_sigsuspend_state(current); - return -EINTR; -+ } - } - } - -@@ -70,6 +73,7 @@ sys_rt_sigsuspend(struct pt_regs regs) - - spin_lock_irq(¤t->sighand->siglock); - saveset = current->blocked; -+ set_sigsuspend_state(current, saveset); - current->blocked = newset; - recalc_sigpending(); - spin_unlock_irq(¤t->sighand->siglock); -@@ -78,8 +82,10 @@ sys_rt_sigsuspend(struct pt_regs regs) - while (1) { - current->state = TASK_INTERRUPTIBLE; - schedule(); -- if (do_signal(®s, &saveset)) -+ if (do_signal(®s, &saveset)) { -+ clear_sigsuspend_state(current); - return -EINTR; -+ } - } - } - -@@ -132,28 +138,29 @@ sys_sigaltstack(unsigned long ebx) - */ - - static int --restore_sigcontext(struct pt_regs *regs, struct sigcontext __user *sc, int *peax) -+restore_sigcontext(struct pt_regs *regs, -+ struct sigcontext __user *__sc, int *peax) - { -- unsigned int err = 0; -+ struct sigcontext scratch; /* 88 bytes of scratch area */ - - /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; - --#define COPY(x) err |= __get_user(regs->x, &sc->x) -+ if (copy_from_user(&scratch, __sc, sizeof(scratch))) -+ return -EFAULT; -+ -+#define COPY(x) regs->x = scratch.x - - #define COPY_SEG(seg) \ -- { unsigned short tmp; \ -- err |= __get_user(tmp, &sc->seg); \ -+ { unsigned short tmp = scratch.seg; \ - regs->x##seg = tmp; } - - #define COPY_SEG_STRICT(seg) \ -- { unsigned short tmp; \ -- err |= __get_user(tmp, &sc->seg); \ -+ { unsigned short tmp = scratch.seg; \ - regs->x##seg = tmp|3; } - - #define GET_SEG(seg) \ -- { unsigned short tmp; \ -- err |= __get_user(tmp, &sc->seg); \ -+ { unsigned short tmp = scratch.seg; \ - loadsegment(seg,tmp); } - - #define FIX_EFLAGS (X86_EFLAGS_AC | X86_EFLAGS_OF | X86_EFLAGS_DF | \ -@@ -176,27 +183,29 @@ restore_sigcontext(struct pt_regs *regs, - COPY_SEG_STRICT(ss); - - { -- unsigned int tmpflags; -- err |= __get_user(tmpflags, &sc->eflags); -+ unsigned int tmpflags = scratch.eflags; - regs->eflags = (regs->eflags & ~FIX_EFLAGS) | (tmpflags & FIX_EFLAGS); - regs->orig_eax = -1; /* disable syscall checks */ - } - - { -- struct _fpstate __user * buf; -- err |= __get_user(buf, &sc->fpstate); -+ struct _fpstate * buf = scratch.fpstate; - if (buf) { - if (verify_area(VERIFY_READ, buf, sizeof(*buf))) -- goto badframe; -- err |= restore_i387(buf); -+ return -EFAULT; -+ if (restore_i387(buf)) -+ return -EFAULT; -+ } else { -+ struct task_struct *me = current; -+ if (me->used_math) { -+ clear_fpu(me); -+ me->used_math = 0; -+ } - } - } - -- err |= __get_user(*peax, &sc->eax); -- return err; -- --badframe: -- return 1; -+ *peax = scratch.eax; -+ return 0; - } - - asmlinkage int sys_sigreturn(unsigned long __unused) -@@ -265,46 +274,47 @@ badframe: - */ - - static int --setup_sigcontext(struct sigcontext __user *sc, struct _fpstate __user *fpstate, -+setup_sigcontext(struct sigcontext __user *__sc, struct _fpstate __user *fpstate, - struct pt_regs *regs, unsigned long mask) - { -- int tmp, err = 0; -+ struct sigcontext sc; /* 88 bytes of scratch area */ -+ int tmp; - - tmp = 0; - __asm__("movl %%gs,%0" : "=r"(tmp): "0"(tmp)); -- err |= __put_user(tmp, (unsigned int __user *)&sc->gs); -+ *(unsigned int *)&sc.gs = tmp; - __asm__("movl %%fs,%0" : "=r"(tmp): "0"(tmp)); -- err |= __put_user(tmp, (unsigned int __user *)&sc->fs); -- -- err |= __put_user(regs->xes, (unsigned int __user *)&sc->es); -- err |= __put_user(regs->xds, (unsigned int __user *)&sc->ds); -- err |= __put_user(regs->edi, &sc->edi); -- err |= __put_user(regs->esi, &sc->esi); -- err |= __put_user(regs->ebp, &sc->ebp); -- err |= __put_user(regs->esp, &sc->esp); -- err |= __put_user(regs->ebx, &sc->ebx); -- err |= __put_user(regs->edx, &sc->edx); -- err |= __put_user(regs->ecx, &sc->ecx); -- err |= __put_user(regs->eax, &sc->eax); -- err |= __put_user(current->thread.trap_no, &sc->trapno); -- err |= __put_user(current->thread.error_code, &sc->err); -- err |= __put_user(regs->eip, &sc->eip); -- err |= __put_user(regs->xcs, (unsigned int __user *)&sc->cs); -- err |= __put_user(regs->eflags, &sc->eflags); -- err |= __put_user(regs->esp, &sc->esp_at_signal); -- err |= __put_user(regs->xss, (unsigned int __user *)&sc->ss); -+ *(unsigned int *)&sc.fs = tmp; -+ *(unsigned int *)&sc.es = regs->xes; -+ *(unsigned int *)&sc.ds = regs->xds; -+ sc.edi = regs->edi; -+ sc.esi = regs->esi; -+ sc.ebp = regs->ebp; -+ sc.esp = regs->esp; -+ sc.ebx = regs->ebx; -+ sc.edx = regs->edx; -+ sc.ecx = regs->ecx; -+ sc.eax = regs->eax; -+ sc.trapno = current->thread.trap_no; -+ sc.err = current->thread.error_code; -+ sc.eip = regs->eip; -+ *(unsigned int *)&sc.cs = regs->xcs; -+ sc.eflags = regs->eflags; -+ sc.esp_at_signal = regs->esp; -+ *(unsigned int *)&sc.ss = regs->xss; - - tmp = save_i387(fpstate); - if (tmp < 0) -- err = 1; -- else -- err |= __put_user(tmp ? fpstate : NULL, &sc->fpstate); -+ return 1; -+ sc.fpstate = tmp ? fpstate : NULL; - - /* non-iBCS2 extensions.. */ -- err |= __put_user(mask, &sc->oldmask); -- err |= __put_user(current->thread.cr2, &sc->cr2); -+ sc.oldmask = mask; -+ sc.cr2 = current->thread.cr2; - -- return err; -+ if (copy_to_user(__sc, &sc, sizeof(sc))) -+ return 1; -+ return 0; - } - - /* -@@ -443,7 +453,7 @@ static void setup_rt_frame(int sig, stru - /* Create the ucontext. */ - err |= __put_user(0, &frame->uc.uc_flags); - err |= __put_user(0, &frame->uc.uc_link); -- err |= __put_user(current->sas_ss_sp, &frame->uc.uc_stack.ss_sp); -+ err |= __put_user(current->sas_ss_sp, (unsigned long *)&frame->uc.uc_stack.ss_sp); - err |= __put_user(sas_ss_flags(regs->esp), - &frame->uc.uc_stack.ss_flags); - err |= __put_user(current->sas_ss_size, &frame->uc.uc_stack.ss_size); -@@ -565,9 +575,10 @@ int fastcall do_signal(struct pt_regs *r - if ((regs->xcs & 3) != 3) - return 1; - -- if (current->flags & PF_FREEZE) { -- refrigerator(0); -- goto no_signal; -+ if (unlikely(test_thread_flag(TIF_FREEZE))) { -+ refrigerator(); -+ if (!signal_pending(current)) -+ goto no_signal; - } - - if (!oldset) -@@ -580,7 +591,9 @@ int fastcall do_signal(struct pt_regs *r - * have been cleared if the watchpoint triggered - * inside the kernel. - */ -- __asm__("movl %0,%%db7" : : "r" (current->thread.debugreg[7])); -+ if (unlikely(current->thread.debugreg[7])) { -+ __asm__("movl %0,%%db7" : : "r" (current->thread.debugreg[7])); -+ } - - /* Whee! Actually deliver the signal. */ - handle_signal(signr, &info, oldset, regs); -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/smp.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/smp.c ---- linux-2.6.8.1.orig/arch/i386/kernel/smp.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/smp.c 2006-05-11 13:05:38.000000000 +0400 -@@ -22,6 +22,7 @@ - - #include <asm/mtrr.h> - #include <asm/tlbflush.h> -+#include <asm/nmi.h> - #include <mach_ipi.h> - #include <mach_apic.h> - -@@ -122,7 +123,7 @@ static inline int __prepare_ICR2 (unsign - return SET_APIC_DEST_FIELD(mask); - } - --inline void __send_IPI_shortcut(unsigned int shortcut, int vector) -+void __send_IPI_shortcut(unsigned int shortcut, int vector) - { - /* - * Subtle. In the case of the 'never do double writes' workaround -@@ -157,7 +158,7 @@ void fastcall send_IPI_self(int vector) - /* - * This is only used on smaller machines. - */ --inline void send_IPI_mask_bitmask(cpumask_t cpumask, int vector) -+void send_IPI_mask_bitmask(cpumask_t cpumask, int vector) - { - unsigned long mask = cpus_addr(cpumask)[0]; - unsigned long cfg; -@@ -326,10 +327,12 @@ asmlinkage void smp_invalidate_interrupt - - if (flush_mm == cpu_tlbstate[cpu].active_mm) { - if (cpu_tlbstate[cpu].state == TLBSTATE_OK) { -+#ifndef CONFIG_X86_SWITCH_PAGETABLES - if (flush_va == FLUSH_ALL) - local_flush_tlb(); - else - __flush_tlb_one(flush_va); -+#endif - } else - leave_mm(cpu); - } -@@ -395,21 +398,6 @@ static void flush_tlb_others(cpumask_t c - spin_unlock(&tlbstate_lock); - } - --void flush_tlb_current_task(void) --{ -- struct mm_struct *mm = current->mm; -- cpumask_t cpu_mask; -- -- preempt_disable(); -- cpu_mask = mm->cpu_vm_mask; -- cpu_clear(smp_processor_id(), cpu_mask); -- -- local_flush_tlb(); -- if (!cpus_empty(cpu_mask)) -- flush_tlb_others(cpu_mask, mm, FLUSH_ALL); -- preempt_enable(); --} -- - void flush_tlb_mm (struct mm_struct * mm) - { - cpumask_t cpu_mask; -@@ -441,7 +429,10 @@ void flush_tlb_page(struct vm_area_struc - - if (current->active_mm == mm) { - if(current->mm) -- __flush_tlb_one(va); -+#ifndef CONFIG_X86_SWITCH_PAGETABLES -+ __flush_tlb_one(va) -+#endif -+ ; - else - leave_mm(smp_processor_id()); - } -@@ -547,6 +538,89 @@ int smp_call_function (void (*func) (voi - return 0; - } - -+static spinlock_t nmi_call_lock = SPIN_LOCK_UNLOCKED; -+static struct nmi_call_data_struct { -+ smp_nmi_function func; -+ void *info; -+ atomic_t started; -+ atomic_t finished; -+ cpumask_t cpus_called; -+ int wait; -+} *nmi_call_data; -+ -+static int smp_nmi_callback(struct pt_regs * regs, int cpu) -+{ -+ smp_nmi_function func; -+ void *info; -+ int wait; -+ -+ func = nmi_call_data->func; -+ info = nmi_call_data->info; -+ wait = nmi_call_data->wait; -+ ack_APIC_irq(); -+ /* prevent from calling func() multiple times */ -+ if (cpu_test_and_set(cpu, nmi_call_data->cpus_called)) -+ return 0; -+ /* -+ * notify initiating CPU that I've grabbed the data and am -+ * about to execute the function -+ */ -+ mb(); -+ atomic_inc(&nmi_call_data->started); -+ /* at this point the nmi_call_data structure is out of scope */ -+ irq_enter(); -+ func(regs, info); -+ irq_exit(); -+ if (wait) -+ atomic_inc(&nmi_call_data->finished); -+ -+ return 0; -+} -+ -+/* -+ * This function tries to call func(regs, info) on each cpu. -+ * Func must be fast and non-blocking. -+ * May be called with disabled interrupts and from any context. -+ */ -+int smp_nmi_call_function(smp_nmi_function func, void *info, int wait) -+{ -+ struct nmi_call_data_struct data; -+ int cpus; -+ -+ cpus = num_online_cpus() - 1; -+ if (!cpus) -+ return 0; -+ -+ data.func = func; -+ data.info = info; -+ data.wait = wait; -+ atomic_set(&data.started, 0); -+ atomic_set(&data.finished, 0); -+ cpus_clear(data.cpus_called); -+ /* prevent this cpu from calling func if NMI happens */ -+ cpu_set(smp_processor_id(), data.cpus_called); -+ -+ if (!spin_trylock(&nmi_call_lock)) -+ return -1; -+ -+ nmi_call_data = &data; -+ set_nmi_ipi_callback(smp_nmi_callback); -+ mb(); -+ -+ /* Send a message to all other CPUs and wait for them to respond */ -+ send_IPI_allbutself(APIC_DM_NMI); -+ while (atomic_read(&data.started) != cpus) -+ barrier(); -+ -+ unset_nmi_ipi_callback(); -+ if (wait) -+ while (atomic_read(&data.finished) != cpus) -+ barrier(); -+ spin_unlock(&nmi_call_lock); -+ -+ return 0; -+} -+ - static void stop_this_cpu (void * dummy) - { - /* -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/smpboot.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/smpboot.c ---- linux-2.6.8.1.orig/arch/i386/kernel/smpboot.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/smpboot.c 2006-05-11 13:05:40.000000000 +0400 -@@ -309,6 +309,8 @@ static void __init synchronize_tsc_bp (v - if (!buggy) - printk("passed.\n"); - ; -+ /* TSC reset. kill whatever might rely on old values */ -+ VE_TASK_INFO(current)->wakeup_stamp = 0; - } - - static void __init synchronize_tsc_ap (void) -@@ -334,6 +336,8 @@ static void __init synchronize_tsc_ap (v - atomic_inc(&tsc_count_stop); - while (atomic_read(&tsc_count_stop) != num_booting_cpus()) mb(); - } -+ /* TSC reset. kill whatever might rely on old values */ -+ VE_TASK_INFO(current)->wakeup_stamp = 0; - } - #undef NR_LOOPS - -@@ -405,8 +409,6 @@ void __init smp_callin(void) - setup_local_APIC(); - map_cpu_to_logical_apicid(); - -- local_irq_enable(); -- - /* - * Get our bogomips. - */ -@@ -419,7 +421,7 @@ void __init smp_callin(void) - smp_store_cpu_info(cpuid); - - disable_APIC_timer(); -- local_irq_disable(); -+ - /* - * Allow the master to continue. - */ -@@ -463,6 +465,10 @@ int __init start_secondary(void *unused) - */ - local_flush_tlb(); - cpu_set(smp_processor_id(), cpu_online_map); -+ -+ /* We can take interrupts now: we're officially "up". */ -+ local_irq_enable(); -+ - wmb(); - return cpu_idle(); - } -@@ -499,7 +505,7 @@ static struct task_struct * __init fork_ - * don't care about the eip and regs settings since - * we'll never reschedule the forked task. - */ -- return copy_process(CLONE_VM|CLONE_IDLETASK, 0, ®s, 0, NULL, NULL); -+ return copy_process(CLONE_VM|CLONE_IDLETASK, 0, ®s, 0, NULL, NULL, 0); - } - - #ifdef CONFIG_NUMA -@@ -810,6 +816,9 @@ static int __init do_boot_cpu(int apicid - - idle->thread.eip = (unsigned long) start_secondary; - -+ /* Cosmetic: sleep_time won't be changed afterwards for the idle -+ * thread; keep it 0 rather than -cycles. */ -+ VE_TASK_INFO(idle)->sleep_time = 0; - unhash_process(idle); - - /* start_eip had better be page-aligned! */ -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/sys_i386.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/sys_i386.c ---- linux-2.6.8.1.orig/arch/i386/kernel/sys_i386.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/sys_i386.c 2006-05-11 13:05:40.000000000 +0400 -@@ -217,7 +217,7 @@ asmlinkage int sys_uname(struct old_utsn - if (!name) - return -EFAULT; - down_read(&uts_sem); -- err=copy_to_user(name, &system_utsname, sizeof (*name)); -+ err=copy_to_user(name, &ve_utsname, sizeof (*name)); - up_read(&uts_sem); - return err?-EFAULT:0; - } -@@ -233,15 +233,15 @@ asmlinkage int sys_olduname(struct oldol - - down_read(&uts_sem); - -- error = __copy_to_user(&name->sysname,&system_utsname.sysname,__OLD_UTS_LEN); -+ error = __copy_to_user(name->sysname,ve_utsname.sysname,__OLD_UTS_LEN); - error |= __put_user(0,name->sysname+__OLD_UTS_LEN); -- error |= __copy_to_user(&name->nodename,&system_utsname.nodename,__OLD_UTS_LEN); -+ error |= __copy_to_user(name->nodename,ve_utsname.nodename,__OLD_UTS_LEN); - error |= __put_user(0,name->nodename+__OLD_UTS_LEN); -- error |= __copy_to_user(&name->release,&system_utsname.release,__OLD_UTS_LEN); -+ error |= __copy_to_user(name->release,ve_utsname.release,__OLD_UTS_LEN); - error |= __put_user(0,name->release+__OLD_UTS_LEN); -- error |= __copy_to_user(&name->version,&system_utsname.version,__OLD_UTS_LEN); -+ error |= __copy_to_user(name->version,ve_utsname.version,__OLD_UTS_LEN); - error |= __put_user(0,name->version+__OLD_UTS_LEN); -- error |= __copy_to_user(&name->machine,&system_utsname.machine,__OLD_UTS_LEN); -+ error |= __copy_to_user(name->machine,ve_utsname.machine,__OLD_UTS_LEN); - error |= __put_user(0,name->machine+__OLD_UTS_LEN); - - up_read(&uts_sem); -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/sysenter.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/sysenter.c ---- linux-2.6.8.1.orig/arch/i386/kernel/sysenter.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/sysenter.c 2006-05-11 13:05:38.000000000 +0400 -@@ -18,13 +18,18 @@ - #include <asm/msr.h> - #include <asm/pgtable.h> - #include <asm/unistd.h> -+#include <linux/highmem.h> - - extern asmlinkage void sysenter_entry(void); - - void enable_sep_cpu(void *info) - { - int cpu = get_cpu(); -+#ifdef CONFIG_X86_HIGH_ENTRY -+ struct tss_struct *tss = (struct tss_struct *) __fix_to_virt(FIX_TSS_0) + cpu; -+#else - struct tss_struct *tss = init_tss + cpu; -+#endif - - tss->ss1 = __KERNEL_CS; - tss->esp1 = sizeof(struct tss_struct) + (unsigned long) tss; -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/time.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/time.c ---- linux-2.6.8.1.orig/arch/i386/kernel/time.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/time.c 2006-05-11 13:05:29.000000000 +0400 -@@ -362,7 +362,7 @@ void __init hpet_time_init(void) - xtime.tv_nsec = (INITIAL_JIFFIES % HZ) * (NSEC_PER_SEC / HZ); - wall_to_monotonic.tv_nsec = -xtime.tv_nsec; - -- if (hpet_enable() >= 0) { -+ if ((hpet_enable() >= 0) && hpet_use_timer) { - printk("Using HPET for base-timer\n"); - } - -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/time_hpet.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/time_hpet.c ---- linux-2.6.8.1.orig/arch/i386/kernel/time_hpet.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/time_hpet.c 2006-05-11 13:05:29.000000000 +0400 -@@ -26,6 +26,7 @@ - unsigned long hpet_period; /* fsecs / HPET clock */ - unsigned long hpet_tick; /* hpet clks count per tick */ - unsigned long hpet_address; /* hpet memory map physical address */ -+int hpet_use_timer; - - static int use_hpet; /* can be used for runtime check of hpet */ - static int boot_hpet_disable; /* boottime override for HPET timer */ -@@ -88,8 +89,7 @@ int __init hpet_enable(void) - * So, we are OK with HPET_EMULATE_RTC part too, where we need - * to have atleast 2 timers. - */ -- if (!(id & HPET_ID_NUMBER) || -- !(id & HPET_ID_LEGSUP)) -+ if (!(id & HPET_ID_NUMBER)) - return -1; - - hpet_period = hpet_readl(HPET_PERIOD); -@@ -109,6 +109,8 @@ int __init hpet_enable(void) - if (hpet_tick_rem > (hpet_period >> 1)) - hpet_tick++; /* rounding the result */ - -+ hpet_use_timer = id & HPET_ID_LEGSUP; -+ - /* - * Stop the timers and reset the main counter. - */ -@@ -118,21 +120,30 @@ int __init hpet_enable(void) - hpet_writel(0, HPET_COUNTER); - hpet_writel(0, HPET_COUNTER + 4); - -- /* -- * Set up timer 0, as periodic with first interrupt to happen at -- * hpet_tick, and period also hpet_tick. -- */ -- cfg = hpet_readl(HPET_T0_CFG); -- cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | -- HPET_TN_SETVAL | HPET_TN_32BIT; -- hpet_writel(cfg, HPET_T0_CFG); -- hpet_writel(hpet_tick, HPET_T0_CMP); -+ if (hpet_use_timer) { -+ /* -+ * Set up timer 0, as periodic with first interrupt to happen at -+ * hpet_tick, and period also hpet_tick. -+ */ -+ cfg = hpet_readl(HPET_T0_CFG); -+ cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | -+ HPET_TN_SETVAL | HPET_TN_32BIT; -+ hpet_writel(cfg, HPET_T0_CFG); -+ /* -+ * Some systems seems to need two writes to HPET_T0_CMP, -+ * to get interrupts working -+ */ -+ hpet_writel(hpet_tick, HPET_T0_CMP); -+ hpet_writel(hpet_tick, HPET_T0_CMP); -+ } - - /* - * Go! - */ - cfg = hpet_readl(HPET_CFG); -- cfg |= HPET_CFG_ENABLE | HPET_CFG_LEGACY; -+ if (hpet_use_timer) -+ cfg |= HPET_CFG_LEGACY; -+ cfg |= HPET_CFG_ENABLE; - hpet_writel(cfg, HPET_CFG); - - use_hpet = 1; -@@ -181,7 +192,8 @@ int __init hpet_enable(void) - #endif - - #ifdef CONFIG_X86_LOCAL_APIC -- wait_timer_tick = wait_hpet_tick; -+ if (hpet_use_timer) -+ wait_timer_tick = wait_hpet_tick; - #endif - return 0; - } -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/timers/timer_hpet.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/timers/timer_hpet.c ---- linux-2.6.8.1.orig/arch/i386/kernel/timers/timer_hpet.c 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/timers/timer_hpet.c 2006-05-11 13:05:29.000000000 +0400 -@@ -79,7 +79,7 @@ static unsigned long get_offset_hpet(voi - - eax = hpet_readl(HPET_COUNTER); - eax -= hpet_last; /* hpet delta */ -- -+ eax = min(hpet_tick, eax); - /* - * Time offset = (hpet delta) * ( usecs per HPET clock ) - * = (hpet delta) * ( usecs per tick / HPET clocks per tick) -@@ -105,9 +105,12 @@ static void mark_offset_hpet(void) - last_offset = ((unsigned long long)last_tsc_high<<32)|last_tsc_low; - rdtsc(last_tsc_low, last_tsc_high); - -- offset = hpet_readl(HPET_T0_CMP) - hpet_tick; -- if (unlikely(((offset - hpet_last) > hpet_tick) && (hpet_last != 0))) { -- int lost_ticks = (offset - hpet_last) / hpet_tick; -+ if (hpet_use_timer) -+ offset = hpet_readl(HPET_T0_CMP) - hpet_tick; -+ else -+ offset = hpet_readl(HPET_COUNTER); -+ if (unlikely(((offset - hpet_last) >= (2*hpet_tick)) && (hpet_last != 0))) { -+ int lost_ticks = ((offset - hpet_last) / hpet_tick) - 1; - jiffies_64 += lost_ticks; - } - hpet_last = offset; -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/timers/timer_tsc.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/timers/timer_tsc.c ---- linux-2.6.8.1.orig/arch/i386/kernel/timers/timer_tsc.c 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/timers/timer_tsc.c 2006-05-11 13:05:39.000000000 +0400 -@@ -81,7 +81,7 @@ static int count2; /* counter for mark_o - * Equal to 2^32 * (1 / (clocks per usec) ). - * Initialized in time_init. - */ --static unsigned long fast_gettimeoffset_quotient; -+unsigned long fast_gettimeoffset_quotient; - - static unsigned long get_offset_tsc(void) - { -@@ -474,7 +474,7 @@ static int __init init_tsc(char* overrid - if (cpu_has_tsc) { - unsigned long tsc_quotient; - #ifdef CONFIG_HPET_TIMER -- if (is_hpet_enabled()){ -+ if (is_hpet_enabled() && hpet_use_timer) { - unsigned long result, remain; - printk("Using TSC for gettimeofday\n"); - tsc_quotient = calibrate_tsc_hpet(NULL); -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/traps.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/traps.c ---- linux-2.6.8.1.orig/arch/i386/kernel/traps.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/traps.c 2006-05-11 13:05:49.000000000 +0400 -@@ -54,12 +54,8 @@ - - #include "mach_traps.h" - --asmlinkage int system_call(void); --asmlinkage void lcall7(void); --asmlinkage void lcall27(void); -- --struct desc_struct default_ldt[] = { { 0, 0 }, { 0, 0 }, { 0, 0 }, -- { 0, 0 }, { 0, 0 } }; -+struct desc_struct default_ldt[] __attribute__((__section__(".data.default_ldt"))) = { { 0, 0 }, { 0, 0 }, { 0, 0 }, { 0, 0 }, { 0, 0 } }; -+struct page *default_ldt_page; - - /* Do we ignore FPU interrupts ? */ - char ignore_fpu_irq = 0; -@@ -93,45 +89,41 @@ asmlinkage void machine_check(void); - - static int kstack_depth_to_print = 24; - --static int valid_stack_ptr(struct task_struct *task, void *p) -+static inline int valid_stack_ptr(struct thread_info *tinfo, void *p) - { -- if (p <= (void *)task->thread_info) -- return 0; -- if (kstack_end(p)) -- return 0; -- return 1; -+ return p > (void *)tinfo && -+ p < (void *)tinfo + THREAD_SIZE - 3; - } - --#ifdef CONFIG_FRAME_POINTER --static void print_context_stack(struct task_struct *task, unsigned long *stack, -- unsigned long ebp) -+static inline unsigned long print_context_stack(struct thread_info *tinfo, -+ unsigned long *stack, unsigned long ebp) - { - unsigned long addr; - -- while (valid_stack_ptr(task, (void *)ebp)) { -+#ifdef CONFIG_FRAME_POINTER -+ while (valid_stack_ptr(tinfo, (void *)ebp)) { - addr = *(unsigned long *)(ebp + 4); -- printk(" [<%08lx>] ", addr); -- print_symbol("%s", addr); -- printk("\n"); -+ printk(" [<%08lx>]", addr); -+ if (decode_call_traces) { -+ print_symbol(" %s", addr); -+ printk("\n"); -+ } - ebp = *(unsigned long *)ebp; - } --} - #else --static void print_context_stack(struct task_struct *task, unsigned long *stack, -- unsigned long ebp) --{ -- unsigned long addr; -- -- while (!kstack_end(stack)) { -+ while (valid_stack_ptr(tinfo, stack)) { - addr = *stack++; - if (__kernel_text_address(addr)) { - printk(" [<%08lx>]", addr); -- print_symbol(" %s", addr); -- printk("\n"); -+ if (decode_call_traces) { -+ print_symbol(" %s", addr); -+ printk("\n"); -+ } - } - } --} - #endif -+ return ebp; -+} - - void show_trace(struct task_struct *task, unsigned long * stack) - { -@@ -140,11 +132,6 @@ void show_trace(struct task_struct *task - if (!task) - task = current; - -- if (!valid_stack_ptr(task, stack)) { -- printk("Stack pointer is garbage, not printing trace\n"); -- return; -- } -- - if (task == current) { - /* Grab ebp right from our regs */ - asm ("movl %%ebp, %0" : "=r" (ebp) : ); -@@ -157,11 +144,14 @@ void show_trace(struct task_struct *task - struct thread_info *context; - context = (struct thread_info *) - ((unsigned long)stack & (~(THREAD_SIZE - 1))); -- print_context_stack(task, stack, ebp); -+ ebp = print_context_stack(context, stack, ebp); - stack = (unsigned long*)context->previous_esp; - if (!stack) - break; -- printk(" =======================\n"); -+ if (decode_call_traces) -+ printk(" =======================\n"); -+ else -+ printk(" =<ctx>= "); - } - } - -@@ -185,8 +175,12 @@ void show_stack(struct task_struct *task - printk("\n "); - printk("%08lx ", *stack++); - } -- printk("\nCall Trace:\n"); -+ printk("\nCall Trace:"); -+ if (decode_call_traces) -+ printk("\n"); - show_trace(task, esp); -+ if (!decode_call_traces) -+ printk("\n"); - } - - /* -@@ -197,6 +191,8 @@ void dump_stack(void) - unsigned long stack; - - show_trace(current, &stack); -+ if (!decode_call_traces) -+ printk("\n"); - } - - EXPORT_SYMBOL(dump_stack); -@@ -216,9 +212,10 @@ void show_registers(struct pt_regs *regs - ss = regs->xss & 0xffff; - } - print_modules(); -- printk("CPU: %d\nEIP: %04x:[<%08lx>] %s\nEFLAGS: %08lx" -+ printk("CPU: %d, VCPU: %d:%d\nEIP: %04x:[<%08lx>] %s\nEFLAGS: %08lx" - " (%s) \n", -- smp_processor_id(), 0xffff & regs->xcs, regs->eip, -+ smp_processor_id(), task_vsched_id(current), task_cpu(current), -+ 0xffff & regs->xcs, regs->eip, - print_tainted(), regs->eflags, UTS_RELEASE); - print_symbol("EIP is at %s\n", regs->eip); - printk("eax: %08lx ebx: %08lx ecx: %08lx edx: %08lx\n", -@@ -227,8 +224,10 @@ void show_registers(struct pt_regs *regs - regs->esi, regs->edi, regs->ebp, esp); - printk("ds: %04x es: %04x ss: %04x\n", - regs->xds & 0xffff, regs->xes & 0xffff, ss); -- printk("Process %s (pid: %d, threadinfo=%p task=%p)", -- current->comm, current->pid, current_thread_info(), current); -+ printk("Process %s (pid: %d, veid=%d, threadinfo=%p task=%p)", -+ current->comm, current->pid, -+ VEID(VE_TASK_INFO(current)->owner_env), -+ current_thread_info(), current); - /* - * When in-kernel, we also print out the stack and code at the - * time of the fault.. -@@ -244,8 +243,10 @@ void show_registers(struct pt_regs *regs - - for(i=0;i<20;i++) - { -- unsigned char c; -- if(__get_user(c, &((unsigned char*)regs->eip)[i])) { -+ unsigned char c = 0; -+ if ((user_mode(regs) && get_user(c, &((unsigned char*)regs->eip)[i])) || -+ (!user_mode(regs) && __direct_get_user(c, &((unsigned char*)regs->eip)[i]))) { -+ - bad: - printk(" Bad EIP value."); - break; -@@ -269,16 +270,14 @@ static void handle_BUG(struct pt_regs *r - - eip = regs->eip; - -- if (eip < PAGE_OFFSET) -- goto no_bug; -- if (__get_user(ud2, (unsigned short *)eip)) -+ if (__direct_get_user(ud2, (unsigned short *)eip)) - goto no_bug; - if (ud2 != 0x0b0f) - goto no_bug; -- if (__get_user(line, (unsigned short *)(eip + 2))) -+ if (__direct_get_user(line, (unsigned short *)(eip + 4))) - goto bug; -- if (__get_user(file, (char **)(eip + 4)) || -- (unsigned long)file < PAGE_OFFSET || __get_user(c, file)) -+ if (__direct_get_user(file, (char **)(eip + 7)) || -+ __direct_get_user(c, file)) - file = "<bad filename>"; - - printk("------------[ cut here ]------------\n"); -@@ -292,11 +291,18 @@ bug: - printk("Kernel BUG\n"); - } - -+static void inline check_kernel_csum_bug(void) -+{ -+ if (kernel_text_csum_broken) -+ printk("Kernel code checksum mismatch detected %d times\n", -+ kernel_text_csum_broken); -+} -+ - spinlock_t die_lock = SPIN_LOCK_UNLOCKED; -+int die_counter; - - void die(const char * str, struct pt_regs * regs, long err) - { -- static int die_counter; - int nl = 0; - - console_verbose(); -@@ -319,6 +325,7 @@ void die(const char * str, struct pt_reg - if (nl) - printk("\n"); - show_registers(regs); -+ check_kernel_csum_bug(); - bust_spinlocks(0); - spin_unlock_irq(&die_lock); - if (in_interrupt()) -@@ -531,6 +538,7 @@ static int dummy_nmi_callback(struct pt_ - } - - static nmi_callback_t nmi_callback = dummy_nmi_callback; -+static nmi_callback_t nmi_ipi_callback = dummy_nmi_callback; - - asmlinkage void do_nmi(struct pt_regs * regs, long error_code) - { -@@ -544,9 +552,20 @@ asmlinkage void do_nmi(struct pt_regs * - if (!nmi_callback(regs, cpu)) - default_do_nmi(regs); - -+ nmi_ipi_callback(regs, cpu); - nmi_exit(); - } - -+void set_nmi_ipi_callback(nmi_callback_t callback) -+{ -+ nmi_ipi_callback = callback; -+} -+ -+void unset_nmi_ipi_callback(void) -+{ -+ nmi_ipi_callback = dummy_nmi_callback; -+} -+ - void set_nmi_callback(nmi_callback_t callback) - { - nmi_callback = callback; -@@ -591,10 +610,18 @@ asmlinkage void do_debug(struct pt_regs - if (regs->eflags & X86_EFLAGS_IF) - local_irq_enable(); - -- /* Mask out spurious debug traps due to lazy DR7 setting */ -+ /* -+ * Mask out spurious debug traps due to lazy DR7 setting or -+ * due to 4G/4G kernel mode: -+ */ - if (condition & (DR_TRAP0|DR_TRAP1|DR_TRAP2|DR_TRAP3)) { - if (!tsk->thread.debugreg[7]) - goto clear_dr7; -+ if (!user_mode(regs)) { -+ // restore upon return-to-userspace: -+ set_thread_flag(TIF_DB7); -+ goto clear_dr7; -+ } - } - - if (regs->eflags & VM_MASK) -@@ -836,19 +863,52 @@ asmlinkage void math_emulate(long arg) - - #endif /* CONFIG_MATH_EMULATION */ - --#ifdef CONFIG_X86_F00F_BUG --void __init trap_init_f00f_bug(void) -+void __init trap_init_virtual_IDT(void) - { -- __set_fixmap(FIX_F00F_IDT, __pa(&idt_table), PAGE_KERNEL_RO); -- - /* -- * Update the IDT descriptor and reload the IDT so that -- * it uses the read-only mapped virtual address. -+ * "idt" is magic - it overlaps the idt_descr -+ * variable so that updating idt will automatically -+ * update the idt descriptor.. - */ -- idt_descr.address = fix_to_virt(FIX_F00F_IDT); -+ __set_fixmap(FIX_IDT, __pa(&idt_table), PAGE_KERNEL_RO); -+ idt_descr.address = __fix_to_virt(FIX_IDT); -+ - __asm__ __volatile__("lidt %0" : : "m" (idt_descr)); - } -+ -+void __init trap_init_virtual_GDT(void) -+{ -+ int cpu = smp_processor_id(); -+ struct Xgt_desc_struct *gdt_desc = cpu_gdt_descr + cpu; -+ struct Xgt_desc_struct tmp_desc = {0, 0}; -+ struct tss_struct * t; -+ -+ __asm__ __volatile__("sgdt %0": "=m" (tmp_desc): :"memory"); -+ -+#ifdef CONFIG_X86_HIGH_ENTRY -+ if (!cpu) { -+ int i; -+ __set_fixmap(FIX_GDT_0, __pa(cpu_gdt_table), PAGE_KERNEL); -+ __set_fixmap(FIX_GDT_1, __pa(cpu_gdt_table) + PAGE_SIZE, PAGE_KERNEL); -+ for(i = 0; i < FIX_TSS_COUNT; i++) -+ __set_fixmap(FIX_TSS_0 - i, __pa(init_tss) + i * PAGE_SIZE, PAGE_KERNEL); -+ } -+ -+ gdt_desc->address = __fix_to_virt(FIX_GDT_0) + sizeof(cpu_gdt_table[0]) * cpu; -+#else -+ gdt_desc->address = (unsigned long)cpu_gdt_table[cpu]; - #endif -+ __asm__ __volatile__("lgdt %0": "=m" (*gdt_desc)); -+ -+#ifdef CONFIG_X86_HIGH_ENTRY -+ t = (struct tss_struct *) __fix_to_virt(FIX_TSS_0) + cpu; -+#else -+ t = init_tss + cpu; -+#endif -+ set_tss_desc(cpu, t); -+ cpu_gdt_table[cpu][GDT_ENTRY_TSS].b &= 0xfffffdff; -+ load_TR_desc(); -+} - - #define _set_gate(gate_addr,type,dpl,addr,seg) \ - do { \ -@@ -875,17 +935,17 @@ void set_intr_gate(unsigned int n, void - _set_gate(idt_table+n,14,0,addr,__KERNEL_CS); - } - --static void __init set_trap_gate(unsigned int n, void *addr) -+void __init set_trap_gate(unsigned int n, void *addr) - { - _set_gate(idt_table+n,15,0,addr,__KERNEL_CS); - } - --static void __init set_system_gate(unsigned int n, void *addr) -+void __init set_system_gate(unsigned int n, void *addr) - { - _set_gate(idt_table+n,15,3,addr,__KERNEL_CS); - } - --static void __init set_call_gate(void *a, void *addr) -+void __init set_call_gate(void *a, void *addr) - { - _set_gate(a,12,3,addr,__KERNEL_CS); - } -@@ -907,6 +967,7 @@ void __init trap_init(void) - #ifdef CONFIG_X86_LOCAL_APIC - init_apic_mappings(); - #endif -+ init_entry_mappings(); - - set_trap_gate(0,÷_error); - set_intr_gate(1,&debug); -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/vm86.c linux-2.6.8.1-ve022stab078/arch/i386/kernel/vm86.c ---- linux-2.6.8.1.orig/arch/i386/kernel/vm86.c 2004-08-14 14:54:49.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/vm86.c 2006-05-11 13:05:38.000000000 +0400 -@@ -124,7 +124,7 @@ struct pt_regs * fastcall save_v86_state - tss = init_tss + get_cpu(); - current->thread.esp0 = current->thread.saved_esp0; - current->thread.sysenter_cs = __KERNEL_CS; -- load_esp0(tss, ¤t->thread); -+ load_virtual_esp0(tss, current); - current->thread.saved_esp0 = 0; - put_cpu(); - -@@ -307,7 +307,7 @@ static void do_sys_vm86(struct kernel_vm - tsk->thread.esp0 = (unsigned long) &info->VM86_TSS_ESP0; - if (cpu_has_sep) - tsk->thread.sysenter_cs = 0; -- load_esp0(tss, &tsk->thread); -+ load_virtual_esp0(tss, tsk); - put_cpu(); - - tsk->thread.screen_bitmap = info->screen_bitmap; -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/vmlinux.lds.S linux-2.6.8.1-ve022stab078/arch/i386/kernel/vmlinux.lds.S ---- linux-2.6.8.1.orig/arch/i386/kernel/vmlinux.lds.S 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/vmlinux.lds.S 2006-05-11 13:05:38.000000000 +0400 -@@ -5,13 +5,17 @@ - #include <asm-generic/vmlinux.lds.h> - #include <asm/thread_info.h> - -+#include <linux/config.h> -+#include <asm/page.h> -+#include <asm/asm_offsets.h> -+ - OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386") - OUTPUT_ARCH(i386) - ENTRY(startup_32) - jiffies = jiffies_64; - SECTIONS - { -- . = 0xC0000000 + 0x100000; -+ . = __PAGE_OFFSET + 0x100000; - /* read-only */ - _text = .; /* Text and read-only data */ - .text : { -@@ -21,6 +25,19 @@ SECTIONS - *(.gnu.warning) - } = 0x9090 - -+#ifdef CONFIG_X86_4G -+ . = ALIGN(PAGE_SIZE_asm); -+ __entry_tramp_start = .; -+ . = FIX_ENTRY_TRAMPOLINE_0_addr; -+ __start___entry_text = .; -+ .entry.text : AT (__entry_tramp_start) { *(.entry.text) } -+ __entry_tramp_end = __entry_tramp_start + SIZEOF(.entry.text); -+ . = __entry_tramp_end; -+ . = ALIGN(PAGE_SIZE_asm); -+#else -+ .entry.text : { *(.entry.text) } -+#endif -+ - _etext = .; /* End of text section */ - - . = ALIGN(16); /* Exception table */ -@@ -36,15 +53,12 @@ SECTIONS - CONSTRUCTORS - } - -- . = ALIGN(4096); -+ . = ALIGN(PAGE_SIZE_asm); - __nosave_begin = .; - .data_nosave : { *(.data.nosave) } -- . = ALIGN(4096); -+ . = ALIGN(PAGE_SIZE_asm); - __nosave_end = .; - -- . = ALIGN(4096); -- .data.page_aligned : { *(.data.idt) } -- - . = ALIGN(32); - .data.cacheline_aligned : { *(.data.cacheline_aligned) } - -@@ -54,7 +68,7 @@ SECTIONS - .data.init_task : { *(.data.init_task) } - - /* will be freed after init */ -- . = ALIGN(4096); /* Init code and data */ -+ . = ALIGN(PAGE_SIZE_asm); /* Init code and data */ - __init_begin = .; - .init.text : { - _sinittext = .; -@@ -93,7 +107,7 @@ SECTIONS - from .altinstructions and .eh_frame */ - .exit.text : { *(.exit.text) } - .exit.data : { *(.exit.data) } -- . = ALIGN(4096); -+ . = ALIGN(PAGE_SIZE_asm); - __initramfs_start = .; - .init.ramfs : { *(.init.ramfs) } - __initramfs_end = .; -@@ -101,10 +115,22 @@ SECTIONS - __per_cpu_start = .; - .data.percpu : { *(.data.percpu) } - __per_cpu_end = .; -- . = ALIGN(4096); -+ . = ALIGN(PAGE_SIZE_asm); - __init_end = .; - /* freed after init ends here */ -- -+ -+ . = ALIGN(PAGE_SIZE_asm); -+ .data.page_aligned_tss : { *(.data.tss) } -+ -+ . = ALIGN(PAGE_SIZE_asm); -+ .data.page_aligned_default_ldt : { *(.data.default_ldt) } -+ -+ . = ALIGN(PAGE_SIZE_asm); -+ .data.page_aligned_idt : { *(.data.idt) } -+ -+ . = ALIGN(PAGE_SIZE_asm); -+ .data.page_aligned_gdt : { *(.data.gdt) } -+ - __bss_start = .; /* BSS */ - .bss : { - *(.bss.page_aligned) -@@ -132,4 +158,6 @@ SECTIONS - .stab.index 0 : { *(.stab.index) } - .stab.indexstr 0 : { *(.stab.indexstr) } - .comment 0 : { *(.comment) } -+ -+ - } -diff -uprN linux-2.6.8.1.orig/arch/i386/kernel/vsyscall-sysenter.S linux-2.6.8.1-ve022stab078/arch/i386/kernel/vsyscall-sysenter.S ---- linux-2.6.8.1.orig/arch/i386/kernel/vsyscall-sysenter.S 2004-08-14 14:55:19.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/kernel/vsyscall-sysenter.S 2006-05-11 13:05:38.000000000 +0400 -@@ -12,6 +12,11 @@ - .type __kernel_vsyscall,@function - __kernel_vsyscall: - .LSTART_vsyscall: -+ cmpl $192, %eax -+ jne 1f -+ int $0x80 -+ ret -+1: - push %ecx - .Lpush_ecx: - push %edx -@@ -84,7 +89,7 @@ SYSENTER_RETURN: - .byte 0x04 /* DW_CFA_advance_loc4 */ - .long .Lpop_ebp-.Lenter_kernel - .byte 0x0e /* DW_CFA_def_cfa_offset */ -- .byte 0x12 /* RA at offset 12 now */ -+ .byte 0x0c /* RA at offset 12 now */ - .byte 0xc5 /* DW_CFA_restore %ebp */ - .byte 0x04 /* DW_CFA_advance_loc4 */ - .long .Lpop_edx-.Lpop_ebp -diff -uprN linux-2.6.8.1.orig/arch/i386/lib/checksum.S linux-2.6.8.1-ve022stab078/arch/i386/lib/checksum.S ---- linux-2.6.8.1.orig/arch/i386/lib/checksum.S 2004-08-14 14:56:00.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/lib/checksum.S 2006-05-11 13:05:38.000000000 +0400 -@@ -280,14 +280,14 @@ unsigned int csum_partial_copy_generic ( - .previous - - .align 4 --.globl csum_partial_copy_generic -+.globl direct_csum_partial_copy_generic - - #ifndef CONFIG_X86_USE_PPRO_CHECKSUM - - #define ARGBASE 16 - #define FP 12 - --csum_partial_copy_generic: -+direct_csum_partial_copy_generic: - subl $4,%esp - pushl %edi - pushl %esi -@@ -422,7 +422,7 @@ DST( movb %cl, (%edi) ) - - #define ARGBASE 12 - --csum_partial_copy_generic: -+direct_csum_partial_copy_generic: - pushl %ebx - pushl %edi - pushl %esi -diff -uprN linux-2.6.8.1.orig/arch/i386/lib/getuser.S linux-2.6.8.1-ve022stab078/arch/i386/lib/getuser.S ---- linux-2.6.8.1.orig/arch/i386/lib/getuser.S 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/lib/getuser.S 2006-05-11 13:05:38.000000000 +0400 -@@ -9,6 +9,7 @@ - * return value. - */ - #include <asm/thread_info.h> -+#include <asm/asm_offsets.h> - - - /* -diff -uprN linux-2.6.8.1.orig/arch/i386/lib/usercopy.c linux-2.6.8.1-ve022stab078/arch/i386/lib/usercopy.c ---- linux-2.6.8.1.orig/arch/i386/lib/usercopy.c 2004-08-14 14:54:46.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/lib/usercopy.c 2006-05-11 13:05:38.000000000 +0400 -@@ -9,7 +9,6 @@ - #include <linux/mm.h> - #include <linux/highmem.h> - #include <linux/blkdev.h> --#include <linux/module.h> - #include <asm/uaccess.h> - #include <asm/mmx.h> - -@@ -77,7 +76,7 @@ do { \ - * and returns @count. - */ - long --__strncpy_from_user(char *dst, const char __user *src, long count) -+__direct_strncpy_from_user(char *dst, const char __user *src, long count) - { - long res; - __do_strncpy_from_user(dst, src, count, res); -@@ -103,7 +102,7 @@ __strncpy_from_user(char *dst, const cha - * and returns @count. - */ - long --strncpy_from_user(char *dst, const char __user *src, long count) -+direct_strncpy_from_user(char *dst, const char __user *src, long count) - { - long res = -EFAULT; - if (access_ok(VERIFY_READ, src, 1)) -@@ -148,7 +147,7 @@ do { \ - * On success, this will be zero. - */ - unsigned long --clear_user(void __user *to, unsigned long n) -+direct_clear_user(void __user *to, unsigned long n) - { - might_sleep(); - if (access_ok(VERIFY_WRITE, to, n)) -@@ -168,7 +167,7 @@ clear_user(void __user *to, unsigned lon - * On success, this will be zero. - */ - unsigned long --__clear_user(void __user *to, unsigned long n) -+__direct_clear_user(void __user *to, unsigned long n) - { - __do_clear_user(to, n); - return n; -@@ -185,7 +184,7 @@ __clear_user(void __user *to, unsigned l - * On exception, returns 0. - * If the string is too long, returns a value greater than @n. - */ --long strnlen_user(const char __user *s, long n) -+long direct_strnlen_user(const char __user *s, long n) - { - unsigned long mask = -__addr_ok(s); - unsigned long res, tmp; -@@ -568,8 +567,7 @@ survive: - return n; - } - --unsigned long --__copy_from_user_ll(void *to, const void __user *from, unsigned long n) -+unsigned long __copy_from_user_ll(void *to, const void __user *from, unsigned long n) - { - if (movsl_is_ok(to, from, n)) - __copy_user_zeroing(to, from, n); -@@ -578,53 +576,3 @@ __copy_from_user_ll(void *to, const void - return n; - } - --/** -- * copy_to_user: - Copy a block of data into user space. -- * @to: Destination address, in user space. -- * @from: Source address, in kernel space. -- * @n: Number of bytes to copy. -- * -- * Context: User context only. This function may sleep. -- * -- * Copy data from kernel space to user space. -- * -- * Returns number of bytes that could not be copied. -- * On success, this will be zero. -- */ --unsigned long --copy_to_user(void __user *to, const void *from, unsigned long n) --{ -- might_sleep(); -- if (access_ok(VERIFY_WRITE, to, n)) -- n = __copy_to_user(to, from, n); -- return n; --} --EXPORT_SYMBOL(copy_to_user); -- --/** -- * copy_from_user: - Copy a block of data from user space. -- * @to: Destination address, in kernel space. -- * @from: Source address, in user space. -- * @n: Number of bytes to copy. -- * -- * Context: User context only. This function may sleep. -- * -- * Copy data from user space to kernel space. -- * -- * Returns number of bytes that could not be copied. -- * On success, this will be zero. -- * -- * If some data could not be copied, this function will pad the copied -- * data to the requested size using zero bytes. -- */ --unsigned long --copy_from_user(void *to, const void __user *from, unsigned long n) --{ -- might_sleep(); -- if (access_ok(VERIFY_READ, from, n)) -- n = __copy_from_user(to, from, n); -- else -- memset(to, 0, n); -- return n; --} --EXPORT_SYMBOL(copy_from_user); -diff -uprN linux-2.6.8.1.orig/arch/i386/math-emu/fpu_system.h linux-2.6.8.1-ve022stab078/arch/i386/math-emu/fpu_system.h ---- linux-2.6.8.1.orig/arch/i386/math-emu/fpu_system.h 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/math-emu/fpu_system.h 2006-05-11 13:05:38.000000000 +0400 -@@ -15,6 +15,7 @@ - #include <linux/sched.h> - #include <linux/kernel.h> - #include <linux/mm.h> -+#include <asm/atomic_kmap.h> - - /* This sets the pointer FPU_info to point to the argument part - of the stack frame of math_emulate() */ -@@ -22,7 +23,7 @@ - - /* s is always from a cpu register, and the cpu does bounds checking - * during register load --> no further bounds checks needed */ --#define LDT_DESCRIPTOR(s) (((struct desc_struct *)current->mm->context.ldt)[(s) >> 3]) -+#define LDT_DESCRIPTOR(s) (((struct desc_struct *)__kmap_atomic_vaddr(KM_LDT_PAGE0))[(s) >> 3]) - #define SEG_D_SIZE(x) ((x).b & (3 << 21)) - #define SEG_G_BIT(x) ((x).b & (1 << 23)) - #define SEG_GRANULARITY(x) (((x).b & (1 << 23)) ? 4096 : 1) -diff -uprN linux-2.6.8.1.orig/arch/i386/mm/fault.c linux-2.6.8.1-ve022stab078/arch/i386/mm/fault.c ---- linux-2.6.8.1.orig/arch/i386/mm/fault.c 2004-08-14 14:54:46.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/mm/fault.c 2006-05-11 13:05:38.000000000 +0400 -@@ -26,36 +26,11 @@ - #include <asm/uaccess.h> - #include <asm/hardirq.h> - #include <asm/desc.h> -+#include <asm/tlbflush.h> - - extern void die(const char *,struct pt_regs *,long); - - /* -- * Unlock any spinlocks which will prevent us from getting the -- * message out -- */ --void bust_spinlocks(int yes) --{ -- int loglevel_save = console_loglevel; -- -- if (yes) { -- oops_in_progress = 1; -- return; -- } --#ifdef CONFIG_VT -- unblank_screen(); --#endif -- oops_in_progress = 0; -- /* -- * OK, the message is on the console. Now we call printk() -- * without oops_in_progress set so that printk will give klogd -- * a poke. Hold onto your hats... -- */ -- console_loglevel = 15; /* NMI oopser may have shut the console up */ -- printk(" "); -- console_loglevel = loglevel_save; --} -- --/* - * Return EIP plus the CS segment base. The segment limit is also - * adjusted, clamped to the kernel/user address space (whichever is - * appropriate), and returned in *eip_limit. -@@ -103,8 +78,17 @@ static inline unsigned long get_segment_ - if (seg & (1<<2)) { - /* Must lock the LDT while reading it. */ - down(¤t->mm->context.sem); -+#if 1 -+ /* horrible hack for 4/4 disabled kernels. -+ I'm not quite sure what the TLB flush is good for, -+ it's mindlessly copied from the read_ldt code */ -+ __flush_tlb_global(); -+ desc = kmap(current->mm->context.ldt_pages[(seg&~7)/PAGE_SIZE]); -+ desc = (void *)desc + ((seg & ~7) % PAGE_SIZE); -+#else - desc = current->mm->context.ldt; - desc = (void *)desc + (seg & ~7); -+#endif - } else { - /* Must disable preemption while reading the GDT. */ - desc = (u32 *)&cpu_gdt_table[get_cpu()]; -@@ -117,6 +101,9 @@ static inline unsigned long get_segment_ - (desc[1] & 0xff000000); - - if (seg & (1<<2)) { -+#if 1 -+ kunmap((void *)((unsigned long)desc & PAGE_MASK)); -+#endif - up(¤t->mm->context.sem); - } else - put_cpu(); -@@ -232,6 +219,8 @@ asmlinkage void do_page_fault(struct pt_ - - tsk = current; - -+ check_stack_overflow(); -+ - info.si_code = SEGV_MAPERR; - - /* -@@ -247,6 +236,17 @@ asmlinkage void do_page_fault(struct pt_ - * (error_code & 4) == 0, and that the fault was not a - * protection error (error_code & 1) == 0. - */ -+#ifdef CONFIG_X86_4G -+ /* -+ * On 4/4 all kernels faults are either bugs, vmalloc or prefetch -+ */ -+ /* If it's vm86 fall through */ -+ if (unlikely(!(regs->eflags & VM_MASK) && ((regs->xcs & 3) == 0))) { -+ if (error_code & 3) -+ goto bad_area_nosemaphore; -+ goto vmalloc_fault; -+ } -+#else - if (unlikely(address >= TASK_SIZE)) { - if (!(error_code & 5)) - goto vmalloc_fault; -@@ -256,6 +256,7 @@ asmlinkage void do_page_fault(struct pt_ - */ - goto bad_area_nosemaphore; - } -+#endif - - mm = tsk->mm; - -@@ -333,7 +334,6 @@ good_area: - goto bad_area; - } - -- survive: - /* - * If for any reason at all we couldn't handle the fault, - * make sure we exit gracefully rather than endlessly redo -@@ -472,14 +472,14 @@ no_context: - */ - out_of_memory: - up_read(&mm->mmap_sem); -- if (tsk->pid == 1) { -- yield(); -- down_read(&mm->mmap_sem); -- goto survive; -+ if (error_code & 4) { -+ /* -+ * 0-order allocation always success if something really -+ * fatal not happen: beancounter overdraft or OOM. Den -+ */ -+ force_sig(SIGKILL, tsk); -+ return; - } -- printk("VM: killing process %s\n", tsk->comm); -- if (error_code & 4) -- do_exit(SIGKILL); - goto no_context; - - do_sigbus: -diff -uprN linux-2.6.8.1.orig/arch/i386/mm/highmem.c linux-2.6.8.1-ve022stab078/arch/i386/mm/highmem.c ---- linux-2.6.8.1.orig/arch/i386/mm/highmem.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/mm/highmem.c 2006-05-11 13:05:38.000000000 +0400 -@@ -41,12 +41,45 @@ void *kmap_atomic(struct page *page, enu - if (!pte_none(*(kmap_pte-idx))) - BUG(); - #endif -- set_pte(kmap_pte-idx, mk_pte(page, kmap_prot)); -+ /* -+ * If the page is not a normal RAM page, then map it -+ * uncached to be on the safe side - it could be device -+ * memory that must not be prefetched: -+ */ -+ if (PageReserved(page)) -+ set_pte(kmap_pte-idx, mk_pte(page, kmap_prot_nocache)); -+ else -+ set_pte(kmap_pte-idx, mk_pte(page, kmap_prot)); - __flush_tlb_one(vaddr); - - return (void*) vaddr; - } - -+/* -+ * page frame number based kmaps - useful for PCI mappings. -+ * NOTE: we map the page with the same mapping as what user is using. -+ */ -+void *kmap_atomic_pte(pte_t *pte, enum km_type type) -+{ -+ enum fixed_addresses idx; -+ unsigned long vaddr; -+ -+ /* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */ -+ inc_preempt_count(); -+ -+ idx = type + KM_TYPE_NR*smp_processor_id(); -+ vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); -+#ifdef CONFIG_DEBUG_HIGHMEM -+ if (!pte_none(*(kmap_pte-idx))) -+ BUG(); -+#endif -+ set_pte(kmap_pte-idx, *pte); -+ __flush_tlb_one(vaddr); -+ -+ return (void*) vaddr; -+} -+ -+ - void kunmap_atomic(void *kvaddr, enum km_type type) - { - #ifdef CONFIG_DEBUG_HIGHMEM -diff -uprN linux-2.6.8.1.orig/arch/i386/mm/hugetlbpage.c linux-2.6.8.1-ve022stab078/arch/i386/mm/hugetlbpage.c ---- linux-2.6.8.1.orig/arch/i386/mm/hugetlbpage.c 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/mm/hugetlbpage.c 2006-05-11 13:05:38.000000000 +0400 -@@ -18,6 +18,8 @@ - #include <asm/tlb.h> - #include <asm/tlbflush.h> - -+#include <ub/ub_vmpages.h> -+ - static pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr) - { - pgd_t *pgd; -@@ -43,6 +45,7 @@ static void set_huge_pte(struct mm_struc - pte_t entry; - - mm->rss += (HPAGE_SIZE / PAGE_SIZE); -+ ub_unused_privvm_dec(mm_ub(mm), HPAGE_SIZE / PAGE_SIZE, vma); - if (write_access) { - entry = - pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot))); -@@ -83,6 +86,7 @@ int copy_hugetlb_page_range(struct mm_st - get_page(ptepage); - set_pte(dst_pte, entry); - dst->rss += (HPAGE_SIZE / PAGE_SIZE); -+ ub_unused_privvm_dec(mm_ub(dst), HPAGE_SIZE / PAGE_SIZE, vma); - addr += HPAGE_SIZE; - } - return 0; -@@ -219,6 +223,7 @@ void unmap_hugepage_range(struct vm_area - put_page(page); - } - mm->rss -= (end - start) >> PAGE_SHIFT; -+ ub_unused_privvm_inc(mm_ub(mm), (end - start) >> PAGE_SHIFT, vma); - flush_tlb_range(vma, start, end); - } - -diff -uprN linux-2.6.8.1.orig/arch/i386/mm/init.c linux-2.6.8.1-ve022stab078/arch/i386/mm/init.c ---- linux-2.6.8.1.orig/arch/i386/mm/init.c 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/mm/init.c 2006-05-11 13:05:38.000000000 +0400 -@@ -27,6 +27,7 @@ - #include <linux/slab.h> - #include <linux/proc_fs.h> - #include <linux/efi.h> -+#include <linux/initrd.h> - - #include <asm/processor.h> - #include <asm/system.h> -@@ -39,143 +40,14 @@ - #include <asm/tlb.h> - #include <asm/tlbflush.h> - #include <asm/sections.h> -+#include <asm/setup.h> -+#include <asm/desc.h> - - DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); - unsigned long highstart_pfn, highend_pfn; - - static int do_test_wp_bit(void); - --/* -- * Creates a middle page table and puts a pointer to it in the -- * given global directory entry. This only returns the gd entry -- * in non-PAE compilation mode, since the middle layer is folded. -- */ --static pmd_t * __init one_md_table_init(pgd_t *pgd) --{ -- pmd_t *pmd_table; -- --#ifdef CONFIG_X86_PAE -- pmd_table = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE); -- set_pgd(pgd, __pgd(__pa(pmd_table) | _PAGE_PRESENT)); -- if (pmd_table != pmd_offset(pgd, 0)) -- BUG(); --#else -- pmd_table = pmd_offset(pgd, 0); --#endif -- -- return pmd_table; --} -- --/* -- * Create a page table and place a pointer to it in a middle page -- * directory entry. -- */ --static pte_t * __init one_page_table_init(pmd_t *pmd) --{ -- if (pmd_none(*pmd)) { -- pte_t *page_table = (pte_t *) alloc_bootmem_low_pages(PAGE_SIZE); -- set_pmd(pmd, __pmd(__pa(page_table) | _PAGE_TABLE)); -- if (page_table != pte_offset_kernel(pmd, 0)) -- BUG(); -- -- return page_table; -- } -- -- return pte_offset_kernel(pmd, 0); --} -- --/* -- * This function initializes a certain range of kernel virtual memory -- * with new bootmem page tables, everywhere page tables are missing in -- * the given range. -- */ -- --/* -- * NOTE: The pagetables are allocated contiguous on the physical space -- * so we can cache the place of the first one and move around without -- * checking the pgd every time. -- */ --static void __init page_table_range_init (unsigned long start, unsigned long end, pgd_t *pgd_base) --{ -- pgd_t *pgd; -- pmd_t *pmd; -- int pgd_idx, pmd_idx; -- unsigned long vaddr; -- -- vaddr = start; -- pgd_idx = pgd_index(vaddr); -- pmd_idx = pmd_index(vaddr); -- pgd = pgd_base + pgd_idx; -- -- for ( ; (pgd_idx < PTRS_PER_PGD) && (vaddr != end); pgd++, pgd_idx++) { -- if (pgd_none(*pgd)) -- one_md_table_init(pgd); -- -- pmd = pmd_offset(pgd, vaddr); -- for (; (pmd_idx < PTRS_PER_PMD) && (vaddr != end); pmd++, pmd_idx++) { -- if (pmd_none(*pmd)) -- one_page_table_init(pmd); -- -- vaddr += PMD_SIZE; -- } -- pmd_idx = 0; -- } --} -- --static inline int is_kernel_text(unsigned long addr) --{ -- if (addr >= (unsigned long)_stext && addr <= (unsigned long)__init_end) -- return 1; -- return 0; --} -- --/* -- * This maps the physical memory to kernel virtual address space, a total -- * of max_low_pfn pages, by creating page tables starting from address -- * PAGE_OFFSET. -- */ --static void __init kernel_physical_mapping_init(pgd_t *pgd_base) --{ -- unsigned long pfn; -- pgd_t *pgd; -- pmd_t *pmd; -- pte_t *pte; -- int pgd_idx, pmd_idx, pte_ofs; -- -- pgd_idx = pgd_index(PAGE_OFFSET); -- pgd = pgd_base + pgd_idx; -- pfn = 0; -- -- for (; pgd_idx < PTRS_PER_PGD; pgd++, pgd_idx++) { -- pmd = one_md_table_init(pgd); -- if (pfn >= max_low_pfn) -- continue; -- for (pmd_idx = 0; pmd_idx < PTRS_PER_PMD && pfn < max_low_pfn; pmd++, pmd_idx++) { -- unsigned int address = pfn * PAGE_SIZE + PAGE_OFFSET; -- -- /* Map with big pages if possible, otherwise create normal page tables. */ -- if (cpu_has_pse) { -- unsigned int address2 = (pfn + PTRS_PER_PTE - 1) * PAGE_SIZE + PAGE_OFFSET + PAGE_SIZE-1; -- -- if (is_kernel_text(address) || is_kernel_text(address2)) -- set_pmd(pmd, pfn_pmd(pfn, PAGE_KERNEL_LARGE_EXEC)); -- else -- set_pmd(pmd, pfn_pmd(pfn, PAGE_KERNEL_LARGE)); -- pfn += PTRS_PER_PTE; -- } else { -- pte = one_page_table_init(pmd); -- -- for (pte_ofs = 0; pte_ofs < PTRS_PER_PTE && pfn < max_low_pfn; pte++, pfn++, pte_ofs++) { -- if (is_kernel_text(address)) -- set_pte(pte, pfn_pte(pfn, PAGE_KERNEL_EXEC)); -- else -- set_pte(pte, pfn_pte(pfn, PAGE_KERNEL)); -- } -- } -- } -- } --} -- - static inline int page_kills_ppro(unsigned long pagenr) - { - if (pagenr >= 0x70000 && pagenr <= 0x7003F) -@@ -223,11 +95,8 @@ static inline int page_is_ram(unsigned l - return 0; - } - --#ifdef CONFIG_HIGHMEM - pte_t *kmap_pte; --pgprot_t kmap_prot; - --EXPORT_SYMBOL(kmap_prot); - EXPORT_SYMBOL(kmap_pte); - - #define kmap_get_fixmap_pte(vaddr) \ -@@ -235,29 +104,7 @@ EXPORT_SYMBOL(kmap_pte); - - void __init kmap_init(void) - { -- unsigned long kmap_vstart; -- -- /* cache the first kmap pte */ -- kmap_vstart = __fix_to_virt(FIX_KMAP_BEGIN); -- kmap_pte = kmap_get_fixmap_pte(kmap_vstart); -- -- kmap_prot = PAGE_KERNEL; --} -- --void __init permanent_kmaps_init(pgd_t *pgd_base) --{ -- pgd_t *pgd; -- pmd_t *pmd; -- pte_t *pte; -- unsigned long vaddr; -- -- vaddr = PKMAP_BASE; -- page_table_range_init(vaddr, vaddr + PAGE_SIZE*LAST_PKMAP, pgd_base); -- -- pgd = swapper_pg_dir + pgd_index(vaddr); -- pmd = pmd_offset(pgd, vaddr); -- pte = pte_offset_kernel(pmd, vaddr); -- pkmap_page_table = pte; -+ kmap_pte = kmap_get_fixmap_pte(__fix_to_virt(FIX_KMAP_BEGIN)); - } - - void __init one_highpage_init(struct page *page, int pfn, int bad_ppro) -@@ -272,6 +119,8 @@ void __init one_highpage_init(struct pag - SetPageReserved(page); - } - -+#ifdef CONFIG_HIGHMEM -+ - #ifndef CONFIG_DISCONTIGMEM - void __init set_highmem_pages_init(int bad_ppro) - { -@@ -283,12 +132,9 @@ void __init set_highmem_pages_init(int b - #else - extern void set_highmem_pages_init(int); - #endif /* !CONFIG_DISCONTIGMEM */ -- - #else --#define kmap_init() do { } while (0) --#define permanent_kmaps_init(pgd_base) do { } while (0) --#define set_highmem_pages_init(bad_ppro) do { } while (0) --#endif /* CONFIG_HIGHMEM */ -+# define set_highmem_pages_init(bad_ppro) do { } while (0) -+#endif - - unsigned long long __PAGE_KERNEL = _PAGE_KERNEL; - unsigned long long __PAGE_KERNEL_EXEC = _PAGE_KERNEL_EXEC; -@@ -299,31 +145,125 @@ unsigned long long __PAGE_KERNEL_EXEC = - extern void __init remap_numa_kva(void); - #endif - --static void __init pagetable_init (void) -+static __init void prepare_pagetables(pgd_t *pgd_base, unsigned long address) -+{ -+ pgd_t *pgd; -+ pmd_t *pmd; -+ pte_t *pte; -+ -+ pgd = pgd_base + pgd_index(address); -+ pmd = pmd_offset(pgd, address); -+ if (!pmd_present(*pmd)) { -+ pte = (pte_t *) alloc_bootmem_low_pages(PAGE_SIZE); -+ set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte))); -+ } -+} -+ -+static void __init fixrange_init (unsigned long start, unsigned long end, pgd_t *pgd_base) -+{ -+ unsigned long vaddr; -+ -+ for (vaddr = start; vaddr != end; vaddr += PAGE_SIZE) -+ prepare_pagetables(pgd_base, vaddr); -+} -+ -+void setup_identity_mappings(pgd_t *pgd_base, unsigned long start, unsigned long end) - { - unsigned long vaddr; -- pgd_t *pgd_base = swapper_pg_dir; -+ pgd_t *pgd; -+ int i, j, k; -+ pmd_t *pmd; -+ pte_t *pte, *pte_base; -+ -+ pgd = pgd_base; - -+ for (i = 0; i < PTRS_PER_PGD; pgd++, i++) { -+ vaddr = i*PGDIR_SIZE; -+ if (end && (vaddr >= end)) -+ break; -+ pmd = pmd_offset(pgd, 0); -+ for (j = 0; j < PTRS_PER_PMD; pmd++, j++) { -+ vaddr = i*PGDIR_SIZE + j*PMD_SIZE; -+ if (end && (vaddr >= end)) -+ break; -+ if (vaddr < start) -+ continue; -+ if (cpu_has_pse) { -+ unsigned long __pe; -+ -+ set_in_cr4(X86_CR4_PSE); -+ boot_cpu_data.wp_works_ok = 1; -+ __pe = _KERNPG_TABLE + _PAGE_PSE + vaddr - start; -+ /* Make it "global" too if supported */ -+ if (cpu_has_pge) { -+ set_in_cr4(X86_CR4_PGE); -+#if !defined(CONFIG_X86_SWITCH_PAGETABLES) -+ __pe += _PAGE_GLOBAL; -+ __PAGE_KERNEL |= _PAGE_GLOBAL; -+#endif -+ } -+ set_pmd(pmd, __pmd(__pe)); -+ continue; -+ } -+ if (!pmd_present(*pmd)) -+ pte_base = (pte_t *) alloc_bootmem_low_pages(PAGE_SIZE); -+ else -+ pte_base = pte_offset_kernel(pmd, 0); -+ pte = pte_base; -+ for (k = 0; k < PTRS_PER_PTE; pte++, k++) { -+ vaddr = i*PGDIR_SIZE + j*PMD_SIZE + k*PAGE_SIZE; -+ if (end && (vaddr >= end)) -+ break; -+ if (vaddr < start) -+ continue; -+ *pte = mk_pte_phys(vaddr-start, PAGE_KERNEL); -+ } -+ set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(pte_base))); -+ } -+ } -+} -+ -+static void __init pagetable_init (void) -+{ -+ unsigned long vaddr, end; -+ pgd_t *pgd_base; - #ifdef CONFIG_X86_PAE - int i; -- /* Init entries of the first-level page table to the zero page */ -- for (i = 0; i < PTRS_PER_PGD; i++) -- set_pgd(pgd_base + i, __pgd(__pa(empty_zero_page) | _PAGE_PRESENT)); - #endif - -- /* Enable PSE if available */ -- if (cpu_has_pse) { -- set_in_cr4(X86_CR4_PSE); -- } -+ /* -+ * This can be zero as well - no problem, in that case we exit -+ * the loops anyway due to the PTRS_PER_* conditions. -+ */ -+ end = (unsigned long)__va(max_low_pfn*PAGE_SIZE); - -- /* Enable PGE if available */ -- if (cpu_has_pge) { -- set_in_cr4(X86_CR4_PGE); -- __PAGE_KERNEL |= _PAGE_GLOBAL; -- __PAGE_KERNEL_EXEC |= _PAGE_GLOBAL; -+ pgd_base = swapper_pg_dir; -+#ifdef CONFIG_X86_PAE -+ /* -+ * It causes too many problems if there's no proper pmd set up -+ * for all 4 entries of the PGD - so we allocate all of them. -+ * PAE systems will not miss this extra 4-8K anyway ... -+ */ -+ for (i = 0; i < PTRS_PER_PGD; i++) { -+ pmd_t *pmd = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE); -+ set_pgd(pgd_base + i, __pgd(__pa(pmd) + 0x1)); - } -+#endif -+ /* -+ * Set up lowmem-sized identity mappings at PAGE_OFFSET: -+ */ -+ setup_identity_mappings(pgd_base, PAGE_OFFSET, end); - -- kernel_physical_mapping_init(pgd_base); -+ /* -+ * Add flat-mode identity-mappings - SMP needs it when -+ * starting up on an AP from real-mode. (In the non-PAE -+ * case we already have these mappings through head.S.) -+ * All user-space mappings are explicitly cleared after -+ * SMP startup. -+ */ -+#if defined(CONFIG_SMP) && defined(CONFIG_X86_PAE) -+ setup_identity_mappings(pgd_base, 0, 16*1024*1024); -+#endif - remap_numa_kva(); - - /* -@@ -331,22 +271,57 @@ static void __init pagetable_init (void) - * created - mappings will be set by set_fixmap(): - */ - vaddr = __fix_to_virt(__end_of_fixed_addresses - 1) & PMD_MASK; -- page_table_range_init(vaddr, 0, pgd_base); -+ fixrange_init(vaddr, 0, pgd_base); - -- permanent_kmaps_init(pgd_base); -+#ifdef CONFIG_HIGHMEM -+ { -+ pgd_t *pgd; -+ pmd_t *pmd; -+ pte_t *pte; - --#ifdef CONFIG_X86_PAE -- /* -- * Add low memory identity-mappings - SMP needs it when -- * starting up on an AP from real-mode. In the non-PAE -- * case we already have these mappings through head.S. -- * All user-space mappings are explicitly cleared after -- * SMP startup. -- */ -- pgd_base[0] = pgd_base[USER_PTRS_PER_PGD]; -+ /* -+ * Permanent kmaps: -+ */ -+ vaddr = PKMAP_BASE; -+ fixrange_init(vaddr, vaddr + PAGE_SIZE*LAST_PKMAP, pgd_base); -+ -+ pgd = swapper_pg_dir + pgd_index(vaddr); -+ pmd = pmd_offset(pgd, vaddr); -+ pte = pte_offset_kernel(pmd, vaddr); -+ pkmap_page_table = pte; -+ } - #endif - } - -+/* -+ * Clear kernel pagetables in a PMD_SIZE-aligned range. -+ */ -+static void clear_mappings(pgd_t *pgd_base, unsigned long start, unsigned long end) -+{ -+ unsigned long vaddr; -+ pgd_t *pgd; -+ pmd_t *pmd; -+ int i, j; -+ -+ pgd = pgd_base; -+ -+ for (i = 0; i < PTRS_PER_PGD; pgd++, i++) { -+ vaddr = i*PGDIR_SIZE; -+ if (end && (vaddr >= end)) -+ break; -+ pmd = pmd_offset(pgd, 0); -+ for (j = 0; j < PTRS_PER_PMD; pmd++, j++) { -+ vaddr = i*PGDIR_SIZE + j*PMD_SIZE; -+ if (end && (vaddr >= end)) -+ break; -+ if (vaddr < start) -+ continue; -+ pmd_clear(pmd); -+ } -+ } -+ flush_tlb_all(); -+} -+ - #if defined(CONFIG_PM_DISK) || defined(CONFIG_SOFTWARE_SUSPEND) - /* - * Swap suspend & friends need this for resume because things like the intel-agp -@@ -365,25 +340,16 @@ static inline void save_pg_dir(void) - } - #endif - --void zap_low_mappings (void) --{ -- int i; - -+void zap_low_mappings(void) -+{ - save_pg_dir(); - -+ printk("zapping low mappings.\n"); - /* - * Zap initial low-memory mappings. -- * -- * Note that "pgd_clear()" doesn't do it for -- * us, because pgd_clear() is a no-op on i386. - */ -- for (i = 0; i < USER_PTRS_PER_PGD; i++) --#ifdef CONFIG_X86_PAE -- set_pgd(swapper_pg_dir+i, __pgd(1 + __pa(empty_zero_page))); --#else -- set_pgd(swapper_pg_dir+i, __pgd(0)); --#endif -- flush_tlb_all(); -+ clear_mappings(swapper_pg_dir, 0, 16*1024*1024); - } - - #ifndef CONFIG_DISCONTIGMEM -@@ -454,7 +420,6 @@ static void __init set_nx(void) - } - } - } -- - /* - * Enables/disables executability of a given kernel page and - * returns the previous setting. -@@ -512,7 +477,15 @@ void __init paging_init(void) - set_in_cr4(X86_CR4_PAE); - #endif - __flush_tlb_all(); -- -+ /* -+ * Subtle. SMP is doing it's boot stuff late (because it has to -+ * fork idle threads) - but it also needs low mappings for the -+ * protected-mode entry to work. We zap these entries only after -+ * the WP-bit has been tested. -+ */ -+#ifndef CONFIG_SMP -+ zap_low_mappings(); -+#endif - kmap_init(); - zone_sizes_init(); - } -@@ -561,6 +534,37 @@ extern void set_max_mapnr_init(void); - - static struct kcore_list kcore_mem, kcore_vmalloc; - -+#ifdef CONFIG_BLK_DEV_INITRD -+/* -+ * This function move initrd from highmem to normal zone, if needed. -+ * Note, we have to do it before highmem pages are given to buddy allocator. -+ */ -+static void initrd_move(void) -+{ -+ unsigned long i, start, off; -+ struct page *page; -+ void *addr; -+ -+ if (initrd_copy <= 0) -+ return; -+ -+ initrd_start = (unsigned long) -+ alloc_bootmem_low_pages(PAGE_ALIGN(INITRD_SIZE)); -+ initrd_end = INITRD_START + initrd_copy; -+ start = (initrd_end - initrd_copy) & PAGE_MASK; -+ off = (initrd_end - initrd_copy) & ~PAGE_MASK; -+ for (i = 0; i < initrd_copy; i += PAGE_SIZE) { -+ page = pfn_to_page((start + i) >> PAGE_SHIFT); -+ addr = kmap_atomic(page, KM_USER0); -+ memcpy((void *)initrd_start + i, -+ addr, PAGE_SIZE); -+ kunmap_atomic(addr, KM_USER0); -+ } -+ initrd_start += off; -+ initrd_end = initrd_start + initrd_copy; -+} -+#endif -+ - void __init mem_init(void) - { - extern int ppro_with_ram_bug(void); -@@ -593,6 +597,9 @@ void __init mem_init(void) - high_memory = (void *) __va(max_low_pfn * PAGE_SIZE); - #endif - -+#ifdef CONFIG_BLK_DEV_INITRD -+ initrd_move(); -+#endif - /* this will put all low memory onto the freelists */ - totalram_pages += __free_all_bootmem(); - -@@ -631,38 +638,57 @@ void __init mem_init(void) - if (boot_cpu_data.wp_works_ok < 0) - test_wp_bit(); - -- /* -- * Subtle. SMP is doing it's boot stuff late (because it has to -- * fork idle threads) - but it also needs low mappings for the -- * protected-mode entry to work. We zap these entries only after -- * the WP-bit has been tested. -- */ --#ifndef CONFIG_SMP -- zap_low_mappings(); --#endif -+ entry_trampoline_setup(); -+ default_ldt_page = virt_to_page(default_ldt); -+ load_LDT(&init_mm.context); - } - --kmem_cache_t *pgd_cache; --kmem_cache_t *pmd_cache; -+kmem_cache_t *pgd_cache, *pmd_cache, *kpmd_cache; - - void __init pgtable_cache_init(void) - { -+ void (*ctor)(void *, kmem_cache_t *, unsigned long); -+ void (*dtor)(void *, kmem_cache_t *, unsigned long); -+ - if (PTRS_PER_PMD > 1) { - pmd_cache = kmem_cache_create("pmd", - PTRS_PER_PMD*sizeof(pmd_t), - PTRS_PER_PMD*sizeof(pmd_t), -- 0, -+ SLAB_UBC, - pmd_ctor, - NULL); - if (!pmd_cache) - panic("pgtable_cache_init(): cannot create pmd cache"); -+ -+ if (TASK_SIZE > PAGE_OFFSET) { -+ kpmd_cache = kmem_cache_create("kpmd", -+ PTRS_PER_PMD*sizeof(pmd_t), -+ PTRS_PER_PMD*sizeof(pmd_t), -+ SLAB_UBC, -+ kpmd_ctor, -+ NULL); -+ if (!kpmd_cache) -+ panic("pgtable_cache_init(): " -+ "cannot create kpmd cache"); -+ } - } -+ -+ if (PTRS_PER_PMD == 1 || TASK_SIZE <= PAGE_OFFSET) -+ ctor = pgd_ctor; -+ else -+ ctor = NULL; -+ -+ if (PTRS_PER_PMD == 1 && TASK_SIZE <= PAGE_OFFSET) -+ dtor = pgd_dtor; -+ else -+ dtor = NULL; -+ - pgd_cache = kmem_cache_create("pgd", - PTRS_PER_PGD*sizeof(pgd_t), - PTRS_PER_PGD*sizeof(pgd_t), -- 0, -- pgd_ctor, -- PTRS_PER_PMD == 1 ? pgd_dtor : NULL); -+ SLAB_UBC, -+ ctor, -+ dtor); - if (!pgd_cache) - panic("pgtable_cache_init(): Cannot create pgd cache"); - } -diff -uprN linux-2.6.8.1.orig/arch/i386/mm/pageattr.c linux-2.6.8.1-ve022stab078/arch/i386/mm/pageattr.c ---- linux-2.6.8.1.orig/arch/i386/mm/pageattr.c 2004-08-14 14:55:20.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/mm/pageattr.c 2006-05-11 13:05:38.000000000 +0400 -@@ -67,22 +67,21 @@ static void flush_kernel_map(void *dummy - - static void set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte) - { -- struct page *page; -- unsigned long flags; -- - set_pte_atomic(kpte, pte); /* change init_mm */ -- if (PTRS_PER_PMD > 1) -- return; -- -- spin_lock_irqsave(&pgd_lock, flags); -- for (page = pgd_list; page; page = (struct page *)page->index) { -- pgd_t *pgd; -- pmd_t *pmd; -- pgd = (pgd_t *)page_address(page) + pgd_index(address); -- pmd = pmd_offset(pgd, address); -- set_pte_atomic((pte_t *)pmd, pte); -+#ifndef CONFIG_X86_PAE -+ { -+ struct list_head *l; -+ if (TASK_SIZE > PAGE_OFFSET) -+ return; -+ spin_lock(&mmlist_lock); -+ list_for_each(l, &init_mm.mmlist) { -+ struct mm_struct *mm = list_entry(l, struct mm_struct, mmlist); -+ pmd_t *pmd = pmd_offset(pgd_offset(mm, address), address); -+ set_pte_atomic((pte_t *)pmd, pte); -+ } -+ spin_unlock(&mmlist_lock); - } -- spin_unlock_irqrestore(&pgd_lock, flags); -+#endif - } - - /* -diff -uprN linux-2.6.8.1.orig/arch/i386/mm/pgtable.c linux-2.6.8.1-ve022stab078/arch/i386/mm/pgtable.c ---- linux-2.6.8.1.orig/arch/i386/mm/pgtable.c 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/mm/pgtable.c 2006-05-11 13:05:40.000000000 +0400 -@@ -5,8 +5,10 @@ - #include <linux/config.h> - #include <linux/sched.h> - #include <linux/kernel.h> -+#include <linux/module.h> - #include <linux/errno.h> - #include <linux/mm.h> -+#include <linux/vmalloc.h> - #include <linux/swap.h> - #include <linux/smp.h> - #include <linux/highmem.h> -@@ -21,6 +23,7 @@ - #include <asm/e820.h> - #include <asm/tlb.h> - #include <asm/tlbflush.h> -+#include <asm/atomic_kmap.h> - - void show_mem(void) - { -@@ -53,6 +56,7 @@ void show_mem(void) - printk("%d reserved pages\n",reserved); - printk("%d pages shared\n",shared); - printk("%d pages swap cached\n",cached); -+ vprintstat(); - } - - /* -@@ -143,9 +147,10 @@ struct page *pte_alloc_one(struct mm_str - struct page *pte; - - #ifdef CONFIG_HIGHPTE -- pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT, 0); -+ pte = alloc_pages(GFP_KERNEL_UBC|__GFP_SOFT_UBC| -+ __GFP_HIGHMEM|__GFP_REPEAT, 0); - #else -- pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0); -+ pte = alloc_pages(GFP_KERNEL_UBC|__GFP_SOFT_UBC|__GFP_REPEAT, 0); - #endif - if (pte) - clear_highpage(pte); -@@ -157,11 +162,20 @@ void pmd_ctor(void *pmd, kmem_cache_t *c - memset(pmd, 0, PTRS_PER_PMD*sizeof(pmd_t)); - } - -+void kpmd_ctor(void *__pmd, kmem_cache_t *cache, unsigned long flags) -+{ -+ pmd_t *kpmd, *pmd; -+ kpmd = pmd_offset(&swapper_pg_dir[PTRS_PER_PGD-1], -+ (PTRS_PER_PMD - NR_SHARED_PMDS)*PMD_SIZE); -+ pmd = (pmd_t *)__pmd + (PTRS_PER_PMD - NR_SHARED_PMDS); -+ -+ memset(__pmd, 0, (PTRS_PER_PMD - NR_SHARED_PMDS)*sizeof(pmd_t)); -+ memcpy(pmd, kpmd, NR_SHARED_PMDS*sizeof(pmd_t)); -+} -+ - /* -- * List of all pgd's needed for non-PAE so it can invalidate entries -- * in both cached and uncached pgd's; not needed for PAE since the -- * kernel pmd is shared. If PAE were not to share the pmd a similar -- * tactic would be needed. This is essentially codepath-based locking -+ * List of all pgd's needed so it can invalidate entries in both cached -+ * and uncached pgd's. This is essentially codepath-based locking - * against pageattr.c; it is the unique case in which a valid change - * of kernel pagetables can't be lazily synchronized by vmalloc faults. - * vmalloc faults work because attached pagetables are never freed. -@@ -169,6 +183,12 @@ void pmd_ctor(void *pmd, kmem_cache_t *c - * checks at dup_mmap(), exec(), and other mmlist addition points - * could be used. The locking scheme was chosen on the basis of - * manfred's recommendations and having no core impact whatsoever. -+ * -+ * Lexicon for #ifdefless conditions to config options: -+ * (a) PTRS_PER_PMD == 1 means non-PAE. -+ * (b) PTRS_PER_PMD > 1 means PAE. -+ * (c) TASK_SIZE > PAGE_OFFSET means 4:4. -+ * (d) TASK_SIZE <= PAGE_OFFSET means non-4:4. - * -- wli - */ - spinlock_t pgd_lock = SPIN_LOCK_UNLOCKED; -@@ -194,26 +214,38 @@ static inline void pgd_list_del(pgd_t *p - next->private = (unsigned long)pprev; - } - --void pgd_ctor(void *pgd, kmem_cache_t *cache, unsigned long unused) -+void pgd_ctor(void *__pgd, kmem_cache_t *cache, unsigned long unused) - { -+ pgd_t *pgd = __pgd; - unsigned long flags; - -- if (PTRS_PER_PMD == 1) -- spin_lock_irqsave(&pgd_lock, flags); -+ if (PTRS_PER_PMD == 1) { -+ if (TASK_SIZE <= PAGE_OFFSET) -+ spin_lock_irqsave(&pgd_lock, flags); -+ else -+ memcpy(&pgd[PTRS_PER_PGD - NR_SHARED_PMDS], -+ &swapper_pg_dir[PTRS_PER_PGD - NR_SHARED_PMDS], -+ NR_SHARED_PMDS*sizeof(pgd_t)); -+ } - -- memcpy((pgd_t *)pgd + USER_PTRS_PER_PGD, -- swapper_pg_dir + USER_PTRS_PER_PGD, -- (PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t)); -+ if (TASK_SIZE <= PAGE_OFFSET) -+ memcpy(&pgd[USER_PTRS_PER_PGD], -+ &swapper_pg_dir[USER_PTRS_PER_PGD], -+ (PTRS_PER_PGD - USER_PTRS_PER_PGD)*sizeof(pgd_t)); - - if (PTRS_PER_PMD > 1) - return; - -- pgd_list_add(pgd); -- spin_unlock_irqrestore(&pgd_lock, flags); -- memset(pgd, 0, USER_PTRS_PER_PGD*sizeof(pgd_t)); -+ if (TASK_SIZE > PAGE_OFFSET) -+ memset(pgd, 0, (PTRS_PER_PGD - NR_SHARED_PMDS)*sizeof(pgd_t)); -+ else { -+ pgd_list_add(pgd); -+ spin_unlock_irqrestore(&pgd_lock, flags); -+ memset(pgd, 0, USER_PTRS_PER_PGD*sizeof(pgd_t)); -+ } - } - --/* never called when PTRS_PER_PMD > 1 */ -+/* Never called when PTRS_PER_PMD > 1 || TASK_SIZE > PAGE_OFFSET */ - void pgd_dtor(void *pgd, kmem_cache_t *cache, unsigned long unused) - { - unsigned long flags; /* can be called from interrupt context */ -@@ -231,15 +263,31 @@ pgd_t *pgd_alloc(struct mm_struct *mm) - if (PTRS_PER_PMD == 1 || !pgd) - return pgd; - -+ /* -+ * In the 4G userspace case alias the top 16 MB virtual -+ * memory range into the user mappings as well (these -+ * include the trampoline and CPU data structures). -+ */ - for (i = 0; i < USER_PTRS_PER_PGD; ++i) { -- pmd_t *pmd = kmem_cache_alloc(pmd_cache, GFP_KERNEL); -+ pmd_t *pmd; -+ -+ if (TASK_SIZE > PAGE_OFFSET && i == USER_PTRS_PER_PGD - 1) -+ pmd = kmem_cache_alloc(kpmd_cache, GFP_KERNEL); -+ else -+ pmd = kmem_cache_alloc(pmd_cache, GFP_KERNEL); -+ - if (!pmd) - goto out_oom; - set_pgd(&pgd[i], __pgd(1 + __pa((u64)((u32)pmd)))); - } -- return pgd; - -+ return pgd; - out_oom: -+ /* -+ * we don't have to handle the kpmd_cache here, since it's the -+ * last allocation, and has either nothing to free or when it -+ * succeeds the whole operation succeeds. -+ */ - for (i--; i >= 0; i--) - kmem_cache_free(pmd_cache, (void *)__va(pgd_val(pgd[i])-1)); - kmem_cache_free(pgd_cache, pgd); -@@ -250,10 +298,27 @@ void pgd_free(pgd_t *pgd) - { - int i; - -- /* in the PAE case user pgd entries are overwritten before usage */ -- if (PTRS_PER_PMD > 1) -- for (i = 0; i < USER_PTRS_PER_PGD; ++i) -- kmem_cache_free(pmd_cache, (void *)__va(pgd_val(pgd[i])-1)); - /* in the non-PAE case, clear_page_tables() clears user pgd entries */ -+ if (PTRS_PER_PMD == 1) -+ goto out_free; -+ -+ /* in the PAE case user pgd entries are overwritten before usage */ -+ for (i = 0; i < USER_PTRS_PER_PGD; ++i) { -+ pmd_t *pmd = __va(pgd_val(pgd[i]) - 1); -+ -+ /* -+ * only userspace pmd's are cleared for us -+ * by mm/memory.c; it's a slab cache invariant -+ * that we must separate the kernel pmd slab -+ * all times, else we'll have bad pmd's. -+ */ -+ if (TASK_SIZE > PAGE_OFFSET && i == USER_PTRS_PER_PGD - 1) -+ kmem_cache_free(kpmd_cache, pmd); -+ else -+ kmem_cache_free(pmd_cache, pmd); -+ } -+out_free: - kmem_cache_free(pgd_cache, pgd); - } -+ -+EXPORT_SYMBOL(show_mem); -diff -uprN linux-2.6.8.1.orig/arch/i386/pci/fixup.c linux-2.6.8.1-ve022stab078/arch/i386/pci/fixup.c ---- linux-2.6.8.1.orig/arch/i386/pci/fixup.c 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/pci/fixup.c 2006-05-11 13:05:29.000000000 +0400 -@@ -210,10 +210,7 @@ static void __devinit pci_fixup_transpar - */ - static void __init pci_fixup_nforce2(struct pci_dev *dev) - { -- u32 val, fixed_val; -- u8 rev; -- -- pci_read_config_byte(dev, PCI_REVISION_ID, &rev); -+ u32 val; - - /* - * Chip Old value New value -@@ -223,17 +220,14 @@ static void __init pci_fixup_nforce2(str - * Northbridge chip version may be determined by - * reading the PCI revision ID (0xC1 or greater is C18D). - */ -- fixed_val = rev < 0xC1 ? 0x1F01FF01 : 0x9F01FF01; -- - pci_read_config_dword(dev, 0x6c, &val); - - /* -- * Apply fixup only if C1 Halt Disconnect is enabled -- * (bit28) because it is not supported on some boards. -+ * Apply fixup if needed, but don't touch disconnect state - */ -- if ((val & (1 << 28)) && val != fixed_val) { -+ if ((val & 0x00FF0000) != 0x00010000) { - printk(KERN_WARNING "PCI: nForce2 C1 Halt Disconnect fixup\n"); -- pci_write_config_dword(dev, 0x6c, fixed_val); -+ pci_write_config_dword(dev, 0x6c, (val & 0xFF00FFFF) | 0x00010000); - } - } - -diff -uprN linux-2.6.8.1.orig/arch/i386/power/cpu.c linux-2.6.8.1-ve022stab078/arch/i386/power/cpu.c ---- linux-2.6.8.1.orig/arch/i386/power/cpu.c 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/i386/power/cpu.c 2006-05-11 13:05:38.000000000 +0400 -@@ -83,9 +83,7 @@ do_fpu_end(void) - static void fix_processor_context(void) - { - int cpu = smp_processor_id(); -- struct tss_struct * t = init_tss + cpu; - -- set_tss_desc(cpu,t); /* This just modifies memory; should not be necessary. But... This is necessary, because 386 hardware has concept of busy TSS or some similar stupidity. */ - cpu_gdt_table[cpu][GDT_ENTRY_TSS].b &= 0xfffffdff; - - load_TR_desc(); /* This does ltr */ -diff -uprN linux-2.6.8.1.orig/arch/ia64/hp/common/sba_iommu.c linux-2.6.8.1-ve022stab078/arch/ia64/hp/common/sba_iommu.c ---- linux-2.6.8.1.orig/arch/ia64/hp/common/sba_iommu.c 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/hp/common/sba_iommu.c 2006-05-11 13:05:30.000000000 +0400 -@@ -475,7 +475,7 @@ sba_search_bitmap(struct ioc *ioc, unsig - * purges IOTLB entries in power-of-two sizes, so we also - * allocate IOVA space in power-of-two sizes. - */ -- bits_wanted = 1UL << get_iovp_order(bits_wanted << PAGE_SHIFT); -+ bits_wanted = 1UL << get_iovp_order(bits_wanted << iovp_shift); - - if (likely(bits_wanted == 1)) { - unsigned int bitshiftcnt; -@@ -684,7 +684,7 @@ sba_free_range(struct ioc *ioc, dma_addr - unsigned long m; - - /* Round up to power-of-two size: see AR2305 note above */ -- bits_not_wanted = 1UL << get_iovp_order(bits_not_wanted << PAGE_SHIFT); -+ bits_not_wanted = 1UL << get_iovp_order(bits_not_wanted << iovp_shift); - for (; bits_not_wanted > 0 ; res_ptr++) { - - if (unlikely(bits_not_wanted > BITS_PER_LONG)) { -@@ -757,7 +757,7 @@ sba_io_pdir_entry(u64 *pdir_ptr, unsigne - #ifdef ENABLE_MARK_CLEAN - /** - * Since DMA is i-cache coherent, any (complete) pages that were written via -- * DMA can be marked as "clean" so that update_mmu_cache() doesn't have to -+ * DMA can be marked as "clean" so that lazy_mmu_prot_update() doesn't have to - * flush them when they get mapped into an executable vm-area. - */ - static void -diff -uprN linux-2.6.8.1.orig/arch/ia64/ia32/binfmt_elf32.c linux-2.6.8.1-ve022stab078/arch/ia64/ia32/binfmt_elf32.c ---- linux-2.6.8.1.orig/arch/ia64/ia32/binfmt_elf32.c 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/ia32/binfmt_elf32.c 2006-05-11 13:05:38.000000000 +0400 -@@ -18,6 +18,8 @@ - #include <asm/param.h> - #include <asm/signal.h> - -+#include <ub/ub_vmpages.h> -+ - #include "ia32priv.h" - #include "elfcore32.h" - -@@ -84,7 +86,11 @@ ia64_elf32_init (struct pt_regs *regs) - vma->vm_ops = &ia32_shared_page_vm_ops; - down_write(¤t->mm->mmap_sem); - { -- insert_vm_struct(current->mm, vma); -+ if (insert_vm_struct(current->mm, vma)) { -+ kmem_cache_free(vm_area_cachep, vma); -+ up_write(¤t->mm->mmap_sem); -+ return; -+ } - } - up_write(¤t->mm->mmap_sem); - } -@@ -93,6 +99,11 @@ ia64_elf32_init (struct pt_regs *regs) - * Install LDT as anonymous memory. This gives us all-zero segment descriptors - * until a task modifies them via modify_ldt(). - */ -+ if (ub_memory_charge(mm_ub(current->mm), -+ PAGE_ALIGN(IA32_LDT_ENTRIES * IA32_LDT_ENTRY_SIZE), -+ VM_WRITE, NULL, UB_SOFT)) -+ return; -+ - vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL); - if (vma) { - memset(vma, 0, sizeof(*vma)); -@@ -103,10 +114,21 @@ ia64_elf32_init (struct pt_regs *regs) - vma->vm_flags = VM_READ|VM_WRITE|VM_MAYREAD|VM_MAYWRITE; - down_write(¤t->mm->mmap_sem); - { -- insert_vm_struct(current->mm, vma); -+ if (insert_vm_struct(current->mm, vma)) { -+ kmem_cache_free(vm_area_cachep, vma); -+ up_write(¤t->mm->mmap_sem); -+ ub_memory_uncharge(mm_ub(current->mm), -+ PAGE_ALIGN(IA32_LDT_ENTRIES * -+ IA32_LDT_ENTRY_SIZE), -+ VM_WRITE, NULL); -+ return; -+ } - } - up_write(¤t->mm->mmap_sem); -- } -+ } else -+ ub_memory_uncharge(mm_ub(current->mm), -+ PAGE_ALIGN(IA32_LDT_ENTRIES * IA32_LDT_ENTRY_SIZE), -+ VM_WRITE, NULL); - - ia64_psr(regs)->ac = 0; /* turn off alignment checking */ - regs->loadrs = 0; -@@ -148,10 +170,10 @@ ia64_elf32_init (struct pt_regs *regs) - int - ia32_setup_arg_pages (struct linux_binprm *bprm, int executable_stack) - { -- unsigned long stack_base; -+ unsigned long stack_base, vm_end, vm_start; - struct vm_area_struct *mpnt; - struct mm_struct *mm = current->mm; -- int i; -+ int i, ret; - - stack_base = IA32_STACK_TOP - MAX_ARG_PAGES*PAGE_SIZE; - mm->arg_start = bprm->p + stack_base; -@@ -161,23 +183,29 @@ ia32_setup_arg_pages (struct linux_binpr - bprm->loader += stack_base; - bprm->exec += stack_base; - -+ vm_end = IA32_STACK_TOP; -+ vm_start = PAGE_MASK & (unsigned long)bprm->p; -+ -+ ret = ub_memory_charge(mm_ub(mm), vm_end - vm_start, VM_STACK_FLAGS, -+ NULL, UB_HARD); -+ if (ret) -+ goto out; -+ -+ ret = -ENOMEM; - mpnt = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL); - if (!mpnt) -- return -ENOMEM; -+ goto out_uncharge; - -- if (security_vm_enough_memory((IA32_STACK_TOP - (PAGE_MASK & (unsigned long) bprm->p)) -- >> PAGE_SHIFT)) { -- kmem_cache_free(vm_area_cachep, mpnt); -- return -ENOMEM; -- } -+ if (security_vm_enough_memory((vm_end - vm_start) >> PAGE_SHIFT)) -+ goto out_free; - - memset(mpnt, 0, sizeof(*mpnt)); - - down_write(¤t->mm->mmap_sem); - { - mpnt->vm_mm = current->mm; -- mpnt->vm_start = PAGE_MASK & (unsigned long) bprm->p; -- mpnt->vm_end = IA32_STACK_TOP; -+ mpnt->vm_start = vm_start; -+ mpnt->vm_end = vm_end; - if (executable_stack == EXSTACK_ENABLE_X) - mpnt->vm_flags = VM_STACK_FLAGS | VM_EXEC; - else if (executable_stack == EXSTACK_DISABLE_X) -@@ -186,7 +214,8 @@ ia32_setup_arg_pages (struct linux_binpr - mpnt->vm_flags = VM_STACK_FLAGS; - mpnt->vm_page_prot = (mpnt->vm_flags & VM_EXEC)? - PAGE_COPY_EXEC: PAGE_COPY; -- insert_vm_struct(current->mm, mpnt); -+ if ((ret = insert_vm_struct(current->mm, mpnt))) -+ goto out_up; - current->mm->total_vm = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT; - } - -@@ -205,6 +234,16 @@ ia32_setup_arg_pages (struct linux_binpr - current->thread.ppl = ia32_init_pp_list(); - - return 0; -+ -+out_up: -+ up_write(¤t->mm->mmap_sem); -+ vm_unacct_memory((vm_end - vm_start) >> PAGE_SHIFT); -+out_free: -+ kmem_cache_free(vm_area_cachep, mpnt); -+out_uncharge: -+ ub_memory_uncharge(mm_ub(mm), vm_end - vm_start, VM_STACK_FLAGS, NULL); -+out: -+ return ret; - } - - static void -diff -uprN linux-2.6.8.1.orig/arch/ia64/ia32/ia32_entry.S linux-2.6.8.1-ve022stab078/arch/ia64/ia32/ia32_entry.S ---- linux-2.6.8.1.orig/arch/ia64/ia32/ia32_entry.S 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/ia32/ia32_entry.S 2006-05-11 13:05:27.000000000 +0400 -@@ -387,7 +387,7 @@ ia32_syscall_table: - data8 sys32_rt_sigaction - data8 sys32_rt_sigprocmask /* 175 */ - data8 sys_rt_sigpending -- data8 sys32_rt_sigtimedwait -+ data8 compat_rt_sigtimedwait - data8 sys32_rt_sigqueueinfo - data8 sys32_rt_sigsuspend - data8 sys32_pread /* 180 */ -diff -uprN linux-2.6.8.1.orig/arch/ia64/ia32/ia32_signal.c linux-2.6.8.1-ve022stab078/arch/ia64/ia32/ia32_signal.c ---- linux-2.6.8.1.orig/arch/ia64/ia32/ia32_signal.c 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/ia32/ia32_signal.c 2006-05-11 13:05:34.000000000 +0400 -@@ -59,19 +59,19 @@ struct rt_sigframe_ia32 - int sig; - int pinfo; - int puc; -- siginfo_t32 info; -+ compat_siginfo_t info; - struct ucontext_ia32 uc; - struct _fpstate_ia32 fpstate; - char retcode[8]; - }; - - int --copy_siginfo_from_user32 (siginfo_t *to, siginfo_t32 *from) -+copy_siginfo_from_user32 (siginfo_t *to, compat_siginfo_t *from) - { - unsigned long tmp; - int err; - -- if (!access_ok(VERIFY_READ, from, sizeof(siginfo_t32))) -+ if (!access_ok(VERIFY_READ, from, sizeof(compat_siginfo_t))) - return -EFAULT; - - err = __get_user(to->si_signo, &from->si_signo); -@@ -110,12 +110,12 @@ copy_siginfo_from_user32 (siginfo_t *to, - } - - int --copy_siginfo_to_user32 (siginfo_t32 *to, siginfo_t *from) -+copy_siginfo_to_user32 (compat_siginfo_t *to, siginfo_t *from) - { - unsigned int addr; - int err; - -- if (!access_ok(VERIFY_WRITE, to, sizeof(siginfo_t32))) -+ if (!access_ok(VERIFY_WRITE, to, sizeof(compat_siginfo_t))) - return -EFAULT; - - /* If you change siginfo_t structure, please be sure -@@ -459,7 +459,7 @@ ia32_rt_sigsuspend (compat_sigset_t *use - sigset_t oldset, set; - - scr->scratch_unat = 0; /* avoid leaking kernel bits to user level */ -- memset(&set, 0, sizeof(&set)); -+ memset(&set, 0, sizeof(set)); - - if (sigsetsize > sizeof(sigset_t)) - return -EINVAL; -@@ -505,6 +505,7 @@ sys32_signal (int sig, unsigned int hand - - sigact_set_handler(&new_sa, handler, 0); - new_sa.sa.sa_flags = SA_ONESHOT | SA_NOMASK; -+ sigemptyset(&new_sa.sa.sa_mask); - - ret = do_sigaction(sig, &new_sa, &old_sa); - -@@ -574,33 +575,7 @@ sys32_rt_sigprocmask (int how, compat_si - } - - asmlinkage long --sys32_rt_sigtimedwait (compat_sigset_t *uthese, siginfo_t32 *uinfo, -- struct compat_timespec *uts, unsigned int sigsetsize) --{ -- extern int copy_siginfo_to_user32 (siginfo_t32 *, siginfo_t *); -- mm_segment_t old_fs = get_fs(); -- struct timespec t; -- siginfo_t info; -- sigset_t s; -- int ret; -- -- if (copy_from_user(&s.sig, uthese, sizeof(compat_sigset_t))) -- return -EFAULT; -- if (uts && get_compat_timespec(&t, uts)) -- return -EFAULT; -- set_fs(KERNEL_DS); -- ret = sys_rt_sigtimedwait(&s, uinfo ? &info : NULL, uts ? &t : NULL, -- sigsetsize); -- set_fs(old_fs); -- if (ret >= 0 && uinfo) { -- if (copy_siginfo_to_user32(uinfo, &info)) -- return -EFAULT; -- } -- return ret; --} -- --asmlinkage long --sys32_rt_sigqueueinfo (int pid, int sig, siginfo_t32 *uinfo) -+sys32_rt_sigqueueinfo (int pid, int sig, compat_siginfo_t *uinfo) - { - mm_segment_t old_fs = get_fs(); - siginfo_t info; -diff -uprN linux-2.6.8.1.orig/arch/ia64/ia32/ia32priv.h linux-2.6.8.1-ve022stab078/arch/ia64/ia32/ia32priv.h ---- linux-2.6.8.1.orig/arch/ia64/ia32/ia32priv.h 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/ia32/ia32priv.h 2006-05-11 13:05:27.000000000 +0400 -@@ -229,7 +229,7 @@ typedef union sigval32 { - - #define SIGEV_PAD_SIZE32 ((SIGEV_MAX_SIZE/sizeof(int)) - 3) - --typedef struct siginfo32 { -+typedef struct compat_siginfo { - int si_signo; - int si_errno; - int si_code; -@@ -279,7 +279,7 @@ typedef struct siginfo32 { - int _fd; - } _sigpoll; - } _sifields; --} siginfo_t32; -+} compat_siginfo_t; - - typedef struct sigevent32 { - sigval_t32 sigev_value; -diff -uprN linux-2.6.8.1.orig/arch/ia64/ia32/sys_ia32.c linux-2.6.8.1-ve022stab078/arch/ia64/ia32/sys_ia32.c ---- linux-2.6.8.1.orig/arch/ia64/ia32/sys_ia32.c 2004-08-14 14:55:34.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/ia32/sys_ia32.c 2006-05-11 13:05:42.000000000 +0400 -@@ -770,7 +770,7 @@ emulate_mmap (struct file *file, unsigne - ia32_set_pp((unsigned int)start, (unsigned int)end, flags); - if (start > pstart) { - if (flags & MAP_SHARED) -- printk(KERN_INFO -+ ve_printk(VE_LOG, KERN_INFO - "%s(%d): emulate_mmap() can't share head (addr=0x%lx)\n", - current->comm, current->pid, start); - ret = mmap_subpage(file, start, min(PAGE_ALIGN(start), end), prot, flags, -@@ -783,7 +783,7 @@ emulate_mmap (struct file *file, unsigne - } - if (end < pend) { - if (flags & MAP_SHARED) -- printk(KERN_INFO -+ ve_printk(VE_LOG, KERN_INFO - "%s(%d): emulate_mmap() can't share tail (end=0x%lx)\n", - current->comm, current->pid, end); - ret = mmap_subpage(file, max(start, PAGE_START(end)), end, prot, flags, -@@ -814,7 +814,7 @@ emulate_mmap (struct file *file, unsigne - is_congruent = (flags & MAP_ANONYMOUS) || (offset_in_page(poff) == 0); - - if ((flags & MAP_SHARED) && !is_congruent) -- printk(KERN_INFO "%s(%d): emulate_mmap() can't share contents of incongruent mmap " -+ ve_printk(VE_LOG, KERN_INFO "%s(%d): emulate_mmap() can't share contents of incongruent mmap " - "(addr=0x%lx,off=0x%llx)\n", current->comm, current->pid, start, off); - - DBG("mmap_body: mapping [0x%lx-0x%lx) %s with poff 0x%llx\n", pstart, pend, -@@ -1521,7 +1521,7 @@ getreg (struct task_struct *child, int r - return __USER_DS; - case PT_CS: return __USER_CS; - default: -- printk(KERN_ERR "ia32.getreg(): unknown register %d\n", regno); -+ ve_printk(VE_LOG, KERN_ERR "ia32.getreg(): unknown register %d\n", regno); - break; - } - return 0; -@@ -1547,18 +1547,18 @@ putreg (struct task_struct *child, int r - case PT_EFL: child->thread.eflag = value; break; - case PT_DS: case PT_ES: case PT_FS: case PT_GS: case PT_SS: - if (value != __USER_DS) -- printk(KERN_ERR -+ ve_printk(VE_LOG, KERN_ERR - "ia32.putreg: attempt to set invalid segment register %d = %x\n", - regno, value); - break; - case PT_CS: - if (value != __USER_CS) -- printk(KERN_ERR -+ ve_printk(VE_LOG, KERN_ERR - "ia32.putreg: attempt to to set invalid segment register %d = %x\n", - regno, value); - break; - default: -- printk(KERN_ERR "ia32.putreg: unknown register %d\n", regno); -+ ve_printk(VE_LOG, KERN_ERR "ia32.putreg: unknown register %d\n", regno); - break; - } - } -@@ -1799,7 +1799,7 @@ sys32_ptrace (int request, pid_t pid, un - - ret = -ESRCH; - read_lock(&tasklist_lock); -- child = find_task_by_pid(pid); -+ child = find_task_by_pid_ve(pid); - if (child) - get_task_struct(child); - read_unlock(&tasklist_lock); -@@ -2419,7 +2419,7 @@ sys32_sendfile (int out_fd, int in_fd, i - ret = sys_sendfile(out_fd, in_fd, offset ? &of : NULL, count); - set_fs(old_fs); - -- if (!ret && offset && put_user(of, offset)) -+ if (offset && put_user(of, offset)) - return -EFAULT; - - return ret; -diff -uprN linux-2.6.8.1.orig/arch/ia64/kernel/acpi.c linux-2.6.8.1-ve022stab078/arch/ia64/kernel/acpi.c ---- linux-2.6.8.1.orig/arch/ia64/kernel/acpi.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/kernel/acpi.c 2006-05-11 13:05:30.000000000 +0400 -@@ -430,8 +430,9 @@ acpi_numa_arch_fixup (void) - { - int i, j, node_from, node_to; - -- /* If there's no SRAT, fix the phys_id */ -+ /* If there's no SRAT, fix the phys_id and mark node 0 online */ - if (srat_num_cpus == 0) { -+ node_set_online(0); - node_cpuid[0].phys_id = hard_smp_processor_id(); - return; - } -diff -uprN linux-2.6.8.1.orig/arch/ia64/kernel/asm-offsets.c linux-2.6.8.1-ve022stab078/arch/ia64/kernel/asm-offsets.c ---- linux-2.6.8.1.orig/arch/ia64/kernel/asm-offsets.c 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/kernel/asm-offsets.c 2006-05-11 13:05:40.000000000 +0400 -@@ -38,11 +38,21 @@ void foo(void) - DEFINE(IA64_TASK_CLEAR_CHILD_TID_OFFSET,offsetof (struct task_struct, clear_child_tid)); - DEFINE(IA64_TASK_GROUP_LEADER_OFFSET, offsetof (struct task_struct, group_leader)); - DEFINE(IA64_TASK_PENDING_OFFSET,offsetof (struct task_struct, pending)); -+#ifdef CONFIG_VE -+ DEFINE(IA64_TASK_PID_OFFSET, offsetof -+ (struct task_struct, pids[PIDTYPE_PID].vnr)); -+#else - DEFINE(IA64_TASK_PID_OFFSET, offsetof (struct task_struct, pid)); -+#endif - DEFINE(IA64_TASK_REAL_PARENT_OFFSET, offsetof (struct task_struct, real_parent)); - DEFINE(IA64_TASK_SIGHAND_OFFSET,offsetof (struct task_struct, sighand)); - DEFINE(IA64_TASK_SIGNAL_OFFSET,offsetof (struct task_struct, signal)); -+#ifdef CONFIG_VE -+ DEFINE(IA64_TASK_TGID_OFFSET, offsetof -+ (struct task_struct, pids[PIDTYPE_TGID].vnr)); -+#else - DEFINE(IA64_TASK_TGID_OFFSET, offsetof (struct task_struct, tgid)); -+#endif - DEFINE(IA64_TASK_THREAD_KSP_OFFSET, offsetof (struct task_struct, thread.ksp)); - DEFINE(IA64_TASK_THREAD_ON_USTACK_OFFSET, offsetof (struct task_struct, thread.on_ustack)); - -diff -uprN linux-2.6.8.1.orig/arch/ia64/kernel/entry.S linux-2.6.8.1-ve022stab078/arch/ia64/kernel/entry.S ---- linux-2.6.8.1.orig/arch/ia64/kernel/entry.S 2004-08-14 14:55:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/kernel/entry.S 2006-05-11 13:05:43.000000000 +0400 -@@ -51,8 +51,11 @@ - * setup a null register window frame. - */ - ENTRY(ia64_execve) -- .prologue ASM_UNW_PRLG_RP|ASM_UNW_PRLG_PFS, ASM_UNW_PRLG_GRSAVE(3) -- alloc loc1=ar.pfs,3,2,4,0 -+ /* -+ * Allocate 8 input registers since ptrace() may clobber them -+ */ -+ .prologue ASM_UNW_PRLG_RP|ASM_UNW_PRLG_PFS, ASM_UNW_PRLG_GRSAVE(8) -+ alloc loc1=ar.pfs,8,2,4,0 - mov loc0=rp - .body - mov out0=in0 // filename -@@ -113,8 +116,11 @@ END(ia64_execve) - * u64 tls) - */ - GLOBAL_ENTRY(sys_clone2) -- .prologue ASM_UNW_PRLG_RP|ASM_UNW_PRLG_PFS, ASM_UNW_PRLG_GRSAVE(6) -- alloc r16=ar.pfs,6,2,6,0 -+ /* -+ * Allocate 8 input registers since ptrace() may clobber them -+ */ -+ .prologue ASM_UNW_PRLG_RP|ASM_UNW_PRLG_PFS, ASM_UNW_PRLG_GRSAVE(8) -+ alloc r16=ar.pfs,8,2,6,0 - DO_SAVE_SWITCH_STACK - adds r2=PT(R16)+IA64_SWITCH_STACK_SIZE+16,sp - mov loc0=rp -@@ -142,8 +148,11 @@ END(sys_clone2) - * Deprecated. Use sys_clone2() instead. - */ - GLOBAL_ENTRY(sys_clone) -- .prologue ASM_UNW_PRLG_RP|ASM_UNW_PRLG_PFS, ASM_UNW_PRLG_GRSAVE(5) -- alloc r16=ar.pfs,5,2,6,0 -+ /* -+ * Allocate 8 input registers since ptrace() may clobber them -+ */ -+ .prologue ASM_UNW_PRLG_RP|ASM_UNW_PRLG_PFS, ASM_UNW_PRLG_GRSAVE(8) -+ alloc r16=ar.pfs,8,2,6,0 - DO_SAVE_SWITCH_STACK - adds r2=PT(R16)+IA64_SWITCH_STACK_SIZE+16,sp - mov loc0=rp -@@ -1139,7 +1148,7 @@ ENTRY(notify_resume_user) - ;; - (pNonSys) mov out2=0 // out2==0 => not a syscall - .fframe 16 -- .spillpsp ar.unat, 16 // (note that offset is relative to psp+0x10!) -+ .spillsp ar.unat, 16 - st8 [sp]=r9,-16 // allocate space for ar.unat and save it - st8 [out1]=loc1,-8 // save ar.pfs, out1=&sigscratch - .body -@@ -1165,7 +1174,7 @@ GLOBAL_ENTRY(sys_rt_sigsuspend) - adds out2=8,sp // out2=&sigscratch->ar_pfs - ;; - .fframe 16 -- .spillpsp ar.unat, 16 // (note that offset is relative to psp+0x10!) -+ .spillsp ar.unat, 16 - st8 [sp]=r9,-16 // allocate space for ar.unat and save it - st8 [out2]=loc1,-8 // save ar.pfs, out2=&sigscratch - .body -@@ -1183,7 +1192,10 @@ END(sys_rt_sigsuspend) - - ENTRY(sys_rt_sigreturn) - PT_REGS_UNWIND_INFO(0) -- alloc r2=ar.pfs,0,0,1,0 -+ /* -+ * Allocate 8 input registers since ptrace() may clobber them -+ */ -+ alloc r2=ar.pfs,8,0,1,0 - .prologue - PT_REGS_SAVES(16) - adds sp=-16,sp -@@ -1537,5 +1549,19 @@ sys_call_table: - data8 sys_ni_syscall - data8 sys_ni_syscall - data8 sys_ni_syscall -+.rept 1500-1280 -+ data8 sys_ni_syscall // 1280 - 1499 -+.endr -+ data8 sys_fairsched_mknod // 1500 -+ data8 sys_fairsched_rmnod -+ data8 sys_fairsched_chwt -+ data8 sys_fairsched_mvpr -+ data8 sys_fairsched_rate -+ data8 sys_getluid // 1505 -+ data8 sys_setluid -+ data8 sys_setublimit -+ data8 sys_ubstat -+ data8 sys_lchmod -+ data8 sys_lutime // 1510 - - .org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls -diff -uprN linux-2.6.8.1.orig/arch/ia64/kernel/entry.h linux-2.6.8.1-ve022stab078/arch/ia64/kernel/entry.h ---- linux-2.6.8.1.orig/arch/ia64/kernel/entry.h 2004-08-14 14:55:19.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/kernel/entry.h 2006-05-11 13:05:30.000000000 +0400 -@@ -1,14 +1,25 @@ - #include <linux/config.h> - - /* -- * Preserved registers that are shared between code in ivt.S and entry.S. Be -- * careful not to step on these! -+ * Preserved registers that are shared between code in ivt.S and -+ * entry.S. Be careful not to step on these! - */ --#define pLvSys p1 /* set 1 if leave from syscall; otherwise, set 0 */ --#define pKStk p2 /* will leave_{kernel,syscall} return to kernel-stacks? */ --#define pUStk p3 /* will leave_{kernel,syscall} return to user-stacks? */ --#define pSys p4 /* are we processing a (synchronous) system call? */ --#define pNonSys p5 /* complement of pSys */ -+#define PRED_LEAVE_SYSCALL 1 /* TRUE iff leave from syscall */ -+#define PRED_KERNEL_STACK 2 /* returning to kernel-stacks? */ -+#define PRED_USER_STACK 3 /* returning to user-stacks? */ -+#define PRED_SYSCALL 4 /* inside a system call? */ -+#define PRED_NON_SYSCALL 5 /* complement of PRED_SYSCALL */ -+ -+#ifdef __ASSEMBLY__ -+# define PASTE2(x,y) x##y -+# define PASTE(x,y) PASTE2(x,y) -+ -+# define pLvSys PASTE(p,PRED_LEAVE_SYSCALL) -+# define pKStk PASTE(p,PRED_KERNEL_STACK) -+# define pUStk PASTE(p,PRED_USER_STACK) -+# define pSys PASTE(p,PRED_SYSCALL) -+# define pNonSys PASTE(p,PRED_NON_SYSCALL) -+#endif - - #define PT(f) (IA64_PT_REGS_##f##_OFFSET) - #define SW(f) (IA64_SWITCH_STACK_##f##_OFFSET) -@@ -49,7 +60,7 @@ - .spillsp @priunat,SW(AR_UNAT)+16+(off); \ - .spillsp ar.rnat,SW(AR_RNAT)+16+(off); \ - .spillsp ar.bspstore,SW(AR_BSPSTORE)+16+(off); \ -- .spillsp pr,SW(PR)+16+(off)) -+ .spillsp pr,SW(PR)+16+(off) - - #define DO_SAVE_SWITCH_STACK \ - movl r28=1f; \ -diff -uprN linux-2.6.8.1.orig/arch/ia64/kernel/fsys.S linux-2.6.8.1-ve022stab078/arch/ia64/kernel/fsys.S ---- linux-2.6.8.1.orig/arch/ia64/kernel/fsys.S 2004-08-14 14:56:25.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/kernel/fsys.S 2006-05-11 13:05:40.000000000 +0400 -@@ -70,6 +70,7 @@ ENTRY(fsys_getpid) - FSYS_RETURN - END(fsys_getpid) - -+#ifndef CONFIG_VE - ENTRY(fsys_getppid) - .prologue - .altrp b6 -@@ -116,6 +117,7 @@ ENTRY(fsys_getppid) - #endif - FSYS_RETURN - END(fsys_getppid) -+#endif - - ENTRY(fsys_set_tid_address) - .prologue -@@ -445,9 +447,9 @@ EX(.fail_efault, ld8 r14=[r33]) // r14 - ;; - - st8 [r2]=r14 // update current->blocked with new mask -- cmpxchg4.acq r14=[r9],r18,ar.ccv // current->thread_info->flags <- r18 -+ cmpxchg4.acq r8=[r9],r18,ar.ccv // current->thread_info->flags <- r18 - ;; -- cmp.ne p6,p0=r17,r14 // update failed? -+ cmp.ne p6,p0=r17,r8 // update failed? - (p6) br.cond.spnt.few 1b // yes -> retry - - #ifdef CONFIG_SMP -@@ -597,8 +599,9 @@ GLOBAL_ENTRY(fsys_bubble_down) - ;; - mov rp=r2 // set the real return addr - tbit.z p8,p0=r3,TIF_SYSCALL_TRACE -- --(p8) br.call.sptk.many b6=b6 // ignore this return addr -+ ;; -+(p10) br.cond.spnt.many ia64_ret_from_syscall // p10==true means out registers are more than 8 -+(p8) br.call.sptk.many b6=b6 // ignore this return addr - br.cond.sptk ia64_trace_syscall - END(fsys_bubble_down) - -@@ -626,7 +629,11 @@ fsyscall_table: - data8 0 // chown - data8 0 // lseek // 1040 - data8 fsys_getpid // getpid -+#ifdef CONFIG_VE -+ data8 0 // getppid -+#else - data8 fsys_getppid // getppid -+#endif - data8 0 // mount - data8 0 // umount - data8 0 // setuid // 1045 -diff -uprN linux-2.6.8.1.orig/arch/ia64/kernel/gate.S linux-2.6.8.1-ve022stab078/arch/ia64/kernel/gate.S ---- linux-2.6.8.1.orig/arch/ia64/kernel/gate.S 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/kernel/gate.S 2006-05-11 13:05:35.000000000 +0400 -@@ -81,6 +81,7 @@ GLOBAL_ENTRY(__kernel_syscall_via_epc) - LOAD_FSYSCALL_TABLE(r14) - - mov r16=IA64_KR(CURRENT) // 12 cycle read latency -+ tnat.nz p10,p9=r15 - mov r19=NR_syscalls-1 - ;; - shladd r18=r17,3,r14 -@@ -119,7 +120,8 @@ GLOBAL_ENTRY(__kernel_syscall_via_epc) - #endif - - mov r10=-1 -- mov r8=ENOSYS -+(p10) mov r8=EINVAL -+(p9) mov r8=ENOSYS - FSYS_RETURN - END(__kernel_syscall_via_epc) - -diff -uprN linux-2.6.8.1.orig/arch/ia64/kernel/irq.c linux-2.6.8.1-ve022stab078/arch/ia64/kernel/irq.c ---- linux-2.6.8.1.orig/arch/ia64/kernel/irq.c 2004-08-14 14:56:01.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/kernel/irq.c 2006-05-11 13:05:38.000000000 +0400 -@@ -56,6 +56,8 @@ - #include <asm/delay.h> - #include <asm/irq.h> - -+#include <ub/beancounter.h> -+#include <ub/ub_task.h> - - /* - * Linux has a controller-independent x86 interrupt architecture. -@@ -256,15 +258,18 @@ int handle_IRQ_event(unsigned int irq, - { - int status = 1; /* Force the "do bottom halves" bit */ - int retval = 0; -+ struct user_beancounter *ub; - - if (!(action->flags & SA_INTERRUPT)) - local_irq_enable(); - -+ ub = set_exec_ub(get_ub0()); - do { - status |= action->flags; - retval |= action->handler(irq, action->dev_id, regs); - action = action->next; - } while (action); -+ (void)set_exec_ub(ub); - if (status & SA_SAMPLE_RANDOM) - add_interrupt_randomness(irq); - local_irq_disable(); -diff -uprN linux-2.6.8.1.orig/arch/ia64/kernel/irq_ia64.c linux-2.6.8.1-ve022stab078/arch/ia64/kernel/irq_ia64.c ---- linux-2.6.8.1.orig/arch/ia64/kernel/irq_ia64.c 2004-08-14 14:55:19.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/kernel/irq_ia64.c 2006-05-11 13:05:40.000000000 +0400 -@@ -101,6 +101,7 @@ void - ia64_handle_irq (ia64_vector vector, struct pt_regs *regs) - { - unsigned long saved_tpr; -+ struct ve_struct *ve; - - #if IRQ_DEBUG - { -@@ -137,6 +138,12 @@ ia64_handle_irq (ia64_vector vector, str - * 16 (without this, it would be ~240, which could easily lead - * to kernel stack overflows). - */ -+ -+#ifdef CONFIG_HOTPLUG_CPU -+#warning "Fix fixup_irqs & ia64_process_pending_intr to set correct env and ub!" -+#endif -+ -+ ve = set_exec_env(get_ve0()); - irq_enter(); - saved_tpr = ia64_getreg(_IA64_REG_CR_TPR); - ia64_srlz_d(); -@@ -162,6 +169,7 @@ ia64_handle_irq (ia64_vector vector, str - * come through until ia64_eoi() has been done. - */ - irq_exit(); -+ (void)set_exec_env(ve); - } - - #ifdef CONFIG_HOTPLUG_CPU -diff -uprN linux-2.6.8.1.orig/arch/ia64/kernel/ivt.S linux-2.6.8.1-ve022stab078/arch/ia64/kernel/ivt.S ---- linux-2.6.8.1.orig/arch/ia64/kernel/ivt.S 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/kernel/ivt.S 2006-05-11 13:05:35.000000000 +0400 -@@ -51,6 +51,7 @@ - #include <asm/system.h> - #include <asm/thread_info.h> - #include <asm/unistd.h> -+#include <asm/errno.h> - - #if 1 - # define PSR_DEFAULT_BITS psr.ac -@@ -732,10 +733,12 @@ ENTRY(break_fault) - ssm psr.ic | PSR_DEFAULT_BITS - ;; - srlz.i // guarantee that interruption collection is on -+ mov r3=NR_syscalls - 1 - ;; - (p15) ssm psr.i // restore psr.i -+ // p10==true means out registers are more than 8 or r15's Nat is true -+(p10) br.cond.spnt.many ia64_ret_from_syscall - ;; -- mov r3=NR_syscalls - 1 - movl r16=sys_call_table - - adds r15=-1024,r15 // r15 contains the syscall number---subtract 1024 -@@ -836,8 +839,11 @@ END(interrupt) - * On exit: - * - executing on bank 1 registers - * - psr.ic enabled, interrupts restored -+ * - p10: TRUE if syscall is invoked with more than 8 out -+ * registers or r15's Nat is true - * - r1: kernel's gp - * - r3: preserved (same as on entry) -+ * - r8: -EINVAL if p10 is true - * - r12: points to kernel stack - * - r13: points to current task - * - p15: TRUE if interrupts need to be re-enabled -@@ -852,7 +858,7 @@ GLOBAL_ENTRY(ia64_syscall_setup) - add r17=PT(R11),r1 // initialize second base pointer - ;; - alloc r19=ar.pfs,8,0,0,0 // ensure in0-in7 are writable -- st8 [r16]=r29,PT(CR_IFS)-PT(CR_IPSR) // save cr.ipsr -+ st8 [r16]=r29,PT(AR_PFS)-PT(CR_IPSR) // save cr.ipsr - tnat.nz p8,p0=in0 - - st8.spill [r17]=r11,PT(CR_IIP)-PT(R11) // save r11 -@@ -860,31 +866,36 @@ GLOBAL_ENTRY(ia64_syscall_setup) - (pKStk) mov r18=r0 // make sure r18 isn't NaT - ;; - -+ st8 [r16]=r26,PT(CR_IFS)-PT(AR_PFS) // save ar.pfs - st8 [r17]=r28,PT(AR_UNAT)-PT(CR_IIP) // save cr.iip - mov r28=b0 // save b0 (2 cyc) --(p8) mov in0=-1 - ;; - -- st8 [r16]=r0,PT(AR_PFS)-PT(CR_IFS) // clear cr.ifs - st8 [r17]=r25,PT(AR_RSC)-PT(AR_UNAT) // save ar.unat --(p9) mov in1=-1 -+ dep r19=0,r19,38,26 // clear all bits but 0..37 [I0] -+(p8) mov in0=-1 - ;; - -- st8 [r16]=r26,PT(AR_RNAT)-PT(AR_PFS) // save ar.pfs -+ st8 [r16]=r19,PT(AR_RNAT)-PT(CR_IFS) // store ar.pfs.pfm in cr.ifs -+ extr.u r11=r19,7,7 // I0 // get sol of ar.pfs -+ and r8=0x7f,r19 // A // get sof of ar.pfs -+ - st8 [r17]=r27,PT(AR_BSPSTORE)-PT(AR_RSC)// save ar.rsc -- tnat.nz p10,p0=in2 -+ tbit.nz p15,p0=r29,IA64_PSR_I_BIT // I0 -+(p9) mov in1=-1 -+ ;; - - (pUStk) sub r18=r18,r22 // r18=RSE.ndirty*8 -- tbit.nz p15,p0=r29,IA64_PSR_I_BIT -- tnat.nz p11,p0=in3 -+ tnat.nz p10,p0=in2 -+ add r11=8,r11 - ;; - (pKStk) adds r16=PT(PR)-PT(AR_RNAT),r16 // skip over ar_rnat field - (pKStk) adds r17=PT(B0)-PT(AR_BSPSTORE),r17 // skip over ar_bspstore field -+ tnat.nz p11,p0=in3 -+ ;; - (p10) mov in2=-1 -- -+ tnat.nz p12,p0=in4 // [I0] - (p11) mov in3=-1 -- tnat.nz p12,p0=in4 -- tnat.nz p13,p0=in5 - ;; - (pUStk) st8 [r16]=r24,PT(PR)-PT(AR_RNAT) // save ar.rnat - (pUStk) st8 [r17]=r23,PT(B0)-PT(AR_BSPSTORE) // save ar.bspstore -@@ -892,36 +903,41 @@ GLOBAL_ENTRY(ia64_syscall_setup) - ;; - st8 [r16]=r31,PT(LOADRS)-PT(PR) // save predicates - st8 [r17]=r28,PT(R1)-PT(B0) // save b0 --(p12) mov in4=-1 -+ tnat.nz p13,p0=in5 // [I0] - ;; - st8 [r16]=r18,PT(R12)-PT(LOADRS) // save ar.rsc value for "loadrs" - st8.spill [r17]=r20,PT(R13)-PT(R1) // save original r1 --(p13) mov in5=-1 -+(p12) mov in4=-1 - ;; - - .mem.offset 0,0; st8.spill [r16]=r12,PT(AR_FPSR)-PT(R12) // save r12 - .mem.offset 8,0; st8.spill [r17]=r13,PT(R15)-PT(R13) // save r13 -- tnat.nz p14,p0=in6 -+(p13) mov in5=-1 - ;; - st8 [r16]=r21,PT(R8)-PT(AR_FPSR) // save ar.fpsr -- st8.spill [r17]=r15 // save r15 -- tnat.nz p8,p0=in7 -+ tnat.nz p14,p0=in6 -+ cmp.lt p10,p9=r11,r8 // frame size can't be more than local+8 - ;; - stf8 [r16]=f1 // ensure pt_regs.r8 != 0 (see handle_syscall_error) -+(p9) tnat.nz p10,p0=r15 - adds r12=-16,r1 // switch to kernel memory stack (with 16 bytes of scratch) --(p14) mov in6=-1 -+ -+ st8.spill [r17]=r15 // save r15 -+ tnat.nz p8,p0=in7 -+ nop.i 0 - - mov r13=r2 // establish `current' - movl r1=__gp // establish kernel global pointer - ;; -+(p14) mov in6=-1 - (p8) mov in7=-1 -- tnat.nz p9,p0=r15 -+ nop.i 0 - - cmp.eq pSys,pNonSys=r0,r0 // set pSys=1, pNonSys=0 - movl r17=FPSR_DEFAULT - ;; - mov.m ar.fpsr=r17 // set ar.fpsr to kernel default value --(p9) mov r15=-1 -+(p10) mov r8=-EINVAL - br.ret.sptk.many b7 - END(ia64_syscall_setup) - -diff -uprN linux-2.6.8.1.orig/arch/ia64/kernel/mca.c linux-2.6.8.1-ve022stab078/arch/ia64/kernel/mca.c ---- linux-2.6.8.1.orig/arch/ia64/kernel/mca.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/kernel/mca.c 2006-05-11 13:05:40.000000000 +0400 -@@ -501,13 +501,13 @@ init_handler_platform (pal_min_state_are - #endif - { - struct task_struct *g, *t; -- do_each_thread (g, t) { -+ do_each_thread_all(g, t) { - if (t == current) - continue; - - printk("\nBacktrace of pid %d (%s)\n", t->pid, t->comm); - show_stack(t, NULL); -- } while_each_thread (g, t); -+ } while_each_thread_all(g, t); - } - #ifdef CONFIG_SMP - if (!tasklist_lock.write_lock) -@@ -691,6 +691,7 @@ ia64_mca_wakeup_ipi_wait(void) - irr = ia64_getreg(_IA64_REG_CR_IRR3); - break; - } -+ cpu_relax(); - } while (!(irr & (1UL << irr_bit))) ; - } - -diff -uprN linux-2.6.8.1.orig/arch/ia64/kernel/perfmon.c linux-2.6.8.1-ve022stab078/arch/ia64/kernel/perfmon.c ---- linux-2.6.8.1.orig/arch/ia64/kernel/perfmon.c 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/kernel/perfmon.c 2006-05-11 13:05:40.000000000 +0400 -@@ -2582,7 +2582,7 @@ pfm_task_incompatible(pfm_context_t *ctx - return -EINVAL; - } - -- if (task->state == TASK_ZOMBIE) { -+ if (task->exit_state == EXIT_ZOMBIE) { - DPRINT(("cannot attach to zombie task [%d]\n", task->pid)); - return -EBUSY; - } -@@ -2619,7 +2619,7 @@ pfm_get_task(pfm_context_t *ctx, pid_t p - - read_lock(&tasklist_lock); - -- p = find_task_by_pid(pid); -+ p = find_task_by_pid_ve(pid); - - /* make sure task cannot go away while we operate on it */ - if (p) get_task_struct(p); -@@ -4177,12 +4177,12 @@ pfm_check_task_exist(pfm_context_t *ctx) - - read_lock(&tasklist_lock); - -- do_each_thread (g, t) { -+ do_each_thread_ve(g, t) { - if (t->thread.pfm_context == ctx) { - ret = 0; - break; - } -- } while_each_thread (g, t); -+ } while_each_thread_ve(g, t); - - read_unlock(&tasklist_lock); - -diff -uprN linux-2.6.8.1.orig/arch/ia64/kernel/process.c linux-2.6.8.1-ve022stab078/arch/ia64/kernel/process.c ---- linux-2.6.8.1.orig/arch/ia64/kernel/process.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/kernel/process.c 2006-05-11 13:05:40.000000000 +0400 -@@ -185,6 +185,8 @@ default_idle (void) - while (!need_resched()) - if (pal_halt && !pmu_active) - safe_halt(); -+ else -+ cpu_relax(); - } - - #ifdef CONFIG_HOTPLUG_CPU -@@ -601,7 +603,7 @@ dump_fpu (struct pt_regs *pt, elf_fpregs - return 1; /* f0-f31 are always valid so we always return 1 */ - } - --asmlinkage long -+long - sys_execve (char *filename, char **argv, char **envp, struct pt_regs *regs) - { - int error; -@@ -626,6 +628,13 @@ kernel_thread (int (*fn)(void *), void * - struct pt_regs pt; - } regs; - -+ /* Don't allow kernel_thread() inside VE */ -+ if (!ve_is_super(get_exec_env())) { -+ printk("kernel_thread call inside VE\n"); -+ dump_stack(); -+ return -EPERM; -+ } -+ - memset(®s, 0, sizeof(regs)); - regs.pt.cr_iip = helper_fptr[0]; /* set entry point (IP) */ - regs.pt.r1 = helper_fptr[1]; /* set GP */ -diff -uprN linux-2.6.8.1.orig/arch/ia64/kernel/ptrace.c linux-2.6.8.1-ve022stab078/arch/ia64/kernel/ptrace.c ---- linux-2.6.8.1.orig/arch/ia64/kernel/ptrace.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/kernel/ptrace.c 2006-05-11 13:05:49.000000000 +0400 -@@ -1,7 +1,7 @@ - /* - * Kernel support for the ptrace() and syscall tracing interfaces. - * -- * Copyright (C) 1999-2003 Hewlett-Packard Co -+ * Copyright (C) 1999-2004 Hewlett-Packard Co - * David Mosberger-Tang <davidm@hpl.hp.com> - * - * Derived from the x86 and Alpha versions. Most of the code in here -@@ -31,9 +31,6 @@ - - #include "entry.h" - --#define p4 (1UL << 4) /* for pSys (see entry.h) */ --#define p5 (1UL << 5) /* for pNonSys (see entry.h) */ -- - /* - * Bits in the PSR that we allow ptrace() to change: - * be, up, ac, mfl, mfh (the user mask; five bits total) -@@ -304,7 +301,6 @@ put_rnat (struct task_struct *task, stru - long num_regs, nbits; - struct pt_regs *pt; - unsigned long cfm, *urbs_kargs; -- struct unw_frame_info info; - - pt = ia64_task_regs(task); - kbsp = (unsigned long *) sw->ar_bspstore; -@@ -316,11 +312,8 @@ put_rnat (struct task_struct *task, stru - * If entered via syscall, don't allow user to set rnat bits - * for syscall args. - */ -- unw_init_from_blocked_task(&info,task); -- if (unw_unwind_to_user(&info) == 0) { -- unw_get_cfm(&info,&cfm); -- urbs_kargs = ia64_rse_skip_regs(urbs_end,-(cfm & 0x7f)); -- } -+ cfm = pt->cr_ifs; -+ urbs_kargs = ia64_rse_skip_regs(urbs_end, -(cfm & 0x7f)); - } - - if (urbs_kargs >= urnat_addr) -@@ -480,27 +473,18 @@ ia64_poke (struct task_struct *child, st - unsigned long - ia64_get_user_rbs_end (struct task_struct *child, struct pt_regs *pt, unsigned long *cfmp) - { -- unsigned long *krbs, *bspstore, cfm; -- struct unw_frame_info info; -+ unsigned long *krbs, *bspstore, cfm = pt->cr_ifs; - long ndirty; - - krbs = (unsigned long *) child + IA64_RBS_OFFSET/8; - bspstore = (unsigned long *) pt->ar_bspstore; - ndirty = ia64_rse_num_regs(krbs, krbs + (pt->loadrs >> 19)); -- cfm = pt->cr_ifs & ~(1UL << 63); - -- if (in_syscall(pt)) { -- /* -- * If bit 63 of cr.ifs is cleared, the kernel was entered via a system -- * call and we need to recover the CFM that existed on entry to the -- * kernel by unwinding the kernel stack. -- */ -- unw_init_from_blocked_task(&info, child); -- if (unw_unwind_to_user(&info) == 0) { -- unw_get_cfm(&info, &cfm); -- ndirty += (cfm & 0x7f); -- } -- } -+ if (in_syscall(pt)) -+ ndirty += (cfm & 0x7f); -+ else -+ cfm &= ~(1UL << 63); /* clear valid bit */ -+ - if (cfmp) - *cfmp = cfm; - return (unsigned long) ia64_rse_skip_regs(bspstore, ndirty); -@@ -591,7 +575,7 @@ find_thread_for_addr (struct task_struct - goto out; - } while ((p = next_thread(p)) != child); - -- do_each_thread(g, p) { -+ do_each_thread_ve(g, p) { - if (child->mm != mm) - continue; - -@@ -599,7 +583,7 @@ find_thread_for_addr (struct task_struct - child = p; - goto out; - } -- } while_each_thread(g, p); -+ } while_each_thread_ve(g, p); - out: - mmput(mm); - return child; -@@ -682,8 +666,8 @@ convert_to_non_syscall (struct task_stru - } - - unw_get_pr(&prev_info, &pr); -- pr &= ~pSys; -- pr |= pNonSys; -+ pr &= ~(1UL << PRED_SYSCALL); -+ pr |= (1UL << PRED_NON_SYSCALL); - unw_set_pr(&prev_info, pr); - - pt->cr_ifs = (1UL << 63) | cfm; -@@ -854,6 +838,13 @@ access_uarea (struct task_struct *child, - *data = (pt->cr_ipsr & IPSR_READ_MASK); - return 0; - -+ case PT_AR_RSC: -+ if (write_access) -+ pt->ar_rsc = *data | (3 << 2); /* force PL3 */ -+ else -+ *data = pt->ar_rsc; -+ return 0; -+ - case PT_AR_RNAT: - urbs_end = ia64_get_user_rbs_end(child, pt, NULL); - rnat_addr = (long) ia64_rse_rnat_addr((long *) urbs_end); -@@ -909,9 +900,6 @@ access_uarea (struct task_struct *child, - ptr = (unsigned long *) - ((long) pt + offsetof(struct pt_regs, ar_bspstore)); - break; -- case PT_AR_RSC: -- ptr = (unsigned long *) ((long) pt + offsetof(struct pt_regs, ar_rsc)); -- break; - case PT_AR_UNAT: - ptr = (unsigned long *) ((long) pt + offsetof(struct pt_regs, ar_unat)); - break; -@@ -997,12 +985,14 @@ access_uarea (struct task_struct *child, - } - - static long --ptrace_getregs (struct task_struct *child, struct pt_all_user_regs *ppr) -+ptrace_getregs (struct task_struct *child, struct pt_all_user_regs __user *ppr) - { -+ unsigned long psr, ec, lc, rnat, bsp, cfm, nat_bits, val; -+ struct unw_frame_info info; -+ struct ia64_fpreg fpval; - struct switch_stack *sw; - struct pt_regs *pt; - long ret, retval; -- struct unw_frame_info info; - char nat = 0; - int i; - -@@ -1023,12 +1013,21 @@ ptrace_getregs (struct task_struct *chil - return -EIO; - } - -+ if (access_uarea(child, PT_CR_IPSR, &psr, 0) < 0 -+ || access_uarea(child, PT_AR_EC, &ec, 0) < 0 -+ || access_uarea(child, PT_AR_LC, &lc, 0) < 0 -+ || access_uarea(child, PT_AR_RNAT, &rnat, 0) < 0 -+ || access_uarea(child, PT_AR_BSP, &bsp, 0) < 0 -+ || access_uarea(child, PT_CFM, &cfm, 0) -+ || access_uarea(child, PT_NAT_BITS, &nat_bits, 0)) -+ return -EIO; -+ - retval = 0; - - /* control regs */ - - retval |= __put_user(pt->cr_iip, &ppr->cr_iip); -- retval |= access_uarea(child, PT_CR_IPSR, &ppr->cr_ipsr, 0); -+ retval |= __put_user(psr, &ppr->cr_ipsr); - - /* app regs */ - -@@ -1039,11 +1038,11 @@ ptrace_getregs (struct task_struct *chil - retval |= __put_user(pt->ar_ccv, &ppr->ar[PT_AUR_CCV]); - retval |= __put_user(pt->ar_fpsr, &ppr->ar[PT_AUR_FPSR]); - -- retval |= access_uarea(child, PT_AR_EC, &ppr->ar[PT_AUR_EC], 0); -- retval |= access_uarea(child, PT_AR_LC, &ppr->ar[PT_AUR_LC], 0); -- retval |= access_uarea(child, PT_AR_RNAT, &ppr->ar[PT_AUR_RNAT], 0); -- retval |= access_uarea(child, PT_AR_BSP, &ppr->ar[PT_AUR_BSP], 0); -- retval |= access_uarea(child, PT_CFM, &ppr->cfm, 0); -+ retval |= __put_user(ec, &ppr->ar[PT_AUR_EC]); -+ retval |= __put_user(lc, &ppr->ar[PT_AUR_LC]); -+ retval |= __put_user(rnat, &ppr->ar[PT_AUR_RNAT]); -+ retval |= __put_user(bsp, &ppr->ar[PT_AUR_BSP]); -+ retval |= __put_user(cfm, &ppr->cfm); - - /* gr1-gr3 */ - -@@ -1053,7 +1052,9 @@ ptrace_getregs (struct task_struct *chil - /* gr4-gr7 */ - - for (i = 4; i < 8; i++) { -- retval |= unw_access_gr(&info, i, &ppr->gr[i], &nat, 0); -+ if (unw_access_gr(&info, i, &val, &nat, 0) < 0) -+ return -EIO; -+ retval |= __put_user(val, &ppr->gr[i]); - } - - /* gr8-gr11 */ -@@ -1077,7 +1078,9 @@ ptrace_getregs (struct task_struct *chil - /* b1-b5 */ - - for (i = 1; i < 6; i++) { -- retval |= unw_access_br(&info, i, &ppr->br[i], 0); -+ if (unw_access_br(&info, i, &val, 0) < 0) -+ return -EIO; -+ __put_user(val, &ppr->br[i]); - } - - /* b6-b7 */ -@@ -1088,8 +1091,9 @@ ptrace_getregs (struct task_struct *chil - /* fr2-fr5 */ - - for (i = 2; i < 6; i++) { -- retval |= access_fr(&info, i, 0, (unsigned long *) &ppr->fr[i], 0); -- retval |= access_fr(&info, i, 1, (unsigned long *) &ppr->fr[i] + 1, 0); -+ if (unw_get_fr(&info, i, &fpval) < 0) -+ return -EIO; -+ retval |= __copy_to_user(&ppr->fr[i], &fpval, sizeof (fpval)); - } - - /* fr6-fr11 */ -@@ -1103,8 +1107,9 @@ ptrace_getregs (struct task_struct *chil - /* fr16-fr31 */ - - for (i = 16; i < 32; i++) { -- retval |= access_fr(&info, i, 0, (unsigned long *) &ppr->fr[i], 0); -- retval |= access_fr(&info, i, 1, (unsigned long *) &ppr->fr[i] + 1, 0); -+ if (unw_get_fr(&info, i, &fpval) < 0) -+ return -EIO; -+ retval |= __copy_to_user(&ppr->fr[i], &fpval, sizeof (fpval)); - } - - /* fph */ -@@ -1118,22 +1123,25 @@ ptrace_getregs (struct task_struct *chil - - /* nat bits */ - -- retval |= access_uarea(child, PT_NAT_BITS, &ppr->nat, 0); -+ retval |= __put_user(nat_bits, &ppr->nat); - - ret = retval ? -EIO : 0; - return ret; - } - - static long --ptrace_setregs (struct task_struct *child, struct pt_all_user_regs *ppr) -+ptrace_setregs (struct task_struct *child, struct pt_all_user_regs __user *ppr) - { -+ unsigned long psr, rsc, ec, lc, rnat, bsp, cfm, nat_bits, val = 0; -+ struct unw_frame_info info; - struct switch_stack *sw; -+ struct ia64_fpreg fpval; - struct pt_regs *pt; - long ret, retval; -- struct unw_frame_info info; -- char nat = 0; - int i; - -+ memset(&fpval, 0, sizeof(fpval)); -+ - retval = verify_area(VERIFY_READ, ppr, sizeof(struct pt_all_user_regs)); - if (retval != 0) { - return -EIO; -@@ -1156,22 +1164,22 @@ ptrace_setregs (struct task_struct *chil - /* control regs */ - - retval |= __get_user(pt->cr_iip, &ppr->cr_iip); -- retval |= access_uarea(child, PT_CR_IPSR, &ppr->cr_ipsr, 1); -+ retval |= __get_user(psr, &ppr->cr_ipsr); - - /* app regs */ - - retval |= __get_user(pt->ar_pfs, &ppr->ar[PT_AUR_PFS]); -- retval |= __get_user(pt->ar_rsc, &ppr->ar[PT_AUR_RSC]); -+ retval |= __get_user(rsc, &ppr->ar[PT_AUR_RSC]); - retval |= __get_user(pt->ar_bspstore, &ppr->ar[PT_AUR_BSPSTORE]); - retval |= __get_user(pt->ar_unat, &ppr->ar[PT_AUR_UNAT]); - retval |= __get_user(pt->ar_ccv, &ppr->ar[PT_AUR_CCV]); - retval |= __get_user(pt->ar_fpsr, &ppr->ar[PT_AUR_FPSR]); - -- retval |= access_uarea(child, PT_AR_EC, &ppr->ar[PT_AUR_EC], 1); -- retval |= access_uarea(child, PT_AR_LC, &ppr->ar[PT_AUR_LC], 1); -- retval |= access_uarea(child, PT_AR_RNAT, &ppr->ar[PT_AUR_RNAT], 1); -- retval |= access_uarea(child, PT_AR_BSP, &ppr->ar[PT_AUR_BSP], 1); -- retval |= access_uarea(child, PT_CFM, &ppr->cfm, 1); -+ retval |= __get_user(ec, &ppr->ar[PT_AUR_EC]); -+ retval |= __get_user(lc, &ppr->ar[PT_AUR_LC]); -+ retval |= __get_user(rnat, &ppr->ar[PT_AUR_RNAT]); -+ retval |= __get_user(bsp, &ppr->ar[PT_AUR_BSP]); -+ retval |= __get_user(cfm, &ppr->cfm); - - /* gr1-gr3 */ - -@@ -1181,11 +1189,9 @@ ptrace_setregs (struct task_struct *chil - /* gr4-gr7 */ - - for (i = 4; i < 8; i++) { -- long ret = unw_get_gr(&info, i, &ppr->gr[i], &nat); -- if (ret < 0) { -- return ret; -- } -- retval |= unw_access_gr(&info, i, &ppr->gr[i], &nat, 1); -+ retval |= __get_user(val, &ppr->gr[i]); -+ if (unw_set_gr(&info, i, val, 0) < 0) /* NaT bit will be set via PT_NAT_BITS */ -+ return -EIO; - } - - /* gr8-gr11 */ -@@ -1209,7 +1215,8 @@ ptrace_setregs (struct task_struct *chil - /* b1-b5 */ - - for (i = 1; i < 6; i++) { -- retval |= unw_access_br(&info, i, &ppr->br[i], 1); -+ retval |= __get_user(val, &ppr->br[i]); -+ unw_set_br(&info, i, val); - } - - /* b6-b7 */ -@@ -1220,8 +1227,9 @@ ptrace_setregs (struct task_struct *chil - /* fr2-fr5 */ - - for (i = 2; i < 6; i++) { -- retval |= access_fr(&info, i, 0, (unsigned long *) &ppr->fr[i], 1); -- retval |= access_fr(&info, i, 1, (unsigned long *) &ppr->fr[i] + 1, 1); -+ retval |= __copy_from_user(&fpval, &ppr->fr[i], sizeof(fpval)); -+ if (unw_set_fr(&info, i, fpval) < 0) -+ return -EIO; - } - - /* fr6-fr11 */ -@@ -1235,8 +1243,9 @@ ptrace_setregs (struct task_struct *chil - /* fr16-fr31 */ - - for (i = 16; i < 32; i++) { -- retval |= access_fr(&info, i, 0, (unsigned long *) &ppr->fr[i], 1); -- retval |= access_fr(&info, i, 1, (unsigned long *) &ppr->fr[i] + 1, 1); -+ retval |= __copy_from_user(&fpval, &ppr->fr[i], sizeof(fpval)); -+ if (unw_set_fr(&info, i, fpval) < 0) -+ return -EIO; - } - - /* fph */ -@@ -1250,7 +1259,16 @@ ptrace_setregs (struct task_struct *chil - - /* nat bits */ - -- retval |= access_uarea(child, PT_NAT_BITS, &ppr->nat, 1); -+ retval |= __get_user(nat_bits, &ppr->nat); -+ -+ retval |= access_uarea(child, PT_CR_IPSR, &psr, 1); -+ retval |= access_uarea(child, PT_AR_RSC, &rsc, 1); -+ retval |= access_uarea(child, PT_AR_EC, &ec, 1); -+ retval |= access_uarea(child, PT_AR_LC, &lc, 1); -+ retval |= access_uarea(child, PT_AR_RNAT, &rnat, 1); -+ retval |= access_uarea(child, PT_AR_BSP, &bsp, 1); -+ retval |= access_uarea(child, PT_CFM, &cfm, 1); -+ retval |= access_uarea(child, PT_NAT_BITS, &nat_bits, 1); - - ret = retval ? -EIO : 0; - return ret; -@@ -1300,7 +1318,7 @@ sys_ptrace (long request, pid_t pid, uns - ret = -ESRCH; - read_lock(&tasklist_lock); - { -- child = find_task_by_pid(pid); -+ child = find_task_by_pid_ve(pid); - if (child) { - if (peek_or_poke) - child = find_thread_for_addr(child, addr); -@@ -1393,7 +1411,7 @@ sys_ptrace (long request, pid_t pid, uns - * sigkill. Perhaps it should be put in the status - * that it wants to exit. - */ -- if (child->state == TASK_ZOMBIE) /* already dead */ -+ if (child->exit_state == EXIT_ZOMBIE) /* already dead */ - goto out_tsk; - child->exit_code = SIGKILL; - -diff -uprN linux-2.6.8.1.orig/arch/ia64/kernel/salinfo.c linux-2.6.8.1-ve022stab078/arch/ia64/kernel/salinfo.c ---- linux-2.6.8.1.orig/arch/ia64/kernel/salinfo.c 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/kernel/salinfo.c 2006-05-11 13:05:30.000000000 +0400 -@@ -417,7 +417,12 @@ retry: - - if (!data->saved_num) - call_on_cpu(cpu, salinfo_log_read_cpu, data); -- data->state = data->log_size ? STATE_LOG_RECORD : STATE_NO_DATA; -+ if (!data->log_size) { -+ data->state = STATE_NO_DATA; -+ clear_bit(cpu, &data->cpu_event); -+ } else { -+ data->state = STATE_LOG_RECORD; -+ } - } - - static ssize_t -diff -uprN linux-2.6.8.1.orig/arch/ia64/kernel/signal.c linux-2.6.8.1-ve022stab078/arch/ia64/kernel/signal.c ---- linux-2.6.8.1.orig/arch/ia64/kernel/signal.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/kernel/signal.c 2006-05-11 13:05:40.000000000 +0400 -@@ -95,7 +95,7 @@ sys_sigaltstack (const stack_t *uss, sta - static long - restore_sigcontext (struct sigcontext *sc, struct sigscratch *scr) - { -- unsigned long ip, flags, nat, um, cfm; -+ unsigned long ip, flags, nat, um, cfm, rsc; - long err; - - /* Always make any pending restarted system calls return -EINTR */ -@@ -107,7 +107,7 @@ restore_sigcontext (struct sigcontext *s - err |= __get_user(ip, &sc->sc_ip); /* instruction pointer */ - err |= __get_user(cfm, &sc->sc_cfm); - err |= __get_user(um, &sc->sc_um); /* user mask */ -- err |= __get_user(scr->pt.ar_rsc, &sc->sc_ar_rsc); -+ err |= __get_user(rsc, &sc->sc_ar_rsc); - err |= __get_user(scr->pt.ar_unat, &sc->sc_ar_unat); - err |= __get_user(scr->pt.ar_fpsr, &sc->sc_ar_fpsr); - err |= __get_user(scr->pt.ar_pfs, &sc->sc_ar_pfs); -@@ -120,6 +120,7 @@ restore_sigcontext (struct sigcontext *s - err |= __copy_from_user(&scr->pt.r15, &sc->sc_gr[15], 8); /* r15 */ - - scr->pt.cr_ifs = cfm | (1UL << 63); -+ scr->pt.ar_rsc = rsc | (3 << 2); /* force PL3 */ - - /* establish new instruction pointer: */ - scr->pt.cr_iip = ip & ~0x3UL; -@@ -267,7 +268,7 @@ ia64_rt_sigreturn (struct sigscratch *sc - si.si_signo = SIGSEGV; - si.si_errno = 0; - si.si_code = SI_KERNEL; -- si.si_pid = current->pid; -+ si.si_pid = virt_pid(current); - si.si_uid = current->uid; - si.si_addr = sc; - force_sig_info(SIGSEGV, &si, current); -@@ -290,12 +291,10 @@ setup_sigcontext (struct sigcontext *sc, - - if (on_sig_stack((unsigned long) sc)) - flags |= IA64_SC_FLAG_ONSTACK; -- if ((ifs & (1UL << 63)) == 0) { -- /* if cr_ifs isn't valid, we got here through a syscall */ -+ if ((ifs & (1UL << 63)) == 0) -+ /* if cr_ifs doesn't have the valid bit set, we got here through a syscall */ - flags |= IA64_SC_FLAG_IN_SYSCALL; -- cfm = scr->ar_pfs & ((1UL << 38) - 1); -- } else -- cfm = ifs & ((1UL << 38) - 1); -+ cfm = ifs & ((1UL << 38) - 1); - ia64_flush_fph(current); - if ((current->thread.flags & IA64_THREAD_FPH_VALID)) { - flags |= IA64_SC_FLAG_FPH_VALID; -@@ -429,7 +428,7 @@ setup_frame (int sig, struct k_sigaction - si.si_signo = SIGSEGV; - si.si_errno = 0; - si.si_code = SI_KERNEL; -- si.si_pid = current->pid; -+ si.si_pid = virt_pid(current); - si.si_uid = current->uid; - si.si_addr = frame; - force_sig_info(SIGSEGV, &si, current); -diff -uprN linux-2.6.8.1.orig/arch/ia64/kernel/smp.c linux-2.6.8.1-ve022stab078/arch/ia64/kernel/smp.c ---- linux-2.6.8.1.orig/arch/ia64/kernel/smp.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/kernel/smp.c 2006-05-11 13:05:30.000000000 +0400 -@@ -290,11 +290,11 @@ smp_call_function_single (int cpuid, voi - - /* Wait for response */ - while (atomic_read(&data.started) != cpus) -- barrier(); -+ cpu_relax(); - - if (wait) - while (atomic_read(&data.finished) != cpus) -- barrier(); -+ cpu_relax(); - call_data = NULL; - - spin_unlock_bh(&call_lock); -@@ -349,11 +349,11 @@ smp_call_function (void (*func) (void *i - - /* Wait for response */ - while (atomic_read(&data.started) != cpus) -- barrier(); -+ cpu_relax(); - - if (wait) - while (atomic_read(&data.finished) != cpus) -- barrier(); -+ cpu_relax(); - call_data = NULL; - - spin_unlock(&call_lock); -diff -uprN linux-2.6.8.1.orig/arch/ia64/kernel/smpboot.c linux-2.6.8.1-ve022stab078/arch/ia64/kernel/smpboot.c ---- linux-2.6.8.1.orig/arch/ia64/kernel/smpboot.c 2004-08-14 14:54:52.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/kernel/smpboot.c 2006-05-11 13:05:40.000000000 +0400 -@@ -363,7 +363,7 @@ fork_by_hand (void) - * Don't care about the IP and regs settings since we'll never reschedule the - * forked task. - */ -- return copy_process(CLONE_VM|CLONE_IDLETASK, 0, 0, 0, NULL, NULL); -+ return copy_process(CLONE_VM|CLONE_IDLETASK, 0, 0, 0, NULL, NULL, 0); - } - - struct create_idle { -diff -uprN linux-2.6.8.1.orig/arch/ia64/kernel/time.c linux-2.6.8.1-ve022stab078/arch/ia64/kernel/time.c ---- linux-2.6.8.1.orig/arch/ia64/kernel/time.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/kernel/time.c 2006-05-11 13:05:40.000000000 +0400 -@@ -36,6 +36,9 @@ u64 jiffies_64 = INITIAL_JIFFIES; - - EXPORT_SYMBOL(jiffies_64); - -+unsigned int cpu_khz; /* TSC clocks / usec, not used here */ -+EXPORT_SYMBOL(cpu_khz); -+ - #define TIME_KEEPER_ID 0 /* smp_processor_id() of time-keeper */ - - #ifdef CONFIG_IA64_DEBUG_IRQ -@@ -389,6 +392,8 @@ ia64_init_itm (void) - register_time_interpolator(&itc_interpolator); - } - -+ cpu_khz = local_cpu_data->proc_freq / 1000; -+ - /* Setup the CPU local timer tick */ - ia64_cpu_local_tick(); - } -diff -uprN linux-2.6.8.1.orig/arch/ia64/kernel/traps.c linux-2.6.8.1-ve022stab078/arch/ia64/kernel/traps.c ---- linux-2.6.8.1.orig/arch/ia64/kernel/traps.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/kernel/traps.c 2006-05-11 13:05:24.000000000 +0400 -@@ -35,34 +35,6 @@ trap_init (void) - fpswa_interface = __va(ia64_boot_param->fpswa); - } - --/* -- * Unlock any spinlocks which will prevent us from getting the message out (timerlist_lock -- * is acquired through the console unblank code) -- */ --void --bust_spinlocks (int yes) --{ -- int loglevel_save = console_loglevel; -- -- if (yes) { -- oops_in_progress = 1; -- return; -- } -- --#ifdef CONFIG_VT -- unblank_screen(); --#endif -- oops_in_progress = 0; -- /* -- * OK, the message is on the console. Now we call printk() without -- * oops_in_progress set so that printk will give klogd a poke. Hold onto -- * your hats... -- */ -- console_loglevel = 15; /* NMI oopser may have shut the console up */ -- printk(" "); -- console_loglevel = loglevel_save; --} -- - void - die (const char *str, struct pt_regs *regs, long err) - { -diff -uprN linux-2.6.8.1.orig/arch/ia64/kernel/unaligned.c linux-2.6.8.1-ve022stab078/arch/ia64/kernel/unaligned.c ---- linux-2.6.8.1.orig/arch/ia64/kernel/unaligned.c 2004-08-14 14:56:14.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/kernel/unaligned.c 2006-05-11 13:05:40.000000000 +0400 -@@ -24,7 +24,7 @@ - #include <asm/uaccess.h> - #include <asm/unaligned.h> - --extern void die_if_kernel(char *str, struct pt_regs *regs, long err) __attribute__ ((noreturn)); -+extern void die_if_kernel(char *str, struct pt_regs *regs, long err); - - #undef DEBUG_UNALIGNED_TRAP - -@@ -1281,7 +1281,7 @@ within_logging_rate_limit (void) - { - static unsigned long count, last_time; - -- if (jiffies - last_time > 5*HZ) -+ if (jiffies - last_time > 60*HZ) - count = 0; - if (++count < 5) { - last_time = jiffies; -@@ -1339,7 +1339,7 @@ ia64_handle_unaligned (unsigned long ifa - if (user_mode(regs)) - tty_write_message(current->signal->tty, buf); - buf[len-1] = '\0'; /* drop '\r' */ -- printk(KERN_WARNING "%s", buf); /* watch for command names containing %s */ -+ ve_printk(VE_LOG, KERN_WARNING "%s", buf); /* watch for command names containing %s */ - } - } else { - if (within_logging_rate_limit()) -diff -uprN linux-2.6.8.1.orig/arch/ia64/kernel/unwind.c linux-2.6.8.1-ve022stab078/arch/ia64/kernel/unwind.c ---- linux-2.6.8.1.orig/arch/ia64/kernel/unwind.c 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/kernel/unwind.c 2006-05-11 13:05:30.000000000 +0400 -@@ -48,7 +48,6 @@ - #include "unwind_i.h" - - #define MIN(a,b) ((a) < (b) ? (a) : (b)) --#define p5 5 - - #define UNW_LOG_CACHE_SIZE 7 /* each unw_script is ~256 bytes in size */ - #define UNW_CACHE_SIZE (1 << UNW_LOG_CACHE_SIZE) -@@ -365,7 +364,7 @@ unw_access_gr (struct unw_frame_info *in - if (info->pri_unat_loc) - nat_addr = info->pri_unat_loc; - else -- nat_addr = &info->sw->ar_unat; -+ nat_addr = &info->sw->caller_unat; - nat_mask = (1UL << ((long) addr & 0x1f8)/8); - } - } else { -@@ -527,7 +526,7 @@ unw_access_ar (struct unw_frame_info *in - case UNW_AR_UNAT: - addr = info->unat_loc; - if (!addr) -- addr = &info->sw->ar_unat; -+ addr = &info->sw->caller_unat; - break; - - case UNW_AR_LC: -@@ -1787,7 +1786,7 @@ run_script (struct unw_script *script, s - - case UNW_INSN_SETNAT_MEMSTK: - if (!state->pri_unat_loc) -- state->pri_unat_loc = &state->sw->ar_unat; -+ state->pri_unat_loc = &state->sw->caller_unat; - /* register off. is a multiple of 8, so the least 3 bits (type) are 0 */ - s[dst+1] = ((unsigned long) state->pri_unat_loc - s[dst]) | UNW_NAT_MEMSTK; - break; -@@ -1905,7 +1904,7 @@ unw_unwind (struct unw_frame_info *info) - num_regs = 0; - if ((info->flags & UNW_FLAG_INTERRUPT_FRAME)) { - info->pt = info->sp + 16; -- if ((pr & (1UL << pNonSys)) != 0) -+ if ((pr & (1UL << PRED_NON_SYSCALL)) != 0) - num_regs = *info->cfm_loc & 0x7f; /* size of frame */ - info->pfs_loc = - (unsigned long *) (info->pt + offsetof(struct pt_regs, ar_pfs)); -@@ -1951,20 +1950,30 @@ EXPORT_SYMBOL(unw_unwind); - int - unw_unwind_to_user (struct unw_frame_info *info) - { -- unsigned long ip; -+ unsigned long ip, sp, pr = 0; - - while (unw_unwind(info) >= 0) { -- if (unw_get_rp(info, &ip) < 0) { -- unw_get_ip(info, &ip); -- UNW_DPRINT(0, "unwind.%s: failed to read return pointer (ip=0x%lx)\n", -- __FUNCTION__, ip); -- return -1; -+ unw_get_sp(info, &sp); -+ if ((long)((unsigned long)info->task + IA64_STK_OFFSET - sp) -+ < IA64_PT_REGS_SIZE) { -+ UNW_DPRINT(0, "unwind.%s: ran off the top of the kernel stack\n", -+ __FUNCTION__); -+ break; - } -- if (ip < FIXADDR_USER_END) -+ if (unw_is_intr_frame(info) && -+ (pr & (1UL << PRED_USER_STACK))) - return 0; -+ if (unw_get_pr (info, &pr) < 0) { -+ unw_get_rp(info, &ip); -+ UNW_DPRINT(0, "unwind.%s: failed to read " -+ "predicate register (ip=0x%lx)\n", -+ __FUNCTION__, ip); -+ return -1; -+ } - } - unw_get_ip(info, &ip); -- UNW_DPRINT(0, "unwind.%s: failed to unwind to user-level (ip=0x%lx)\n", __FUNCTION__, ip); -+ UNW_DPRINT(0, "unwind.%s: failed to unwind to user-level (ip=0x%lx)\n", -+ __FUNCTION__, ip); - return -1; - } - EXPORT_SYMBOL(unw_unwind_to_user); -@@ -2239,11 +2248,11 @@ unw_init (void) - if (8*sizeof(unw_hash_index_t) < UNW_LOG_HASH_SIZE) - unw_hash_index_t_is_too_narrow(); - -- unw.sw_off[unw.preg_index[UNW_REG_PRI_UNAT_GR]] = SW(AR_UNAT); -+ unw.sw_off[unw.preg_index[UNW_REG_PRI_UNAT_GR]] = SW(CALLER_UNAT); - unw.sw_off[unw.preg_index[UNW_REG_BSPSTORE]] = SW(AR_BSPSTORE); -- unw.sw_off[unw.preg_index[UNW_REG_PFS]] = SW(AR_UNAT); -+ unw.sw_off[unw.preg_index[UNW_REG_PFS]] = SW(AR_PFS); - unw.sw_off[unw.preg_index[UNW_REG_RP]] = SW(B0); -- unw.sw_off[unw.preg_index[UNW_REG_UNAT]] = SW(AR_UNAT); -+ unw.sw_off[unw.preg_index[UNW_REG_UNAT]] = SW(CALLER_UNAT); - unw.sw_off[unw.preg_index[UNW_REG_PR]] = SW(PR); - unw.sw_off[unw.preg_index[UNW_REG_LC]] = SW(AR_LC); - unw.sw_off[unw.preg_index[UNW_REG_FPSR]] = SW(AR_FPSR); -diff -uprN linux-2.6.8.1.orig/arch/ia64/lib/memcpy_mck.S linux-2.6.8.1-ve022stab078/arch/ia64/lib/memcpy_mck.S ---- linux-2.6.8.1.orig/arch/ia64/lib/memcpy_mck.S 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/lib/memcpy_mck.S 2006-05-11 13:05:30.000000000 +0400 -@@ -309,7 +309,7 @@ EK(.ex_handler, (p[D]) st8 [dst1] = t15, - add src_pre_mem=0,src0 // prefetch src pointer - add dst_pre_mem=0,dst0 // prefetch dest pointer - and src0=-8,src0 // 1st src pointer --(p7) mov ar.lc = r21 -+(p7) mov ar.lc = cnt - (p8) mov ar.lc = r0 - ;; - TEXT_ALIGN(32) -@@ -634,8 +634,11 @@ END(memcpy) - clrrrb - ;; - alloc saved_pfs_stack=ar.pfs,3,3,3,0 -+ cmp.lt p8,p0=A,r0 - sub B = dst0, saved_in0 // how many byte copied so far - ;; -+(p8) mov A = 0; // A shouldn't be negative, cap it -+ ;; - sub C = A, B - sub D = saved_in2, A - ;; -diff -uprN linux-2.6.8.1.orig/arch/ia64/lib/swiotlb.c linux-2.6.8.1-ve022stab078/arch/ia64/lib/swiotlb.c ---- linux-2.6.8.1.orig/arch/ia64/lib/swiotlb.c 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/lib/swiotlb.c 2006-05-11 13:05:30.000000000 +0400 -@@ -337,7 +337,7 @@ swiotlb_map_single (struct device *hwdev - - /* - * Since DMA is i-cache coherent, any (complete) pages that were written via -- * DMA can be marked as "clean" so that update_mmu_cache() doesn't have to -+ * DMA can be marked as "clean" so that lazy_mmu_prot_update() doesn't have to - * flush them when they get mapped into an executable vm-area. - */ - static void -diff -uprN linux-2.6.8.1.orig/arch/ia64/mm/contig.c linux-2.6.8.1-ve022stab078/arch/ia64/mm/contig.c ---- linux-2.6.8.1.orig/arch/ia64/mm/contig.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/mm/contig.c 2006-05-11 13:05:40.000000000 +0400 -@@ -19,6 +19,7 @@ - #include <linux/efi.h> - #include <linux/mm.h> - #include <linux/swap.h> -+#include <linux/module.h> - - #include <asm/meminit.h> - #include <asm/pgalloc.h> -@@ -297,3 +298,5 @@ paging_init (void) - #endif /* !CONFIG_VIRTUAL_MEM_MAP */ - zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page)); - } -+ -+EXPORT_SYMBOL(show_mem); -diff -uprN linux-2.6.8.1.orig/arch/ia64/mm/discontig.c linux-2.6.8.1-ve022stab078/arch/ia64/mm/discontig.c ---- linux-2.6.8.1.orig/arch/ia64/mm/discontig.c 2004-08-14 14:56:01.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/mm/discontig.c 2006-05-11 13:05:40.000000000 +0400 -@@ -21,6 +21,7 @@ - #include <asm/meminit.h> - #include <asm/numa.h> - #include <asm/sections.h> -+#include <linux/module.h> - - /* - * Track per-node information needed to setup the boot memory allocator, the -@@ -671,3 +672,5 @@ void paging_init(void) - - zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page)); - } -+ -+EXPORT_SYMBOL(show_mem); -diff -uprN linux-2.6.8.1.orig/arch/ia64/mm/fault.c linux-2.6.8.1-ve022stab078/arch/ia64/mm/fault.c ---- linux-2.6.8.1.orig/arch/ia64/mm/fault.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/mm/fault.c 2006-05-11 13:05:38.000000000 +0400 -@@ -16,6 +16,8 @@ - #include <asm/uaccess.h> - #include <asm/hardirq.h> - -+#include <ub/beancounter.h> -+ - extern void die (char *, struct pt_regs *, long); - - /* -@@ -36,6 +38,11 @@ expand_backing_store (struct vm_area_str - if (address - vma->vm_start > current->rlim[RLIMIT_STACK].rlim_cur - || (((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) > current->rlim[RLIMIT_AS].rlim_cur)) - return -ENOMEM; -+ -+ if (ub_memory_charge(mm_ub(vma->vm_mm), PAGE_SIZE, -+ vma->vm_flags, vma->vm_file, UB_HARD)) -+ return -ENOMEM; -+ - vma->vm_end += PAGE_SIZE; - vma->vm_mm->total_vm += grow; - if (vma->vm_flags & VM_LOCKED) -@@ -213,9 +220,6 @@ ia64_do_page_fault (unsigned long addres - return; - } - -- if (ia64_done_with_exception(regs)) -- return; -- - /* - * Since we have no vma's for region 5, we might get here even if the address is - * valid, due to the VHPT walker inserting a non present translation that becomes -@@ -226,6 +230,9 @@ ia64_do_page_fault (unsigned long addres - if (REGION_NUMBER(address) == 5 && mapped_kernel_page_is_present(address)) - return; - -+ if (ia64_done_with_exception(regs)) -+ return; -+ - /* - * Oops. The kernel tried to access some bad page. We'll have to terminate things - * with extreme prejudice. -@@ -244,13 +251,13 @@ ia64_do_page_fault (unsigned long addres - - out_of_memory: - up_read(&mm->mmap_sem); -- if (current->pid == 1) { -- yield(); -- down_read(&mm->mmap_sem); -- goto survive; -- } -- printk(KERN_CRIT "VM: killing process %s\n", current->comm); -- if (user_mode(regs)) -- do_exit(SIGKILL); -+ if (user_mode(regs)) { -+ /* -+ * 0-order allocation always success if something really -+ * fatal not happen: beancounter overdraft or OOM. Den -+ */ -+ force_sig(SIGKILL, current); -+ return; -+ } - goto no_context; - } -diff -uprN linux-2.6.8.1.orig/arch/ia64/mm/init.c linux-2.6.8.1-ve022stab078/arch/ia64/mm/init.c ---- linux-2.6.8.1.orig/arch/ia64/mm/init.c 2004-08-14 14:55:19.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/mm/init.c 2006-05-11 13:05:38.000000000 +0400 -@@ -37,6 +37,8 @@ - #include <asm/unistd.h> - #include <asm/mca.h> - -+#include <ub/ub_vmpages.h> -+ - DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); - - extern void ia64_tlb_init (void); -@@ -76,7 +78,7 @@ check_pgt_cache (void) - } - - void --update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte) -+lazy_mmu_prot_update (pte_t pte) - { - unsigned long addr; - struct page *page; -@@ -85,7 +87,6 @@ update_mmu_cache (struct vm_area_struct - return; /* not an executable page... */ - - page = pte_page(pte); -- /* don't use VADDR: it may not be mapped on this CPU (or may have just been flushed): */ - addr = (unsigned long) page_address(page); - - if (test_bit(PG_arch_1, &page->flags)) -@@ -118,6 +119,10 @@ ia64_init_addr_space (void) - - ia64_set_rbs_bot(); - -+ if (ub_memory_charge(mm_ub(current->mm), PAGE_SIZE, -+ VM_DATA_DEFAULT_FLAGS, NULL, UB_SOFT)) -+ return; -+ - /* - * If we're out of memory and kmem_cache_alloc() returns NULL, we simply ignore - * the problem. When the process attempts to write to the register backing store -@@ -131,8 +136,18 @@ ia64_init_addr_space (void) - vma->vm_end = vma->vm_start + PAGE_SIZE; - vma->vm_page_prot = protection_map[VM_DATA_DEFAULT_FLAGS & 0x7]; - vma->vm_flags = VM_DATA_DEFAULT_FLAGS | VM_GROWSUP; -- insert_vm_struct(current->mm, vma); -- } -+ down_write(¤t->mm->mmap_sem); -+ if (insert_vm_struct(current->mm, vma)) { -+ up_write(¤t->mm->mmap_sem); -+ kmem_cache_free(vm_area_cachep, vma); -+ ub_memory_uncharge(mm_ub(current->mm), PAGE_SIZE, -+ VM_DATA_DEFAULT_FLAGS, NULL); -+ return; -+ } -+ up_write(¤t->mm->mmap_sem); -+ } else -+ ub_memory_uncharge(mm_ub(current->mm), PAGE_SIZE, -+ VM_DATA_DEFAULT_FLAGS, NULL); - - /* map NaT-page at address zero to speed up speculative dereferencing of NULL: */ - if (!(current->personality & MMAP_PAGE_ZERO)) { -@@ -143,7 +158,13 @@ ia64_init_addr_space (void) - vma->vm_end = PAGE_SIZE; - vma->vm_page_prot = __pgprot(pgprot_val(PAGE_READONLY) | _PAGE_MA_NAT); - vma->vm_flags = VM_READ | VM_MAYREAD | VM_IO | VM_RESERVED; -- insert_vm_struct(current->mm, vma); -+ down_write(¤t->mm->mmap_sem); -+ if (insert_vm_struct(current->mm, vma)) { -+ up_write(¤t->mm->mmap_sem); -+ kmem_cache_free(vm_area_cachep, vma); -+ return; -+ } -+ up_write(¤t->mm->mmap_sem); - } - } - } -@@ -260,8 +281,9 @@ setup_gate (void) - struct page *page; - - /* -- * Map the gate page twice: once read-only to export the ELF headers etc. and once -- * execute-only page to enable privilege-promotion via "epc": -+ * Map the gate page twice: once read-only to export the ELF -+ * headers etc. and once execute-only page to enable -+ * privilege-promotion via "epc": - */ - page = virt_to_page(ia64_imva(__start_gate_section)); - put_kernel_page(page, GATE_ADDR, PAGE_READONLY); -@@ -270,6 +292,20 @@ setup_gate (void) - put_kernel_page(page, GATE_ADDR + PAGE_SIZE, PAGE_GATE); - #else - put_kernel_page(page, GATE_ADDR + PERCPU_PAGE_SIZE, PAGE_GATE); -+ /* Fill in the holes (if any) with read-only zero pages: */ -+ { -+ unsigned long addr; -+ -+ for (addr = GATE_ADDR + PAGE_SIZE; -+ addr < GATE_ADDR + PERCPU_PAGE_SIZE; -+ addr += PAGE_SIZE) -+ { -+ put_kernel_page(ZERO_PAGE(0), addr, -+ PAGE_READONLY); -+ put_kernel_page(ZERO_PAGE(0), addr + PERCPU_PAGE_SIZE, -+ PAGE_READONLY); -+ } -+ } - #endif - ia64_patch_gate(); - } -diff -uprN linux-2.6.8.1.orig/arch/ia64/mm/tlb.c linux-2.6.8.1-ve022stab078/arch/ia64/mm/tlb.c ---- linux-2.6.8.1.orig/arch/ia64/mm/tlb.c 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/mm/tlb.c 2006-05-11 13:05:40.000000000 +0400 -@@ -57,7 +57,7 @@ wrap_mmu_context (struct mm_struct *mm) - - read_lock(&tasklist_lock); - repeat: -- for_each_process(tsk) { -+ for_each_process_all(tsk) { - if (!tsk->mm) - continue; - tsk_context = tsk->mm->context; -diff -uprN linux-2.6.8.1.orig/arch/ia64/pci/pci.c linux-2.6.8.1-ve022stab078/arch/ia64/pci/pci.c ---- linux-2.6.8.1.orig/arch/ia64/pci/pci.c 2004-08-14 14:55:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/pci/pci.c 2006-05-11 13:05:31.000000000 +0400 -@@ -55,13 +55,13 @@ struct pci_fixup pcibios_fixups[1]; - */ - - #define PCI_SAL_ADDRESS(seg, bus, devfn, reg) \ -- ((u64)(seg << 24) | (u64)(bus << 16) | \ -+ ((u64)((u64) seg << 24) | (u64)(bus << 16) | \ - (u64)(devfn << 8) | (u64)(reg)) - - /* SAL 3.2 adds support for extended config space. */ - - #define PCI_SAL_EXT_ADDRESS(seg, bus, devfn, reg) \ -- ((u64)(seg << 28) | (u64)(bus << 20) | \ -+ ((u64)((u64) seg << 28) | (u64)(bus << 20) | \ - (u64)(devfn << 12) | (u64)(reg)) - - static int -diff -uprN linux-2.6.8.1.orig/arch/ia64/sn/io/hwgfs/ramfs.c linux-2.6.8.1-ve022stab078/arch/ia64/sn/io/hwgfs/ramfs.c ---- linux-2.6.8.1.orig/arch/ia64/sn/io/hwgfs/ramfs.c 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ia64/sn/io/hwgfs/ramfs.c 2006-05-11 13:05:32.000000000 +0400 -@@ -97,7 +97,7 @@ static int hwgfs_symlink(struct inode * - inode = hwgfs_get_inode(dir->i_sb, S_IFLNK|S_IRWXUGO, 0); - if (inode) { - int l = strlen(symname)+1; -- error = page_symlink(inode, symname, l); -+ error = page_symlink(inode, symname, l, GFP_KERNEL); - if (!error) { - d_instantiate(dentry, inode); - dget(dentry); -diff -uprN linux-2.6.8.1.orig/arch/m68k/kernel/ptrace.c linux-2.6.8.1-ve022stab078/arch/m68k/kernel/ptrace.c ---- linux-2.6.8.1.orig/arch/m68k/kernel/ptrace.c 2004-08-14 14:55:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/m68k/kernel/ptrace.c 2006-05-11 13:05:26.000000000 +0400 -@@ -277,7 +277,7 @@ asmlinkage int sys_ptrace(long request, - long tmp; - - ret = 0; -- if (child->state == TASK_ZOMBIE) /* already dead */ -+ if (child->exit_state == EXIT_ZOMBIE) /* already dead */ - break; - child->exit_code = SIGKILL; - /* make sure the single step bit is not set. */ -diff -uprN linux-2.6.8.1.orig/arch/m68knommu/kernel/ptrace.c linux-2.6.8.1-ve022stab078/arch/m68knommu/kernel/ptrace.c ---- linux-2.6.8.1.orig/arch/m68knommu/kernel/ptrace.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/m68knommu/kernel/ptrace.c 2006-05-11 13:05:26.000000000 +0400 -@@ -271,7 +271,7 @@ asmlinkage int sys_ptrace(long request, - long tmp; - - ret = 0; -- if (child->state == TASK_ZOMBIE) /* already dead */ -+ if (child->exit_state == EXIT_ZOMBIE) /* already dead */ - break; - child->exit_code = SIGKILL; - /* make sure the single step bit is not set. */ -diff -uprN linux-2.6.8.1.orig/arch/mips/kernel/irixelf.c linux-2.6.8.1-ve022stab078/arch/mips/kernel/irixelf.c ---- linux-2.6.8.1.orig/arch/mips/kernel/irixelf.c 2004-08-14 14:56:25.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/mips/kernel/irixelf.c 2006-05-11 13:05:35.000000000 +0400 -@@ -127,7 +127,9 @@ static void set_brk(unsigned long start, - end = PAGE_ALIGN(end); - if (end <= start) - return; -+ down_write(¤t->mm->mmap_sem); - do_brk(start, end - start); -+ up_write(¤t->mm->mmap_sem); - } - - -@@ -376,7 +378,9 @@ static unsigned int load_irix_interp(str - - /* Map the last of the bss segment */ - if (last_bss > len) { -+ down_write(¤t->mm->mmap_sem); - do_brk(len, (last_bss - len)); -+ up_write(¤t->mm->mmap_sem); - } - kfree(elf_phdata); - -@@ -448,7 +452,12 @@ static inline int look_for_irix_interpre - if (retval < 0) - goto out; - -- file = open_exec(*name); -+ /* -+ * I don't understand this loop. -+ * Are we suppose to break the loop after successful open and -+ * read, or close the file, or store it somewhere? --SAW -+ */ -+ file = open_exec(*name, bprm); - if (IS_ERR(file)) { - retval = PTR_ERR(file); - goto out; -@@ -564,7 +573,9 @@ void irix_map_prda_page (void) - unsigned long v; - struct prda *pp; - -+ down_write(¤t->mm->mmap_sem); - v = do_brk (PRDA_ADDRESS, PAGE_SIZE); -+ up_write(¤t->mm->mmap_sem); - - if (v < 0) - return; -@@ -855,8 +866,11 @@ static int load_irix_library(struct file - - len = (elf_phdata->p_filesz + elf_phdata->p_vaddr+ 0xfff) & 0xfffff000; - bss = elf_phdata->p_memsz + elf_phdata->p_vaddr; -- if (bss > len) -+ if (bss > len) { -+ down_write(¤t->mm->mmap_sem); - do_brk(len, bss-len); -+ up_write(¤t->mm->mmap_sem); -+ } - kfree(elf_phdata); - return 0; - } -diff -uprN linux-2.6.8.1.orig/arch/mips/kernel/irixsig.c linux-2.6.8.1-ve022stab078/arch/mips/kernel/irixsig.c ---- linux-2.6.8.1.orig/arch/mips/kernel/irixsig.c 2004-08-14 14:56:00.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/mips/kernel/irixsig.c 2006-05-11 13:05:25.000000000 +0400 -@@ -184,9 +184,10 @@ asmlinkage int do_irix_signal(sigset_t * - if (!user_mode(regs)) - return 1; - -- if (current->flags & PF_FREEZE) { -- refrigerator(0); -- goto no_signal; -+ if (unlikely(test_thread_flag(TIF_FREEZE))) { -+ refrigerator(); -+ if (!signal_pending(current)) -+ goto no_signal; - } - - if (!oldset) -diff -uprN linux-2.6.8.1.orig/arch/mips/kernel/ptrace.c linux-2.6.8.1-ve022stab078/arch/mips/kernel/ptrace.c ---- linux-2.6.8.1.orig/arch/mips/kernel/ptrace.c 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/mips/kernel/ptrace.c 2006-05-11 13:05:26.000000000 +0400 -@@ -277,7 +277,7 @@ asmlinkage int sys_ptrace(long request, - */ - case PTRACE_KILL: - ret = 0; -- if (child->state == TASK_ZOMBIE) /* already dead */ -+ if (child->exit_state == EXIT_ZOMBIE) /* already dead */ - break; - child->exit_code = SIGKILL; - wake_up_process(child); -diff -uprN linux-2.6.8.1.orig/arch/mips/kernel/ptrace32.c linux-2.6.8.1-ve022stab078/arch/mips/kernel/ptrace32.c ---- linux-2.6.8.1.orig/arch/mips/kernel/ptrace32.c 2004-08-14 14:55:20.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/mips/kernel/ptrace32.c 2006-05-11 13:05:26.000000000 +0400 -@@ -262,7 +262,7 @@ asmlinkage int sys32_ptrace(int request, - */ - case PTRACE_KILL: - ret = 0; -- if (child->state == TASK_ZOMBIE) /* already dead */ -+ if (child->exit_state == EXIT_ZOMBIE) /* already dead */ - break; - child->exit_code = SIGKILL; - wake_up_process(child); -diff -uprN linux-2.6.8.1.orig/arch/mips/kernel/signal.c linux-2.6.8.1-ve022stab078/arch/mips/kernel/signal.c ---- linux-2.6.8.1.orig/arch/mips/kernel/signal.c 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/mips/kernel/signal.c 2006-05-11 13:05:25.000000000 +0400 -@@ -556,9 +556,10 @@ asmlinkage int do_signal(sigset_t *oldse - if (!user_mode(regs)) - return 1; - -- if (current->flags & PF_FREEZE) { -- refrigerator(0); -- goto no_signal; -+ if (unlikely(test_thread_flag(TIF_FREEZE))) { -+ refrigerator(); -+ if (!signal_pending(current)) -+ goto no_signal; - } - - if (!oldset) -diff -uprN linux-2.6.8.1.orig/arch/mips/kernel/signal32.c linux-2.6.8.1-ve022stab078/arch/mips/kernel/signal32.c ---- linux-2.6.8.1.orig/arch/mips/kernel/signal32.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/mips/kernel/signal32.c 2006-05-11 13:05:25.000000000 +0400 -@@ -704,9 +704,10 @@ asmlinkage int do_signal32(sigset_t *old - if (!user_mode(regs)) - return 1; - -- if (current->flags & PF_FREEZE) { -- refrigerator(0); -- goto no_signal; -+ if (unlikely(test_thread_flag(TIF_FREEZE))) { -+ refrigerator(); -+ if (!signal_pending(current)) -+ goto no_signal; - } - - if (!oldset) -diff -uprN linux-2.6.8.1.orig/arch/parisc/kernel/ptrace.c linux-2.6.8.1-ve022stab078/arch/parisc/kernel/ptrace.c ---- linux-2.6.8.1.orig/arch/parisc/kernel/ptrace.c 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/parisc/kernel/ptrace.c 2006-05-11 13:05:26.000000000 +0400 -@@ -303,7 +303,7 @@ long sys_ptrace(long request, pid_t pid, - * that it wants to exit. - */ - DBG(("sys_ptrace(KILL)\n")); -- if (child->state == TASK_ZOMBIE) /* already dead */ -+ if (child->exit_state == EXIT_ZOMBIE) /* already dead */ - goto out_tsk; - child->exit_code = SIGKILL; - goto out_wake_notrap; -diff -uprN linux-2.6.8.1.orig/arch/ppc/kernel/ptrace.c linux-2.6.8.1-ve022stab078/arch/ppc/kernel/ptrace.c ---- linux-2.6.8.1.orig/arch/ppc/kernel/ptrace.c 2004-08-14 14:55:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ppc/kernel/ptrace.c 2006-05-11 13:05:26.000000000 +0400 -@@ -377,7 +377,7 @@ int sys_ptrace(long request, long pid, l - */ - case PTRACE_KILL: { - ret = 0; -- if (child->state == TASK_ZOMBIE) /* already dead */ -+ if (child->exit_state == EXIT_ZOMBIE) /* already dead */ - break; - child->exit_code = SIGKILL; - /* make sure the single step bit is not set. */ -diff -uprN linux-2.6.8.1.orig/arch/ppc64/boot/zlib.c linux-2.6.8.1-ve022stab078/arch/ppc64/boot/zlib.c ---- linux-2.6.8.1.orig/arch/ppc64/boot/zlib.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ppc64/boot/zlib.c 2006-05-11 13:05:34.000000000 +0400 -@@ -1307,7 +1307,7 @@ local int huft_build( - { - *t = (inflate_huft *)Z_NULL; - *m = 0; -- return Z_OK; -+ return Z_DATA_ERROR; - } - - -@@ -1351,6 +1351,7 @@ local int huft_build( - if ((j = *p++) != 0) - v[x[j]++] = i; - } while (++i < n); -+ n = x[g]; /* set n to length of v */ - - - /* Generate the Huffman codes and for each, make the table entries */ -diff -uprN linux-2.6.8.1.orig/arch/ppc64/kernel/ioctl32.c linux-2.6.8.1-ve022stab078/arch/ppc64/kernel/ioctl32.c ---- linux-2.6.8.1.orig/arch/ppc64/kernel/ioctl32.c 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ppc64/kernel/ioctl32.c 2006-05-11 13:05:29.000000000 +0400 -@@ -41,7 +41,6 @@ IOCTL_TABLE_START - #include <linux/compat_ioctl.h> - #define DECLARES - #include "compat_ioctl.c" --COMPATIBLE_IOCTL(TCSBRKP) - COMPATIBLE_IOCTL(TIOCSTART) - COMPATIBLE_IOCTL(TIOCSTOP) - COMPATIBLE_IOCTL(TIOCSLTC) -diff -uprN linux-2.6.8.1.orig/arch/ppc64/kernel/ptrace.c linux-2.6.8.1-ve022stab078/arch/ppc64/kernel/ptrace.c ---- linux-2.6.8.1.orig/arch/ppc64/kernel/ptrace.c 2004-08-14 14:56:00.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ppc64/kernel/ptrace.c 2006-05-11 13:05:26.000000000 +0400 -@@ -182,7 +182,7 @@ int sys_ptrace(long request, long pid, l - */ - case PTRACE_KILL: { - ret = 0; -- if (child->state == TASK_ZOMBIE) /* already dead */ -+ if (child->exit_state == EXIT_ZOMBIE) /* already dead */ - break; - child->exit_code = SIGKILL; - /* make sure the single step bit is not set. */ -diff -uprN linux-2.6.8.1.orig/arch/ppc64/kernel/ptrace32.c linux-2.6.8.1-ve022stab078/arch/ppc64/kernel/ptrace32.c ---- linux-2.6.8.1.orig/arch/ppc64/kernel/ptrace32.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/ppc64/kernel/ptrace32.c 2006-05-11 13:05:26.000000000 +0400 -@@ -314,7 +314,7 @@ int sys32_ptrace(long request, long pid, - */ - case PTRACE_KILL: { - ret = 0; -- if (child->state == TASK_ZOMBIE) /* already dead */ -+ if (child->exit_state == EXIT_ZOMBIE) /* already dead */ - break; - child->exit_code = SIGKILL; - /* make sure the single step bit is not set. */ -diff -uprN linux-2.6.8.1.orig/arch/s390/kernel/compat_exec.c linux-2.6.8.1-ve022stab078/arch/s390/kernel/compat_exec.c ---- linux-2.6.8.1.orig/arch/s390/kernel/compat_exec.c 2004-08-14 14:56:01.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/s390/kernel/compat_exec.c 2006-05-11 13:05:33.000000000 +0400 -@@ -39,7 +39,7 @@ int setup_arg_pages32(struct linux_binpr - unsigned long stack_base; - struct vm_area_struct *mpnt; - struct mm_struct *mm = current->mm; -- int i; -+ int i, ret; - - stack_base = STACK_TOP - MAX_ARG_PAGES*PAGE_SIZE; - mm->arg_start = bprm->p + stack_base; -@@ -68,7 +68,11 @@ int setup_arg_pages32(struct linux_binpr - /* executable stack setting would be applied here */ - mpnt->vm_page_prot = PAGE_COPY; - mpnt->vm_flags = VM_STACK_FLAGS; -- insert_vm_struct(mm, mpnt); -+ if ((ret = insert_vm_struct(mm, mpnt))) { -+ up_write(&mm->mmap_sem); -+ kmem_cache_free(vm_area_cachep, mpnt); -+ return ret; -+ } - mm->total_vm = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT; - } - -diff -uprN linux-2.6.8.1.orig/arch/s390/kernel/compat_ioctl.c linux-2.6.8.1-ve022stab078/arch/s390/kernel/compat_ioctl.c ---- linux-2.6.8.1.orig/arch/s390/kernel/compat_ioctl.c 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/s390/kernel/compat_ioctl.c 2006-05-11 13:05:29.000000000 +0400 -@@ -65,9 +65,6 @@ COMPATIBLE_IOCTL(BIODASDSATTR) - COMPATIBLE_IOCTL(TAPE390_DISPLAY) - #endif - --/* This one should be architecture independent */ --COMPATIBLE_IOCTL(TCSBRKP) -- - /* s390 doesn't need handlers here */ - COMPATIBLE_IOCTL(TIOCGSERIAL) - COMPATIBLE_IOCTL(TIOCSSERIAL) -diff -uprN linux-2.6.8.1.orig/arch/s390/kernel/compat_signal.c linux-2.6.8.1-ve022stab078/arch/s390/kernel/compat_signal.c ---- linux-2.6.8.1.orig/arch/s390/kernel/compat_signal.c 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/s390/kernel/compat_signal.c 2006-05-11 13:05:34.000000000 +0400 -@@ -245,9 +245,6 @@ sys32_sigaction(int sig, const struct ol - return ret; - } - --int --do_sigaction(int sig, const struct k_sigaction *act, struct k_sigaction *oact); -- - asmlinkage long - sys32_rt_sigaction(int sig, const struct sigaction32 __user *act, - struct sigaction32 __user *oact, size_t sigsetsize) -diff -uprN linux-2.6.8.1.orig/arch/s390/kernel/ptrace.c linux-2.6.8.1-ve022stab078/arch/s390/kernel/ptrace.c ---- linux-2.6.8.1.orig/arch/s390/kernel/ptrace.c 2004-08-14 14:56:14.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/s390/kernel/ptrace.c 2006-05-11 13:05:49.000000000 +0400 -@@ -626,7 +626,7 @@ do_ptrace(struct task_struct *child, lon - * perhaps it should be put in the status that it wants to - * exit. - */ -- if (child->state == TASK_ZOMBIE) /* already dead */ -+ if (child->exit_state == EXIT_ZOMBIE) /* already dead */ - return 0; - child->exit_code = SIGKILL; - /* make sure the single step bit is not set. */ -diff -uprN linux-2.6.8.1.orig/arch/s390/mm/fault.c linux-2.6.8.1-ve022stab078/arch/s390/mm/fault.c ---- linux-2.6.8.1.orig/arch/s390/mm/fault.c 2004-08-14 14:56:26.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/s390/mm/fault.c 2006-05-11 13:05:24.000000000 +0400 -@@ -61,17 +61,9 @@ void bust_spinlocks(int yes) - if (yes) { - oops_in_progress = 1; - } else { -- int loglevel_save = console_loglevel; - oops_in_progress = 0; - console_unblank(); -- /* -- * OK, the message is on the console. Now we call printk() -- * without oops_in_progress set so that printk will give klogd -- * a poke. Hold onto your hats... -- */ -- console_loglevel = 15; -- printk(" "); -- console_loglevel = loglevel_save; -+ wake_up_klogd(); - } - } - -diff -uprN linux-2.6.8.1.orig/arch/sh/kernel/ptrace.c linux-2.6.8.1-ve022stab078/arch/sh/kernel/ptrace.c ---- linux-2.6.8.1.orig/arch/sh/kernel/ptrace.c 2004-08-14 14:54:49.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/sh/kernel/ptrace.c 2006-05-11 13:05:26.000000000 +0400 -@@ -217,7 +217,7 @@ asmlinkage int sys_ptrace(long request, - */ - case PTRACE_KILL: { - ret = 0; -- if (child->state == TASK_ZOMBIE) /* already dead */ -+ if (child->exit_state == EXIT_ZOMBIE) /* already dead */ - break; - child->exit_code = SIGKILL; - wake_up_process(child); -diff -uprN linux-2.6.8.1.orig/arch/sh/kernel/signal.c linux-2.6.8.1-ve022stab078/arch/sh/kernel/signal.c ---- linux-2.6.8.1.orig/arch/sh/kernel/signal.c 2004-08-14 14:56:25.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/sh/kernel/signal.c 2006-05-11 13:05:25.000000000 +0400 -@@ -584,9 +584,10 @@ int do_signal(struct pt_regs *regs, sigs - if (!user_mode(regs)) - return 1; - -- if (current->flags & PF_FREEZE) { -- refrigerator(0); -- goto no_signal; -+ if (unlikely(test_thread_flag(TIF_FREEZE))) { -+ refrigerator(); -+ if (!signal_pending(current)) -+ goto no_signal; - } - - if (!oldset) -diff -uprN linux-2.6.8.1.orig/arch/sh64/kernel/ptrace.c linux-2.6.8.1-ve022stab078/arch/sh64/kernel/ptrace.c ---- linux-2.6.8.1.orig/arch/sh64/kernel/ptrace.c 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/sh64/kernel/ptrace.c 2006-05-11 13:05:26.000000000 +0400 -@@ -257,7 +257,7 @@ asmlinkage int sys_ptrace(long request, - */ - case PTRACE_KILL: { - ret = 0; -- if (child->state == TASK_ZOMBIE) /* already dead */ -+ if (child->exit_state == EXIT_ZOMBIE) /* already dead */ - break; - child->exit_code = SIGKILL; - wake_up_process(child); -diff -uprN linux-2.6.8.1.orig/arch/sh64/kernel/signal.c linux-2.6.8.1-ve022stab078/arch/sh64/kernel/signal.c ---- linux-2.6.8.1.orig/arch/sh64/kernel/signal.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/sh64/kernel/signal.c 2006-05-11 13:05:25.000000000 +0400 -@@ -705,10 +705,11 @@ int do_signal(struct pt_regs *regs, sigs - if (!user_mode(regs)) - return 1; - -- if (current->flags & PF_FREEZE) { -- refrigerator(0); -- goto no_signal; -- } -+ if (unlikely(test_thread_flag(TIF_FREEZE))) { -+ refrigerator(); -+ if (!signal_pending(current)) -+ goto no_signal; -+ } - - if (!oldset) - oldset = ¤t->blocked; -diff -uprN linux-2.6.8.1.orig/arch/sparc/kernel/ptrace.c linux-2.6.8.1-ve022stab078/arch/sparc/kernel/ptrace.c ---- linux-2.6.8.1.orig/arch/sparc/kernel/ptrace.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/sparc/kernel/ptrace.c 2006-05-11 13:05:26.000000000 +0400 -@@ -567,7 +567,7 @@ asmlinkage void do_ptrace(struct pt_regs - * exit. - */ - case PTRACE_KILL: { -- if (child->state == TASK_ZOMBIE) { /* already dead */ -+ if (child->exit_state == EXIT_ZOMBIE) { /* already dead */ - pt_succ_return(regs, 0); - goto out_tsk; - } -diff -uprN linux-2.6.8.1.orig/arch/sparc64/kernel/binfmt_aout32.c linux-2.6.8.1-ve022stab078/arch/sparc64/kernel/binfmt_aout32.c ---- linux-2.6.8.1.orig/arch/sparc64/kernel/binfmt_aout32.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/sparc64/kernel/binfmt_aout32.c 2006-05-11 13:05:33.000000000 +0400 -@@ -49,7 +49,9 @@ static void set_brk(unsigned long start, - end = PAGE_ALIGN(end); - if (end <= start) - return; -+ down_write(¤t->mm->mmap_sem); - do_brk(start, end - start); -+ up_write(¤t->mm->mmap_sem); - } - - /* -@@ -246,10 +248,14 @@ static int load_aout32_binary(struct lin - if (N_MAGIC(ex) == NMAGIC) { - loff_t pos = fd_offset; - /* Fuck me plenty... */ -+ down_write(¤t->mm->mmap_sem); - error = do_brk(N_TXTADDR(ex), ex.a_text); -+ up_write(¤t->mm->mmap_sem); - bprm->file->f_op->read(bprm->file, (char __user *)N_TXTADDR(ex), - ex.a_text, &pos); -+ down_write(¤t->mm->mmap_sem); - error = do_brk(N_DATADDR(ex), ex.a_data); -+ up_write(¤t->mm->mmap_sem); - bprm->file->f_op->read(bprm->file, (char __user *)N_DATADDR(ex), - ex.a_data, &pos); - goto beyond_if; -@@ -257,8 +263,10 @@ static int load_aout32_binary(struct lin - - if (N_MAGIC(ex) == OMAGIC) { - loff_t pos = fd_offset; -+ down_write(¤t->mm->mmap_sem); - do_brk(N_TXTADDR(ex) & PAGE_MASK, - ex.a_text+ex.a_data + PAGE_SIZE - 1); -+ up_write(¤t->mm->mmap_sem); - bprm->file->f_op->read(bprm->file, (char __user *)N_TXTADDR(ex), - ex.a_text+ex.a_data, &pos); - } else { -@@ -272,7 +280,9 @@ static int load_aout32_binary(struct lin - - if (!bprm->file->f_op->mmap) { - loff_t pos = fd_offset; -+ down_write(¤t->mm->mmap_sem); - do_brk(0, ex.a_text+ex.a_data); -+ up_write(¤t->mm->mmap_sem); - bprm->file->f_op->read(bprm->file, - (char __user *)N_TXTADDR(ex), - ex.a_text+ex.a_data, &pos); -@@ -389,7 +399,9 @@ static int load_aout32_library(struct fi - len = PAGE_ALIGN(ex.a_text + ex.a_data); - bss = ex.a_text + ex.a_data + ex.a_bss; - if (bss > len) { -+ down_write(¤t->mm->mmap_sem); - error = do_brk(start_addr + len, bss - len); -+ up_write(¤t->mm->mmap_sem); - retval = error; - if (error != start_addr + len) - goto out; -diff -uprN linux-2.6.8.1.orig/arch/sparc64/kernel/ioctl32.c linux-2.6.8.1-ve022stab078/arch/sparc64/kernel/ioctl32.c ---- linux-2.6.8.1.orig/arch/sparc64/kernel/ioctl32.c 2004-08-14 14:56:14.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/sparc64/kernel/ioctl32.c 2006-05-11 13:05:29.000000000 +0400 -@@ -475,7 +475,6 @@ IOCTL_TABLE_START - #include <linux/compat_ioctl.h> - #define DECLARES - #include "compat_ioctl.c" --COMPATIBLE_IOCTL(TCSBRKP) - COMPATIBLE_IOCTL(TIOCSTART) - COMPATIBLE_IOCTL(TIOCSTOP) - COMPATIBLE_IOCTL(TIOCSLTC) -diff -uprN linux-2.6.8.1.orig/arch/sparc64/kernel/ptrace.c linux-2.6.8.1-ve022stab078/arch/sparc64/kernel/ptrace.c ---- linux-2.6.8.1.orig/arch/sparc64/kernel/ptrace.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/sparc64/kernel/ptrace.c 2006-05-11 13:05:26.000000000 +0400 -@@ -559,7 +559,7 @@ asmlinkage void do_ptrace(struct pt_regs - * exit. - */ - case PTRACE_KILL: { -- if (child->state == TASK_ZOMBIE) { /* already dead */ -+ if (child->exit_state == EXIT_ZOMBIE) { /* already dead */ - pt_succ_return(regs, 0); - goto out_tsk; - } -diff -uprN linux-2.6.8.1.orig/arch/um/kernel/ptrace.c linux-2.6.8.1-ve022stab078/arch/um/kernel/ptrace.c ---- linux-2.6.8.1.orig/arch/um/kernel/ptrace.c 2004-08-14 14:56:25.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/um/kernel/ptrace.c 2006-05-11 13:05:26.000000000 +0400 -@@ -163,7 +163,7 @@ int sys_ptrace(long request, long pid, l - */ - case PTRACE_KILL: { - ret = 0; -- if (child->state == TASK_ZOMBIE) /* already dead */ -+ if (child->exit_state == EXIT_ZOMBIE) /* already dead */ - break; - child->exit_code = SIGKILL; - wake_up_process(child); -diff -uprN linux-2.6.8.1.orig/arch/um/kernel/tt/process_kern.c linux-2.6.8.1-ve022stab078/arch/um/kernel/tt/process_kern.c ---- linux-2.6.8.1.orig/arch/um/kernel/tt/process_kern.c 2004-08-14 14:56:00.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/um/kernel/tt/process_kern.c 2006-05-11 13:05:26.000000000 +0400 -@@ -65,7 +65,7 @@ void *switch_to_tt(void *prev, void *nex - panic("write of switch_pipe failed, errno = %d", -err); - - reading = 1; -- if((from->state == TASK_ZOMBIE) || (from->state == TASK_DEAD)) -+ if((from->exit_state == EXIT_ZOMBIE) || (from->exit_state == EXIT_DEAD)) - os_kill_process(os_getpid(), 0); - - err = os_read_file(from->thread.mode.tt.switch_pipe[0], &c, sizeof(c)); -diff -uprN linux-2.6.8.1.orig/arch/v850/kernel/ptrace.c linux-2.6.8.1-ve022stab078/arch/v850/kernel/ptrace.c ---- linux-2.6.8.1.orig/arch/v850/kernel/ptrace.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/v850/kernel/ptrace.c 2006-05-11 13:05:26.000000000 +0400 -@@ -238,7 +238,7 @@ int sys_ptrace(long request, long pid, l - */ - case PTRACE_KILL: - rval = 0; -- if (child->state == TASK_ZOMBIE) /* already dead */ -+ if (child->exit_state == EXIT_ZOMBIE) /* already dead */ - break; - child->exit_code = SIGKILL; - wake_up_process(child); -diff -uprN linux-2.6.8.1.orig/arch/x86_64/boot/compressed/head.S linux-2.6.8.1-ve022stab078/arch/x86_64/boot/compressed/head.S ---- linux-2.6.8.1.orig/arch/x86_64/boot/compressed/head.S 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/boot/compressed/head.S 2006-05-11 13:05:45.000000000 +0400 -@@ -35,7 +35,7 @@ - startup_32: - cld - cli -- movl $(__KERNEL_DS),%eax -+ movl $(__BOOT_DS),%eax - movl %eax,%ds - movl %eax,%es - movl %eax,%fs -@@ -77,7 +77,7 @@ startup_32: - jnz 3f - addl $8,%esp - xorl %ebx,%ebx -- ljmp $(__KERNEL_CS), $0x100000 -+ ljmp $(__BOOT_CS), $0x100000 - - /* - * We come here, if we were loaded high. -@@ -105,7 +105,7 @@ startup_32: - popl %eax # hcount - movl $0x100000,%edi - cli # make sure we don't get interrupted -- ljmp $(__KERNEL_CS), $0x1000 # and jump to the move routine -+ ljmp $(__BOOT_CS), $0x1000 # and jump to the move routine - - /* - * Routine (template) for moving the decompressed kernel in place, -@@ -128,7 +128,7 @@ move_routine_start: - movsl - movl %ebx,%esi # Restore setup pointer - xorl %ebx,%ebx -- ljmp $(__KERNEL_CS), $0x100000 -+ ljmp $(__BOOT_CS), $0x100000 - move_routine_end: - - -@@ -138,5 +138,5 @@ user_stack: - .fill 4096,4,0 - stack_start: - .long user_stack+4096 -- .word __KERNEL_DS -+ .word __BOOT_DS - -diff -uprN linux-2.6.8.1.orig/arch/x86_64/boot/setup.S linux-2.6.8.1-ve022stab078/arch/x86_64/boot/setup.S ---- linux-2.6.8.1.orig/arch/x86_64/boot/setup.S 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/boot/setup.S 2006-05-11 13:05:45.000000000 +0400 -@@ -727,7 +727,7 @@ flush_instr: - subw $DELTA_INITSEG, %si - shll $4, %esi # Convert to 32-bit pointer - # NOTE: For high loaded big kernels we need a --# jmpi 0x100000,__KERNEL_CS -+# jmpi 0x100000,__BOOT_CS - # - # but we yet haven't reloaded the CS register, so the default size - # of the target offset still is 16 bit. -@@ -738,7 +738,7 @@ flush_instr: - .byte 0x66, 0xea # prefix + jmpi-opcode - code32: .long 0x1000 # will be set to 0x100000 - # for big kernels -- .word __KERNEL_CS -+ .word __BOOT_CS - - # Here's a bunch of information about your current kernel.. - kernel_version: .ascii UTS_RELEASE -diff -uprN linux-2.6.8.1.orig/arch/x86_64/ia32/ia32_aout.c linux-2.6.8.1-ve022stab078/arch/x86_64/ia32/ia32_aout.c ---- linux-2.6.8.1.orig/arch/x86_64/ia32/ia32_aout.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/ia32/ia32_aout.c 2006-05-11 13:05:40.000000000 +0400 -@@ -113,7 +113,9 @@ static void set_brk(unsigned long start, - end = PAGE_ALIGN(end); - if (end <= start) - return; -+ down_write(¤t->mm->mmap_sem); - do_brk(start, end - start); -+ up_write(¤t->mm->mmap_sem); - } - - #if CORE_DUMP -@@ -323,7 +325,10 @@ static int load_aout_binary(struct linux - pos = 32; - map_size = ex.a_text+ex.a_data; - -+ down_write(¤t->mm->mmap_sem); - error = do_brk(text_addr & PAGE_MASK, map_size); -+ up_write(¤t->mm->mmap_sem); -+ - if (error != (text_addr & PAGE_MASK)) { - send_sig(SIGKILL, current, 0); - return error; -@@ -343,14 +348,14 @@ static int load_aout_binary(struct linux - if ((ex.a_text & 0xfff || ex.a_data & 0xfff) && - (N_MAGIC(ex) != NMAGIC) && (jiffies-error_time2) > 5*HZ) - { -- printk(KERN_NOTICE "executable not page aligned\n"); -+ ve_printk(VE_LOG, KERN_NOTICE "executable not page aligned\n"); - error_time2 = jiffies; - } - - if ((fd_offset & ~PAGE_MASK) != 0 && - (jiffies-error_time) > 5*HZ) - { -- printk(KERN_WARNING -+ ve_printk(VE_LOG, KERN_WARNING - "fd_offset is not page aligned. Please convert program: %s\n", - bprm->file->f_dentry->d_name.name); - error_time = jiffies; -@@ -359,7 +364,9 @@ static int load_aout_binary(struct linux - - if (!bprm->file->f_op->mmap||((fd_offset & ~PAGE_MASK) != 0)) { - loff_t pos = fd_offset; -+ down_write(¤t->mm->mmap_sem); - do_brk(N_TXTADDR(ex), ex.a_text+ex.a_data); -+ up_write(¤t->mm->mmap_sem); - bprm->file->f_op->read(bprm->file,(char *)N_TXTADDR(ex), - ex.a_text+ex.a_data, &pos); - flush_icache_range((unsigned long) N_TXTADDR(ex), -@@ -461,14 +468,15 @@ static int load_aout_library(struct file - static unsigned long error_time; - if ((jiffies-error_time) > 5*HZ) - { -- printk(KERN_WARNING -+ ve_printk(VE_LOG, KERN_WARNING - "N_TXTOFF is not page aligned. Please convert library: %s\n", - file->f_dentry->d_name.name); - error_time = jiffies; - } - #endif -- -+ down_write(¤t->mm->mmap_sem); - do_brk(start_addr, ex.a_text + ex.a_data + ex.a_bss); -+ up_write(¤t->mm->mmap_sem); - - file->f_op->read(file, (char *)start_addr, - ex.a_text + ex.a_data, &pos); -@@ -492,7 +500,9 @@ static int load_aout_library(struct file - len = PAGE_ALIGN(ex.a_text + ex.a_data); - bss = ex.a_text + ex.a_data + ex.a_bss; - if (bss > len) { -+ down_write(¤t->mm->mmap_sem); - error = do_brk(start_addr + len, bss - len); -+ up_write(¤t->mm->mmap_sem); - retval = error; - if (error != start_addr + len) - goto out; -diff -uprN linux-2.6.8.1.orig/arch/x86_64/ia32/ia32_binfmt.c linux-2.6.8.1-ve022stab078/arch/x86_64/ia32/ia32_binfmt.c ---- linux-2.6.8.1.orig/arch/x86_64/ia32/ia32_binfmt.c 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/ia32/ia32_binfmt.c 2006-05-11 13:05:45.000000000 +0400 -@@ -27,12 +27,14 @@ - #include <asm/ia32.h> - #include <asm/vsyscall32.h> - -+#include <ub/ub_vmpages.h> -+ - #define ELF_NAME "elf/i386" - - #define AT_SYSINFO 32 - #define AT_SYSINFO_EHDR 33 - --int sysctl_vsyscall32 = 1; -+int sysctl_vsyscall32 = 0; - - #define ARCH_DLINFO do { \ - if (sysctl_vsyscall32) { \ -@@ -46,7 +48,7 @@ struct elf_phdr; - - #define IA32_EMULATOR 1 - --#define ELF_ET_DYN_BASE (TASK_UNMAPPED_32 + 0x1000000) -+#define ELF_ET_DYN_BASE (TASK_UNMAPPED_BASE + 0x1000000) - - #undef ELF_ARCH - #define ELF_ARCH EM_386 -@@ -73,8 +75,8 @@ typedef elf_greg_t elf_gregset_t[ELF_NGR - * Dumping its extra ELF program headers includes all the other information - * a debugger needs to easily find how the vsyscall DSO was being used. - */ --#define ELF_CORE_EXTRA_PHDRS (VSYSCALL32_EHDR->e_phnum) --#define ELF_CORE_WRITE_EXTRA_PHDRS \ -+#define DO_ELF_CORE_EXTRA_PHDRS (VSYSCALL32_EHDR->e_phnum) -+#define DO_ELF_CORE_WRITE_EXTRA_PHDRS \ - do { \ - const struct elf32_phdr *const vsyscall_phdrs = \ - (const struct elf32_phdr *) (VSYSCALL32_BASE \ -@@ -96,7 +98,7 @@ do { \ - DUMP_WRITE(&phdr, sizeof(phdr)); \ - } \ - } while (0) --#define ELF_CORE_WRITE_EXTRA_DATA \ -+#define DO_ELF_CORE_WRITE_EXTRA_DATA \ - do { \ - const struct elf32_phdr *const vsyscall_phdrs = \ - (const struct elf32_phdr *) (VSYSCALL32_BASE \ -@@ -109,6 +111,21 @@ do { \ - } \ - } while (0) - -+extern int sysctl_at_vsyscall; -+ -+#define ELF_CORE_EXTRA_PHDRS ({ (sysctl_at_vsyscall != 0 ? \ -+ DO_ELF_CORE_EXTRA_PHDRS : 0); }) -+ -+#define ELF_CORE_WRITE_EXTRA_PHDRS do { \ -+ if (sysctl_at_vsyscall != 0) \ -+ DO_ELF_CORE_WRITE_EXTRA_PHDRS; \ -+ } while (0) -+ -+#define ELF_CORE_WRITE_EXTRA_DATA do { \ -+ if (sysctl_at_vsyscall != 0) \ -+ DO_ELF_CORE_WRITE_EXTRA_DATA; \ -+ } while (0) -+ - struct elf_siginfo - { - int si_signo; /* signal number */ -@@ -303,6 +320,10 @@ MODULE_AUTHOR("Eric Youngdale, Andi Klee - - static void elf32_init(struct pt_regs *); - -+#define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1 -+#define arch_setup_additional_pages syscall32_setup_pages -+extern int syscall32_setup_pages(struct linux_binprm *, int exstack); -+ - #include "../../../fs/binfmt_elf.c" - - static void elf32_init(struct pt_regs *regs) -@@ -327,10 +348,10 @@ static void elf32_init(struct pt_regs *r - - int setup_arg_pages(struct linux_binprm *bprm, int executable_stack) - { -- unsigned long stack_base; -+ unsigned long stack_base, vm_end, vm_start; - struct vm_area_struct *mpnt; - struct mm_struct *mm = current->mm; -- int i; -+ int i, ret; - - stack_base = IA32_STACK_TOP - MAX_ARG_PAGES * PAGE_SIZE; - mm->arg_start = bprm->p + stack_base; -@@ -340,22 +361,28 @@ int setup_arg_pages(struct linux_binprm - bprm->loader += stack_base; - bprm->exec += stack_base; - -+ vm_end = IA32_STACK_TOP; -+ vm_start = PAGE_MASK & (unsigned long)bprm->p; -+ -+ ret = -ENOMEM; -+ if (ub_memory_charge(mm_ub(mm), vm_end - vm_start, -+ vm_stack_flags32, NULL, UB_HARD)) -+ goto out; -+ - mpnt = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL); -- if (!mpnt) -- return -ENOMEM; -- -- if (security_vm_enough_memory((IA32_STACK_TOP - (PAGE_MASK & (unsigned long) bprm->p))>>PAGE_SHIFT)) { -- kmem_cache_free(vm_area_cachep, mpnt); -- return -ENOMEM; -- } -+ if (!mpnt) -+ goto out_uncharge; -+ -+ if (security_vm_enough_memory((vm_end - vm_start)>>PAGE_SHIFT)) -+ goto out_uncharge_free; - - memset(mpnt, 0, sizeof(*mpnt)); - - down_write(&mm->mmap_sem); - { - mpnt->vm_mm = mm; -- mpnt->vm_start = PAGE_MASK & (unsigned long) bprm->p; -- mpnt->vm_end = IA32_STACK_TOP; -+ mpnt->vm_start = vm_start; -+ mpnt->vm_end = vm_end; - if (executable_stack == EXSTACK_ENABLE_X) - mpnt->vm_flags = vm_stack_flags32 | VM_EXEC; - else if (executable_stack == EXSTACK_DISABLE_X) -@@ -364,7 +391,8 @@ int setup_arg_pages(struct linux_binprm - mpnt->vm_flags = vm_stack_flags32; - mpnt->vm_page_prot = (mpnt->vm_flags & VM_EXEC) ? - PAGE_COPY_EXEC : PAGE_COPY; -- insert_vm_struct(mm, mpnt); -+ if ((ret = insert_vm_struct(mm, mpnt))) -+ goto out_up; - mm->total_vm = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT; - } - -@@ -379,6 +407,17 @@ int setup_arg_pages(struct linux_binprm - up_write(&mm->mmap_sem); - - return 0; -+ -+out_up: -+ up_write(&mm->mmap_sem); -+ vm_unacct_memory((vm_end - vm_start) >> PAGE_SHIFT); -+out_uncharge_free: -+ kmem_cache_free(vm_area_cachep, mpnt); -+out_uncharge: -+ ub_memory_uncharge(mm_ub(mm), vm_end - vm_start, -+ vm_stack_flags32, NULL); -+out: -+ return ret; - } - - static unsigned long -diff -uprN linux-2.6.8.1.orig/arch/x86_64/ia32/ia32_ioctl.c linux-2.6.8.1-ve022stab078/arch/x86_64/ia32/ia32_ioctl.c ---- linux-2.6.8.1.orig/arch/x86_64/ia32/ia32_ioctl.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/ia32/ia32_ioctl.c 2006-05-11 13:05:35.000000000 +0400 -@@ -24,17 +24,27 @@ - static int tiocgdev(unsigned fd, unsigned cmd, unsigned int __user *ptr) - { - -- struct file *file = fget(fd); -+ struct file *file; - struct tty_struct *real_tty; -+ int ret; - -+ file = fget(fd); - if (!file) - return -EBADF; -+ -+ ret = -EINVAL; - if (file->f_op->ioctl != tty_ioctl) -- return -EINVAL; -+ goto out; - real_tty = (struct tty_struct *)file->private_data; - if (!real_tty) -- return -EINVAL; -- return put_user(new_encode_dev(tty_devnum(real_tty)), ptr); -+ goto out; -+ -+ ret = put_user(new_encode_dev(tty_devnum(real_tty)), ptr); -+ -+out: -+ fput(file); -+ -+ return ret; - } - - #define RTC_IRQP_READ32 _IOR('p', 0x0b, unsigned int) /* Read IRQ rate */ -diff -uprN linux-2.6.8.1.orig/arch/x86_64/ia32/ia32_signal.c linux-2.6.8.1-ve022stab078/arch/x86_64/ia32/ia32_signal.c ---- linux-2.6.8.1.orig/arch/x86_64/ia32/ia32_signal.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/ia32/ia32_signal.c 2006-05-11 13:05:45.000000000 +0400 -@@ -44,10 +44,10 @@ - asmlinkage int do_signal(struct pt_regs *regs, sigset_t *oldset); - void signal_fault(struct pt_regs *regs, void __user *frame, char *where); - --int ia32_copy_siginfo_to_user(siginfo_t32 __user *to, siginfo_t *from) -+int copy_siginfo_to_user32(compat_siginfo_t __user *to, siginfo_t *from) - { - int err; -- if (!access_ok (VERIFY_WRITE, to, sizeof(siginfo_t32))) -+ if (!access_ok (VERIFY_WRITE, to, sizeof(compat_siginfo_t))) - return -EFAULT; - - /* If you change siginfo_t structure, please make sure that -@@ -95,11 +95,11 @@ int ia32_copy_siginfo_to_user(siginfo_t3 - return err; - } - --int ia32_copy_siginfo_from_user(siginfo_t *to, siginfo_t32 __user *from) -+int copy_siginfo_from_user32(siginfo_t *to, compat_siginfo_t __user *from) - { - int err; - u32 ptr32; -- if (!access_ok (VERIFY_READ, from, sizeof(siginfo_t32))) -+ if (!access_ok (VERIFY_READ, from, sizeof(compat_siginfo_t))) - return -EFAULT; - - err = __get_user(to->si_signo, &from->si_signo); -@@ -122,6 +122,7 @@ sys32_sigsuspend(int history0, int histo - mask &= _BLOCKABLE; - spin_lock_irq(¤t->sighand->siglock); - saveset = current->blocked; -+ set_sigsuspend_state(current, saveset); - siginitset(¤t->blocked, mask); - recalc_sigpending(); - spin_unlock_irq(¤t->sighand->siglock); -@@ -130,8 +131,10 @@ sys32_sigsuspend(int history0, int histo - while (1) { - current->state = TASK_INTERRUPTIBLE; - schedule(); -- if (do_signal(®s, &saveset)) -+ if (do_signal(®s, &saveset)) { -+ clear_sigsuspend_state(current); - return -EINTR; -+ } - } - } - -@@ -187,7 +190,7 @@ struct rt_sigframe - int sig; - u32 pinfo; - u32 puc; -- struct siginfo32 info; -+ struct compat_siginfo info; - struct ucontext_ia32 uc; - struct _fpstate_ia32 fpstate; - char retcode[8]; -@@ -260,6 +263,12 @@ ia32_restore_sigcontext(struct pt_regs * - if (verify_area(VERIFY_READ, buf, sizeof(*buf))) - goto badframe; - err |= restore_i387_ia32(current, buf, 0); -+ } else { -+ struct task_struct *me = current; -+ if (me->used_math) { -+ clear_fpu(me); -+ me->used_math = 0; -+ } - } - } - -@@ -522,7 +531,7 @@ void ia32_setup_rt_frame(int sig, struct - } - err |= __put_user((u32)(u64)&frame->info, &frame->pinfo); - err |= __put_user((u32)(u64)&frame->uc, &frame->puc); -- err |= ia32_copy_siginfo_to_user(&frame->info, info); -+ err |= copy_siginfo_to_user32(&frame->info, info); - if (err) - goto give_sigsegv; - -diff -uprN linux-2.6.8.1.orig/arch/x86_64/ia32/ia32entry.S linux-2.6.8.1-ve022stab078/arch/x86_64/ia32/ia32entry.S ---- linux-2.6.8.1.orig/arch/x86_64/ia32/ia32entry.S 2004-08-14 14:55:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/ia32/ia32entry.S 2006-05-11 13:05:29.000000000 +0400 -@@ -436,7 +436,7 @@ ia32_sys_call_table: - .quad sys_init_module - .quad sys_delete_module - .quad quiet_ni_syscall /* 130 get_kernel_syms */ -- .quad sys32_quotactl /* quotactl */ -+ .quad sys_quotactl /* quotactl */ - .quad sys_getpgid - .quad sys_fchdir - .quad quiet_ni_syscall /* bdflush */ -@@ -482,7 +482,7 @@ ia32_sys_call_table: - .quad sys32_rt_sigaction - .quad sys32_rt_sigprocmask /* 175 */ - .quad sys32_rt_sigpending -- .quad sys32_rt_sigtimedwait -+ .quad compat_rt_sigtimedwait - .quad sys32_rt_sigqueueinfo - .quad stub32_rt_sigsuspend - .quad sys32_pread /* 180 */ -diff -uprN linux-2.6.8.1.orig/arch/x86_64/ia32/ptrace32.c linux-2.6.8.1-ve022stab078/arch/x86_64/ia32/ptrace32.c ---- linux-2.6.8.1.orig/arch/x86_64/ia32/ptrace32.c 2004-08-14 14:55:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/ia32/ptrace32.c 2006-05-11 13:05:40.000000000 +0400 -@@ -205,7 +205,7 @@ static struct task_struct *find_target(i - - *err = -ESRCH; - read_lock(&tasklist_lock); -- child = find_task_by_pid(pid); -+ child = find_task_by_pid_ve(pid); - if (child) - get_task_struct(child); - read_unlock(&tasklist_lock); -diff -uprN linux-2.6.8.1.orig/arch/x86_64/ia32/sys_ia32.c linux-2.6.8.1-ve022stab078/arch/x86_64/ia32/sys_ia32.c ---- linux-2.6.8.1.orig/arch/x86_64/ia32/sys_ia32.c 2004-08-14 14:55:20.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/ia32/sys_ia32.c 2006-05-11 13:05:49.000000000 +0400 -@@ -658,11 +658,12 @@ sys32_waitpid(compat_pid_t pid, unsigned - int sys32_ni_syscall(int call) - { - struct task_struct *me = current; -- static char lastcomm[8]; -- if (strcmp(lastcomm, me->comm)) { -- printk(KERN_INFO "IA32 syscall %d from %s not implemented\n", call, -- current->comm); -- strcpy(lastcomm, me->comm); -+ static char lastcomm[sizeof(me->comm)]; -+ -+ if (strncmp(lastcomm, me->comm, sizeof(lastcomm))) { -+ ve_printk(VE_LOG, KERN_INFO "IA32 syscall %d from %s not implemented\n", -+ call, me->comm); -+ strncpy(lastcomm, me->comm, sizeof(lastcomm)); - } - return -ENOSYS; - } -@@ -782,51 +783,13 @@ sys32_rt_sigpending(compat_sigset_t __us - - - asmlinkage long --sys32_rt_sigtimedwait(compat_sigset_t __user *uthese, siginfo_t32 __user *uinfo, -- struct compat_timespec __user *uts, compat_size_t sigsetsize) --{ -- sigset_t s; -- compat_sigset_t s32; -- struct timespec t; -- int ret; -- mm_segment_t old_fs = get_fs(); -- siginfo_t info; -- -- if (copy_from_user (&s32, uthese, sizeof(compat_sigset_t))) -- return -EFAULT; -- switch (_NSIG_WORDS) { -- case 4: s.sig[3] = s32.sig[6] | (((long)s32.sig[7]) << 32); -- case 3: s.sig[2] = s32.sig[4] | (((long)s32.sig[5]) << 32); -- case 2: s.sig[1] = s32.sig[2] | (((long)s32.sig[3]) << 32); -- case 1: s.sig[0] = s32.sig[0] | (((long)s32.sig[1]) << 32); -- } -- if (uts && get_compat_timespec(&t, uts)) -- return -EFAULT; -- if (uinfo) { -- /* stop data leak to user space in case of structure fill mismatch -- * between sys_rt_sigtimedwait & ia32_copy_siginfo_to_user. -- */ -- memset(&info, 0, sizeof(info)); -- } -- set_fs (KERNEL_DS); -- ret = sys_rt_sigtimedwait(&s, uinfo ? &info : NULL, uts ? &t : NULL, -- sigsetsize); -- set_fs (old_fs); -- if (ret >= 0 && uinfo) { -- if (ia32_copy_siginfo_to_user(uinfo, &info)) -- return -EFAULT; -- } -- return ret; --} -- --asmlinkage long --sys32_rt_sigqueueinfo(int pid, int sig, siginfo_t32 __user *uinfo) -+sys32_rt_sigqueueinfo(int pid, int sig, compat_siginfo_t __user *uinfo) - { - siginfo_t info; - int ret; - mm_segment_t old_fs = get_fs(); - -- if (ia32_copy_siginfo_from_user(&info, uinfo)) -+ if (copy_siginfo_from_user32(&info, uinfo)) - return -EFAULT; - set_fs (KERNEL_DS); - ret = sys_rt_sigqueueinfo(pid, sig, &info); -@@ -947,7 +910,7 @@ sys32_sendfile(int out_fd, int in_fd, co - ret = sys_sendfile(out_fd, in_fd, offset ? &of : NULL, count); - set_fs(old_fs); - -- if (!ret && offset && put_user(of, offset)) -+ if (offset && put_user(of, offset)) - return -EFAULT; - - return ret; -@@ -1067,13 +1030,13 @@ asmlinkage long sys32_olduname(struct ol - - down_read(&uts_sem); - -- error = __copy_to_user(&name->sysname,&system_utsname.sysname,__OLD_UTS_LEN); -+ error = __copy_to_user(&name->sysname,&ve_utsname.sysname,__OLD_UTS_LEN); - __put_user(0,name->sysname+__OLD_UTS_LEN); -- __copy_to_user(&name->nodename,&system_utsname.nodename,__OLD_UTS_LEN); -+ __copy_to_user(&name->nodename,&ve_utsname.nodename,__OLD_UTS_LEN); - __put_user(0,name->nodename+__OLD_UTS_LEN); -- __copy_to_user(&name->release,&system_utsname.release,__OLD_UTS_LEN); -+ __copy_to_user(&name->release,&ve_utsname.release,__OLD_UTS_LEN); - __put_user(0,name->release+__OLD_UTS_LEN); -- __copy_to_user(&name->version,&system_utsname.version,__OLD_UTS_LEN); -+ __copy_to_user(&name->version,&ve_utsname.version,__OLD_UTS_LEN); - __put_user(0,name->version+__OLD_UTS_LEN); - { - char *arch = "x86_64"; -@@ -1096,7 +1059,7 @@ long sys32_uname(struct old_utsname __us - if (!name) - return -EFAULT; - down_read(&uts_sem); -- err=copy_to_user(name, &system_utsname, sizeof (*name)); -+ err=copy_to_user(name, &ve_utsname, sizeof (*name)); - up_read(&uts_sem); - if (personality(current->personality) == PER_LINUX32) - err |= copy_to_user(&name->machine, "i686", 5); -@@ -1316,23 +1279,11 @@ long sys32_fadvise64_64(int fd, __u32 of - long sys32_vm86_warning(void) - { - struct task_struct *me = current; -- static char lastcomm[8]; -- if (strcmp(lastcomm, me->comm)) { -- printk(KERN_INFO "%s: vm86 mode not supported on 64 bit kernel\n", -- me->comm); -- strcpy(lastcomm, me->comm); -- } -- return -ENOSYS; --} -- --long sys32_quotactl(void) --{ -- struct task_struct *me = current; -- static char lastcomm[8]; -- if (strcmp(lastcomm, me->comm)) { -- printk(KERN_INFO "%s: 32bit quotactl not supported on 64 bit kernel\n", -+ static char lastcomm[sizeof(me->comm)]; -+ if (strncmp(lastcomm, me->comm, sizeof(lastcomm))) { -+ ve_printk(VE_LOG, KERN_INFO "%s: vm87 mode not supported on 64 bit kernel\n", - me->comm); -- strcpy(lastcomm, me->comm); -+ strncpy(lastcomm, me->comm, sizeof(lastcomm)); - } - return -ENOSYS; - } -diff -uprN linux-2.6.8.1.orig/arch/x86_64/ia32/syscall32.c linux-2.6.8.1-ve022stab078/arch/x86_64/ia32/syscall32.c ---- linux-2.6.8.1.orig/arch/x86_64/ia32/syscall32.c 2004-08-14 14:55:35.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/ia32/syscall32.c 2006-05-11 13:05:48.000000000 +0400 -@@ -4,11 +4,13 @@ - on demand because 32bit cannot reach the kernel's fixmaps */ - - #include <linux/mm.h> -+#include <linux/mman.h> - #include <linux/string.h> - #include <linux/kernel.h> - #include <linux/gfp.h> - #include <linux/init.h> - #include <linux/stringify.h> -+#include <linux/security.h> - #include <asm/proto.h> - #include <asm/tlbflush.h> - #include <asm/ia32_unistd.h> -@@ -30,32 +32,64 @@ extern int sysctl_vsyscall32; - char *syscall32_page; - static int use_sysenter __initdata = -1; - --/* RED-PEN: This knows too much about high level VM */ --/* Alternative would be to generate a vma with appropriate backing options -- and let it be handled by generic VM */ --int map_syscall32(struct mm_struct *mm, unsigned long address) --{ -- pte_t *pte; -- pmd_t *pmd; -- int err = 0; -- -- down_read(&mm->mmap_sem); -- spin_lock(&mm->page_table_lock); -- pmd = pmd_alloc(mm, pgd_offset(mm, address), address); -- if (pmd && (pte = pte_alloc_map(mm, pmd, address)) != NULL) { -- if (pte_none(*pte)) { -- set_pte(pte, -- mk_pte(virt_to_page(syscall32_page), -- PAGE_KERNEL_VSYSCALL)); -- } -- /* Flush only the local CPU. Other CPUs taking a fault -- will just end up here again */ -- __flush_tlb_one(address); -- } else -- err = -ENOMEM; -- spin_unlock(&mm->page_table_lock); -- up_read(&mm->mmap_sem); -- return err; -+static struct page * -+syscall32_nopage(struct vm_area_struct *vma, unsigned long adr, int *type) -+{ -+ struct page *p = virt_to_page(adr - vma->vm_start + syscall32_page); -+ get_page(p); -+ return p; -+} -+ -+/* Prevent VMA merging */ -+static void syscall32_vma_close(struct vm_area_struct *vma) -+{ -+} -+ -+static struct vm_operations_struct syscall32_vm_ops = { -+ .close = syscall32_vma_close, -+ .nopage = syscall32_nopage, -+}; -+ -+struct linux_binprm; -+ -+/* Setup a VMA at program startup for the vsyscall page */ -+int syscall32_setup_pages(struct linux_binprm *bprm, int exstack) -+{ -+ int npages = (VSYSCALL32_END - VSYSCALL32_BASE) >> PAGE_SHIFT; -+ struct vm_area_struct *vma; -+ struct mm_struct *mm = current->mm; -+ int ret; -+ -+ vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL); -+ if (!vma) -+ return -ENOMEM; -+ if (security_vm_enough_memory(npages)) { -+ kmem_cache_free(vm_area_cachep, vma); -+ return -ENOMEM; -+ } -+ -+ memset(vma, 0, sizeof(struct vm_area_struct)); -+ /* Could randomize here */ -+ vma->vm_start = VSYSCALL32_BASE; -+ vma->vm_end = VSYSCALL32_END; -+ /* MAYWRITE to allow gdb to COW and set breakpoints */ -+ vma->vm_flags = VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYEXEC|VM_MAYEXEC|VM_MAYWRITE; -+ vma->vm_flags |= mm->def_flags; -+ vma->vm_page_prot = protection_map[vma->vm_flags & 7]; -+ vma->vm_ops = &syscall32_vm_ops; -+ vma->vm_mm = mm; -+ -+ down_write(&mm->mmap_sem); -+ ret = insert_vm_struct(mm, vma); -+ if (ret) { -+ up_write(&mm->mmap_sem); -+ kmem_cache_free(vm_area_cachep, vma); -+ vm_unacct_memory(npages); -+ return ret; -+ } -+ mm->total_vm += npages; -+ up_write(&mm->mmap_sem); -+ return 0; - } - - static int __init init_syscall32(void) -@@ -63,7 +97,6 @@ static int __init init_syscall32(void) - syscall32_page = (void *)get_zeroed_page(GFP_KERNEL); - if (!syscall32_page) - panic("Cannot allocate syscall32 page"); -- SetPageReserved(virt_to_page(syscall32_page)); - if (use_sysenter > 0) { - memcpy(syscall32_page, syscall32_sysenter, - syscall32_sysenter_end - syscall32_sysenter); -diff -uprN linux-2.6.8.1.orig/arch/x86_64/kernel/acpi/wakeup.S linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/acpi/wakeup.S ---- linux-2.6.8.1.orig/arch/x86_64/kernel/acpi/wakeup.S 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/acpi/wakeup.S 2006-05-11 13:05:45.000000000 +0400 -@@ -77,7 +77,7 @@ wakeup_code: - - .byte 0x66, 0xea # prefix + jmpi-opcode - .long wakeup_32 - __START_KERNEL_map -- .word __KERNEL_CS -+ .word __BOOT_CS - - .code32 - wakeup_32: -@@ -96,13 +96,13 @@ wakeup_32: - jnc bogus_cpu - movl %edx,%edi - -- movw $__KERNEL_DS, %ax -+ movw $__BOOT_DS, %ax - movw %ax, %ds - movw %ax, %es - movw %ax, %fs - movw %ax, %gs - -- movw $__KERNEL_DS, %ax -+ movw $__BOOT_DS, %ax - movw %ax, %ss - - mov $(wakeup_stack - __START_KERNEL_map), %esp -@@ -187,7 +187,7 @@ reach_compatibility_mode: - - wakeup_jumpvector: - .long wakeup_long64 - __START_KERNEL_map -- .word __KERNEL_CS -+ .word __BOOT_CS - - .code64 - -diff -uprN linux-2.6.8.1.orig/arch/x86_64/kernel/apic.c linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/apic.c ---- linux-2.6.8.1.orig/arch/x86_64/kernel/apic.c 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/apic.c 2006-05-11 13:05:40.000000000 +0400 -@@ -883,6 +883,7 @@ void smp_local_timer_interrupt(struct pt - */ - void smp_apic_timer_interrupt(struct pt_regs *regs) - { -+ struct ve_struct *ve; - /* - * the NMI deadlock-detector uses this. - */ -@@ -898,9 +899,11 @@ void smp_apic_timer_interrupt(struct pt_ - * Besides, if we don't timer interrupts ignore the global - * interrupt lock, which is the WrongThing (tm) to do. - */ -+ ve = set_exec_env(get_ve0()); - irq_enter(); - smp_local_timer_interrupt(regs); - irq_exit(); -+ (void)set_exec_env(ve); - } - - /* -diff -uprN linux-2.6.8.1.orig/arch/x86_64/kernel/entry.S linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/entry.S ---- linux-2.6.8.1.orig/arch/x86_64/kernel/entry.S 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/entry.S 2006-05-11 13:05:40.000000000 +0400 -@@ -173,6 +173,10 @@ rff_trace: - * - * XXX if we had a free scratch register we could save the RSP into the stack frame - * and report it properly in ps. Unfortunately we haven't. -+ * -+ * When user can change the frames always force IRET. That is because -+ * it deals with uncanonical addresses better. SYSRET has trouble -+ * with them due to bugs in both AMD and Intel CPUs. - */ - - ENTRY(system_call) -@@ -236,7 +240,10 @@ sysret_signal: - xorl %esi,%esi # oldset -> arg2 - call ptregscall_common - 1: movl $_TIF_NEED_RESCHED,%edi -- jmp sysret_check -+ /* Use IRET because user could have changed frame. This -+ works because ptregscall_common has called FIXUP_TOP_OF_STACK. */ -+ cli -+ jmp int_with_check - - /* Do syscall tracing */ - tracesys: -@@ -257,7 +264,8 @@ tracesys: - call syscall_trace_leave - RESTORE_TOP_OF_STACK %rbx - RESTORE_REST -- jmp ret_from_sys_call -+ /* Use IRET because user could have changed frame */ -+ jmp int_ret_from_sys_call - - badsys: - movq $-ENOSYS,RAX-ARGOFFSET(%rsp) -@@ -358,20 +366,9 @@ ENTRY(stub_execve) - popq %r11 - CFI_ADJUST_CFA_OFFSET -8 - SAVE_REST -- movq %r11, %r15 - FIXUP_TOP_OF_STACK %r11 - call sys_execve -- GET_THREAD_INFO(%rcx) -- bt $TIF_IA32,threadinfo_flags(%rcx) -- jc exec_32bit - RESTORE_TOP_OF_STACK %r11 -- movq %r15, %r11 -- RESTORE_REST -- push %r11 -- ret -- --exec_32bit: -- CFI_ADJUST_CFA_OFFSET REST_SKIP - movq %rax,RAX(%rsp) - RESTORE_REST - jmp int_ret_from_sys_call -@@ -728,7 +725,7 @@ ENTRY(kernel_thread) - xorl %r9d,%r9d - - # clone now -- call do_fork -+ call do_fork_kthread - movq %rax,RAX(%rsp) - xorl %edi,%edi - -diff -uprN linux-2.6.8.1.orig/arch/x86_64/kernel/head.S linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/head.S ---- linux-2.6.8.1.orig/arch/x86_64/kernel/head.S 2004-08-14 14:55:35.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/head.S 2006-05-11 13:05:45.000000000 +0400 -@@ -39,7 +39,7 @@ startup_32: - - movl %ebx,%ebp /* Save trampoline flag */ - -- movl $__KERNEL_DS,%eax -+ movl $__BOOT_DS,%eax - movl %eax,%ds - - /* If the CPU doesn't support CPUID this will double fault. -@@ -159,7 +159,14 @@ reach_long64: - /* esi is pointer to real mode structure with interesting info. - pass it to C */ - movl %esi, %edi -- -+ -+ /* Switch to __KERNEL_CS. The segment is the same, but selector -+ * is different. */ -+ pushq $__KERNEL_CS -+ pushq $switch_cs -+ lretq -+switch_cs: -+ - /* Finally jump to run C code and to be on real kernel address - * Since we are running on identity-mapped space we have to jump - * to the full 64bit address , this is only possible as indirect -@@ -192,7 +199,7 @@ pGDT32: - .org 0xf10 - ljumpvector: - .long reach_long64-__START_KERNEL_map -- .word __KERNEL_CS -+ .word __BOOT_CS - - ENTRY(stext) - ENTRY(_stext) -@@ -326,7 +333,7 @@ gdt: - ENTRY(gdt_table32) - .quad 0x0000000000000000 /* This one is magic */ - .quad 0x0000000000000000 /* unused */ -- .quad 0x00af9a000000ffff /* __KERNEL_CS */ -+ .quad 0x00af9a000000ffff /* __BOOT_CS */ - gdt32_end: - - /* We need valid kernel segments for data and code in long mode too -@@ -337,23 +344,30 @@ gdt32_end: - .align L1_CACHE_BYTES - - /* The TLS descriptors are currently at a different place compared to i386. -- Hopefully nobody expects them at a fixed place (Wine?) */ -+ Hopefully nobody expects them at a fixed place (Wine?) -+ Descriptors rearranged to plase 32bit and TLS selectors in the same -+ places, because it is really necessary. sysret/exit mandates order -+ of kernel/user cs/ds, so we have to extend gdt. -+*/ - - ENTRY(cpu_gdt_table) -- .quad 0x0000000000000000 /* NULL descriptor */ -- .quad 0x008f9a000000ffff /* __KERNEL_COMPAT32_CS */ -- .quad 0x00af9a000000ffff /* __KERNEL_CS */ -- .quad 0x00cf92000000ffff /* __KERNEL_DS */ -- .quad 0x00cffe000000ffff /* __USER32_CS */ -- .quad 0x00cff2000000ffff /* __USER_DS, __USER32_DS */ -- .quad 0x00affa000000ffff /* __USER_CS */ -- .quad 0x00cf9a000000ffff /* __KERNEL32_CS */ -- .quad 0,0 /* TSS */ -- .quad 0 /* LDT */ -- .quad 0,0,0 /* three TLS descriptors */ -- .quad 0 /* unused now */ -- .quad 0x00009a000000ffff /* __KERNEL16_CS - 16bit PM for S3 wakeup. */ -+ .quad 0x0000000000000000 /* 0 NULL descriptor */ -+ .quad 0x008f9a000000ffff /* 1 __KERNEL_COMPAT32_CS */ -+ .quad 0x00af9a000000ffff /* 2 __BOOT_CS */ -+ .quad 0x00cf92000000ffff /* 3 __BOOT_DS */ -+ .quad 0,0 /* 4,5 TSS */ -+ .quad 0,0,0 /* 6-8 three TLS descriptors */ -+ .quad 0x00009a000000ffff /* 9 __KERNEL16_CS - 16bit PM for S3 wakeup. */ - /* base must be patched for real base address. */ -+ .quad 0 /* 10 LDT */ -+ .quad 0x00cf9a000000ffff /* 11 __KERNEL32_CS */ -+ .quad 0x00af9a000000ffff /* 12 __KERNEL_CS */ -+ .quad 0x00cf92000000ffff /* 13 __KERNEL_DS */ -+ .quad 0x00cffe000000ffff /* 14 __USER32_CS */ -+ .quad 0x00cff2000000ffff /* 15 __USER_DS, __USER32_DS */ -+ .quad 0x00affa000000ffff /* 16 __USER_CS */ -+ .quad 0,0,0,0,0,0,0 -+ .quad 0,0,0,0,0,0,0,0 - gdt_end: - /* asm/segment.h:GDT_ENTRIES must match this */ - /* This should be a multiple of the cache line size */ -diff -uprN linux-2.6.8.1.orig/arch/x86_64/kernel/irq.c linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/irq.c ---- linux-2.6.8.1.orig/arch/x86_64/kernel/irq.c 2004-08-14 14:55:20.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/irq.c 2006-05-11 13:05:40.000000000 +0400 -@@ -45,7 +45,8 @@ - #include <asm/desc.h> - #include <asm/irq.h> - -- -+#include <ub/beancounter.h> -+#include <ub/ub_task.h> - - /* - * Linux has a controller-independent x86 interrupt architecture. -@@ -213,15 +214,18 @@ inline void synchronize_irq(unsigned int - int handle_IRQ_event(unsigned int irq, struct pt_regs * regs, struct irqaction * action) - { - int status = 1; /* Force the "do bottom halves" bit */ -+ struct user_beancounter *ub; - - if (!(action->flags & SA_INTERRUPT)) - local_irq_enable(); - -+ ub = set_exec_ub(get_ub0()); - do { - status |= action->flags; - action->handler(irq, action->dev_id, regs); - action = action->next; - } while (action); -+ (void)set_exec_ub(ub); - if (status & SA_SAMPLE_RANDOM) - add_interrupt_randomness(irq); - local_irq_disable(); -@@ -340,9 +344,11 @@ asmlinkage unsigned int do_IRQ(struct pt - irq_desc_t *desc = irq_desc + irq; - struct irqaction * action; - unsigned int status; -+ struct ve_struct *ve; - - if (irq > 256) BUG(); - -+ ve = set_exec_env(get_ve0()); - irq_enter(); - kstat_cpu(cpu).irqs[irq]++; - spin_lock(&desc->lock); -@@ -405,6 +411,7 @@ out: - spin_unlock(&desc->lock); - - irq_exit(); -+ (void)set_exec_env(ve); - return 1; - } - -@@ -833,6 +840,8 @@ static int irq_affinity_read_proc (char - return len; - } - -+int no_irq_affinity; -+ - static int irq_affinity_write_proc (struct file *file, - const char __user *buffer, - unsigned long count, void *data) -@@ -840,7 +849,7 @@ static int irq_affinity_write_proc (stru - int irq = (long) data, full_count = count, err; - cpumask_t tmp, new_value; - -- if (!irq_desc[irq].handler->set_affinity) -+ if (!irq_desc[irq].handler->set_affinity || no_irq_affinity) - return -EIO; - - err = cpumask_parse(buffer, count, new_value); -diff -uprN linux-2.6.8.1.orig/arch/x86_64/kernel/nmi.c linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/nmi.c ---- linux-2.6.8.1.orig/arch/x86_64/kernel/nmi.c 2004-08-14 14:55:31.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/nmi.c 2006-05-11 13:05:29.000000000 +0400 -@@ -59,6 +59,7 @@ static int panic_on_timeout; - unsigned int nmi_watchdog = NMI_DEFAULT; - static unsigned int nmi_hz = HZ; - unsigned int nmi_perfctr_msr; /* the MSR to reset in NMI handler */ -+static unsigned int nmi_p4_cccr_val; - - /* Note that these events don't tick when the CPU idles. This means - the frequency varies with CPU load. */ -@@ -70,12 +71,41 @@ unsigned int nmi_perfctr_msr; /* the MSR - #define K7_EVENT_CYCLES_PROCESSOR_IS_RUNNING 0x76 - #define K7_NMI_EVENT K7_EVENT_CYCLES_PROCESSOR_IS_RUNNING - --#define P6_EVNTSEL0_ENABLE (1 << 22) --#define P6_EVNTSEL_INT (1 << 20) --#define P6_EVNTSEL_OS (1 << 17) --#define P6_EVNTSEL_USR (1 << 16) --#define P6_EVENT_CPU_CLOCKS_NOT_HALTED 0x79 --#define P6_NMI_EVENT P6_EVENT_CPU_CLOCKS_NOT_HALTED -+#define MSR_P4_MISC_ENABLE 0x1A0 -+#define MSR_P4_MISC_ENABLE_PERF_AVAIL (1<<7) -+#define MSR_P4_MISC_ENABLE_PEBS_UNAVAIL (1<<12) -+#define MSR_P4_PERFCTR0 0x300 -+#define MSR_P4_CCCR0 0x360 -+#define P4_ESCR_EVENT_SELECT(N) ((N)<<25) -+#define P4_ESCR_OS (1<<3) -+#define P4_ESCR_USR (1<<2) -+#define P4_CCCR_OVF_PMI0 (1<<26) -+#define P4_CCCR_OVF_PMI1 (1<<27) -+#define P4_CCCR_THRESHOLD(N) ((N)<<20) -+#define P4_CCCR_COMPLEMENT (1<<19) -+#define P4_CCCR_COMPARE (1<<18) -+#define P4_CCCR_REQUIRED (3<<16) -+#define P4_CCCR_ESCR_SELECT(N) ((N)<<13) -+#define P4_CCCR_ENABLE (1<<12) -+/* Set up IQ_COUNTER0 to behave like a clock, by having IQ_CCCR0 filter -+ CRU_ESCR0 (with any non-null event selector) through a complemented -+ max threshold. [IA32-Vol3, Section 14.9.9] */ -+#define MSR_P4_IQ_COUNTER0 0x30C -+#define P4_NMI_CRU_ESCR0 (P4_ESCR_EVENT_SELECT(0x3F)|P4_ESCR_OS|P4_ESCR_USR) -+#define P4_NMI_IQ_CCCR0 \ -+ (P4_CCCR_OVF_PMI0|P4_CCCR_THRESHOLD(15)|P4_CCCR_COMPLEMENT| \ -+ P4_CCCR_COMPARE|P4_CCCR_REQUIRED|P4_CCCR_ESCR_SELECT(4)|P4_CCCR_ENABLE) -+ -+static __init inline int nmi_known_cpu(void) -+{ -+ switch (boot_cpu_data.x86_vendor) { -+ case X86_VENDOR_AMD: -+ return boot_cpu_data.x86 == 15; -+ case X86_VENDOR_INTEL: -+ return boot_cpu_data.x86 == 15; -+ } -+ return 0; -+} - - /* Run after command line and cpu_init init, but before all other checks */ - void __init nmi_watchdog_default(void) -@@ -83,19 +113,10 @@ void __init nmi_watchdog_default(void) - if (nmi_watchdog != NMI_DEFAULT) - return; - -- /* For some reason the IO APIC watchdog doesn't work on the AMD -- 8111 chipset. For now switch to local APIC mode using -- perfctr0 there. On Intel CPUs we don't have code to handle -- the perfctr and the IO-APIC seems to work, so use that. */ -- -- if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) { -- nmi_watchdog = NMI_LOCAL_APIC; -- printk(KERN_INFO -- "Using local APIC NMI watchdog using perfctr0\n"); -- } else { -- printk(KERN_INFO "Using IO APIC NMI watchdog\n"); -+ if (nmi_known_cpu()) -+ nmi_watchdog = NMI_LOCAL_APIC; -+ else - nmi_watchdog = NMI_IO_APIC; -- } - } - - /* Why is there no CPUID flag for this? */ -@@ -181,7 +202,10 @@ static void disable_lapic_nmi_watchdog(v - wrmsr(MSR_K7_EVNTSEL0, 0, 0); - break; - case X86_VENDOR_INTEL: -- wrmsr(MSR_IA32_EVNTSEL0, 0, 0); -+ if (boot_cpu_data.x86 == 15) { -+ wrmsr(MSR_P4_IQ_CCCR0, 0, 0); -+ wrmsr(MSR_P4_CRU_ESCR0, 0, 0); -+ } - break; - } - nmi_active = -1; -@@ -296,6 +320,14 @@ late_initcall(init_lapic_nmi_sysfs); - * Original code written by Keith Owens. - */ - -+static void clear_msr_range(unsigned int base, unsigned int n) -+{ -+ unsigned int i; -+ -+ for(i = 0; i < n; ++i) -+ wrmsr(base+i, 0, 0); -+} -+ - static void setup_k7_watchdog(void) - { - int i; -@@ -327,6 +359,47 @@ static void setup_k7_watchdog(void) - wrmsr(MSR_K7_EVNTSEL0, evntsel, 0); - } - -+static int setup_p4_watchdog(void) -+{ -+ unsigned int misc_enable, dummy; -+ -+ rdmsr(MSR_P4_MISC_ENABLE, misc_enable, dummy); -+ if (!(misc_enable & MSR_P4_MISC_ENABLE_PERF_AVAIL)) -+ return 0; -+ -+ nmi_perfctr_msr = MSR_P4_IQ_COUNTER0; -+ nmi_p4_cccr_val = P4_NMI_IQ_CCCR0; -+#ifdef CONFIG_SMP -+ if (smp_num_siblings == 2) -+ nmi_p4_cccr_val |= P4_CCCR_OVF_PMI1; -+#endif -+ -+ if (!(misc_enable & MSR_P4_MISC_ENABLE_PEBS_UNAVAIL)) -+ clear_msr_range(0x3F1, 2); -+ /* MSR 0x3F0 seems to have a default value of 0xFC00, but current -+ docs doesn't fully define it, so leave it alone for now. */ -+ if (boot_cpu_data.x86_model >= 0x3) { -+ /* MSR_P4_IQ_ESCR0/1 (0x3ba/0x3bb) removed */ -+ clear_msr_range(0x3A0, 26); -+ clear_msr_range(0x3BC, 3); -+ } else { -+ clear_msr_range(0x3A0, 31); -+ } -+ clear_msr_range(0x3C0, 6); -+ clear_msr_range(0x3C8, 6); -+ clear_msr_range(0x3E0, 2); -+ clear_msr_range(MSR_P4_CCCR0, 18); -+ clear_msr_range(MSR_P4_PERFCTR0, 18); -+ -+ wrmsr(MSR_P4_CRU_ESCR0, P4_NMI_CRU_ESCR0, 0); -+ wrmsr(MSR_P4_IQ_CCCR0, P4_NMI_IQ_CCCR0 & ~P4_CCCR_ENABLE, 0); -+ Dprintk("setting P4_IQ_COUNTER0 to 0x%08lx\n", -(cpu_khz/nmi_hz*1000)); -+ wrmsr(MSR_P4_IQ_COUNTER0, -(cpu_khz/nmi_hz*1000), -1); -+ apic_write(APIC_LVTPC, APIC_DM_NMI); -+ wrmsr(MSR_P4_IQ_CCCR0, nmi_p4_cccr_val, 0); -+ return 1; -+} -+ - void setup_apic_nmi_watchdog(void) - { - switch (boot_cpu_data.x86_vendor) { -@@ -337,6 +410,13 @@ void setup_apic_nmi_watchdog(void) - return; - setup_k7_watchdog(); - break; -+ case X86_VENDOR_INTEL: -+ if (boot_cpu_data.x86 != 15) -+ return; -+ if (!setup_p4_watchdog()) -+ return; -+ break; -+ - default: - return; - } -@@ -414,8 +494,20 @@ void nmi_watchdog_tick (struct pt_regs * - last_irq_sums[cpu] = sum; - alert_counter[cpu] = 0; - } -- if (nmi_perfctr_msr) -+ if (nmi_perfctr_msr) { -+ if (nmi_perfctr_msr == MSR_P4_IQ_COUNTER0) { -+ /* -+ * P4 quirks: -+ * - An overflown perfctr will assert its interrupt -+ * until the OVF flag in its CCCR is cleared. -+ * - LVTPC is masked on interrupt and must be -+ * unmasked by the LVTPC handler. -+ */ -+ wrmsr(MSR_P4_IQ_CCCR0, nmi_p4_cccr_val, 0); -+ apic_write(APIC_LVTPC, APIC_DM_NMI); -+ } - wrmsr(nmi_perfctr_msr, -(cpu_khz/nmi_hz*1000), -1); -+ } - } - - static int dummy_nmi_callback(struct pt_regs * regs, int cpu) -diff -uprN linux-2.6.8.1.orig/arch/x86_64/kernel/process.c linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/process.c ---- linux-2.6.8.1.orig/arch/x86_64/kernel/process.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/process.c 2006-05-11 13:05:49.000000000 +0400 -@@ -209,7 +209,8 @@ void __show_regs(struct pt_regs * regs) - printk("Pid: %d, comm: %.20s %s %s\n", - current->pid, current->comm, print_tainted(), UTS_RELEASE); - printk("RIP: %04lx:[<%016lx>] ", regs->cs & 0xffff, regs->rip); -- printk_address(regs->rip); -+ if (decode_call_traces) -+ printk_address(regs->rip); - printk("\nRSP: %04lx:%016lx EFLAGS: %08lx\n", regs->ss, regs->rsp, regs->eflags); - printk("RAX: %016lx RBX: %016lx RCX: %016lx\n", - regs->rax, regs->rbx, regs->rcx); -@@ -606,7 +607,7 @@ long do_arch_prctl(struct task_struct *t - - switch (code) { - case ARCH_SET_GS: -- if (addr >= TASK_SIZE) -+ if (addr >= TASK_SIZE_OF(task)) - return -EPERM; - cpu = get_cpu(); - /* handle small bases via the GDT because that's faster to -@@ -632,7 +633,7 @@ long do_arch_prctl(struct task_struct *t - case ARCH_SET_FS: - /* Not strictly needed for fs, but do it for symmetry - with gs */ -- if (addr >= TASK_SIZE) -+ if (addr >= TASK_SIZE_OF(task)) - return -EPERM; - cpu = get_cpu(); - /* handle small bases via the GDT because that's faster to -@@ -711,3 +712,20 @@ int dump_task_regs(struct task_struct *t - - return 1; - } -+ -+long do_fork_kthread(unsigned long clone_flags, -+ unsigned long stack_start, -+ struct pt_regs *regs, -+ unsigned long stack_size, -+ int __user *parent_tidptr, -+ int __user *child_tidptr) -+{ -+ if (ve_is_super(get_exec_env())) -+ return do_fork(clone_flags, stack_start, regs, stack_size, -+ parent_tidptr, child_tidptr); -+ -+ /* Don't allow kernel_thread() inside VE */ -+ printk("kernel_thread call inside VE\n"); -+ dump_stack(); -+ return -EPERM; -+} -diff -uprN linux-2.6.8.1.orig/arch/x86_64/kernel/ptrace.c linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/ptrace.c ---- linux-2.6.8.1.orig/arch/x86_64/kernel/ptrace.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/ptrace.c 2006-05-11 13:05:49.000000000 +0400 -@@ -128,12 +128,12 @@ static int putreg(struct task_struct *ch - value &= 0xffff; - return 0; - case offsetof(struct user_regs_struct,fs_base): -- if (!((value >> 48) == 0 || (value >> 48) == 0xffff)) -+ if (value >= TASK_SIZE_OF(child)) - return -EIO; - child->thread.fs = value; - return 0; - case offsetof(struct user_regs_struct,gs_base): -- if (!((value >> 48) == 0 || (value >> 48) == 0xffff)) -+ if (value >= TASK_SIZE_OF(child)) - return -EIO; - child->thread.gs = value; - return 0; -@@ -148,6 +148,11 @@ static int putreg(struct task_struct *ch - return -EIO; - value &= 0xffff; - break; -+ case offsetof(struct user_regs_struct, rip): -+ /* Check if the new RIP address is canonical */ -+ if (value >= TASK_SIZE_OF(child)) -+ return -EIO; -+ break; - } - put_stack_long(child, regno - sizeof(struct pt_regs), value); - return 0; -@@ -169,6 +174,15 @@ static unsigned long getreg(struct task_ - return child->thread.fs; - case offsetof(struct user_regs_struct, gs_base): - return child->thread.gs; -+ case offsetof(struct user_regs_struct, cs): -+ if (test_tsk_thread_flag(child, TIF_SYSCALL_TRACE)) { -+ val = get_stack_long(child, regno - sizeof(struct pt_regs)); -+ if (val == __USER_CS) -+ return 0x33; -+ if (val == __USER32_CS) -+ return 0x23; -+ } -+ /* fall through */ - default: - regno = regno - sizeof(struct pt_regs); - val = get_stack_long(child, regno); -@@ -202,7 +216,7 @@ asmlinkage long sys_ptrace(long request, - } - ret = -ESRCH; - read_lock(&tasklist_lock); -- child = find_task_by_pid(pid); -+ child = find_task_by_pid_ve(pid); - if (child) - get_task_struct(child); - read_unlock(&tasklist_lock); -@@ -246,7 +260,7 @@ asmlinkage long sys_ptrace(long request, - break; - - switch (addr) { -- case 0 ... sizeof(struct user_regs_struct): -+ case 0 ... sizeof(struct user_regs_struct) - sizeof(long): - tmp = getreg(child, addr); - break; - case offsetof(struct user, u_debugreg[0]): -@@ -285,33 +299,37 @@ asmlinkage long sys_ptrace(long request, - break; - - case PTRACE_POKEUSR: /* write the word at location addr in the USER area */ -+ { -+ int dsize; -+ -+ dsize = test_tsk_thread_flag(child, TIF_IA32) ? 3 : 7; - ret = -EIO; - if ((addr & 7) || - addr > sizeof(struct user) - 7) - break; - - switch (addr) { -- case 0 ... sizeof(struct user_regs_struct): -+ case 0 ... sizeof(struct user_regs_struct) - sizeof(long): - ret = putreg(child, addr, data); - break; - /* Disallows to set a breakpoint into the vsyscall */ - case offsetof(struct user, u_debugreg[0]): -- if (data >= TASK_SIZE-7) break; -+ if (data >= TASK_SIZE_OF(child) - dsize) break; - child->thread.debugreg0 = data; - ret = 0; - break; - case offsetof(struct user, u_debugreg[1]): -- if (data >= TASK_SIZE-7) break; -+ if (data >= TASK_SIZE_OF(child) - dsize) break; - child->thread.debugreg1 = data; - ret = 0; - break; - case offsetof(struct user, u_debugreg[2]): -- if (data >= TASK_SIZE-7) break; -+ if (data >= TASK_SIZE_OF(child) - dsize) break; - child->thread.debugreg2 = data; - ret = 0; - break; - case offsetof(struct user, u_debugreg[3]): -- if (data >= TASK_SIZE-7) break; -+ if (data >= TASK_SIZE_OF(child) - dsize) break; - child->thread.debugreg3 = data; - ret = 0; - break; -@@ -333,6 +351,7 @@ asmlinkage long sys_ptrace(long request, - break; - } - break; -+ } - case PTRACE_SYSCALL: /* continue and stop at next (return from) syscall */ - case PTRACE_CONT: { /* restart after signal. */ - long tmp; -@@ -393,7 +412,7 @@ asmlinkage long sys_ptrace(long request, - long tmp; - - ret = 0; -- if (child->state == TASK_ZOMBIE) /* already dead */ -+ if (child->exit_state == EXIT_ZOMBIE) /* already dead */ - break; - child->exit_code = SIGKILL; - /* make sure the single step bit is not set. */ -@@ -498,8 +517,10 @@ static void syscall_trace(struct pt_regs - current_thread_info()->flags, current->ptrace); - #endif - -+ set_pn_state(current, (regs->rax != -ENOSYS) ? PN_STOP_LEAVE : PN_STOP_ENTRY); - ptrace_notify(SIGTRAP | ((current->ptrace & PT_TRACESYSGOOD) - ? 0x80 : 0)); -+ clear_pn_state(current); - /* - * this isn't the same as continuing with a signal, but it will do - * for normal use. strace only continues with a signal if the -diff -uprN linux-2.6.8.1.orig/arch/x86_64/kernel/setup.c linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/setup.c ---- linux-2.6.8.1.orig/arch/x86_64/kernel/setup.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/setup.c 2006-05-11 13:05:28.000000000 +0400 -@@ -642,6 +642,23 @@ static int __init init_amd(struct cpuinf - int r; - int level; - -+#ifdef CONFIG_SMP -+ unsigned long value; -+ -+ /* -+ * Disable TLB flush filter by setting HWCR.FFDIS on K8 -+ * bit 6 of msr C001_0015 -+ * -+ * Errata 63 for SH-B3 steppings -+ * Errata 122 for all steppings (F+ have it disabled by default) -+ */ -+ if (c->x86 == 15) { -+ rdmsrl(MSR_K8_HWCR, value); -+ value |= 1 << 6; -+ wrmsrl(MSR_K8_HWCR, value); -+ } -+#endif -+ - /* Bit 31 in normal CPUID used for nonstandard 3DNow ID; - 3DNow is IDd by bit 31 in extended CPUID (1*32+31) anyway */ - clear_bit(0*32+31, &c->x86_capability); -@@ -1086,7 +1103,7 @@ static int show_cpuinfo(struct seq_file - seq_printf(m, "cache size\t: %d KB\n", c->x86_cache_size); - - #ifdef CONFIG_X86_HT -- if (cpu_has_ht) { -+ if (smp_num_siblings > 1) { - seq_printf(m, "physical id\t: %d\n", phys_proc_id[c - cpu_data]); - seq_printf(m, "siblings\t: %d\n", smp_num_siblings); - } -diff -uprN linux-2.6.8.1.orig/arch/x86_64/kernel/signal.c linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/signal.c ---- linux-2.6.8.1.orig/arch/x86_64/kernel/signal.c 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/signal.c 2006-05-11 13:05:45.000000000 +0400 -@@ -29,6 +29,7 @@ - #include <asm/uaccess.h> - #include <asm/i387.h> - #include <asm/proto.h> -+#include <asm/ia32_unistd.h> - - /* #define DEBUG_SIG 1 */ - -@@ -54,6 +55,7 @@ sys_rt_sigsuspend(sigset_t __user *unews - - spin_lock_irq(¤t->sighand->siglock); - saveset = current->blocked; -+ set_sigsuspend_state(current, saveset); - current->blocked = newset; - recalc_sigpending(); - spin_unlock_irq(¤t->sighand->siglock); -@@ -65,8 +67,10 @@ sys_rt_sigsuspend(sigset_t __user *unews - while (1) { - current->state = TASK_INTERRUPTIBLE; - schedule(); -- if (do_signal(®s, &saveset)) -+ if (do_signal(®s, &saveset)) { -+ clear_sigsuspend_state(current); - return -EINTR; -+ } - } - } - -@@ -124,6 +128,12 @@ restore_sigcontext(struct pt_regs *regs, - if (verify_area(VERIFY_READ, buf, sizeof(*buf))) - goto badframe; - err |= restore_i387(buf); -+ } else { -+ struct task_struct *me = current; -+ if (me->used_math) { -+ clear_fpu(me); -+ me->used_math = 0; -+ } - } - } - -@@ -287,7 +297,7 @@ static void setup_rt_frame(int sig, stru - if (ka->sa.sa_flags & SA_RESTORER) { - err |= __put_user(ka->sa.sa_restorer, &frame->pretcode); - } else { -- printk("%s forgot to set SA_RESTORER for signal %d.\n", me->comm, sig); -+ ve_printk(VE_LOG, "%s forgot to set SA_RESTORER for signal %d.\n", me->comm, sig); - goto give_sigsegv; - } - -@@ -349,7 +359,7 @@ handle_signal(unsigned long sig, siginfo - #endif - - /* Are we from a system call? */ -- if (regs->orig_rax >= 0) { -+ if ((long)regs->orig_rax >= 0) { - /* If so, check system call restarting.. */ - switch (regs->rax) { - case -ERESTART_RESTARTBLOCK: -@@ -411,9 +421,10 @@ int do_signal(struct pt_regs *regs, sigs - return 1; - } - -- if (current->flags & PF_FREEZE) { -- refrigerator(0); -- goto no_signal; -+ if (test_thread_flag(TIF_FREEZE)) { -+ refrigerator(); -+ if (!signal_pending(current)) -+ goto no_signal; - } - - if (!oldset) -@@ -436,7 +447,7 @@ int do_signal(struct pt_regs *regs, sigs - - no_signal: - /* Did we come from a system call? */ -- if (regs->orig_rax >= 0) { -+ if ((long)regs->orig_rax >= 0) { - /* Restart the system call - no handlers present */ - long res = regs->rax; - if (res == -ERESTARTNOHAND || -@@ -446,7 +457,9 @@ int do_signal(struct pt_regs *regs, sigs - regs->rip -= 2; - } - if (regs->rax == (unsigned long)-ERESTART_RESTARTBLOCK) { -- regs->rax = __NR_restart_syscall; -+ regs->rax = test_thread_flag(TIF_IA32) ? -+ __NR_ia32_restart_syscall : -+ __NR_restart_syscall; - regs->rip -= 2; - } - } -diff -uprN linux-2.6.8.1.orig/arch/x86_64/kernel/smpboot.c linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/smpboot.c ---- linux-2.6.8.1.orig/arch/x86_64/kernel/smpboot.c 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/smpboot.c 2006-05-11 13:05:40.000000000 +0400 -@@ -309,8 +309,6 @@ void __init smp_callin(void) - Dprintk("CALLIN, before setup_local_APIC().\n"); - setup_local_APIC(); - -- local_irq_enable(); -- - /* - * Get our bogomips. - */ -@@ -324,8 +322,6 @@ void __init smp_callin(void) - */ - smp_store_cpu_info(cpuid); - -- local_irq_disable(); -- - /* - * Allow the master to continue. - */ -@@ -397,7 +393,8 @@ static struct task_struct * __init fork_ - * don't care about the eip and regs settings since - * we'll never reschedule the forked task. - */ -- return copy_process(CLONE_VM|CLONE_IDLETASK, 0, ®s, 0, NULL, NULL); -+ return copy_process(CLONE_VM|CLONE_IDLETASK, 0, -+ ®s, 0, NULL, NULL, 0); - } - - #if APIC_DEBUG -diff -uprN linux-2.6.8.1.orig/arch/x86_64/kernel/sys_x86_64.c linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/sys_x86_64.c ---- linux-2.6.8.1.orig/arch/x86_64/kernel/sys_x86_64.c 2004-08-14 14:55:35.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/sys_x86_64.c 2006-05-11 13:05:40.000000000 +0400 -@@ -69,13 +69,7 @@ out: - static void find_start_end(unsigned long flags, unsigned long *begin, - unsigned long *end) - { --#ifdef CONFIG_IA32_EMULATION -- if (test_thread_flag(TIF_IA32)) { -- *begin = TASK_UNMAPPED_32; -- *end = IA32_PAGE_OFFSET; -- } else --#endif -- if (flags & MAP_32BIT) { -+ if (!test_thread_flag(TIF_IA32) && (flags & MAP_32BIT)) { - /* This is usually used needed to map code in small - model, so it needs to be in the first 31bit. Limit - it to that. This means we need to move the -@@ -83,12 +77,12 @@ static void find_start_end(unsigned long - conflicts with the heap, but we assume that glibc - malloc knows how to fall back to mmap. Give it 1GB - of playground for now. -AK */ -- *begin = 0x40000000; -- *end = 0x80000000; -- } else { -- *begin = TASK_UNMAPPED_64; -- *end = TASK_SIZE; -- } -+ *begin = 0x40000000; -+ *end = 0x80000000; -+ } else { -+ *begin = TASK_UNMAPPED_BASE; -+ *end = TASK_SIZE; -+ } - } - - unsigned long -@@ -146,7 +140,7 @@ asmlinkage long sys_uname(struct new_uts - { - int err; - down_read(&uts_sem); -- err = copy_to_user(name, &system_utsname, sizeof (*name)); -+ err = copy_to_user(name, &ve_utsname, sizeof (*name)); - up_read(&uts_sem); - if (personality(current->personality) == PER_LINUX32) - err |= copy_to_user(&name->machine, "i686", 5); -diff -uprN linux-2.6.8.1.orig/arch/x86_64/kernel/time.c linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/time.c ---- linux-2.6.8.1.orig/arch/x86_64/kernel/time.c 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/time.c 2006-05-11 13:05:40.000000000 +0400 -@@ -55,10 +55,13 @@ static int nohpet __initdata = 0; - unsigned int cpu_khz; /* TSC clocks / usec, not used here */ - unsigned long hpet_period; /* fsecs / HPET clock */ - unsigned long hpet_tick; /* HPET clocks / interrupt */ -+static int hpet_use_timer; - unsigned long vxtime_hz = PIT_TICK_RATE; - int report_lost_ticks; /* command line option */ - unsigned long long monotonic_base; - -+EXPORT_SYMBOL(cpu_khz); -+ - struct vxtime_data __vxtime __section_vxtime; /* for vsyscalls */ - - volatile unsigned long __jiffies __section_jiffies = INITIAL_JIFFIES; -@@ -96,7 +99,9 @@ static inline unsigned int do_gettimeoff - - static inline unsigned int do_gettimeoffset_hpet(void) - { -- return ((hpet_readl(HPET_COUNTER) - vxtime.last) * vxtime.quot) >> 32; -+ /* cap counter read to one tick to avoid inconsistencies */ -+ unsigned long counter = hpet_readl(HPET_COUNTER) - vxtime.last; -+ return (min(counter,hpet_tick) * vxtime.quot) >> 32; - } - - unsigned int (*do_gettimeoffset)(void) = do_gettimeoffset_tsc; -@@ -270,7 +275,7 @@ unsigned long long monotonic_clock(void) - - last_offset = vxtime.last; - base = monotonic_base; -- this_offset = hpet_readl(HPET_T0_CMP) - hpet_tick; -+ this_offset = hpet_readl(HPET_COUNTER); - - } while (read_seqretry(&xtime_lock, seq)); - offset = (this_offset - last_offset); -@@ -309,7 +314,14 @@ static irqreturn_t timer_interrupt(int i - - write_seqlock(&xtime_lock); - -- if (vxtime.hpet_address) { -+ if (vxtime.hpet_address) -+ offset = hpet_readl(HPET_COUNTER); -+ -+ if (hpet_use_timer) { -+ /* if we're using the hpet timer functionality, -+ * we can more accurately know the counter value -+ * when the timer interrupt occured. -+ */ - offset = hpet_readl(HPET_T0_CMP) - hpet_tick; - delay = hpet_readl(HPET_COUNTER) - offset; - } else { -@@ -630,8 +642,7 @@ static int hpet_init(void) - - id = hpet_readl(HPET_ID); - -- if (!(id & HPET_ID_VENDOR) || !(id & HPET_ID_NUMBER) || -- !(id & HPET_ID_LEGSUP)) -+ if (!(id & HPET_ID_VENDOR) || !(id & HPET_ID_NUMBER)) - return -1; - - hpet_period = hpet_readl(HPET_PERIOD); -@@ -641,6 +652,8 @@ static int hpet_init(void) - hpet_tick = (1000000000L * (USEC_PER_SEC / HZ) + hpet_period / 2) / - hpet_period; - -+ hpet_use_timer = (id & HPET_ID_LEGSUP); -+ - /* - * Stop the timers and reset the main counter. - */ -@@ -656,17 +669,20 @@ static int hpet_init(void) - * and period also hpet_tick. - */ - -- hpet_writel(HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_SETVAL | -- HPET_TN_32BIT, HPET_T0_CFG); -- hpet_writel(hpet_tick, HPET_T0_CMP); -- hpet_writel(hpet_tick, HPET_T0_CMP); /* AK: why twice? */ -+ if (hpet_use_timer) { -+ hpet_writel(HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_SETVAL | -+ HPET_TN_32BIT, HPET_T0_CFG); -+ hpet_writel(hpet_tick, HPET_T0_CMP); -+ hpet_writel(hpet_tick, HPET_T0_CMP); /* AK: why twice? */ -+ cfg |= HPET_CFG_LEGACY; -+ } - - /* - * Go! - */ - -- cfg |= HPET_CFG_ENABLE | HPET_CFG_LEGACY; -- hpet_writel(cfg, HPET_CFG); -+ cfg |= HPET_CFG_ENABLE; -+ hpet_writel(cfg, HPET_CFG); - - return 0; - } -@@ -719,9 +735,11 @@ void __init time_init(void) - set_normalized_timespec(&wall_to_monotonic, - -xtime.tv_sec, -xtime.tv_nsec); - -- if (!hpet_init()) { -+ if (!hpet_init()) - vxtime_hz = (1000000000000000L + hpet_period / 2) / - hpet_period; -+ -+ if (hpet_use_timer) { - cpu_khz = hpet_calibrate_tsc(); - timename = "HPET"; - } else { -@@ -754,12 +772,12 @@ void __init time_init_smp(void) - char *timetype; - - if (vxtime.hpet_address) { -- timetype = "HPET"; -+ timetype = hpet_use_timer ? "HPET" : "PIT/HPET"; - vxtime.last = hpet_readl(HPET_T0_CMP) - hpet_tick; - vxtime.mode = VXTIME_HPET; - do_gettimeoffset = do_gettimeoffset_hpet; - } else { -- timetype = "PIT/TSC"; -+ timetype = hpet_use_timer ? "HPET/TSC" : "PIT/TSC"; - vxtime.mode = VXTIME_TSC; - } - printk(KERN_INFO "time.c: Using %s based timekeeping.\n", timetype); -diff -uprN linux-2.6.8.1.orig/arch/x86_64/kernel/trampoline.S linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/trampoline.S ---- linux-2.6.8.1.orig/arch/x86_64/kernel/trampoline.S 2004-08-14 14:56:01.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/trampoline.S 2006-05-11 13:05:45.000000000 +0400 -@@ -46,7 +46,7 @@ r_base = . - lidt idt_48 - r_base # load idt with 0, 0 - lgdt gdt_48 - r_base # load gdt with whatever is appropriate - -- movw $__KERNEL_DS,%ax -+ movw $__BOOT_DS,%ax - movw %ax,%ds - movw %ax,%es - -diff -uprN linux-2.6.8.1.orig/arch/x86_64/kernel/traps.c linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/traps.c ---- linux-2.6.8.1.orig/arch/x86_64/kernel/traps.c 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/traps.c 2006-05-11 13:05:49.000000000 +0400 -@@ -91,6 +91,9 @@ int printk_address(unsigned long address - char *delim = ":"; - char namebuf[128]; - -+ if (!decode_call_traces) -+ return printk("[<%016lx>]", address); -+ - symname = kallsyms_lookup(address, &symsize, &offset, &modname, namebuf); - if (!symname) - return printk("[<%016lx>]", address); -@@ -143,7 +146,7 @@ void show_trace(unsigned long *stack) - if (__kernel_text_address(addr)) { - i += printk_address(addr); - i += printk(" "); -- if (i > 50) { -+ if (i > 50 && decode_call_traces) { - printk("\n"); - i = 0; - } -@@ -172,7 +175,7 @@ void show_trace(unsigned long *stack) - if (__kernel_text_address(addr)) { - i += printk_address(addr); - i += printk(" "); -- if (i > 50) { -+ if (i > 50 && decode_call_traces) { - printk("\n "); - i = 0; - } -@@ -188,7 +191,7 @@ void show_trace(unsigned long *stack) - if (__kernel_text_address(addr)) { - i += printk_address(addr); - i += printk(" "); -- if (i > 50) { -+ if (i > 50 && decode_call_traces) { - printk("\n "); - i = 0; - } -@@ -254,10 +257,13 @@ void show_registers(struct pt_regs *regs - - rsp = regs->rsp; - -- printk("CPU %d ", cpu); -+ printk("CPU: %d, VCPU: %d:%d ", cpu, task_vsched_id(current), -+ task_cpu(current)); - __show_regs(regs); -- printk("Process %s (pid: %d, threadinfo %p, task %p)\n", -- cur->comm, cur->pid, cur->thread_info, cur); -+ printk("Process %s (pid: %d, veid=%d, threadinfo %p, task %p)\n", -+ cur->comm, cur->pid, -+ VEID(VE_TASK_INFO(current)->owner_env), -+ cur->thread_info, cur); - - /* - * When in-kernel, we also print out the stack and code at the -diff -uprN linux-2.6.8.1.orig/arch/x86_64/kernel/vmlinux.lds.S linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/vmlinux.lds.S ---- linux-2.6.8.1.orig/arch/x86_64/kernel/vmlinux.lds.S 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/vmlinux.lds.S 2006-05-11 13:05:29.000000000 +0400 -@@ -44,32 +44,31 @@ SECTIONS - } - __bss_end = .; - -- . = ALIGN(64); -+ . = ALIGN(CONFIG_X86_L1_CACHE_BYTES); - .data.cacheline_aligned : { *(.data.cacheline_aligned) } - -+#define AFTER(x) BINALIGN(LOADADDR(x) + SIZEOF(x), 16) -+#define BINALIGN(x,y) (((x) + (y) - 1) & ~((y) - 1)) -+#define CACHE_ALIGN(x) BINALIGN(x, CONFIG_X86_L1_CACHE_BYTES) -+ - .vsyscall_0 -10*1024*1024: AT ((LOADADDR(.data.cacheline_aligned) + SIZEOF(.data.cacheline_aligned) + 4095) & ~(4095)) { *(.vsyscall_0) } - __vsyscall_0 = LOADADDR(.vsyscall_0); -- . = ALIGN(64); -- .xtime_lock : AT ((LOADADDR(.vsyscall_0) + SIZEOF(.vsyscall_0) + 63) & ~(63)) { *(.xtime_lock) } -+ . = ALIGN(CONFIG_X86_L1_CACHE_BYTES); -+ .xtime_lock : AT CACHE_ALIGN(AFTER(.vsyscall_0)) { *(.xtime_lock) } - xtime_lock = LOADADDR(.xtime_lock); -- . = ALIGN(16); -- .vxtime : AT ((LOADADDR(.xtime_lock) + SIZEOF(.xtime_lock) + 15) & ~(15)) { *(.vxtime) } -+ .vxtime : AT AFTER(.xtime_lock) { *(.vxtime) } - vxtime = LOADADDR(.vxtime); -- . = ALIGN(16); -- .wall_jiffies : AT ((LOADADDR(.vxtime) + SIZEOF(.vxtime) + 15) & ~(15)) { *(.wall_jiffies) } -+ .wall_jiffies : AT AFTER(.vxtime) { *(.wall_jiffies) } - wall_jiffies = LOADADDR(.wall_jiffies); -- . = ALIGN(16); -- .sys_tz : AT ((LOADADDR(.wall_jiffies) + SIZEOF(.wall_jiffies) + 15) & ~(15)) { *(.sys_tz) } -+ .sys_tz : AT AFTER(.wall_jiffies) { *(.sys_tz) } - sys_tz = LOADADDR(.sys_tz); -- . = ALIGN(16); -- .sysctl_vsyscall : AT ((LOADADDR(.sys_tz) + SIZEOF(.sys_tz) + 15) & ~(15)) { *(.sysctl_vsyscall) } -- sysctl_vsyscall = LOADADDR(.sysctl_vsyscall); -- . = ALIGN(16); -- .jiffies : AT ((LOADADDR(.sysctl_vsyscall) + SIZEOF(.sysctl_vsyscall) + 15) & ~(15)) { *(.jiffies) } -- jiffies = LOADADDR(.jiffies); -- . = ALIGN(16); -- .xtime : AT ((LOADADDR(.jiffies) + SIZEOF(.jiffies) + 15) & ~(15)) { *(.xtime) } -+ .sysctl_vsyscall : AT AFTER(.sys_tz) { *(.sysctl_vsyscall) } -+ sysctl_vsyscall = LOADADDR(.sysctl_vsyscall); -+ .xtime : AT AFTER(.sysctl_vsyscall) { *(.xtime) } - xtime = LOADADDR(.xtime); -+ . = ALIGN(CONFIG_X86_L1_CACHE_BYTES); -+ .jiffies : AT CACHE_ALIGN(AFTER(.xtime)) { *(.jiffies) } -+ jiffies = LOADADDR(.jiffies); - .vsyscall_1 ADDR(.vsyscall_0) + 1024: AT (LOADADDR(.vsyscall_0) + 1024) { *(.vsyscall_1) } - . = LOADADDR(.vsyscall_0) + 4096; - -diff -uprN linux-2.6.8.1.orig/arch/x86_64/kernel/vsyscall.c linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/vsyscall.c ---- linux-2.6.8.1.orig/arch/x86_64/kernel/vsyscall.c 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/kernel/vsyscall.c 2006-05-11 13:05:29.000000000 +0400 -@@ -165,14 +165,12 @@ static void __init map_vsyscall(void) - - static int __init vsyscall_init(void) - { -- if ((unsigned long) &vgettimeofday != VSYSCALL_ADDR(__NR_vgettimeofday)) -- panic("vgettimeofday link addr broken"); -- if ((unsigned long) &vtime != VSYSCALL_ADDR(__NR_vtime)) -- panic("vtime link addr broken"); -- if (VSYSCALL_ADDR(0) != __fix_to_virt(VSYSCALL_FIRST_PAGE)) -- panic("fixmap first vsyscall %lx should be %lx", __fix_to_virt(VSYSCALL_FIRST_PAGE), -- VSYSCALL_ADDR(0)); -+ BUG_ON(((unsigned long) &vgettimeofday != -+ VSYSCALL_ADDR(__NR_vgettimeofday))); -+ BUG_ON((unsigned long) &vtime != VSYSCALL_ADDR(__NR_vtime)); -+ BUG_ON((VSYSCALL_ADDR(0) != __fix_to_virt(VSYSCALL_FIRST_PAGE))); - map_vsyscall(); -+ sysctl_vsyscall = 1; - - return 0; - } -diff -uprN linux-2.6.8.1.orig/arch/x86_64/lib/copy_user.S linux-2.6.8.1-ve022stab078/arch/x86_64/lib/copy_user.S ---- linux-2.6.8.1.orig/arch/x86_64/lib/copy_user.S 2004-08-14 14:55:34.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/lib/copy_user.S 2006-05-11 13:05:30.000000000 +0400 -@@ -73,7 +73,7 @@ bad_to_user: - * rdx count - * - * Output: -- * eax uncopied bytes or 0 if successfull. -+ * eax uncopied bytes or 0 if successful. - */ - .globl copy_user_generic - .p2align 4 -@@ -179,9 +179,9 @@ copy_user_generic: - movl $8,%r9d - subl %ecx,%r9d - movl %r9d,%ecx -- subq %r9,%rdx -- jz .Lsmall_align -- js .Lsmall_align -+ cmpq %r9,%rdx -+ jz .Lhandle_7 -+ js .Lhandle_7 - .Lalign_1: - .Ls11: movb (%rsi),%bl - .Ld11: movb %bl,(%rdi) -@@ -189,10 +189,8 @@ copy_user_generic: - incq %rdi - decl %ecx - jnz .Lalign_1 -+ subq %r9,%rdx - jmp .Lafter_bad_alignment --.Lsmall_align: -- addq %r9,%rdx -- jmp .Lhandle_7 - #endif - - /* table sorted by exception address */ -@@ -219,8 +217,8 @@ copy_user_generic: - .quad .Ls10,.Le_byte - .quad .Ld10,.Le_byte - #ifdef FIX_ALIGNMENT -- .quad .Ls11,.Le_byte -- .quad .Ld11,.Le_byte -+ .quad .Ls11,.Lzero_rest -+ .quad .Ld11,.Lzero_rest - #endif - .quad .Le5,.Le_zero - .previous -diff -uprN linux-2.6.8.1.orig/arch/x86_64/lib/csum-copy.S linux-2.6.8.1-ve022stab078/arch/x86_64/lib/csum-copy.S ---- linux-2.6.8.1.orig/arch/x86_64/lib/csum-copy.S 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/lib/csum-copy.S 2006-05-11 13:05:30.000000000 +0400 -@@ -188,8 +188,8 @@ csum_partial_copy_generic: - source - movw (%rdi),%bx - adcl %ebx,%eax -- dest - decl %ecx -+ dest - movw %bx,(%rsi) - leaq 2(%rdi),%rdi - leaq 2(%rsi),%rsi -diff -uprN linux-2.6.8.1.orig/arch/x86_64/mm/fault.c linux-2.6.8.1-ve022stab078/arch/x86_64/mm/fault.c ---- linux-2.6.8.1.orig/arch/x86_64/mm/fault.c 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/mm/fault.c 2006-05-11 13:05:40.000000000 +0400 -@@ -34,27 +34,6 @@ - #include <asm/kdebug.h> - #include <asm-generic/sections.h> - --void bust_spinlocks(int yes) --{ -- int loglevel_save = console_loglevel; -- if (yes) { -- oops_in_progress = 1; -- } else { --#ifdef CONFIG_VT -- unblank_screen(); --#endif -- oops_in_progress = 0; -- /* -- * OK, the message is on the console. Now we call printk() -- * without oops_in_progress set so that printk will give klogd -- * a poke. Hold onto your hats... -- */ -- console_loglevel = 15; /* NMI oopser may have shut the console up */ -- printk(" "); -- console_loglevel = loglevel_save; -- } --} -- - /* Sometimes the CPU reports invalid exceptions on prefetch. - Check that here and ignore. - Opcode checker based on code by Richard Brunner */ -@@ -219,7 +198,7 @@ int unhandled_signal(struct task_struct - } - - int page_fault_trace; --int exception_trace = 1; -+int exception_trace = 0; - - /* - * This routine handles page faults. It determines the address, -@@ -261,7 +240,7 @@ asmlinkage void do_page_fault(struct pt_ - local_irq_enable(); - - if (unlikely(page_fault_trace)) -- printk("pagefault rip:%lx rsp:%lx cs:%lu ss:%lu address %lx error %lx\n", -+ ve_printk(VE_LOG, "pagefault rip:%lx rsp:%lx cs:%lu ss:%lu address %lx error %lx\n", - regs->rip,regs->rsp,regs->cs,regs->ss,address,error_code); - - tsk = current; -@@ -281,8 +260,27 @@ asmlinkage void do_page_fault(struct pt_ - if (unlikely(in_atomic() || !mm)) - goto bad_area_nosemaphore; - -- again: -- down_read(&mm->mmap_sem); -+ /* When running in the kernel we expect faults to occur only to -+ * addresses in user space. All other faults represent errors in the -+ * kernel and should generate an OOPS. Unfortunatly, in the case of an -+ * erroneous fault occuring in a code path which already holds mmap_sem -+ * we will deadlock attempting to validate the fault against the -+ * address space. Luckily the kernel only validly references user -+ * space from well defined areas of code, which are listed in the -+ * exceptions table. -+ * -+ * As the vast majority of faults will be valid we will only perform -+ * the source reference check when there is a possibilty of a deadlock. -+ * Attempt to lock the address space, if we cannot we then validate the -+ * source. If this is invalid we can skip the address space check, -+ * thus avoiding the deadlock. -+ */ -+ if (!down_read_trylock(&mm->mmap_sem)) { -+ if ((error_code & 4) == 0 && -+ !search_exception_tables(regs->rip)) -+ goto bad_area_nosemaphore; -+ down_read(&mm->mmap_sem); -+ } - - vma = find_vma(mm, address); - if (!vma) -@@ -349,17 +347,6 @@ bad_area: - up_read(&mm->mmap_sem); - - bad_area_nosemaphore: -- --#ifdef CONFIG_IA32_EMULATION -- /* 32bit vsyscall. map on demand. */ -- if (test_thread_flag(TIF_IA32) && -- address >= 0xffffe000 && address < 0xffffe000 + PAGE_SIZE) { -- if (map_syscall32(mm, address) < 0) -- goto out_of_memory2; -- return; -- } --#endif -- - /* User mode accesses just cause a SIGSEGV */ - if (error_code & 4) { - if (is_prefetch(regs, address)) -@@ -376,7 +363,7 @@ bad_area_nosemaphore: - return; - - if (exception_trace && unhandled_signal(tsk, SIGSEGV)) { -- printk(KERN_INFO -+ ve_printk(VE_LOG, KERN_INFO - "%s[%d]: segfault at %016lx rip %016lx rsp %016lx error %lx\n", - tsk->comm, tsk->pid, address, regs->rip, - regs->rsp, error_code); -@@ -440,14 +427,14 @@ no_context: - */ - out_of_memory: - up_read(&mm->mmap_sem); --out_of_memory2: -- if (current->pid == 1) { -- yield(); -- goto again; -- } -- printk("VM: killing process %s\n", tsk->comm); -- if (error_code & 4) -- do_exit(SIGKILL); -+ if (error_code & 4) { -+ /* -+ * 0-order allocation always success if something really -+ * fatal not happen: beancounter overdraft or OOM. Den -+ */ -+ force_sig(SIGKILL, tsk); -+ return; -+ } - goto no_context; - - do_sigbus: -diff -uprN linux-2.6.8.1.orig/arch/x86_64/mm/init.c linux-2.6.8.1-ve022stab078/arch/x86_64/mm/init.c ---- linux-2.6.8.1.orig/arch/x86_64/mm/init.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/mm/init.c 2006-05-11 13:05:40.000000000 +0400 -@@ -22,6 +22,7 @@ - #include <linux/pagemap.h> - #include <linux/bootmem.h> - #include <linux/proc_fs.h> -+#include <linux/module.h> - - #include <asm/processor.h> - #include <asm/system.h> -@@ -80,6 +81,8 @@ void show_mem(void) - printk("%d pages swap cached\n",cached); - } - -+EXPORT_SYMBOL(show_mem); -+ - /* References to section boundaries */ - - extern char _text, _etext, _edata, __bss_start, _end[]; -@@ -578,9 +581,9 @@ static __init int x8664_sysctl_init(void - __initcall(x8664_sysctl_init); - #endif - --/* Pseudo VMAs to allow ptrace access for the vsyscall pages. x86-64 has two -- different ones: one for 32bit and one for 64bit. Use the appropiate -- for the target task. */ -+/* A pseudo VMAs to allow ptrace access for the vsyscall page. This only -+ covers the 64bit vsyscall page now. 32bit has a real VMA now and does -+ not need special handling anymore. */ - - static struct vm_area_struct gate_vma = { - .vm_start = VSYSCALL_START, -@@ -588,19 +591,15 @@ static struct vm_area_struct gate_vma = - .vm_page_prot = PAGE_READONLY - }; - --static struct vm_area_struct gate32_vma = { -- .vm_start = VSYSCALL32_BASE, -- .vm_end = VSYSCALL32_END, -- .vm_page_prot = PAGE_READONLY --}; -- - struct vm_area_struct *get_gate_vma(struct task_struct *tsk) - { -- return test_tsk_thread_flag(tsk, TIF_IA32) ? &gate32_vma : &gate_vma; -+ return test_tsk_thread_flag(tsk, TIF_IA32) ? NULL : &gate_vma; - } - - int in_gate_area(struct task_struct *task, unsigned long addr) - { - struct vm_area_struct *vma = get_gate_vma(task); -+ if (!vma) -+ return 0; - return (addr >= vma->vm_start) && (addr < vma->vm_end); - } -diff -uprN linux-2.6.8.1.orig/arch/x86_64/mm/ioremap.c linux-2.6.8.1-ve022stab078/arch/x86_64/mm/ioremap.c ---- linux-2.6.8.1.orig/arch/x86_64/mm/ioremap.c 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/mm/ioremap.c 2006-05-11 13:05:30.000000000 +0400 -@@ -16,7 +16,7 @@ - #include <asm/fixmap.h> - #include <asm/cacheflush.h> - #include <asm/tlbflush.h> -- -+#include <asm/proto.h> - - static inline void remap_area_pte(pte_t * pte, unsigned long address, unsigned long size, - unsigned long phys_addr, unsigned long flags) -@@ -99,7 +99,31 @@ static int remap_area_pages(unsigned lon - } - - /* -- * Generic mapping function (not visible outside): -+ * Fix up the linear direct mapping of the kernel to avoid cache attribute -+ * conflicts. -+ */ -+static int -+ioremap_change_attr(unsigned long phys_addr, unsigned long size, -+ unsigned long flags) -+{ -+ int err = 0; -+ if (flags && phys_addr + size - 1 < (end_pfn_map << PAGE_SHIFT)) { -+ unsigned long npages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT; -+ unsigned long vaddr = (unsigned long) __va(phys_addr); -+ -+ /* -+ * Must use a address here and not struct page because the phys addr -+ * can be a in hole between nodes and not have an memmap entry. -+ */ -+ err = change_page_attr_addr(vaddr,npages,__pgprot(__PAGE_KERNEL|flags)); -+ if (!err) -+ global_flush_tlb(); -+ } -+ return err; -+} -+ -+/* -+ * Generic mapping function - */ - - /* -@@ -155,12 +179,17 @@ void * __ioremap(unsigned long phys_addr - /* - * Ok, go for it.. - */ -- area = get_vm_area(size, VM_IOREMAP); -+ area = get_vm_area(size, VM_IOREMAP | (flags << 24)); - if (!area) - return NULL; - area->phys_addr = phys_addr; - addr = area->addr; - if (remap_area_pages((unsigned long) addr, phys_addr, size, flags)) { -+ remove_vm_area((void *)(PAGE_MASK & (unsigned long) addr)); -+ return NULL; -+ } -+ if (ioremap_change_attr(phys_addr, size, flags) < 0) { -+ area->flags &= 0xffffff; - vunmap(addr); - return NULL; - } -@@ -191,43 +220,34 @@ void * __ioremap(unsigned long phys_addr - - void *ioremap_nocache (unsigned long phys_addr, unsigned long size) - { -- void *p = __ioremap(phys_addr, size, _PAGE_PCD); -- if (!p) -- return p; -- -- if (phys_addr + size < virt_to_phys(high_memory)) { -- struct page *ppage = virt_to_page(__va(phys_addr)); -- unsigned long npages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT; -- -- BUG_ON(phys_addr+size > (unsigned long)high_memory); -- BUG_ON(phys_addr + size < phys_addr); -- -- if (change_page_attr(ppage, npages, PAGE_KERNEL_NOCACHE) < 0) { -- iounmap(p); -- p = NULL; -- } -- global_flush_tlb(); -- } -- -- return p; -+ return __ioremap(phys_addr, size, _PAGE_PCD); - } - - void iounmap(void *addr) - { -- struct vm_struct *p; -+ struct vm_struct *p, **pprev; -+ - if (addr <= high_memory) - return; -- p = remove_vm_area((void *)(PAGE_MASK & (unsigned long) addr)); -+ -+ write_lock(&vmlist_lock); -+ for (p = vmlist, pprev = &vmlist; p != NULL; pprev = &p->next, p = *pprev) -+ if (p->addr == (void *)(PAGE_MASK & (unsigned long)addr)) -+ break; - if (!p) { - printk("__iounmap: bad address %p\n", addr); -- return; -- } -- -- if (p->flags && p->phys_addr < virt_to_phys(high_memory)) { -- change_page_attr(virt_to_page(__va(p->phys_addr)), -+ goto out_unlock; -+ } -+ *pprev = p->next; -+ unmap_vm_area(p); -+ if ((p->flags >> 24) && -+ p->phys_addr + p->size - 1 < virt_to_phys(high_memory)) { -+ change_page_attr_addr((unsigned long)__va(p->phys_addr), - p->size >> PAGE_SHIFT, - PAGE_KERNEL); - global_flush_tlb(); - } -+out_unlock: -+ write_unlock(&vmlist_lock); - kfree(p); - } -diff -uprN linux-2.6.8.1.orig/arch/x86_64/mm/pageattr.c linux-2.6.8.1-ve022stab078/arch/x86_64/mm/pageattr.c ---- linux-2.6.8.1.orig/arch/x86_64/mm/pageattr.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/arch/x86_64/mm/pageattr.c 2006-05-11 13:05:30.000000000 +0400 -@@ -61,7 +61,10 @@ static void flush_kernel_map(void *addre - asm volatile("clflush (%0)" :: "r" (address + i)); - } else - asm volatile("wbinvd":::"memory"); -- __flush_tlb_one(address); -+ if (address) -+ __flush_tlb_one(address); -+ else -+ __flush_tlb_all(); - } - - -@@ -111,13 +114,12 @@ static void revert_page(unsigned long ad - } - - static int --__change_page_attr(unsigned long address, struct page *page, pgprot_t prot, -- pgprot_t ref_prot) -+__change_page_attr(unsigned long address, unsigned long pfn, pgprot_t prot, -+ pgprot_t ref_prot) - { - pte_t *kpte; - struct page *kpte_page; - unsigned kpte_flags; -- - kpte = lookup_address(address); - if (!kpte) return 0; - kpte_page = virt_to_page(((unsigned long)kpte) & PAGE_MASK); -@@ -125,20 +127,20 @@ __change_page_attr(unsigned long address - if (pgprot_val(prot) != pgprot_val(ref_prot)) { - if ((kpte_flags & _PAGE_PSE) == 0) { - pte_t old = *kpte; -- pte_t standard = mk_pte(page, ref_prot); -+ pte_t standard = pfn_pte(pfn, ref_prot); - -- set_pte(kpte, mk_pte(page, prot)); -+ set_pte(kpte, pfn_pte(pfn, prot)); - if (pte_same(old,standard)) - get_page(kpte_page); - } else { - struct page *split = split_large_page(address, prot, ref_prot); - if (!split) - return -ENOMEM; -- get_page(kpte_page); -+ get_page(split); - set_pte(kpte,mk_pte(split, ref_prot)); - } - } else if ((kpte_flags & _PAGE_PSE) == 0) { -- set_pte(kpte, mk_pte(page, ref_prot)); -+ set_pte(kpte, pfn_pte(pfn, ref_prot)); - __put_page(kpte_page); - } - -@@ -162,31 +164,38 @@ __change_page_attr(unsigned long address - * - * Caller must call global_flush_tlb() after this. - */ --int change_page_attr(struct page *page, int numpages, pgprot_t prot) -+int change_page_attr_addr(unsigned long address, int numpages, pgprot_t prot) - { - int err = 0; - int i; - - down_write(&init_mm.mmap_sem); -- for (i = 0; i < numpages; !err && i++, page++) { -- unsigned long address = (unsigned long)page_address(page); -- err = __change_page_attr(address, page, prot, PAGE_KERNEL); -+ for (i = 0; i < numpages; i++, address += PAGE_SIZE) { -+ unsigned long pfn = __pa(address) >> PAGE_SHIFT; -+ -+ err = __change_page_attr(address, pfn, prot, PAGE_KERNEL); - if (err) - break; - /* Handle kernel mapping too which aliases part of the - * lowmem */ - /* Disabled right now. Fixme */ -- if (0 && page_to_phys(page) < KERNEL_TEXT_SIZE) { -+ if (0 && __pa(address) < KERNEL_TEXT_SIZE) { - unsigned long addr2; -- addr2 = __START_KERNEL_map + page_to_phys(page); -- err = __change_page_attr(addr2, page, prot, -- PAGE_KERNEL_EXEC); -+ addr2 = __START_KERNEL_map + __pa(address); -+ err = __change_page_attr(addr2, pfn, prot, PAGE_KERNEL_EXEC); - } - } - up_write(&init_mm.mmap_sem); - return err; - } - -+/* Don't call this for MMIO areas that may not have a mem_map entry */ -+int change_page_attr(struct page *page, int numpages, pgprot_t prot) -+{ -+ unsigned long addr = (unsigned long)page_address(page); -+ return change_page_attr_addr(addr, numpages, prot); -+} -+ - void global_flush_tlb(void) - { - struct deferred_page *df, *next_df; -@@ -194,6 +203,8 @@ void global_flush_tlb(void) - down_read(&init_mm.mmap_sem); - df = xchg(&df_list, NULL); - up_read(&init_mm.mmap_sem); -+ if (!df) -+ return; - flush_map((df && !df->next) ? df->address : 0); - for (; df; df = next_df) { - next_df = df->next; -diff -uprN linux-2.6.8.1.orig/drivers/base/class.c linux-2.6.8.1-ve022stab078/drivers/base/class.c ---- linux-2.6.8.1.orig/drivers/base/class.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/base/class.c 2006-05-11 13:05:42.000000000 +0400 -@@ -69,8 +69,13 @@ static struct kobj_type ktype_class = { - }; - - /* Hotplug events for classes go to the class_obj subsys */ --static decl_subsys(class, &ktype_class, NULL); -+decl_subsys(class, &ktype_class, NULL); - -+#ifndef CONFIG_VE -+#define visible_class_subsys class_subsys -+#else -+#define visible_class_subsys (*get_exec_env()->class_subsys) -+#endif - - int class_create_file(struct class * cls, const struct class_attribute * attr) - { -@@ -143,7 +148,7 @@ int class_register(struct class * cls) - if (error) - return error; - -- subsys_set_kset(cls, class_subsys); -+ subsys_set_kset(cls, visible_class_subsys); - - error = subsystem_register(&cls->subsys); - if (!error) { -@@ -304,8 +309,13 @@ static struct kset_hotplug_ops class_hot - .hotplug = class_hotplug, - }; - --static decl_subsys(class_obj, &ktype_class_device, &class_hotplug_ops); -+decl_subsys(class_obj, &ktype_class_device, &class_hotplug_ops); - -+#ifndef CONFIG_VE -+#define visible_class_obj_subsys class_obj_subsys -+#else -+#define visible_class_obj_subsys (*get_exec_env()->class_obj_subsys) -+#endif - - static int class_device_add_attrs(struct class_device * cd) - { -@@ -342,7 +352,7 @@ static void class_device_remove_attrs(st - - void class_device_initialize(struct class_device *class_dev) - { -- kobj_set_kset_s(class_dev, class_obj_subsys); -+ kobj_set_kset_s(class_dev, visible_class_obj_subsys); - kobject_init(&class_dev->kobj); - INIT_LIST_HEAD(&class_dev->node); - } -@@ -505,12 +515,19 @@ void class_interface_unregister(struct c - class_put(parent); - } - -- -+void prepare_sysfs_classes(void) -+{ -+#ifdef CONFIG_VE -+ get_ve0()->class_subsys = &class_subsys; -+ get_ve0()->class_obj_subsys = &class_obj_subsys; -+#endif -+} - - int __init classes_init(void) - { - int retval; - -+ prepare_sysfs_classes(); - retval = subsystem_register(&class_subsys); - if (retval) - return retval; -@@ -542,3 +559,6 @@ EXPORT_SYMBOL(class_device_remove_file); - - EXPORT_SYMBOL(class_interface_register); - EXPORT_SYMBOL(class_interface_unregister); -+ -+EXPORT_SYMBOL(class_subsys); -+EXPORT_SYMBOL(class_obj_subsys); -diff -uprN linux-2.6.8.1.orig/drivers/block/floppy.c linux-2.6.8.1-ve022stab078/drivers/block/floppy.c ---- linux-2.6.8.1.orig/drivers/block/floppy.c 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/block/floppy.c 2006-05-11 13:05:35.000000000 +0400 -@@ -3774,7 +3774,7 @@ static int floppy_open(struct inode *ino - * Needed so that programs such as fdrawcmd still can work on write - * protected disks */ - if (filp->f_mode & 2 -- || permission(filp->f_dentry->d_inode, 2, NULL) == 0) -+ || permission(filp->f_dentry->d_inode, 2, NULL, NULL) == 0) - filp->private_data = (void *)8; - - if (UFDCS->rawcmd == 1) -diff -uprN linux-2.6.8.1.orig/drivers/block/genhd.c linux-2.6.8.1-ve022stab078/drivers/block/genhd.c ---- linux-2.6.8.1.orig/drivers/block/genhd.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/block/genhd.c 2006-05-11 13:05:40.000000000 +0400 -@@ -18,6 +18,8 @@ - #define MAX_PROBE_HASH 255 /* random */ - - static struct subsystem block_subsys; -+struct subsystem *get_block_subsys(void) {return &block_subsys;} -+EXPORT_SYMBOL(get_block_subsys); - - /* - * Can be deleted altogether. Later. -diff -uprN linux-2.6.8.1.orig/drivers/block/ioctl.c linux-2.6.8.1-ve022stab078/drivers/block/ioctl.c ---- linux-2.6.8.1.orig/drivers/block/ioctl.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/block/ioctl.c 2006-05-11 13:05:34.000000000 +0400 -@@ -219,3 +219,5 @@ int blkdev_ioctl(struct inode *inode, st - } - return -ENOTTY; - } -+ -+EXPORT_SYMBOL_GPL(blkdev_ioctl); -diff -uprN linux-2.6.8.1.orig/drivers/block/ll_rw_blk.c linux-2.6.8.1-ve022stab078/drivers/block/ll_rw_blk.c ---- linux-2.6.8.1.orig/drivers/block/ll_rw_blk.c 2004-08-14 14:54:49.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/block/ll_rw_blk.c 2006-05-11 13:05:31.000000000 +0400 -@@ -263,6 +263,45 @@ void blk_queue_make_request(request_queu - EXPORT_SYMBOL(blk_queue_make_request); - - /** -+ * blk_queue_ordered - does this queue support ordered writes -+ * @q: the request queue -+ * @flag: see below -+ * -+ * Description: -+ * For journalled file systems, doing ordered writes on a commit -+ * block instead of explicitly doing wait_on_buffer (which is bad -+ * for performance) can be a big win. Block drivers supporting this -+ * feature should call this function and indicate so. -+ * -+ **/ -+void blk_queue_ordered(request_queue_t *q, int flag) -+{ -+ if (flag) -+ set_bit(QUEUE_FLAG_ORDERED, &q->queue_flags); -+ else -+ clear_bit(QUEUE_FLAG_ORDERED, &q->queue_flags); -+} -+ -+EXPORT_SYMBOL(blk_queue_ordered); -+ -+/** -+ * blk_queue_issue_flush_fn - set function for issuing a flush -+ * @q: the request queue -+ * @iff: the function to be called issuing the flush -+ * -+ * Description: -+ * If a driver supports issuing a flush command, the support is notified -+ * to the block layer by defining it through this call. -+ * -+ **/ -+void blk_queue_issue_flush_fn(request_queue_t *q, issue_flush_fn *iff) -+{ -+ q->issue_flush_fn = iff; -+} -+ -+EXPORT_SYMBOL(blk_queue_issue_flush_fn); -+ -+/** - * blk_queue_bounce_limit - set bounce buffer limit for queue - * @q: the request queue for the device - * @dma_addr: bus address limit -@@ -1925,10 +1964,11 @@ int blk_execute_rq(request_queue_t *q, s - } - - rq->flags |= REQ_NOMERGE; -- rq->waiting = &wait; -+ if (!rq->waiting) -+ rq->waiting = &wait; - elv_add_request(q, rq, ELEVATOR_INSERT_BACK, 1); - generic_unplug_device(q); -- wait_for_completion(&wait); -+ wait_for_completion(rq->waiting); - rq->waiting = NULL; - - if (rq->errors) -@@ -1939,6 +1979,72 @@ int blk_execute_rq(request_queue_t *q, s - - EXPORT_SYMBOL(blk_execute_rq); - -+/** -+ * blkdev_issue_flush - queue a flush -+ * @bdev: blockdev to issue flush for -+ * @error_sector: error sector -+ * -+ * Description: -+ * Issue a flush for the block device in question. Caller can supply -+ * room for storing the error offset in case of a flush error, if they -+ * wish to. Caller must run wait_for_completion() on its own. -+ */ -+int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector) -+{ -+ request_queue_t *q; -+ -+ if (bdev->bd_disk == NULL) -+ return -ENXIO; -+ -+ q = bdev_get_queue(bdev); -+ if (!q) -+ return -ENXIO; -+ if (!q->issue_flush_fn) -+ return -EOPNOTSUPP; -+ -+ return q->issue_flush_fn(q, bdev->bd_disk, error_sector); -+} -+ -+EXPORT_SYMBOL(blkdev_issue_flush); -+ -+/** -+ * blkdev_scsi_issue_flush_fn - issue flush for SCSI devices -+ * @q: device queue -+ * @disk: gendisk -+ * @error_sector: error offset -+ * -+ * Description: -+ * Devices understanding the SCSI command set, can use this function as -+ * a helper for issuing a cache flush. Note: driver is required to store -+ * the error offset (in case of error flushing) in ->sector of struct -+ * request. -+ */ -+int blkdev_scsi_issue_flush_fn(request_queue_t *q, struct gendisk *disk, -+ sector_t *error_sector) -+{ -+ struct request *rq = blk_get_request(q, WRITE, __GFP_WAIT); -+ int ret; -+ -+ rq->flags |= REQ_BLOCK_PC | REQ_SOFTBARRIER; -+ rq->sector = 0; -+ memset(rq->cmd, 0, sizeof(rq->cmd)); -+ rq->cmd[0] = 0x35; -+ rq->cmd_len = 12; -+ rq->data = NULL; -+ rq->data_len = 0; -+ rq->timeout = 60 * HZ; -+ -+ ret = blk_execute_rq(q, disk, rq); -+ -+ if (ret && error_sector) -+ *error_sector = rq->sector; -+ -+ blk_put_request(rq); -+ return ret; -+} -+ -+EXPORT_SYMBOL(blkdev_scsi_issue_flush_fn); -+ - void drive_stat_acct(struct request *rq, int nr_sectors, int new_io) - { - int rw = rq_data_dir(rq); -@@ -2192,7 +2298,7 @@ EXPORT_SYMBOL(__blk_attempt_remerge); - static int __make_request(request_queue_t *q, struct bio *bio) - { - struct request *req, *freereq = NULL; -- int el_ret, rw, nr_sectors, cur_nr_sectors, barrier, ra; -+ int el_ret, rw, nr_sectors, cur_nr_sectors, barrier, err, sync; - sector_t sector; - - sector = bio->bi_sector; -@@ -2210,9 +2316,11 @@ static int __make_request(request_queue_ - - spin_lock_prefetch(q->queue_lock); - -- barrier = test_bit(BIO_RW_BARRIER, &bio->bi_rw); -- -- ra = bio->bi_rw & (1 << BIO_RW_AHEAD); -+ barrier = bio_barrier(bio); -+ if (barrier && !(q->queue_flags & (1 << QUEUE_FLAG_ORDERED))) { -+ err = -EOPNOTSUPP; -+ goto end_io; -+ } - - again: - spin_lock_irq(q->queue_lock); -@@ -2238,6 +2346,7 @@ again: - drive_stat_acct(req, nr_sectors, 0); - if (!attempt_back_merge(q, req)) - elv_merged_request(q, req); -+ sync = bio_sync(bio); - goto out; - - case ELEVATOR_FRONT_MERGE: -@@ -2264,6 +2373,7 @@ again: - drive_stat_acct(req, nr_sectors, 0); - if (!attempt_front_merge(q, req)) - elv_merged_request(q, req); -+ sync = bio_sync(bio); - goto out; - - /* -@@ -2292,7 +2402,8 @@ get_rq: - /* - * READA bit set - */ -- if (ra) -+ err = -EWOULDBLOCK; -+ if (bio_rw_ahead(bio)) - goto end_io; - - freereq = get_request_wait(q, rw); -@@ -2303,10 +2414,9 @@ get_rq: - req->flags |= REQ_CMD; - - /* -- * inherit FAILFAST from bio and don't stack up -- * retries for read ahead -+ * inherit FAILFAST from bio (for read-ahead, and explicit FAILFAST) - */ -- if (ra || test_bit(BIO_RW_FAILFAST, &bio->bi_rw)) -+ if (bio_rw_ahead(bio) || bio_failfast(bio)) - req->flags |= REQ_FAILFAST; - - /* -@@ -2329,18 +2439,19 @@ get_rq: - req->rq_disk = bio->bi_bdev->bd_disk; - req->start_time = jiffies; - -+ sync = bio_sync(bio); - add_request(q, req); - out: - if (freereq) - __blk_put_request(q, freereq); -- if (bio_sync(bio)) -+ if (sync) - __generic_unplug_device(q); - - spin_unlock_irq(q->queue_lock); - return 0; - - end_io: -- bio_endio(bio, nr_sectors << 9, -EWOULDBLOCK); -+ bio_endio(bio, nr_sectors << 9, err); - return 0; - } - -@@ -2647,10 +2758,17 @@ void blk_recalc_rq_sectors(struct reques - static int __end_that_request_first(struct request *req, int uptodate, - int nr_bytes) - { -- int total_bytes, bio_nbytes, error = 0, next_idx = 0; -+ int total_bytes, bio_nbytes, error, next_idx = 0; - struct bio *bio; - - /* -+ * extend uptodate bool to allow < 0 value to be direct io error -+ */ -+ error = 0; -+ if (end_io_error(uptodate)) -+ error = !uptodate ? -EIO : uptodate; -+ -+ /* - * for a REQ_BLOCK_PC request, we want to carry any eventual - * sense key with us all the way through - */ -@@ -2658,7 +2776,6 @@ static int __end_that_request_first(stru - req->errors = 0; - - if (!uptodate) { -- error = -EIO; - if (blk_fs_request(req) && !(req->flags & REQ_QUIET)) - printk("end_request: I/O error, dev %s, sector %llu\n", - req->rq_disk ? req->rq_disk->disk_name : "?", -@@ -2741,7 +2858,7 @@ static int __end_that_request_first(stru - /** - * end_that_request_first - end I/O on a request - * @req: the request being processed -- * @uptodate: 0 for I/O error -+ * @uptodate: 1 for success, 0 for I/O error, < 0 for specific error - * @nr_sectors: number of sectors to end I/O on - * - * Description: -@@ -2762,7 +2879,7 @@ EXPORT_SYMBOL(end_that_request_first); - /** - * end_that_request_chunk - end I/O on a request - * @req: the request being processed -- * @uptodate: 0 for I/O error -+ * @uptodate: 1 for success, 0 for I/O error, < 0 for specific error - * @nr_bytes: number of bytes to complete - * - * Description: -diff -uprN linux-2.6.8.1.orig/drivers/block/scsi_ioctl.c linux-2.6.8.1-ve022stab078/drivers/block/scsi_ioctl.c ---- linux-2.6.8.1.orig/drivers/block/scsi_ioctl.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/block/scsi_ioctl.c 2006-05-11 13:05:38.000000000 +0400 -@@ -304,7 +304,8 @@ static int sg_scsi_ioctl(struct file *fi - struct gendisk *bd_disk, Scsi_Ioctl_Command __user *sic) - { - struct request *rq; -- int err, in_len, out_len, bytes, opcode, cmdlen; -+ int err; -+ unsigned int in_len, out_len, bytes, opcode, cmdlen; - char *buffer = NULL, sense[SCSI_SENSE_BUFFERSIZE]; - - /* -@@ -316,7 +317,7 @@ static int sg_scsi_ioctl(struct file *fi - return -EFAULT; - if (in_len > PAGE_SIZE || out_len > PAGE_SIZE) - return -EINVAL; -- if (get_user(opcode, sic->data)) -+ if (get_user(opcode, (int *)sic->data)) - return -EFAULT; - - bytes = max(in_len, out_len); -diff -uprN linux-2.6.8.1.orig/drivers/char/keyboard.c linux-2.6.8.1-ve022stab078/drivers/char/keyboard.c ---- linux-2.6.8.1.orig/drivers/char/keyboard.c 2004-08-14 14:56:26.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/char/keyboard.c 2006-05-11 13:05:24.000000000 +0400 -@@ -1063,7 +1063,7 @@ void kbd_keycode(unsigned int keycode, i - sysrq_down = down; - return; - } -- if (sysrq_down && down && !rep) { -+ if ((sysrq_down || sysrq_eat_all()) && down && !rep) { - handle_sysrq(kbd_sysrq_xlate[keycode], regs, tty); - return; - } -diff -uprN linux-2.6.8.1.orig/drivers/char/n_tty.c linux-2.6.8.1-ve022stab078/drivers/char/n_tty.c ---- linux-2.6.8.1.orig/drivers/char/n_tty.c 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/char/n_tty.c 2006-05-11 13:05:33.000000000 +0400 -@@ -946,13 +946,13 @@ static inline int copy_from_read_buf(str - - { - int retval; -- ssize_t n; -+ size_t n; - unsigned long flags; - - retval = 0; - spin_lock_irqsave(&tty->read_lock, flags); - n = min(tty->read_cnt, N_TTY_BUF_SIZE - tty->read_tail); -- n = min((ssize_t)*nr, n); -+ n = min(*nr, n); - spin_unlock_irqrestore(&tty->read_lock, flags); - if (n) { - mb(); -diff -uprN linux-2.6.8.1.orig/drivers/char/pty.c linux-2.6.8.1-ve022stab078/drivers/char/pty.c ---- linux-2.6.8.1.orig/drivers/char/pty.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/char/pty.c 2006-05-11 13:05:40.000000000 +0400 -@@ -32,22 +32,48 @@ - #include <asm/bitops.h> - #include <linux/devpts_fs.h> - -+#include <ub/ub_misc.h> -+ - #if defined(CONFIG_LEGACY_PTYS) || defined(CONFIG_UNIX98_PTYS) - - #ifdef CONFIG_LEGACY_PTYS - static struct tty_driver *pty_driver, *pty_slave_driver; -+ -+struct tty_driver *get_pty_driver(void) {return pty_driver;} -+struct tty_driver *get_pty_slave_driver(void) {return pty_slave_driver;} -+ -+EXPORT_SYMBOL(get_pty_driver); -+EXPORT_SYMBOL(get_pty_slave_driver); - #endif - - /* These are global because they are accessed in tty_io.c */ - #ifdef CONFIG_UNIX98_PTYS - struct tty_driver *ptm_driver; - struct tty_driver *pts_driver; -+EXPORT_SYMBOL(ptm_driver); -+EXPORT_SYMBOL(pts_driver); -+ -+#ifdef CONFIG_VE -+#define ve_ptm_driver (get_exec_env()->ptm_driver) -+#else -+#define ve_ptm_driver ptm_driver -+#endif -+ -+void prepare_pty(void) -+{ -+#ifdef CONFIG_VE -+ get_ve0()->ptm_driver = ptm_driver; -+ /* don't clean ptm_driver and co. here, they are used in vecalls.c */ -+#endif -+} - #endif - - static void pty_close(struct tty_struct * tty, struct file * filp) - { - if (!tty) - return; -+ -+ ub_pty_uncharge(tty); - if (tty->driver->subtype == PTY_TYPE_MASTER) { - if (tty->count > 1) - printk("master pty_close: count = %d!!\n", tty->count); -@@ -61,14 +87,18 @@ static void pty_close(struct tty_struct - if (!tty->link) - return; - tty->link->packet = 0; -+ set_bit(TTY_OTHER_CLOSED, &tty->link->flags); - wake_up_interruptible(&tty->link->read_wait); - wake_up_interruptible(&tty->link->write_wait); -- set_bit(TTY_OTHER_CLOSED, &tty->link->flags); - if (tty->driver->subtype == PTY_TYPE_MASTER) { - set_bit(TTY_OTHER_CLOSED, &tty->flags); - #ifdef CONFIG_UNIX98_PTYS -- if (tty->driver == ptm_driver) -+ if (tty->driver->flags & TTY_DRIVER_DEVPTS_MEM) { -+ struct ve_struct *old_env; -+ old_env = set_exec_env(VE_OWNER_TTY(tty)); - devpts_pty_kill(tty->index); -+ set_exec_env(old_env); -+ } - #endif - tty_vhangup(tty->link); - } -@@ -288,6 +318,8 @@ static int pty_open(struct tty_struct *t - - if (!tty || !tty->link) - goto out; -+ if (ub_pty_charge(tty)) -+ goto out; - - retval = -EIO; - if (test_bit(TTY_OTHER_CLOSED, &tty->flags)) -@@ -455,6 +487,7 @@ static int __init pty_init(void) - panic("Couldn't register Unix98 pts driver"); - - pty_table[1].data = &ptm_driver->refcount; -+ prepare_pty(); - #endif /* CONFIG_UNIX98_PTYS */ - - return 0; -diff -uprN linux-2.6.8.1.orig/drivers/char/qtronix.c linux-2.6.8.1-ve022stab078/drivers/char/qtronix.c ---- linux-2.6.8.1.orig/drivers/char/qtronix.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/char/qtronix.c 2006-05-11 13:05:32.000000000 +0400 -@@ -537,7 +537,7 @@ repeat: - i--; - } - if (count-i) { -- file->f_dentry->d_inode->i_atime = CURRENT_TIME; -+ file->f_dentry->d_inode->i_atime = current_fs_time(inode->i_sb); - return count-i; - } - if (signal_pending(current)) -diff -uprN linux-2.6.8.1.orig/drivers/char/random.c linux-2.6.8.1-ve022stab078/drivers/char/random.c ---- linux-2.6.8.1.orig/drivers/char/random.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/char/random.c 2006-05-11 13:05:33.000000000 +0400 -@@ -1720,8 +1720,9 @@ random_write(struct file * file, const c - if (p == buffer) { - return (ssize_t)ret; - } else { -- file->f_dentry->d_inode->i_mtime = CURRENT_TIME; -- mark_inode_dirty(file->f_dentry->d_inode); -+ struct inode *inode = file->f_dentry->d_inode; -+ inode->i_mtime = current_fs_time(inode->i_sb); -+ mark_inode_dirty(inode); - return (ssize_t)(p - buffer); - } - } -@@ -1917,7 +1918,7 @@ static int poolsize_strategy(ctl_table * - void __user *oldval, size_t __user *oldlenp, - void __user *newval, size_t newlen, void **context) - { -- int len; -+ unsigned int len; - - sysctl_poolsize = random_state->poolinfo.POOLBYTES; - -diff -uprN linux-2.6.8.1.orig/drivers/char/raw.c linux-2.6.8.1-ve022stab078/drivers/char/raw.c ---- linux-2.6.8.1.orig/drivers/char/raw.c 2004-08-14 14:55:34.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/char/raw.c 2006-05-11 13:05:34.000000000 +0400 -@@ -122,7 +122,7 @@ raw_ioctl(struct inode *inode, struct fi - { - struct block_device *bdev = filp->private_data; - -- return ioctl_by_bdev(bdev, command, arg); -+ return blkdev_ioctl(bdev->bd_inode, filp, command, arg); - } - - static void bind_device(struct raw_config_request *rq) -diff -uprN linux-2.6.8.1.orig/drivers/char/sonypi.c linux-2.6.8.1-ve022stab078/drivers/char/sonypi.c ---- linux-2.6.8.1.orig/drivers/char/sonypi.c 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/char/sonypi.c 2006-05-11 13:05:32.000000000 +0400 -@@ -489,7 +489,8 @@ repeat: - i--; - } - if (count - i) { -- file->f_dentry->d_inode->i_atime = CURRENT_TIME; -+ struct inode *inode = file->f_dentry->d_inode; -+ inode->i_atime = current_fs_time(inode->i_sb); - return count-i; - } - if (signal_pending(current)) -diff -uprN linux-2.6.8.1.orig/drivers/char/sysrq.c linux-2.6.8.1-ve022stab078/drivers/char/sysrq.c ---- linux-2.6.8.1.orig/drivers/char/sysrq.c 2004-08-14 14:56:00.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/char/sysrq.c 2006-05-11 13:05:40.000000000 +0400 -@@ -31,10 +31,12 @@ - #include <linux/suspend.h> - #include <linux/writeback.h> - #include <linux/buffer_head.h> /* for fsync_bdev() */ -+#include <linux/kallsyms.h> - - #include <linux/spinlock.h> - - #include <asm/ptrace.h> -+#include <asm/uaccess.h> - - extern void reset_vc(unsigned int); - -@@ -131,6 +133,296 @@ static struct sysrq_key_op sysrq_mountro - .action_msg = "Emergency Remount R/O", - }; - -+#ifdef CONFIG_SYSRQ_DEBUG -+/* -+ * Alt-SysRq debugger -+ * Implemented functions: -+ * dumping memory -+ * resolvind symbols -+ * writing memory -+ * quitting :) -+ */ -+ -+/* Memory accessing routines */ -+#define DUMP_LINES 22 -+unsigned long *dumpmem_addr; -+ -+static void dump_mem(void) -+{ -+ unsigned long value[4]; -+ mm_segment_t old_fs; -+ int line, err; -+ -+ old_fs = get_fs(); -+ set_fs(KERNEL_DS); -+ err = 0; -+ for (line = 0; line < DUMP_LINES; line++) { -+ err |= __get_user(value[0], dumpmem_addr++); -+ err |= __get_user(value[1], dumpmem_addr++); -+ err |= __get_user(value[2], dumpmem_addr++); -+ err |= __get_user(value[3], dumpmem_addr++); -+ if (err) { -+ printk("Invalid address 0x%p\n", dumpmem_addr - 4); -+ break; -+ } -+ printk("0x%p: %08lx %08lx %08lx %08lx\n", dumpmem_addr - 4, -+ value[0], value[1], value[2], value[3]); -+ } -+ set_fs(old_fs); -+} -+ -+static unsigned long *writemem_addr; -+ -+static void write_mem(unsigned long val) -+{ -+ mm_segment_t old_fs; -+ unsigned long old_val; -+ -+ old_fs = get_fs(); -+ set_fs(KERNEL_DS); -+ if (__get_user(old_val, writemem_addr)) -+ goto err; -+ printk("Changing [0x%p] %08lX to %08lX\n", writemem_addr, old_val, val); -+ __put_user(val, writemem_addr); -+err: -+ set_fs(old_fs); -+} -+ -+/* reading user input */ -+#define NAME_LEN (64) -+static struct { -+ unsigned long hex; -+ char name[NAME_LEN + 1]; -+ void (*entered)(void); -+} debug_input; -+ -+static void debug_read_hex(int key) -+{ -+ static int entered = 0; -+ int val; -+ -+ if (key >= '0' && key <= '9') -+ val = key - '0'; -+ else if (key >= 'a' && key <= 'f') -+ val = key - 'a' + 0xa; -+ else -+ return; -+ -+ entered++; -+ debug_input.hex = (debug_input.hex << 4) + val; -+ printk("%c", key); -+ if (entered != sizeof(unsigned long) * 2) -+ return; -+ -+ printk("\n"); -+ entered = 0; -+ debug_input.entered(); -+} -+ -+static void debug_read_string(int key) -+{ -+ static int pos; -+ static int shift; -+ -+ if (key == 0) { -+ /* actually key == 0 not only for shift */ -+ shift = 1; -+ return; -+ } -+ -+ if (key == 0x0d) /* enter */ -+ goto finish; -+ -+ if (key >= 'a' && key <= 'z') { -+ if (shift) -+ key = key - 'a' + 'A'; -+ goto correct; -+ } -+ if (key == '-') { -+ if (shift) -+ key = '_'; -+ goto correct; -+ } -+ if (key >= '0' && key <= '9') -+ goto correct; -+ return; -+ -+correct: -+ debug_input.name[pos] = key; -+ pos++; -+ shift = 0; -+ printk("%c", key); -+ if (pos != NAME_LEN) -+ return; -+ -+finish: -+ printk("\n"); -+ pos = 0; -+ shift = 0; -+ debug_input.entered(); -+ memset(debug_input.name, 0, NAME_LEN); -+} -+ -+static int sysrq_debug_mode; -+#define DEBUG_SELECT_ACTION 1 -+#define DEBUG_READ_INPUT 2 -+static struct sysrq_key_op *debug_sysrq_key_table[]; -+static void (*handle_debug_input)(int key); -+static void swap_opts(struct sysrq_key_op **); -+#define PROMPT "> " -+ -+int sysrq_eat_all(void) -+{ -+ return sysrq_debug_mode; -+} -+ -+static inline void debug_switch_read_input(void (*fn_read)(int), -+ void (*fn_fini)(void)) -+{ -+ WARN_ON(fn_read == NULL || fn_fini == NULL); -+ debug_input.entered = fn_fini; -+ handle_debug_input = fn_read; -+ sysrq_debug_mode = DEBUG_READ_INPUT; -+} -+ -+static inline void debug_switch_select_action(void) -+{ -+ sysrq_debug_mode = DEBUG_SELECT_ACTION; -+ handle_debug_input = NULL; -+ printk(PROMPT); -+} -+ -+/* handle key press in debug mode */ -+static void __handle_debug(int key, struct pt_regs *pt_regs, -+ struct tty_struct *tty) -+{ -+ if (sysrq_debug_mode == DEBUG_SELECT_ACTION) { -+ __handle_sysrq(key, pt_regs, tty); -+ if (sysrq_debug_mode) -+ printk(PROMPT); -+ } else { -+ __sysrq_lock_table(); -+ handle_debug_input(key); -+ __sysrq_unlock_table(); -+ } -+} -+ -+/* dump memory */ -+static void debug_dumpmem_addr_entered(void) -+{ -+ dumpmem_addr = (unsigned long *)debug_input.hex; -+ dump_mem(); -+ debug_switch_select_action(); -+} -+ -+static void sysrq_handle_dumpmem(int key, struct pt_regs *pt_regs, -+ struct tty_struct *tty) -+{ -+ debug_switch_read_input(debug_read_hex, debug_dumpmem_addr_entered); -+} -+static struct sysrq_key_op sysrq_debug_dumpmem = { -+ .handler = sysrq_handle_dumpmem, -+ .help_msg = "Dump memory\n", -+ .action_msg = "Enter address", -+}; -+ -+static void sysrq_handle_dumpnext(int key, struct pt_regs *pt_regs, -+ struct tty_struct *tty) -+{ -+ dump_mem(); -+} -+static struct sysrq_key_op sysrq_debug_dumpnext = { -+ .handler = sysrq_handle_dumpnext, -+ .help_msg = "dump neXt\n", -+ .action_msg = "", -+}; -+ -+/* resolve symbol */ -+static void debug_resolve_name_entered(void) -+{ -+ unsigned long sym_addr; -+ -+ sym_addr = kallsyms_lookup_name(debug_input.name); -+ printk("%s: %08lX\n", debug_input.name, sym_addr); -+ if (sym_addr) { -+ printk("Now you can dump it via X\n"); -+ dumpmem_addr = (unsigned long *)sym_addr; -+ } -+ debug_switch_select_action(); -+} -+ -+static void sysrq_handle_resolve(int key, struct pt_regs *pt_regs, -+ struct tty_struct *tty) -+{ -+ debug_switch_read_input(debug_read_string, debug_resolve_name_entered); -+} -+static struct sysrq_key_op sysrq_debug_resove = { -+ .handler = sysrq_handle_resolve, -+ .help_msg = "Resolve symbol\n", -+ .action_msg = "Enter symbol name", -+}; -+ -+/* write memory */ -+static void debug_writemem_val_entered(void) -+{ -+ write_mem(debug_input.hex); -+ debug_switch_select_action(); -+} -+ -+static void debug_writemem_addr_entered(void) -+{ -+ mm_segment_t old_fs; -+ unsigned long val; -+ -+ writemem_addr = (unsigned long *)debug_input.hex; -+ old_fs = get_fs(); -+ set_fs(KERNEL_DS); -+ if (!__get_user(val, writemem_addr)) -+ printk(" [0x%p] = %08lX\n", writemem_addr, val); -+ set_fs(old_fs); -+ debug_switch_read_input(debug_read_hex, debug_writemem_val_entered); -+} -+ -+static void sysrq_handle_writemem(int key, struct pt_regs *pt_regs, -+ struct tty_struct *tty) -+{ -+ debug_switch_read_input(debug_read_hex, debug_writemem_addr_entered); -+} -+static struct sysrq_key_op sysrq_debug_writemem = { -+ .handler = sysrq_handle_writemem, -+ .help_msg = "Write memory\n", -+ .action_msg = "Enter address and then value", -+}; -+ -+/* switch to debug mode */ -+static void sysrq_handle_debug(int key, struct pt_regs *pt_regs, -+ struct tty_struct *tty) -+{ -+ swap_opts(debug_sysrq_key_table); -+ printk("Welcome sysrq debugging mode\n" -+ "Press H for help\n"); -+ debug_switch_select_action(); -+} -+static struct sysrq_key_op sysrq_debug_enter = { -+ .handler = sysrq_handle_debug, -+ .help_msg = "start Degugging", -+ .action_msg = "Select desired action", -+}; -+ -+/* quit debug mode */ -+static void sysrq_handle_quit(int key, struct pt_regs *pt_regs, -+ struct tty_struct *tty) -+{ -+ swap_opts(NULL); -+ sysrq_debug_mode = 0; -+} -+static struct sysrq_key_op sysrq_debug_quit = { -+ .handler = sysrq_handle_quit, -+ .help_msg = "Quit debug mode\n", -+ .action_msg = "Thank you for using debugger", -+}; -+#endif -+ - /* END SYNC SYSRQ HANDLERS BLOCK */ - - -@@ -139,8 +431,13 @@ static struct sysrq_key_op sysrq_mountro - static void sysrq_handle_showregs(int key, struct pt_regs *pt_regs, - struct tty_struct *tty) - { -+ bust_spinlocks(1); - if (pt_regs) - show_regs(pt_regs); -+ bust_spinlocks(0); -+#ifdef __i386__ -+ smp_nmi_call_function(smp_show_regs, NULL, 0); -+#endif - } - static struct sysrq_key_op sysrq_showregs_op = { - .handler = sysrq_handle_showregs, -@@ -183,7 +480,7 @@ static void send_sig_all(int sig) - { - struct task_struct *p; - -- for_each_process(p) { -+ for_each_process_all(p) { - if (p->mm && p->pid != 1) - /* Not swapper, init nor kernel thread */ - force_sig(sig, p); -@@ -214,13 +511,26 @@ static struct sysrq_key_op sysrq_kill_op - .action_msg = "Kill All Tasks", - }; - -+#ifdef CONFIG_SCHED_VCPU -+static void sysrq_handle_vschedstate(int key, struct pt_regs *pt_regs, -+ struct tty_struct *tty) -+{ -+ show_vsched(); -+} -+static struct sysrq_key_op sysrq_vschedstate_op = { -+ .handler = sysrq_handle_vschedstate, -+ .help_msg = "showvsChed", -+ .action_msg = "Show Vsched", -+}; -+#endif -+ - /* END SIGNAL SYSRQ HANDLERS BLOCK */ - - - /* Key Operations table and lock */ - static spinlock_t sysrq_key_table_lock = SPIN_LOCK_UNLOCKED; - #define SYSRQ_KEY_TABLE_LENGTH 36 --static struct sysrq_key_op *sysrq_key_table[SYSRQ_KEY_TABLE_LENGTH] = { -+static struct sysrq_key_op *def_sysrq_key_table[SYSRQ_KEY_TABLE_LENGTH] = { - /* 0 */ &sysrq_loglevel_op, - /* 1 */ &sysrq_loglevel_op, - /* 2 */ &sysrq_loglevel_op, -@@ -235,8 +545,16 @@ static struct sysrq_key_op *sysrq_key_ta - it is handled specially on the sparc - and will never arrive */ - /* b */ &sysrq_reboot_op, -+#ifdef CONFIG_SCHED_VCPU -+/* c */ &sysrq_vschedstate_op, -+#else - /* c */ NULL, -+#endif -+#ifdef CONFIG_SYSRQ_DEBUG -+/* d */ &sysrq_debug_enter, -+#else - /* d */ NULL, -+#endif - /* e */ &sysrq_term_op, - /* f */ NULL, - /* g */ NULL, -@@ -270,6 +588,29 @@ static struct sysrq_key_op *sysrq_key_ta - /* z */ NULL - }; - -+#ifdef CONFIG_SYSRQ_DEBUG -+static struct sysrq_key_op *debug_sysrq_key_table[SYSRQ_KEY_TABLE_LENGTH] = { -+ [13] = &sysrq_debug_dumpmem, /* d */ -+ [26] = &sysrq_debug_quit, /* q */ -+ [27] = &sysrq_debug_resove, /* r */ -+ [32] = &sysrq_debug_writemem, /* w */ -+ [33] = &sysrq_debug_dumpnext, /* x */ -+}; -+ -+static struct sysrq_key_op **sysrq_key_table = def_sysrq_key_table; -+ -+/* call swap_opts(NULL) to restore opts to defaults */ -+static void swap_opts(struct sysrq_key_op **swap_to) -+{ -+ if (swap_to) -+ sysrq_key_table = swap_to; -+ else -+ sysrq_key_table = def_sysrq_key_table; -+} -+#else -+#define sysrq_key_table def_sysrq_key_table -+#endif -+ - /* key2index calculation, -1 on invalid index */ - static int sysrq_key_table_key2index(int key) { - int retval; -@@ -358,6 +699,12 @@ void handle_sysrq(int key, struct pt_reg - { - if (!sysrq_enabled) - return; -+#ifdef CONFIG_SYSRQ_DEBUG -+ if (sysrq_debug_mode) { -+ __handle_debug(key, pt_regs, tty); -+ return; -+ } -+#endif - __handle_sysrq(key, pt_regs, tty); - } - -diff -uprN linux-2.6.8.1.orig/drivers/char/tty_io.c linux-2.6.8.1-ve022stab078/drivers/char/tty_io.c ---- linux-2.6.8.1.orig/drivers/char/tty_io.c 2004-08-14 14:55:34.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/char/tty_io.c 2006-05-11 13:05:40.000000000 +0400 -@@ -86,6 +86,7 @@ - #include <linux/string.h> - #include <linux/slab.h> - #include <linux/poll.h> -+#include <linux/ve_owner.h> - #include <linux/proc_fs.h> - #include <linux/init.h> - #include <linux/module.h> -@@ -103,6 +104,7 @@ - #include <linux/devfs_fs_kernel.h> - - #include <linux/kmod.h> -+#include <ub/ub_mem.h> - - #undef TTY_DEBUG_HANGUP - -@@ -120,7 +122,12 @@ struct termios tty_std_termios = { /* fo - - EXPORT_SYMBOL(tty_std_termios); - -+/* this lock protects tty_drivers list, this pretty guys do no locking */ -+rwlock_t tty_driver_guard = RW_LOCK_UNLOCKED; -+EXPORT_SYMBOL(tty_driver_guard); -+ - LIST_HEAD(tty_drivers); /* linked list of tty drivers */ -+EXPORT_SYMBOL(tty_drivers); - struct tty_ldisc ldiscs[NR_LDISCS]; /* line disc dispatch table */ - - /* Semaphore to protect creating and releasing a tty */ -@@ -130,6 +137,13 @@ DECLARE_MUTEX(tty_sem); - extern struct tty_driver *ptm_driver; /* Unix98 pty masters; for /dev/ptmx */ - extern int pty_limit; /* Config limit on Unix98 ptys */ - static DEFINE_IDR(allocated_ptys); -+#ifdef CONFIG_VE -+#define ve_allocated_ptys (*(get_exec_env()->allocated_ptys)) -+#define ve_ptm_driver (get_exec_env()->ptm_driver) -+#else -+#define ve_allocated_ptys allocated_ptys -+#define ve_ptm_driver ptm_driver -+#endif - static DECLARE_MUTEX(allocated_ptys_lock); - #endif - -@@ -150,11 +164,25 @@ extern void rs_360_init(void); - static void release_mem(struct tty_struct *tty, int idx); - - -+DCL_VE_OWNER(TTYDRV, TAIL_SOFT, struct tty_driver, owner_env, , ()) -+DCL_VE_OWNER(TTY, TAIL_SOFT, struct tty_struct, owner_env, , ()) -+ -+void prepare_tty(void) -+{ -+#ifdef CONFIG_VE -+ get_ve0()->allocated_ptys = &allocated_ptys; -+ /* -+ * in this case, tty_register_driver() setups -+ * owner_env correctly right from the bootup -+ */ -+#endif -+} -+ - static struct tty_struct *alloc_tty_struct(void) - { - struct tty_struct *tty; - -- tty = kmalloc(sizeof(struct tty_struct), GFP_KERNEL); -+ tty = ub_kmalloc(sizeof(struct tty_struct), GFP_KERNEL); - if (tty) - memset(tty, 0, sizeof(struct tty_struct)); - return tty; -@@ -307,14 +335,37 @@ struct tty_driver *get_tty_driver(dev_t - { - struct tty_driver *p; - -+ read_lock(&tty_driver_guard); - list_for_each_entry(p, &tty_drivers, tty_drivers) { - dev_t base = MKDEV(p->major, p->minor_start); - if (device < base || device >= base + p->num) - continue; - *index = device - base; -- return p; -+#ifdef CONFIG_VE -+ if (in_interrupt()) -+ goto found; -+ if (p->major!=PTY_MASTER_MAJOR && p->major!=PTY_SLAVE_MAJOR -+#ifdef CONFIG_UNIX98_PTYS -+ && (p->major<UNIX98_PTY_MASTER_MAJOR || -+ p->major>UNIX98_PTY_MASTER_MAJOR+UNIX98_PTY_MAJOR_COUNT-1) && -+ (p->major<UNIX98_PTY_SLAVE_MAJOR || -+ p->major>UNIX98_PTY_SLAVE_MAJOR+UNIX98_PTY_MAJOR_COUNT-1) -+#endif -+ ) goto found; -+ if (ve_is_super(VE_OWNER_TTYDRV(p)) && -+ ve_is_super(get_exec_env())) -+ goto found; -+ if (!ve_accessible_strict(VE_OWNER_TTYDRV(p), get_exec_env())) -+ continue; -+#endif -+ goto found; - } -+ read_unlock(&tty_driver_guard); - return NULL; -+ -+found: -+ read_unlock(&tty_driver_guard); -+ return p; - } - - /* -@@ -410,7 +461,6 @@ void do_tty_hangup(void *data) - struct file * cons_filp = NULL; - struct file *filp, *f = NULL; - struct task_struct *p; -- struct pid *pid; - int closecount = 0, n; - - if (!tty) -@@ -481,8 +531,7 @@ void do_tty_hangup(void *data) - - read_lock(&tasklist_lock); - if (tty->session > 0) { -- struct list_head *l; -- for_each_task_pid(tty->session, PIDTYPE_SID, p, l, pid) { -+ do_each_task_pid_all(tty->session, PIDTYPE_SID, p) { - if (p->signal->tty == tty) - p->signal->tty = NULL; - if (!p->signal->leader) -@@ -491,7 +540,7 @@ void do_tty_hangup(void *data) - send_group_sig_info(SIGCONT, SEND_SIG_PRIV, p); - if (tty->pgrp > 0) - p->signal->tty_old_pgrp = tty->pgrp; -- } -+ } while_each_task_pid_all(tty->session, PIDTYPE_SID, p); - } - read_unlock(&tasklist_lock); - -@@ -563,15 +612,15 @@ void disassociate_ctty(int on_exit) - { - struct tty_struct *tty; - struct task_struct *p; -- struct list_head *l; -- struct pid *pid; - int tty_pgrp = -1; - - lock_kernel(); - -+ down(&tty_sem); - tty = current->signal->tty; - if (tty) { - tty_pgrp = tty->pgrp; -+ up(&tty_sem); - if (on_exit && tty->driver->type != TTY_DRIVER_TYPE_PTY) - tty_vhangup(tty); - } else { -@@ -579,6 +628,7 @@ void disassociate_ctty(int on_exit) - kill_pg(current->signal->tty_old_pgrp, SIGHUP, on_exit); - kill_pg(current->signal->tty_old_pgrp, SIGCONT, on_exit); - } -+ up(&tty_sem); - unlock_kernel(); - return; - } -@@ -588,14 +638,19 @@ void disassociate_ctty(int on_exit) - kill_pg(tty_pgrp, SIGCONT, on_exit); - } - -+ /* Must lock changes to tty_old_pgrp */ -+ down(&tty_sem); - current->signal->tty_old_pgrp = 0; - tty->session = 0; - tty->pgrp = -1; - -+ /* Now clear signal->tty under the lock */ - read_lock(&tasklist_lock); -- for_each_task_pid(current->signal->session, PIDTYPE_SID, p, l, pid) -+ do_each_task_pid_all(current->signal->session, PIDTYPE_SID, p) { - p->signal->tty = NULL; -+ } while_each_task_pid_all(current->signal->session, PIDTYPE_SID, p); - read_unlock(&tasklist_lock); -+ up(&tty_sem); - unlock_kernel(); - } - -@@ -656,7 +711,7 @@ static ssize_t tty_read(struct file * fi - i = -EIO; - unlock_kernel(); - if (i > 0) -- inode->i_atime = CURRENT_TIME; -+ inode->i_atime = current_fs_time(inode->i_sb); - return i; - } - -@@ -702,7 +757,8 @@ static inline ssize_t do_tty_write( - } - } - if (written) { -- file->f_dentry->d_inode->i_mtime = CURRENT_TIME; -+ struct inode *inode = file->f_dentry->d_inode; -+ inode->i_mtime = current_fs_time(inode->i_sb); - ret = written; - } - up(&tty->atomic_write); -@@ -760,27 +816,28 @@ static inline void tty_line_name(struct - * really quite straightforward. The semaphore locking can probably be - * relaxed for the (most common) case of reopening a tty. - */ --static int init_dev(struct tty_driver *driver, int idx, -- struct tty_struct **ret_tty) -+static int init_dev(struct tty_driver *driver, int idx, -+ struct tty_struct *i_tty, struct tty_struct **ret_tty) - { - struct tty_struct *tty, *o_tty; - struct termios *tp, **tp_loc, *o_tp, **o_tp_loc; - struct termios *ltp, **ltp_loc, *o_ltp, **o_ltp_loc; -+ struct ve_struct * owner; - int retval=0; - -- /* -- * Check whether we need to acquire the tty semaphore to avoid -- * race conditions. For now, play it safe. -- */ -- down(&tty_sem); -+ owner = VE_OWNER_TTYDRV(driver); - -- /* check whether we're reopening an existing tty */ -- if (driver->flags & TTY_DRIVER_DEVPTS_MEM) { -- tty = devpts_get_tty(idx); -- if (tty && driver->subtype == PTY_TYPE_MASTER) -- tty = tty->link; -- } else { -- tty = driver->ttys[idx]; -+ if (i_tty) -+ tty = i_tty; -+ else { -+ /* check whether we're reopening an existing tty */ -+ if (driver->flags & TTY_DRIVER_DEVPTS_MEM) { -+ tty = devpts_get_tty(idx); -+ if (tty && driver->subtype == PTY_TYPE_MASTER) -+ tty = tty->link; -+ } else { -+ tty = driver->ttys[idx]; -+ } - } - if (tty) goto fast_track; - -@@ -808,6 +865,7 @@ static int init_dev(struct tty_driver *d - tty->driver = driver; - tty->index = idx; - tty_line_name(driver, idx, tty->name); -+ SET_VE_OWNER_TTY(tty, owner); - - if (driver->flags & TTY_DRIVER_DEVPTS_MEM) { - tp_loc = &tty->termios; -@@ -818,7 +876,7 @@ static int init_dev(struct tty_driver *d - } - - if (!*tp_loc) { -- tp = (struct termios *) kmalloc(sizeof(struct termios), -+ tp = (struct termios *) ub_kmalloc(sizeof(struct termios), - GFP_KERNEL); - if (!tp) - goto free_mem_out; -@@ -826,7 +884,7 @@ static int init_dev(struct tty_driver *d - } - - if (!*ltp_loc) { -- ltp = (struct termios *) kmalloc(sizeof(struct termios), -+ ltp = (struct termios *) ub_kmalloc(sizeof(struct termios), - GFP_KERNEL); - if (!ltp) - goto free_mem_out; -@@ -841,6 +899,7 @@ static int init_dev(struct tty_driver *d - o_tty->driver = driver->other; - o_tty->index = idx; - tty_line_name(driver->other, idx, o_tty->name); -+ SET_VE_OWNER_TTY(o_tty, owner); - - if (driver->flags & TTY_DRIVER_DEVPTS_MEM) { - o_tp_loc = &o_tty->termios; -@@ -852,7 +911,7 @@ static int init_dev(struct tty_driver *d - - if (!*o_tp_loc) { - o_tp = (struct termios *) -- kmalloc(sizeof(struct termios), GFP_KERNEL); -+ ub_kmalloc(sizeof(struct termios), GFP_KERNEL); - if (!o_tp) - goto free_mem_out; - *o_tp = driver->other->init_termios; -@@ -860,7 +919,7 @@ static int init_dev(struct tty_driver *d - - if (!*o_ltp_loc) { - o_ltp = (struct termios *) -- kmalloc(sizeof(struct termios), GFP_KERNEL); -+ ub_kmalloc(sizeof(struct termios), GFP_KERNEL); - if (!o_ltp) - goto free_mem_out; - memset(o_ltp, 0, sizeof(struct termios)); -@@ -878,6 +937,10 @@ static int init_dev(struct tty_driver *d - *o_ltp_loc = o_ltp; - o_tty->termios = *o_tp_loc; - o_tty->termios_locked = *o_ltp_loc; -+#ifdef CONFIG_VE -+ if (driver->other->refcount == 0) -+ (void)get_ve(owner); -+#endif - driver->other->refcount++; - if (driver->subtype == PTY_TYPE_MASTER) - o_tty->count++; -@@ -902,6 +965,10 @@ static int init_dev(struct tty_driver *d - *ltp_loc = ltp; - tty->termios = *tp_loc; - tty->termios_locked = *ltp_loc; -+#ifdef CONFIG_VE -+ if (driver->refcount == 0) -+ (void)get_ve(owner); -+#endif - driver->refcount++; - tty->count++; - -@@ -956,7 +1023,6 @@ success: - - /* All paths come through here to release the semaphore */ - end_init: -- up(&tty_sem); - return retval; - - /* Release locally allocated memory ... nothing placed in slots */ -@@ -1010,6 +1076,10 @@ static void release_mem(struct tty_struc - } - o_tty->magic = 0; - o_tty->driver->refcount--; -+#ifdef CONFIG_VE -+ if (o_tty->driver->refcount == 0) -+ put_ve(VE_OWNER_TTY(o_tty)); -+#endif - file_list_lock(); - list_del_init(&o_tty->tty_files); - file_list_unlock(); -@@ -1032,6 +1102,10 @@ static void release_mem(struct tty_struc - - tty->magic = 0; - tty->driver->refcount--; -+#ifdef CONFIG_VE -+ if (tty->driver->refcount == 0) -+ put_ve(VE_OWNER_TTY(tty)); -+#endif - file_list_lock(); - list_del_init(&tty->tty_files); - file_list_unlock(); -@@ -1054,6 +1128,9 @@ static void release_dev(struct file * fi - int devpts_master, devpts; - int idx; - char buf[64]; -+#ifdef CONFIG_UNIX98_PTYS -+ struct idr *idr_alloced; -+#endif - - tty = (struct tty_struct *)filp->private_data; - if (tty_paranoia_check(tty, filp->f_dentry->d_inode, "release_dev")) -@@ -1069,6 +1146,9 @@ static void release_dev(struct file * fi - devpts = (tty->driver->flags & TTY_DRIVER_DEVPTS_MEM) != 0; - devpts_master = pty_master && devpts; - o_tty = tty->link; -+#ifdef CONFIG_UNIX98_PTYS -+ idr_alloced = tty->owner_env->allocated_ptys; -+#endif - - #ifdef TTY_PARANOIA_CHECK - if (idx < 0 || idx >= tty->driver->num) { -@@ -1152,9 +1232,14 @@ static void release_dev(struct file * fi - * each iteration we avoid any problems. - */ - while (1) { -+ /* Guard against races with tty->count changes elsewhere and -+ opens on /dev/tty */ -+ -+ down(&tty_sem); - tty_closing = tty->count <= 1; - o_tty_closing = o_tty && - (o_tty->count <= (pty_master ? 1 : 0)); -+ up(&tty_sem); - do_sleep = 0; - - if (tty_closing) { -@@ -1190,6 +1275,8 @@ static void release_dev(struct file * fi - * both sides, and we've completed the last operation that could - * block, so it's safe to proceed with closing. - */ -+ -+ down(&tty_sem); - if (pty_master) { - if (--o_tty->count < 0) { - printk(KERN_WARNING "release_dev: bad pty slave count " -@@ -1203,7 +1290,8 @@ static void release_dev(struct file * fi - tty->count, tty_name(tty, buf)); - tty->count = 0; - } -- -+ up(&tty_sem); -+ - /* - * We've decremented tty->count, so we need to remove this file - * descriptor off the tty->tty_files list; this serves two -@@ -1235,15 +1323,15 @@ static void release_dev(struct file * fi - */ - if (tty_closing || o_tty_closing) { - struct task_struct *p; -- struct list_head *l; -- struct pid *pid; - - read_lock(&tasklist_lock); -- for_each_task_pid(tty->session, PIDTYPE_SID, p, l, pid) -+ do_each_task_pid_all(tty->session, PIDTYPE_SID, p) { - p->signal->tty = NULL; -+ } while_each_task_pid_all(tty->session, PIDTYPE_SID, p); - if (o_tty) -- for_each_task_pid(o_tty->session, PIDTYPE_SID, p,l, pid) -+ do_each_task_pid_all(o_tty->session, PIDTYPE_SID, p) { - p->signal->tty = NULL; -+ } while_each_task_pid_all(o_tty->session, PIDTYPE_SID, p); - read_unlock(&tasklist_lock); - } - -@@ -1294,7 +1382,7 @@ static void release_dev(struct file * fi - /* Make this pty number available for reallocation */ - if (devpts) { - down(&allocated_ptys_lock); -- idr_remove(&allocated_ptys, idx); -+ idr_remove(idr_alloced, idx); - up(&allocated_ptys_lock); - } - #endif -@@ -1315,7 +1403,7 @@ static void release_dev(struct file * fi - */ - static int tty_open(struct inode * inode, struct file * filp) - { -- struct tty_struct *tty; -+ struct tty_struct *tty, *c_tty; - int noctty, retval; - struct tty_driver *driver; - int index; -@@ -1327,12 +1415,18 @@ retry_open: - noctty = filp->f_flags & O_NOCTTY; - index = -1; - retval = 0; -+ c_tty = NULL; -+ -+ down(&tty_sem); - - if (device == MKDEV(TTYAUX_MAJOR,0)) { -- if (!current->signal->tty) -+ if (!current->signal->tty) { -+ up(&tty_sem); - return -ENXIO; -+ } - driver = current->signal->tty->driver; - index = current->signal->tty->index; -+ c_tty = current->signal->tty; - filp->f_flags |= O_NONBLOCK; /* Don't let /dev/tty block */ - /* noctty = 1; */ - goto got_driver; -@@ -1341,6 +1435,12 @@ retry_open: - if (device == MKDEV(TTY_MAJOR,0)) { - extern int fg_console; - extern struct tty_driver *console_driver; -+#ifdef CONFIG_VE -+ if (!ve_is_super(get_exec_env())) { -+ up(&tty_sem); -+ return -ENODEV; -+ } -+#endif - driver = console_driver; - index = fg_console; - noctty = 1; -@@ -1348,6 +1448,12 @@ retry_open: - } - #endif - if (device == MKDEV(TTYAUX_MAJOR,1)) { -+#ifdef CONFIG_VE -+ if (!ve_is_super(get_exec_env())) { -+ up(&tty_sem); -+ return -ENODEV; -+ } -+#endif - driver = console_device(&index); - if (driver) { - /* Don't let /dev/console block */ -@@ -1355,6 +1461,7 @@ retry_open: - noctty = 1; - goto got_driver; - } -+ up(&tty_sem); - return -ENODEV; - } - -@@ -1364,29 +1471,33 @@ retry_open: - - /* find a device that is not in use. */ - down(&allocated_ptys_lock); -- if (!idr_pre_get(&allocated_ptys, GFP_KERNEL)) { -+ if (!idr_pre_get(&ve_allocated_ptys, GFP_KERNEL)) { - up(&allocated_ptys_lock); -+ up(&tty_sem); - return -ENOMEM; - } -- idr_ret = idr_get_new(&allocated_ptys, NULL, &index); -+ idr_ret = idr_get_new(&ve_allocated_ptys, NULL, &index); - if (idr_ret < 0) { - up(&allocated_ptys_lock); -+ up(&tty_sem); - if (idr_ret == -EAGAIN) - return -ENOMEM; - return -EIO; - } - if (index >= pty_limit) { -- idr_remove(&allocated_ptys, index); -+ idr_remove(&ve_allocated_ptys, index); - up(&allocated_ptys_lock); -+ up(&tty_sem); - return -EIO; - } - up(&allocated_ptys_lock); - -- driver = ptm_driver; -- retval = init_dev(driver, index, &tty); -+ driver = ve_ptm_driver; -+ retval = init_dev(driver, index, NULL, &tty); -+ up(&tty_sem); - if (retval) { - down(&allocated_ptys_lock); -- idr_remove(&allocated_ptys, index); -+ idr_remove(&ve_allocated_ptys, index); - up(&allocated_ptys_lock); - return retval; - } -@@ -1398,10 +1509,13 @@ retry_open: - #endif - { - driver = get_tty_driver(device, &index); -- if (!driver) -+ if (!driver) { -+ up(&tty_sem); - return -ENODEV; -+ } - got_driver: -- retval = init_dev(driver, index, &tty); -+ retval = init_dev(driver, index, c_tty, &tty); -+ up(&tty_sem); - if (retval) - return retval; - } -@@ -1435,7 +1549,7 @@ got_driver: - #ifdef CONFIG_UNIX98_PTYS - if (index != -1) { - down(&allocated_ptys_lock); -- idr_remove(&allocated_ptys, index); -+ idr_remove(&ve_allocated_ptys, index); - up(&allocated_ptys_lock); - } - #endif -@@ -1566,10 +1680,12 @@ static int tiocswinsz(struct tty_struct - - static int tioccons(struct file *file) - { -+ if (!capable(CAP_SYS_ADMIN)) -+ return -EPERM; -+ if (!ve_is_super(get_exec_env())) -+ return -EACCES; - if (file->f_op->write == redirected_tty_write) { - struct file *f; -- if (!capable(CAP_SYS_ADMIN)) -- return -EPERM; - spin_lock(&redirect_lock); - f = redirect; - redirect = NULL; -@@ -1606,8 +1722,6 @@ static int fionbio(struct file *file, in - - static int tiocsctty(struct tty_struct *tty, int arg) - { -- struct list_head *l; -- struct pid *pid; - task_t *p; - - if (current->signal->leader && -@@ -1630,8 +1744,9 @@ static int tiocsctty(struct tty_struct * - */ - - read_lock(&tasklist_lock); -- for_each_task_pid(tty->session, PIDTYPE_SID, p, l, pid) -+ do_each_task_pid_all(tty->session, PIDTYPE_SID, p) { - p->signal->tty = NULL; -+ } while_each_task_pid_all(tty->session, PIDTYPE_SID, p); - read_unlock(&tasklist_lock); - } else - return -EPERM; -@@ -1653,7 +1768,7 @@ static int tiocgpgrp(struct tty_struct * - */ - if (tty == real_tty && current->signal->tty != real_tty) - return -ENOTTY; -- return put_user(real_tty->pgrp, p); -+ return put_user(pid_type_to_vpid(PIDTYPE_PGID, real_tty->pgrp), p); - } - - static int tiocspgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t __user *p) -@@ -1673,6 +1788,9 @@ static int tiocspgrp(struct tty_struct * - return -EFAULT; - if (pgrp < 0) - return -EINVAL; -+ pgrp = vpid_to_pid(pgrp); -+ if (pgrp < 0) -+ return -EPERM; - if (session_of_pgrp(pgrp) != current->signal->session) - return -EPERM; - real_tty->pgrp = pgrp; -@@ -1689,7 +1807,7 @@ static int tiocgsid(struct tty_struct *t - return -ENOTTY; - if (real_tty->session <= 0) - return -ENOTTY; -- return put_user(real_tty->session, p); -+ return put_user(pid_type_to_vpid(PIDTYPE_SID, real_tty->session), p); - } - - static int tiocsetd(struct tty_struct *tty, int __user *p) -@@ -1938,8 +2056,6 @@ static void __do_SAK(void *arg) - #else - struct tty_struct *tty = arg; - struct task_struct *p; -- struct list_head *l; -- struct pid *pid; - int session; - int i; - struct file *filp; -@@ -1952,7 +2068,7 @@ static void __do_SAK(void *arg) - if (tty->driver->flush_buffer) - tty->driver->flush_buffer(tty); - read_lock(&tasklist_lock); -- for_each_task_pid(session, PIDTYPE_SID, p, l, pid) { -+ do_each_task_pid_all(session, PIDTYPE_SID, p) { - if (p->signal->tty == tty || session > 0) { - printk(KERN_NOTICE "SAK: killed process %d" - " (%s): p->signal->session==tty->session\n", -@@ -1979,7 +2095,7 @@ static void __do_SAK(void *arg) - spin_unlock(&p->files->file_lock); - } - task_unlock(p); -- } -+ } while_each_task_pid_all(session, PIDTYPE_SID, p); - read_unlock(&tasklist_lock); - #endif - } -@@ -2303,8 +2419,11 @@ int tty_register_driver(struct tty_drive - - if (!driver->put_char) - driver->put_char = tty_default_put_char; -- -+ -+ SET_VE_OWNER_TTYDRV(driver, get_exec_env()); -+ write_lock_irq(&tty_driver_guard); - list_add(&driver->tty_drivers, &tty_drivers); -+ write_unlock_irq(&tty_driver_guard); - - if ( !(driver->flags & TTY_DRIVER_NO_DEVFS) ) { - for(i = 0; i < driver->num; i++) -@@ -2331,7 +2450,9 @@ int tty_unregister_driver(struct tty_dri - unregister_chrdev_region(MKDEV(driver->major, driver->minor_start), - driver->num); - -+ write_lock_irq(&tty_driver_guard); - list_del(&driver->tty_drivers); -+ write_unlock_irq(&tty_driver_guard); - - /* - * Free the termios and termios_locked structures because -@@ -2459,6 +2580,7 @@ static int __init tty_init(void) - - vty_init(); - #endif -+ prepare_tty(); - return 0; - } - module_init(tty_init); -diff -uprN linux-2.6.8.1.orig/drivers/char/vt.c linux-2.6.8.1-ve022stab078/drivers/char/vt.c ---- linux-2.6.8.1.orig/drivers/char/vt.c 2004-08-14 14:56:00.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/char/vt.c 2006-05-11 13:05:33.000000000 +0400 -@@ -748,6 +748,8 @@ inline int resize_screen(int currcons, i - * [this is to be used together with some user program - * like resize that changes the hardware videomode] - */ -+#define VC_RESIZE_MAXCOL (32767) -+#define VC_RESIZE_MAXROW (32767) - int vc_resize(int currcons, unsigned int cols, unsigned int lines) - { - unsigned long old_origin, new_origin, new_scr_end, rlth, rrem, err = 0; -@@ -760,6 +762,9 @@ int vc_resize(int currcons, unsigned int - if (!vc_cons_allocated(currcons)) - return -ENXIO; - -+ if (cols > VC_RESIZE_MAXCOL || lines > VC_RESIZE_MAXROW) -+ return -EINVAL; -+ - new_cols = (cols ? cols : video_num_columns); - new_rows = (lines ? lines : video_num_lines); - new_row_size = new_cols << 1; -diff -uprN linux-2.6.8.1.orig/drivers/ide/pci/cmd64x.c linux-2.6.8.1-ve022stab078/drivers/ide/pci/cmd64x.c ---- linux-2.6.8.1.orig/drivers/ide/pci/cmd64x.c 2004-08-14 14:56:01.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/ide/pci/cmd64x.c 2006-05-11 13:05:27.000000000 +0400 -@@ -596,7 +596,7 @@ static unsigned int __devinit init_chips - - #ifdef __i386__ - if (dev->resource[PCI_ROM_RESOURCE].start) { -- pci_write_config_byte(dev, PCI_ROM_ADDRESS, dev->resource[PCI_ROM_RESOURCE].start | PCI_ROM_ADDRESS_ENABLE); -+ pci_write_config_dword(dev, PCI_ROM_ADDRESS, dev->resource[PCI_ROM_RESOURCE].start | PCI_ROM_ADDRESS_ENABLE); - printk(KERN_INFO "%s: ROM enabled at 0x%08lx\n", name, dev->resource[PCI_ROM_RESOURCE].start); - } - #endif -diff -uprN linux-2.6.8.1.orig/drivers/ide/pci/hpt34x.c linux-2.6.8.1-ve022stab078/drivers/ide/pci/hpt34x.c ---- linux-2.6.8.1.orig/drivers/ide/pci/hpt34x.c 2004-08-14 14:55:19.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/ide/pci/hpt34x.c 2006-05-11 13:05:27.000000000 +0400 -@@ -251,7 +251,7 @@ static unsigned int __devinit init_chips - - if (cmd & PCI_COMMAND_MEMORY) { - if (pci_resource_start(dev, PCI_ROM_RESOURCE)) { -- pci_write_config_byte(dev, PCI_ROM_ADDRESS, -+ pci_write_config_dword(dev, PCI_ROM_ADDRESS, - dev->resource[PCI_ROM_RESOURCE].start | PCI_ROM_ADDRESS_ENABLE); - printk(KERN_INFO "HPT345: ROM enabled at 0x%08lx\n", - dev->resource[PCI_ROM_RESOURCE].start); -diff -uprN linux-2.6.8.1.orig/drivers/ide/pci/hpt366.c linux-2.6.8.1-ve022stab078/drivers/ide/pci/hpt366.c ---- linux-2.6.8.1.orig/drivers/ide/pci/hpt366.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/ide/pci/hpt366.c 2006-05-11 13:05:27.000000000 +0400 -@@ -1089,7 +1089,7 @@ static unsigned int __devinit init_chips - u8 test = 0; - - if (dev->resource[PCI_ROM_RESOURCE].start) -- pci_write_config_byte(dev, PCI_ROM_ADDRESS, -+ pci_write_config_dword(dev, PCI_ROM_ADDRESS, - dev->resource[PCI_ROM_RESOURCE].start | PCI_ROM_ADDRESS_ENABLE); - - pci_read_config_byte(dev, PCI_CACHE_LINE_SIZE, &test); -diff -uprN linux-2.6.8.1.orig/drivers/ieee1394/ieee1394_core.c linux-2.6.8.1-ve022stab078/drivers/ieee1394/ieee1394_core.c ---- linux-2.6.8.1.orig/drivers/ieee1394/ieee1394_core.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/ieee1394/ieee1394_core.c 2006-05-11 13:05:25.000000000 +0400 -@@ -1034,8 +1034,8 @@ static int hpsbpkt_thread(void *__hi) - if (khpsbpkt_kill) - break; - -- if (current->flags & PF_FREEZE) { -- refrigerator(0); -+ if (test_thread_flag(TIF_FREEZE)) { -+ refrigerator(); - continue; - } - -diff -uprN linux-2.6.8.1.orig/drivers/ieee1394/nodemgr.c linux-2.6.8.1-ve022stab078/drivers/ieee1394/nodemgr.c ---- linux-2.6.8.1.orig/drivers/ieee1394/nodemgr.c 2004-08-14 14:55:34.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/ieee1394/nodemgr.c 2006-05-11 13:05:25.000000000 +0400 -@@ -1481,8 +1481,8 @@ static int nodemgr_host_thread(void *__h - - if (down_interruptible(&hi->reset_sem) || - down_interruptible(&nodemgr_serialize)) { -- if (current->flags & PF_FREEZE) { -- refrigerator(0); -+ if (test_thread_flag(TIF_FREEZE)) { -+ refrigerator(); - continue; - } - printk("NodeMgr: received unexpected signal?!\n" ); -diff -uprN linux-2.6.8.1.orig/drivers/input/serio/serio.c linux-2.6.8.1-ve022stab078/drivers/input/serio/serio.c ---- linux-2.6.8.1.orig/drivers/input/serio/serio.c 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/input/serio/serio.c 2006-05-11 13:05:25.000000000 +0400 -@@ -153,8 +153,8 @@ static int serio_thread(void *nothing) - do { - serio_handle_events(); - wait_event_interruptible(serio_wait, !list_empty(&serio_event_list)); -- if (current->flags & PF_FREEZE) -- refrigerator(PF_FREEZE); -+ if (test_thread_flag(TIF_FREEZE)) -+ refrigerator(); - } while (!signal_pending(current)); - - printk(KERN_DEBUG "serio: kseriod exiting\n"); -diff -uprN linux-2.6.8.1.orig/drivers/input/serio/serport.c linux-2.6.8.1-ve022stab078/drivers/input/serio/serport.c ---- linux-2.6.8.1.orig/drivers/input/serio/serport.c 2004-08-14 14:56:14.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/input/serio/serport.c 2006-05-11 13:05:33.000000000 +0400 -@@ -66,6 +66,9 @@ static int serport_ldisc_open(struct tty - struct serport *serport; - char name[64]; - -+ if (!capable(CAP_SYS_ADMIN)) -+ return -EPERM; -+ - serport = kmalloc(sizeof(struct serport), GFP_KERNEL); - if (unlikely(!serport)) - return -ENOMEM; -diff -uprN linux-2.6.8.1.orig/drivers/md/md.c linux-2.6.8.1-ve022stab078/drivers/md/md.c ---- linux-2.6.8.1.orig/drivers/md/md.c 2004-08-14 14:55:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/md/md.c 2006-05-11 13:05:25.000000000 +0400 -@@ -2822,8 +2822,8 @@ int md_thread(void * arg) - - wait_event_interruptible(thread->wqueue, - test_bit(THREAD_WAKEUP, &thread->flags)); -- if (current->flags & PF_FREEZE) -- refrigerator(PF_FREEZE); -+ if (test_thread_flag(TIF_FREEZE)) -+ refrigerator(); - - clear_bit(THREAD_WAKEUP, &thread->flags); - -diff -uprN linux-2.6.8.1.orig/drivers/net/8139too.c linux-2.6.8.1-ve022stab078/drivers/net/8139too.c ---- linux-2.6.8.1.orig/drivers/net/8139too.c 2004-08-14 14:54:46.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/net/8139too.c 2006-05-11 13:05:25.000000000 +0400 -@@ -1624,8 +1624,8 @@ static int rtl8139_thread (void *data) - do { - timeout = interruptible_sleep_on_timeout (&tp->thr_wait, timeout); - /* make swsusp happy with our thread */ -- if (current->flags & PF_FREEZE) -- refrigerator(PF_FREEZE); -+ if (test_thread_flag(TIF_FREEZE)) -+ refrigerator(); - } while (!signal_pending (current) && (timeout > 0)); - - if (signal_pending (current)) { -diff -uprN linux-2.6.8.1.orig/drivers/net/forcedeth.c linux-2.6.8.1-ve022stab078/drivers/net/forcedeth.c ---- linux-2.6.8.1.orig/drivers/net/forcedeth.c 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/net/forcedeth.c 2006-05-11 13:05:27.000000000 +0400 -@@ -1618,6 +1618,9 @@ static int nv_open(struct net_device *de - writel(NVREG_MIISTAT_MASK, base + NvRegMIIStatus); - dprintk(KERN_INFO "startup: got 0x%08x.\n", miistat); - } -+ /* set linkspeed to invalid value, thus force nv_update_linkspeed -+ * to init hw */ -+ np->linkspeed = 0; - ret = nv_update_linkspeed(dev); - nv_start_rx(dev); - nv_start_tx(dev); -diff -uprN linux-2.6.8.1.orig/drivers/net/irda/sir_kthread.c linux-2.6.8.1-ve022stab078/drivers/net/irda/sir_kthread.c ---- linux-2.6.8.1.orig/drivers/net/irda/sir_kthread.c 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/net/irda/sir_kthread.c 2006-05-11 13:05:25.000000000 +0400 -@@ -136,8 +136,8 @@ static int irda_thread(void *startup) - remove_wait_queue(&irda_rq_queue.kick, &wait); - - /* make swsusp happy with our thread */ -- if (current->flags & PF_FREEZE) -- refrigerator(PF_FREEZE); -+ if (test_thread_flag(TIF_FREEZE)) -+ refrigerator(); - - run_irda_queue(); - } -diff -uprN linux-2.6.8.1.orig/drivers/net/irda/stir4200.c linux-2.6.8.1-ve022stab078/drivers/net/irda/stir4200.c ---- linux-2.6.8.1.orig/drivers/net/irda/stir4200.c 2004-08-14 14:54:52.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/net/irda/stir4200.c 2006-05-11 13:05:25.000000000 +0400 -@@ -767,7 +767,7 @@ static int stir_transmit_thread(void *ar - && !signal_pending(current)) - { - /* if suspending, then power off and wait */ -- if (current->flags & PF_FREEZE) { -+ if (test_thread_flag(TIF_FREEZE)) { - if (stir->receiving) - receive_stop(stir); - else -@@ -775,7 +775,7 @@ static int stir_transmit_thread(void *ar - - write_reg(stir, REG_CTRL1, CTRL1_TXPWD|CTRL1_RXPWD); - -- refrigerator(PF_FREEZE); -+ refrigerator(); - - if (change_speed(stir, stir->speed)) - break; -diff -uprN linux-2.6.8.1.orig/drivers/net/irda/vlsi_ir.h linux-2.6.8.1-ve022stab078/drivers/net/irda/vlsi_ir.h ---- linux-2.6.8.1.orig/drivers/net/irda/vlsi_ir.h 2004-08-14 14:55:34.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/net/irda/vlsi_ir.h 2006-05-11 13:05:40.000000000 +0400 -@@ -58,7 +58,7 @@ typedef void irqreturn_t; - - /* PDE() introduced in 2.5.4 */ - #ifdef CONFIG_PROC_FS --#define PDE(inode) ((inode)->u.generic_ip) -+#define LPDE(inode) ((inode)->u.generic_ip) - #endif - - /* irda crc16 calculation exported in 2.5.42 */ -diff -uprN linux-2.6.8.1.orig/drivers/net/loopback.c linux-2.6.8.1-ve022stab078/drivers/net/loopback.c ---- linux-2.6.8.1.orig/drivers/net/loopback.c 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/net/loopback.c 2006-05-11 13:05:45.000000000 +0400 -@@ -127,6 +127,11 @@ static int loopback_xmit(struct sk_buff - { - struct net_device_stats *lb_stats; - -+ if (unlikely(get_exec_env()->disable_net)) { -+ kfree_skb(skb); -+ return 0; -+ } -+ - skb_orphan(skb); - - skb->protocol=eth_type_trans(skb,dev); -@@ -183,6 +188,30 @@ static struct net_device_stats *get_stat - return stats; - } - -+static void loopback_destructor(struct net_device *dev) -+{ -+ kfree(dev->priv); -+ dev->priv = NULL; -+} -+ -+struct net_device templ_loopback_dev = { -+ .name = "lo", -+ .mtu = (16 * 1024) + 20 + 20 + 12, -+ .hard_start_xmit = loopback_xmit, -+ .hard_header = eth_header, -+ .hard_header_cache = eth_header_cache, -+ .header_cache_update = eth_header_cache_update, -+ .hard_header_len = ETH_HLEN, /* 14 */ -+ .addr_len = ETH_ALEN, /* 6 */ -+ .tx_queue_len = 0, -+ .type = ARPHRD_LOOPBACK, /* 0x0001*/ -+ .rebuild_header = eth_rebuild_header, -+ .flags = IFF_LOOPBACK, -+ .features = NETIF_F_SG|NETIF_F_FRAGLIST -+ |NETIF_F_NO_CSUM|NETIF_F_HIGHDMA -+ |NETIF_F_LLTX|NETIF_F_VIRTUAL, -+}; -+ - struct net_device loopback_dev = { - .name = "lo", - .mtu = (16 * 1024) + 20 + 20 + 12, -@@ -212,9 +241,11 @@ int __init loopback_init(void) - memset(stats, 0, sizeof(struct net_device_stats)); - loopback_dev.priv = stats; - loopback_dev.get_stats = &get_stats; -+ loopback_dev.destructor = &loopback_destructor; - } - - return register_netdev(&loopback_dev); - }; - - EXPORT_SYMBOL(loopback_dev); -+EXPORT_SYMBOL(templ_loopback_dev); -diff -uprN linux-2.6.8.1.orig/drivers/net/net_init.c linux-2.6.8.1-ve022stab078/drivers/net/net_init.c ---- linux-2.6.8.1.orig/drivers/net/net_init.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/net/net_init.c 2006-05-11 13:05:40.000000000 +0400 -@@ -51,6 +51,7 @@ - #include <linux/if_ltalk.h> - #include <linux/rtnetlink.h> - #include <net/neighbour.h> -+#include <ub/ub_mem.h> - - /* The network devices currently exist only in the socket namespace, so these - entries are unused. The only ones that make sense are -@@ -83,7 +84,7 @@ struct net_device *alloc_netdev(int size - & ~NETDEV_ALIGN_CONST; - alloc_size += sizeof_priv + NETDEV_ALIGN_CONST; - -- p = kmalloc (alloc_size, GFP_KERNEL); -+ p = ub_kmalloc(alloc_size, GFP_KERNEL); - if (!p) { - printk(KERN_ERR "alloc_dev: Unable to allocate device.\n"); - return NULL; -@@ -392,6 +393,10 @@ int register_netdev(struct net_device *d - - out: - rtnl_unlock(); -+ if (err == 0 && dev->reg_state != NETREG_REGISTERED) { -+ unregister_netdev(dev); -+ err = -ENOMEM; -+ } - return err; - } - -diff -uprN linux-2.6.8.1.orig/drivers/net/open_vznet.c linux-2.6.8.1-ve022stab078/drivers/net/open_vznet.c ---- linux-2.6.8.1.orig/drivers/net/open_vznet.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/drivers/net/open_vznet.c 2006-05-11 13:05:40.000000000 +0400 -@@ -0,0 +1,190 @@ -+/* -+ * open_vznet.c -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+/* -+ * Virtual Networking device used to change VE ownership on packets -+ */ -+ -+#include <linux/kernel.h> -+#include <linux/module.h> -+#include <linux/seq_file.h> -+ -+#include <linux/inet.h> -+#include <net/ip.h> -+#include <linux/skbuff.h> -+#include <linux/venet.h> -+ -+void veip_stop(struct ve_struct *ve) -+{ -+ struct list_head *p, *tmp; -+ -+ write_lock_irq(&veip_hash_lock); -+ if (ve->veip == NULL) -+ goto unlock; -+ list_for_each_safe(p, tmp, &ve->veip->ip_lh) { -+ struct ip_entry_struct *ptr; -+ ptr = list_entry(p, struct ip_entry_struct, ve_list); -+ ptr->active_env = NULL; -+ list_del(&ptr->ve_list); -+ list_del(&ptr->ip_hash); -+ kfree(ptr); -+ } -+ veip_put(ve->veip); -+ ve->veip = NULL; -+unlock: -+ write_unlock_irq(&veip_hash_lock); -+} -+ -+int veip_start(struct ve_struct *ve) -+{ -+ int err; -+ -+ err = 0; -+ write_lock_irq(&veip_hash_lock); -+ ve->veip = veip_findcreate(ve->veid); -+ if (ve->veip == NULL) -+ err = -ENOMEM; -+ write_unlock_irq(&veip_hash_lock); -+ return err; -+} -+ -+int veip_entry_add(struct ve_struct *ve, struct sockaddr_in *addr) -+{ -+ struct ip_entry_struct *entry, *found; -+ int err; -+ -+ entry = kmalloc(sizeof(struct ip_entry_struct), GFP_KERNEL); -+ if (entry == NULL) -+ return -ENOMEM; -+ -+ memset(entry, 0, sizeof(struct ip_entry_struct)); -+ entry->ip = addr->sin_addr.s_addr; -+ -+ write_lock_irq(&veip_hash_lock); -+ err = -EADDRINUSE; -+ found = ip_entry_lookup(entry->ip); -+ if (found != NULL) -+ goto out_unlock; -+ else { -+ ip_entry_hash(entry, ve->veip); -+ found = entry; -+ entry = NULL; -+ } -+ err = 0; -+ found->active_env = ve; -+out_unlock: -+ write_unlock_irq(&veip_hash_lock); -+ if (entry != NULL) -+ kfree(entry); -+ return err; -+} -+ -+int veip_entry_del(envid_t veid, struct sockaddr_in *addr) -+{ -+ struct ip_entry_struct *found; -+ int err; -+ -+ err = -EADDRNOTAVAIL; -+ write_lock_irq(&veip_hash_lock); -+ found = ip_entry_lookup(addr->sin_addr.s_addr); -+ if (found == NULL) -+ goto out; -+ if (found->active_env->veid != veid) -+ goto out; -+ -+ err = 0; -+ found->active_env = NULL; -+ -+ list_del(&found->ip_hash); -+ list_del(&found->ve_list); -+ kfree(found); -+out: -+ write_unlock_irq(&veip_hash_lock); -+ return err; -+} -+ -+static struct ve_struct *venet_find_ve(__u32 ip) -+{ -+ struct ip_entry_struct *entry; -+ -+ entry = ip_entry_lookup(ip); -+ if (entry == NULL) -+ return NULL; -+ -+ return entry->active_env; -+} -+ -+int venet_change_skb_owner(struct sk_buff *skb) -+{ -+ struct ve_struct *ve, *ve_old; -+ struct iphdr *iph; -+ -+ ve_old = skb->owner_env; -+ iph = skb->nh.iph; -+ -+ read_lock(&veip_hash_lock); -+ if (!ve_is_super(ve_old)) { -+ /* from VE to host */ -+ ve = venet_find_ve(iph->saddr); -+ if (ve == NULL) -+ goto out_drop; -+ if (!ve_accessible_strict(ve, ve_old)) -+ goto out_source; -+ skb->owner_env = get_ve0(); -+ } else { -+ /* from host to VE */ -+ ve = venet_find_ve(iph->daddr); -+ if (ve == NULL) -+ goto out_drop; -+ skb->owner_env = ve; -+ } -+ read_unlock(&veip_hash_lock); -+ -+ return 0; -+ -+out_drop: -+ read_unlock(&veip_hash_lock); -+ return -ESRCH; -+ -+out_source: -+ read_unlock(&veip_hash_lock); -+ if (net_ratelimit()) { -+ printk(KERN_WARNING "Dropped packet, source wrong " -+ "veid=%u src-IP=%u.%u.%u.%u " -+ "dst-IP=%u.%u.%u.%u\n", -+ skb->owner_env->veid, -+ NIPQUAD(skb->nh.iph->saddr), -+ NIPQUAD(skb->nh.iph->daddr)); -+ } -+ return -EACCES; -+} -+ -+#ifdef CONFIG_PROC_FS -+int veip_seq_show(struct seq_file *m, void *v) -+{ -+ struct list_head *p; -+ struct ip_entry_struct *entry; -+ char s[16]; -+ -+ p = (struct list_head *)v; -+ if (p == ip_entry_hash_table) { -+ seq_puts(m, "Version: 2.5\n"); -+ return 0; -+ } -+ entry = list_entry(p, struct ip_entry_struct, ip_hash); -+ sprintf(s, "%u.%u.%u.%u", NIPQUAD(entry->ip)); -+ seq_printf(m, "%15s %10u\n", s, 0); -+ return 0; -+} -+#endif -+ -+MODULE_AUTHOR("SWsoft <info@sw-soft.com>"); -+MODULE_DESCRIPTION("Virtuozzo Virtual Network Device"); -+MODULE_LICENSE("GPL v2"); -diff -uprN linux-2.6.8.1.orig/drivers/net/ppp_async.c linux-2.6.8.1-ve022stab078/drivers/net/ppp_async.c ---- linux-2.6.8.1.orig/drivers/net/ppp_async.c 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/net/ppp_async.c 2006-05-11 13:05:33.000000000 +0400 -@@ -973,7 +973,7 @@ static void async_lcp_peek(struct asyncp - data += 4; - dlen -= 4; - /* data[0] is code, data[1] is length */ -- while (dlen >= 2 && dlen >= data[1]) { -+ while (dlen >= 2 && dlen >= data[1] && data[1] >= 2) { - switch (data[0]) { - case LCP_MRU: - val = (data[2] << 8) + data[3]; -diff -uprN linux-2.6.8.1.orig/drivers/net/tun.c linux-2.6.8.1-ve022stab078/drivers/net/tun.c ---- linux-2.6.8.1.orig/drivers/net/tun.c 2004-08-14 14:55:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/net/tun.c 2006-05-11 13:05:42.000000000 +0400 -@@ -44,6 +44,7 @@ - - #include <asm/system.h> - #include <asm/uaccess.h> -+#include <ub/beancounter.h> - - #ifdef TUN_DEBUG - static int debug; -@@ -71,6 +72,7 @@ static int tun_net_close(struct net_devi - static int tun_net_xmit(struct sk_buff *skb, struct net_device *dev) - { - struct tun_struct *tun = netdev_priv(dev); -+ struct user_beancounter *ub; - - DBG(KERN_INFO "%s: tun_net_xmit %d\n", tun->dev->name, skb->len); - -@@ -90,6 +92,19 @@ static int tun_net_xmit(struct sk_buff * - if (skb_queue_len(&tun->readq) >= dev->tx_queue_len) - goto drop; - } -+ -+ ub = netdev_bc(dev)->exec_ub; -+ if (ub && (skb_bc(skb)->charged == 0)) { -+ unsigned long charge; -+ charge = skb_charge_fullsize(skb); -+ if (charge_beancounter(ub, UB_OTHERSOCKBUF, charge, 1)) -+ goto drop; -+ get_beancounter(ub); -+ skb_bc(skb)->ub = ub; -+ skb_bc(skb)->charged = charge; -+ skb_bc(skb)->resource = UB_OTHERSOCKBUF; -+ } -+ - skb_queue_tail(&tun->readq, skb); - - /* Notify and wake up reader process */ -@@ -174,22 +189,26 @@ static __inline__ ssize_t tun_get_user(s - { - struct tun_pi pi = { 0, __constant_htons(ETH_P_IP) }; - struct sk_buff *skb; -- size_t len = count; -+ size_t len = count, align = 0; - - if (!(tun->flags & TUN_NO_PI)) { -- if ((len -= sizeof(pi)) > len) -+ if ((len -= sizeof(pi)) > count) - return -EINVAL; - - if(memcpy_fromiovec((void *)&pi, iv, sizeof(pi))) - return -EFAULT; - } -+ -+ if ((tun->flags & TUN_TYPE_MASK) == TUN_TAP_DEV) -+ align = NET_IP_ALIGN; - -- if (!(skb = alloc_skb(len + 2, GFP_KERNEL))) { -+ if (!(skb = alloc_skb(len + align, GFP_KERNEL))) { - tun->stats.rx_dropped++; - return -ENOMEM; - } - -- skb_reserve(skb, 2); -+ if (align) -+ skb_reserve(skb, align); - if (memcpy_fromiovec(skb_put(skb, len), iv, len)) - return -EFAULT; - -@@ -322,6 +341,7 @@ static ssize_t tun_chr_readv(struct file - - ret = tun_put_user(tun, skb, (struct iovec *) iv, len); - -+ /* skb will be uncharged in kfree_skb() */ - kfree_skb(skb); - break; - } -@@ -355,6 +375,7 @@ static void tun_setup(struct net_device - dev->stop = tun_net_close; - dev->get_stats = tun_net_stats; - dev->destructor = free_netdev; -+ dev->features |= NETIF_F_VIRTUAL; - } - - static struct tun_struct *tun_get_by_name(const char *name) -@@ -363,8 +384,9 @@ static struct tun_struct *tun_get_by_nam - - ASSERT_RTNL(); - list_for_each_entry(tun, &tun_dev_list, list) { -- if (!strncmp(tun->dev->name, name, IFNAMSIZ)) -- return tun; -+ if (ve_accessible_strict(tun->dev->owner_env, get_exec_env()) && -+ !strncmp(tun->dev->name, name, IFNAMSIZ)) -+ return tun; - } - - return NULL; -@@ -383,7 +405,8 @@ static int tun_set_iff(struct file *file - - /* Check permissions */ - if (tun->owner != -1 && -- current->euid != tun->owner && !capable(CAP_NET_ADMIN)) -+ current->euid != tun->owner && -+ !capable(CAP_NET_ADMIN) && !capable(CAP_VE_NET_ADMIN)) - return -EPERM; - } - else if (__dev_get_by_name(ifr->ifr_name)) -diff -uprN linux-2.6.8.1.orig/drivers/net/venet_core.c linux-2.6.8.1-ve022stab078/drivers/net/venet_core.c ---- linux-2.6.8.1.orig/drivers/net/venet_core.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/drivers/net/venet_core.c 2006-05-11 13:05:45.000000000 +0400 -@@ -0,0 +1,626 @@ -+/* -+ * venet_core.c -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+/* -+ * Common part for Virtuozzo virtual network devices -+ */ -+ -+#include <linux/kernel.h> -+#include <linux/sched.h> -+#include <linux/interrupt.h> -+#include <linux/fs.h> -+#include <linux/types.h> -+#include <linux/string.h> -+#include <linux/socket.h> -+#include <linux/errno.h> -+#include <linux/fcntl.h> -+#include <linux/in.h> -+#include <linux/init.h> -+#include <linux/module.h> -+#include <linux/tcp.h> -+#include <linux/proc_fs.h> -+#include <linux/seq_file.h> -+ -+#include <asm/system.h> -+#include <asm/uaccess.h> -+#include <asm/io.h> -+#include <asm/unistd.h> -+ -+#include <linux/inet.h> -+#include <linux/netdevice.h> -+#include <linux/etherdevice.h> -+#include <net/ip.h> -+#include <linux/skbuff.h> -+#include <net/sock.h> -+#include <linux/if_ether.h> /* For the statistics structure. */ -+#include <linux/if_arp.h> /* For ARPHRD_ETHER */ -+#include <linux/venet.h> -+#include <linux/ve_proto.h> -+#include <linux/vzctl.h> -+#include <linux/vzctl_venet.h> -+ -+struct list_head ip_entry_hash_table[VEIP_HASH_SZ]; -+rwlock_t veip_hash_lock = RW_LOCK_UNLOCKED; -+LIST_HEAD(veip_lh); -+ -+#define ip_entry_hash_function(ip) (ntohl(ip) & (VEIP_HASH_SZ - 1)) -+ -+void ip_entry_hash(struct ip_entry_struct *entry, struct veip_struct *veip) -+{ -+ list_add(&entry->ip_hash, -+ ip_entry_hash_table + ip_entry_hash_function(entry->ip)); -+ list_add(&entry->ve_list, &veip->ip_lh); -+} -+ -+void veip_put(struct veip_struct *veip) -+{ -+ if (!list_empty(&veip->ip_lh)) -+ return; -+ if (!list_empty(&veip->src_lh)) -+ return; -+ if (!list_empty(&veip->dst_lh)) -+ return; -+ -+ list_del(&veip->list); -+ kfree(veip); -+} -+ -+struct ip_entry_struct *ip_entry_lookup(u32 addr) -+{ -+ struct ip_entry_struct *entry; -+ struct list_head *tmp; -+ -+ list_for_each(tmp, ip_entry_hash_table + ip_entry_hash_function(addr)) { -+ entry = list_entry(tmp, struct ip_entry_struct, ip_hash); -+ if (entry->ip != addr) -+ continue; -+ return entry; -+ } -+ return NULL; -+} -+ -+struct veip_struct *veip_find(envid_t veid) -+{ -+ struct veip_struct *ptr; -+ list_for_each_entry(ptr, &veip_lh, list) { -+ if (ptr->veid != veid) -+ continue; -+ return ptr; -+ } -+ return NULL; -+} -+ -+struct veip_struct *veip_findcreate(envid_t veid) -+{ -+ struct veip_struct *ptr; -+ -+ ptr = veip_find(veid); -+ if (ptr != NULL) -+ return ptr; -+ -+ ptr = kmalloc(sizeof(struct veip_struct), GFP_ATOMIC); -+ if (ptr == NULL) -+ return NULL; -+ memset(ptr, 0, sizeof(struct veip_struct)); -+ INIT_LIST_HEAD(&ptr->ip_lh); -+ INIT_LIST_HEAD(&ptr->src_lh); -+ INIT_LIST_HEAD(&ptr->dst_lh); -+ list_add(&ptr->list, &veip_lh); -+ ptr->veid = veid; -+ return ptr; -+} -+ -+/* -+ * Device functions -+ */ -+ -+static int venet_open(struct net_device *dev) -+{ -+ if (!try_module_get(THIS_MODULE)) -+ return -EBUSY; -+ return 0; -+} -+ -+static int venet_close(struct net_device *master) -+{ -+ module_put(THIS_MODULE); -+ return 0; -+} -+ -+static void venet_destructor(struct net_device *dev) -+{ -+ kfree(dev->priv); -+ dev->priv = NULL; -+} -+ -+/* -+ * The higher levels take care of making this non-reentrant (it's -+ * called with bh's disabled). -+ */ -+static int venet_xmit(struct sk_buff *skb, struct net_device *dev) -+{ -+ struct net_device_stats *stats = (struct net_device_stats *)dev->priv; -+ struct net_device *rcv = NULL; -+ struct iphdr *iph; -+ int length; -+ -+ if (unlikely(get_exec_env()->disable_net)) -+ goto outf; -+ -+ /* -+ * Optimise so buffers with skb->free=1 are not copied but -+ * instead are lobbed from tx queue to rx queue -+ */ -+ if (atomic_read(&skb->users) != 1) { -+ struct sk_buff *skb2 = skb; -+ skb = skb_clone(skb, GFP_ATOMIC); /* Clone the buffer */ -+ if (skb == NULL) { -+ kfree_skb(skb2); -+ goto out; -+ } -+ kfree_skb(skb2); -+ } else -+ skb_orphan(skb); -+ -+ if (skb->protocol != __constant_htons(ETH_P_IP)) -+ goto outf; -+ -+ iph = skb->nh.iph; -+ if (MULTICAST(iph->daddr)) -+ goto outf; -+ -+ if (venet_change_skb_owner(skb) < 0) -+ goto outf; -+ -+ if (unlikely(VE_OWNER_SKB(skb)->disable_net)) -+ goto outf; -+ -+ rcv = VE_OWNER_SKB(skb)->_venet_dev; -+ if (!rcv) -+ /* VE going down */ -+ goto outf; -+ -+ dev_hold(rcv); -+ -+ if (!(rcv->flags & IFF_UP)) { -+ /* Target VE does not want to receive packets */ -+ dev_put(rcv); -+ goto outf; -+ } -+ -+ skb->pkt_type = PACKET_HOST; -+ skb->dev = rcv; -+ -+ skb->mac.raw = skb->data; -+ memset(skb->data - dev->hard_header_len, 0, dev->hard_header_len); -+ -+ dst_release(skb->dst); -+ skb->dst = NULL; -+#ifdef CONFIG_NETFILTER -+ nf_conntrack_put(skb->nfct); -+ skb->nfct = NULL; -+#ifdef CONFIG_NETFILTER_DEBUG -+ skb->nf_debug = 0; -+#endif -+#endif -+ length = skb->len; -+ -+ netif_rx(skb); -+ -+ stats->tx_bytes += length; -+ stats->tx_packets++; -+ if (rcv) { -+ struct net_device_stats *rcv_stats = -+ (struct net_device_stats *)rcv->priv; -+ rcv_stats->rx_bytes += length; -+ rcv_stats->rx_packets++; -+ dev_put(rcv); -+ } -+ -+ return 0; -+ -+outf: -+ kfree_skb(skb); -+ ++stats->tx_dropped; -+out: -+ return 0; -+} -+ -+static struct net_device_stats *get_stats(struct net_device *dev) -+{ -+ return (struct net_device_stats *)dev->priv; -+} -+ -+/* Initialize the rest of the LOOPBACK device. */ -+int venet_init_dev(struct net_device *dev) -+{ -+ dev->hard_start_xmit = venet_xmit; -+ dev->priv = kmalloc(sizeof(struct net_device_stats), GFP_KERNEL); -+ if (dev->priv == NULL) -+ return -ENOMEM; -+ memset(dev->priv, 0, sizeof(struct net_device_stats)); -+ dev->get_stats = get_stats; -+ dev->open = venet_open; -+ dev->stop = venet_close; -+ dev->destructor = venet_destructor; -+ -+ /* -+ * Fill in the generic fields of the device structure. -+ */ -+ dev->type = ARPHRD_VOID; -+ dev->hard_header_len = ETH_HLEN; -+ dev->mtu = 1500; /* eth_mtu */ -+ dev->tx_queue_len = 0; -+ -+ memset(dev->broadcast, 0xFF, ETH_ALEN); -+ -+ /* New-style flags. */ -+ dev->flags = IFF_BROADCAST|IFF_NOARP|IFF_POINTOPOINT; -+ return 0; -+} -+ -+static void venet_setup(struct net_device *dev) -+{ -+ dev->init = venet_init_dev; -+ /* -+ * No other features, as they are: -+ * - checksumming is required, and nobody else will done our job -+ */ -+ dev->features |= NETIF_F_VENET | NETIF_F_VIRTUAL; -+} -+ -+#ifdef CONFIG_PROC_FS -+static int veinfo_seq_show(struct seq_file *m, void *v) -+{ -+ struct ve_struct *ve = (struct ve_struct *)v; -+ struct list_head *tmp; -+ -+ seq_printf(m, "%10u %5u %5u", ve->veid, -+ ve->class_id, atomic_read(&ve->pcounter)); -+ read_lock(&veip_hash_lock); -+ if (ve->veip == NULL) -+ goto unlock; -+ list_for_each(tmp, &ve->veip->ip_lh) { -+ char ip[16]; -+ struct ip_entry_struct *entry; -+ -+ entry = list_entry(tmp, struct ip_entry_struct, ve_list); -+ if (entry->active_env == NULL) -+ continue; -+ -+ sprintf(ip, "%u.%u.%u.%u", NIPQUAD(entry->ip)); -+ seq_printf(m, " %15s", ip); -+ } -+unlock: -+ read_unlock(&veip_hash_lock); -+ seq_putc(m, '\n'); -+ return 0; -+} -+ -+static void *ve_seq_start(struct seq_file *m, loff_t *pos) -+{ -+ struct ve_struct *ve, *curve; -+ loff_t l; -+ -+ curve = get_exec_env(); -+ read_lock(&ve_list_guard); -+ if (!ve_is_super(curve)) { -+ if (*pos != 0) -+ return NULL; -+ return curve; -+ } -+ for (ve = ve_list_head, l = *pos; -+ ve != NULL && l > 0; -+ ve = ve->next, l--); -+ return ve; -+} -+ -+static void *ve_seq_next(struct seq_file *m, void *v, loff_t *pos) -+{ -+ struct ve_struct *ve = (struct ve_struct *)v; -+ -+ if (!ve_is_super(get_exec_env())) -+ return NULL; -+ (*pos)++; -+ return ve->next; -+} -+ -+static void ve_seq_stop(struct seq_file *m, void *v) -+{ -+ read_unlock(&ve_list_guard); -+} -+ -+ -+static struct seq_operations veinfo_seq_op = { -+ start: ve_seq_start, -+ next: ve_seq_next, -+ stop: ve_seq_stop, -+ show: veinfo_seq_show -+}; -+ -+static int veinfo_open(struct inode *inode, struct file *file) -+{ -+ return seq_open(file, &veinfo_seq_op); -+} -+ -+static struct file_operations proc_veinfo_operations = { -+ open: veinfo_open, -+ read: seq_read, -+ llseek: seq_lseek, -+ release: seq_release -+}; -+ -+static void *veip_seq_start(struct seq_file *m, loff_t *pos) -+{ -+ loff_t l; -+ struct list_head *p; -+ int i; -+ -+ l = *pos; -+ write_lock_irq(&veip_hash_lock); -+ if (l == 0) -+ return ip_entry_hash_table; -+ for (i = 0; i < VEIP_HASH_SZ; i++) { -+ list_for_each(p, ip_entry_hash_table + i) { -+ if (--l == 0) -+ return p; -+ } -+ } -+ return NULL; -+} -+ -+static void *veip_seq_next(struct seq_file *m, void *v, loff_t *pos) -+{ -+ struct list_head *p; -+ -+ p = (struct list_head *)v; -+ while (1) { -+ p = p->next; -+ if (p < ip_entry_hash_table || -+ p >= ip_entry_hash_table + VEIP_HASH_SZ) { -+ (*pos)++; -+ return p; -+ } -+ if (++p >= ip_entry_hash_table + VEIP_HASH_SZ) -+ return NULL; -+ } -+ return NULL; -+} -+ -+static void veip_seq_stop(struct seq_file *m, void *v) -+{ -+ write_unlock_irq(&veip_hash_lock); -+} -+ -+static struct seq_operations veip_seq_op = { -+ start: veip_seq_start, -+ next: veip_seq_next, -+ stop: veip_seq_stop, -+ show: veip_seq_show -+}; -+ -+static int veip_open(struct inode *inode, struct file *file) -+{ -+ return seq_open(file, &veip_seq_op); -+} -+ -+static struct file_operations proc_veip_operations = { -+ open: veip_open, -+ read: seq_read, -+ llseek: seq_lseek, -+ release: seq_release -+}; -+#endif -+ -+int real_ve_ip_map(envid_t veid, int op, struct sockaddr *uservaddr, int addrlen) -+{ -+ int err; -+ struct sockaddr_in addr; -+ struct ve_struct *ve; -+ -+ err = -EPERM; -+ if (!capable(CAP_SETVEID)) -+ goto out; -+ -+ err = -EINVAL; -+ if (addrlen != sizeof(struct sockaddr_in)) -+ goto out; -+ -+ err = move_addr_to_kernel(uservaddr, addrlen, &addr); -+ if (err < 0) -+ goto out; -+ -+ switch (op) -+ { -+ case VE_IP_ADD: -+ ve = get_ve_by_id(veid); -+ err = -ESRCH; -+ if (!ve) -+ goto out; -+ -+ down_read(&ve->op_sem); -+ if (ve->is_running) -+ err = veip_entry_add(ve, &addr); -+ up_read(&ve->op_sem); -+ put_ve(ve); -+ break; -+ -+ case VE_IP_DEL: -+ err = veip_entry_del(veid, &addr); -+ break; -+ default: -+ err = -EINVAL; -+ } -+ -+out: -+ return err; -+} -+ -+int venet_ioctl(struct inode *ino, struct file *file, unsigned int cmd, -+ unsigned long arg) -+{ -+ int err; -+ -+ err = -ENOTTY; -+ switch(cmd) { -+ case VENETCTL_VE_IP_MAP: { -+ struct vzctl_ve_ip_map s; -+ err = -EFAULT; -+ if (copy_from_user(&s, (void *)arg, sizeof(s))) -+ break; -+ err = real_ve_ip_map(s.veid, s.op, s.addr, s.addrlen); -+ } -+ break; -+ } -+ return err; -+} -+ -+static struct vzioctlinfo venetcalls = { -+ type: VENETCTLTYPE, -+ func: venet_ioctl, -+ owner: THIS_MODULE, -+}; -+ -+int venet_dev_start(struct ve_struct *env) -+{ -+ struct net_device *dev_venet; -+ int err; -+ -+ dev_venet = alloc_netdev(0, "venet%d", venet_setup); -+ if (!dev_venet) -+ return -ENOMEM; -+ err = dev_alloc_name(dev_venet, dev_venet->name); -+ if (err<0) -+ goto err; -+ if ((err = register_netdev(dev_venet)) != 0) -+ goto err; -+ env->_venet_dev = dev_venet; -+ return 0; -+err: -+ free_netdev(dev_venet); -+ printk(KERN_ERR "VENET initialization error err=%d\n", err); -+ return err; -+} -+ -+static int venet_start(unsigned int hooknum, void *data) -+{ -+ struct ve_struct *env; -+ int err; -+ -+ env = (struct ve_struct *)data; -+ if (env->veip) -+ return -EEXIST; -+ if (!ve_is_super(env) && !try_module_get(THIS_MODULE)) -+ return 0; -+ -+ err = veip_start(env); -+ if (err) -+ goto err; -+ -+ err = venet_dev_start(env); -+ if (err) -+ goto err_free; -+ return 0; -+ -+err_free: -+ veip_stop(env); -+err: -+ if (!ve_is_super(env)) -+ module_put(THIS_MODULE); -+ return err; -+} -+ -+static int venet_stop(unsigned int hooknum, void *data) -+{ -+ struct ve_struct *env; -+ -+ env = (struct ve_struct *)data; -+ veip_stop(env); -+ if (!ve_is_super(env)) -+ module_put(THIS_MODULE); -+ return 0; -+} -+ -+#define VE_HOOK_PRI_NET 0 -+ -+static struct ve_hook venet_ve_hook_init = { -+ hook: venet_start, -+ undo: venet_stop, -+ hooknum: VE_HOOK_INIT, -+ priority: VE_HOOK_PRI_NET -+}; -+ -+static struct ve_hook venet_ve_hook_fini = { -+ hook: venet_stop, -+ hooknum: VE_HOOK_FINI, -+ priority: VE_HOOK_PRI_NET -+}; -+ -+__init int venet_init(void) -+{ -+#ifdef CONFIG_PROC_FS -+ struct proc_dir_entry *de; -+#endif -+ int i, err; -+ -+ if (get_ve0()->_venet_dev != NULL) -+ return -EEXIST; -+ -+ for (i = 0; i < VEIP_HASH_SZ; i++) -+ INIT_LIST_HEAD(ip_entry_hash_table + i); -+ -+ err = venet_start(VE_HOOK_INIT, (void *)get_ve0()); -+ if (err) -+ return err; -+ -+#ifdef CONFIG_PROC_FS -+ de = create_proc_glob_entry("vz/veinfo", -+ S_IFREG|S_IRUSR, NULL); -+ if (de) -+ de->proc_fops = &proc_veinfo_operations; -+ else -+ printk(KERN_WARNING "venet: can't make veinfo proc entry\n"); -+ -+ de = create_proc_entry("vz/veip", S_IFREG|S_IRUSR, NULL); -+ if (de) -+ de->proc_fops = &proc_veip_operations; -+ else -+ printk(KERN_WARNING "venet: can't make veip proc entry\n"); -+#endif -+ -+ ve_hook_register(&venet_ve_hook_init); -+ ve_hook_register(&venet_ve_hook_fini); -+ vzioctl_register(&venetcalls); -+ return 0; -+} -+ -+__exit void venet_exit(void) -+{ -+ struct net_device *dev_venet; -+ -+ vzioctl_unregister(&venetcalls); -+ ve_hook_unregister(&venet_ve_hook_fini); -+ ve_hook_unregister(&venet_ve_hook_init); -+#ifdef CONFIG_PROC_FS -+ remove_proc_entry("vz/veip", NULL); -+ remove_proc_entry("vz/veinfo", NULL); -+#endif -+ -+ dev_venet = get_ve0()->_venet_dev; -+ if (dev_venet != NULL) { -+ get_ve0()->_venet_dev = NULL; -+ unregister_netdev(dev_venet); -+ free_netdev(dev_venet); -+ } -+ veip_stop(get_ve0()); -+} -+ -+module_init(venet_init); -+module_exit(venet_exit); -diff -uprN linux-2.6.8.1.orig/drivers/net/wireless/airo.c linux-2.6.8.1-ve022stab078/drivers/net/wireless/airo.c ---- linux-2.6.8.1.orig/drivers/net/wireless/airo.c 2004-08-14 14:54:49.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/net/wireless/airo.c 2006-05-11 13:05:25.000000000 +0400 -@@ -2901,8 +2901,8 @@ static int airo_thread(void *data) { - flush_signals(current); - - /* make swsusp happy with our thread */ -- if (current->flags & PF_FREEZE) -- refrigerator(PF_FREEZE); -+ if (test_thread_flag(TIF_FREEZE)) -+ refrigerator(); - - if (test_bit(JOB_DIE, &ai->flags)) - break; -diff -uprN linux-2.6.8.1.orig/drivers/pci/probe.c linux-2.6.8.1-ve022stab078/drivers/pci/probe.c ---- linux-2.6.8.1.orig/drivers/pci/probe.c 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/pci/probe.c 2006-05-11 13:05:40.000000000 +0400 -@@ -26,6 +26,7 @@ LIST_HEAD(pci_root_buses); - EXPORT_SYMBOL(pci_root_buses); - - LIST_HEAD(pci_devices); -+EXPORT_SYMBOL(pci_devices); - - /* - * PCI Bus Class -diff -uprN linux-2.6.8.1.orig/drivers/pci/quirks.c linux-2.6.8.1-ve022stab078/drivers/pci/quirks.c ---- linux-2.6.8.1.orig/drivers/pci/quirks.c 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/pci/quirks.c 2006-05-11 13:05:28.000000000 +0400 -@@ -292,6 +292,46 @@ static void __devinit quirk_ich4_lpc_acp - quirk_io_region(dev, region, 64, PCI_BRIDGE_RESOURCES+1); - } - -+#if defined(CONFIG_X86_IO_APIC) && defined(CONFIG_SMP) -+#include <asm/irq.h> -+ -+static void __devinit quirk_intel_irqbalance(struct pci_dev *dev) -+{ -+ u8 config, rev; -+ u32 word; -+ extern struct pci_raw_ops *raw_pci_ops; -+ -+ pci_read_config_byte(dev, PCI_CLASS_REVISION, &rev); -+ if (rev > 0x9) -+ return; -+ -+ printk(KERN_INFO "Intel E7520/7320/7525 detected."); -+ -+ /* enable access to config space*/ -+ pci_read_config_byte(dev, 0xf4, &config); -+ config |= 0x2; -+ pci_write_config_byte(dev, 0xf4, config); -+ -+ /* read xTPR register */ -+ raw_pci_ops->read(0, 0, 0x40, 0x4c, 2, &word); -+ -+ if (!(word & (1 << 13))) { -+ printk(KERN_INFO "Disabling irq balancing and affinity\n"); -+#ifdef __i386__ -+#ifdef CONFIG_IRQBALANCE -+ irqbalance_disable(""); -+#endif -+ noirqdebug_setup(""); -+#endif -+ no_irq_affinity = 1; -+ } -+ -+ config &= ~0x2; -+ /* disable access to config space*/ -+ pci_write_config_byte(dev, 0xf4, config); -+} -+#endif -+ - /* - * VIA ACPI: One IO region pointed to by longword at - * 0x48 or 0x20 (256 bytes of ACPI registers) -@@ -1039,6 +1079,10 @@ static struct pci_fixup pci_fixups[] __d - #endif /* CONFIG_SCSI_SATA */ - - { PCI_FIXUP_FINAL, PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_SMCH, quirk_pciehp_msi }, -+#if defined(CONFIG_X86_IO_APIC) && defined(CONFIG_SMP) -+ { PCI_FIXUP_FINAL, PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_E7320_MCH, quirk_intel_irqbalance }, -+ { PCI_FIXUP_FINAL, PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_E7525_MCH, quirk_intel_irqbalance }, -+#endif - - { 0 } - }; -diff -uprN linux-2.6.8.1.orig/drivers/pcmcia/cs.c linux-2.6.8.1-ve022stab078/drivers/pcmcia/cs.c ---- linux-2.6.8.1.orig/drivers/pcmcia/cs.c 2004-08-14 14:55:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/pcmcia/cs.c 2006-05-11 13:05:25.000000000 +0400 -@@ -724,8 +724,8 @@ static int pccardd(void *__skt) - } - - schedule(); -- if (current->flags & PF_FREEZE) -- refrigerator(PF_FREEZE); -+ if (test_thread_flag(TIF_FREEZE)) -+ refrigerator(); - - if (!skt->thread) - break; -diff -uprN linux-2.6.8.1.orig/drivers/sbus/char/bbc_envctrl.c linux-2.6.8.1-ve022stab078/drivers/sbus/char/bbc_envctrl.c ---- linux-2.6.8.1.orig/drivers/sbus/char/bbc_envctrl.c 2004-08-14 14:56:26.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/sbus/char/bbc_envctrl.c 2006-05-11 13:05:40.000000000 +0400 -@@ -614,7 +614,7 @@ void bbc_envctrl_cleanup(void) - int found = 0; - - read_lock(&tasklist_lock); -- for_each_process(p) { -+ for_each_process_all(p) { - if (p == kenvctrld_task) { - found = 1; - break; -diff -uprN linux-2.6.8.1.orig/drivers/sbus/char/envctrl.c linux-2.6.8.1-ve022stab078/drivers/sbus/char/envctrl.c ---- linux-2.6.8.1.orig/drivers/sbus/char/envctrl.c 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/sbus/char/envctrl.c 2006-05-11 13:05:40.000000000 +0400 -@@ -1170,7 +1170,7 @@ static void __exit envctrl_cleanup(void) - int found = 0; - - read_lock(&tasklist_lock); -- for_each_process(p) { -+ for_each_process_all(p) { - if (p == kenvctrld_task) { - found = 1; - break; -diff -uprN linux-2.6.8.1.orig/drivers/scsi/aic7xxx/aic79xx_osm.c linux-2.6.8.1-ve022stab078/drivers/scsi/aic7xxx/aic79xx_osm.c ---- linux-2.6.8.1.orig/drivers/scsi/aic7xxx/aic79xx_osm.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/scsi/aic7xxx/aic79xx_osm.c 2006-05-11 13:05:25.000000000 +0400 -@@ -2591,7 +2591,6 @@ ahd_linux_dv_thread(void *data) - sprintf(current->comm, "ahd_dv_%d", ahd->unit); - #else - daemonize("ahd_dv_%d", ahd->unit); -- current->flags |= PF_FREEZE; - #endif - unlock_kernel(); - -diff -uprN linux-2.6.8.1.orig/drivers/scsi/aic7xxx/aic7xxx_osm.c linux-2.6.8.1-ve022stab078/drivers/scsi/aic7xxx/aic7xxx_osm.c ---- linux-2.6.8.1.orig/drivers/scsi/aic7xxx/aic7xxx_osm.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/scsi/aic7xxx/aic7xxx_osm.c 2006-05-11 13:05:25.000000000 +0400 -@@ -2295,7 +2295,6 @@ ahc_linux_dv_thread(void *data) - sprintf(current->comm, "ahc_dv_%d", ahc->unit); - #else - daemonize("ahc_dv_%d", ahc->unit); -- current->flags |= PF_FREEZE; - #endif - unlock_kernel(); - -diff -uprN linux-2.6.8.1.orig/drivers/scsi/scsi_error.c linux-2.6.8.1-ve022stab078/drivers/scsi/scsi_error.c ---- linux-2.6.8.1.orig/drivers/scsi/scsi_error.c 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/scsi/scsi_error.c 2006-05-11 13:05:25.000000000 +0400 -@@ -558,7 +558,7 @@ static int scsi_request_sense(struct scs - - memcpy(scmd->cmnd, generic_sense, sizeof(generic_sense)); - -- scsi_result = kmalloc(252, GFP_ATOMIC | (scmd->device->host->hostt->unchecked_isa_dma) ? __GFP_DMA : 0); -+ scsi_result = kmalloc(252, GFP_ATOMIC | ((scmd->device->host->hostt->unchecked_isa_dma) ? __GFP_DMA : 0)); - - - if (unlikely(!scsi_result)) { -diff -uprN linux-2.6.8.1.orig/drivers/scsi/scsi_scan.c linux-2.6.8.1-ve022stab078/drivers/scsi/scsi_scan.c ---- linux-2.6.8.1.orig/drivers/scsi/scsi_scan.c 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/scsi/scsi_scan.c 2006-05-11 13:05:25.000000000 +0400 -@@ -733,7 +733,7 @@ static int scsi_probe_and_add_lun(struct - if (!sreq) - goto out_free_sdev; - result = kmalloc(256, GFP_ATOMIC | -- (host->unchecked_isa_dma) ? __GFP_DMA : 0); -+ ((host->unchecked_isa_dma) ? __GFP_DMA : 0)); - if (!result) - goto out_free_sreq; - -diff -uprN linux-2.6.8.1.orig/drivers/scsi/sg.c linux-2.6.8.1-ve022stab078/drivers/scsi/sg.c ---- linux-2.6.8.1.orig/drivers/scsi/sg.c 2004-08-14 14:55:31.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/scsi/sg.c 2006-05-11 13:05:34.000000000 +0400 -@@ -2877,23 +2877,22 @@ static void * dev_seq_start(struct seq_f - { - struct sg_proc_deviter * it = kmalloc(sizeof(*it), GFP_KERNEL); - -+ s->private = it; - if (! it) - return NULL; -+ - if (NULL == sg_dev_arr) -- goto err1; -+ return NULL; - it->index = *pos; - it->max = sg_last_dev(); - if (it->index >= it->max) -- goto err1; -+ return NULL; - return it; --err1: -- kfree(it); -- return NULL; - } - - static void * dev_seq_next(struct seq_file *s, void *v, loff_t *pos) - { -- struct sg_proc_deviter * it = (struct sg_proc_deviter *) v; -+ struct sg_proc_deviter * it = s->private; - - *pos = ++it->index; - return (it->index < it->max) ? it : NULL; -@@ -2901,7 +2900,7 @@ static void * dev_seq_next(struct seq_fi - - static void dev_seq_stop(struct seq_file *s, void *v) - { -- kfree (v); -+ kfree(s->private); - } - - static int sg_proc_open_dev(struct inode *inode, struct file *file) -diff -uprN linux-2.6.8.1.orig/drivers/serial/8250.c linux-2.6.8.1-ve022stab078/drivers/serial/8250.c ---- linux-2.6.8.1.orig/drivers/serial/8250.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/serial/8250.c 2006-05-11 13:05:28.000000000 +0400 -@@ -20,27 +20,28 @@ - * membase is an 'ioremapped' cookie. - */ - #include <linux/config.h> -+#if defined(CONFIG_SERIAL_8250_CONSOLE) && defined(CONFIG_MAGIC_SYSRQ) -+#define SUPPORT_SYSRQ -+#endif -+ - #include <linux/module.h> - #include <linux/moduleparam.h> --#include <linux/tty.h> - #include <linux/ioport.h> - #include <linux/init.h> - #include <linux/console.h> - #include <linux/sysrq.h> -+#include <linux/delay.h> -+#include <linux/device.h> -+#include <linux/tty.h> -+#include <linux/tty_flip.h> - #include <linux/serial_reg.h> -+#include <linux/serial_core.h> - #include <linux/serial.h> - #include <linux/serialP.h> --#include <linux/delay.h> --#include <linux/device.h> - - #include <asm/io.h> - #include <asm/irq.h> - --#if defined(CONFIG_SERIAL_8250_CONSOLE) && defined(CONFIG_MAGIC_SYSRQ) --#define SUPPORT_SYSRQ --#endif -- --#include <linux/serial_core.h> - #include "8250.h" - - /* -@@ -827,16 +828,22 @@ receive_chars(struct uart_8250_port *up, - struct tty_struct *tty = up->port.info->tty; - unsigned char ch; - int max_count = 256; -+ char flag; - - do { -+ /* The following is not allowed by the tty layer and -+ unsafe. It should be fixed ASAP */ - if (unlikely(tty->flip.count >= TTY_FLIPBUF_SIZE)) { -- tty->flip.work.func((void *)tty); -- if (tty->flip.count >= TTY_FLIPBUF_SIZE) -- return; // if TTY_DONT_FLIP is set -+ if(tty->low_latency) { -+ spin_unlock(&up->port.lock); -+ tty_flip_buffer_push(tty); -+ spin_lock(&up->port.lock); -+ } -+ /* If this failed then we will throw away the -+ bytes but must do so to clear interrupts */ - } - ch = serial_inp(up, UART_RX); -- *tty->flip.char_buf_ptr = ch; -- *tty->flip.flag_buf_ptr = TTY_NORMAL; -+ flag = TTY_NORMAL; - up->port.icount.rx++; - - if (unlikely(*status & (UART_LSR_BI | UART_LSR_PE | -@@ -876,35 +883,30 @@ receive_chars(struct uart_8250_port *up, - #endif - if (*status & UART_LSR_BI) { - DEBUG_INTR("handling break...."); -- *tty->flip.flag_buf_ptr = TTY_BREAK; -+ flag = TTY_BREAK; - } else if (*status & UART_LSR_PE) -- *tty->flip.flag_buf_ptr = TTY_PARITY; -+ flag = TTY_PARITY; - else if (*status & UART_LSR_FE) -- *tty->flip.flag_buf_ptr = TTY_FRAME; -+ flag = TTY_FRAME; - } - if (uart_handle_sysrq_char(&up->port, ch, regs)) - goto ignore_char; -- if ((*status & up->port.ignore_status_mask) == 0) { -- tty->flip.flag_buf_ptr++; -- tty->flip.char_buf_ptr++; -- tty->flip.count++; -- } -+ if ((*status & up->port.ignore_status_mask) == 0) -+ tty_insert_flip_char(tty, ch, flag); - if ((*status & UART_LSR_OE) && -- tty->flip.count < TTY_FLIPBUF_SIZE) { -+ tty->flip.count < TTY_FLIPBUF_SIZE) - /* - * Overrun is special, since it's reported - * immediately, and doesn't affect the current - * character. - */ -- *tty->flip.flag_buf_ptr = TTY_OVERRUN; -- tty->flip.flag_buf_ptr++; -- tty->flip.char_buf_ptr++; -- tty->flip.count++; -- } -+ tty_insert_flip_char(tty, 0, TTY_OVERRUN); - ignore_char: - *status = serial_inp(up, UART_LSR); - } while ((*status & UART_LSR_DR) && (max_count-- > 0)); -+ spin_unlock(&up->port.lock); - tty_flip_buffer_push(tty); -+ spin_lock(&up->port.lock); - } - - static _INLINE_ void transmit_chars(struct uart_8250_port *up) -diff -uprN linux-2.6.8.1.orig/drivers/usb/core/hub.c linux-2.6.8.1-ve022stab078/drivers/usb/core/hub.c ---- linux-2.6.8.1.orig/drivers/usb/core/hub.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/usb/core/hub.c 2006-05-11 13:05:25.000000000 +0400 -@@ -1922,8 +1922,8 @@ static int hub_thread(void *__unused) - do { - hub_events(); - wait_event_interruptible(khubd_wait, !list_empty(&hub_event_list)); -- if (current->flags & PF_FREEZE) -- refrigerator(PF_FREEZE); -+ if (test_thread_flag(TIF_FREEZE)) -+ refrigerator(); - } while (!signal_pending(current)); - - pr_debug ("%s: khubd exiting\n", usbcore_name); -diff -uprN linux-2.6.8.1.orig/drivers/w1/w1.c linux-2.6.8.1-ve022stab078/drivers/w1/w1.c ---- linux-2.6.8.1.orig/drivers/w1/w1.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/drivers/w1/w1.c 2006-05-11 13:05:25.000000000 +0400 -@@ -465,8 +465,8 @@ int w1_control(void *data) - timeout = w1_timeout; - do { - timeout = interruptible_sleep_on_timeout(&w1_control_wait, timeout); -- if (current->flags & PF_FREEZE) -- refrigerator(PF_FREEZE); -+ if (test_thread_flag(TIF_FREEZE)) -+ refrigerator(); - } while (!signal_pending(current) && (timeout > 0)); - - if (signal_pending(current)) -@@ -536,8 +536,8 @@ int w1_process(void *data) - timeout = w1_timeout; - do { - timeout = interruptible_sleep_on_timeout(&dev->kwait, timeout); -- if (current->flags & PF_FREEZE) -- refrigerator(PF_FREEZE); -+ if (test_thread_flag(TIF_FREEZE)) -+ refrigerator(); - } while (!signal_pending(current) && (timeout > 0)); - - if (signal_pending(current)) -diff -uprN linux-2.6.8.1.orig/fs/adfs/adfs.h linux-2.6.8.1-ve022stab078/fs/adfs/adfs.h ---- linux-2.6.8.1.orig/fs/adfs/adfs.h 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/adfs/adfs.h 2006-05-11 13:05:35.000000000 +0400 -@@ -72,7 +72,7 @@ int adfs_get_block(struct inode *inode, - struct buffer_head *bh, int create); - struct inode *adfs_iget(struct super_block *sb, struct object_info *obj); - void adfs_read_inode(struct inode *inode); --void adfs_write_inode(struct inode *inode,int unused); -+int adfs_write_inode(struct inode *inode,int unused); - int adfs_notify_change(struct dentry *dentry, struct iattr *attr); - - /* map.c */ -diff -uprN linux-2.6.8.1.orig/fs/adfs/inode.c linux-2.6.8.1-ve022stab078/fs/adfs/inode.c ---- linux-2.6.8.1.orig/fs/adfs/inode.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/adfs/inode.c 2006-05-11 13:05:35.000000000 +0400 -@@ -372,10 +372,11 @@ out: - * The adfs-specific inode data has already been updated by - * adfs_notify_change() - */ --void adfs_write_inode(struct inode *inode, int unused) -+int adfs_write_inode(struct inode *inode, int unused) - { - struct super_block *sb = inode->i_sb; - struct object_info obj; -+ int ret; - - lock_kernel(); - obj.file_id = inode->i_ino; -@@ -386,7 +387,8 @@ void adfs_write_inode(struct inode *inod - obj.attr = ADFS_I(inode)->attr; - obj.size = inode->i_size; - -- adfs_dir_update(sb, &obj); -+ ret = adfs_dir_update(sb, &obj); - unlock_kernel(); -+ return ret; - } - MODULE_LICENSE("GPL"); -diff -uprN linux-2.6.8.1.orig/fs/affs/inode.c linux-2.6.8.1-ve022stab078/fs/affs/inode.c ---- linux-2.6.8.1.orig/fs/affs/inode.c 2004-08-14 14:55:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/affs/inode.c 2006-05-11 13:05:35.000000000 +0400 -@@ -181,7 +181,7 @@ bad_inode: - return; - } - --void -+int - affs_write_inode(struct inode *inode, int unused) - { - struct super_block *sb = inode->i_sb; -@@ -194,11 +194,11 @@ affs_write_inode(struct inode *inode, in - - if (!inode->i_nlink) - // possibly free block -- return; -+ return 0; - bh = affs_bread(sb, inode->i_ino); - if (!bh) { - affs_error(sb,"write_inode","Cannot read block %lu",inode->i_ino); -- return; -+ return -EIO; - } - tail = AFFS_TAIL(sb, bh); - if (tail->stype == be32_to_cpu(ST_ROOT)) { -@@ -226,6 +226,7 @@ affs_write_inode(struct inode *inode, in - mark_buffer_dirty_inode(bh, inode); - affs_brelse(bh); - affs_free_prealloc(inode); -+ return 0; - } - - int -diff -uprN linux-2.6.8.1.orig/fs/afs/mntpt.c linux-2.6.8.1-ve022stab078/fs/afs/mntpt.c ---- linux-2.6.8.1.orig/fs/afs/mntpt.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/afs/mntpt.c 2006-05-11 13:05:40.000000000 +0400 -@@ -162,6 +162,7 @@ static struct vfsmount *afs_mntpt_do_aut - char *buf, *devname = NULL, *options = NULL; - filler_t *filler; - int ret; -+ struct file_system_type *fstype; - - kenter("{%s}", mntpt->d_name.name); - -@@ -210,7 +211,12 @@ static struct vfsmount *afs_mntpt_do_aut - - /* try and do the mount */ - kdebug("--- attempting mount %s -o %s ---", devname, options); -- mnt = do_kern_mount("afs", 0, devname, options); -+ fstype = get_fs_type("afs"); -+ ret = -ENODEV; -+ if (!fstype) -+ goto error; -+ mnt = do_kern_mount(fstype, 0, devname, options); -+ put_filesystem(fstype); - kdebug("--- mount result %p ---", mnt); - - free_page((unsigned long) devname); -diff -uprN linux-2.6.8.1.orig/fs/attr.c linux-2.6.8.1-ve022stab078/fs/attr.c ---- linux-2.6.8.1.orig/fs/attr.c 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/attr.c 2006-05-11 13:05:32.000000000 +0400 -@@ -14,6 +14,7 @@ - #include <linux/fcntl.h> - #include <linux/quotaops.h> - #include <linux/security.h> -+#include <linux/time.h> - - /* Taken over from the old code... */ - -@@ -87,11 +88,14 @@ int inode_setattr(struct inode * inode, - if (ia_valid & ATTR_GID) - inode->i_gid = attr->ia_gid; - if (ia_valid & ATTR_ATIME) -- inode->i_atime = attr->ia_atime; -+ inode->i_atime = timespec_trunc(attr->ia_atime, -+ get_sb_time_gran(inode->i_sb)); - if (ia_valid & ATTR_MTIME) -- inode->i_mtime = attr->ia_mtime; -+ inode->i_mtime = timespec_trunc(attr->ia_mtime, -+ get_sb_time_gran(inode->i_sb)); - if (ia_valid & ATTR_CTIME) -- inode->i_ctime = attr->ia_ctime; -+ inode->i_ctime = timespec_trunc(attr->ia_ctime, -+ get_sb_time_gran(inode->i_sb)); - if (ia_valid & ATTR_MODE) { - umode_t mode = attr->ia_mode; - -@@ -131,14 +135,17 @@ int setattr_mask(unsigned int ia_valid) - int notify_change(struct dentry * dentry, struct iattr * attr) - { - struct inode *inode = dentry->d_inode; -- mode_t mode = inode->i_mode; -+ mode_t mode; - int error; -- struct timespec now = CURRENT_TIME; -+ struct timespec now; - unsigned int ia_valid = attr->ia_valid; - - if (!inode) - BUG(); - -+ mode = inode->i_mode; -+ now = current_fs_time(inode->i_sb); -+ - attr->ia_ctime = now; - if (!(ia_valid & ATTR_ATIME_SET)) - attr->ia_atime = now; -diff -uprN linux-2.6.8.1.orig/fs/autofs/autofs_i.h linux-2.6.8.1-ve022stab078/fs/autofs/autofs_i.h ---- linux-2.6.8.1.orig/fs/autofs/autofs_i.h 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/autofs/autofs_i.h 2006-05-11 13:05:42.000000000 +0400 -@@ -123,7 +123,7 @@ static inline struct autofs_sb_info *aut - filesystem without "magic".) */ - - static inline int autofs_oz_mode(struct autofs_sb_info *sbi) { -- return sbi->catatonic || process_group(current) == sbi->oz_pgrp; -+ return sbi->catatonic || virt_pgid(current) == sbi->oz_pgrp; - } - - /* Hash operations */ -diff -uprN linux-2.6.8.1.orig/fs/autofs/init.c linux-2.6.8.1-ve022stab078/fs/autofs/init.c ---- linux-2.6.8.1.orig/fs/autofs/init.c 2004-08-14 14:55:35.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/autofs/init.c 2006-05-11 13:05:42.000000000 +0400 -@@ -25,6 +25,7 @@ static struct file_system_type autofs_fs - .name = "autofs", - .get_sb = autofs_get_sb, - .kill_sb = kill_anon_super, -+ .fs_flags = FS_VIRTUALIZED, - }; - - static int __init init_autofs_fs(void) -diff -uprN linux-2.6.8.1.orig/fs/autofs/inode.c linux-2.6.8.1-ve022stab078/fs/autofs/inode.c ---- linux-2.6.8.1.orig/fs/autofs/inode.c 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/autofs/inode.c 2006-05-11 13:05:42.000000000 +0400 -@@ -66,7 +66,7 @@ static int parse_options(char *options, - - *uid = current->uid; - *gid = current->gid; -- *pgrp = process_group(current); -+ *pgrp = virt_pgid(current); - - *minproto = *maxproto = AUTOFS_PROTO_VERSION; - -@@ -138,7 +138,7 @@ int autofs_fill_super(struct super_block - sbi->magic = AUTOFS_SBI_MAGIC; - sbi->catatonic = 0; - sbi->exp_timeout = 0; -- sbi->oz_pgrp = process_group(current); -+ sbi->oz_pgrp = virt_pgid(current); - autofs_initialize_hash(&sbi->dirhash); - sbi->queues = NULL; - memset(sbi->symlink_bitmap, 0, sizeof(long)*AUTOFS_SYMLINK_BITMAP_LEN); -diff -uprN linux-2.6.8.1.orig/fs/autofs/root.c linux-2.6.8.1-ve022stab078/fs/autofs/root.c ---- linux-2.6.8.1.orig/fs/autofs/root.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/autofs/root.c 2006-05-11 13:05:42.000000000 +0400 -@@ -347,7 +347,7 @@ static int autofs_root_unlink(struct ino - - /* This allows root to remove symlinks */ - lock_kernel(); -- if ( !autofs_oz_mode(sbi) && !capable(CAP_SYS_ADMIN) ) { -+ if ( !autofs_oz_mode(sbi) && !capable(CAP_SYS_ADMIN) && !capable(CAP_VE_SYS_ADMIN) ) { - unlock_kernel(); - return -EACCES; - } -@@ -534,7 +534,7 @@ static int autofs_root_ioctl(struct inod - _IOC_NR(cmd) - _IOC_NR(AUTOFS_IOC_FIRST) >= AUTOFS_IOC_COUNT ) - return -ENOTTY; - -- if ( !autofs_oz_mode(sbi) && !capable(CAP_SYS_ADMIN) ) -+ if ( !autofs_oz_mode(sbi) && !capable(CAP_SYS_ADMIN) && !capable(CAP_VE_SYS_ADMIN) ) - return -EPERM; - - switch(cmd) { -diff -uprN linux-2.6.8.1.orig/fs/autofs4/autofs_i.h linux-2.6.8.1-ve022stab078/fs/autofs4/autofs_i.h ---- linux-2.6.8.1.orig/fs/autofs4/autofs_i.h 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/autofs4/autofs_i.h 2006-05-11 13:05:42.000000000 +0400 -@@ -91,6 +91,7 @@ struct autofs_wait_queue { - - struct autofs_sb_info { - u32 magic; -+ struct dentry *root; - struct file *pipe; - pid_t oz_pgrp; - int catatonic; -@@ -119,7 +120,7 @@ static inline struct autofs_info *autofs - filesystem without "magic".) */ - - static inline int autofs4_oz_mode(struct autofs_sb_info *sbi) { -- return sbi->catatonic || process_group(current) == sbi->oz_pgrp; -+ return sbi->catatonic || virt_pgid(current) == sbi->oz_pgrp; - } - - /* Does a dentry have some pending activity? */ -diff -uprN linux-2.6.8.1.orig/fs/autofs4/init.c linux-2.6.8.1-ve022stab078/fs/autofs4/init.c ---- linux-2.6.8.1.orig/fs/autofs4/init.c 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/autofs4/init.c 2006-05-11 13:05:42.000000000 +0400 -@@ -25,6 +25,7 @@ static struct file_system_type autofs_fs - .name = "autofs", - .get_sb = autofs_get_sb, - .kill_sb = kill_anon_super, -+ .fs_flags = FS_VIRTUALIZED, - }; - - static int __init init_autofs4_fs(void) -diff -uprN linux-2.6.8.1.orig/fs/autofs4/inode.c linux-2.6.8.1-ve022stab078/fs/autofs4/inode.c ---- linux-2.6.8.1.orig/fs/autofs4/inode.c 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/autofs4/inode.c 2006-05-11 13:05:42.000000000 +0400 -@@ -16,6 +16,7 @@ - #include <linux/pagemap.h> - #include <linux/parser.h> - #include <asm/bitops.h> -+#include <linux/smp_lock.h> - #include "autofs_i.h" - #include <linux/module.h> - -@@ -76,6 +77,66 @@ void autofs4_free_ino(struct autofs_info - kfree(ino); - } - -+/* -+ * Deal with the infamous "Busy inodes after umount ..." message. -+ * -+ * Clean up the dentry tree. This happens with autofs if the user -+ * space program goes away due to a SIGKILL, SIGSEGV etc. -+ */ -+static void autofs4_force_release(struct autofs_sb_info *sbi) -+{ -+ struct dentry *this_parent = sbi->root; -+ struct list_head *next; -+ -+ spin_lock(&dcache_lock); -+repeat: -+ next = this_parent->d_subdirs.next; -+resume: -+ while (next != &this_parent->d_subdirs) { -+ struct dentry *dentry = list_entry(next, struct dentry, d_child); -+ -+ /* Negative dentry - don`t care */ -+ if (!simple_positive(dentry)) { -+ next = next->next; -+ continue; -+ } -+ -+ if (!list_empty(&dentry->d_subdirs)) { -+ this_parent = dentry; -+ goto repeat; -+ } -+ -+ next = next->next; -+ spin_unlock(&dcache_lock); -+ -+ DPRINTK("dentry %p %.*s", -+ dentry, (int)dentry->d_name.len, dentry->d_name.name); -+ -+ dput(dentry); -+ spin_lock(&dcache_lock); -+ } -+ -+ if (this_parent != sbi->root) { -+ struct dentry *dentry = this_parent; -+ -+ next = this_parent->d_child.next; -+ this_parent = this_parent->d_parent; -+ spin_unlock(&dcache_lock); -+ DPRINTK("parent dentry %p %.*s", -+ dentry, (int)dentry->d_name.len, dentry->d_name.name); -+ dput(dentry); -+ spin_lock(&dcache_lock); -+ goto resume; -+ } -+ spin_unlock(&dcache_lock); -+ -+ dput(sbi->root); -+ sbi->root = NULL; -+ shrink_dcache_sb(sbi->sb); -+ -+ return; -+} -+ - static void autofs4_put_super(struct super_block *sb) - { - struct autofs_sb_info *sbi = autofs4_sbi(sb); -@@ -85,6 +146,10 @@ static void autofs4_put_super(struct sup - if ( !sbi->catatonic ) - autofs4_catatonic_mode(sbi); /* Free wait queues, close pipe */ - -+ /* Clean up and release dangling references */ -+ if (sbi) -+ autofs4_force_release(sbi); -+ - kfree(sbi); - - DPRINTK("shutting down"); -@@ -116,7 +181,7 @@ static int parse_options(char *options, - - *uid = current->uid; - *gid = current->gid; -- *pgrp = process_group(current); -+ *pgrp = virt_pgid(current); - - *minproto = AUTOFS_MIN_PROTO_VERSION; - *maxproto = AUTOFS_MAX_PROTO_VERSION; -@@ -199,9 +264,10 @@ int autofs4_fill_super(struct super_bloc - - s->s_fs_info = sbi; - sbi->magic = AUTOFS_SBI_MAGIC; -+ sbi->root = NULL; - sbi->catatonic = 0; - sbi->exp_timeout = 0; -- sbi->oz_pgrp = process_group(current); -+ sbi->oz_pgrp = virt_pgid(current); - sbi->sb = s; - sbi->version = 0; - sbi->sub_version = 0; -@@ -265,6 +331,13 @@ int autofs4_fill_super(struct super_bloc - sbi->pipe = pipe; - - /* -+ * Take a reference to the root dentry so we get a chance to -+ * clean up the dentry tree on umount. -+ * See autofs4_force_release. -+ */ -+ sbi->root = dget(root); -+ -+ /* - * Success! Install the root dentry now to indicate completion. - */ - s->s_root = root; -diff -uprN linux-2.6.8.1.orig/fs/autofs4/root.c linux-2.6.8.1-ve022stab078/fs/autofs4/root.c ---- linux-2.6.8.1.orig/fs/autofs4/root.c 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/autofs4/root.c 2006-05-11 13:05:42.000000000 +0400 -@@ -593,7 +593,7 @@ static int autofs4_dir_unlink(struct ino - struct autofs_info *ino = autofs4_dentry_ino(dentry); - - /* This allows root to remove symlinks */ -- if ( !autofs4_oz_mode(sbi) && !capable(CAP_SYS_ADMIN) ) -+ if ( !autofs4_oz_mode(sbi) && !capable(CAP_SYS_ADMIN) && !capable(CAP_VE_SYS_ADMIN) ) - return -EACCES; - - dput(ino->dentry); -@@ -621,7 +621,9 @@ static int autofs4_dir_rmdir(struct inod - spin_unlock(&dcache_lock); - return -ENOTEMPTY; - } -+ spin_lock(&dentry->d_lock); - __d_drop(dentry); -+ spin_unlock(&dentry->d_lock); - spin_unlock(&dcache_lock); - - dput(ino->dentry); -@@ -783,7 +785,7 @@ static int autofs4_root_ioctl(struct ino - _IOC_NR(cmd) - _IOC_NR(AUTOFS_IOC_FIRST) >= AUTOFS_IOC_COUNT ) - return -ENOTTY; - -- if ( !autofs4_oz_mode(sbi) && !capable(CAP_SYS_ADMIN) ) -+ if ( !autofs4_oz_mode(sbi) && !capable(CAP_SYS_ADMIN) && !capable(CAP_VE_SYS_ADMIN) ) - return -EPERM; - - switch(cmd) { -diff -uprN linux-2.6.8.1.orig/fs/bad_inode.c linux-2.6.8.1-ve022stab078/fs/bad_inode.c ---- linux-2.6.8.1.orig/fs/bad_inode.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/bad_inode.c 2006-05-11 13:05:32.000000000 +0400 -@@ -105,7 +105,8 @@ void make_bad_inode(struct inode * inode - remove_inode_hash(inode); - - inode->i_mode = S_IFREG; -- inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; -+ inode->i_atime = inode->i_mtime = inode->i_ctime = -+ current_fs_time(inode->i_sb); - inode->i_op = &bad_inode_ops; - inode->i_fop = &bad_file_ops; - } -diff -uprN linux-2.6.8.1.orig/fs/bfs/inode.c linux-2.6.8.1-ve022stab078/fs/bfs/inode.c ---- linux-2.6.8.1.orig/fs/bfs/inode.c 2004-08-14 14:55:19.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/bfs/inode.c 2006-05-11 13:05:35.000000000 +0400 -@@ -85,7 +85,7 @@ static void bfs_read_inode(struct inode - brelse(bh); - } - --static void bfs_write_inode(struct inode * inode, int unused) -+static int bfs_write_inode(struct inode * inode, int unused) - { - unsigned long ino = inode->i_ino; - struct bfs_inode * di; -@@ -94,7 +94,7 @@ static void bfs_write_inode(struct inode - - if (ino < BFS_ROOT_INO || ino > BFS_SB(inode->i_sb)->si_lasti) { - printf("Bad inode number %s:%08lx\n", inode->i_sb->s_id, ino); -- return; -+ return -EIO; - } - - lock_kernel(); -@@ -103,7 +103,7 @@ static void bfs_write_inode(struct inode - if (!bh) { - printf("Unable to read inode %s:%08lx\n", inode->i_sb->s_id, ino); - unlock_kernel(); -- return; -+ return -EIO; - } - - off = (ino - BFS_ROOT_INO)%BFS_INODES_PER_BLOCK; -@@ -129,6 +129,7 @@ static void bfs_write_inode(struct inode - mark_buffer_dirty(bh); - brelse(bh); - unlock_kernel(); -+ return 0; - } - - static void bfs_delete_inode(struct inode * inode) -diff -uprN linux-2.6.8.1.orig/fs/binfmt_aout.c linux-2.6.8.1-ve022stab078/fs/binfmt_aout.c ---- linux-2.6.8.1.orig/fs/binfmt_aout.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/binfmt_aout.c 2006-05-11 13:05:45.000000000 +0400 -@@ -43,13 +43,21 @@ static struct linux_binfmt aout_format = - .min_coredump = PAGE_SIZE - }; - --static void set_brk(unsigned long start, unsigned long end) -+#define BAD_ADDR(x) ((unsigned long)(x) >= TASK_SIZE) -+ -+static int set_brk(unsigned long start, unsigned long end) - { - start = PAGE_ALIGN(start); - end = PAGE_ALIGN(end); -- if (end <= start) -- return; -- do_brk(start, end - start); -+ if (end > start) { -+ unsigned long addr; -+ down_write(¤t->mm->mmap_sem); -+ addr = do_brk(start, end - start); -+ up_write(¤t->mm->mmap_sem); -+ if (BAD_ADDR(addr)) -+ return addr; -+ } -+ return 0; - } - - /* -@@ -318,10 +326,14 @@ static int load_aout_binary(struct linux - loff_t pos = fd_offset; - /* Fuck me plenty... */ - /* <AOL></AOL> */ -+ down_write(¤t->mm->mmap_sem); - error = do_brk(N_TXTADDR(ex), ex.a_text); -+ up_write(¤t->mm->mmap_sem); - bprm->file->f_op->read(bprm->file, (char *) N_TXTADDR(ex), - ex.a_text, &pos); -+ down_write(¤t->mm->mmap_sem); - error = do_brk(N_DATADDR(ex), ex.a_data); -+ up_write(¤t->mm->mmap_sem); - bprm->file->f_op->read(bprm->file, (char *) N_DATADDR(ex), - ex.a_data, &pos); - goto beyond_if; -@@ -341,8 +353,9 @@ static int load_aout_binary(struct linux - pos = 32; - map_size = ex.a_text+ex.a_data; - #endif -- -+ down_write(¤t->mm->mmap_sem); - error = do_brk(text_addr & PAGE_MASK, map_size); -+ up_write(¤t->mm->mmap_sem); - if (error != (text_addr & PAGE_MASK)) { - send_sig(SIGKILL, current, 0); - return error; -@@ -377,7 +390,9 @@ static int load_aout_binary(struct linux - - if (!bprm->file->f_op->mmap||((fd_offset & ~PAGE_MASK) != 0)) { - loff_t pos = fd_offset; -+ down_write(¤t->mm->mmap_sem); - do_brk(N_TXTADDR(ex), ex.a_text+ex.a_data); -+ up_write(¤t->mm->mmap_sem); - bprm->file->f_op->read(bprm->file, - (char __user *)N_TXTADDR(ex), - ex.a_text+ex.a_data, &pos); -@@ -413,7 +428,11 @@ static int load_aout_binary(struct linux - beyond_if: - set_binfmt(&aout_format); - -- set_brk(current->mm->start_brk, current->mm->brk); -+ retval = set_brk(current->mm->start_brk, current->mm->brk); -+ if (retval < 0) { -+ send_sig(SIGKILL, current, 0); -+ return retval; -+ } - - retval = setup_arg_pages(bprm, EXSTACK_DEFAULT); - if (retval < 0) { -@@ -429,9 +448,11 @@ beyond_if: - #endif - start_thread(regs, ex.a_entry, current->mm->start_stack); - if (unlikely(current->ptrace & PT_PTRACED)) { -- if (current->ptrace & PT_TRACE_EXEC) -+ if (current->ptrace & PT_TRACE_EXEC) { -+ set_pn_state(current, PN_STOP_EXEC); - ptrace_notify ((PTRACE_EVENT_EXEC << 8) | SIGTRAP); -- else -+ clear_pn_state(current); -+ } else - send_sig(SIGTRAP, current, 0); - } - return 0; -@@ -478,8 +499,9 @@ static int load_aout_library(struct file - file->f_dentry->d_name.name); - error_time = jiffies; - } -- -+ down_write(¤t->mm->mmap_sem); - do_brk(start_addr, ex.a_text + ex.a_data + ex.a_bss); -+ up_write(¤t->mm->mmap_sem); - - file->f_op->read(file, (char __user *)start_addr, - ex.a_text + ex.a_data, &pos); -@@ -503,7 +525,9 @@ static int load_aout_library(struct file - len = PAGE_ALIGN(ex.a_text + ex.a_data); - bss = ex.a_text + ex.a_data + ex.a_bss; - if (bss > len) { -+ down_write(¤t->mm->mmap_sem); - error = do_brk(start_addr + len, bss - len); -+ up_write(¤t->mm->mmap_sem); - retval = error; - if (error != start_addr + len) - goto out; -diff -uprN linux-2.6.8.1.orig/fs/binfmt_elf.c linux-2.6.8.1-ve022stab078/fs/binfmt_elf.c ---- linux-2.6.8.1.orig/fs/binfmt_elf.c 2004-08-14 14:55:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/binfmt_elf.c 2006-05-11 13:05:45.000000000 +0400 -@@ -87,7 +87,10 @@ static int set_brk(unsigned long start, - start = ELF_PAGEALIGN(start); - end = ELF_PAGEALIGN(end); - if (end > start) { -- unsigned long addr = do_brk(start, end - start); -+ unsigned long addr; -+ down_write(¤t->mm->mmap_sem); -+ addr = do_brk(start, end - start); -+ up_write(¤t->mm->mmap_sem); - if (BAD_ADDR(addr)) - return addr; - } -@@ -102,15 +105,17 @@ static int set_brk(unsigned long start, - be in memory */ - - --static void padzero(unsigned long elf_bss) -+static int padzero(unsigned long elf_bss) - { - unsigned long nbyte; - - nbyte = ELF_PAGEOFFSET(elf_bss); - if (nbyte) { - nbyte = ELF_MIN_ALIGN - nbyte; -- clear_user((void __user *) elf_bss, nbyte); -+ if (clear_user((void __user *) elf_bss, nbyte)) -+ return -EFAULT; - } -+ return 0; - } - - /* Let's use some macros to make this stack manipulation a litle clearer */ -@@ -126,7 +131,7 @@ static void padzero(unsigned long elf_bs - #define STACK_ALLOC(sp, len) ({ sp -= len ; sp; }) - #endif - --static void -+static int - create_elf_tables(struct linux_binprm *bprm, struct elfhdr * exec, - int interp_aout, unsigned long load_addr, - unsigned long interp_load_addr) -@@ -171,7 +176,8 @@ create_elf_tables(struct linux_binprm *b - STACK_ALLOC(p, ((current->pid % 64) << 7)); - #endif - u_platform = (elf_addr_t __user *)STACK_ALLOC(p, len); -- __copy_to_user(u_platform, k_platform, len); -+ if (__copy_to_user(u_platform, k_platform, len)) -+ return -EFAULT; - } - - /* Create the ELF interpreter info */ -@@ -233,7 +239,8 @@ create_elf_tables(struct linux_binprm *b - #endif - - /* Now, let's put argc (and argv, envp if appropriate) on the stack */ -- __put_user(argc, sp++); -+ if (__put_user(argc, sp++)) -+ return -EFAULT; - if (interp_aout) { - argv = sp + 2; - envp = argv + argc + 1; -@@ -245,31 +252,35 @@ create_elf_tables(struct linux_binprm *b - } - - /* Populate argv and envp */ -- p = current->mm->arg_start; -+ p = current->mm->arg_end = current->mm->arg_start; - while (argc-- > 0) { - size_t len; - __put_user((elf_addr_t)p, argv++); - len = strnlen_user((void __user *)p, PAGE_SIZE*MAX_ARG_PAGES); - if (!len || len > PAGE_SIZE*MAX_ARG_PAGES) -- return; -+ return 0; - p += len; - } -- __put_user(0, argv); -+ if (__put_user(0, argv)) -+ return -EFAULT; - current->mm->arg_end = current->mm->env_start = p; - while (envc-- > 0) { - size_t len; - __put_user((elf_addr_t)p, envp++); - len = strnlen_user((void __user *)p, PAGE_SIZE*MAX_ARG_PAGES); - if (!len || len > PAGE_SIZE*MAX_ARG_PAGES) -- return; -+ return 0; - p += len; - } -- __put_user(0, envp); -+ if (__put_user(0, envp)) -+ return -EFAULT; - current->mm->env_end = p; - - /* Put the elf_info on the stack in the right place. */ - sp = (elf_addr_t __user *)envp + 1; -- copy_to_user(sp, elf_info, ei_index * sizeof(elf_addr_t)); -+ if (copy_to_user(sp, elf_info, ei_index * sizeof(elf_addr_t))) -+ return -EFAULT; -+ return 0; - } - - #ifndef elf_map -@@ -334,14 +345,17 @@ static unsigned long load_elf_interp(str - goto out; - - retval = kernel_read(interpreter,interp_elf_ex->e_phoff,(char *)elf_phdata,size); -- error = retval; -- if (retval < 0) -+ error = -EIO; -+ if (retval != size) { -+ if (retval < 0) -+ error = retval; - goto out_close; -+ } - - eppnt = elf_phdata; - for (i=0; i<interp_elf_ex->e_phnum; i++, eppnt++) { - if (eppnt->p_type == PT_LOAD) { -- int elf_type = MAP_PRIVATE | MAP_DENYWRITE; -+ int elf_type = MAP_PRIVATE | MAP_DENYWRITE | MAP_EXECPRIO; - int elf_prot = 0; - unsigned long vaddr = 0; - unsigned long k, map_addr; -@@ -399,12 +413,18 @@ static unsigned long load_elf_interp(str - * that there are zero-mapped pages up to and including the - * last bss page. - */ -- padzero(elf_bss); -+ if (padzero(elf_bss)) { -+ error = -EFAULT; -+ goto out_close; -+ } -+ - elf_bss = ELF_PAGESTART(elf_bss + ELF_MIN_ALIGN - 1); /* What we have mapped so far */ - - /* Map the last of the bss segment */ - if (last_bss > elf_bss) { -+ down_write(¤t->mm->mmap_sem); - error = do_brk(elf_bss, last_bss - elf_bss); -+ up_write(¤t->mm->mmap_sem); - if (BAD_ADDR(error)) - goto out_close; - } -@@ -444,7 +464,9 @@ static unsigned long load_aout_interp(st - goto out; - } - -+ down_write(¤t->mm->mmap_sem); - do_brk(0, text_data); -+ up_write(¤t->mm->mmap_sem); - if (!interpreter->f_op || !interpreter->f_op->read) - goto out; - if (interpreter->f_op->read(interpreter, addr, text_data, &offset) < 0) -@@ -452,8 +474,11 @@ static unsigned long load_aout_interp(st - flush_icache_range((unsigned long)addr, - (unsigned long)addr + text_data); - -+ -+ down_write(¤t->mm->mmap_sem); - do_brk(ELF_PAGESTART(text_data + ELF_MIN_ALIGN - 1), - interp_ex->a_bss); -+ up_write(¤t->mm->mmap_sem); - elf_entry = interp_ex->a_entry; - - out: -@@ -487,25 +512,33 @@ static int load_elf_binary(struct linux_ - unsigned long elf_entry, interp_load_addr = 0; - unsigned long start_code, end_code, start_data, end_data; - unsigned long reloc_func_desc = 0; -- struct elfhdr elf_ex; -- struct elfhdr interp_elf_ex; -- struct exec interp_ex; - char passed_fileno[6]; - struct files_struct *files; - int have_pt_gnu_stack, executable_stack = EXSTACK_DEFAULT; - unsigned long def_flags = 0; -+ struct { -+ struct elfhdr elf_ex; -+ struct elfhdr interp_elf_ex; -+ struct exec interp_ex; -+ } *loc; -+ -+ loc = kmalloc(sizeof(*loc), GFP_KERNEL); -+ if (!loc) { -+ retval = -ENOMEM; -+ goto out_ret; -+ } - - /* Get the exec-header */ -- elf_ex = *((struct elfhdr *) bprm->buf); -+ loc->elf_ex = *((struct elfhdr *) bprm->buf); - - retval = -ENOEXEC; - /* First of all, some simple consistency checks */ -- if (memcmp(elf_ex.e_ident, ELFMAG, SELFMAG) != 0) -+ if (memcmp(loc->elf_ex.e_ident, ELFMAG, SELFMAG) != 0) - goto out; - -- if (elf_ex.e_type != ET_EXEC && elf_ex.e_type != ET_DYN) -+ if (loc->elf_ex.e_type != ET_EXEC && loc->elf_ex.e_type != ET_DYN) - goto out; -- if (!elf_check_arch(&elf_ex)) -+ if (!elf_check_arch(&loc->elf_ex)) - goto out; - if (!bprm->file->f_op||!bprm->file->f_op->mmap) - goto out; -@@ -513,18 +546,21 @@ static int load_elf_binary(struct linux_ - /* Now read in all of the header information */ - - retval = -ENOMEM; -- if (elf_ex.e_phentsize != sizeof(struct elf_phdr)) -+ if (loc->elf_ex.e_phentsize != sizeof(struct elf_phdr)) - goto out; -- if (elf_ex.e_phnum > 65536U / sizeof(struct elf_phdr)) -+ if (loc->elf_ex.e_phnum > 65536U / sizeof(struct elf_phdr)) - goto out; -- size = elf_ex.e_phnum * sizeof(struct elf_phdr); -+ size = loc->elf_ex.e_phnum * sizeof(struct elf_phdr); - elf_phdata = (struct elf_phdr *) kmalloc(size, GFP_KERNEL); - if (!elf_phdata) - goto out; - -- retval = kernel_read(bprm->file, elf_ex.e_phoff, (char *) elf_phdata, size); -- if (retval < 0) -+ retval = kernel_read(bprm->file, loc->elf_ex.e_phoff, (char *) elf_phdata, size); -+ if (retval != size) { -+ if (retval >= 0) -+ retval = -EIO; - goto out_free_ph; -+ } - - files = current->files; /* Refcounted so ok */ - retval = unshare_files(); -@@ -553,7 +589,7 @@ static int load_elf_binary(struct linux_ - start_data = 0; - end_data = 0; - -- for (i = 0; i < elf_ex.e_phnum; i++) { -+ for (i = 0; i < loc->elf_ex.e_phnum; i++) { - if (elf_ppnt->p_type == PT_INTERP) { - /* This is the program interpreter used for - * shared libraries - for now assume that this -@@ -561,7 +597,8 @@ static int load_elf_binary(struct linux_ - */ - - retval = -ENOMEM; -- if (elf_ppnt->p_filesz > PATH_MAX) -+ if (elf_ppnt->p_filesz > PATH_MAX || -+ elf_ppnt->p_filesz == 0) - goto out_free_file; - elf_interpreter = (char *) kmalloc(elf_ppnt->p_filesz, - GFP_KERNEL); -@@ -571,8 +608,16 @@ static int load_elf_binary(struct linux_ - retval = kernel_read(bprm->file, elf_ppnt->p_offset, - elf_interpreter, - elf_ppnt->p_filesz); -- if (retval < 0) -+ if (retval != elf_ppnt->p_filesz) { -+ if (retval >= 0) -+ retval = -EIO; -+ goto out_free_interp; -+ } -+ /* make sure path is NULL terminated */ -+ retval = -EINVAL; -+ if (elf_interpreter[elf_ppnt->p_filesz - 1] != '\0') - goto out_free_interp; -+ - /* If the program interpreter is one of these two, - * then assume an iBCS2 image. Otherwise assume - * a native linux image. -@@ -600,26 +645,29 @@ static int load_elf_binary(struct linux_ - * switch really is going to happen - do this in - * flush_thread(). - akpm - */ -- SET_PERSONALITY(elf_ex, ibcs2_interpreter); -+ SET_PERSONALITY(loc->elf_ex, ibcs2_interpreter); - -- interpreter = open_exec(elf_interpreter); -+ interpreter = open_exec(elf_interpreter, NULL); - retval = PTR_ERR(interpreter); - if (IS_ERR(interpreter)) - goto out_free_interp; - retval = kernel_read(interpreter, 0, bprm->buf, BINPRM_BUF_SIZE); -- if (retval < 0) -+ if (retval != BINPRM_BUF_SIZE) { -+ if (retval >= 0) -+ retval = -EIO; - goto out_free_dentry; -+ } - - /* Get the exec headers */ -- interp_ex = *((struct exec *) bprm->buf); -- interp_elf_ex = *((struct elfhdr *) bprm->buf); -+ loc->interp_ex = *((struct exec *) bprm->buf); -+ loc->interp_elf_ex = *((struct elfhdr *) bprm->buf); - break; - } - elf_ppnt++; - } - - elf_ppnt = elf_phdata; -- for (i = 0; i < elf_ex.e_phnum; i++, elf_ppnt++) -+ for (i = 0; i < loc->elf_ex.e_phnum; i++, elf_ppnt++) - if (elf_ppnt->p_type == PT_GNU_STACK) { - if (elf_ppnt->p_flags & PF_X) - executable_stack = EXSTACK_ENABLE_X; -@@ -627,19 +675,19 @@ static int load_elf_binary(struct linux_ - executable_stack = EXSTACK_DISABLE_X; - break; - } -- have_pt_gnu_stack = (i < elf_ex.e_phnum); -+ have_pt_gnu_stack = (i < loc->elf_ex.e_phnum); - - /* Some simple consistency checks for the interpreter */ - if (elf_interpreter) { - interpreter_type = INTERPRETER_ELF | INTERPRETER_AOUT; - - /* Now figure out which format our binary is */ -- if ((N_MAGIC(interp_ex) != OMAGIC) && -- (N_MAGIC(interp_ex) != ZMAGIC) && -- (N_MAGIC(interp_ex) != QMAGIC)) -+ if ((N_MAGIC(loc->interp_ex) != OMAGIC) && -+ (N_MAGIC(loc->interp_ex) != ZMAGIC) && -+ (N_MAGIC(loc->interp_ex) != QMAGIC)) - interpreter_type = INTERPRETER_ELF; - -- if (memcmp(interp_elf_ex.e_ident, ELFMAG, SELFMAG) != 0) -+ if (memcmp(loc->interp_elf_ex.e_ident, ELFMAG, SELFMAG) != 0) - interpreter_type &= ~INTERPRETER_ELF; - - retval = -ELIBBAD; -@@ -655,11 +703,11 @@ static int load_elf_binary(struct linux_ - } - /* Verify the interpreter has a valid arch */ - if ((interpreter_type == INTERPRETER_ELF) && -- !elf_check_arch(&interp_elf_ex)) -+ !elf_check_arch(&loc->interp_elf_ex)) - goto out_free_dentry; - } else { - /* Executables without an interpreter also need a personality */ -- SET_PERSONALITY(elf_ex, ibcs2_interpreter); -+ SET_PERSONALITY(loc->elf_ex, ibcs2_interpreter); - } - - /* OK, we are done with that, now set up the arg stuff, -@@ -699,8 +747,8 @@ static int load_elf_binary(struct linux_ - - /* Do this immediately, since STACK_TOP as used in setup_arg_pages - may depend on the personality. */ -- SET_PERSONALITY(elf_ex, ibcs2_interpreter); -- if (elf_read_implies_exec(elf_ex, have_pt_gnu_stack)) -+ SET_PERSONALITY(loc->elf_ex, ibcs2_interpreter); -+ if (elf_read_implies_exec(loc->elf_ex, have_pt_gnu_stack)) - current->personality |= READ_IMPLIES_EXEC; - - /* Do this so that we can load the interpreter, if need be. We will -@@ -720,7 +768,7 @@ static int load_elf_binary(struct linux_ - the image should be loaded at fixed address, not at a variable - address. */ - -- for(i = 0, elf_ppnt = elf_phdata; i < elf_ex.e_phnum; i++, elf_ppnt++) { -+ for(i = 0, elf_ppnt = elf_phdata; i < loc->elf_ex.e_phnum; i++, elf_ppnt++) { - int elf_prot = 0, elf_flags; - unsigned long k, vaddr; - -@@ -744,7 +792,13 @@ static int load_elf_binary(struct linux_ - nbyte = ELF_MIN_ALIGN - nbyte; - if (nbyte > elf_brk - elf_bss) - nbyte = elf_brk - elf_bss; -- clear_user((void __user *) elf_bss + load_bias, nbyte); -+ /* -+ * This bss-zeroing can fail if the ELF file -+ * specifies odd protections. So we don't check -+ * the return value -+ */ -+ (void)clear_user((void __user *)elf_bss + -+ load_bias, nbyte); - } - } - -@@ -752,12 +806,13 @@ static int load_elf_binary(struct linux_ - if (elf_ppnt->p_flags & PF_W) elf_prot |= PROT_WRITE; - if (elf_ppnt->p_flags & PF_X) elf_prot |= PROT_EXEC; - -- elf_flags = MAP_PRIVATE|MAP_DENYWRITE|MAP_EXECUTABLE; -+ elf_flags = MAP_PRIVATE|MAP_DENYWRITE|MAP_EXECUTABLE| -+ MAP_EXECPRIO; - - vaddr = elf_ppnt->p_vaddr; -- if (elf_ex.e_type == ET_EXEC || load_addr_set) { -+ if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) { - elf_flags |= MAP_FIXED; -- } else if (elf_ex.e_type == ET_DYN) { -+ } else if (loc->elf_ex.e_type == ET_DYN) { - /* Try and get dynamic programs out of the way of the default mmap - base, as well as whatever program they might try to exec. This - is because the brk will follow the loader, and is not movable. */ -@@ -765,13 +820,15 @@ static int load_elf_binary(struct linux_ - } - - error = elf_map(bprm->file, load_bias + vaddr, elf_ppnt, elf_prot, elf_flags); -- if (BAD_ADDR(error)) -- continue; -+ if (BAD_ADDR(error)) { -+ send_sig(SIGKILL, current, 0); -+ goto out_free_dentry; -+ } - - if (!load_addr_set) { - load_addr_set = 1; - load_addr = (elf_ppnt->p_vaddr - elf_ppnt->p_offset); -- if (elf_ex.e_type == ET_DYN) { -+ if (loc->elf_ex.e_type == ET_DYN) { - load_bias += error - - ELF_PAGESTART(load_bias + vaddr); - load_addr += load_bias; -@@ -808,7 +865,7 @@ static int load_elf_binary(struct linux_ - elf_brk = k; - } - -- elf_ex.e_entry += load_bias; -+ loc->elf_ex.e_entry += load_bias; - elf_bss += load_bias; - elf_brk += load_bias; - start_code += load_bias; -@@ -826,14 +883,18 @@ static int load_elf_binary(struct linux_ - send_sig(SIGKILL, current, 0); - goto out_free_dentry; - } -- padzero(elf_bss); -+ if (padzero(elf_bss)) { -+ send_sig(SIGSEGV, current, 0); -+ retval = -EFAULT; /* Nobody gets to see this, but.. */ -+ goto out_free_dentry; -+ } - - if (elf_interpreter) { - if (interpreter_type == INTERPRETER_AOUT) -- elf_entry = load_aout_interp(&interp_ex, -+ elf_entry = load_aout_interp(&loc->interp_ex, - interpreter); - else -- elf_entry = load_elf_interp(&interp_elf_ex, -+ elf_entry = load_elf_interp(&loc->interp_elf_ex, - interpreter, - &interp_load_addr); - if (BAD_ADDR(elf_entry)) { -@@ -848,7 +909,12 @@ static int load_elf_binary(struct linux_ - fput(interpreter); - kfree(elf_interpreter); - } else { -- elf_entry = elf_ex.e_entry; -+ elf_entry = loc->elf_ex.e_entry; -+ if (BAD_ADDR(elf_entry)) { -+ send_sig(SIGSEGV, current, 0); -+ retval = -ENOEXEC; /* Nobody gets to see this, but.. */ -+ goto out_free_dentry; -+ } - } - - kfree(elf_phdata); -@@ -858,9 +924,17 @@ static int load_elf_binary(struct linux_ - - set_binfmt(&elf_format); - -+#ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES -+ retval = arch_setup_additional_pages(bprm, executable_stack); -+ if (retval < 0) { -+ send_sig(SIGKILL, current, 0); -+ goto out; -+ } -+#endif /* ARCH_HAS_SETUP_ADDITIONAL_PAGES */ -+ - compute_creds(bprm); - current->flags &= ~PF_FORKNOEXEC; -- create_elf_tables(bprm, &elf_ex, (interpreter_type == INTERPRETER_AOUT), -+ create_elf_tables(bprm, &loc->elf_ex, (interpreter_type == INTERPRETER_AOUT), - load_addr, interp_load_addr); - /* N.B. passed_fileno might not be initialized? */ - if (interpreter_type == INTERPRETER_AOUT) -@@ -898,13 +972,17 @@ static int load_elf_binary(struct linux_ - - start_thread(regs, elf_entry, bprm->p); - if (unlikely(current->ptrace & PT_PTRACED)) { -- if (current->ptrace & PT_TRACE_EXEC) -+ if (current->ptrace & PT_TRACE_EXEC) { -+ set_pn_state(current, PN_STOP_EXEC); - ptrace_notify ((PTRACE_EVENT_EXEC << 8) | SIGTRAP); -- else -+ clear_pn_state(current); -+ } else - send_sig(SIGTRAP, current, 0); - } - retval = 0; - out: -+ kfree(loc); -+out_ret: - return retval; - - /* error cleanup */ -@@ -933,6 +1011,7 @@ out_free_ph: - static int load_elf_library(struct file *file) - { - struct elf_phdr *elf_phdata; -+ struct elf_phdr *eppnt; - unsigned long elf_bss, bss, len; - int retval, error, i, j; - struct elfhdr elf_ex; -@@ -956,43 +1035,52 @@ static int load_elf_library(struct file - /* j < ELF_MIN_ALIGN because elf_ex.e_phnum <= 2 */ - - error = -ENOMEM; -- elf_phdata = (struct elf_phdr *) kmalloc(j, GFP_KERNEL); -+ elf_phdata = kmalloc(j, GFP_KERNEL); - if (!elf_phdata) - goto out; - -+ eppnt = elf_phdata; - error = -ENOEXEC; -- retval = kernel_read(file, elf_ex.e_phoff, (char *) elf_phdata, j); -+ retval = kernel_read(file, elf_ex.e_phoff, (char *)eppnt, j); - if (retval != j) - goto out_free_ph; - - for (j = 0, i = 0; i<elf_ex.e_phnum; i++) -- if ((elf_phdata + i)->p_type == PT_LOAD) j++; -+ if ((eppnt + i)->p_type == PT_LOAD) -+ j++; - if (j != 1) - goto out_free_ph; - -- while (elf_phdata->p_type != PT_LOAD) elf_phdata++; -+ while (eppnt->p_type != PT_LOAD) -+ eppnt++; - - /* Now use mmap to map the library into memory. */ - down_write(¤t->mm->mmap_sem); - error = do_mmap(file, -- ELF_PAGESTART(elf_phdata->p_vaddr), -- (elf_phdata->p_filesz + -- ELF_PAGEOFFSET(elf_phdata->p_vaddr)), -+ ELF_PAGESTART(eppnt->p_vaddr), -+ (eppnt->p_filesz + -+ ELF_PAGEOFFSET(eppnt->p_vaddr)), - PROT_READ | PROT_WRITE | PROT_EXEC, - MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE, -- (elf_phdata->p_offset - -- ELF_PAGEOFFSET(elf_phdata->p_vaddr))); -+ (eppnt->p_offset - -+ ELF_PAGEOFFSET(eppnt->p_vaddr))); - up_write(¤t->mm->mmap_sem); -- if (error != ELF_PAGESTART(elf_phdata->p_vaddr)) -+ if (error != ELF_PAGESTART(eppnt->p_vaddr)) - goto out_free_ph; - -- elf_bss = elf_phdata->p_vaddr + elf_phdata->p_filesz; -- padzero(elf_bss); -+ elf_bss = eppnt->p_vaddr + eppnt->p_filesz; -+ if (padzero(elf_bss)) { -+ error = -EFAULT; -+ goto out_free_ph; -+ } - -- len = ELF_PAGESTART(elf_phdata->p_filesz + elf_phdata->p_vaddr + ELF_MIN_ALIGN - 1); -- bss = elf_phdata->p_memsz + elf_phdata->p_vaddr; -- if (bss > len) -+ len = ELF_PAGESTART(eppnt->p_filesz + eppnt->p_vaddr + ELF_MIN_ALIGN - 1); -+ bss = eppnt->p_memsz + eppnt->p_vaddr; -+ if (bss > len) { -+ down_write(¤t->mm->mmap_sem); - do_brk(len, bss - len); -+ up_write(¤t->mm->mmap_sem); -+ } - error = 0; - - out_free_ph: -@@ -1172,20 +1260,20 @@ static void fill_prstatus(struct elf_prs - prstatus->pr_info.si_signo = prstatus->pr_cursig = signr; - prstatus->pr_sigpend = p->pending.signal.sig[0]; - prstatus->pr_sighold = p->blocked.sig[0]; -- prstatus->pr_pid = p->pid; -- prstatus->pr_ppid = p->parent->pid; -- prstatus->pr_pgrp = process_group(p); -- prstatus->pr_sid = p->signal->session; -+ prstatus->pr_pid = virt_pid(p); -+ prstatus->pr_ppid = virt_pid(p->parent); -+ prstatus->pr_pgrp = virt_pgid(p); -+ prstatus->pr_sid = virt_sid(p); - jiffies_to_timeval(p->utime, &prstatus->pr_utime); - jiffies_to_timeval(p->stime, &prstatus->pr_stime); - jiffies_to_timeval(p->cutime, &prstatus->pr_cutime); - jiffies_to_timeval(p->cstime, &prstatus->pr_cstime); - } - --static void fill_psinfo(struct elf_prpsinfo *psinfo, struct task_struct *p, -- struct mm_struct *mm) -+static int fill_psinfo(struct elf_prpsinfo *psinfo, struct task_struct *p, -+ struct mm_struct *mm) - { -- int i, len; -+ unsigned int i, len; - - /* first copy the parameters from user space */ - memset(psinfo, 0, sizeof(struct elf_prpsinfo)); -@@ -1193,17 +1281,18 @@ static void fill_psinfo(struct elf_prpsi - len = mm->arg_end - mm->arg_start; - if (len >= ELF_PRARGSZ) - len = ELF_PRARGSZ-1; -- copy_from_user(&psinfo->pr_psargs, -- (const char __user *)mm->arg_start, len); -+ if (copy_from_user(&psinfo->pr_psargs, -+ (const char __user *)mm->arg_start, len)) -+ return -EFAULT; - for(i = 0; i < len; i++) - if (psinfo->pr_psargs[i] == 0) - psinfo->pr_psargs[i] = ' '; - psinfo->pr_psargs[len] = 0; - -- psinfo->pr_pid = p->pid; -- psinfo->pr_ppid = p->parent->pid; -- psinfo->pr_pgrp = process_group(p); -- psinfo->pr_sid = p->signal->session; -+ psinfo->pr_pid = virt_pid(p); -+ psinfo->pr_ppid = virt_pid(p->parent); -+ psinfo->pr_pgrp = virt_pgid(p); -+ psinfo->pr_sid = virt_sid(p); - - i = p->state ? ffz(~p->state) + 1 : 0; - psinfo->pr_state = i; -@@ -1215,7 +1304,7 @@ static void fill_psinfo(struct elf_prpsi - SET_GID(psinfo->pr_gid, p->gid); - strncpy(psinfo->pr_fname, p->comm, sizeof(psinfo->pr_fname)); - -- return; -+ return 0; - } - - /* Here is the structure in which status of each thread is captured. */ -@@ -1344,7 +1433,7 @@ static int elf_core_dump(long signr, str - /* capture the status of all other threads */ - if (signr) { - read_lock(&tasklist_lock); -- do_each_thread(g,p) -+ do_each_thread_ve(g,p) - if (current->mm == p->mm && current != p) { - int sz = elf_dump_thread_status(signr, p, &thread_list); - if (!sz) { -@@ -1353,7 +1442,7 @@ static int elf_core_dump(long signr, str - } else - thread_status_size += sz; - } -- while_each_thread(g,p); -+ while_each_thread_ve(g,p); - read_unlock(&tasklist_lock); - } - -diff -uprN linux-2.6.8.1.orig/fs/binfmt_em86.c linux-2.6.8.1-ve022stab078/fs/binfmt_em86.c ---- linux-2.6.8.1.orig/fs/binfmt_em86.c 2004-08-14 14:55:35.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/binfmt_em86.c 2006-05-11 13:05:35.000000000 +0400 -@@ -82,7 +82,7 @@ static int load_em86(struct linux_binprm - * Note that we use open_exec() as the name is now in kernel - * space, and we don't need to copy it. - */ -- file = open_exec(interp); -+ file = open_exec(interp, bprm); - if (IS_ERR(file)) - return PTR_ERR(file); - -diff -uprN linux-2.6.8.1.orig/fs/binfmt_flat.c linux-2.6.8.1-ve022stab078/fs/binfmt_flat.c ---- linux-2.6.8.1.orig/fs/binfmt_flat.c 2004-08-14 14:54:46.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/binfmt_flat.c 2006-05-11 13:05:35.000000000 +0400 -@@ -774,7 +774,7 @@ static int load_flat_shared_library(int - - /* Open the file up */ - bprm.filename = buf; -- bprm.file = open_exec(bprm.filename); -+ bprm.file = open_exec(bprm.filename, &bprm); - res = PTR_ERR(bprm.file); - if (IS_ERR(bprm.file)) - return res; -diff -uprN linux-2.6.8.1.orig/fs/binfmt_misc.c linux-2.6.8.1-ve022stab078/fs/binfmt_misc.c ---- linux-2.6.8.1.orig/fs/binfmt_misc.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/binfmt_misc.c 2006-05-11 13:05:35.000000000 +0400 -@@ -150,7 +150,8 @@ static int load_misc_binary(struct linux - - /* if the binary is not readable than enforce mm->dumpable=0 - regardless of the interpreter's permissions */ -- if (permission(bprm->file->f_dentry->d_inode, MAY_READ, NULL)) -+ if (permission(bprm->file->f_dentry->d_inode, MAY_READ, -+ NULL, NULL)) - bprm->interp_flags |= BINPRM_FLAGS_ENFORCE_NONDUMP; - - allow_write_access(bprm->file); -@@ -179,7 +180,7 @@ static int load_misc_binary(struct linux - - bprm->interp = iname; /* for binfmt_script */ - -- interp_file = open_exec (iname); -+ interp_file = open_exec (iname, bprm); - retval = PTR_ERR (interp_file); - if (IS_ERR (interp_file)) - goto _error; -@@ -509,7 +510,8 @@ static struct inode *bm_get_inode(struct - inode->i_gid = 0; - inode->i_blksize = PAGE_CACHE_SIZE; - inode->i_blocks = 0; -- inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; -+ inode->i_atime = inode->i_mtime = inode->i_ctime = -+ current_fs_time(inode->i_sb); - } - return inode; - } -diff -uprN linux-2.6.8.1.orig/fs/binfmt_script.c linux-2.6.8.1-ve022stab078/fs/binfmt_script.c ---- linux-2.6.8.1.orig/fs/binfmt_script.c 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/binfmt_script.c 2006-05-11 13:05:35.000000000 +0400 -@@ -85,7 +85,7 @@ static int load_script(struct linux_binp - /* - * OK, now restart the process with the interpreter's dentry. - */ -- file = open_exec(interp); -+ file = open_exec(interp, bprm); - if (IS_ERR(file)) - return PTR_ERR(file); - -diff -uprN linux-2.6.8.1.orig/fs/bio.c linux-2.6.8.1-ve022stab078/fs/bio.c ---- linux-2.6.8.1.orig/fs/bio.c 2004-08-14 14:55:34.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/bio.c 2006-05-11 13:05:28.000000000 +0400 -@@ -388,20 +388,17 @@ int bio_uncopy_user(struct bio *bio) - struct bio_vec *bvec; - int i, ret = 0; - -- if (bio_data_dir(bio) == READ) { -- char *uaddr = bio->bi_private; -+ char *uaddr = bio->bi_private; - -- __bio_for_each_segment(bvec, bio, i, 0) { -- char *addr = page_address(bvec->bv_page); -- -- if (!ret && copy_to_user(uaddr, addr, bvec->bv_len)) -- ret = -EFAULT; -+ __bio_for_each_segment(bvec, bio, i, 0) { -+ char *addr = page_address(bvec->bv_page); -+ if (bio_data_dir(bio) == READ && !ret && -+ copy_to_user(uaddr, addr, bvec->bv_len)) -+ ret = -EFAULT; - -- __free_page(bvec->bv_page); -- uaddr += bvec->bv_len; -- } -+ __free_page(bvec->bv_page); -+ uaddr += bvec->bv_len; - } -- - bio_put(bio); - return ret; - } -@@ -457,6 +454,7 @@ struct bio *bio_copy_user(request_queue_ - */ - if (!ret) { - if (!write_to_vm) { -+ unsigned long p = uaddr; - bio->bi_rw |= (1 << BIO_RW); - /* - * for a write, copy in data to kernel pages -@@ -465,8 +463,9 @@ struct bio *bio_copy_user(request_queue_ - bio_for_each_segment(bvec, bio, i) { - char *addr = page_address(bvec->bv_page); - -- if (copy_from_user(addr, (char *) uaddr, bvec->bv_len)) -+ if (copy_from_user(addr, (char *) p, bvec->bv_len)) - goto cleanup; -+ p += bvec->bv_len; - } - } - -diff -uprN linux-2.6.8.1.orig/fs/block_dev.c linux-2.6.8.1-ve022stab078/fs/block_dev.c ---- linux-2.6.8.1.orig/fs/block_dev.c 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/block_dev.c 2006-05-11 13:05:40.000000000 +0400 -@@ -548,9 +548,16 @@ static int do_open(struct block_device * - { - struct module *owner = NULL; - struct gendisk *disk; -- int ret = -ENXIO; -+ int ret; - int part; - -+#ifdef CONFIG_VE -+ ret = get_device_perms_ve(S_IFBLK, bdev->bd_dev, -+ file->f_mode&(FMODE_READ|FMODE_WRITE)); -+ if (ret) -+ return ret; -+#endif -+ ret = -ENXIO; - file->f_mapping = bdev->bd_inode->i_mapping; - lock_kernel(); - disk = get_gendisk(bdev->bd_dev, &part); -@@ -821,7 +828,7 @@ EXPORT_SYMBOL(ioctl_by_bdev); - * namespace if possible and return it. Return ERR_PTR(error) - * otherwise. - */ --struct block_device *lookup_bdev(const char *path) -+struct block_device *lookup_bdev(const char *path, int mode) - { - struct block_device *bdev; - struct inode *inode; -@@ -839,6 +846,11 @@ struct block_device *lookup_bdev(const c - error = -ENOTBLK; - if (!S_ISBLK(inode->i_mode)) - goto fail; -+#ifdef CONFIG_VE -+ error = get_device_perms_ve(S_IFBLK, inode->i_rdev, mode); -+ if (error) -+ goto fail; -+#endif - error = -EACCES; - if (nd.mnt->mnt_flags & MNT_NODEV) - goto fail; -@@ -870,12 +882,13 @@ struct block_device *open_bdev_excl(cons - mode_t mode = FMODE_READ; - int error = 0; - -- bdev = lookup_bdev(path); -+ if (!(flags & MS_RDONLY)) -+ mode |= FMODE_WRITE; -+ -+ bdev = lookup_bdev(path, mode); - if (IS_ERR(bdev)) - return bdev; - -- if (!(flags & MS_RDONLY)) -- mode |= FMODE_WRITE; - error = blkdev_get(bdev, mode, 0); - if (error) - return ERR_PTR(error); -diff -uprN linux-2.6.8.1.orig/fs/buffer.c linux-2.6.8.1-ve022stab078/fs/buffer.c ---- linux-2.6.8.1.orig/fs/buffer.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/buffer.c 2006-05-11 13:05:35.000000000 +0400 -@@ -505,6 +505,7 @@ __find_get_block_slow(struct block_devic - struct buffer_head *bh; - struct buffer_head *head; - struct page *page; -+ int all_mapped = 1; - - index = block >> (PAGE_CACHE_SHIFT - bd_inode->i_blkbits); - page = find_get_page(bd_mapping, index); -@@ -522,14 +523,23 @@ __find_get_block_slow(struct block_devic - get_bh(bh); - goto out_unlock; - } -+ if (!buffer_mapped(bh)) -+ all_mapped = 0; - bh = bh->b_this_page; - } while (bh != head); - -- printk("__find_get_block_slow() failed. " -- "block=%llu, b_blocknr=%llu\n", -- (unsigned long long)block, (unsigned long long)bh->b_blocknr); -- printk("b_state=0x%08lx, b_size=%u\n", bh->b_state, bh->b_size); -- printk("device blocksize: %d\n", 1 << bd_inode->i_blkbits); -+ /* we might be here because some of the buffers on this page are -+ * not mapped. This is due to various races between -+ * file io on the block device and getblk. It gets dealt with -+ * elsewhere, don't buffer_error if we had some unmapped buffers -+ */ -+ if (all_mapped) { -+ printk("__find_get_block_slow() failed. " -+ "block=%llu, b_blocknr=%llu\n", -+ (unsigned long long)block, (unsigned long long)bh->b_blocknr); -+ printk("b_state=0x%08lx, b_size=%u\n", bh->b_state, bh->b_size); -+ printk("device blocksize: %d\n", 1 << bd_inode->i_blkbits); -+ } - out_unlock: - spin_unlock(&bd_mapping->private_lock); - page_cache_release(page); -@@ -1177,18 +1187,16 @@ init_page_buffers(struct page *page, str - { - struct buffer_head *head = page_buffers(page); - struct buffer_head *bh = head; -- unsigned int b_state; -- -- b_state = 1 << BH_Mapped; -- if (PageUptodate(page)) -- b_state |= 1 << BH_Uptodate; -+ int uptodate = PageUptodate(page); - - do { -- if (!(bh->b_state & (1 << BH_Mapped))) { -+ if (!buffer_mapped(bh)) { - init_buffer(bh, NULL, NULL); - bh->b_bdev = bdev; - bh->b_blocknr = block; -- bh->b_state = b_state; -+ if (uptodate) -+ set_buffer_uptodate(bh); -+ set_buffer_mapped(bh); - } - block++; - bh = bh->b_this_page; -@@ -1217,8 +1225,10 @@ grow_dev_page(struct block_device *bdev, - - if (page_has_buffers(page)) { - bh = page_buffers(page); -- if (bh->b_size == size) -+ if (bh->b_size == size) { -+ init_page_buffers(page, bdev, block, size); - return page; -+ } - if (!try_to_free_buffers(page)) - goto failed; - } -@@ -2022,8 +2032,9 @@ static int __block_prepare_write(struct - goto out; - if (buffer_new(bh)) { - clear_buffer_new(bh); -- unmap_underlying_metadata(bh->b_bdev, -- bh->b_blocknr); -+ if (buffer_mapped(bh)) -+ unmap_underlying_metadata(bh->b_bdev, -+ bh->b_blocknr); - if (PageUptodate(page)) { - set_buffer_uptodate(bh); - continue; -@@ -2756,21 +2767,31 @@ static int end_bio_bh_io_sync(struct bio - if (bio->bi_size) - return 1; - -+ if (err == -EOPNOTSUPP) -+ set_bit(BIO_EOPNOTSUPP, &bio->bi_flags); -+ - bh->b_end_io(bh, test_bit(BIO_UPTODATE, &bio->bi_flags)); - bio_put(bio); - return 0; - } - --void submit_bh(int rw, struct buffer_head * bh) -+int submit_bh(int rw, struct buffer_head * bh) - { - struct bio *bio; -+ int ret = 0; - - BUG_ON(!buffer_locked(bh)); - BUG_ON(!buffer_mapped(bh)); - BUG_ON(!bh->b_end_io); - -- /* Only clear out a write error when rewriting */ -- if (test_set_buffer_req(bh) && rw == WRITE) -+ if (buffer_ordered(bh) && (rw == WRITE)) -+ rw = WRITE_BARRIER; -+ -+ /* -+ * Only clear out a write error when rewriting, should this -+ * include WRITE_SYNC as well? -+ */ -+ if (test_set_buffer_req(bh) && (rw == WRITE || rw == WRITE_BARRIER)) - clear_buffer_write_io_error(bh); - - /* -@@ -2792,7 +2813,14 @@ void submit_bh(int rw, struct buffer_hea - bio->bi_end_io = end_bio_bh_io_sync; - bio->bi_private = bh; - -+ bio_get(bio); - submit_bio(rw, bio); -+ -+ if (bio_flagged(bio, BIO_EOPNOTSUPP)) -+ ret = -EOPNOTSUPP; -+ -+ bio_put(bio); -+ return ret; - } - - /** -@@ -2901,7 +2929,7 @@ drop_buffers(struct page *page, struct b - - bh = head; - do { -- if (buffer_write_io_error(bh)) -+ if (buffer_write_io_error(bh) && page->mapping) - set_bit(AS_EIO, &page->mapping->flags); - if (buffer_busy(bh)) - goto failed; -@@ -3100,7 +3128,7 @@ void __init buffer_init(void) - - bh_cachep = kmem_cache_create("buffer_head", - sizeof(struct buffer_head), 0, -- SLAB_PANIC, init_buffer_head, NULL); -+ SLAB_RECLAIM_ACCOUNT|SLAB_PANIC, init_buffer_head, NULL); - for (i = 0; i < ARRAY_SIZE(bh_wait_queue_heads); i++) - init_waitqueue_head(&bh_wait_queue_heads[i].wqh); - -diff -uprN linux-2.6.8.1.orig/fs/char_dev.c linux-2.6.8.1-ve022stab078/fs/char_dev.c ---- linux-2.6.8.1.orig/fs/char_dev.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/char_dev.c 2006-05-11 13:05:40.000000000 +0400 -@@ -257,6 +257,13 @@ int chrdev_open(struct inode * inode, st - struct cdev *new = NULL; - int ret = 0; - -+#ifdef CONFIG_VE -+ ret = get_device_perms_ve(S_IFCHR, inode->i_rdev, -+ filp->f_mode&(FMODE_READ|FMODE_WRITE)); -+ if (ret) -+ return ret; -+#endif -+ - spin_lock(&cdev_lock); - p = inode->i_cdev; - if (!p) { -diff -uprN linux-2.6.8.1.orig/fs/cifs/cifsfs.c linux-2.6.8.1-ve022stab078/fs/cifs/cifsfs.c ---- linux-2.6.8.1.orig/fs/cifs/cifsfs.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/cifs/cifsfs.c 2006-05-11 13:05:35.000000000 +0400 -@@ -188,7 +188,8 @@ cifs_statfs(struct super_block *sb, stru - return 0; /* always return success? what if volume is no longer available? */ - } - --static int cifs_permission(struct inode * inode, int mask, struct nameidata *nd) -+static int cifs_permission(struct inode * inode, int mask, -+ struct nameidata *nd, struct exec_perm *exec_perm) - { - struct cifs_sb_info *cifs_sb; - -@@ -200,7 +201,7 @@ static int cifs_permission(struct inode - on the client (above and beyond ACL on servers) for - servers which do not support setting and viewing mode bits, - so allowing client to check permissions is useful */ -- return vfs_permission(inode, mask); -+ return vfs_permission(inode, mask, exec_perm); - } - - static kmem_cache_t *cifs_inode_cachep; -diff -uprN linux-2.6.8.1.orig/fs/coda/dir.c linux-2.6.8.1-ve022stab078/fs/coda/dir.c ---- linux-2.6.8.1.orig/fs/coda/dir.c 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/coda/dir.c 2006-05-11 13:05:35.000000000 +0400 -@@ -147,7 +147,8 @@ exit: - } - - --int coda_permission(struct inode *inode, int mask, struct nameidata *nd) -+int coda_permission(struct inode *inode, int mask, struct nameidata *nd, -+ struct exec_perm *perm) - { - int error = 0; - -diff -uprN linux-2.6.8.1.orig/fs/coda/pioctl.c linux-2.6.8.1-ve022stab078/fs/coda/pioctl.c ---- linux-2.6.8.1.orig/fs/coda/pioctl.c 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/coda/pioctl.c 2006-05-11 13:05:35.000000000 +0400 -@@ -25,7 +25,7 @@ - - /* pioctl ops */ - static int coda_ioctl_permission(struct inode *inode, int mask, -- struct nameidata *nd); -+ struct nameidata *nd, struct exec_perm *); - static int coda_pioctl(struct inode * inode, struct file * filp, - unsigned int cmd, unsigned long user_data); - -@@ -43,7 +43,8 @@ struct file_operations coda_ioctl_operat - - /* the coda pioctl inode ops */ - static int coda_ioctl_permission(struct inode *inode, int mask, -- struct nameidata *nd) -+ struct nameidata *nd, -+ struct exec_perm *exec_perm) - { - return 0; - } -diff -uprN linux-2.6.8.1.orig/fs/compat.c linux-2.6.8.1-ve022stab078/fs/compat.c ---- linux-2.6.8.1.orig/fs/compat.c 2004-08-14 14:55:31.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/compat.c 2006-05-11 13:05:49.000000000 +0400 -@@ -25,6 +25,7 @@ - #include <linux/file.h> - #include <linux/vfs.h> - #include <linux/ioctl32.h> -+#include <linux/virtinfo.h> - #include <linux/init.h> - #include <linux/sockios.h> /* for SIOCDEVPRIVATE */ - #include <linux/smb.h> -@@ -155,6 +156,8 @@ asmlinkage long compat_sys_statfs(const - if (!error) { - struct kstatfs tmp; - error = vfs_statfs(nd.dentry->d_inode->i_sb, &tmp); -+ if (!error) -+ error = faudit_statfs(nd.mnt->mnt_sb, &tmp); - if (!error && put_compat_statfs(buf, &tmp)) - error = -EFAULT; - path_release(&nd); -@@ -173,6 +176,8 @@ asmlinkage long compat_sys_fstatfs(unsig - if (!file) - goto out; - error = vfs_statfs(file->f_dentry->d_inode->i_sb, &tmp); -+ if (!error) -+ error = faudit_statfs(file->f_vfsmnt->mnt_sb, &tmp); - if (!error && put_compat_statfs(buf, &tmp)) - error = -EFAULT; - fput(file); -@@ -216,6 +221,8 @@ asmlinkage long compat_statfs64(const ch - if (!error) { - struct kstatfs tmp; - error = vfs_statfs(nd.dentry->d_inode->i_sb, &tmp); -+ if (!error) -+ error = faudit_statfs(nd.mnt->mnt_sb, &tmp); - if (!error && put_compat_statfs64(buf, &tmp)) - error = -EFAULT; - path_release(&nd); -@@ -237,6 +244,8 @@ asmlinkage long compat_fstatfs64(unsigne - if (!file) - goto out; - error = vfs_statfs(file->f_dentry->d_inode->i_sb, &tmp); -+ if (!error) -+ error = faudit_statfs(file->f_vfsmnt->mnt_sb, &tmp); - if (!error && put_compat_statfs64(buf, &tmp)) - error = -EFAULT; - fput(file); -@@ -429,6 +438,8 @@ asmlinkage long compat_sys_ioctl(unsigne - fn = d_path(filp->f_dentry, - filp->f_vfsmnt, path, - PAGE_SIZE); -+ if (IS_ERR(fn)) -+ fn = "(err)"; - } - - sprintf(buf,"'%c'", (cmd>>24) & 0x3f); -@@ -1375,7 +1386,11 @@ int compat_do_execve(char * filename, - - sched_balance_exec(); - -- file = open_exec(filename); -+ retval = virtinfo_gencall(VIRTINFO_DOEXECVE, NULL); -+ if (retval) -+ return retval; -+ -+ file = open_exec(filename, &bprm); - - retval = PTR_ERR(file); - if (IS_ERR(file)) -diff -uprN linux-2.6.8.1.orig/fs/compat_ioctl.c linux-2.6.8.1-ve022stab078/fs/compat_ioctl.c ---- linux-2.6.8.1.orig/fs/compat_ioctl.c 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/compat_ioctl.c 2006-05-11 13:05:35.000000000 +0400 -@@ -640,8 +640,11 @@ int siocdevprivate_ioctl(unsigned int fd - /* Don't check these user accesses, just let that get trapped - * in the ioctl handler instead. - */ -- copy_to_user(&u_ifreq64->ifr_ifrn.ifrn_name[0], &tmp_buf[0], IFNAMSIZ); -- __put_user(data64, &u_ifreq64->ifr_ifru.ifru_data); -+ if (copy_to_user(&u_ifreq64->ifr_ifrn.ifrn_name[0], &tmp_buf[0], -+ IFNAMSIZ)) -+ return -EFAULT; -+ if (__put_user(data64, &u_ifreq64->ifr_ifru.ifru_data)) -+ return -EFAULT; - - return sys_ioctl(fd, cmd, (unsigned long) u_ifreq64); - } -@@ -679,6 +682,11 @@ static int dev_ifsioc(unsigned int fd, u - set_fs (old_fs); - if (!err) { - switch (cmd) { -+ /* TUNSETIFF is defined as _IOW, it should be _IORW -+ * as the data is copied back to user space, but that -+ * cannot be fixed without breaking all existing apps. -+ */ -+ case TUNSETIFF: - case SIOCGIFFLAGS: - case SIOCGIFMETRIC: - case SIOCGIFMTU: -@@ -785,13 +793,16 @@ static int routing_ioctl(unsigned int fd - r = (void *) &r4; - } - -- if (ret) -- return -EFAULT; -+ if (ret) { -+ ret = -EFAULT; -+ goto out; -+ } - - set_fs (KERNEL_DS); - ret = sys_ioctl (fd, cmd, (unsigned long) r); - set_fs (old_fs); - -+out: - if (mysock) - sockfd_put(mysock); - -@@ -2336,7 +2347,9 @@ put_dirent32 (struct dirent *d, struct c - __put_user(d->d_ino, &d32->d_ino); - __put_user(d->d_off, &d32->d_off); - __put_user(d->d_reclen, &d32->d_reclen); -- __copy_to_user(d32->d_name, d->d_name, d->d_reclen); -+ if (__copy_to_user(d32->d_name, d->d_name, d->d_reclen)) -+ return -EFAULT; -+ - return ret; - } - -@@ -2479,7 +2492,8 @@ static int serial_struct_ioctl(unsigned - if (cmd == TIOCSSERIAL) { - if (verify_area(VERIFY_READ, ss32, sizeof(SS32))) - return -EFAULT; -- __copy_from_user(&ss, ss32, offsetof(SS32, iomem_base)); -+ if (__copy_from_user(&ss, ss32, offsetof(SS32, iomem_base))) -+ return -EFAULT; - __get_user(udata, &ss32->iomem_base); - ss.iomem_base = compat_ptr(udata); - __get_user(ss.iomem_reg_shift, &ss32->iomem_reg_shift); -@@ -2492,7 +2506,8 @@ static int serial_struct_ioctl(unsigned - if (cmd == TIOCGSERIAL && err >= 0) { - if (verify_area(VERIFY_WRITE, ss32, sizeof(SS32))) - return -EFAULT; -- __copy_to_user(ss32,&ss,offsetof(SS32,iomem_base)); -+ if (__copy_to_user(ss32,&ss,offsetof(SS32,iomem_base))) -+ return -EFAULT; - __put_user((unsigned long)ss.iomem_base >> 32 ? - 0xffffffff : (unsigned)(unsigned long)ss.iomem_base, - &ss32->iomem_base); -diff -uprN linux-2.6.8.1.orig/fs/dcache.c linux-2.6.8.1-ve022stab078/fs/dcache.c ---- linux-2.6.8.1.orig/fs/dcache.c 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/dcache.c 2006-05-11 13:05:40.000000000 +0400 -@@ -19,6 +19,7 @@ - #include <linux/mm.h> - #include <linux/fs.h> - #include <linux/slab.h> -+#include <linux/kmem_cache.h> - #include <linux/init.h> - #include <linux/smp_lock.h> - #include <linux/hash.h> -@@ -26,11 +27,15 @@ - #include <linux/module.h> - #include <linux/mount.h> - #include <linux/file.h> -+#include <linux/namei.h> - #include <asm/uaccess.h> - #include <linux/security.h> - #include <linux/seqlock.h> - #include <linux/swap.h> - #include <linux/bootmem.h> -+#include <linux/kernel_stat.h> -+ -+#include <ub/ub_dcache.h> - - /* #define DCACHE_DEBUG 1 */ - -@@ -43,7 +48,10 @@ EXPORT_SYMBOL(dcache_lock); - - static kmem_cache_t *dentry_cache; - --#define DNAME_INLINE_LEN (sizeof(struct dentry)-offsetof(struct dentry,d_iname)) -+unsigned int dentry_memusage(void) -+{ -+ return kmem_cache_memusage(dentry_cache); -+} - - /* - * This is the single most critical data structure when it comes -@@ -70,6 +78,7 @@ static void d_callback(struct rcu_head * - { - struct dentry * dentry = container_of(head, struct dentry, d_rcu); - -+ ub_dentry_free(dentry); - if (dname_external(dentry)) - kfree(dentry->d_name.name); - kmem_cache_free(dentry_cache, dentry); -@@ -109,6 +118,75 @@ static inline void dentry_iput(struct de - } - } - -+struct dcache_shrinker { -+ struct list_head list; -+ struct dentry *dentry; -+}; -+ -+DECLARE_WAIT_QUEUE_HEAD(dcache_shrinker_wq); -+ -+/* called under dcache_lock */ -+static void dcache_shrinker_add(struct dcache_shrinker *ds, -+ struct dentry *parent, struct dentry *dentry) -+{ -+ struct super_block *sb; -+ -+ sb = parent->d_sb; -+ ds->dentry = parent; -+ list_add(&ds->list, &sb->s_dshrinkers); -+} -+ -+/* called under dcache_lock */ -+static void dcache_shrinker_del(struct dcache_shrinker *ds) -+{ -+ if (ds == NULL || list_empty(&ds->list)) -+ return; -+ -+ list_del_init(&ds->list); -+ wake_up_all(&dcache_shrinker_wq); -+} -+ -+/* called under dcache_lock, drops inside */ -+static void dcache_shrinker_wait(struct super_block *sb) -+{ -+ DECLARE_WAITQUEUE(wq, current); -+ -+ __set_current_state(TASK_UNINTERRUPTIBLE); -+ add_wait_queue(&dcache_shrinker_wq, &wq); -+ spin_unlock(&dcache_lock); -+ -+ schedule(); -+ remove_wait_queue(&dcache_shrinker_wq, &wq); -+ __set_current_state(TASK_RUNNING); -+} -+ -+void dcache_shrinker_wait_sb(struct super_block *sb) -+{ -+ /* the root dentry can be held in dput_recursive */ -+ spin_lock(&dcache_lock); -+ while (!list_empty(&sb->s_dshrinkers)) { -+ dcache_shrinker_wait(sb); -+ spin_lock(&dcache_lock); -+ } -+ spin_unlock(&dcache_lock); -+} -+ -+/* dcache_lock protects shrinker's list */ -+static void shrink_dcache_racecheck(struct dentry *parent, int *racecheck) -+{ -+ struct super_block *sb; -+ struct dcache_shrinker *ds; -+ -+ sb = parent->d_sb; -+ list_for_each_entry(ds, &sb->s_dshrinkers, list) { -+ /* is one of dcache shrinkers working on the dentry? */ -+ if (ds->dentry == parent) { -+ *racecheck = 1; -+ break; -+ } -+ } -+} -+ - /* - * This is dput - * -@@ -127,26 +205,26 @@ static inline void dentry_iput(struct de - */ - - /* -- * dput - release a dentry -- * @dentry: dentry to release -+ * dput_recursive - go upward through the dentry tree and release dentries -+ * @dentry: starting dentry -+ * @ds: shrinker to be added to active list (see shrink_dcache_parent) - * - * Release a dentry. This will drop the usage count and if appropriate - * call the dentry unlink method as well as removing it from the queues and - * releasing its resources. If the parent dentries were scheduled for release - * they too may now get deleted. - * -+ * This traverse upward doesn't change d_inuse of any dentry -+ * - * no dcache lock, please. - */ -- --void dput(struct dentry *dentry) -+static void dput_recursive(struct dentry *dentry, struct dcache_shrinker *ds) - { -- if (!dentry) -- return; -- --repeat: - if (!atomic_dec_and_lock(&dentry->d_count, &dcache_lock)) - return; -+ dcache_shrinker_del(ds); - -+repeat: - spin_lock(&dentry->d_lock); - if (atomic_read(&dentry->d_count)) { - spin_unlock(&dentry->d_lock); -@@ -178,6 +256,7 @@ unhash_it: - - kill_it: { - struct dentry *parent; -+ struct dcache_shrinker lds; - - /* If dentry was on d_lru list - * delete it from there -@@ -187,18 +266,50 @@ kill_it: { - dentry_stat.nr_unused--; - } - list_del(&dentry->d_child); -+ parent = dentry->d_parent; -+ dcache_shrinker_add(&lds, parent, dentry); - dentry_stat.nr_dentry--; /* For d_free, below */ - /*drops the locks, at that point nobody can reach this dentry */ - dentry_iput(dentry); -- parent = dentry->d_parent; - d_free(dentry); -- if (dentry == parent) -+ if (unlikely(dentry == parent)) { -+ spin_lock(&dcache_lock); -+ dcache_shrinker_del(&lds); -+ spin_unlock(&dcache_lock); - return; -+ } - dentry = parent; -- goto repeat; -+ spin_lock(&dcache_lock); -+ dcache_shrinker_del(&lds); -+ if (atomic_dec_and_test(&dentry->d_count)) -+ goto repeat; -+ spin_unlock(&dcache_lock); - } - } - -+/* -+ * dput - release a dentry -+ * @dentry: dentry to release -+ * -+ * Release a dentry. This will drop the usage count and if appropriate -+ * call the dentry unlink method as well as removing it from the queues and -+ * releasing its resources. If the parent dentries were scheduled for release -+ * they too may now get deleted. -+ * -+ * no dcache lock, please. -+ */ -+ -+void dput(struct dentry *dentry) -+{ -+ if (!dentry) -+ return; -+ -+ spin_lock(&dcache_lock); -+ ub_dentry_uncharge(dentry); -+ spin_unlock(&dcache_lock); -+ dput_recursive(dentry, NULL); -+} -+ - /** - * d_invalidate - invalidate a dentry - * @dentry: dentry to invalidate -@@ -265,6 +376,8 @@ static inline struct dentry * __dget_loc - dentry_stat.nr_unused--; - list_del_init(&dentry->d_lru); - } -+ -+ ub_dentry_charge_nofail(dentry); - return dentry; - } - -@@ -327,13 +440,16 @@ restart: - tmp = head; - while ((tmp = tmp->next) != head) { - struct dentry *dentry = list_entry(tmp, struct dentry, d_alias); -+ spin_lock(&dentry->d_lock); - if (!atomic_read(&dentry->d_count)) { - __dget_locked(dentry); - __d_drop(dentry); -+ spin_unlock(&dentry->d_lock); - spin_unlock(&dcache_lock); - dput(dentry); - goto restart; - } -+ spin_unlock(&dentry->d_lock); - } - spin_unlock(&dcache_lock); - } -@@ -344,19 +460,27 @@ restart: - * removed. - * Called with dcache_lock, drops it and then regains. - */ --static inline void prune_one_dentry(struct dentry * dentry) -+static void prune_one_dentry(struct dentry * dentry) - { - struct dentry * parent; -+ struct dcache_shrinker ds; - - __d_drop(dentry); - list_del(&dentry->d_child); -+ parent = dentry->d_parent; -+ dcache_shrinker_add(&ds, parent, dentry); - dentry_stat.nr_dentry--; /* For d_free, below */ - dentry_iput(dentry); - parent = dentry->d_parent; - d_free(dentry); - if (parent != dentry) -- dput(parent); -+ /* -+ * dentry is not in use, only child (not outside) -+ * references change, so parent->d_inuse does not change -+ */ -+ dput_recursive(parent, &ds); - spin_lock(&dcache_lock); -+ dcache_shrinker_del(&ds); - } - - /** -@@ -379,6 +503,8 @@ static void prune_dcache(int count) - struct dentry *dentry; - struct list_head *tmp; - -+ cond_resched_lock(&dcache_lock); -+ - tmp = dentry_unused.prev; - if (tmp == &dentry_unused) - break; -@@ -472,6 +598,7 @@ repeat: - continue; - } - prune_one_dentry(dentry); -+ cond_resched_lock(&dcache_lock); - goto repeat; - } - spin_unlock(&dcache_lock); -@@ -536,13 +663,12 @@ positive: - * whenever the d_subdirs list is non-empty and continue - * searching. - */ --static int select_parent(struct dentry * parent) -+static int select_parent(struct dentry * parent, int * racecheck) - { - struct dentry *this_parent = parent; - struct list_head *next; - int found = 0; - -- spin_lock(&dcache_lock); - repeat: - next = this_parent->d_subdirs.next; - resume: -@@ -564,6 +690,15 @@ resume: - dentry_stat.nr_unused++; - found++; - } -+ -+ /* -+ * We can return to the caller if we have found some (this -+ * ensures forward progress). We'll be coming back to find -+ * the rest. -+ */ -+ if (found && need_resched()) -+ goto out; -+ - /* - * Descend a level if the d_subdirs list is non-empty. - */ -@@ -575,6 +710,9 @@ dentry->d_parent->d_name.name, dentry->d - #endif - goto repeat; - } -+ -+ if (!found && racecheck != NULL) -+ shrink_dcache_racecheck(dentry, racecheck); - } - /* - * All done at this level ... ascend and resume the search. -@@ -588,7 +726,7 @@ this_parent->d_parent->d_name.name, this - #endif - goto resume; - } -- spin_unlock(&dcache_lock); -+out: - return found; - } - -@@ -601,10 +739,66 @@ this_parent->d_parent->d_name.name, this - - void shrink_dcache_parent(struct dentry * parent) - { -- int found; -+ int found, r; -+ -+ while (1) { -+ spin_lock(&dcache_lock); -+ found = select_parent(parent, NULL); -+ if (found) -+ goto found; - -- while ((found = select_parent(parent)) != 0) -+ /* -+ * try again with a dput_recursive() race check. -+ * it returns quickly if everything was really shrinked -+ */ -+ r = 0; -+ found = select_parent(parent, &r); -+ if (found) -+ goto found; -+ if (!r) -+ break; -+ -+ /* drops the lock inside */ -+ dcache_shrinker_wait(parent->d_sb); -+ continue; -+ -+found: -+ spin_unlock(&dcache_lock); - prune_dcache(found); -+ } -+ spin_unlock(&dcache_lock); -+} -+ -+/* -+ * Move any unused anon dentries to the end of the unused list. -+ * called under dcache_lock -+ */ -+static int select_anon(struct hlist_head *head, int *racecheck) -+{ -+ struct hlist_node *lp; -+ int found = 0; -+ -+ hlist_for_each(lp, head) { -+ struct dentry *this = hlist_entry(lp, struct dentry, d_hash); -+ if (!list_empty(&this->d_lru)) { -+ dentry_stat.nr_unused--; -+ list_del_init(&this->d_lru); -+ } -+ -+ /* -+ * move only zero ref count dentries to the end -+ * of the unused list for prune_dcache -+ */ -+ if (!atomic_read(&this->d_count)) { -+ list_add_tail(&this->d_lru, &dentry_unused); -+ dentry_stat.nr_unused++; -+ found++; -+ } -+ -+ if (!found && racecheck != NULL) -+ shrink_dcache_racecheck(this, racecheck); -+ } -+ return found; - } - - /** -@@ -617,33 +811,36 @@ void shrink_dcache_parent(struct dentry - * done under dcache_lock. - * - */ --void shrink_dcache_anon(struct hlist_head *head) -+void shrink_dcache_anon(struct super_block *sb) - { -- struct hlist_node *lp; -- int found; -- do { -- found = 0; -+ int found, r; -+ -+ while (1) { - spin_lock(&dcache_lock); -- hlist_for_each(lp, head) { -- struct dentry *this = hlist_entry(lp, struct dentry, d_hash); -- if (!list_empty(&this->d_lru)) { -- dentry_stat.nr_unused--; -- list_del_init(&this->d_lru); -- } -+ found = select_anon(&sb->s_anon, NULL); -+ if (found) -+ goto found; - -- /* -- * move only zero ref count dentries to the end -- * of the unused list for prune_dcache -- */ -- if (!atomic_read(&this->d_count)) { -- list_add_tail(&this->d_lru, &dentry_unused); -- dentry_stat.nr_unused++; -- found++; -- } -- } -+ /* -+ * try again with a dput_recursive() race check. -+ * it returns quickly if everything was really shrinked -+ */ -+ r = 0; -+ found = select_anon(&sb->s_anon, &r); -+ if (found) -+ goto found; -+ if (!r) -+ break; -+ -+ /* drops the lock inside */ -+ dcache_shrinker_wait(sb); -+ continue; -+ -+found: - spin_unlock(&dcache_lock); - prune_dcache(found); -- } while(found); -+ } -+ spin_unlock(&dcache_lock); - } - - /* -@@ -660,12 +857,18 @@ void shrink_dcache_anon(struct hlist_hea - */ - static int shrink_dcache_memory(int nr, unsigned int gfp_mask) - { -+ int res = -1; -+ -+ KSTAT_PERF_ENTER(shrink_dcache) - if (nr) { - if (!(gfp_mask & __GFP_FS)) -- return -1; -+ goto out; - prune_dcache(nr); - } -- return (dentry_stat.nr_unused / 100) * sysctl_vfs_cache_pressure; -+ res = (dentry_stat.nr_unused / 100) * sysctl_vfs_cache_pressure; -+out: -+ KSTAT_PERF_LEAVE(shrink_dcache) -+ return res; - } - - /** -@@ -685,19 +888,20 @@ struct dentry *d_alloc(struct dentry * p - - dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL); - if (!dentry) -- return NULL; -+ goto err_dentry; - - if (name->len > DNAME_INLINE_LEN-1) { - dname = kmalloc(name->len + 1, GFP_KERNEL); -- if (!dname) { -- kmem_cache_free(dentry_cache, dentry); -- return NULL; -- } -+ if (!dname) -+ goto err_name; - } else { - dname = dentry->d_iname; - } - dentry->d_name.name = dname; - -+ if (ub_dentry_alloc(dentry)) -+ goto err_charge; -+ - dentry->d_name.len = name->len; - dentry->d_name.hash = name->hash; - memcpy(dname, name->name, name->len); -@@ -727,12 +931,23 @@ struct dentry *d_alloc(struct dentry * p - } - - spin_lock(&dcache_lock); -- if (parent) -+ if (parent) { - list_add(&dentry->d_child, &parent->d_subdirs); -+ if (parent->d_flags & DCACHE_VIRTUAL) -+ dentry->d_flags |= DCACHE_VIRTUAL; -+ } - dentry_stat.nr_dentry++; - spin_unlock(&dcache_lock); - - return dentry; -+ -+err_charge: -+ if (name->len > DNAME_INLINE_LEN - 1) -+ kfree(dname); -+err_name: -+ kmem_cache_free(dentry_cache, dentry); -+err_dentry: -+ return NULL; - } - - /** -@@ -1016,6 +1231,7 @@ struct dentry * __d_lookup(struct dentry - if (!d_unhashed(dentry)) { - atomic_inc(&dentry->d_count); - found = dentry; -+ goto found; - } - terminate: - spin_unlock(&dentry->d_lock); -@@ -1026,6 +1242,17 @@ next: - rcu_read_unlock(); - - return found; -+ -+found: -+ /* -+ * d_lock and rcu_read_lock -+ * are dropped in ub_dentry_charge() -+ */ -+ if (!ub_dentry_charge(found)) -+ return found; -+ -+ dput(found); -+ return NULL; - } - - /** -@@ -1262,6 +1489,32 @@ already_unhashed: - } - - /** -+ * __d_path_add_deleted - prepend "(deleted) " text -+ * @end: a pointer to the character after free space at the beginning of the -+ * buffer -+ * @buflen: remaining free space -+ */ -+static inline char * __d_path_add_deleted(char * end, int buflen) -+{ -+ buflen -= 10; -+ if (buflen < 0) -+ return ERR_PTR(-ENAMETOOLONG); -+ end -= 10; -+ memcpy(end, "(deleted) ", 10); -+ return end; -+} -+ -+/** -+ * d_root_check - checks if dentry is accessible from current's fs root -+ * @dentry: dentry to be verified -+ * @vfsmnt: vfsmnt to which the dentry belongs -+ */ -+int d_root_check(struct dentry *dentry, struct vfsmount *vfsmnt) -+{ -+ return PTR_ERR(d_path(dentry, vfsmnt, NULL, 0)); -+} -+ -+/** - * d_path - return the path of a dentry - * @dentry: dentry to report - * @vfsmnt: vfsmnt to which the dentry belongs -@@ -1282,36 +1535,35 @@ static char * __d_path( struct dentry *d - char *buffer, int buflen) - { - char * end = buffer+buflen; -- char * retval; -+ char * retval = NULL; - int namelen; -+ int deleted; -+ struct vfsmount *oldvfsmnt; - -- *--end = '\0'; -- buflen--; -- if (!IS_ROOT(dentry) && d_unhashed(dentry)) { -- buflen -= 10; -- end -= 10; -- if (buflen < 0) -+ oldvfsmnt = vfsmnt; -+ deleted = (!IS_ROOT(dentry) && d_unhashed(dentry)); -+ if (buffer != NULL) { -+ *--end = '\0'; -+ buflen--; -+ -+ if (buflen < 1) - goto Elong; -- memcpy(end, " (deleted)", 10); -+ /* Get '/' right */ -+ retval = end-1; -+ *retval = '/'; - } - -- if (buflen < 1) -- goto Elong; -- /* Get '/' right */ -- retval = end-1; -- *retval = '/'; -- - for (;;) { - struct dentry * parent; - - if (dentry == root && vfsmnt == rootmnt) - break; - if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) { -- /* Global root? */ -+ /* root of a tree? */ - spin_lock(&vfsmount_lock); - if (vfsmnt->mnt_parent == vfsmnt) { - spin_unlock(&vfsmount_lock); -- goto global_root; -+ goto other_root; - } - dentry = vfsmnt->mnt_mountpoint; - vfsmnt = vfsmnt->mnt_parent; -@@ -1320,27 +1572,51 @@ static char * __d_path( struct dentry *d - } - parent = dentry->d_parent; - prefetch(parent); -+ if (buffer != NULL) { -+ namelen = dentry->d_name.len; -+ buflen -= namelen + 1; -+ if (buflen < 0) -+ goto Elong; -+ end -= namelen; -+ memcpy(end, dentry->d_name.name, namelen); -+ *--end = '/'; -+ retval = end; -+ } -+ dentry = parent; -+ } -+ /* the given root point is reached */ -+finish: -+ if (buffer != NULL && deleted) -+ retval = __d_path_add_deleted(end, buflen); -+ return retval; -+ -+other_root: -+ /* -+ * We traversed the tree upward and reached a root, but the given -+ * lookup terminal point wasn't encountered. It means either that the -+ * dentry is out of our scope or belongs to an abstract space like -+ * sock_mnt or pipe_mnt. Check for it. -+ * -+ * There are different options to check it. -+ * We may assume that any dentry tree is unreachable unless it's -+ * connected to `root' (defined as fs root of init aka child reaper) -+ * and expose all paths that are not connected to it. -+ * The other option is to allow exposing of known abstract spaces -+ * explicitly and hide the path information for other cases. -+ * This approach is more safe, let's take it. 2001/04/22 SAW -+ */ -+ if (!(oldvfsmnt->mnt_sb->s_flags & MS_NOUSER)) -+ return ERR_PTR(-EINVAL); -+ if (buffer != NULL) { - namelen = dentry->d_name.len; -- buflen -= namelen + 1; -+ buflen -= namelen; - if (buflen < 0) - goto Elong; -- end -= namelen; -- memcpy(end, dentry->d_name.name, namelen); -- *--end = '/'; -- retval = end; -- dentry = parent; -+ retval -= namelen-1; /* hit the slash */ -+ memcpy(retval, dentry->d_name.name, namelen); - } -+ goto finish; - -- return retval; -- --global_root: -- namelen = dentry->d_name.len; -- buflen -= namelen; -- if (buflen < 0) -- goto Elong; -- retval -= namelen-1; /* hit the slash */ -- memcpy(retval, dentry->d_name.name, namelen); -- return retval; - Elong: - return ERR_PTR(-ENAMETOOLONG); - } -@@ -1365,6 +1641,226 @@ char * d_path(struct dentry *dentry, str - return res; - } - -+#ifdef CONFIG_VE -+#include <net/sock.h> -+#include <linux/ip.h> -+#include <linux/file.h> -+#include <linux/namespace.h> -+#include <linux/vzratelimit.h> -+ -+static void mark_sub_tree_virtual(struct dentry *d) -+{ -+ struct dentry *orig_root; -+ -+ orig_root = d; -+ while (1) { -+ spin_lock(&d->d_lock); -+ d->d_flags |= DCACHE_VIRTUAL; -+ spin_unlock(&d->d_lock); -+ -+ if (!list_empty(&d->d_subdirs)) { -+ d = list_entry(d->d_subdirs.next, -+ struct dentry, d_child); -+ continue; -+ } -+ if (d == orig_root) -+ break; -+ while (d == list_entry(d->d_parent->d_subdirs.prev, -+ struct dentry, d_child)) { -+ d = d->d_parent; -+ if (d == orig_root) -+ goto out; -+ } -+ d = list_entry(d->d_child.next, -+ struct dentry, d_child); -+ } -+out: -+ return; -+} -+ -+void mark_tree_virtual(struct vfsmount *m, struct dentry *d) -+{ -+ struct vfsmount *orig_rootmnt; -+ -+ spin_lock(&dcache_lock); -+ spin_lock(&vfsmount_lock); -+ orig_rootmnt = m; -+ while (1) { -+ mark_sub_tree_virtual(d); -+ if (!list_empty(&m->mnt_mounts)) { -+ m = list_entry(m->mnt_mounts.next, -+ struct vfsmount, mnt_child); -+ d = m->mnt_root; -+ continue; -+ } -+ if (m == orig_rootmnt) -+ break; -+ while (m == list_entry(m->mnt_parent->mnt_mounts.prev, -+ struct vfsmount, mnt_child)) { -+ m = m->mnt_parent; -+ if (m == orig_rootmnt) -+ goto out; -+ } -+ m = list_entry(m->mnt_child.next, -+ struct vfsmount, mnt_child); -+ d = m->mnt_root; -+ } -+out: -+ spin_unlock(&vfsmount_lock); -+ spin_unlock(&dcache_lock); -+} -+EXPORT_SYMBOL(mark_tree_virtual); -+ -+static struct vz_rate_info area_ri = { 20, 10*HZ }; -+#define VE_AREA_ACC_CHECK 0x0001 -+#define VE_AREA_ACC_DENY 0x0002 -+#define VE_AREA_EXEC_CHECK 0x0010 -+#define VE_AREA_EXEC_DENY 0x0020 -+#define VE0_AREA_ACC_CHECK 0x0100 -+#define VE0_AREA_ACC_DENY 0x0200 -+#define VE0_AREA_EXEC_CHECK 0x1000 -+#define VE0_AREA_EXEC_DENY 0x2000 -+int ve_area_access_check = 0; -+ -+static void print_connection_info(struct task_struct *tsk) -+{ -+ struct files_struct *files; -+ int fd; -+ -+ files = get_files_struct(tsk); -+ if (!files) -+ return; -+ -+ spin_lock(&files->file_lock); -+ for (fd = 0; fd < files->max_fds; fd++) { -+ struct file *file; -+ struct inode *inode; -+ struct socket *socket; -+ struct sock *sk; -+ struct inet_opt *inet; -+ -+ file = files->fd[fd]; -+ if (file == NULL) -+ continue; -+ -+ inode = file->f_dentry->d_inode; -+ if (!inode->i_sock) -+ continue; -+ -+ socket = SOCKET_I(inode); -+ if (socket == NULL) -+ continue; -+ -+ sk = socket->sk; -+ if (sk->sk_family != PF_INET || sk->sk_type != SOCK_STREAM) -+ continue; -+ -+ inet = inet_sk(sk); -+ printk(KERN_ALERT "connection from %u.%u.%u.%u:%u to port %u\n", -+ NIPQUAD(inet->daddr), ntohs(inet->dport), -+ inet->num); -+ } -+ spin_unlock(&files->file_lock); -+ put_files_struct(files); -+} -+ -+static void check_alert(struct vfsmount *vfsmnt, struct dentry *dentry, -+ char *str) -+{ -+ struct task_struct *tsk; -+ unsigned long page; -+ struct super_block *sb; -+ char *p; -+ -+ if (!vz_ratelimit(&area_ri)) -+ return; -+ -+ tsk = current; -+ p = ERR_PTR(-ENOMEM); -+ page = __get_free_page(GFP_KERNEL); -+ if (page) { -+ spin_lock(&dcache_lock); -+ p = __d_path(dentry, vfsmnt, tsk->fs->root, tsk->fs->rootmnt, -+ (char *)page, PAGE_SIZE); -+ spin_unlock(&dcache_lock); -+ } -+ if (IS_ERR(p)) -+ p = "(undefined)"; -+ -+ sb = dentry->d_sb; -+ printk(KERN_ALERT "%s check alert! file:[%s] from %d/%s, dev%x\n" -+ "Task %d/%d[%s] from VE%d, execenv %d\n", -+ str, p, VE_OWNER_FSTYPE(sb->s_type)->veid, -+ sb->s_type->name, sb->s_dev, -+ tsk->pid, virt_pid(tsk), tsk->comm, -+ VE_TASK_INFO(tsk)->owner_env->veid, -+ get_exec_env()->veid); -+ -+ free_page(page); -+ -+ print_connection_info(tsk); -+ -+ read_lock(&tasklist_lock); -+ tsk = tsk->real_parent; -+ get_task_struct(tsk); -+ read_unlock(&tasklist_lock); -+ -+ printk(KERN_ALERT "Parent %d/%d[%s] from VE%d\n", -+ tsk->pid, virt_pid(tsk), tsk->comm, -+ VE_TASK_INFO(tsk)->owner_env->veid); -+ -+ print_connection_info(tsk); -+ put_task_struct(tsk); -+ dump_stack(); -+} -+#endif -+ -+int check_area_access_ve(struct dentry *dentry, struct vfsmount *mnt) -+{ -+#ifdef CONFIG_VE -+ int check, alert, deny; -+ -+ if (ve_is_super(get_exec_env())) { -+ check = ve_area_access_check & VE0_AREA_ACC_CHECK; -+ alert = dentry->d_flags & DCACHE_VIRTUAL; -+ deny = ve_area_access_check & VE0_AREA_ACC_DENY; -+ } else { -+ check = ve_area_access_check & VE_AREA_ACC_CHECK; -+ alert = !(dentry->d_flags & DCACHE_VIRTUAL); -+ deny = ve_area_access_check & VE_AREA_ACC_DENY; -+ } -+ -+ if (check && alert) -+ check_alert(mnt, dentry, "Access"); -+ if (deny && alert) -+ return -EACCES; -+#endif -+ return 0; -+} -+ -+int check_area_execute_ve(struct dentry *dentry, struct vfsmount *mnt) -+{ -+#ifdef CONFIG_VE -+ int check, alert, deny; -+ -+ if (ve_is_super(get_exec_env())) { -+ check = ve_area_access_check & VE0_AREA_EXEC_CHECK; -+ alert = dentry->d_flags & DCACHE_VIRTUAL; -+ deny = ve_area_access_check & VE0_AREA_EXEC_DENY; -+ } else { -+ check = ve_area_access_check & VE_AREA_EXEC_CHECK; -+ alert = !(dentry->d_flags & DCACHE_VIRTUAL); -+ deny = ve_area_access_check & VE_AREA_EXEC_DENY; -+ } -+ -+ if (check && alert) -+ check_alert(mnt, dentry, "Exec"); -+ if (deny && alert) -+ return -EACCES; -+#endif -+ return 0; -+} -+ - /* - * NOTE! The user-level library version returns a - * character pointer. The kernel system call just -@@ -1501,10 +1997,12 @@ resume: - goto repeat; - } - atomic_dec(&dentry->d_count); -+ ub_dentry_uncharge(dentry); - } - if (this_parent != root) { - next = this_parent->d_child.next; - atomic_dec(&this_parent->d_count); -+ ub_dentry_uncharge(this_parent); - this_parent = this_parent->d_parent; - goto resume; - } -@@ -1627,7 +2125,7 @@ void __init vfs_caches_init(unsigned lon - SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); - - filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0, -- SLAB_HWCACHE_ALIGN|SLAB_PANIC, filp_ctor, filp_dtor); -+ SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_UBC, filp_ctor, filp_dtor); - - dcache_init(mempages); - inode_init(mempages); -diff -uprN linux-2.6.8.1.orig/fs/dcookies.c linux-2.6.8.1-ve022stab078/fs/dcookies.c ---- linux-2.6.8.1.orig/fs/dcookies.c 2004-08-14 14:54:46.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/dcookies.c 2006-05-11 13:05:37.000000000 +0400 -@@ -93,12 +93,10 @@ static struct dcookie_struct * alloc_dco - if (!dcs) - return NULL; - -- atomic_inc(&dentry->d_count); -- atomic_inc(&vfsmnt->mnt_count); - dentry->d_cookie = dcs; - -- dcs->dentry = dentry; -- dcs->vfsmnt = vfsmnt; -+ dcs->dentry = dget(dentry); -+ dcs->vfsmnt = mntget(vfsmnt); - hash_dcookie(dcs); - - return dcs; -diff -uprN linux-2.6.8.1.orig/fs/devpts/inode.c linux-2.6.8.1-ve022stab078/fs/devpts/inode.c ---- linux-2.6.8.1.orig/fs/devpts/inode.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/devpts/inode.c 2006-05-11 13:05:42.000000000 +0400 -@@ -12,6 +12,7 @@ - - #include <linux/module.h> - #include <linux/init.h> -+#include <linux/ve.h> - #include <linux/fs.h> - #include <linux/sched.h> - #include <linux/namei.h> -@@ -25,13 +26,29 @@ - static struct vfsmount *devpts_mnt; - static struct dentry *devpts_root; - --static struct { -- int setuid; -- int setgid; -- uid_t uid; -- gid_t gid; -- umode_t mode; --} config = {.mode = 0600}; -+void prepare_devpts(void) -+{ -+#ifdef CONFIG_VE -+ get_ve0()->devpts_mnt = devpts_mnt; -+ devpts_mnt = (struct vfsmount *)0x11121314; -+ -+ /* ve0.devpts_root should be filled inside fill_super() */ -+ BUG_ON(devpts_root != NULL); -+ devpts_root = (struct dentry *)0x12131415; -+#endif -+} -+ -+#ifndef CONFIG_VE -+#define visible_devpts_mnt devpts_mnt -+#define visible_devpts_root devpts_root -+#define visible_devpts_config config -+#else -+#define visible_devpts_mnt (get_exec_env()->devpts_mnt) -+#define visible_devpts_root (get_exec_env()->devpts_root) -+#define visible_devpts_config (*(get_exec_env()->devpts_config)) -+#endif -+ -+static struct devpts_config config = {.mode = 0600}; - - static int devpts_remount(struct super_block *sb, int *flags, char *data) - { -@@ -57,15 +74,16 @@ static int devpts_remount(struct super_b - } else if (sscanf(this_char, "mode=%o%c", &n, &dummy) == 1) - mode = n & ~S_IFMT; - else { -- printk("devpts: called with bogus options\n"); -+ ve_printk(VE_LOG, -+ "devpts: called with bogus options\n"); - return -EINVAL; - } - } -- config.setuid = setuid; -- config.setgid = setgid; -- config.uid = uid; -- config.gid = gid; -- config.mode = mode; -+ visible_devpts_config.setuid = setuid; -+ visible_devpts_config.setgid = setgid; -+ visible_devpts_config.uid = uid; -+ visible_devpts_config.gid = gid; -+ visible_devpts_config.mode = mode; - - return 0; - } -@@ -98,10 +116,10 @@ devpts_fill_super(struct super_block *s, - inode->i_fop = &simple_dir_operations; - inode->i_nlink = 2; - -- devpts_root = s->s_root = d_alloc_root(inode); -+ visible_devpts_root = s->s_root = d_alloc_root(inode); - if (s->s_root) - return 0; -- -+ - printk("devpts: get root dentry failed\n"); - iput(inode); - fail: -@@ -114,13 +132,15 @@ static struct super_block *devpts_get_sb - return get_sb_single(fs_type, flags, data, devpts_fill_super); - } - --static struct file_system_type devpts_fs_type = { -+struct file_system_type devpts_fs_type = { - .owner = THIS_MODULE, - .name = "devpts", - .get_sb = devpts_get_sb, - .kill_sb = kill_anon_super, - }; - -+EXPORT_SYMBOL(devpts_fs_type); -+ - /* - * The normal naming convention is simply /dev/pts/<number>; this conforms - * to the System V naming convention -@@ -129,7 +149,7 @@ static struct file_system_type devpts_fs - static struct dentry *get_node(int num) - { - char s[12]; -- struct dentry *root = devpts_root; -+ struct dentry *root = visible_devpts_root; - down(&root->d_inode->i_sem); - return lookup_one_len(s, root, sprintf(s, "%d", num)); - } -@@ -147,7 +167,7 @@ int devpts_pty_new(struct tty_struct *tt - struct tty_driver *driver = tty->driver; - dev_t device = MKDEV(driver->major, driver->minor_start+number); - struct dentry *dentry; -- struct inode *inode = new_inode(devpts_mnt->mnt_sb); -+ struct inode *inode = new_inode(visible_devpts_mnt->mnt_sb); - - /* We're supposed to be given the slave end of a pty */ - BUG_ON(driver->type != TTY_DRIVER_TYPE_PTY); -@@ -158,10 +178,12 @@ int devpts_pty_new(struct tty_struct *tt - - inode->i_ino = number+2; - inode->i_blksize = 1024; -- inode->i_uid = config.setuid ? config.uid : current->fsuid; -- inode->i_gid = config.setgid ? config.gid : current->fsgid; -+ inode->i_uid = visible_devpts_config.setuid ? -+ visible_devpts_config.uid : current->fsuid; -+ inode->i_gid = visible_devpts_config.setgid ? -+ visible_devpts_config.gid : current->fsgid; - inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME; -- init_special_inode(inode, S_IFCHR|config.mode, device); -+ init_special_inode(inode, S_IFCHR|visible_devpts_config.mode, device); - inode->i_op = &devpts_file_inode_operations; - inode->u.generic_ip = tty; - -@@ -169,7 +191,7 @@ int devpts_pty_new(struct tty_struct *tt - if (!IS_ERR(dentry) && !dentry->d_inode) - d_instantiate(dentry, inode); - -- up(&devpts_root->d_inode->i_sem); -+ up(&visible_devpts_root->d_inode->i_sem); - - return 0; - } -@@ -179,10 +201,14 @@ struct tty_struct *devpts_get_tty(int nu - struct dentry *dentry = get_node(number); - struct tty_struct *tty; - -- tty = (IS_ERR(dentry) || !dentry->d_inode) ? NULL : -- dentry->d_inode->u.generic_ip; -+ tty = NULL; -+ if (!IS_ERR(dentry)) { -+ if (dentry->d_inode) -+ tty = dentry->d_inode->u.generic_ip; -+ dput(dentry); -+ } - -- up(&devpts_root->d_inode->i_sem); -+ up(&visible_devpts_root->d_inode->i_sem); - - return tty; - } -@@ -200,7 +226,7 @@ void devpts_pty_kill(int number) - } - dput(dentry); - } -- up(&devpts_root->d_inode->i_sem); -+ up(&visible_devpts_root->d_inode->i_sem); - } - - static int __init init_devpts_fs(void) -@@ -208,17 +234,22 @@ static int __init init_devpts_fs(void) - int err = init_devpts_xattr(); - if (err) - return err; -+#ifdef CONFIG_VE -+ get_ve0()->devpts_config = &config; -+#endif - err = register_filesystem(&devpts_fs_type); - if (!err) { - devpts_mnt = kern_mount(&devpts_fs_type); - if (IS_ERR(devpts_mnt)) - err = PTR_ERR(devpts_mnt); - } -+ prepare_devpts(); - return err; - } - - static void __exit exit_devpts_fs(void) - { -+ /* the code is never called, the argument is irrelevant */ - unregister_filesystem(&devpts_fs_type); - mntput(devpts_mnt); - exit_devpts_xattr(); -diff -uprN linux-2.6.8.1.orig/fs/direct-io.c linux-2.6.8.1-ve022stab078/fs/direct-io.c ---- linux-2.6.8.1.orig/fs/direct-io.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/direct-io.c 2006-05-11 13:05:25.000000000 +0400 -@@ -833,8 +833,10 @@ do_holes: - char *kaddr; - - /* AKPM: eargh, -ENOTBLK is a hack */ -- if (dio->rw == WRITE) -+ if (dio->rw == WRITE) { -+ page_cache_release(page); - return -ENOTBLK; -+ } - - if (dio->block_in_file >= - i_size_read(dio->inode)>>blkbits) { -diff -uprN linux-2.6.8.1.orig/fs/eventpoll.c linux-2.6.8.1-ve022stab078/fs/eventpoll.c ---- linux-2.6.8.1.orig/fs/eventpoll.c 2004-08-14 14:55:19.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/eventpoll.c 2006-05-11 13:05:48.000000000 +0400 -@@ -149,10 +149,9 @@ - #define EP_ITEM_FROM_EPQUEUE(p) (container_of(p, struct ep_pqueue, pt)->epi) - - --struct epoll_filefd { -- struct file *file; -- int fd; --}; -+/* Maximum msec timeout value storeable in a long int */ -+#define EP_MAX_MSTIMEO min(1000ULL * MAX_SCHEDULE_TIMEOUT / HZ, (LONG_MAX - 999ULL) / HZ) -+ - - /* - * Node that is linked into the "wake_task_list" member of the "struct poll_safewake". -@@ -176,36 +175,6 @@ struct poll_safewake { - spinlock_t lock; - }; - --/* -- * This structure is stored inside the "private_data" member of the file -- * structure and rapresent the main data sructure for the eventpoll -- * interface. -- */ --struct eventpoll { -- /* Protect the this structure access */ -- rwlock_t lock; -- -- /* -- * This semaphore is used to ensure that files are not removed -- * while epoll is using them. This is read-held during the event -- * collection loop and it is write-held during the file cleanup -- * path, the epoll file exit code and the ctl operations. -- */ -- struct rw_semaphore sem; -- -- /* Wait queue used by sys_epoll_wait() */ -- wait_queue_head_t wq; -- -- /* Wait queue used by file->poll() */ -- wait_queue_head_t poll_wait; -- -- /* List of ready file descriptors */ -- struct list_head rdllist; -- -- /* RB-Tree root used to store monitored fd structs */ -- struct rb_root rbr; --}; -- - /* Wait structure used by the poll hooks */ - struct eppoll_entry { - /* List header used to link this structure to the "struct epitem" */ -@@ -224,50 +193,6 @@ struct eppoll_entry { - wait_queue_head_t *whead; - }; - --/* -- * Each file descriptor added to the eventpoll interface will -- * have an entry of this type linked to the hash. -- */ --struct epitem { -- /* RB-Tree node used to link this structure to the eventpoll rb-tree */ -- struct rb_node rbn; -- -- /* List header used to link this structure to the eventpoll ready list */ -- struct list_head rdllink; -- -- /* The file descriptor information this item refers to */ -- struct epoll_filefd ffd; -- -- /* Number of active wait queue attached to poll operations */ -- int nwait; -- -- /* List containing poll wait queues */ -- struct list_head pwqlist; -- -- /* The "container" of this item */ -- struct eventpoll *ep; -- -- /* The structure that describe the interested events and the source fd */ -- struct epoll_event event; -- -- /* -- * Used to keep track of the usage count of the structure. This avoids -- * that the structure will desappear from underneath our processing. -- */ -- atomic_t usecnt; -- -- /* List header used to link this item to the "struct file" items list */ -- struct list_head fllink; -- -- /* List header used to link the item to the transfer list */ -- struct list_head txlink; -- -- /* -- * This is used during the collection/transfer of events to userspace -- * to pin items empty events set. -- */ -- unsigned int revents; --}; - - /* Wrapper struct used by poll queueing */ - struct ep_pqueue { -@@ -282,13 +207,13 @@ static void ep_poll_safewake(struct poll - static int ep_getfd(int *efd, struct inode **einode, struct file **efile); - static int ep_file_init(struct file *file); - static void ep_free(struct eventpoll *ep); --static struct epitem *ep_find(struct eventpoll *ep, struct file *file, int fd); -+struct epitem *ep_find(struct eventpoll *ep, struct file *file, int fd); - static void ep_use_epitem(struct epitem *epi); - static void ep_release_epitem(struct epitem *epi); - static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, - poll_table *pt); - static void ep_rbtree_insert(struct eventpoll *ep, struct epitem *epi); --static int ep_insert(struct eventpoll *ep, struct epoll_event *event, -+int ep_insert(struct eventpoll *ep, struct epoll_event *event, - struct file *tfile, int fd); - static int ep_modify(struct eventpoll *ep, struct epitem *epi, - struct epoll_event *event); -@@ -615,6 +540,7 @@ eexit_1: - return error; - } - -+#define MAX_EVENTS (INT_MAX / sizeof(struct epoll_event)) - - /* - * Implement the event wait interface for the eventpoll file. It is the kernel -@@ -631,7 +557,7 @@ asmlinkage long sys_epoll_wait(int epfd, - current, epfd, events, maxevents, timeout)); - - /* The maximum number of event must be greater than zero */ -- if (maxevents <= 0) -+ if (maxevents <= 0 || maxevents > MAX_EVENTS) - return -EINVAL; - - /* Verify that the area passed by the user is writeable */ -@@ -816,7 +742,7 @@ static void ep_free(struct eventpoll *ep - * the returned item, so the caller must call ep_release_epitem() - * after finished using the "struct epitem". - */ --static struct epitem *ep_find(struct eventpoll *ep, struct file *file, int fd) -+struct epitem *ep_find(struct eventpoll *ep, struct file *file, int fd) - { - int kcmp; - unsigned long flags; -@@ -916,7 +842,7 @@ static void ep_rbtree_insert(struct even - } - - --static int ep_insert(struct eventpoll *ep, struct epoll_event *event, -+int ep_insert(struct eventpoll *ep, struct epoll_event *event, - struct file *tfile, int fd) - { - int error, revents, pwake = 0; -@@ -1474,8 +1400,8 @@ static int ep_poll(struct eventpoll *ep, - * and the overflow condition. The passed timeout is in milliseconds, - * that why (t * HZ) / 1000. - */ -- jtimeout = timeout == -1 || timeout > (MAX_SCHEDULE_TIMEOUT - 1000) / HZ ? -- MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000; -+ jtimeout = (timeout < 0 || timeout >= EP_MAX_MSTIMEO) ? -+ MAX_SCHEDULE_TIMEOUT : (timeout * HZ + 999) / 1000; - - retry: - write_lock_irqsave(&ep->lock, flags); -diff -uprN linux-2.6.8.1.orig/fs/exec.c linux-2.6.8.1-ve022stab078/fs/exec.c ---- linux-2.6.8.1.orig/fs/exec.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/exec.c 2006-05-11 13:05:49.000000000 +0400 -@@ -26,6 +26,7 @@ - #include <linux/slab.h> - #include <linux/file.h> - #include <linux/mman.h> -+#include <linux/virtinfo.h> - #include <linux/a.out.h> - #include <linux/stat.h> - #include <linux/fcntl.h> -@@ -50,6 +51,8 @@ - #include <asm/uaccess.h> - #include <asm/mmu_context.h> - -+#include <ub/ub_vmpages.h> -+ - #ifdef CONFIG_KMOD - #include <linux/kmod.h> - #endif -@@ -58,6 +61,8 @@ int core_uses_pid; - char core_pattern[65] = "core"; - /* The maximal length of core_pattern is also specified in sysctl.c */ - -+int sysctl_at_vsyscall; -+ - static struct linux_binfmt *formats; - static rwlock_t binfmt_lock = RW_LOCK_UNLOCKED; - -@@ -130,7 +135,7 @@ asmlinkage long sys_uselib(const char __ - if (!S_ISREG(nd.dentry->d_inode->i_mode)) - goto exit; - -- error = permission(nd.dentry->d_inode, MAY_READ | MAY_EXEC, &nd); -+ error = permission(nd.dentry->d_inode, MAY_READ | MAY_EXEC, &nd, NULL); - if (error) - goto exit; - -@@ -298,10 +303,14 @@ void install_arg_page(struct vm_area_str - struct page *page, unsigned long address) - { - struct mm_struct *mm = vma->vm_mm; -+ struct page_beancounter *pbc; - pgd_t * pgd; - pmd_t * pmd; - pte_t * pte; - -+ if (pb_alloc(&pbc)) -+ return; -+ - if (unlikely(anon_vma_prepare(vma))) - goto out_sig; - -@@ -320,9 +329,14 @@ void install_arg_page(struct vm_area_str - goto out; - } - mm->rss++; -+ vma->vm_rss++; - lru_cache_add_active(page); - set_pte(pte, pte_mkdirty(pte_mkwrite(mk_pte( - page, vma->vm_page_prot)))); -+ -+ ub_unused_privvm_dec(mm_ub(mm), 1, vma); -+ pb_add_ref(page, mm_ub(mm), &pbc); -+ - page_add_anon_rmap(page, vma, address); - pte_unmap(pte); - spin_unlock(&mm->page_table_lock); -@@ -334,6 +348,31 @@ out: - out_sig: - __free_page(page); - force_sig(SIGKILL, current); -+ pb_free(&pbc); -+} -+ -+static inline void get_stack_vma_params(struct mm_struct *mm, int exec_stack, -+ unsigned long stack_base, struct linux_binprm *bprm, -+ unsigned long *start, unsigned long *end, unsigned long *flags) -+{ -+#ifdef CONFIG_STACK_GROWSUP -+ *start = stack_base; -+ *end = PAGE_MASK & -+ (PAGE_SIZE - 1 + (unsigned long) bprm->p); -+#else -+ *start = PAGE_MASK & (unsigned long) bprm->p; -+ *end = STACK_TOP; -+#endif -+ /* Adjust stack execute permissions; explicitly enable -+ * for EXSTACK_ENABLE_X, disable for EXSTACK_DISABLE_X -+ * and leave alone (arch default) otherwise. */ -+ if (unlikely(exec_stack == EXSTACK_ENABLE_X)) -+ *flags = VM_STACK_FLAGS | VM_EXEC; -+ else if (exec_stack == EXSTACK_DISABLE_X) -+ *flags = VM_STACK_FLAGS & ~VM_EXEC; -+ else -+ *flags = VM_STACK_FLAGS; -+ *flags |= mm->def_flags; - } - - int setup_arg_pages(struct linux_binprm *bprm, int executable_stack) -@@ -341,9 +380,13 @@ int setup_arg_pages(struct linux_binprm - unsigned long stack_base; - struct vm_area_struct *mpnt; - struct mm_struct *mm = current->mm; -- int i; -+ int i, ret; - long arg_size; - -+ unsigned long vm_start; -+ unsigned long vm_end; -+ unsigned long vm_flags; -+ - #ifdef CONFIG_STACK_GROWSUP - /* Move the argument and environment strings to the bottom of the - * stack space. -@@ -399,40 +442,32 @@ int setup_arg_pages(struct linux_binprm - bprm->loader += stack_base; - bprm->exec += stack_base; - -- mpnt = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL); -+ get_stack_vma_params(mm, executable_stack, stack_base, bprm, -+ &vm_start, &vm_end, &vm_flags); -+ -+ ret = -ENOMEM; -+ if (ub_memory_charge(mm_ub(mm), vm_end - vm_start, vm_flags, -+ NULL, UB_SOFT)) -+ goto out; -+ mpnt = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL | __GFP_SOFT_UBC); - if (!mpnt) -- return -ENOMEM; -+ goto out_uncharge; - -- if (security_vm_enough_memory(arg_size >> PAGE_SHIFT)) { -- kmem_cache_free(vm_area_cachep, mpnt); -- return -ENOMEM; -- } -+ if (security_vm_enough_memory(arg_size >> PAGE_SHIFT)) -+ goto out_free; - - memset(mpnt, 0, sizeof(*mpnt)); - - down_write(&mm->mmap_sem); - { - mpnt->vm_mm = mm; --#ifdef CONFIG_STACK_GROWSUP -- mpnt->vm_start = stack_base; -- mpnt->vm_end = PAGE_MASK & -- (PAGE_SIZE - 1 + (unsigned long) bprm->p); --#else -- mpnt->vm_start = PAGE_MASK & (unsigned long) bprm->p; -- mpnt->vm_end = STACK_TOP; --#endif -- /* Adjust stack execute permissions; explicitly enable -- * for EXSTACK_ENABLE_X, disable for EXSTACK_DISABLE_X -- * and leave alone (arch default) otherwise. */ -- if (unlikely(executable_stack == EXSTACK_ENABLE_X)) -- mpnt->vm_flags = VM_STACK_FLAGS | VM_EXEC; -- else if (executable_stack == EXSTACK_DISABLE_X) -- mpnt->vm_flags = VM_STACK_FLAGS & ~VM_EXEC; -- else -- mpnt->vm_flags = VM_STACK_FLAGS; -- mpnt->vm_flags |= mm->def_flags; -+ mpnt->vm_start = vm_start; -+ mpnt->vm_end = vm_end; -+ mpnt->vm_flags = vm_flags; -+ mpnt->vm_rss = 0; - mpnt->vm_page_prot = protection_map[mpnt->vm_flags & 0x7]; -- insert_vm_struct(mm, mpnt); -+ if ((ret = insert_vm_struct(mm, mpnt))) -+ goto out_up; - mm->total_vm = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT; - } - -@@ -447,6 +482,16 @@ int setup_arg_pages(struct linux_binprm - up_write(&mm->mmap_sem); - - return 0; -+ -+out_up: -+ up_write(&mm->mmap_sem); -+ vm_unacct_memory(arg_size >> PAGE_SHIFT); -+out_free: -+ kmem_cache_free(vm_area_cachep, mpnt); -+out_uncharge: -+ ub_memory_uncharge(mm_ub(mm), vm_end - vm_start, vm_flags, NULL); -+out: -+ return ret; - } - - EXPORT_SYMBOL(setup_arg_pages); -@@ -468,7 +513,7 @@ static inline void free_arg_pages(struct - - #endif /* CONFIG_MMU */ - --struct file *open_exec(const char *name) -+struct file *open_exec(const char *name, struct linux_binprm *bprm) - { - struct nameidata nd; - int err; -@@ -483,9 +528,13 @@ struct file *open_exec(const char *name) - file = ERR_PTR(-EACCES); - if (!(nd.mnt->mnt_flags & MNT_NOEXEC) && - S_ISREG(inode->i_mode)) { -- int err = permission(inode, MAY_EXEC, &nd); -- if (!err && !(inode->i_mode & 0111)) -- err = -EACCES; -+ int err; -+ if (bprm != NULL) { -+ bprm->perm.set = 0; -+ err = permission(inode, MAY_EXEC, &nd, -+ &bprm->perm); -+ } else -+ err = permission(inode, MAY_EXEC, &nd, NULL); - file = ERR_PTR(err); - if (!err) { - file = dentry_open(nd.dentry, nd.mnt, O_RDONLY); -@@ -524,35 +573,65 @@ int kernel_read(struct file *file, unsig - - EXPORT_SYMBOL(kernel_read); - --static int exec_mmap(struct mm_struct *mm) -+static int exec_mmap(struct linux_binprm *bprm) - { - struct task_struct *tsk; -- struct mm_struct * old_mm, *active_mm; -- -- /* Add it to the list of mm's */ -- spin_lock(&mmlist_lock); -- list_add(&mm->mmlist, &init_mm.mmlist); -- mmlist_nr++; -- spin_unlock(&mmlist_lock); -+ struct mm_struct *mm, *old_mm, *active_mm; -+ int ret; - - /* Notify parent that we're no longer interested in the old VM */ - tsk = current; - old_mm = current->mm; - mm_release(tsk, old_mm); - -+ if (old_mm) { -+ /* -+ * Make sure that if there is a core dump in progress -+ * for the old mm, we get out and die instead of going -+ * through with the exec. We must hold mmap_sem around -+ * checking core_waiters and changing tsk->mm. The -+ * core-inducing thread will increment core_waiters for -+ * each thread whose ->mm == old_mm. -+ */ -+ down_read(&old_mm->mmap_sem); -+ if (unlikely(old_mm->core_waiters)) { -+ up_read(&old_mm->mmap_sem); -+ return -EINTR; -+ } -+ } -+ -+ ret = 0; - task_lock(tsk); -+ mm = bprm->mm; - active_mm = tsk->active_mm; - tsk->mm = mm; - tsk->active_mm = mm; - activate_mm(active_mm, mm); - task_unlock(tsk); -+ -+ /* Add it to the list of mm's */ -+ spin_lock(&mmlist_lock); -+ list_add(&mm->mmlist, &init_mm.mmlist); -+ mmlist_nr++; -+ spin_unlock(&mmlist_lock); -+ bprm->mm = NULL; /* We're using it now */ -+ -+#ifdef CONFIG_VZ_GENCALLS -+ if (virtinfo_notifier_call(VITYPE_GENERAL, VIRTINFO_EXECMMAP, -+ bprm) & NOTIFY_FAIL) { -+ /* similar to binfmt_elf */ -+ send_sig(SIGKILL, current, 0); -+ ret = -ENOMEM; -+ } -+#endif - if (old_mm) { -+ up_read(&old_mm->mmap_sem); - if (active_mm != old_mm) BUG(); - mmput(old_mm); -- return 0; -+ return ret; - } - mmdrop(active_mm); -- return 0; -+ return ret; - } - - /* -@@ -563,52 +642,26 @@ static int exec_mmap(struct mm_struct *m - */ - static inline int de_thread(struct task_struct *tsk) - { -- struct signal_struct *newsig, *oldsig = tsk->signal; -+ struct signal_struct *sig = tsk->signal; - struct sighand_struct *newsighand, *oldsighand = tsk->sighand; - spinlock_t *lock = &oldsighand->siglock; -+ struct task_struct *leader = NULL; - int count; - - /* - * If we don't share sighandlers, then we aren't sharing anything - * and we can just re-use it all. - */ -- if (atomic_read(&oldsighand->count) <= 1) -+ if (atomic_read(&oldsighand->count) <= 1) { -+ BUG_ON(atomic_read(&sig->count) != 1); -+ exit_itimers(sig); - return 0; -+ } - - newsighand = kmem_cache_alloc(sighand_cachep, GFP_KERNEL); - if (!newsighand) - return -ENOMEM; - -- spin_lock_init(&newsighand->siglock); -- atomic_set(&newsighand->count, 1); -- memcpy(newsighand->action, oldsighand->action, sizeof(newsighand->action)); -- -- /* -- * See if we need to allocate a new signal structure -- */ -- newsig = NULL; -- if (atomic_read(&oldsig->count) > 1) { -- newsig = kmem_cache_alloc(signal_cachep, GFP_KERNEL); -- if (!newsig) { -- kmem_cache_free(sighand_cachep, newsighand); -- return -ENOMEM; -- } -- atomic_set(&newsig->count, 1); -- newsig->group_exit = 0; -- newsig->group_exit_code = 0; -- newsig->group_exit_task = NULL; -- newsig->group_stop_count = 0; -- newsig->curr_target = NULL; -- init_sigpending(&newsig->shared_pending); -- INIT_LIST_HEAD(&newsig->posix_timers); -- -- newsig->tty = oldsig->tty; -- newsig->pgrp = oldsig->pgrp; -- newsig->session = oldsig->session; -- newsig->leader = oldsig->leader; -- newsig->tty_old_pgrp = oldsig->tty_old_pgrp; -- } -- - if (thread_group_empty(current)) - goto no_thread_group; - -@@ -618,7 +671,7 @@ static inline int de_thread(struct task_ - */ - read_lock(&tasklist_lock); - spin_lock_irq(lock); -- if (oldsig->group_exit) { -+ if (sig->group_exit) { - /* - * Another group action in progress, just - * return so that the signal is processed. -@@ -626,11 +679,9 @@ static inline int de_thread(struct task_ - spin_unlock_irq(lock); - read_unlock(&tasklist_lock); - kmem_cache_free(sighand_cachep, newsighand); -- if (newsig) -- kmem_cache_free(signal_cachep, newsig); - return -EAGAIN; - } -- oldsig->group_exit = 1; -+ sig->group_exit = 1; - zap_other_threads(current); - read_unlock(&tasklist_lock); - -@@ -640,14 +691,16 @@ static inline int de_thread(struct task_ - count = 2; - if (current->pid == current->tgid) - count = 1; -- while (atomic_read(&oldsig->count) > count) { -- oldsig->group_exit_task = current; -- oldsig->notify_count = count; -+ while (atomic_read(&sig->count) > count) { -+ sig->group_exit_task = current; -+ sig->notify_count = count; - __set_current_state(TASK_UNINTERRUPTIBLE); - spin_unlock_irq(lock); - schedule(); - spin_lock_irq(lock); - } -+ sig->group_exit_task = NULL; -+ sig->notify_count = 0; - spin_unlock_irq(lock); - - /* -@@ -656,22 +709,23 @@ static inline int de_thread(struct task_ - * and to assume its PID: - */ - if (current->pid != current->tgid) { -- struct task_struct *leader = current->group_leader, *parent; -- struct dentry *proc_dentry1, *proc_dentry2; -- unsigned long state, ptrace; -+ struct task_struct *parent; -+ struct dentry *proc_dentry1[2], *proc_dentry2[2]; -+ unsigned long exit_state, ptrace; - - /* - * Wait for the thread group leader to be a zombie. - * It should already be zombie at this point, most - * of the time. - */ -- while (leader->state != TASK_ZOMBIE) -+ leader = current->group_leader; -+ while (leader->exit_state != EXIT_ZOMBIE) - yield(); - - spin_lock(&leader->proc_lock); - spin_lock(¤t->proc_lock); -- proc_dentry1 = proc_pid_unhash(current); -- proc_dentry2 = proc_pid_unhash(leader); -+ proc_pid_unhash(current, proc_dentry1); -+ proc_pid_unhash(leader, proc_dentry2); - write_lock_irq(&tasklist_lock); - - if (leader->tgid != current->tgid) -@@ -709,7 +763,7 @@ static inline int de_thread(struct task_ - list_del(¤t->tasks); - list_add_tail(¤t->tasks, &init_task.tasks); - current->exit_signal = SIGCHLD; -- state = leader->state; -+ exit_state = leader->exit_state; - - write_unlock_irq(&tasklist_lock); - spin_unlock(&leader->proc_lock); -@@ -717,37 +771,53 @@ static inline int de_thread(struct task_ - proc_pid_flush(proc_dentry1); - proc_pid_flush(proc_dentry2); - -- if (state != TASK_ZOMBIE) -+ if (exit_state != EXIT_ZOMBIE) - BUG(); -- release_task(leader); - } - -+ /* -+ * Now there are really no other threads at all, -+ * so it's safe to stop telling them to kill themselves. -+ */ -+ sig->group_exit = 0; -+ - no_thread_group: -+ exit_itimers(sig); -+ if (leader) -+ release_task(leader); -+ BUG_ON(atomic_read(&sig->count) != 1); - -- write_lock_irq(&tasklist_lock); -- spin_lock(&oldsighand->siglock); -- spin_lock(&newsighand->siglock); -- -- if (current == oldsig->curr_target) -- oldsig->curr_target = next_thread(current); -- if (newsig) -- current->signal = newsig; -- current->sighand = newsighand; -- init_sigpending(¤t->pending); -- recalc_sigpending(); -- -- spin_unlock(&newsighand->siglock); -- spin_unlock(&oldsighand->siglock); -- write_unlock_irq(&tasklist_lock); -+ if (atomic_read(&oldsighand->count) == 1) { -+ /* -+ * Now that we nuked the rest of the thread group, -+ * it turns out we are not sharing sighand any more either. -+ * So we can just keep it. -+ */ -+ kmem_cache_free(sighand_cachep, newsighand); -+ } else { -+ /* -+ * Move our state over to newsighand and switch it in. -+ */ -+ spin_lock_init(&newsighand->siglock); -+ atomic_set(&newsighand->count, 1); -+ memcpy(newsighand->action, oldsighand->action, -+ sizeof(newsighand->action)); - -- if (newsig && atomic_dec_and_test(&oldsig->count)) -- kmem_cache_free(signal_cachep, oldsig); -+ write_lock_irq(&tasklist_lock); -+ spin_lock(&oldsighand->siglock); -+ spin_lock(&newsighand->siglock); - -- if (atomic_dec_and_test(&oldsighand->count)) -- kmem_cache_free(sighand_cachep, oldsighand); -+ current->sighand = newsighand; -+ recalc_sigpending(); -+ -+ spin_unlock(&newsighand->siglock); -+ spin_unlock(&oldsighand->siglock); -+ write_unlock_irq(&tasklist_lock); -+ -+ if (atomic_dec_and_test(&oldsighand->count)) -+ kmem_cache_free(sighand_cachep, oldsighand); -+ } - -- if (!thread_group_empty(current)) -- BUG(); - if (current->tgid != current->pid) - BUG(); - return 0; -@@ -786,11 +856,27 @@ static inline void flush_old_files(struc - spin_unlock(&files->file_lock); - } - -+void get_task_comm(char *buf, struct task_struct *tsk) -+{ -+ /* buf must be at least sizeof(tsk->comm) in size */ -+ task_lock(tsk); -+ strncpy(buf, tsk->comm, sizeof(tsk->comm)); -+ task_unlock(tsk); -+} -+ -+void set_task_comm(struct task_struct *tsk, char *buf) -+{ -+ task_lock(tsk); -+ strlcpy(tsk->comm, buf, sizeof(tsk->comm)); -+ task_unlock(tsk); -+} -+ - int flush_old_exec(struct linux_binprm * bprm) - { - char * name; - int i, ch, retval; - struct files_struct *files; -+ char tcomm[sizeof(current->comm)]; - - /* - * Make sure we have a private signal table and that -@@ -812,12 +898,10 @@ int flush_old_exec(struct linux_binprm * - /* - * Release all of the old mmap stuff - */ -- retval = exec_mmap(bprm->mm); -+ retval = exec_mmap(bprm); - if (retval) - goto mmap_failed; - -- bprm->mm = NULL; /* We're using it now */ -- - /* This is the point of no return */ - steal_locks(files); - put_files_struct(files); -@@ -831,17 +915,19 @@ int flush_old_exec(struct linux_binprm * - if (ch == '/') - i = 0; - else -- if (i < 15) -- current->comm[i++] = ch; -+ if (i < (sizeof(tcomm) - 1)) -+ tcomm[i++] = ch; - } -- current->comm[i] = '\0'; -+ tcomm[i] = '\0'; -+ set_task_comm(current, tcomm); - - flush_thread(); - - if (bprm->e_uid != current->euid || bprm->e_gid != current->egid || -- permission(bprm->file->f_dentry->d_inode,MAY_READ, NULL) || -+ permission(bprm->file->f_dentry->d_inode, MAY_READ, NULL, NULL) || - (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP)) - current->mm->dumpable = 0; -+ current->mm->vps_dumpable = 1; - - /* An exec changes our domain. We are no longer part of the thread - group */ -@@ -872,13 +958,6 @@ int prepare_binprm(struct linux_binprm * - struct inode * inode = bprm->file->f_dentry->d_inode; - int retval; - -- mode = inode->i_mode; -- /* -- * Check execute perms again - if the caller has CAP_DAC_OVERRIDE, -- * vfs_permission lets a non-executable through -- */ -- if (!(mode & 0111)) /* with at least _one_ execute bit set */ -- return -EACCES; - if (bprm->file->f_op == NULL) - return -EACCES; - -@@ -886,10 +965,24 @@ int prepare_binprm(struct linux_binprm * - bprm->e_gid = current->egid; - - if(!(bprm->file->f_vfsmnt->mnt_flags & MNT_NOSUID)) { -+ if (!bprm->perm.set) { -+ /* -+ * This piece of code creates a time window between -+ * MAY_EXEC permission check and setuid/setgid -+ * operations and may be considered as a security hole. -+ * This code is here for compatibility reasons, -+ * if the filesystem is unable to return info now. -+ */ -+ bprm->perm.mode = inode->i_mode; -+ bprm->perm.uid = inode->i_uid; -+ bprm->perm.gid = inode->i_gid; -+ } -+ mode = bprm->perm.mode; -+ - /* Set-uid? */ - if (mode & S_ISUID) { - current->personality &= ~PER_CLEAR_ON_SETID; -- bprm->e_uid = inode->i_uid; -+ bprm->e_uid = bprm->perm.uid; - } - - /* Set-gid? */ -@@ -900,7 +993,7 @@ int prepare_binprm(struct linux_binprm * - */ - if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) { - current->personality &= ~PER_CLEAR_ON_SETID; -- bprm->e_gid = inode->i_gid; -+ bprm->e_gid = bprm->perm.gid; - } - } - -@@ -993,7 +1086,7 @@ int search_binary_handler(struct linux_b - - loader = PAGE_SIZE*MAX_ARG_PAGES-sizeof(void *); - -- file = open_exec("/sbin/loader"); -+ file = open_exec("/sbin/loader", bprm); - retval = PTR_ERR(file); - if (IS_ERR(file)) - return retval; -@@ -1079,7 +1172,11 @@ int do_execve(char * filename, - int retval; - int i; - -- file = open_exec(filename); -+ retval = virtinfo_gencall(VIRTINFO_DOEXECVE, NULL); -+ if (retval) -+ return retval; -+ -+ file = open_exec(filename, &bprm); - - retval = PTR_ERR(file); - if (IS_ERR(file)) -@@ -1222,7 +1319,7 @@ void format_corename(char *corename, con - case 'p': - pid_in_pattern = 1; - rc = snprintf(out_ptr, out_end - out_ptr, -- "%d", current->tgid); -+ "%d", virt_tgid(current)); - if (rc > out_end - out_ptr) - goto out; - out_ptr += rc; -@@ -1266,7 +1363,7 @@ void format_corename(char *corename, con - case 'h': - down_read(&uts_sem); - rc = snprintf(out_ptr, out_end - out_ptr, -- "%s", system_utsname.nodename); -+ "%s", ve_utsname.nodename); - up_read(&uts_sem); - if (rc > out_end - out_ptr) - goto out; -@@ -1294,7 +1391,7 @@ void format_corename(char *corename, con - if (!pid_in_pattern - && (core_uses_pid || atomic_read(¤t->mm->mm_users) != 1)) { - rc = snprintf(out_ptr, out_end - out_ptr, -- ".%d", current->tgid); -+ ".%d", virt_tgid(current)); - if (rc > out_end - out_ptr) - goto out; - out_ptr += rc; -@@ -1308,6 +1405,7 @@ static void zap_threads (struct mm_struc - struct task_struct *g, *p; - struct task_struct *tsk = current; - struct completion *vfork_done = tsk->vfork_done; -+ int traced = 0; - - /* - * Make sure nobody is waiting for us to release the VM, -@@ -1319,14 +1417,34 @@ static void zap_threads (struct mm_struc - } - - read_lock(&tasklist_lock); -- do_each_thread(g,p) -+ do_each_thread_ve(g,p) - if (mm == p->mm && p != tsk) { - force_sig_specific(SIGKILL, p); - mm->core_waiters++; -+ if (unlikely(p->ptrace) && -+ unlikely(p->parent->mm == mm)) -+ traced = 1; - } -- while_each_thread(g,p); -+ while_each_thread_ve(g,p); - - read_unlock(&tasklist_lock); -+ -+ if (unlikely(traced)) { -+ /* -+ * We are zapping a thread and the thread it ptraces. -+ * If the tracee went into a ptrace stop for exit tracing, -+ * we could deadlock since the tracer is waiting for this -+ * coredump to finish. Detach them so they can both die. -+ */ -+ write_lock_irq(&tasklist_lock); -+ do_each_thread_ve(g,p) { -+ if (mm == p->mm && p != tsk && -+ p->ptrace && p->parent->mm == mm) { -+ __ptrace_detach(p, 0); -+ } -+ } while_each_thread_ve(g,p); -+ write_unlock_irq(&tasklist_lock); -+ } - } - - static void coredump_wait(struct mm_struct *mm) -@@ -1362,7 +1480,8 @@ int do_coredump(long signr, int exit_cod - if (!binfmt || !binfmt->core_dump) - goto fail; - down_write(&mm->mmap_sem); -- if (!mm->dumpable) { -+ if (!mm->dumpable || -+ (!mm->vps_dumpable && !ve_is_super(get_exec_env()))) { - up_write(&mm->mmap_sem); - goto fail; - } -diff -uprN linux-2.6.8.1.orig/fs/ext2/acl.c linux-2.6.8.1-ve022stab078/fs/ext2/acl.c ---- linux-2.6.8.1.orig/fs/ext2/acl.c 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext2/acl.c 2006-05-11 13:05:35.000000000 +0400 -@@ -286,7 +286,7 @@ ext2_set_acl(struct inode *inode, int ty - * inode->i_sem: don't care - */ - int --ext2_permission(struct inode *inode, int mask, struct nameidata *nd) -+__ext2_permission(struct inode *inode, int mask) - { - int mode = inode->i_mode; - -@@ -336,6 +336,29 @@ check_capabilities: - return -EACCES; - } - -+int -+ext2_permission(struct inode *inode, int mask, struct nameidata *nd, -+ struct exec_perm *exec_perm) -+{ -+ int ret; -+ -+ if (exec_perm != NULL) -+ down(&inode->i_sem); -+ -+ ret = __ext2_permission(inode, mask); -+ -+ if (exec_perm != NULL) { -+ if (!ret) { -+ exec_perm->set = 1; -+ exec_perm->mode = inode->i_mode; -+ exec_perm->uid = inode->i_uid; -+ exec_perm->gid = inode->i_gid; -+ } -+ up(&inode->i_sem); -+ } -+ return ret; -+} -+ - /* - * Initialize the ACLs of a new inode. Called from ext2_new_inode. - * -diff -uprN linux-2.6.8.1.orig/fs/ext2/acl.h linux-2.6.8.1-ve022stab078/fs/ext2/acl.h ---- linux-2.6.8.1.orig/fs/ext2/acl.h 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext2/acl.h 2006-05-11 13:05:35.000000000 +0400 -@@ -10,18 +10,18 @@ - #define EXT2_ACL_MAX_ENTRIES 32 - - typedef struct { -- __u16 e_tag; -- __u16 e_perm; -- __u32 e_id; -+ __le16 e_tag; -+ __le16 e_perm; -+ __le32 e_id; - } ext2_acl_entry; - - typedef struct { -- __u16 e_tag; -- __u16 e_perm; -+ __le16 e_tag; -+ __le16 e_perm; - } ext2_acl_entry_short; - - typedef struct { -- __u32 a_version; -+ __le32 a_version; - } ext2_acl_header; - - static inline size_t ext2_acl_size(int count) -@@ -59,7 +59,8 @@ static inline int ext2_acl_count(size_t - #define EXT2_ACL_NOT_CACHED ((void *)-1) - - /* acl.c */ --extern int ext2_permission (struct inode *, int, struct nameidata *); -+extern int ext2_permission (struct inode *, int, struct nameidata *, -+ struct exec_perm *); - extern int ext2_acl_chmod (struct inode *); - extern int ext2_init_acl (struct inode *, struct inode *); - -diff -uprN linux-2.6.8.1.orig/fs/ext2/balloc.c linux-2.6.8.1-ve022stab078/fs/ext2/balloc.c ---- linux-2.6.8.1.orig/fs/ext2/balloc.c 2004-08-14 14:55:19.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext2/balloc.c 2006-05-11 13:05:31.000000000 +0400 -@@ -88,8 +88,8 @@ read_block_bitmap(struct super_block *sb - if (!bh) - ext2_error (sb, "read_block_bitmap", - "Cannot read block bitmap - " -- "block_group = %d, block_bitmap = %lu", -- block_group, (unsigned long) desc->bg_block_bitmap); -+ "block_group = %d, block_bitmap = %u", -+ block_group, le32_to_cpu(desc->bg_block_bitmap)); - error_out: - return bh; - } -diff -uprN linux-2.6.8.1.orig/fs/ext2/dir.c linux-2.6.8.1-ve022stab078/fs/ext2/dir.c ---- linux-2.6.8.1.orig/fs/ext2/dir.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext2/dir.c 2006-05-11 13:05:33.000000000 +0400 -@@ -251,7 +251,7 @@ ext2_readdir (struct file * filp, void * - loff_t pos = filp->f_pos; - struct inode *inode = filp->f_dentry->d_inode; - struct super_block *sb = inode->i_sb; -- unsigned offset = pos & ~PAGE_CACHE_MASK; -+ unsigned int offset = pos & ~PAGE_CACHE_MASK; - unsigned long n = pos >> PAGE_CACHE_SHIFT; - unsigned long npages = dir_pages(inode); - unsigned chunk_mask = ~(ext2_chunk_size(inode)-1); -@@ -270,8 +270,13 @@ ext2_readdir (struct file * filp, void * - ext2_dirent *de; - struct page *page = ext2_get_page(inode, n); - -- if (IS_ERR(page)) -+ if (IS_ERR(page)) { -+ ext2_error(sb, __FUNCTION__, -+ "bad page in #%lu", -+ inode->i_ino); -+ filp->f_pos += PAGE_CACHE_SIZE - offset; - continue; -+ } - kaddr = page_address(page); - if (need_revalidate) { - offset = ext2_validate_entry(kaddr, offset, chunk_mask); -@@ -303,6 +308,7 @@ ext2_readdir (struct file * filp, void * - goto success; - } - } -+ filp->f_pos += le16_to_cpu(de->rec_len); - } - ext2_put_page(page); - } -@@ -310,7 +316,6 @@ ext2_readdir (struct file * filp, void * - success: - ret = 0; - done: -- filp->f_pos = (n << PAGE_CACHE_SHIFT) | offset; - filp->f_version = inode->i_version; - return ret; - } -@@ -420,7 +425,7 @@ void ext2_set_link(struct inode *dir, st - ext2_set_de_type (de, inode); - err = ext2_commit_chunk(page, from, to); - ext2_put_page(page); -- dir->i_mtime = dir->i_ctime = CURRENT_TIME; -+ dir->i_mtime = dir->i_ctime = CURRENT_TIME_SEC; - EXT2_I(dir)->i_flags &= ~EXT2_BTREE_FL; - mark_inode_dirty(dir); - } -@@ -510,7 +515,7 @@ got_it: - de->inode = cpu_to_le32(inode->i_ino); - ext2_set_de_type (de, inode); - err = ext2_commit_chunk(page, from, to); -- dir->i_mtime = dir->i_ctime = CURRENT_TIME; -+ dir->i_mtime = dir->i_ctime = CURRENT_TIME_SEC; - EXT2_I(dir)->i_flags &= ~EXT2_BTREE_FL; - mark_inode_dirty(dir); - /* OFFSET_CACHE */ -@@ -558,7 +563,7 @@ int ext2_delete_entry (struct ext2_dir_e - pde->rec_len = cpu_to_le16(to-from); - dir->inode = 0; - err = ext2_commit_chunk(page, from, to); -- inode->i_ctime = inode->i_mtime = CURRENT_TIME; -+ inode->i_ctime = inode->i_mtime = CURRENT_TIME_SEC; - EXT2_I(inode)->i_flags &= ~EXT2_BTREE_FL; - mark_inode_dirty(inode); - out: -@@ -586,6 +591,7 @@ int ext2_make_empty(struct inode *inode, - goto fail; - } - kaddr = kmap_atomic(page, KM_USER0); -+ memset(kaddr, 0, chunk_size); - de = (struct ext2_dir_entry_2 *)kaddr; - de->name_len = 1; - de->rec_len = cpu_to_le16(EXT2_DIR_REC_LEN(1)); -diff -uprN linux-2.6.8.1.orig/fs/ext2/ext2.h linux-2.6.8.1-ve022stab078/fs/ext2/ext2.h ---- linux-2.6.8.1.orig/fs/ext2/ext2.h 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext2/ext2.h 2006-05-11 13:05:35.000000000 +0400 -@@ -5,7 +5,7 @@ - * second extended file system inode data in memory - */ - struct ext2_inode_info { -- __u32 i_data[15]; -+ __le32 i_data[15]; - __u32 i_flags; - __u32 i_faddr; - __u8 i_frag_no; -@@ -115,7 +115,7 @@ extern unsigned long ext2_count_free (st - - /* inode.c */ - extern void ext2_read_inode (struct inode *); --extern void ext2_write_inode (struct inode *, int); -+extern int ext2_write_inode (struct inode *, int); - extern void ext2_put_inode (struct inode *); - extern void ext2_delete_inode (struct inode *); - extern int ext2_sync_inode (struct inode *); -@@ -131,9 +131,6 @@ extern int ext2_ioctl (struct inode *, s - /* super.c */ - extern void ext2_error (struct super_block *, const char *, const char *, ...) - __attribute__ ((format (printf, 3, 4))); --extern NORET_TYPE void ext2_panic (struct super_block *, const char *, -- const char *, ...) -- __attribute__ ((NORET_AND format (printf, 3, 4))); - extern void ext2_warning (struct super_block *, const char *, const char *, ...) - __attribute__ ((format (printf, 3, 4))); - extern void ext2_update_dynamic_rev (struct super_block *sb); -diff -uprN linux-2.6.8.1.orig/fs/ext2/ialloc.c linux-2.6.8.1-ve022stab078/fs/ext2/ialloc.c ---- linux-2.6.8.1.orig/fs/ext2/ialloc.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext2/ialloc.c 2006-05-11 13:05:32.000000000 +0400 -@@ -57,8 +57,8 @@ read_inode_bitmap(struct super_block * s - if (!bh) - ext2_error(sb, "read_inode_bitmap", - "Cannot read inode bitmap - " -- "block_group = %lu, inode_bitmap = %lu", -- block_group, (unsigned long) desc->bg_inode_bitmap); -+ "block_group = %lu, inode_bitmap = %u", -+ block_group, le32_to_cpu(desc->bg_inode_bitmap)); - error_out: - return bh; - } -@@ -577,7 +577,7 @@ got: - inode->i_ino = ino; - inode->i_blksize = PAGE_SIZE; /* This is the optimal IO size (for stat), not the fs block size */ - inode->i_blocks = 0; -- inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME; -+ inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME_SEC; - memset(ei->i_data, 0, sizeof(ei->i_data)); - ei->i_flags = EXT2_I(dir)->i_flags & ~EXT2_BTREE_FL; - if (S_ISLNK(mode)) -diff -uprN linux-2.6.8.1.orig/fs/ext2/inode.c linux-2.6.8.1-ve022stab078/fs/ext2/inode.c ---- linux-2.6.8.1.orig/fs/ext2/inode.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext2/inode.c 2006-05-11 13:05:35.000000000 +0400 -@@ -142,12 +142,12 @@ static int ext2_alloc_block (struct inod - } - - typedef struct { -- u32 *p; -- u32 key; -+ __le32 *p; -+ __le32 key; - struct buffer_head *bh; - } Indirect; - --static inline void add_chain(Indirect *p, struct buffer_head *bh, u32 *v) -+static inline void add_chain(Indirect *p, struct buffer_head *bh, __le32 *v) - { - p->key = *(p->p = v); - p->bh = bh; -@@ -280,7 +280,7 @@ static Indirect *ext2_get_branch(struct - read_lock(&EXT2_I(inode)->i_meta_lock); - if (!verify_chain(chain, p)) - goto changed; -- add_chain(++p, bh, (u32*)bh->b_data + *++offsets); -+ add_chain(++p, bh, (__le32*)bh->b_data + *++offsets); - read_unlock(&EXT2_I(inode)->i_meta_lock); - if (!p->key) - goto no_block; -@@ -321,8 +321,8 @@ no_block: - static unsigned long ext2_find_near(struct inode *inode, Indirect *ind) - { - struct ext2_inode_info *ei = EXT2_I(inode); -- u32 *start = ind->bh ? (u32*) ind->bh->b_data : ei->i_data; -- u32 *p; -+ __le32 *start = ind->bh ? (__le32 *) ind->bh->b_data : ei->i_data; -+ __le32 *p; - unsigned long bg_start; - unsigned long colour; - -@@ -440,7 +440,7 @@ static int ext2_alloc_branch(struct inod - lock_buffer(bh); - memset(bh->b_data, 0, blocksize); - branch[n].bh = bh; -- branch[n].p = (u32*) bh->b_data + offsets[n]; -+ branch[n].p = (__le32 *) bh->b_data + offsets[n]; - *branch[n].p = branch[n].key; - set_buffer_uptodate(bh); - unlock_buffer(bh); -@@ -506,7 +506,7 @@ static inline int ext2_splice_branch(str - - /* We are done with atomic stuff, now do the rest of housekeeping */ - -- inode->i_ctime = CURRENT_TIME; -+ inode->i_ctime = CURRENT_TIME_SEC; - - /* had we spliced it onto indirect block? */ - if (where->bh) -@@ -702,7 +702,7 @@ struct address_space_operations ext2_nob - * or memcmp with zero_page, whatever is better for particular architecture. - * Linus? - */ --static inline int all_zeroes(u32 *p, u32 *q) -+static inline int all_zeroes(__le32 *p, __le32 *q) - { - while (p < q) - if (*p++) -@@ -748,7 +748,7 @@ static Indirect *ext2_find_shared(struct - int depth, - int offsets[4], - Indirect chain[4], -- u32 *top) -+ __le32 *top) - { - Indirect *partial, *p; - int k, err; -@@ -768,7 +768,7 @@ static Indirect *ext2_find_shared(struct - write_unlock(&EXT2_I(inode)->i_meta_lock); - goto no_top; - } -- for (p=partial; p>chain && all_zeroes((u32*)p->bh->b_data,p->p); p--) -+ for (p=partial; p>chain && all_zeroes((__le32*)p->bh->b_data,p->p); p--) - ; - /* - * OK, we've found the last block that must survive. The rest of our -@@ -803,7 +803,7 @@ no_top: - * stored as little-endian 32-bit) and updating @inode->i_blocks - * appropriately. - */ --static inline void ext2_free_data(struct inode *inode, u32 *p, u32 *q) -+static inline void ext2_free_data(struct inode *inode, __le32 *p, __le32 *q) - { - unsigned long block_to_free = 0, count = 0; - unsigned long nr; -@@ -843,7 +843,7 @@ static inline void ext2_free_data(struct - * stored as little-endian 32-bit) and updating @inode->i_blocks - * appropriately. - */ --static void ext2_free_branches(struct inode *inode, u32 *p, u32 *q, int depth) -+static void ext2_free_branches(struct inode *inode, __le32 *p, __le32 *q, int depth) - { - struct buffer_head * bh; - unsigned long nr; -@@ -867,8 +867,8 @@ static void ext2_free_branches(struct in - continue; - } - ext2_free_branches(inode, -- (u32*)bh->b_data, -- (u32*)bh->b_data + addr_per_block, -+ (__le32*)bh->b_data, -+ (__le32*)bh->b_data + addr_per_block, - depth); - bforget(bh); - ext2_free_blocks(inode, nr, 1); -@@ -880,12 +880,12 @@ static void ext2_free_branches(struct in - - void ext2_truncate (struct inode * inode) - { -- u32 *i_data = EXT2_I(inode)->i_data; -+ __le32 *i_data = EXT2_I(inode)->i_data; - int addr_per_block = EXT2_ADDR_PER_BLOCK(inode->i_sb); - int offsets[4]; - Indirect chain[4]; - Indirect *partial; -- int nr = 0; -+ __le32 nr = 0; - int n; - long iblock; - unsigned blocksize; -@@ -933,7 +933,7 @@ void ext2_truncate (struct inode * inode - while (partial > chain) { - ext2_free_branches(inode, - partial->p + 1, -- (u32*)partial->bh->b_data + addr_per_block, -+ (__le32*)partial->bh->b_data+addr_per_block, - (chain+n-1) - partial); - mark_buffer_dirty_inode(partial->bh, inode); - brelse (partial->bh); -@@ -966,7 +966,7 @@ do_indirects: - case EXT2_TIND_BLOCK: - ; - } -- inode->i_mtime = inode->i_ctime = CURRENT_TIME; -+ inode->i_mtime = inode->i_ctime = CURRENT_TIME_SEC; - if (inode_needs_sync(inode)) { - sync_mapping_buffers(inode->i_mapping); - ext2_sync_inode (inode); -@@ -1248,9 +1248,9 @@ static int ext2_update_inode(struct inod - return err; - } - --void ext2_write_inode(struct inode *inode, int wait) -+int ext2_write_inode(struct inode *inode, int wait) - { -- ext2_update_inode(inode, wait); -+ return ext2_update_inode(inode, wait); - } - - int ext2_sync_inode(struct inode *inode) -diff -uprN linux-2.6.8.1.orig/fs/ext2/ioctl.c linux-2.6.8.1-ve022stab078/fs/ext2/ioctl.c ---- linux-2.6.8.1.orig/fs/ext2/ioctl.c 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext2/ioctl.c 2006-05-11 13:05:32.000000000 +0400 -@@ -59,7 +59,7 @@ int ext2_ioctl (struct inode * inode, st - ei->i_flags = flags; - - ext2_set_inode_flags(inode); -- inode->i_ctime = CURRENT_TIME; -+ inode->i_ctime = CURRENT_TIME_SEC; - mark_inode_dirty(inode); - return 0; - } -@@ -72,7 +72,7 @@ int ext2_ioctl (struct inode * inode, st - return -EROFS; - if (get_user(inode->i_generation, (int __user *) arg)) - return -EFAULT; -- inode->i_ctime = CURRENT_TIME; -+ inode->i_ctime = CURRENT_TIME_SEC; - mark_inode_dirty(inode); - return 0; - default: -diff -uprN linux-2.6.8.1.orig/fs/ext2/namei.c linux-2.6.8.1-ve022stab078/fs/ext2/namei.c ---- linux-2.6.8.1.orig/fs/ext2/namei.c 2004-08-14 14:56:01.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext2/namei.c 2006-05-11 13:05:43.000000000 +0400 -@@ -30,6 +30,7 @@ - */ - - #include <linux/pagemap.h> -+#include <linux/quotaops.h> - #include "ext2.h" - #include "xattr.h" - #include "acl.h" -@@ -181,7 +182,7 @@ static int ext2_symlink (struct inode * - inode->i_mapping->a_ops = &ext2_nobh_aops; - else - inode->i_mapping->a_ops = &ext2_aops; -- err = page_symlink(inode, symname, l); -+ err = page_symlink(inode, symname, l, GFP_KERNEL); - if (err) - goto out_fail; - } else { -@@ -210,7 +211,7 @@ static int ext2_link (struct dentry * ol - if (inode->i_nlink >= EXT2_LINK_MAX) - return -EMLINK; - -- inode->i_ctime = CURRENT_TIME; -+ inode->i_ctime = CURRENT_TIME_SEC; - ext2_inc_count(inode); - atomic_inc(&inode->i_count); - -@@ -269,6 +270,8 @@ static int ext2_unlink(struct inode * di - struct page * page; - int err = -ENOENT; - -+ DQUOT_INIT(inode); -+ - de = ext2_find_entry (dir, dentry, &page); - if (!de) - goto out; -@@ -311,6 +314,9 @@ static int ext2_rename (struct inode * o - struct ext2_dir_entry_2 * old_de; - int err = -ENOENT; - -+ if (new_inode) -+ DQUOT_INIT(new_inode); -+ - old_de = ext2_find_entry (old_dir, old_dentry, &old_page); - if (!old_de) - goto out; -@@ -336,7 +342,7 @@ static int ext2_rename (struct inode * o - goto out_dir; - ext2_inc_count(old_inode); - ext2_set_link(new_dir, new_de, new_page, old_inode); -- new_inode->i_ctime = CURRENT_TIME; -+ new_inode->i_ctime = CURRENT_TIME_SEC; - if (dir_de) - new_inode->i_nlink--; - ext2_dec_count(new_inode); -@@ -361,7 +367,7 @@ static int ext2_rename (struct inode * o - * rename. - * ext2_dec_count() will mark the inode dirty. - */ -- old_inode->i_ctime = CURRENT_TIME; -+ old_inode->i_ctime = CURRENT_TIME_SEC; - - ext2_delete_entry (old_de, old_page); - ext2_dec_count(old_inode); -diff -uprN linux-2.6.8.1.orig/fs/ext2/super.c linux-2.6.8.1-ve022stab078/fs/ext2/super.c ---- linux-2.6.8.1.orig/fs/ext2/super.c 2004-08-14 14:55:35.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext2/super.c 2006-05-11 13:05:40.000000000 +0400 -@@ -37,8 +37,6 @@ static void ext2_sync_super(struct super - static int ext2_remount (struct super_block * sb, int * flags, char * data); - static int ext2_statfs (struct super_block * sb, struct kstatfs * buf); - --static char error_buf[1024]; -- - void ext2_error (struct super_block * sb, const char * function, - const char * fmt, ...) - { -@@ -52,51 +50,32 @@ void ext2_error (struct super_block * sb - cpu_to_le16(le16_to_cpu(es->s_state) | EXT2_ERROR_FS); - ext2_sync_super(sb, es); - } -- va_start (args, fmt); -- vsprintf (error_buf, fmt, args); -- va_end (args); -- if (test_opt (sb, ERRORS_PANIC)) -- panic ("EXT2-fs panic (device %s): %s: %s\n", -- sb->s_id, function, error_buf); -- printk (KERN_CRIT "EXT2-fs error (device %s): %s: %s\n", -- sb->s_id, function, error_buf); -- if (test_opt (sb, ERRORS_RO)) { -- printk ("Remounting filesystem read-only\n"); -+ -+ va_start(args, fmt); -+ printk(KERN_CRIT "EXT2-fs error (device %s): %s: ",sb->s_id, function); -+ vprintk(fmt, args); -+ printk("\n"); -+ va_end(args); -+ -+ if (test_opt(sb, ERRORS_PANIC)) -+ panic("EXT2-fs panic from previous error\n"); -+ if (test_opt(sb, ERRORS_RO)) { -+ printk("Remounting filesystem read-only\n"); - sb->s_flags |= MS_RDONLY; - } - } - --NORET_TYPE void ext2_panic (struct super_block * sb, const char * function, -- const char * fmt, ...) --{ -- va_list args; -- struct ext2_sb_info *sbi = EXT2_SB(sb); -- -- if (!(sb->s_flags & MS_RDONLY)) { -- sbi->s_mount_state |= EXT2_ERROR_FS; -- sbi->s_es->s_state = -- cpu_to_le16(le16_to_cpu(sbi->s_es->s_state) | EXT2_ERROR_FS); -- mark_buffer_dirty(sbi->s_sbh); -- sb->s_dirt = 1; -- } -- va_start (args, fmt); -- vsprintf (error_buf, fmt, args); -- va_end (args); -- sb->s_flags |= MS_RDONLY; -- panic ("EXT2-fs panic (device %s): %s: %s\n", -- sb->s_id, function, error_buf); --} -- - void ext2_warning (struct super_block * sb, const char * function, - const char * fmt, ...) - { - va_list args; - -- va_start (args, fmt); -- vsprintf (error_buf, fmt, args); -- va_end (args); -- printk (KERN_WARNING "EXT2-fs warning (device %s): %s: %s\n", -- sb->s_id, function, error_buf); -+ va_start(args, fmt); -+ printk(KERN_WARNING "EXT2-fs warning (device %s): %s: ", -+ sb->s_id, function); -+ vprintk(fmt, args); -+ printk("\n"); -+ va_end(args); - } - - void ext2_update_dynamic_rev(struct super_block *sb) -@@ -134,7 +113,7 @@ static void ext2_put_super (struct super - if (!(sb->s_flags & MS_RDONLY)) { - struct ext2_super_block *es = sbi->s_es; - -- es->s_state = le16_to_cpu(sbi->s_mount_state); -+ es->s_state = cpu_to_le16(sbi->s_mount_state); - ext2_sync_super(sb, es); - } - db_count = sbi->s_gdb_count; -@@ -143,6 +122,9 @@ static void ext2_put_super (struct super - brelse (sbi->s_group_desc[i]); - kfree(sbi->s_group_desc); - kfree(sbi->s_debts); -+ percpu_counter_destroy(&sbi->s_freeblocks_counter); -+ percpu_counter_destroy(&sbi->s_freeinodes_counter); -+ percpu_counter_destroy(&sbi->s_dirs_counter); - brelse (sbi->s_sbh); - sb->s_fs_info = NULL; - kfree(sbi); -@@ -189,7 +171,7 @@ static int init_inodecache(void) - { - ext2_inode_cachep = kmem_cache_create("ext2_inode_cache", - sizeof(struct ext2_inode_info), -- 0, SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT, -+ 0, SLAB_RECLAIM_ACCOUNT, - init_once, NULL); - if (ext2_inode_cachep == NULL) - return -ENOMEM; -@@ -449,8 +431,8 @@ static int ext2_setup_super (struct supe - (le32_to_cpu(es->s_lastcheck) + le32_to_cpu(es->s_checkinterval) <= get_seconds())) - printk ("EXT2-fs warning: checktime reached, " - "running e2fsck is recommended\n"); -- if (!(__s16) le16_to_cpu(es->s_max_mnt_count)) -- es->s_max_mnt_count = (__s16) cpu_to_le16(EXT2_DFL_MAX_MNT_COUNT); -+ if (!le16_to_cpu(es->s_max_mnt_count)) -+ es->s_max_mnt_count = cpu_to_le16(EXT2_DFL_MAX_MNT_COUNT); - es->s_mnt_count=cpu_to_le16(le16_to_cpu(es->s_mnt_count) + 1); - ext2_write_super(sb); - if (test_opt (sb, DEBUG)) -@@ -529,12 +511,18 @@ static int ext2_check_descriptors (struc - static loff_t ext2_max_size(int bits) - { - loff_t res = EXT2_NDIR_BLOCKS; -+ /* This constant is calculated to be the largest file size for a -+ * dense, 4k-blocksize file such that the total number of -+ * sectors in the file, including data and all indirect blocks, -+ * does not exceed 2^32. */ -+ const loff_t upper_limit = 0x1ff7fffd000LL; -+ - res += 1LL << (bits-2); - res += 1LL << (2*(bits-2)); - res += 1LL << (3*(bits-2)); - res <<= bits; -- if (res > (512LL << 32) - (1 << bits)) -- res = (512LL << 32) - (1 << bits); -+ if (res > upper_limit) -+ res = upper_limit; - return res; - } - -@@ -572,6 +560,7 @@ static int ext2_fill_super(struct super_ - int blocksize = BLOCK_SIZE; - int db_count; - int i, j; -+ __le32 features; - - sbi = kmalloc(sizeof(*sbi), GFP_KERNEL); - if (!sbi) -@@ -614,7 +603,7 @@ static int ext2_fill_super(struct super_ - es = (struct ext2_super_block *) (((char *)bh->b_data) + offset); - sbi->s_es = es; - sb->s_magic = le16_to_cpu(es->s_magic); -- sb->s_flags |= MS_ONE_SECOND; -+ set_sb_time_gran(sb, 1000000000U); - if (sb->s_magic != EXT2_SUPER_MAGIC) { - if (!silent) - printk ("VFS: Can't find ext2 filesystem on dev %s.\n", -@@ -661,17 +650,18 @@ static int ext2_fill_super(struct super_ - * previously didn't change the revision level when setting the flags, - * so there is a chance incompat flags are set on a rev 0 filesystem. - */ -- if ((i = EXT2_HAS_INCOMPAT_FEATURE(sb, ~EXT2_FEATURE_INCOMPAT_SUPP))) { -+ features = EXT2_HAS_INCOMPAT_FEATURE(sb, ~EXT2_FEATURE_INCOMPAT_SUPP); -+ if (features) { - printk("EXT2-fs: %s: couldn't mount because of " - "unsupported optional features (%x).\n", -- sb->s_id, i); -+ sb->s_id, le32_to_cpu(features)); - goto failed_mount; - } - if (!(sb->s_flags & MS_RDONLY) && -- (i = EXT2_HAS_RO_COMPAT_FEATURE(sb, ~EXT2_FEATURE_RO_COMPAT_SUPP))){ -+ (features = EXT2_HAS_RO_COMPAT_FEATURE(sb, ~EXT2_FEATURE_RO_COMPAT_SUPP))){ - printk("EXT2-fs: %s: couldn't mount RDWR because of " - "unsupported optional features (%x).\n", -- sb->s_id, i); -+ sb->s_id, le32_to_cpu(features)); - goto failed_mount; - } - blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size); -@@ -694,7 +684,7 @@ static int ext2_fill_super(struct super_ - } - es = (struct ext2_super_block *) (((char *)bh->b_data) + offset); - sbi->s_es = es; -- if (es->s_magic != le16_to_cpu(EXT2_SUPER_MAGIC)) { -+ if (es->s_magic != cpu_to_le16(EXT2_SUPER_MAGIC)) { - printk ("EXT2-fs: Magic mismatch, very weird !\n"); - goto failed_mount; - } -@@ -937,12 +927,12 @@ static int ext2_remount (struct super_bl - es->s_state = cpu_to_le16(sbi->s_mount_state); - es->s_mtime = cpu_to_le32(get_seconds()); - } else { -- int ret; -- if ((ret = EXT2_HAS_RO_COMPAT_FEATURE(sb, -- ~EXT2_FEATURE_RO_COMPAT_SUPP))) { -+ __le32 ret = EXT2_HAS_RO_COMPAT_FEATURE(sb, -+ ~EXT2_FEATURE_RO_COMPAT_SUPP); -+ if (ret) { - printk("EXT2-fs: %s: couldn't remount RDWR because of " - "unsupported optional features (%x).\n", -- sb->s_id, ret); -+ sb->s_id, le32_to_cpu(ret)); - return -EROFS; - } - /* -@@ -1018,7 +1008,7 @@ static struct file_system_type ext2_fs_t - .name = "ext2", - .get_sb = ext2_get_sb, - .kill_sb = kill_block_super, -- .fs_flags = FS_REQUIRES_DEV, -+ .fs_flags = FS_REQUIRES_DEV | FS_VIRTUALIZED, - }; - - static int __init init_ext2_fs(void) -diff -uprN linux-2.6.8.1.orig/fs/ext2/xattr.c linux-2.6.8.1-ve022stab078/fs/ext2/xattr.c ---- linux-2.6.8.1.orig/fs/ext2/xattr.c 2004-08-14 14:55:35.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext2/xattr.c 2006-05-11 13:05:32.000000000 +0400 -@@ -803,7 +803,7 @@ ext2_xattr_set2(struct inode *inode, str - - /* Update the inode. */ - EXT2_I(inode)->i_file_acl = new_bh ? new_bh->b_blocknr : 0; -- inode->i_ctime = CURRENT_TIME; -+ inode->i_ctime = CURRENT_TIME_SEC; - if (IS_SYNC(inode)) { - error = ext2_sync_inode (inode); - if (error) -@@ -1071,7 +1071,7 @@ static inline void ext2_xattr_hash_entry - } - - if (entry->e_value_block == 0 && entry->e_value_size != 0) { -- __u32 *value = (__u32 *)((char *)header + -+ __le32 *value = (__le32 *)((char *)header + - le16_to_cpu(entry->e_value_offs)); - for (n = (le32_to_cpu(entry->e_value_size) + - EXT2_XATTR_ROUND) >> EXT2_XATTR_PAD_BITS; n; n--) { -diff -uprN linux-2.6.8.1.orig/fs/ext2/xattr.h linux-2.6.8.1-ve022stab078/fs/ext2/xattr.h ---- linux-2.6.8.1.orig/fs/ext2/xattr.h 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext2/xattr.h 2006-05-11 13:05:31.000000000 +0400 -@@ -26,20 +26,20 @@ - #define EXT2_XATTR_INDEX_SECURITY 6 - - struct ext2_xattr_header { -- __u32 h_magic; /* magic number for identification */ -- __u32 h_refcount; /* reference count */ -- __u32 h_blocks; /* number of disk blocks used */ -- __u32 h_hash; /* hash value of all attributes */ -+ __le32 h_magic; /* magic number for identification */ -+ __le32 h_refcount; /* reference count */ -+ __le32 h_blocks; /* number of disk blocks used */ -+ __le32 h_hash; /* hash value of all attributes */ - __u32 h_reserved[4]; /* zero right now */ - }; - - struct ext2_xattr_entry { - __u8 e_name_len; /* length of name */ - __u8 e_name_index; /* attribute name index */ -- __u16 e_value_offs; /* offset in disk block of value */ -- __u32 e_value_block; /* disk block attribute is stored on (n/i) */ -- __u32 e_value_size; /* size of attribute value */ -- __u32 e_hash; /* hash value of name and value */ -+ __le16 e_value_offs; /* offset in disk block of value */ -+ __le32 e_value_block; /* disk block attribute is stored on (n/i) */ -+ __le32 e_value_size; /* size of attribute value */ -+ __le32 e_hash; /* hash value of name and value */ - char e_name[0]; /* attribute name */ - }; - -diff -uprN linux-2.6.8.1.orig/fs/ext2/xattr_user.c linux-2.6.8.1-ve022stab078/fs/ext2/xattr_user.c ---- linux-2.6.8.1.orig/fs/ext2/xattr_user.c 2004-08-14 14:55:34.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext2/xattr_user.c 2006-05-11 13:05:35.000000000 +0400 -@@ -40,7 +40,7 @@ ext2_xattr_user_get(struct inode *inode, - return -EINVAL; - if (!test_opt(inode->i_sb, XATTR_USER)) - return -EOPNOTSUPP; -- error = permission(inode, MAY_READ, NULL); -+ error = permission(inode, MAY_READ, NULL, NULL); - if (error) - return error; - -@@ -60,7 +60,7 @@ ext2_xattr_user_set(struct inode *inode, - if ( !S_ISREG(inode->i_mode) && - (!S_ISDIR(inode->i_mode) || inode->i_mode & S_ISVTX)) - return -EPERM; -- error = permission(inode, MAY_WRITE, NULL); -+ error = permission(inode, MAY_WRITE, NULL, NULL); - if (error) - return error; - -diff -uprN linux-2.6.8.1.orig/fs/ext3/Makefile linux-2.6.8.1-ve022stab078/fs/ext3/Makefile ---- linux-2.6.8.1.orig/fs/ext3/Makefile 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext3/Makefile 2006-05-11 13:05:31.000000000 +0400 -@@ -5,7 +5,7 @@ - obj-$(CONFIG_EXT3_FS) += ext3.o - - ext3-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \ -- ioctl.o namei.o super.o symlink.o hash.o -+ ioctl.o namei.o super.o symlink.o hash.o resize.o - - ext3-$(CONFIG_EXT3_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o - ext3-$(CONFIG_EXT3_FS_POSIX_ACL) += acl.o -diff -uprN linux-2.6.8.1.orig/fs/ext3/acl.c linux-2.6.8.1-ve022stab078/fs/ext3/acl.c ---- linux-2.6.8.1.orig/fs/ext3/acl.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext3/acl.c 2006-05-11 13:05:35.000000000 +0400 -@@ -291,7 +291,7 @@ ext3_set_acl(handle_t *handle, struct in - * inode->i_sem: don't care - */ - int --ext3_permission(struct inode *inode, int mask, struct nameidata *nd) -+__ext3_permission(struct inode *inode, int mask) - { - int mode = inode->i_mode; - -@@ -341,6 +341,29 @@ check_capabilities: - return -EACCES; - } - -+int -+ext3_permission(struct inode *inode, int mask, struct nameidata *nd, -+ struct exec_perm *exec_perm) -+{ -+ int ret; -+ -+ if (exec_perm != NULL) -+ down(&inode->i_sem); -+ -+ ret = __ext3_permission(inode, mask); -+ -+ if (exec_perm != NULL) { -+ if (!ret) { -+ exec_perm->set = 1; -+ exec_perm->mode = inode->i_mode; -+ exec_perm->uid = inode->i_uid; -+ exec_perm->gid = inode->i_gid; -+ } -+ up(&inode->i_sem); -+ } -+ return ret; -+} -+ - /* - * Initialize the ACLs of a new inode. Called from ext3_new_inode. - * -diff -uprN linux-2.6.8.1.orig/fs/ext3/acl.h linux-2.6.8.1-ve022stab078/fs/ext3/acl.h ---- linux-2.6.8.1.orig/fs/ext3/acl.h 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext3/acl.h 2006-05-11 13:05:35.000000000 +0400 -@@ -10,18 +10,18 @@ - #define EXT3_ACL_MAX_ENTRIES 32 - - typedef struct { -- __u16 e_tag; -- __u16 e_perm; -- __u32 e_id; -+ __le16 e_tag; -+ __le16 e_perm; -+ __le32 e_id; - } ext3_acl_entry; - - typedef struct { -- __u16 e_tag; -- __u16 e_perm; -+ __le16 e_tag; -+ __le16 e_perm; - } ext3_acl_entry_short; - - typedef struct { -- __u32 a_version; -+ __le32 a_version; - } ext3_acl_header; - - static inline size_t ext3_acl_size(int count) -@@ -59,7 +59,8 @@ static inline int ext3_acl_count(size_t - #define EXT3_ACL_NOT_CACHED ((void *)-1) - - /* acl.c */ --extern int ext3_permission (struct inode *, int, struct nameidata *); -+extern int ext3_permission (struct inode *, int, struct nameidata *, -+ struct exec_perm *); - extern int ext3_acl_chmod (struct inode *); - extern int ext3_init_acl (handle_t *, struct inode *, struct inode *); - -diff -uprN linux-2.6.8.1.orig/fs/ext3/balloc.c linux-2.6.8.1-ve022stab078/fs/ext3/balloc.c ---- linux-2.6.8.1.orig/fs/ext3/balloc.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext3/balloc.c 2006-05-11 13:05:31.000000000 +0400 -@@ -54,6 +54,7 @@ struct ext3_group_desc * ext3_get_group_ - - return NULL; - } -+ smp_rmb(); - - group_desc = block_group / EXT3_DESC_PER_BLOCK(sb); - desc = block_group % EXT3_DESC_PER_BLOCK(sb); -@@ -91,15 +92,16 @@ read_block_bitmap(struct super_block *sb - if (!bh) - ext3_error (sb, "read_block_bitmap", - "Cannot read block bitmap - " -- "block_group = %d, block_bitmap = %lu", -- block_group, (unsigned long) desc->bg_block_bitmap); -+ "block_group = %d, block_bitmap = %u", -+ block_group, le32_to_cpu(desc->bg_block_bitmap)); - error_out: - return bh; - } - - /* Free given blocks, update quota and i_blocks field */ --void ext3_free_blocks (handle_t *handle, struct inode * inode, -- unsigned long block, unsigned long count) -+void ext3_free_blocks_sb(handle_t *handle, struct super_block *sb, -+ unsigned long block, unsigned long count, -+ int *pdquot_freed_blocks) - { - struct buffer_head *bitmap_bh = NULL; - struct buffer_head *gd_bh; -@@ -107,18 +109,12 @@ void ext3_free_blocks (handle_t *handle, - unsigned long bit; - unsigned long i; - unsigned long overflow; -- struct super_block * sb; - struct ext3_group_desc * gdp; - struct ext3_super_block * es; - struct ext3_sb_info *sbi; - int err = 0, ret; -- int dquot_freed_blocks = 0; - -- sb = inode->i_sb; -- if (!sb) { -- printk ("ext3_free_blocks: nonexistent device"); -- return; -- } -+ *pdquot_freed_blocks = 0; - sbi = EXT3_SB(sb); - es = EXT3_SB(sb)->s_es; - if (block < le32_to_cpu(es->s_first_data_block) || -@@ -245,7 +241,7 @@ do_more: - jbd_lock_bh_state(bitmap_bh); - BUFFER_TRACE(bitmap_bh, "bit already cleared"); - } else { -- dquot_freed_blocks++; -+ (*pdquot_freed_blocks)++; - } - } - jbd_unlock_bh_state(bitmap_bh); -@@ -253,7 +249,7 @@ do_more: - spin_lock(sb_bgl_lock(sbi, block_group)); - gdp->bg_free_blocks_count = - cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count) + -- dquot_freed_blocks); -+ *pdquot_freed_blocks); - spin_unlock(sb_bgl_lock(sbi, block_group)); - percpu_counter_mod(&sbi->s_freeblocks_counter, count); - -@@ -275,6 +271,22 @@ do_more: - error_return: - brelse(bitmap_bh); - ext3_std_error(sb, err); -+ return; -+} -+ -+/* Free given blocks, update quota and i_blocks field */ -+void ext3_free_blocks(handle_t *handle, struct inode *inode, -+ unsigned long block, unsigned long count) -+{ -+ struct super_block * sb; -+ int dquot_freed_blocks; -+ -+ sb = inode->i_sb; -+ if (!sb) { -+ printk ("ext3_free_blocks: nonexistent device"); -+ return; -+ } -+ ext3_free_blocks_sb(handle, sb, block, count, &dquot_freed_blocks); - if (dquot_freed_blocks) - DQUOT_FREE_BLOCK(inode, dquot_freed_blocks); - return; -@@ -523,6 +535,8 @@ ext3_new_block(handle_t *handle, struct - #ifdef EXT3FS_DEBUG - static int goal_hits, goal_attempts; - #endif -+ unsigned long ngroups; -+ - *errp = -ENOSPC; - sb = inode->i_sb; - if (!sb) { -@@ -574,13 +588,16 @@ ext3_new_block(handle_t *handle, struct - goto allocated; - } - -+ ngroups = EXT3_SB(sb)->s_groups_count; -+ smp_rmb(); -+ - /* - * Now search the rest of the groups. We assume that - * i and gdp correctly point to the last group visited. - */ -- for (bgi = 0; bgi < EXT3_SB(sb)->s_groups_count; bgi++) { -+ for (bgi = 0; bgi < ngroups; bgi++) { - group_no++; -- if (group_no >= EXT3_SB(sb)->s_groups_count) -+ if (group_no >= ngroups) - group_no = 0; - gdp = ext3_get_group_desc(sb, group_no, &gdp_bh); - if (!gdp) { -@@ -715,6 +732,7 @@ unsigned long ext3_count_free_blocks(str - unsigned long desc_count; - struct ext3_group_desc *gdp; - int i; -+ unsigned long ngroups; - #ifdef EXT3FS_DEBUG - struct ext3_super_block *es; - unsigned long bitmap_count, x; -@@ -747,7 +765,9 @@ unsigned long ext3_count_free_blocks(str - return bitmap_count; - #else - desc_count = 0; -- for (i = 0; i < EXT3_SB(sb)->s_groups_count; i++) { -+ ngroups = EXT3_SB(sb)->s_groups_count; -+ smp_rmb(); -+ for (i = 0; i < ngroups; i++) { - gdp = ext3_get_group_desc(sb, i, NULL); - if (!gdp) - continue; -diff -uprN linux-2.6.8.1.orig/fs/ext3/fsync.c linux-2.6.8.1-ve022stab078/fs/ext3/fsync.c ---- linux-2.6.8.1.orig/fs/ext3/fsync.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext3/fsync.c 2006-05-11 13:05:31.000000000 +0400 -@@ -49,10 +49,6 @@ int ext3_sync_file(struct file * file, s - - J_ASSERT(ext3_journal_current_handle() == 0); - -- smp_mb(); /* prepare for lockless i_state read */ -- if (!(inode->i_state & I_DIRTY)) -- goto out; -- - /* - * data=writeback: - * The caller's filemap_fdatawrite()/wait will sync the data. -diff -uprN linux-2.6.8.1.orig/fs/ext3/ialloc.c linux-2.6.8.1-ve022stab078/fs/ext3/ialloc.c ---- linux-2.6.8.1.orig/fs/ext3/ialloc.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext3/ialloc.c 2006-05-11 13:05:32.000000000 +0400 -@@ -64,8 +64,8 @@ read_inode_bitmap(struct super_block * s - if (!bh) - ext3_error(sb, "read_inode_bitmap", - "Cannot read inode bitmap - " -- "block_group = %lu, inode_bitmap = %lu", -- block_group, (unsigned long) desc->bg_inode_bitmap); -+ "block_group = %lu, inode_bitmap = %u", -+ block_group, le32_to_cpu(desc->bg_inode_bitmap)); - error_out: - return bh; - } -@@ -97,7 +97,7 @@ void ext3_free_inode (handle_t *handle, - unsigned long bit; - struct ext3_group_desc * gdp; - struct ext3_super_block * es; -- struct ext3_sb_info *sbi = EXT3_SB(sb); -+ struct ext3_sb_info *sbi; - int fatal = 0, err; - - if (atomic_read(&inode->i_count) > 1) { -@@ -114,6 +114,7 @@ void ext3_free_inode (handle_t *handle, - printk("ext3_free_inode: inode on nonexistent device\n"); - return; - } -+ sbi = EXT3_SB(sb); - - ino = inode->i_ino; - ext3_debug ("freeing inode %lu\n", ino); -@@ -319,8 +320,6 @@ static int find_group_orlov(struct super - desc = ext3_get_group_desc (sb, group, &bh); - if (!desc || !desc->bg_free_inodes_count) - continue; -- if (sbi->s_debts[group] >= max_debt) -- continue; - if (le16_to_cpu(desc->bg_used_dirs_count) >= max_dirs) - continue; - if (le16_to_cpu(desc->bg_free_inodes_count) < min_inodes) -@@ -559,7 +558,7 @@ got: - /* This is the optimal IO size (for stat), not the fs block size */ - inode->i_blksize = PAGE_SIZE; - inode->i_blocks = 0; -- inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME; -+ inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME_SEC; - - memset(ei->i_data, 0, sizeof(ei->i_data)); - ei->i_next_alloc_block = 0; -diff -uprN linux-2.6.8.1.orig/fs/ext3/inode.c linux-2.6.8.1-ve022stab078/fs/ext3/inode.c ---- linux-2.6.8.1.orig/fs/ext3/inode.c 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext3/inode.c 2006-05-11 13:05:39.000000000 +0400 -@@ -66,6 +66,8 @@ int ext3_forget(handle_t *handle, int is - { - int err; - -+ might_sleep(); -+ - BUFFER_TRACE(bh, "enter"); - - jbd_debug(4, "forgetting bh %p: is_metadata = %d, mode %o, " -@@ -82,7 +84,7 @@ int ext3_forget(handle_t *handle, int is - (!is_metadata && !ext3_should_journal_data(inode))) { - if (bh) { - BUFFER_TRACE(bh, "call journal_forget"); -- ext3_journal_forget(handle, bh); -+ return ext3_journal_forget(handle, bh); - } - return 0; - } -@@ -303,12 +305,12 @@ static int ext3_alloc_block (handle_t *h - - - typedef struct { -- u32 *p; -- u32 key; -+ __le32 *p; -+ __le32 key; - struct buffer_head *bh; - } Indirect; - --static inline void add_chain(Indirect *p, struct buffer_head *bh, u32 *v) -+static inline void add_chain(Indirect *p, struct buffer_head *bh, __le32 *v) - { - p->key = *(p->p = v); - p->bh = bh; -@@ -439,7 +441,7 @@ static Indirect *ext3_get_branch(struct - /* Reader: pointers */ - if (!verify_chain(chain, p)) - goto changed; -- add_chain(++p, bh, (u32*)bh->b_data + *++offsets); -+ add_chain(++p, bh, (__le32*)bh->b_data + *++offsets); - /* Reader: end */ - if (!p->key) - goto no_block; -@@ -480,8 +482,8 @@ no_block: - static unsigned long ext3_find_near(struct inode *inode, Indirect *ind) - { - struct ext3_inode_info *ei = EXT3_I(inode); -- u32 *start = ind->bh ? (u32*) ind->bh->b_data : ei->i_data; -- u32 *p; -+ __le32 *start = ind->bh ? (__le32*) ind->bh->b_data : ei->i_data; -+ __le32 *p; - unsigned long bg_start; - unsigned long colour; - -@@ -609,7 +611,7 @@ static int ext3_alloc_branch(handle_t *h - } - - memset(bh->b_data, 0, blocksize); -- branch[n].p = (u32*) bh->b_data + offsets[n]; -+ branch[n].p = (__le32*) bh->b_data + offsets[n]; - *branch[n].p = branch[n].key; - BUFFER_TRACE(bh, "marking uptodate"); - set_buffer_uptodate(bh); -@@ -687,7 +689,7 @@ static int ext3_splice_branch(handle_t * - - /* We are done with atomic stuff, now do the rest of housekeeping */ - -- inode->i_ctime = CURRENT_TIME; -+ inode->i_ctime = CURRENT_TIME_SEC; - ext3_mark_inode_dirty(handle, inode); - - /* had we spliced it onto indirect block? */ -@@ -780,6 +782,7 @@ reread: - if (!partial) { - clear_buffer_new(bh_result); - got_it: -+ clear_buffer_delay(bh_result); - map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key)); - if (boundary) - set_buffer_boundary(bh_result); -@@ -1063,11 +1066,13 @@ static int walk_page_buffers( handle_t * - * and the commit_write(). So doing the journal_start at the start of - * prepare_write() is the right place. - * -- * Also, this function can nest inside ext3_writepage() -> -- * block_write_full_page(). In that case, we *know* that ext3_writepage() -- * has generated enough buffer credits to do the whole page. So we won't -- * block on the journal in that case, which is good, because the caller may -- * be PF_MEMALLOC. -+ * [2004/09/04 SAW] journal_start() in prepare_write() causes different ranking -+ * violations if copy_from_user() triggers a page fault (mmap_sem, may be page -+ * lock, plus __GFP_FS allocations). -+ * Now we read in not up-to-date buffers in prepare_write(), and do the rest -+ * including hole instantiation and inode extension in commit_write(). -+ * -+ * Other notes. - * - * By accident, ext3 can be reentered when a transaction is open via - * quota file writes. If we were to commit the transaction while thus -@@ -1082,6 +1087,27 @@ static int walk_page_buffers( handle_t * - * write. - */ - -+static int ext3_get_block_delay(struct inode *inode, sector_t iblock, -+ struct buffer_head *bh, int create) -+{ -+ int ret; -+ -+ ret = ext3_get_block_handle(NULL, inode, iblock, bh, 0, 0); -+ if (ret) -+ return ret; -+ if (!buffer_mapped(bh)) { -+ set_buffer_delay(bh); -+ set_buffer_new(bh); -+ } -+ return ret; -+} -+ -+static int ext3_prepare_write(struct file *file, struct page *page, -+ unsigned from, unsigned to) -+{ -+ return block_prepare_write(page, from, to, ext3_get_block_delay); -+} -+ - static int do_journal_get_write_access(handle_t *handle, - struct buffer_head *bh) - { -@@ -1090,8 +1116,52 @@ static int do_journal_get_write_access(h - return ext3_journal_get_write_access(handle, bh); - } - --static int ext3_prepare_write(struct file *file, struct page *page, -- unsigned from, unsigned to) -+/* -+ * This function zeroes buffers not mapped to disk. -+ * We do it similarly to the error path in __block_prepare_write() to avoid -+ * keeping garbage in the page cache. -+ * Here we check BH_delay state. We know that if the buffer appears -+ * !buffer_mapped then -+ * - it was !buffer_mapped at the moment of ext3_prepare_write, and -+ * - ext3_get_block failed to map this buffer (e.g., ENOSPC). -+ * If this !mapped buffer is not up to date (it can be up to date if -+ * PageUptodate), then we zero its content. -+ */ -+static void ext3_clear_delayed_buffers(struct page *page, -+ unsigned from, unsigned to) -+{ -+ struct buffer_head *bh, *head, *next; -+ unsigned block_start, block_end; -+ unsigned blocksize; -+ void *kaddr; -+ -+ head = page_buffers(page); -+ blocksize = head->b_size; -+ for ( bh = head, block_start = 0; -+ bh != head || !block_start; -+ block_start = block_end, bh = next) -+ { -+ next = bh->b_this_page; -+ block_end = block_start + blocksize; -+ if (block_end <= from || block_start >= to) -+ continue; -+ if (!buffer_delay(bh)) -+ continue; -+ J_ASSERT_BH(bh, !buffer_mapped(bh)); -+ clear_buffer_new(bh); -+ clear_buffer_delay(bh); -+ if (!buffer_uptodate(bh)) { -+ kaddr = kmap_atomic(page, KM_USER0); -+ memset(kaddr + block_start, 0, bh->b_size); -+ kunmap_atomic(kaddr, KM_USER0); -+ set_buffer_uptodate(bh); -+ mark_buffer_dirty(bh); -+ } -+ } -+} -+ -+static int ext3_map_write(struct file *file, struct page *page, -+ unsigned from, unsigned to) - { - struct inode *inode = page->mapping->host; - int ret, needed_blocks = ext3_writepage_trans_blocks(inode); -@@ -1104,19 +1174,19 @@ retry: - ret = PTR_ERR(handle); - goto out; - } -- ret = block_prepare_write(page, from, to, ext3_get_block); -- if (ret) -- goto prepare_write_failed; - -- if (ext3_should_journal_data(inode)) { -+ ret = block_prepare_write(page, from, to, ext3_get_block); -+ if (!ret && ext3_should_journal_data(inode)) { - ret = walk_page_buffers(handle, page_buffers(page), - from, to, NULL, do_journal_get_write_access); - } --prepare_write_failed: -- if (ret) -- ext3_journal_stop(handle); -+ if (!ret) -+ goto out; -+ -+ ext3_journal_stop(handle); - if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries)) - goto retry; -+ ext3_clear_delayed_buffers(page, from, to); - out: - return ret; - } -@@ -1151,10 +1221,15 @@ static int commit_write_fn(handle_t *han - static int ext3_ordered_commit_write(struct file *file, struct page *page, - unsigned from, unsigned to) - { -- handle_t *handle = ext3_journal_current_handle(); -+ handle_t *handle; - struct inode *inode = page->mapping->host; - int ret = 0, ret2; - -+ ret = ext3_map_write(file, page, from, to); -+ if (ret) -+ return ret; -+ handle = ext3_journal_current_handle(); -+ - ret = walk_page_buffers(handle, page_buffers(page), - from, to, NULL, ext3_journal_dirty_data); - -@@ -1180,11 +1255,15 @@ static int ext3_ordered_commit_write(str - static int ext3_writeback_commit_write(struct file *file, struct page *page, - unsigned from, unsigned to) - { -- handle_t *handle = ext3_journal_current_handle(); -+ handle_t *handle; - struct inode *inode = page->mapping->host; - int ret = 0, ret2; - loff_t new_i_size; - -+ ret = ext3_map_write(file, page, from, to); -+ if (ret) -+ return ret; -+ handle = ext3_journal_current_handle(); - new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; - if (new_i_size > EXT3_I(inode)->i_disksize) - EXT3_I(inode)->i_disksize = new_i_size; -@@ -1198,12 +1277,17 @@ static int ext3_writeback_commit_write(s - static int ext3_journalled_commit_write(struct file *file, - struct page *page, unsigned from, unsigned to) - { -- handle_t *handle = ext3_journal_current_handle(); -+ handle_t *handle; - struct inode *inode = page->mapping->host; - int ret = 0, ret2; - int partial = 0; - loff_t pos; - -+ ret = ext3_map_write(file, page, from, to); -+ if (ret) -+ return ret; -+ handle = ext3_journal_current_handle(); -+ - /* - * Here we duplicate the generic_commit_write() functionality - */ -@@ -1471,8 +1555,11 @@ static int ext3_journalled_writepage(str - ClearPageChecked(page); - ret = block_prepare_write(page, 0, PAGE_CACHE_SIZE, - ext3_get_block); -- if (ret != 0) -- goto out_unlock; -+ if (ret != 0) { -+ ext3_journal_stop(handle); -+ unlock_page(page); -+ return ret; -+ } - ret = walk_page_buffers(handle, page_buffers(page), 0, - PAGE_CACHE_SIZE, NULL, do_journal_get_write_access); - -@@ -1498,7 +1585,6 @@ out: - - no_write: - redirty_page_for_writepage(wbc, page); --out_unlock: - unlock_page(page); - goto out; - } -@@ -1577,6 +1663,12 @@ static ssize_t ext3_direct_IO(int rw, st - offset, nr_segs, - ext3_direct_io_get_blocks, NULL); - -+ /* -+ * Reacquire the handle: ext3_direct_io_get_block() can restart the -+ * transaction -+ */ -+ handle = journal_current_handle(); -+ - out_stop: - if (handle) { - int err; -@@ -1765,7 +1857,7 @@ unlock: - * or memcmp with zero_page, whatever is better for particular architecture. - * Linus? - */ --static inline int all_zeroes(u32 *p, u32 *q) -+static inline int all_zeroes(__le32 *p, __le32 *q) - { - while (p < q) - if (*p++) -@@ -1812,7 +1904,7 @@ static Indirect *ext3_find_shared(struct - int depth, - int offsets[4], - Indirect chain[4], -- u32 *top) -+ __le32 *top) - { - Indirect *partial, *p; - int k, err; -@@ -1832,7 +1924,7 @@ static Indirect *ext3_find_shared(struct - if (!partial->key && *partial->p) - /* Writer: end */ - goto no_top; -- for (p=partial; p>chain && all_zeroes((u32*)p->bh->b_data,p->p); p--) -+ for (p=partial; p>chain && all_zeroes((__le32*)p->bh->b_data,p->p); p--) - ; - /* - * OK, we've found the last block that must survive. The rest of our -@@ -1871,9 +1963,9 @@ no_top: - static void - ext3_clear_blocks(handle_t *handle, struct inode *inode, struct buffer_head *bh, - unsigned long block_to_free, unsigned long count, -- u32 *first, u32 *last) -+ __le32 *first, __le32 *last) - { -- u32 *p; -+ __le32 *p; - if (try_to_extend_transaction(handle, inode)) { - if (bh) { - BUFFER_TRACE(bh, "call ext3_journal_dirty_metadata"); -@@ -1929,15 +2021,16 @@ ext3_clear_blocks(handle_t *handle, stru - * block pointers. - */ - static void ext3_free_data(handle_t *handle, struct inode *inode, -- struct buffer_head *this_bh, u32 *first, u32 *last) -+ struct buffer_head *this_bh, -+ __le32 *first, __le32 *last) - { - unsigned long block_to_free = 0; /* Starting block # of a run */ - unsigned long count = 0; /* Number of blocks in the run */ -- u32 *block_to_free_p = NULL; /* Pointer into inode/ind -+ __le32 *block_to_free_p = NULL; /* Pointer into inode/ind - corresponding to - block_to_free */ - unsigned long nr; /* Current block # */ -- u32 *p; /* Pointer into inode/ind -+ __le32 *p; /* Pointer into inode/ind - for current block */ - int err; - -@@ -1996,10 +2089,10 @@ static void ext3_free_data(handle_t *han - */ - static void ext3_free_branches(handle_t *handle, struct inode *inode, - struct buffer_head *parent_bh, -- u32 *first, u32 *last, int depth) -+ __le32 *first, __le32 *last, int depth) - { - unsigned long nr; -- u32 *p; -+ __le32 *p; - - if (is_handle_aborted(handle)) - return; -@@ -2029,8 +2122,9 @@ static void ext3_free_branches(handle_t - - /* This zaps the entire block. Bottom up. */ - BUFFER_TRACE(bh, "free child branches"); -- ext3_free_branches(handle, inode, bh, (u32*)bh->b_data, -- (u32*)bh->b_data + addr_per_block, -+ ext3_free_branches(handle, inode, bh, -+ (__le32*)bh->b_data, -+ (__le32*)bh->b_data + addr_per_block, - depth); - - /* -@@ -2135,13 +2229,13 @@ void ext3_truncate(struct inode * inode) - { - handle_t *handle; - struct ext3_inode_info *ei = EXT3_I(inode); -- u32 *i_data = ei->i_data; -+ __le32 *i_data = ei->i_data; - int addr_per_block = EXT3_ADDR_PER_BLOCK(inode->i_sb); - struct address_space *mapping = inode->i_mapping; - int offsets[4]; - Indirect chain[4]; - Indirect *partial; -- int nr = 0; -+ __le32 nr = 0; - int n; - long last_block; - unsigned blocksize = inode->i_sb->s_blocksize; -@@ -2248,7 +2342,7 @@ void ext3_truncate(struct inode * inode) - /* Clear the ends of indirect blocks on the shared branch */ - while (partial > chain) { - ext3_free_branches(handle, inode, partial->bh, partial->p + 1, -- (u32*)partial->bh->b_data + addr_per_block, -+ (__le32*)partial->bh->b_data+addr_per_block, - (chain+n-1) - partial); - BUFFER_TRACE(partial->bh, "call brelse"); - brelse (partial->bh); -@@ -2282,7 +2376,7 @@ do_indirects: - ; - } - up(&ei->truncate_sem); -- inode->i_mtime = inode->i_ctime = CURRENT_TIME; -+ inode->i_mtime = inode->i_ctime = CURRENT_TIME_SEC; - ext3_mark_inode_dirty(handle, inode); - - /* In a multi-transaction truncate, we only make the final -@@ -2311,8 +2405,10 @@ static unsigned long ext3_get_inode_bloc - struct buffer_head *bh; - struct ext3_group_desc * gdp; - -+ - if ((ino != EXT3_ROOT_INO && - ino != EXT3_JOURNAL_INO && -+ ino != EXT3_RESIZE_INO && - ino < EXT3_FIRST_INO(sb)) || - ino > le32_to_cpu( - EXT3_SB(sb)->s_es->s_inodes_count)) { -@@ -2326,6 +2422,7 @@ static unsigned long ext3_get_inode_bloc - "group >= groups count"); - return 0; - } -+ smp_rmb(); - group_desc = block_group >> EXT3_DESC_PER_BLOCK_BITS(sb); - desc = block_group & (EXT3_DESC_PER_BLOCK(sb) - 1); - bh = EXT3_SB(sb)->s_group_desc[group_desc]; -@@ -2743,21 +2840,21 @@ out_brelse: - * `stuff()' is running, and the new i_size will be lost. Plus the inode - * will no longer be on the superblock's dirty inode list. - */ --void ext3_write_inode(struct inode *inode, int wait) -+int ext3_write_inode(struct inode *inode, int wait) - { -- if (current->flags & PF_MEMALLOC) -- return; -+ if (current->flags & (PF_MEMALLOC | PF_MEMDIE)) -+ return 0; - - if (ext3_journal_current_handle()) { - jbd_debug(0, "called recursively, non-PF_MEMALLOC!\n"); - dump_stack(); -- return; -+ return -EIO; - } - - if (!wait) -- return; -+ return 0; - -- ext3_force_commit(inode->i_sb); -+ return ext3_force_commit(inode->i_sb); - } - - /* -@@ -2966,6 +3063,7 @@ int ext3_mark_inode_dirty(handle_t *hand - struct ext3_iloc iloc; - int err; - -+ might_sleep(); - err = ext3_reserve_inode_write(handle, inode, &iloc); - if (!err) - err = ext3_mark_iloc_dirty(handle, inode, &iloc); -diff -uprN linux-2.6.8.1.orig/fs/ext3/ioctl.c linux-2.6.8.1-ve022stab078/fs/ext3/ioctl.c ---- linux-2.6.8.1.orig/fs/ext3/ioctl.c 2004-08-14 14:55:19.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext3/ioctl.c 2006-05-11 13:05:37.000000000 +0400 -@@ -67,7 +67,7 @@ int ext3_ioctl (struct inode * inode, st - * the relevant capability. - */ - if ((jflag ^ oldflags) & (EXT3_JOURNAL_DATA_FL)) { -- if (!capable(CAP_SYS_RESOURCE)) -+ if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - } - -@@ -86,7 +86,7 @@ int ext3_ioctl (struct inode * inode, st - ei->i_flags = flags; - - ext3_set_inode_flags(inode); -- inode->i_ctime = CURRENT_TIME; -+ inode->i_ctime = CURRENT_TIME_SEC; - - err = ext3_mark_iloc_dirty(handle, inode, &iloc); - flags_err: -@@ -120,7 +120,7 @@ flags_err: - return PTR_ERR(handle); - err = ext3_reserve_inode_write(handle, inode, &iloc); - if (err == 0) { -- inode->i_ctime = CURRENT_TIME; -+ inode->i_ctime = CURRENT_TIME_SEC; - inode->i_generation = generation; - err = ext3_mark_iloc_dirty(handle, inode, &iloc); - } -@@ -151,6 +151,51 @@ flags_err: - return ret; - } - #endif -+ case EXT3_IOC_GROUP_EXTEND: { -+ unsigned long n_blocks_count; -+ struct super_block *sb = inode->i_sb; -+ int err; -+ -+ if (!capable(CAP_SYS_RESOURCE)) -+ return -EPERM; -+ -+ if (IS_RDONLY(inode)) -+ return -EROFS; -+ -+ if (get_user(n_blocks_count, (__u32 *)arg)) -+ return -EFAULT; -+ -+ err = ext3_group_extend(sb, EXT3_SB(sb)->s_es, n_blocks_count); -+ journal_lock_updates(EXT3_SB(sb)->s_journal); -+ journal_flush(EXT3_SB(sb)->s_journal); -+ journal_unlock_updates(EXT3_SB(sb)->s_journal); -+ -+ return err; -+ } -+ case EXT3_IOC_GROUP_ADD: { -+ struct ext3_new_group_data input; -+ struct super_block *sb = inode->i_sb; -+ int err; -+ -+ if (!capable(CAP_SYS_RESOURCE)) -+ return -EPERM; -+ -+ if (IS_RDONLY(inode)) -+ return -EROFS; -+ -+ if (copy_from_user(&input, (struct ext3_new_group_input *)arg, -+ sizeof(input))) -+ return -EFAULT; -+ -+ err = ext3_group_add(sb, &input); -+ journal_lock_updates(EXT3_SB(sb)->s_journal); -+ journal_flush(EXT3_SB(sb)->s_journal); -+ journal_unlock_updates(EXT3_SB(sb)->s_journal); -+ -+ return err; -+ } -+ -+ - default: - return -ENOTTY; - } -diff -uprN linux-2.6.8.1.orig/fs/ext3/namei.c linux-2.6.8.1-ve022stab078/fs/ext3/namei.c ---- linux-2.6.8.1.orig/fs/ext3/namei.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext3/namei.c 2006-05-11 13:05:32.000000000 +0400 -@@ -71,9 +71,6 @@ static struct buffer_head *ext3_append(h - #define swap(x, y) do { typeof(x) z = x; x = y; y = z; } while (0) - #endif - --typedef struct { u32 v; } le_u32; --typedef struct { u16 v; } le_u16; -- - #ifdef DX_DEBUG - #define dxtrace(command) command - #else -@@ -82,22 +79,22 @@ typedef struct { u16 v; } le_u16; - - struct fake_dirent - { -- /*le*/u32 inode; -- /*le*/u16 rec_len; -+ __le32 inode; -+ __le16 rec_len; - u8 name_len; - u8 file_type; - }; - - struct dx_countlimit - { -- le_u16 limit; -- le_u16 count; -+ __le16 limit; -+ __le16 count; - }; - - struct dx_entry - { -- le_u32 hash; -- le_u32 block; -+ __le32 hash; -+ __le32 block; - }; - - /* -@@ -114,7 +111,7 @@ struct dx_root - char dotdot_name[4]; - struct dx_root_info - { -- le_u32 reserved_zero; -+ __le32 reserved_zero; - u8 hash_version; - u8 info_length; /* 8 */ - u8 indirect_levels; -@@ -184,42 +181,42 @@ static int ext3_dx_add_entry(handle_t *h - - static inline unsigned dx_get_block (struct dx_entry *entry) - { -- return le32_to_cpu(entry->block.v) & 0x00ffffff; -+ return le32_to_cpu(entry->block) & 0x00ffffff; - } - - static inline void dx_set_block (struct dx_entry *entry, unsigned value) - { -- entry->block.v = cpu_to_le32(value); -+ entry->block = cpu_to_le32(value); - } - - static inline unsigned dx_get_hash (struct dx_entry *entry) - { -- return le32_to_cpu(entry->hash.v); -+ return le32_to_cpu(entry->hash); - } - - static inline void dx_set_hash (struct dx_entry *entry, unsigned value) - { -- entry->hash.v = cpu_to_le32(value); -+ entry->hash = cpu_to_le32(value); - } - - static inline unsigned dx_get_count (struct dx_entry *entries) - { -- return le16_to_cpu(((struct dx_countlimit *) entries)->count.v); -+ return le16_to_cpu(((struct dx_countlimit *) entries)->count); - } - - static inline unsigned dx_get_limit (struct dx_entry *entries) - { -- return le16_to_cpu(((struct dx_countlimit *) entries)->limit.v); -+ return le16_to_cpu(((struct dx_countlimit *) entries)->limit); - } - - static inline void dx_set_count (struct dx_entry *entries, unsigned value) - { -- ((struct dx_countlimit *) entries)->count.v = cpu_to_le16(value); -+ ((struct dx_countlimit *) entries)->count = cpu_to_le16(value); - } - - static inline void dx_set_limit (struct dx_entry *entries, unsigned value) - { -- ((struct dx_countlimit *) entries)->limit.v = cpu_to_le16(value); -+ ((struct dx_countlimit *) entries)->limit = cpu_to_le16(value); - } - - static inline unsigned dx_root_limit (struct inode *dir, unsigned infosize) -@@ -1254,7 +1251,7 @@ static int add_dirent_to_buf(handle_t *h - * happen is that the times are slightly out of date - * and/or different from the directory change time. - */ -- dir->i_mtime = dir->i_ctime = CURRENT_TIME; -+ dir->i_mtime = dir->i_ctime = CURRENT_TIME_SEC; - ext3_update_dx_flag(dir); - dir->i_version++; - ext3_mark_inode_dirty(handle, dir); -@@ -2032,7 +2029,7 @@ static int ext3_rmdir (struct inode * di - * recovery. */ - inode->i_size = 0; - ext3_orphan_add(handle, inode); -- inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME; -+ inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME_SEC; - ext3_mark_inode_dirty(handle, inode); - dir->i_nlink--; - ext3_update_dx_flag(dir); -@@ -2082,7 +2079,7 @@ static int ext3_unlink(struct inode * di - retval = ext3_delete_entry(handle, dir, de, bh); - if (retval) - goto end_unlink; -- dir->i_ctime = dir->i_mtime = CURRENT_TIME; -+ dir->i_ctime = dir->i_mtime = CURRENT_TIME_SEC; - ext3_update_dx_flag(dir); - ext3_mark_inode_dirty(handle, dir); - inode->i_nlink--; -@@ -2132,7 +2129,7 @@ retry: - * We have a transaction open. All is sweetness. It also sets - * i_size in generic_commit_write(). - */ -- err = page_symlink(inode, symname, l); -+ err = page_symlink(inode, symname, l, GFP_NOFS); - if (err) { - ext3_dec_count(handle, inode); - ext3_mark_inode_dirty(handle, inode); -@@ -2172,7 +2169,7 @@ retry: - if (IS_DIRSYNC(dir)) - handle->h_sync = 1; - -- inode->i_ctime = CURRENT_TIME; -+ inode->i_ctime = CURRENT_TIME_SEC; - ext3_inc_count(handle, inode); - atomic_inc(&inode->i_count); - -@@ -2258,7 +2255,7 @@ static int ext3_rename (struct inode * o - } else { - BUFFER_TRACE(new_bh, "get write access"); - ext3_journal_get_write_access(handle, new_bh); -- new_de->inode = le32_to_cpu(old_inode->i_ino); -+ new_de->inode = cpu_to_le32(old_inode->i_ino); - if (EXT3_HAS_INCOMPAT_FEATURE(new_dir->i_sb, - EXT3_FEATURE_INCOMPAT_FILETYPE)) - new_de->file_type = old_de->file_type; -@@ -2273,7 +2270,7 @@ static int ext3_rename (struct inode * o - * Like most other Unix systems, set the ctime for inodes on a - * rename. - */ -- old_inode->i_ctime = CURRENT_TIME; -+ old_inode->i_ctime = CURRENT_TIME_SEC; - ext3_mark_inode_dirty(handle, old_inode); - - /* -@@ -2306,14 +2303,14 @@ static int ext3_rename (struct inode * o - - if (new_inode) { - new_inode->i_nlink--; -- new_inode->i_ctime = CURRENT_TIME; -+ new_inode->i_ctime = CURRENT_TIME_SEC; - } -- old_dir->i_ctime = old_dir->i_mtime = CURRENT_TIME; -+ old_dir->i_ctime = old_dir->i_mtime = CURRENT_TIME_SEC; - ext3_update_dx_flag(old_dir); - if (dir_bh) { - BUFFER_TRACE(dir_bh, "get_write_access"); - ext3_journal_get_write_access(handle, dir_bh); -- PARENT_INO(dir_bh->b_data) = le32_to_cpu(new_dir->i_ino); -+ PARENT_INO(dir_bh->b_data) = cpu_to_le32(new_dir->i_ino); - BUFFER_TRACE(dir_bh, "call ext3_journal_dirty_metadata"); - ext3_journal_dirty_metadata(handle, dir_bh); - old_dir->i_nlink--; -diff -uprN linux-2.6.8.1.orig/fs/ext3/resize.c linux-2.6.8.1-ve022stab078/fs/ext3/resize.c ---- linux-2.6.8.1.orig/fs/ext3/resize.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/fs/ext3/resize.c 2006-05-11 13:05:31.000000000 +0400 -@@ -0,0 +1,996 @@ -+/* -+ * linux/fs/ext3/resize.c -+ * -+ * Support for resizing an ext3 filesystem while it is mounted. -+ * -+ * Copyright (C) 2001, 2002 Andreas Dilger <adilger@clusterfs.com> -+ * -+ * This could probably be made into a module, because it is not often in use. -+ */ -+ -+#include <linux/config.h> -+ -+#define EXT3FS_DEBUG -+ -+#include <linux/sched.h> -+#include <linux/smp_lock.h> -+#include <linux/ext3_jbd.h> -+ -+#include <linux/errno.h> -+#include <linux/slab.h> -+ -+ -+#define outside(b, first, last) ((b) < (first) || (b) >= (last)) -+#define inside(b, first, last) ((b) >= (first) && (b) < (last)) -+ -+static int verify_group_input(struct super_block *sb, -+ struct ext3_new_group_data *input) -+{ -+ struct ext3_sb_info *sbi = EXT3_SB(sb); -+ struct ext3_super_block *es = sbi->s_es; -+ unsigned start = le32_to_cpu(es->s_blocks_count); -+ unsigned end = start + input->blocks_count; -+ unsigned group = input->group; -+ unsigned itend = input->inode_table + EXT3_SB(sb)->s_itb_per_group; -+ unsigned overhead = ext3_bg_has_super(sb, group) ? -+ (1 + ext3_bg_num_gdb(sb, group) + -+ le16_to_cpu(es->s_reserved_gdt_blocks)) : 0; -+ unsigned metaend = start + overhead; -+ struct buffer_head *bh = NULL; -+ int free_blocks_count; -+ int err = -EINVAL; -+ -+ input->free_blocks_count = free_blocks_count = -+ input->blocks_count - 2 - overhead - sbi->s_itb_per_group; -+ -+ if (test_opt(sb, DEBUG)) -+ printk(KERN_DEBUG "EXT3-fs: adding %s group %u: %u blocks " -+ "(%d free, %u reserved)\n", -+ ext3_bg_has_super(sb, input->group) ? "normal" : -+ "no-super", input->group, input->blocks_count, -+ free_blocks_count, input->reserved_blocks); -+ -+ if (group != sbi->s_groups_count) -+ ext3_warning(sb, __FUNCTION__, -+ "Cannot add at group %u (only %lu groups)", -+ input->group, sbi->s_groups_count); -+ else if ((start - le32_to_cpu(es->s_first_data_block)) % -+ EXT3_BLOCKS_PER_GROUP(sb)) -+ ext3_warning(sb, __FUNCTION__, "Last group not full"); -+ else if (input->reserved_blocks > input->blocks_count / 5) -+ ext3_warning(sb, __FUNCTION__, "Reserved blocks too high (%u)", -+ input->reserved_blocks); -+ else if (free_blocks_count < 0) -+ ext3_warning(sb, __FUNCTION__, "Bad blocks count %u", -+ input->blocks_count); -+ else if (!(bh = sb_bread(sb, end - 1))) -+ ext3_warning(sb, __FUNCTION__, "Cannot read last block (%u)", -+ end - 1); -+ else if (outside(input->block_bitmap, start, end)) -+ ext3_warning(sb, __FUNCTION__, -+ "Block bitmap not in group (block %u)", -+ input->block_bitmap); -+ else if (outside(input->inode_bitmap, start, end)) -+ ext3_warning(sb, __FUNCTION__, -+ "Inode bitmap not in group (block %u)", -+ input->inode_bitmap); -+ else if (outside(input->inode_table, start, end) || -+ outside(itend - 1, start, end)) -+ ext3_warning(sb, __FUNCTION__, -+ "Inode table not in group (blocks %u-%u)", -+ input->inode_table, itend - 1); -+ else if (input->inode_bitmap == input->block_bitmap) -+ ext3_warning(sb, __FUNCTION__, -+ "Block bitmap same as inode bitmap (%u)", -+ input->block_bitmap); -+ else if (inside(input->block_bitmap, input->inode_table, itend)) -+ ext3_warning(sb, __FUNCTION__, -+ "Block bitmap (%u) in inode table (%u-%u)", -+ input->block_bitmap, input->inode_table, itend-1); -+ else if (inside(input->inode_bitmap, input->inode_table, itend)) -+ ext3_warning(sb, __FUNCTION__, -+ "Inode bitmap (%u) in inode table (%u-%u)", -+ input->inode_bitmap, input->inode_table, itend-1); -+ else if (inside(input->block_bitmap, start, metaend)) -+ ext3_warning(sb, __FUNCTION__, -+ "Block bitmap (%u) in GDT table (%u-%u)", -+ input->block_bitmap, start, metaend - 1); -+ else if (inside(input->inode_bitmap, start, metaend)) -+ ext3_warning(sb, __FUNCTION__, -+ "Inode bitmap (%u) in GDT table (%u-%u)", -+ input->inode_bitmap, start, metaend - 1); -+ else if (inside(input->inode_table, start, metaend) || -+ inside(itend - 1, start, metaend)) -+ ext3_warning(sb, __FUNCTION__, -+ "Inode table (%u-%u) overlaps GDT table (%u-%u)", -+ input->inode_table, itend - 1, start, metaend - 1); -+ else -+ err = 0; -+ brelse(bh); -+ -+ return err; -+} -+ -+static struct buffer_head *bclean(handle_t *handle, struct super_block *sb, -+ unsigned long blk) -+{ -+ struct buffer_head *bh; -+ int err; -+ -+ bh = sb_getblk(sb, blk); -+ if ((err = ext3_journal_get_write_access(handle, bh))) { -+ brelse(bh); -+ bh = ERR_PTR(err); -+ } else { -+ lock_buffer(bh); -+ memset(bh->b_data, 0, sb->s_blocksize); -+ set_buffer_uptodate(bh); -+ unlock_buffer(bh); -+ } -+ -+ return bh; -+} -+ -+/* -+ * To avoid calling the atomic setbit hundreds or thousands of times, we only -+ * need to use it within a single byte (to ensure we get endianness right). -+ * We can use memset for the rest of the bitmap as there are no other users. -+ */ -+static void mark_bitmap_end(int start_bit, int end_bit, char *bitmap) -+{ -+ int i; -+ -+ if (start_bit >= end_bit) -+ return; -+ -+ ext3_debug("mark end bits +%d through +%d used\n", start_bit, end_bit); -+ for (i = start_bit; i < ((start_bit + 7) & ~7UL); i++) -+ ext3_set_bit(i, bitmap); -+ if (i < end_bit) -+ memset(bitmap + (i >> 3), 0xff, (end_bit - i) >> 3); -+} -+ -+/* -+ * Set up the block and inode bitmaps, and the inode table for the new group. -+ * This doesn't need to be part of the main transaction, since we are only -+ * changing blocks outside the actual filesystem. We still do journaling to -+ * ensure the recovery is correct in case of a failure just after resize. -+ * If any part of this fails, we simply abort the resize. -+ */ -+static int setup_new_group_blocks(struct super_block *sb, -+ struct ext3_new_group_data *input) -+{ -+ struct ext3_sb_info *sbi = EXT3_SB(sb); -+ unsigned long start = input->group * sbi->s_blocks_per_group + -+ le32_to_cpu(sbi->s_es->s_first_data_block); -+ int reserved_gdb = ext3_bg_has_super(sb, input->group) ? -+ le16_to_cpu(sbi->s_es->s_reserved_gdt_blocks) : 0; -+ unsigned long gdblocks = ext3_bg_num_gdb(sb, input->group); -+ struct buffer_head *bh; -+ handle_t *handle; -+ unsigned long block; -+ int bit; -+ int i; -+ int err = 0, err2; -+ -+ handle = ext3_journal_start_sb(sb, reserved_gdb + gdblocks + -+ 2 + sbi->s_itb_per_group); -+ if (IS_ERR(handle)) -+ return PTR_ERR(handle); -+ -+ lock_super(sb); -+ if (input->group != sbi->s_groups_count) { -+ err = -EBUSY; -+ goto exit_journal; -+ } -+ -+ if (IS_ERR(bh = bclean(handle, sb, input->block_bitmap))) { -+ err = PTR_ERR(bh); -+ goto exit_journal; -+ } -+ -+ if (ext3_bg_has_super(sb, input->group)) { -+ ext3_debug("mark backup superblock %#04lx (+0)\n", start); -+ ext3_set_bit(0, bh->b_data); -+ } -+ -+ /* Copy all of the GDT blocks into the backup in this group */ -+ for (i = 0, bit = 1, block = start + 1; -+ i < gdblocks; i++, block++, bit++) { -+ struct buffer_head *gdb; -+ -+ ext3_debug("update backup group %#04lx (+%d)\n", block, bit); -+ -+ gdb = sb_getblk(sb, block); -+ if ((err = ext3_journal_get_write_access(handle, gdb))) { -+ brelse(gdb); -+ goto exit_bh; -+ } -+ lock_buffer(bh); -+ memcpy(gdb->b_data, sbi->s_group_desc[i], bh->b_size); -+ set_buffer_uptodate(gdb); -+ unlock_buffer(bh); -+ ext3_journal_dirty_metadata(handle, gdb); -+ ext3_set_bit(bit, bh->b_data); -+ brelse(gdb); -+ } -+ -+ /* Zero out all of the reserved backup group descriptor table blocks */ -+ for (i = 0, bit = gdblocks + 1, block = start + bit; -+ i < reserved_gdb; i++, block++, bit++) { -+ struct buffer_head *gdb; -+ -+ ext3_debug("clear reserved block %#04lx (+%d)\n", block, bit); -+ -+ if (IS_ERR(gdb = bclean(handle, sb, block))) { -+ err = PTR_ERR(bh); -+ goto exit_bh; -+ } -+ ext3_journal_dirty_metadata(handle, gdb); -+ ext3_set_bit(bit, bh->b_data); -+ brelse(gdb); -+ } -+ ext3_debug("mark block bitmap %#04x (+%ld)\n", input->block_bitmap, -+ input->block_bitmap - start); -+ ext3_set_bit(input->block_bitmap - start, bh->b_data); -+ ext3_debug("mark inode bitmap %#04x (+%ld)\n", input->inode_bitmap, -+ input->inode_bitmap - start); -+ ext3_set_bit(input->inode_bitmap - start, bh->b_data); -+ -+ /* Zero out all of the inode table blocks */ -+ for (i = 0, block = input->inode_table, bit = block - start; -+ i < sbi->s_itb_per_group; i++, bit++, block++) { -+ struct buffer_head *it; -+ -+ ext3_debug("clear inode block %#04x (+%ld)\n", block, bit); -+ if (IS_ERR(it = bclean(handle, sb, block))) { -+ err = PTR_ERR(it); -+ goto exit_bh; -+ } -+ ext3_journal_dirty_metadata(handle, it); -+ brelse(it); -+ ext3_set_bit(bit, bh->b_data); -+ } -+ mark_bitmap_end(input->blocks_count, EXT3_BLOCKS_PER_GROUP(sb), -+ bh->b_data); -+ ext3_journal_dirty_metadata(handle, bh); -+ brelse(bh); -+ -+ /* Mark unused entries in inode bitmap used */ -+ ext3_debug("clear inode bitmap %#04x (+%ld)\n", -+ input->inode_bitmap, input->inode_bitmap - start); -+ if (IS_ERR(bh = bclean(handle, sb, input->inode_bitmap))) { -+ err = PTR_ERR(bh); -+ goto exit_journal; -+ } -+ -+ mark_bitmap_end(EXT3_INODES_PER_GROUP(sb), EXT3_BLOCKS_PER_GROUP(sb), -+ bh->b_data); -+ ext3_journal_dirty_metadata(handle, bh); -+exit_bh: -+ brelse(bh); -+ -+exit_journal: -+ unlock_super(sb); -+ if ((err2 = ext3_journal_stop(handle)) && !err) -+ err = err2; -+ -+ return err; -+} -+ -+/* -+ * Iterate through the groups which hold BACKUP superblock/GDT copies in an -+ * ext3 filesystem. The counters should be initialized to 1, 5, and 7 before -+ * calling this for the first time. In a sparse filesystem it will be the -+ * sequence of powers of 3, 5, and 7: 1, 3, 5, 7, 9, 25, 27, 49, 81, ... -+ * For a non-sparse filesystem it will be every group: 1, 2, 3, 4, ... -+ */ -+unsigned ext3_list_backups(struct super_block *sb, unsigned *three, -+ unsigned *five, unsigned *seven) -+{ -+ unsigned *min = three; -+ int mult = 3; -+ unsigned ret; -+ -+ if (!EXT3_HAS_RO_COMPAT_FEATURE(sb, -+ EXT3_FEATURE_RO_COMPAT_SPARSE_SUPER)) { -+ ret = *min; -+ *min += 1; -+ return ret; -+ } -+ -+ if (*five < *min) { -+ min = five; -+ mult = 5; -+ } -+ if (*seven < *min) { -+ min = seven; -+ mult = 7; -+ } -+ -+ ret = *min; -+ *min *= mult; -+ -+ return ret; -+} -+ -+/* -+ * Check that all of the backup GDT blocks are held in the primary GDT block. -+ * It is assumed that they are stored in group order. Returns the number of -+ * groups in current filesystem that have BACKUPS, or -ve error code. -+ */ -+static int verify_reserved_gdb(struct super_block *sb, -+ struct buffer_head *primary) -+{ -+ const unsigned long blk = primary->b_blocknr; -+ const unsigned long end = EXT3_SB(sb)->s_groups_count; -+ unsigned three = 1; -+ unsigned five = 5; -+ unsigned seven = 7; -+ unsigned grp; -+ __u32 *p = (__u32 *)primary->b_data; -+ int gdbackups = 0; -+ -+ while ((grp = ext3_list_backups(sb, &three, &five, &seven)) < end) { -+ if (le32_to_cpu(*p++) != grp * EXT3_BLOCKS_PER_GROUP(sb) + blk){ -+ ext3_warning(sb, __FUNCTION__, -+ "reserved GDT %ld missing grp %d (%ld)\n", -+ blk, grp, -+ grp * EXT3_BLOCKS_PER_GROUP(sb) + blk); -+ return -EINVAL; -+ } -+ if (++gdbackups > EXT3_ADDR_PER_BLOCK(sb)) -+ return -EFBIG; -+ } -+ -+ return gdbackups; -+} -+ -+/* -+ * Called when we need to bring a reserved group descriptor table block into -+ * use from the resize inode. The primary copy of the new GDT block currently -+ * is an indirect block (under the double indirect block in the resize inode). -+ * The new backup GDT blocks will be stored as leaf blocks in this indirect -+ * block, in group order. Even though we know all the block numbers we need, -+ * we check to ensure that the resize inode has actually reserved these blocks. -+ * -+ * Don't need to update the block bitmaps because the blocks are still in use. -+ * -+ * We get all of the error cases out of the way, so that we are sure to not -+ * fail once we start modifying the data on disk, because JBD has no rollback. -+ */ -+static int add_new_gdb(handle_t *handle, struct inode *inode, -+ struct ext3_new_group_data *input, -+ struct buffer_head **primary) -+{ -+ struct super_block *sb = inode->i_sb; -+ struct ext3_super_block *es = EXT3_SB(sb)->s_es; -+ unsigned long gdb_num = input->group / EXT3_DESC_PER_BLOCK(sb); -+ unsigned long gdblock = EXT3_SB(sb)->s_sbh->b_blocknr + 1 + gdb_num; -+ struct buffer_head **o_group_desc, **n_group_desc; -+ struct buffer_head *dind; -+ int gdbackups; -+ struct ext3_iloc iloc; -+ __u32 *data; -+ int err; -+ -+ if (test_opt(sb, DEBUG)) -+ printk(KERN_DEBUG -+ "EXT3-fs: ext3_add_new_gdb: adding group block %lu\n", -+ gdb_num); -+ -+ /* -+ * If we are not using the primary superblock/GDT copy don't resize, -+ * because the user tools have no way of handling this. Probably a -+ * bad time to do it anyways. -+ */ -+ if (EXT3_SB(sb)->s_sbh->b_blocknr != -+ le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block)) { -+ ext3_warning(sb, __FUNCTION__, -+ "won't resize using backup superblock at %llu\n", -+ (unsigned long long)EXT3_SB(sb)->s_sbh->b_blocknr); -+ return -EPERM; -+ } -+ -+ *primary = sb_bread(sb, gdblock); -+ if (!*primary) -+ return -EIO; -+ -+ if ((gdbackups = verify_reserved_gdb(sb, *primary)) < 0) { -+ err = gdbackups; -+ goto exit_bh; -+ } -+ -+ data = EXT3_I(inode)->i_data + EXT3_DIND_BLOCK; -+ dind = sb_bread(sb, le32_to_cpu(*data)); -+ if (!dind) { -+ err = -EIO; -+ goto exit_bh; -+ } -+ -+ data = (__u32 *)dind->b_data; -+ if (le32_to_cpu(data[gdb_num % EXT3_ADDR_PER_BLOCK(sb)]) != gdblock) { -+ ext3_warning(sb, __FUNCTION__, -+ "new group %u GDT block %lu not reserved\n", -+ input->group, gdblock); -+ err = -EINVAL; -+ goto exit_dind; -+ } -+ -+ if ((err = ext3_journal_get_write_access(handle, EXT3_SB(sb)->s_sbh))) -+ goto exit_dind; -+ -+ if ((err = ext3_journal_get_write_access(handle, *primary))) -+ goto exit_sbh; -+ -+ if ((err = ext3_journal_get_write_access(handle, dind))) -+ goto exit_primary; -+ -+ /* ext3_reserve_inode_write() gets a reference on the iloc */ -+ if ((err = ext3_reserve_inode_write(handle, inode, &iloc))) -+ goto exit_dindj; -+ -+ n_group_desc = (struct buffer_head **)kmalloc((gdb_num + 1) * -+ sizeof(struct buffer_head *), GFP_KERNEL); -+ if (!n_group_desc) { -+ err = -ENOMEM; -+ ext3_warning (sb, __FUNCTION__, -+ "not enough memory for %lu groups", gdb_num + 1); -+ goto exit_inode; -+ } -+ -+ /* -+ * Finally, we have all of the possible failures behind us... -+ * -+ * Remove new GDT block from inode double-indirect block and clear out -+ * the new GDT block for use (which also "frees" the backup GDT blocks -+ * from the reserved inode). We don't need to change the bitmaps for -+ * these blocks, because they are marked as in-use from being in the -+ * reserved inode, and will become GDT blocks (primary and backup). -+ */ -+ data[gdb_num % EXT3_ADDR_PER_BLOCK(sb)] = 0; -+ ext3_journal_dirty_metadata(handle, dind); -+ brelse(dind); -+ inode->i_blocks -= (gdbackups + 1) * sb->s_blocksize >> 9; -+ ext3_mark_iloc_dirty(handle, inode, &iloc); -+ memset((*primary)->b_data, 0, sb->s_blocksize); -+ ext3_journal_dirty_metadata(handle, *primary); -+ -+ o_group_desc = EXT3_SB(sb)->s_group_desc; -+ memcpy(n_group_desc, o_group_desc, -+ EXT3_SB(sb)->s_gdb_count * sizeof(struct buffer_head *)); -+ n_group_desc[gdb_num] = *primary; -+ EXT3_SB(sb)->s_group_desc = n_group_desc; -+ EXT3_SB(sb)->s_gdb_count++; -+ kfree(o_group_desc); -+ -+ es->s_reserved_gdt_blocks = -+ cpu_to_le16(le16_to_cpu(es->s_reserved_gdt_blocks) - 1); -+ ext3_journal_dirty_metadata(handle, EXT3_SB(sb)->s_sbh); -+ -+ return 0; -+ -+exit_inode: -+ //ext3_journal_release_buffer(handle, iloc.bh); -+ brelse(iloc.bh); -+exit_dindj: -+ //ext3_journal_release_buffer(handle, dind); -+exit_primary: -+ //ext3_journal_release_buffer(handle, *primary); -+exit_sbh: -+ //ext3_journal_release_buffer(handle, *primary); -+exit_dind: -+ brelse(dind); -+exit_bh: -+ brelse(*primary); -+ -+ ext3_debug("leaving with error %d\n", err); -+ return err; -+} -+ -+/* -+ * Called when we are adding a new group which has a backup copy of each of -+ * the GDT blocks (i.e. sparse group) and there are reserved GDT blocks. -+ * We need to add these reserved backup GDT blocks to the resize inode, so -+ * that they are kept for future resizing and not allocated to files. -+ * -+ * Each reserved backup GDT block will go into a different indirect block. -+ * The indirect blocks are actually the primary reserved GDT blocks, -+ * so we know in advance what their block numbers are. We only get the -+ * double-indirect block to verify it is pointing to the primary reserved -+ * GDT blocks so we don't overwrite a data block by accident. The reserved -+ * backup GDT blocks are stored in their reserved primary GDT block. -+ */ -+static int reserve_backup_gdb(handle_t *handle, struct inode *inode, -+ struct ext3_new_group_data *input) -+{ -+ struct super_block *sb = inode->i_sb; -+ int reserved_gdb =le16_to_cpu(EXT3_SB(sb)->s_es->s_reserved_gdt_blocks); -+ struct buffer_head **primary; -+ struct buffer_head *dind; -+ struct ext3_iloc iloc; -+ unsigned long blk; -+ __u32 *data, *end; -+ int gdbackups = 0; -+ int res, i; -+ int err; -+ -+ primary = kmalloc(reserved_gdb * sizeof(*primary), GFP_KERNEL); -+ if (!primary) -+ return -ENOMEM; -+ -+ data = EXT3_I(inode)->i_data + EXT3_DIND_BLOCK; -+ dind = sb_bread(sb, le32_to_cpu(*data)); -+ if (!dind) { -+ err = -EIO; -+ goto exit_free; -+ } -+ -+ blk = EXT3_SB(sb)->s_sbh->b_blocknr + 1 + EXT3_SB(sb)->s_gdb_count; -+ data = (__u32 *)dind->b_data + EXT3_SB(sb)->s_gdb_count; -+ end = (__u32 *)dind->b_data + EXT3_ADDR_PER_BLOCK(sb); -+ -+ /* Get each reserved primary GDT block and verify it holds backups */ -+ for (res = 0; res < reserved_gdb; res++, blk++) { -+ if (le32_to_cpu(*data) != blk) { -+ ext3_warning(sb, __FUNCTION__, -+ "reserved block %lu not at offset %ld\n", -+ blk, (long)(data - (__u32 *)dind->b_data)); -+ err = -EINVAL; -+ goto exit_bh; -+ } -+ primary[res] = sb_bread(sb, blk); -+ if (!primary[res]) { -+ err = -EIO; -+ goto exit_bh; -+ } -+ if ((gdbackups = verify_reserved_gdb(sb, primary[res])) < 0) { -+ brelse(primary[res]); -+ err = gdbackups; -+ goto exit_bh; -+ } -+ if (++data >= end) -+ data = (__u32 *)dind->b_data; -+ } -+ -+ for (i = 0; i < reserved_gdb; i++) { -+ if ((err = ext3_journal_get_write_access(handle, primary[i]))) { -+ /* -+ int j; -+ for (j = 0; j < i; j++) -+ ext3_journal_release_buffer(handle, primary[j]); -+ */ -+ goto exit_bh; -+ } -+ } -+ -+ if ((err = ext3_reserve_inode_write(handle, inode, &iloc))) -+ goto exit_bh; -+ -+ /* -+ * Finally we can add each of the reserved backup GDT blocks from -+ * the new group to its reserved primary GDT block. -+ */ -+ blk = input->group * EXT3_BLOCKS_PER_GROUP(sb); -+ for (i = 0; i < reserved_gdb; i++) { -+ int err2; -+ data = (__u32 *)primary[i]->b_data; -+ /* printk("reserving backup %lu[%u] = %lu\n", -+ primary[i]->b_blocknr, gdbackups, -+ blk + primary[i]->b_blocknr); */ -+ data[gdbackups] = cpu_to_le32(blk + primary[i]->b_blocknr); -+ err2 = ext3_journal_dirty_metadata(handle, primary[i]); -+ if (!err) -+ err = err2; -+ } -+ inode->i_blocks += reserved_gdb * sb->s_blocksize >> 9; -+ ext3_mark_iloc_dirty(handle, inode, &iloc); -+ -+exit_bh: -+ while (--res >= 0) -+ brelse(primary[res]); -+ brelse(dind); -+ -+exit_free: -+ kfree(primary); -+ -+ return err; -+} -+ -+/* -+ * Update the backup copies of the ext3 metadata. These don't need to be part -+ * of the main resize transaction, because e2fsck will re-write them if there -+ * is a problem (basically only OOM will cause a problem). However, we -+ * _should_ update the backups if possible, in case the primary gets trashed -+ * for some reason and we need to run e2fsck from a backup superblock. The -+ * important part is that the new block and inode counts are in the backup -+ * superblocks, and the location of the new group metadata in the GDT backups. -+ * -+ * We do not need lock_super() for this, because these blocks are not -+ * otherwise touched by the filesystem code when it is mounted. We don't -+ * need to worry about last changing from sbi->s_groups_count, because the -+ * worst that can happen is that we do not copy the full number of backups -+ * at this time. The resize which changed s_groups_count will backup again. -+ */ -+static void update_backups(struct super_block *sb, -+ int blk_off, char *data, int size) -+{ -+ struct ext3_sb_info *sbi = EXT3_SB(sb); -+ const unsigned long last = sbi->s_groups_count; -+ const int bpg = EXT3_BLOCKS_PER_GROUP(sb); -+ unsigned three = 1; -+ unsigned five = 5; -+ unsigned seven = 7; -+ unsigned group; -+ int rest = sb->s_blocksize - size; -+ handle_t *handle; -+ int err = 0, err2; -+ -+ handle = ext3_journal_start_sb(sb, EXT3_MAX_TRANS_DATA); -+ if (IS_ERR(handle)) { -+ group = 1; -+ err = PTR_ERR(handle); -+ goto exit_err; -+ } -+ -+ while ((group = ext3_list_backups(sb, &three, &five, &seven)) < last) { -+ struct buffer_head *bh; -+ -+ /* Out of journal space, and can't get more - abort - so sad */ -+ if (handle->h_buffer_credits == 0 && -+ ext3_journal_extend(handle, EXT3_MAX_TRANS_DATA) && -+ (err = ext3_journal_restart(handle, EXT3_MAX_TRANS_DATA))) -+ break; -+ -+ bh = sb_getblk(sb, group * bpg + blk_off); -+ ext3_debug(sb, __FUNCTION__, "update metadata backup %#04lx\n", -+ bh->b_blocknr); -+ if ((err = ext3_journal_get_write_access(handle, bh))) -+ break; -+ lock_buffer(bh); -+ memcpy(bh->b_data, data, size); -+ if (rest) -+ memset(bh->b_data + size, 0, rest); -+ set_buffer_uptodate(bh); -+ unlock_buffer(bh); -+ ext3_journal_dirty_metadata(handle, bh); -+ brelse(bh); -+ } -+ if ((err2 = ext3_journal_stop(handle)) && !err) -+ err = err2; -+ -+ /* -+ * Ugh! Need to have e2fsck write the backup copies. It is too -+ * late to revert the resize, we shouldn't fail just because of -+ * the backup copies (they are only needed in case of corruption). -+ * -+ * However, if we got here we have a journal problem too, so we -+ * can't really start a transaction to mark the superblock. -+ * Chicken out and just set the flag on the hope it will be written -+ * to disk, and if not - we will simply wait until next fsck. -+ */ -+exit_err: -+ if (err) { -+ ext3_warning(sb, __FUNCTION__, -+ "can't update backup for group %d (err %d), " -+ "forcing fsck on next reboot\n", group, err); -+ sbi->s_mount_state &= ~EXT3_VALID_FS; -+ sbi->s_es->s_state &= ~cpu_to_le16(EXT3_VALID_FS); -+ mark_buffer_dirty(sbi->s_sbh); -+ } -+} -+ -+/* Add group descriptor data to an existing or new group descriptor block. -+ * Ensure we handle all possible error conditions _before_ we start modifying -+ * the filesystem, because we cannot abort the transaction and not have it -+ * write the data to disk. -+ * -+ * If we are on a GDT block boundary, we need to get the reserved GDT block. -+ * Otherwise, we may need to add backup GDT blocks for a sparse group. -+ * -+ * We only need to hold the superblock lock while we are actually adding -+ * in the new group's counts to the superblock. Prior to that we have -+ * not really "added" the group at all. We re-check that we are still -+ * adding in the last group in case things have changed since verifying. -+ */ -+int ext3_group_add(struct super_block *sb, struct ext3_new_group_data *input) -+{ -+ struct ext3_sb_info *sbi = EXT3_SB(sb); -+ struct ext3_super_block *es = sbi->s_es; -+ int reserved_gdb = ext3_bg_has_super(sb, input->group) ? -+ le16_to_cpu(es->s_reserved_gdt_blocks) : 0; -+ struct buffer_head *primary = NULL; -+ struct ext3_group_desc *gdp; -+ struct inode *inode = NULL; -+ handle_t *handle; -+ int gdb_off, gdb_num; -+ int err, err2; -+ -+ gdb_num = input->group / EXT3_DESC_PER_BLOCK(sb); -+ gdb_off = input->group % EXT3_DESC_PER_BLOCK(sb); -+ -+ if (gdb_off == 0 && !EXT3_HAS_RO_COMPAT_FEATURE(sb, -+ EXT3_FEATURE_RO_COMPAT_SPARSE_SUPER)) { -+ ext3_warning(sb, __FUNCTION__, -+ "Can't resize non-sparse filesystem further\n"); -+ return -EPERM; -+ } -+ -+ if (reserved_gdb || gdb_off == 0) { -+ if (!EXT3_HAS_COMPAT_FEATURE(sb, -+ EXT3_FEATURE_COMPAT_RESIZE_INODE)){ -+ ext3_warning(sb, __FUNCTION__, -+ "No reserved GDT blocks, can't resize\n"); -+ return -EPERM; -+ } -+ inode = iget(sb, EXT3_RESIZE_INO); -+ if (!inode || is_bad_inode(inode)) { -+ ext3_warning(sb, __FUNCTION__, -+ "Error opening resize inode\n"); -+ iput(inode); -+ return -ENOENT; -+ } -+ } -+ -+ if ((err = verify_group_input(sb, input))) -+ goto exit_put; -+ -+ if ((err = setup_new_group_blocks(sb, input))) -+ goto exit_put; -+ -+ /* -+ * We will always be modifying at least the superblock and a GDT -+ * block. If we are adding a group past the last current GDT block, -+ * we will also modify the inode and the dindirect block. If we -+ * are adding a group with superblock/GDT backups we will also -+ * modify each of the reserved GDT dindirect blocks. -+ */ -+ handle = ext3_journal_start_sb(sb, -+ ext3_bg_has_super(sb, input->group) ? -+ 3 + reserved_gdb : 4); -+ if (IS_ERR(handle)) { -+ err = PTR_ERR(handle); -+ goto exit_put; -+ } -+ -+ lock_super(sb); -+ if (input->group != EXT3_SB(sb)->s_groups_count) { -+ ext3_warning(sb, __FUNCTION__, -+ "multiple resizers run on filesystem!\n"); -+ goto exit_journal; -+ } -+ -+ if ((err = ext3_journal_get_write_access(handle, sbi->s_sbh))) -+ goto exit_journal; -+ -+ /* -+ * We will only either add reserved group blocks to a backup group -+ * or remove reserved blocks for the first group in a new group block. -+ * Doing both would be mean more complex code, and sane people don't -+ * use non-sparse filesystems anymore. This is already checked above. -+ */ -+ if (gdb_off) { -+ primary = sbi->s_group_desc[gdb_num]; -+ if ((err = ext3_journal_get_write_access(handle, primary))) -+ goto exit_journal; -+ -+ if (reserved_gdb && ext3_bg_num_gdb(sb, input->group) && -+ (err = reserve_backup_gdb(handle, inode, input))) -+ goto exit_journal; -+ } else if ((err = add_new_gdb(handle, inode, input, &primary))) -+ goto exit_journal; -+ -+ /* -+ * OK, now we've set up the new group. Time to make it active. -+ * -+ * Current kernels don't lock all allocations via lock_super(), -+ * so we have to be safe wrt. concurrent accesses the group -+ * data. So we need to be careful to set all of the relevant -+ * group descriptor data etc. *before* we enable the group. -+ * -+ * The key field here is EXT3_SB(sb)->s_groups_count: as long as -+ * that retains its old value, nobody is going to access the new -+ * group. -+ * -+ * So first we update all the descriptor metadata for the new -+ * group; then we update the total disk blocks count; then we -+ * update the groups count to enable the group; then finally we -+ * update the free space counts so that the system can start -+ * using the new disk blocks. -+ */ -+ -+ /* Update group descriptor block for new group */ -+ gdp = (struct ext3_group_desc *)primary->b_data + gdb_off; -+ -+ gdp->bg_block_bitmap = cpu_to_le32(input->block_bitmap); -+ gdp->bg_inode_bitmap = cpu_to_le32(input->inode_bitmap); -+ gdp->bg_inode_table = cpu_to_le32(input->inode_table); -+ gdp->bg_free_blocks_count = cpu_to_le16(input->free_blocks_count); -+ gdp->bg_free_inodes_count = cpu_to_le16(EXT3_INODES_PER_GROUP(sb)); -+ -+ /* -+ * Make the new blocks and inodes valid next. We do this before -+ * increasing the group count so that once the group is enabled, -+ * all of its blocks and inodes are already valid. -+ * -+ * We always allocate group-by-group, then block-by-block or -+ * inode-by-inode within a group, so enabling these -+ * blocks/inodes before the group is live won't actually let us -+ * allocate the new space yet. -+ */ -+ es->s_blocks_count = cpu_to_le32(le32_to_cpu(es->s_blocks_count) + -+ input->blocks_count); -+ es->s_inodes_count = cpu_to_le32(le32_to_cpu(es->s_inodes_count) + -+ EXT3_INODES_PER_GROUP(sb)); -+ -+ /* -+ * We need to protect s_groups_count against other CPUs seeing -+ * inconsistent state in the superblock. -+ * -+ * The precise rules we use are: -+ * -+ * * Writers of s_groups_count *must* hold lock_super -+ * AND -+ * * Writers must perform a smp_wmb() after updating all dependent -+ * data and before modifying the groups count -+ * -+ * * Readers must hold lock_super() over the access -+ * OR -+ * * Readers must perform an smp_rmb() after reading the groups count -+ * and before reading any dependent data. -+ * -+ * NB. These rules can be relaxed when checking the group count -+ * while freeing data, as we can only allocate from a block -+ * group after serialising against the group count, and we can -+ * only then free after serialising in turn against that -+ * allocation. -+ */ -+ smp_wmb(); -+ -+ /* Update the global fs size fields */ -+ EXT3_SB(sb)->s_groups_count++; -+ -+ ext3_journal_dirty_metadata(handle, primary); -+ -+ /* Update the reserved block counts only once the new group is -+ * active. */ -+ es->s_r_blocks_count = cpu_to_le32(le32_to_cpu(es->s_r_blocks_count) + -+ input->reserved_blocks); -+ -+ /* Update the free space counts */ -+ percpu_counter_mod(&sbi->s_freeblocks_counter, -+ input->free_blocks_count); -+ percpu_counter_mod(&sbi->s_freeinodes_counter, -+ EXT3_INODES_PER_GROUP(sb)); -+ -+ ext3_journal_dirty_metadata(handle, EXT3_SB(sb)->s_sbh); -+ sb->s_dirt = 1; -+ -+exit_journal: -+ unlock_super(sb); -+ if ((err2 = ext3_journal_stop(handle)) && !err) -+ err = err2; -+ if (!err) { -+ update_backups(sb, sbi->s_sbh->b_blocknr, (char *)es, -+ sizeof(struct ext3_super_block)); -+ update_backups(sb, primary->b_blocknr, primary->b_data, -+ primary->b_size); -+ } -+exit_put: -+ iput(inode); -+ return err; -+} /* ext3_group_add */ -+ -+/* Extend the filesystem to the new number of blocks specified. This entry -+ * point is only used to extend the current filesystem to the end of the last -+ * existing group. It can be accessed via ioctl, or by "remount,resize=<size>" -+ * for emergencies (because it has no dependencies on reserved blocks). -+ * -+ * If we _really_ wanted, we could use default values to call ext3_group_add() -+ * allow the "remount" trick to work for arbitrary resizing, assuming enough -+ * GDT blocks are reserved to grow to the desired size. -+ */ -+int ext3_group_extend(struct super_block *sb, struct ext3_super_block *es, -+ unsigned long n_blocks_count) -+{ -+ unsigned long o_blocks_count; -+ unsigned long o_groups_count; -+ unsigned long last; -+ int add; -+ struct buffer_head * bh; -+ handle_t *handle; -+ int err, freed_blocks; -+ -+ /* We don't need to worry about locking wrt other resizers just -+ * yet: we're going to revalidate es->s_blocks_count after -+ * taking lock_super() below. */ -+ o_blocks_count = le32_to_cpu(es->s_blocks_count); -+ o_groups_count = EXT3_SB(sb)->s_groups_count; -+ -+ if (test_opt(sb, DEBUG)) -+ printk(KERN_DEBUG "EXT3-fs: extending last group from %lu to %lu blocks\n", -+ o_blocks_count, n_blocks_count); -+ -+ if (n_blocks_count == 0 || n_blocks_count == o_blocks_count) -+ return 0; -+ -+ if (n_blocks_count < o_blocks_count) { -+ ext3_warning(sb, __FUNCTION__, -+ "can't shrink FS - resize aborted"); -+ return -EBUSY; -+ } -+ -+ /* Handle the remaining blocks in the last group only. */ -+ last = (o_blocks_count - le32_to_cpu(es->s_first_data_block)) % -+ EXT3_BLOCKS_PER_GROUP(sb); -+ -+ if (last == 0) { -+ ext3_warning(sb, __FUNCTION__, -+ "need to use ext2online to resize further\n"); -+ return -EPERM; -+ } -+ -+ add = EXT3_BLOCKS_PER_GROUP(sb) - last; -+ -+ if (o_blocks_count + add > n_blocks_count) -+ add = n_blocks_count - o_blocks_count; -+ -+ if (o_blocks_count + add < n_blocks_count) -+ ext3_warning(sb, __FUNCTION__, -+ "will only finish group (%lu blocks, %u new)", -+ o_blocks_count + add, add); -+ -+ /* See if the device is actually as big as what was requested */ -+ bh = sb_bread(sb, o_blocks_count + add -1); -+ if (!bh) { -+ ext3_warning(sb, __FUNCTION__, -+ "can't read last block, resize aborted"); -+ return -ENOSPC; -+ } -+ brelse(bh); -+ -+ /* We will update the superblock, one block bitmap, and -+ * one group descriptor via ext3_free_blocks(). -+ */ -+ handle = ext3_journal_start_sb(sb, 3); -+ if (IS_ERR(handle)) { -+ err = PTR_ERR(handle); -+ ext3_warning(sb, __FUNCTION__, "error %d on journal start",err); -+ goto exit_put; -+ } -+ -+ lock_super(sb); -+ if (o_blocks_count != le32_to_cpu(es->s_blocks_count)) { -+ ext3_warning(sb, __FUNCTION__, -+ "multiple resizers run on filesystem!\n"); -+ err = -EBUSY; -+ goto exit_put; -+ } -+ -+ if ((err = ext3_journal_get_write_access(handle, -+ EXT3_SB(sb)->s_sbh))) { -+ ext3_warning(sb, __FUNCTION__, -+ "error %d on journal write access", err); -+ unlock_super(sb); -+ ext3_journal_stop(handle); -+ goto exit_put; -+ } -+ es->s_blocks_count = cpu_to_le32(o_blocks_count + add); -+ ext3_journal_dirty_metadata(handle, EXT3_SB(sb)->s_sbh); -+ sb->s_dirt = 1; -+ unlock_super(sb); -+ ext3_debug("freeing blocks %ld through %ld\n", o_blocks_count, -+ o_blocks_count + add); -+ ext3_free_blocks_sb(handle, sb, o_blocks_count, add, &freed_blocks); -+ ext3_debug("freed blocks %ld through %ld\n", o_blocks_count, -+ o_blocks_count + add); -+ if ((err = ext3_journal_stop(handle))) -+ goto exit_put; -+ if (test_opt(sb, DEBUG)) -+ printk(KERN_DEBUG "EXT3-fs: extended group to %u blocks\n", -+ le32_to_cpu(es->s_blocks_count)); -+ update_backups(sb, EXT3_SB(sb)->s_sbh->b_blocknr, (char *)es, -+ sizeof(struct ext3_super_block)); -+exit_put: -+ return err; -+} /* ext3_group_extend */ -diff -uprN linux-2.6.8.1.orig/fs/ext3/super.c linux-2.6.8.1-ve022stab078/fs/ext3/super.c ---- linux-2.6.8.1.orig/fs/ext3/super.c 2004-08-14 14:56:14.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext3/super.c 2006-05-11 13:05:40.000000000 +0400 -@@ -59,19 +59,19 @@ static int ext3_sync_fs(struct super_blo - * that sync() will call the filesystem's write_super callback if - * appropriate. - */ --handle_t *ext3_journal_start(struct inode *inode, int nblocks) -+handle_t *ext3_journal_start_sb(struct super_block *sb, int nblocks) - { - journal_t *journal; - -- if (inode->i_sb->s_flags & MS_RDONLY) -+ if (sb->s_flags & MS_RDONLY) - return ERR_PTR(-EROFS); - - /* Special case here: if the journal has aborted behind our - * backs (eg. EIO in the commit thread), then we still need to - * take the FS itself readonly cleanly. */ -- journal = EXT3_JOURNAL(inode); -+ journal = EXT3_SB(sb)->s_journal; - if (is_journal_aborted(journal)) { -- ext3_abort(inode->i_sb, __FUNCTION__, -+ ext3_abort(sb, __FUNCTION__, - "Detected aborted journal"); - return ERR_PTR(-EROFS); - } -@@ -108,17 +108,20 @@ void ext3_journal_abort_handle(const cha - char nbuf[16]; - const char *errstr = ext3_decode_error(NULL, err, nbuf); - -- printk(KERN_ERR "%s: aborting transaction: %s in %s", -- caller, errstr, err_fn); -- - if (bh) - BUFFER_TRACE(bh, "abort"); -- journal_abort_handle(handle); -+ - if (!handle->h_err) - handle->h_err = err; --} - --static char error_buf[1024]; -+ if (is_handle_aborted(handle)) -+ return; -+ -+ printk(KERN_ERR "%s: aborting transaction: %s in %s\n", -+ caller, errstr, err_fn); -+ -+ journal_abort_handle(handle); -+} - - /* Deal with the reporting of failure conditions on a filesystem such as - * inconsistencies detected or read IO failures. -@@ -140,7 +143,7 @@ static void ext3_handle_error(struct sup - struct ext3_super_block *es = EXT3_SB(sb)->s_es; - - EXT3_SB(sb)->s_mount_state |= EXT3_ERROR_FS; -- es->s_state |= cpu_to_le32(EXT3_ERROR_FS); -+ es->s_state |= cpu_to_le16(EXT3_ERROR_FS); - - if (sb->s_flags & MS_RDONLY) - return; -@@ -166,12 +169,11 @@ void ext3_error (struct super_block * sb - { - va_list args; - -- va_start (args, fmt); -- vsprintf (error_buf, fmt, args); -- va_end (args); -- -- printk (KERN_CRIT "EXT3-fs error (device %s): %s: %s\n", -- sb->s_id, function, error_buf); -+ va_start(args, fmt); -+ printk(KERN_CRIT "EXT3-fs error (device %s): %s: ",sb->s_id, function); -+ vprintk(fmt, args); -+ printk("\n"); -+ va_end(args); - - ext3_handle_error(sb); - } -@@ -240,21 +242,19 @@ void ext3_abort (struct super_block * sb - - printk (KERN_CRIT "ext3_abort called.\n"); - -- va_start (args, fmt); -- vsprintf (error_buf, fmt, args); -- va_end (args); -- -- if (test_opt (sb, ERRORS_PANIC)) -- panic ("EXT3-fs panic (device %s): %s: %s\n", -- sb->s_id, function, error_buf); -+ va_start(args, fmt); -+ printk(KERN_CRIT "EXT3-fs error (device %s): %s: ",sb->s_id, function); -+ vprintk(fmt, args); -+ printk("\n"); -+ va_end(args); - -- printk (KERN_CRIT "EXT3-fs abort (device %s): %s: %s\n", -- sb->s_id, function, error_buf); -+ if (test_opt(sb, ERRORS_PANIC)) -+ panic("EXT3-fs panic from previous error\n"); - - if (sb->s_flags & MS_RDONLY) - return; - -- printk (KERN_CRIT "Remounting filesystem read-only\n"); -+ printk(KERN_CRIT "Remounting filesystem read-only\n"); - EXT3_SB(sb)->s_mount_state |= EXT3_ERROR_FS; - sb->s_flags |= MS_RDONLY; - EXT3_SB(sb)->s_mount_opt |= EXT3_MOUNT_ABORT; -@@ -272,15 +272,16 @@ NORET_TYPE void ext3_panic (struct super - { - va_list args; - -- va_start (args, fmt); -- vsprintf (error_buf, fmt, args); -- va_end (args); -+ va_start(args, fmt); -+ printk(KERN_CRIT "EXT3-fs error (device %s): %s: ",sb->s_id, function); -+ vprintk(fmt, args); -+ printk("\n"); -+ va_end(args); - - /* this is to prevent panic from syncing this filesystem */ - /* AKPM: is this sufficient? */ - sb->s_flags |= MS_RDONLY; -- panic ("EXT3-fs panic (device %s): %s: %s\n", -- sb->s_id, function, error_buf); -+ panic ("EXT3-fs panic forced\n"); - } - - void ext3_warning (struct super_block * sb, const char * function, -@@ -288,11 +289,12 @@ void ext3_warning (struct super_block * - { - va_list args; - -- va_start (args, fmt); -- vsprintf (error_buf, fmt, args); -- va_end (args); -- printk (KERN_WARNING "EXT3-fs warning (device %s): %s: %s\n", -- sb->s_id, function, error_buf); -+ va_start(args, fmt); -+ printk(KERN_WARNING "EXT3-fs warning (device %s): %s: ", -+ sb->s_id, function); -+ vprintk(fmt, args); -+ printk("\n"); -+ va_end(args); - } - - void ext3_update_dynamic_rev(struct super_block *sb) -@@ -380,7 +382,7 @@ static void dump_orphan_list(struct supe - "inode %s:%ld at %p: mode %o, nlink %d, next %d\n", - inode->i_sb->s_id, inode->i_ino, inode, - inode->i_mode, inode->i_nlink, -- le32_to_cpu(NEXT_ORPHAN(inode))); -+ NEXT_ORPHAN(inode)); - } - } - -@@ -394,7 +396,7 @@ void ext3_put_super (struct super_block - journal_destroy(sbi->s_journal); - if (!(sb->s_flags & MS_RDONLY)) { - EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER); -- es->s_state = le16_to_cpu(sbi->s_mount_state); -+ es->s_state = cpu_to_le16(sbi->s_mount_state); - BUFFER_TRACE(sbi->s_sbh, "marking dirty"); - mark_buffer_dirty(sbi->s_sbh); - ext3_commit_super(sb, es, 1); -@@ -403,7 +405,9 @@ void ext3_put_super (struct super_block - for (i = 0; i < sbi->s_gdb_count; i++) - brelse(sbi->s_group_desc[i]); - kfree(sbi->s_group_desc); -- kfree(sbi->s_debts); -+ percpu_counter_destroy(&sbi->s_freeblocks_counter); -+ percpu_counter_destroy(&sbi->s_freeinodes_counter); -+ percpu_counter_destroy(&sbi->s_dirs_counter); - brelse(sbi->s_sbh); - #ifdef CONFIG_QUOTA - for (i = 0; i < MAXQUOTAS; i++) { -@@ -480,7 +484,7 @@ static int init_inodecache(void) - { - ext3_inode_cachep = kmem_cache_create("ext3_inode_cache", - sizeof(struct ext3_inode_info), -- 0, SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT, -+ 0, SLAB_RECLAIM_ACCOUNT, - init_once, NULL); - if (ext3_inode_cachep == NULL) - return -ENOMEM; -@@ -587,7 +591,7 @@ enum { - Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback, - Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota, - Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, -- Opt_ignore, Opt_err, -+ Opt_ignore, Opt_err, Opt_resize, - }; - - static match_table_t tokens = { -@@ -632,7 +636,8 @@ static match_table_t tokens = { - {Opt_ignore, "noquota"}, - {Opt_ignore, "quota"}, - {Opt_ignore, "usrquota"}, -- {Opt_err, NULL} -+ {Opt_err, NULL}, -+ {Opt_resize, "resize"}, - }; - - static unsigned long get_sb_block(void **data) -@@ -656,7 +661,7 @@ static unsigned long get_sb_block(void * - } - - static int parse_options (char * options, struct super_block *sb, -- unsigned long * inum, int is_remount) -+ unsigned long * inum, unsigned long *n_blocks_count, int is_remount) - { - struct ext3_sb_info *sbi = EXT3_SB(sb); - char * p; -@@ -899,6 +904,15 @@ clear_qf_name: - break; - case Opt_ignore: - break; -+ case Opt_resize: -+ if (!n_blocks_count) { -+ printk("EXT3-fs: resize option only available " -+ "for remount\n"); -+ return 0; -+ } -+ match_int(&args[0], &option); -+ *n_blocks_count = option; -+ break; - default: - printk (KERN_ERR - "EXT3-fs: Unrecognized mount option \"%s\" " -@@ -958,8 +972,7 @@ static int ext3_setup_super(struct super - es->s_state = cpu_to_le16(le16_to_cpu(es->s_state) & ~EXT3_VALID_FS); - #endif - if (!(__s16) le16_to_cpu(es->s_max_mnt_count)) -- es->s_max_mnt_count = -- (__s16) cpu_to_le16(EXT3_DFL_MAX_MNT_COUNT); -+ es->s_max_mnt_count = cpu_to_le16(EXT3_DFL_MAX_MNT_COUNT); - es->s_mnt_count=cpu_to_le16(le16_to_cpu(es->s_mnt_count) + 1); - es->s_mtime = cpu_to_le32(get_seconds()); - ext3_update_dynamic_rev(sb); -@@ -993,6 +1006,7 @@ static int ext3_setup_super(struct super - return res; - } - -+/* Called at mount-time, super-block is locked */ - static int ext3_check_descriptors (struct super_block * sb) - { - struct ext3_sb_info *sbi = EXT3_SB(sb); -@@ -1168,12 +1182,18 @@ static void ext3_orphan_cleanup (struct - static loff_t ext3_max_size(int bits) - { - loff_t res = EXT3_NDIR_BLOCKS; -+ /* This constant is calculated to be the largest file size for a -+ * dense, 4k-blocksize file such that the total number of -+ * sectors in the file, including data and all indirect blocks, -+ * does not exceed 2^32. */ -+ const loff_t upper_limit = 0x1ff7fffd000LL; -+ - res += 1LL << (bits-2); - res += 1LL << (2*(bits-2)); - res += 1LL << (3*(bits-2)); - res <<= bits; -- if (res > (512LL << 32) - (1 << bits)) -- res = (512LL << 32) - (1 << bits); -+ if (res > upper_limit) -+ res = upper_limit; - return res; - } - -@@ -1215,6 +1235,7 @@ static int ext3_fill_super (struct super - int db_count; - int i; - int needs_recovery; -+ __le32 features; - - sbi = kmalloc(sizeof(*sbi), GFP_KERNEL); - if (!sbi) -@@ -1288,10 +1309,10 @@ static int ext3_fill_super (struct super - sbi->s_resuid = le16_to_cpu(es->s_def_resuid); - sbi->s_resgid = le16_to_cpu(es->s_def_resgid); - -- if (!parse_options ((char *) data, sb, &journal_inum, 0)) -+ if (!parse_options ((char *) data, sb, &journal_inum, NULL, 0)) - goto failed_mount; - -- sb->s_flags |= MS_ONE_SECOND; -+ set_sb_time_gran(sb, 1000000000U); - sb->s_flags = (sb->s_flags & ~MS_POSIXACL) | - ((sbi->s_mount_opt & EXT3_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0); - -@@ -1307,17 +1328,18 @@ static int ext3_fill_super (struct super - * previously didn't change the revision level when setting the flags, - * so there is a chance incompat flags are set on a rev 0 filesystem. - */ -- if ((i = EXT3_HAS_INCOMPAT_FEATURE(sb, ~EXT3_FEATURE_INCOMPAT_SUPP))) { -+ features = EXT3_HAS_INCOMPAT_FEATURE(sb, ~EXT3_FEATURE_INCOMPAT_SUPP); -+ if (features) { - printk(KERN_ERR "EXT3-fs: %s: couldn't mount because of " - "unsupported optional features (%x).\n", -- sb->s_id, i); -+ sb->s_id, le32_to_cpu(features)); - goto failed_mount; - } -- if (!(sb->s_flags & MS_RDONLY) && -- (i = EXT3_HAS_RO_COMPAT_FEATURE(sb, ~EXT3_FEATURE_RO_COMPAT_SUPP))){ -+ features = EXT3_HAS_RO_COMPAT_FEATURE(sb, ~EXT3_FEATURE_RO_COMPAT_SUPP); -+ if (!(sb->s_flags & MS_RDONLY) && features) { - printk(KERN_ERR "EXT3-fs: %s: couldn't mount RDWR because of " - "unsupported optional features (%x).\n", -- sb->s_id, i); -+ sb->s_id, le32_to_cpu(features)); - goto failed_mount; - } - blocksize = BLOCK_SIZE << le32_to_cpu(es->s_log_block_size); -@@ -1354,7 +1376,7 @@ static int ext3_fill_super (struct super - } - es = (struct ext3_super_block *)(((char *)bh->b_data) + offset); - sbi->s_es = es; -- if (es->s_magic != le16_to_cpu(EXT3_SUPER_MAGIC)) { -+ if (es->s_magic != cpu_to_le16(EXT3_SUPER_MAGIC)) { - printk (KERN_ERR - "EXT3-fs: Magic mismatch, very weird !\n"); - goto failed_mount; -@@ -1432,13 +1454,6 @@ static int ext3_fill_super (struct super - printk (KERN_ERR "EXT3-fs: not enough memory\n"); - goto failed_mount; - } -- sbi->s_debts = kmalloc(sbi->s_groups_count * sizeof(u8), -- GFP_KERNEL); -- if (!sbi->s_debts) { -- printk("EXT3-fs: not enough memory to allocate s_bgi\n"); -- goto failed_mount2; -- } -- memset(sbi->s_debts, 0, sbi->s_groups_count * sizeof(u8)); - - percpu_counter_init(&sbi->s_freeblocks_counter); - percpu_counter_init(&sbi->s_freeinodes_counter); -@@ -1575,7 +1590,6 @@ static int ext3_fill_super (struct super - failed_mount3: - journal_destroy(sbi->s_journal); - failed_mount2: -- kfree(sbi->s_debts); - for (i = 0; i < db_count; i++) - brelse(sbi->s_group_desc[i]); - kfree(sbi->s_group_desc); -@@ -1724,10 +1738,10 @@ static journal_t *ext3_get_dev_journal(s - printk(KERN_ERR "EXT3-fs: I/O error on journal device\n"); - goto out_journal; - } -- if (ntohl(journal->j_superblock->s_nr_users) != 1) { -+ if (be32_to_cpu(journal->j_superblock->s_nr_users) != 1) { - printk(KERN_ERR "EXT3-fs: External journal has more than one " - "user (unsupported) - %d\n", -- ntohl(journal->j_superblock->s_nr_users)); -+ be32_to_cpu(journal->j_superblock->s_nr_users)); - goto out_journal; - } - EXT3_SB(sb)->journal_bdev = bdev; -@@ -2013,11 +2027,12 @@ int ext3_remount (struct super_block * s - struct ext3_super_block * es; - struct ext3_sb_info *sbi = EXT3_SB(sb); - unsigned long tmp; -+ unsigned long n_blocks_count = 0; - - /* - * Allow the "check" option to be passed as a remount option. - */ -- if (!parse_options(data, sb, &tmp, 1)) -+ if (!parse_options(data, sb, &tmp, &n_blocks_count, 1)) - return -EINVAL; - - if (sbi->s_mount_opt & EXT3_MOUNT_ABORT) -@@ -2030,7 +2045,8 @@ int ext3_remount (struct super_block * s - - ext3_init_journal_params(sbi, sbi->s_journal); - -- if ((*flags & MS_RDONLY) != (sb->s_flags & MS_RDONLY)) { -+ if ((*flags & MS_RDONLY) != (sb->s_flags & MS_RDONLY) || -+ n_blocks_count > le32_to_cpu(es->s_blocks_count)) { - if (sbi->s_mount_opt & EXT3_MOUNT_ABORT) - return -EROFS; - -@@ -2052,13 +2068,13 @@ int ext3_remount (struct super_block * s - - ext3_mark_recovery_complete(sb, es); - } else { -- int ret; -+ __le32 ret; - if ((ret = EXT3_HAS_RO_COMPAT_FEATURE(sb, - ~EXT3_FEATURE_RO_COMPAT_SUPP))) { - printk(KERN_WARNING "EXT3-fs: %s: couldn't " - "remount RDWR because of unsupported " - "optional features (%x).\n", -- sb->s_id, ret); -+ sb->s_id, le32_to_cpu(ret)); - return -EROFS; - } - /* -@@ -2069,6 +2085,8 @@ int ext3_remount (struct super_block * s - */ - ext3_clear_journal_err(sb, es); - sbi->s_mount_state = le16_to_cpu(es->s_state); -+ if ((ret = ext3_group_extend(sb, es, n_blocks_count))) -+ return ret; - if (!ext3_setup_super (sb, es, 0)) - sb->s_flags &= ~MS_RDONLY; - } -@@ -2085,6 +2103,10 @@ int ext3_statfs (struct super_block * sb - if (test_opt (sb, MINIX_DF)) - overhead = 0; - else { -+ unsigned long ngroups; -+ ngroups = EXT3_SB(sb)->s_groups_count; -+ smp_rmb(); -+ - /* - * Compute the overhead (FS structures) - */ -@@ -2100,7 +2122,7 @@ int ext3_statfs (struct super_block * sb - * block group descriptors. If the sparse superblocks - * feature is turned on, then not all groups have this. - */ -- for (i = 0; i < EXT3_SB(sb)->s_groups_count; i++) -+ for (i = 0; i < ngroups; i++) - overhead += ext3_bg_has_super(sb, i) + - ext3_bg_num_gdb(sb, i); - -@@ -2108,8 +2130,7 @@ int ext3_statfs (struct super_block * sb - * Every block group has an inode bitmap, a block - * bitmap, and an inode table. - */ -- overhead += (EXT3_SB(sb)->s_groups_count * -- (2 + EXT3_SB(sb)->s_itb_per_group)); -+ overhead += (ngroups * (2 + EXT3_SB(sb)->s_itb_per_group)); - } - - buf->f_type = EXT3_SUPER_MAGIC; -@@ -2331,7 +2352,7 @@ static struct file_system_type ext3_fs_t - .name = "ext3", - .get_sb = ext3_get_sb, - .kill_sb = kill_block_super, -- .fs_flags = FS_REQUIRES_DEV, -+ .fs_flags = FS_REQUIRES_DEV | FS_VIRTUALIZED, - }; - - static int __init init_ext3_fs(void) -diff -uprN linux-2.6.8.1.orig/fs/ext3/xattr.c linux-2.6.8.1-ve022stab078/fs/ext3/xattr.c ---- linux-2.6.8.1.orig/fs/ext3/xattr.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext3/xattr.c 2006-05-11 13:05:32.000000000 +0400 -@@ -819,7 +819,7 @@ getblk_failed: - - /* Update the inode. */ - EXT3_I(inode)->i_file_acl = new_bh ? new_bh->b_blocknr : 0; -- inode->i_ctime = CURRENT_TIME; -+ inode->i_ctime = CURRENT_TIME_SEC; - ext3_mark_inode_dirty(handle, inode); - if (IS_SYNC(inode)) - handle->h_sync = 1; -@@ -1130,7 +1130,7 @@ static inline void ext3_xattr_hash_entry - } - - if (entry->e_value_block == 0 && entry->e_value_size != 0) { -- __u32 *value = (__u32 *)((char *)header + -+ __le32 *value = (__le32 *)((char *)header + - le16_to_cpu(entry->e_value_offs)); - for (n = (le32_to_cpu(entry->e_value_size) + - EXT3_XATTR_ROUND) >> EXT3_XATTR_PAD_BITS; n; n--) { -diff -uprN linux-2.6.8.1.orig/fs/ext3/xattr.h linux-2.6.8.1-ve022stab078/fs/ext3/xattr.h ---- linux-2.6.8.1.orig/fs/ext3/xattr.h 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext3/xattr.h 2006-05-11 13:05:31.000000000 +0400 -@@ -25,20 +25,20 @@ - #define EXT3_XATTR_INDEX_SECURITY 6 - - struct ext3_xattr_header { -- __u32 h_magic; /* magic number for identification */ -- __u32 h_refcount; /* reference count */ -- __u32 h_blocks; /* number of disk blocks used */ -- __u32 h_hash; /* hash value of all attributes */ -+ __le32 h_magic; /* magic number for identification */ -+ __le32 h_refcount; /* reference count */ -+ __le32 h_blocks; /* number of disk blocks used */ -+ __le32 h_hash; /* hash value of all attributes */ - __u32 h_reserved[4]; /* zero right now */ - }; - - struct ext3_xattr_entry { - __u8 e_name_len; /* length of name */ - __u8 e_name_index; /* attribute name index */ -- __u16 e_value_offs; /* offset in disk block of value */ -- __u32 e_value_block; /* disk block attribute is stored on (n/i) */ -- __u32 e_value_size; /* size of attribute value */ -- __u32 e_hash; /* hash value of name and value */ -+ __le16 e_value_offs; /* offset in disk block of value */ -+ __le32 e_value_block; /* disk block attribute is stored on (n/i) */ -+ __le32 e_value_size; /* size of attribute value */ -+ __le32 e_hash; /* hash value of name and value */ - char e_name[0]; /* attribute name */ - }; - -diff -uprN linux-2.6.8.1.orig/fs/ext3/xattr_user.c linux-2.6.8.1-ve022stab078/fs/ext3/xattr_user.c ---- linux-2.6.8.1.orig/fs/ext3/xattr_user.c 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ext3/xattr_user.c 2006-05-11 13:05:35.000000000 +0400 -@@ -42,7 +42,7 @@ ext3_xattr_user_get(struct inode *inode, - return -EINVAL; - if (!test_opt(inode->i_sb, XATTR_USER)) - return -EOPNOTSUPP; -- error = permission(inode, MAY_READ, NULL); -+ error = permission(inode, MAY_READ, NULL, NULL); - if (error) - return error; - -@@ -62,7 +62,7 @@ ext3_xattr_user_set(struct inode *inode, - if ( !S_ISREG(inode->i_mode) && - (!S_ISDIR(inode->i_mode) || inode->i_mode & S_ISVTX)) - return -EPERM; -- error = permission(inode, MAY_WRITE, NULL); -+ error = permission(inode, MAY_WRITE, NULL, NULL); - if (error) - return error; - -diff -uprN linux-2.6.8.1.orig/fs/fat/inode.c linux-2.6.8.1-ve022stab078/fs/fat/inode.c ---- linux-2.6.8.1.orig/fs/fat/inode.c 2004-08-14 14:55:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/fat/inode.c 2006-05-11 13:05:35.000000000 +0400 -@@ -1227,7 +1227,7 @@ static int fat_fill_inode(struct inode * - return 0; - } - --void fat_write_inode(struct inode *inode, int wait) -+int fat_write_inode(struct inode *inode, int wait) - { - struct super_block *sb = inode->i_sb; - struct buffer_head *bh; -@@ -1237,14 +1237,14 @@ void fat_write_inode(struct inode *inode - retry: - i_pos = MSDOS_I(inode)->i_pos; - if (inode->i_ino == MSDOS_ROOT_INO || !i_pos) { -- return; -+ return 0; - } - lock_kernel(); - if (!(bh = sb_bread(sb, i_pos >> MSDOS_SB(sb)->dir_per_block_bits))) { - printk(KERN_ERR "FAT: unable to read inode block " - "for updating (i_pos %lld)\n", i_pos); - unlock_kernel(); -- return /* -EIO */; -+ return -EIO; - } - spin_lock(&fat_inode_lock); - if (i_pos != MSDOS_I(inode)->i_pos) { -@@ -1281,6 +1281,7 @@ retry: - mark_buffer_dirty(bh); - brelse(bh); - unlock_kernel(); -+ return 0; - } - - -diff -uprN linux-2.6.8.1.orig/fs/fcntl.c linux-2.6.8.1-ve022stab078/fs/fcntl.c ---- linux-2.6.8.1.orig/fs/fcntl.c 2004-08-14 14:55:35.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/fcntl.c 2006-05-11 13:05:40.000000000 +0400 -@@ -14,6 +14,7 @@ - #include <linux/module.h> - #include <linux/security.h> - #include <linux/ptrace.h> -+#include <linux/ve_owner.h> - - #include <asm/poll.h> - #include <asm/siginfo.h> -@@ -219,6 +220,9 @@ static int setfl(int fd, struct file * f - struct inode * inode = filp->f_dentry->d_inode; - int error = 0; - -+ if (!capable(CAP_SYS_RAWIO)) -+ arg &= ~O_DIRECT; -+ - /* O_APPEND cannot be cleared if the file is marked as append-only */ - if (!(arg & O_APPEND) && IS_APPEND(inode)) - return -EPERM; -@@ -262,6 +266,7 @@ static int setfl(int fd, struct file * f - static void f_modown(struct file *filp, unsigned long pid, - uid_t uid, uid_t euid, int force) - { -+ pid = comb_vpid_to_pid(pid); - write_lock_irq(&filp->f_owner.lock); - if (force || !filp->f_owner.pid) { - filp->f_owner.pid = pid; -@@ -330,7 +335,7 @@ static long do_fcntl(int fd, unsigned in - * current syscall conventions, the only way - * to fix this will be in libc. - */ -- err = filp->f_owner.pid; -+ err = comb_pid_to_vpid(filp->f_owner.pid); - force_successful_syscall_return(); - break; - case F_SETOWN: -@@ -482,6 +487,8 @@ static void send_sigio_to_task(struct ta - - void send_sigio(struct fown_struct *fown, int fd, int band) - { -+ struct file *f; -+ struct ve_struct *env; - struct task_struct *p; - int pid; - -@@ -489,19 +496,21 @@ void send_sigio(struct fown_struct *fown - pid = fown->pid; - if (!pid) - goto out_unlock_fown; -- -+ -+ /* hack: fown's are always embedded in struct file */ -+ f = container_of(fown, struct file, f_owner); -+ env = VE_OWNER_FILP(f); -+ - read_lock(&tasklist_lock); - if (pid > 0) { -- p = find_task_by_pid(pid); -- if (p) { -+ p = find_task_by_pid_all(pid); -+ if (p && ve_accessible(VE_TASK_INFO(p)->owner_env, env)) { - send_sigio_to_task(p, fown, fd, band); - } - } else { -- struct list_head *l; -- struct pid *pidptr; -- for_each_task_pid(-pid, PIDTYPE_PGID, p, l, pidptr) { -+ __do_each_task_pid_ve(-pid, PIDTYPE_PGID, p, env) { - send_sigio_to_task(p, fown, fd, band); -- } -+ } __while_each_task_pid_ve(-pid, PIDTYPE_PGID, p, env); - } - read_unlock(&tasklist_lock); - out_unlock_fown: -@@ -517,6 +526,8 @@ static void send_sigurg_to_task(struct t - - int send_sigurg(struct fown_struct *fown) - { -+ struct file *f; -+ struct ve_struct *env; - struct task_struct *p; - int pid, ret = 0; - -@@ -527,18 +538,20 @@ int send_sigurg(struct fown_struct *fown - - ret = 1; - -+ /* hack: fown's are always embedded in struct file */ -+ f = container_of(fown, struct file, f_owner); -+ env = VE_OWNER_FILP(f); -+ - read_lock(&tasklist_lock); - if (pid > 0) { -- p = find_task_by_pid(pid); -- if (p) { -+ p = find_task_by_pid_all(pid); -+ if (p && ve_accessible(VE_TASK_INFO(p)->owner_env, env)) { - send_sigurg_to_task(p, fown); - } - } else { -- struct list_head *l; -- struct pid *pidptr; -- for_each_task_pid(-pid, PIDTYPE_PGID, p, l, pidptr) { -+ __do_each_task_pid_ve(-pid, PIDTYPE_PGID, p, env) { - send_sigurg_to_task(p, fown); -- } -+ } __while_each_task_pid_ve(-pid, PIDTYPE_PGID, p, env); - } - read_unlock(&tasklist_lock); - out_unlock_fown: -diff -uprN linux-2.6.8.1.orig/fs/file.c linux-2.6.8.1-ve022stab078/fs/file.c ---- linux-2.6.8.1.orig/fs/file.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/file.c 2006-05-11 13:05:39.000000000 +0400 -@@ -15,6 +15,7 @@ - - #include <asm/bitops.h> - -+#include <ub/ub_mem.h> - - /* - * Allocate an fd array, using kmalloc or vmalloc. -@@ -26,9 +27,9 @@ struct file ** alloc_fd_array(int num) - int size = num * sizeof(struct file *); - - if (size <= PAGE_SIZE) -- new_fds = (struct file **) kmalloc(size, GFP_KERNEL); -+ new_fds = (struct file **) ub_kmalloc(size, GFP_KERNEL); - else -- new_fds = (struct file **) vmalloc(size); -+ new_fds = (struct file **) ub_vmalloc(size); - return new_fds; - } - -@@ -135,9 +136,9 @@ fd_set * alloc_fdset(int num) - int size = num / 8; - - if (size <= PAGE_SIZE) -- new_fdset = (fd_set *) kmalloc(size, GFP_KERNEL); -+ new_fdset = (fd_set *) ub_kmalloc(size, GFP_KERNEL); - else -- new_fdset = (fd_set *) vmalloc(size); -+ new_fdset = (fd_set *) ub_vmalloc(size); - return new_fdset; - } - -diff -uprN linux-2.6.8.1.orig/fs/file_table.c linux-2.6.8.1-ve022stab078/fs/file_table.c ---- linux-2.6.8.1.orig/fs/file_table.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/file_table.c 2006-05-11 13:05:40.000000000 +0400 -@@ -8,6 +8,7 @@ - #include <linux/string.h> - #include <linux/slab.h> - #include <linux/file.h> -+#include <linux/ve_owner.h> - #include <linux/init.h> - #include <linux/module.h> - #include <linux/smp_lock.h> -@@ -17,6 +18,8 @@ - #include <linux/mount.h> - #include <linux/cdev.h> - -+#include <ub/ub_misc.h> -+ - /* sysctl tunables... */ - struct files_stat_struct files_stat = { - .max_files = NR_FILE -@@ -56,6 +59,8 @@ void filp_dtor(void * objp, struct kmem_ - - static inline void file_free(struct file *f) - { -+ ub_file_uncharge(f); -+ put_ve(VE_OWNER_FILP(f)); - kmem_cache_free(filp_cachep, f); - } - -@@ -65,40 +70,46 @@ static inline void file_free(struct file - */ - struct file *get_empty_filp(void) - { --static int old_max; -+ static int old_max; - struct file * f; - - /* - * Privileged users can go above max_files - */ -- if (files_stat.nr_files < files_stat.max_files || -- capable(CAP_SYS_ADMIN)) { -- f = kmem_cache_alloc(filp_cachep, GFP_KERNEL); -- if (f) { -- memset(f, 0, sizeof(*f)); -- if (security_file_alloc(f)) { -- file_free(f); -- goto fail; -- } -- eventpoll_init_file(f); -- atomic_set(&f->f_count, 1); -- f->f_uid = current->fsuid; -- f->f_gid = current->fsgid; -- f->f_owner.lock = RW_LOCK_UNLOCKED; -- /* f->f_version: 0 */ -- INIT_LIST_HEAD(&f->f_list); -- return f; -- } -+ if (files_stat.nr_files >= files_stat.max_files && -+ !capable(CAP_SYS_ADMIN)) -+ goto over; -+ -+ f = kmem_cache_alloc(filp_cachep, GFP_KERNEL); -+ if (f == NULL) -+ goto fail; -+ -+ memset(f, 0, sizeof(*f)); -+ if (ub_file_charge(f)) { -+ kmem_cache_free(filp_cachep, f); -+ goto fail; - } - -+ SET_VE_OWNER_FILP(f, get_ve(get_exec_env())); -+ if (security_file_alloc(f)) { -+ file_free(f); -+ goto fail; -+ } -+ eventpoll_init_file(f); -+ atomic_set(&f->f_count, 1); -+ f->f_uid = current->fsuid; -+ f->f_gid = current->fsgid; -+ f->f_owner.lock = RW_LOCK_UNLOCKED; -+ /* f->f_version: 0 */ -+ INIT_LIST_HEAD(&f->f_list); -+ return f; -+ -+over: - /* Ran out of filps - report that */ -- if (files_stat.max_files >= old_max) { -+ if (files_stat.nr_files > old_max) { - printk(KERN_INFO "VFS: file-max limit %d reached\n", -- files_stat.max_files); -- old_max = files_stat.max_files; -- } else { -- /* Big problems... */ -- printk(KERN_WARNING "VFS: filp allocation failed\n"); -+ files_stat.max_files); -+ old_max = files_stat.nr_files; - } - fail: - return NULL; -diff -uprN linux-2.6.8.1.orig/fs/filesystems.c linux-2.6.8.1-ve022stab078/fs/filesystems.c ---- linux-2.6.8.1.orig/fs/filesystems.c 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/filesystems.c 2006-05-11 13:05:40.000000000 +0400 -@@ -11,6 +11,7 @@ - #include <linux/kmod.h> - #include <linux/init.h> - #include <linux/module.h> -+#include <linux/ve_owner.h> - #include <asm/uaccess.h> - - /* -@@ -20,8 +21,8 @@ - * During the unload module must call unregister_filesystem(). - * We can access the fields of list element if: - * 1) spinlock is held or -- * 2) we hold the reference to the module. -- * The latter can be guaranteed by call of try_module_get(); if it -+ * 2) we hold the reference to the element. -+ * The latter can be guaranteed by call of try_filesystem(); if it - * returned 0 we must skip the element, otherwise we got the reference. - * Once the reference is obtained we can drop the spinlock. - */ -@@ -29,23 +30,51 @@ - static struct file_system_type *file_systems; - static rwlock_t file_systems_lock = RW_LOCK_UNLOCKED; - -+int try_get_filesystem(struct file_system_type *fs) -+{ -+ if (try_module_get(fs->owner)) { -+#ifdef CONFIG_VE -+ get_ve(VE_OWNER_FSTYPE(fs)); -+#endif -+ return 1; -+ } -+ return 0; -+} -+ - /* WARNING: This can be used only if we _already_ own a reference */ - void get_filesystem(struct file_system_type *fs) - { -+#ifdef CONFIG_VE -+ get_ve(VE_OWNER_FSTYPE(fs)); -+#endif - __module_get(fs->owner); - } - - void put_filesystem(struct file_system_type *fs) - { - module_put(fs->owner); -+#ifdef CONFIG_VE -+ put_ve(VE_OWNER_FSTYPE(fs)); -+#endif -+} -+ -+static inline int check_ve_fstype(struct file_system_type *p, -+ struct ve_struct *env) -+{ -+ return ((p->fs_flags & FS_VIRTUALIZED) || -+ ve_accessible_strict(VE_OWNER_FSTYPE(p), env)); - } - --static struct file_system_type **find_filesystem(const char *name) -+static struct file_system_type **find_filesystem(const char *name, -+ struct ve_struct *env) - { - struct file_system_type **p; -- for (p=&file_systems; *p; p=&(*p)->next) -+ for (p=&file_systems; *p; p=&(*p)->next) { -+ if (!check_ve_fstype(*p, env)) -+ continue; - if (strcmp((*p)->name,name) == 0) - break; -+ } - return p; - } - -@@ -72,8 +101,10 @@ int register_filesystem(struct file_syst - if (fs->next) - return -EBUSY; - INIT_LIST_HEAD(&fs->fs_supers); -+ if (VE_OWNER_FSTYPE(fs) == NULL) -+ SET_VE_OWNER_FSTYPE(fs, get_ve0()); - write_lock(&file_systems_lock); -- p = find_filesystem(fs->name); -+ p = find_filesystem(fs->name, VE_OWNER_FSTYPE(fs)); - if (*p) - res = -EBUSY; - else -@@ -130,11 +161,14 @@ static int fs_index(const char __user * - - err = -EINVAL; - read_lock(&file_systems_lock); -- for (tmp=file_systems, index=0 ; tmp ; tmp=tmp->next, index++) { -+ for (tmp=file_systems, index=0 ; tmp ; tmp=tmp->next) { -+ if (!check_ve_fstype(tmp, get_exec_env())) -+ continue; - if (strcmp(tmp->name,name) == 0) { - err = index; - break; - } -+ index++; - } - read_unlock(&file_systems_lock); - putname(name); -@@ -147,9 +181,15 @@ static int fs_name(unsigned int index, c - int len, res; - - read_lock(&file_systems_lock); -- for (tmp = file_systems; tmp; tmp = tmp->next, index--) -- if (index <= 0 && try_module_get(tmp->owner)) -- break; -+ for (tmp = file_systems; tmp; tmp = tmp->next) { -+ if (!check_ve_fstype(tmp, get_exec_env())) -+ continue; -+ if (!index) { -+ if (try_get_filesystem(tmp)) -+ break; -+ } else -+ index--; -+ } - read_unlock(&file_systems_lock); - if (!tmp) - return -EINVAL; -@@ -167,8 +207,9 @@ static int fs_maxindex(void) - int index; - - read_lock(&file_systems_lock); -- for (tmp = file_systems, index = 0 ; tmp ; tmp = tmp->next, index++) -- ; -+ for (tmp = file_systems, index = 0 ; tmp ; tmp = tmp->next) -+ if (check_ve_fstype(tmp, get_exec_env())) -+ index++; - read_unlock(&file_systems_lock); - return index; - } -@@ -204,9 +245,10 @@ int get_filesystem_list(char * buf) - read_lock(&file_systems_lock); - tmp = file_systems; - while (tmp && len < PAGE_SIZE - 80) { -- len += sprintf(buf+len, "%s\t%s\n", -- (tmp->fs_flags & FS_REQUIRES_DEV) ? "" : "nodev", -- tmp->name); -+ if (check_ve_fstype(tmp, get_exec_env())) -+ len += sprintf(buf+len, "%s\t%s\n", -+ (tmp->fs_flags & FS_REQUIRES_DEV) ? "" : "nodev", -+ tmp->name); - tmp = tmp->next; - } - read_unlock(&file_systems_lock); -@@ -218,14 +260,14 @@ struct file_system_type *get_fs_type(con - struct file_system_type *fs; - - read_lock(&file_systems_lock); -- fs = *(find_filesystem(name)); -- if (fs && !try_module_get(fs->owner)) -+ fs = *(find_filesystem(name, get_exec_env())); -+ if (fs && !try_get_filesystem(fs)) - fs = NULL; - read_unlock(&file_systems_lock); - if (!fs && (request_module("%s", name) == 0)) { - read_lock(&file_systems_lock); -- fs = *(find_filesystem(name)); -- if (fs && !try_module_get(fs->owner)) -+ fs = *(find_filesystem(name, get_exec_env())); -+ if (fs && !try_get_filesystem(fs)) - fs = NULL; - read_unlock(&file_systems_lock); - } -@@ -233,3 +275,5 @@ struct file_system_type *get_fs_type(con - } - - EXPORT_SYMBOL(get_fs_type); -+EXPORT_SYMBOL(get_filesystem); -+EXPORT_SYMBOL(put_filesystem); -diff -uprN linux-2.6.8.1.orig/fs/fs-writeback.c linux-2.6.8.1-ve022stab078/fs/fs-writeback.c ---- linux-2.6.8.1.orig/fs/fs-writeback.c 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/fs-writeback.c 2006-05-11 13:05:35.000000000 +0400 -@@ -133,10 +133,11 @@ out: - - EXPORT_SYMBOL(__mark_inode_dirty); - --static void write_inode(struct inode *inode, int sync) -+static int write_inode(struct inode *inode, int sync) - { - if (inode->i_sb->s_op->write_inode && !is_bad_inode(inode)) -- inode->i_sb->s_op->write_inode(inode, sync); -+ return inode->i_sb->s_op->write_inode(inode, sync); -+ return 0; - } - - /* -@@ -170,8 +171,11 @@ __sync_single_inode(struct inode *inode, - ret = do_writepages(mapping, wbc); - - /* Don't write the inode if only I_DIRTY_PAGES was set */ -- if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) -- write_inode(inode, wait); -+ if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) { -+ int err = write_inode(inode, wait); -+ if (ret == 0) -+ ret = err; -+ } - - if (wait) { - int err = filemap_fdatawait(mapping); -@@ -392,7 +396,6 @@ writeback_inodes(struct writeback_contro - { - struct super_block *sb; - -- spin_lock(&inode_lock); - spin_lock(&sb_lock); - restart: - sb = sb_entry(super_blocks.prev); -@@ -407,19 +410,21 @@ restart: - * be unmounted by the time it is released. - */ - if (down_read_trylock(&sb->s_umount)) { -- if (sb->s_root) -+ if (sb->s_root) { -+ spin_lock(&inode_lock); - sync_sb_inodes(sb, wbc); -+ spin_unlock(&inode_lock); -+ } - up_read(&sb->s_umount); - } - spin_lock(&sb_lock); -- if (__put_super(sb)) -+ if (__put_super_and_need_restart(sb)) - goto restart; - } - if (wbc->nr_to_write <= 0) - break; - } - spin_unlock(&sb_lock); -- spin_unlock(&inode_lock); - } - - /* -@@ -464,32 +469,6 @@ static void set_sb_syncing(int val) - spin_unlock(&sb_lock); - } - --/* -- * Find a superblock with inodes that need to be synced -- */ --static struct super_block *get_super_to_sync(void) --{ -- struct super_block *sb; --restart: -- spin_lock(&sb_lock); -- sb = sb_entry(super_blocks.prev); -- for (; sb != sb_entry(&super_blocks); sb = sb_entry(sb->s_list.prev)) { -- if (sb->s_syncing) -- continue; -- sb->s_syncing = 1; -- sb->s_count++; -- spin_unlock(&sb_lock); -- down_read(&sb->s_umount); -- if (!sb->s_root) { -- drop_super(sb); -- goto restart; -- } -- return sb; -- } -- spin_unlock(&sb_lock); -- return NULL; --} -- - /** - * sync_inodes - * -@@ -508,23 +487,39 @@ restart: - * outstanding dirty inodes, the writeback goes block-at-a-time within the - * filesystem's write_inode(). This is extremely slow. - */ --void sync_inodes(int wait) -+static void __sync_inodes(int wait) - { - struct super_block *sb; - -- set_sb_syncing(0); -- while ((sb = get_super_to_sync()) != NULL) { -- sync_inodes_sb(sb, 0); -- sync_blockdev(sb->s_bdev); -- drop_super(sb); -+ spin_lock(&sb_lock); -+restart: -+ list_for_each_entry(sb, &super_blocks, s_list) { -+ if (sb->s_syncing) -+ continue; -+ sb->s_syncing = 1; -+ sb->s_count++; -+ spin_unlock(&sb_lock); -+ down_read(&sb->s_umount); -+ if (sb->s_root) { -+ sync_inodes_sb(sb, wait); -+ sync_blockdev(sb->s_bdev); -+ } -+ up_read(&sb->s_umount); -+ spin_lock(&sb_lock); -+ if (__put_super_and_need_restart(sb)) -+ goto restart; - } -+ spin_unlock(&sb_lock); -+} -+ -+void sync_inodes(int wait) -+{ -+ set_sb_syncing(0); -+ __sync_inodes(0); -+ - if (wait) { - set_sb_syncing(0); -- while ((sb = get_super_to_sync()) != NULL) { -- sync_inodes_sb(sb, 1); -- sync_blockdev(sb->s_bdev); -- drop_super(sb); -- } -+ __sync_inodes(1); - } - } - -diff -uprN linux-2.6.8.1.orig/fs/hfs/hfs_fs.h linux-2.6.8.1-ve022stab078/fs/hfs/hfs_fs.h ---- linux-2.6.8.1.orig/fs/hfs/hfs_fs.h 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/hfs/hfs_fs.h 2006-05-11 13:05:35.000000000 +0400 -@@ -198,7 +198,7 @@ extern struct address_space_operations h - - extern struct inode *hfs_new_inode(struct inode *, struct qstr *, int); - extern void hfs_inode_write_fork(struct inode *, struct hfs_extent *, u32 *, u32 *); --extern void hfs_write_inode(struct inode *, int); -+extern int hfs_write_inode(struct inode *, int); - extern int hfs_inode_setattr(struct dentry *, struct iattr *); - extern void hfs_inode_read_fork(struct inode *inode, struct hfs_extent *ext, - u32 log_size, u32 phys_size, u32 clump_size); -diff -uprN linux-2.6.8.1.orig/fs/hfs/inode.c linux-2.6.8.1-ve022stab078/fs/hfs/inode.c ---- linux-2.6.8.1.orig/fs/hfs/inode.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/hfs/inode.c 2006-05-11 13:05:35.000000000 +0400 -@@ -381,7 +381,7 @@ void hfs_inode_write_fork(struct inode * - HFS_SB(inode->i_sb)->alloc_blksz); - } - --void hfs_write_inode(struct inode *inode, int unused) -+int hfs_write_inode(struct inode *inode, int unused) - { - struct hfs_find_data fd; - hfs_cat_rec rec; -@@ -395,27 +395,27 @@ void hfs_write_inode(struct inode *inode - break; - case HFS_EXT_CNID: - hfs_btree_write(HFS_SB(inode->i_sb)->ext_tree); -- return; -+ return 0; - case HFS_CAT_CNID: - hfs_btree_write(HFS_SB(inode->i_sb)->cat_tree); -- return; -+ return 0; - default: - BUG(); -- return; -+ return -EIO; - } - } - - if (HFS_IS_RSRC(inode)) { - mark_inode_dirty(HFS_I(inode)->rsrc_inode); -- return; -+ return 0; - } - - if (!inode->i_nlink) -- return; -+ return 0; - - if (hfs_find_init(HFS_SB(inode->i_sb)->cat_tree, &fd)) - /* panic? */ -- return; -+ return -EIO; - - fd.search_key->cat = HFS_I(inode)->cat_key; - if (hfs_brec_find(&fd)) -@@ -460,6 +460,7 @@ void hfs_write_inode(struct inode *inode - } - out: - hfs_find_exit(&fd); -+ return 0; - } - - static struct dentry *hfs_file_lookup(struct inode *dir, struct dentry *dentry, -@@ -512,11 +513,11 @@ void hfs_clear_inode(struct inode *inode - } - - static int hfs_permission(struct inode *inode, int mask, -- struct nameidata *nd) -+ struct nameidata *nd, struct exec_perm *exec_perm) - { - if (S_ISREG(inode->i_mode) && mask & MAY_EXEC) - return 0; -- return vfs_permission(inode, mask); -+ return vfs_permission(inode, mask, NULL); - } - - static int hfs_file_open(struct inode *inode, struct file *file) -diff -uprN linux-2.6.8.1.orig/fs/hfsplus/dir.c linux-2.6.8.1-ve022stab078/fs/hfsplus/dir.c ---- linux-2.6.8.1.orig/fs/hfsplus/dir.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/hfsplus/dir.c 2006-05-11 13:05:32.000000000 +0400 -@@ -396,7 +396,7 @@ int hfsplus_symlink(struct inode *dir, s - if (!inode) - return -ENOSPC; - -- res = page_symlink(inode, symname, strlen(symname) + 1); -+ res = page_symlink(inode, symname, strlen(symname) + 1, GFP_KERNEL); - if (res) { - inode->i_nlink = 0; - hfsplus_delete_inode(inode); -diff -uprN linux-2.6.8.1.orig/fs/hfsplus/hfsplus_fs.h linux-2.6.8.1-ve022stab078/fs/hfsplus/hfsplus_fs.h ---- linux-2.6.8.1.orig/fs/hfsplus/hfsplus_fs.h 2004-08-14 14:56:00.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/hfsplus/hfsplus_fs.h 2006-05-11 13:05:35.000000000 +0400 -@@ -333,7 +333,7 @@ extern struct address_space_operations h - void hfsplus_inode_read_fork(struct inode *, struct hfsplus_fork_raw *); - void hfsplus_inode_write_fork(struct inode *, struct hfsplus_fork_raw *); - int hfsplus_cat_read_inode(struct inode *, struct hfs_find_data *); --void hfsplus_cat_write_inode(struct inode *); -+int hfsplus_cat_write_inode(struct inode *); - struct inode *hfsplus_new_inode(struct super_block *, int); - void hfsplus_delete_inode(struct inode *); - -diff -uprN linux-2.6.8.1.orig/fs/hfsplus/inode.c linux-2.6.8.1-ve022stab078/fs/hfsplus/inode.c ---- linux-2.6.8.1.orig/fs/hfsplus/inode.c 2004-08-14 14:54:52.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/hfsplus/inode.c 2006-05-11 13:05:35.000000000 +0400 -@@ -252,15 +252,19 @@ static void hfsplus_set_perms(struct ino - perms->dev = cpu_to_be32(HFSPLUS_I(inode).dev); - } - --static int hfsplus_permission(struct inode *inode, int mask, struct nameidata *nd) -+static int hfsplus_permission(struct inode *inode, int mask, -+ struct nameidata *nd, struct exec_perm *exec_perm) - { - /* MAY_EXEC is also used for lookup, if no x bit is set allow lookup, - * open_exec has the same test, so it's still not executable, if a x bit - * is set fall back to standard permission check. -+ * -+ * The comment above and the check below don't make much sense -+ * with S_ISREG condition... --SAW - */ - if (S_ISREG(inode->i_mode) && mask & MAY_EXEC && !(inode->i_mode & 0111)) - return 0; -- return vfs_permission(inode, mask); -+ return vfs_permission(inode, mask, exec_perm); - } - - -@@ -483,22 +487,22 @@ int hfsplus_cat_read_inode(struct inode - return res; - } - --void hfsplus_cat_write_inode(struct inode *inode) -+int hfsplus_cat_write_inode(struct inode *inode) - { - struct hfs_find_data fd; - hfsplus_cat_entry entry; - - if (HFSPLUS_IS_RSRC(inode)) { - mark_inode_dirty(HFSPLUS_I(inode).rsrc_inode); -- return; -+ return 0; - } - - if (!inode->i_nlink) -- return; -+ return 0; - - if (hfs_find_init(HFSPLUS_SB(inode->i_sb).cat_tree, &fd)) - /* panic? */ -- return; -+ return -EIO; - - if (hfsplus_find_cat(inode->i_sb, inode->i_ino, &fd)) - /* panic? */ -@@ -546,4 +550,5 @@ void hfsplus_cat_write_inode(struct inod - } - out: - hfs_find_exit(&fd); -+ return 0; - } -diff -uprN linux-2.6.8.1.orig/fs/hfsplus/super.c linux-2.6.8.1-ve022stab078/fs/hfsplus/super.c ---- linux-2.6.8.1.orig/fs/hfsplus/super.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/hfsplus/super.c 2006-05-11 13:05:35.000000000 +0400 -@@ -94,20 +94,20 @@ static void hfsplus_read_inode(struct in - make_bad_inode(inode); - } - --void hfsplus_write_inode(struct inode *inode, int unused) -+int hfsplus_write_inode(struct inode *inode, int unused) - { - struct hfsplus_vh *vhdr; -+ int ret = 0; - - dprint(DBG_INODE, "hfsplus_write_inode: %lu\n", inode->i_ino); - hfsplus_ext_write_extent(inode); - if (inode->i_ino >= HFSPLUS_FIRSTUSER_CNID) { -- hfsplus_cat_write_inode(inode); -- return; -+ return hfsplus_cat_write_inode(inode); - } - vhdr = HFSPLUS_SB(inode->i_sb).s_vhdr; - switch (inode->i_ino) { - case HFSPLUS_ROOT_CNID: -- hfsplus_cat_write_inode(inode); -+ ret = hfsplus_cat_write_inode(inode); - break; - case HFSPLUS_EXT_CNID: - if (vhdr->ext_file.total_size != cpu_to_be64(inode->i_size)) { -@@ -148,6 +148,7 @@ void hfsplus_write_inode(struct inode *i - hfs_btree_write(HFSPLUS_SB(inode->i_sb).attr_tree); - break; - } -+ return ret; - } - - static void hfsplus_clear_inode(struct inode *inode) -diff -uprN linux-2.6.8.1.orig/fs/hpfs/namei.c linux-2.6.8.1-ve022stab078/fs/hpfs/namei.c ---- linux-2.6.8.1.orig/fs/hpfs/namei.c 2004-08-14 14:55:19.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/hpfs/namei.c 2006-05-11 13:05:35.000000000 +0400 -@@ -415,7 +415,7 @@ again: - d_drop(dentry); - spin_lock(&dentry->d_lock); - if (atomic_read(&dentry->d_count) > 1 || -- permission(inode, MAY_WRITE, NULL) || -+ permission(inode, MAY_WRITE, NULL, NULL) || - !S_ISREG(inode->i_mode) || - get_write_access(inode)) { - spin_unlock(&dentry->d_lock); -diff -uprN linux-2.6.8.1.orig/fs/hugetlbfs/inode.c linux-2.6.8.1-ve022stab078/fs/hugetlbfs/inode.c ---- linux-2.6.8.1.orig/fs/hugetlbfs/inode.c 2004-08-14 14:56:14.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/hugetlbfs/inode.c 2006-05-11 13:05:40.000000000 +0400 -@@ -198,6 +198,7 @@ static void hugetlbfs_delete_inode(struc - struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(inode->i_sb); - - hlist_del_init(&inode->i_hash); -+ list_del(&inode->i_sb_list); - list_del_init(&inode->i_list); - inode->i_state |= I_FREEING; - inodes_stat.nr_inodes--; -@@ -240,6 +241,7 @@ static void hugetlbfs_forget_inode(struc - inodes_stat.nr_unused--; - hlist_del_init(&inode->i_hash); - out_truncate: -+ list_del(&inode->i_sb_list); - list_del_init(&inode->i_list); - inode->i_state |= I_FREEING; - inodes_stat.nr_inodes--; -@@ -453,7 +455,7 @@ static int hugetlbfs_symlink(struct inod - gid, S_IFLNK|S_IRWXUGO, 0); - if (inode) { - int l = strlen(symname)+1; -- error = page_symlink(inode, symname, l); -+ error = page_symlink(inode, symname, l, GFP_KERNEL); - if (!error) { - d_instantiate(dentry, inode); - dget(dentry); -@@ -731,7 +733,7 @@ struct file *hugetlb_zero_setup(size_t s - struct inode *inode; - struct dentry *dentry, *root; - struct qstr quick_string; -- char buf[16]; -+ char buf[64]; - - if (!can_do_hugetlb_shm()) - return ERR_PTR(-EPERM); -@@ -740,7 +742,8 @@ struct file *hugetlb_zero_setup(size_t s - return ERR_PTR(-ENOMEM); - - root = hugetlbfs_vfsmount->mnt_root; -- snprintf(buf, 16, "%lu", hugetlbfs_counter()); -+ snprintf(buf, sizeof(buf), "VE%d-%d", -+ get_exec_env()->veid, hugetlbfs_counter()); - quick_string.name = buf; - quick_string.len = strlen(quick_string.name); - quick_string.hash = 0; -diff -uprN linux-2.6.8.1.orig/fs/inode.c linux-2.6.8.1-ve022stab078/fs/inode.c ---- linux-2.6.8.1.orig/fs/inode.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/inode.c 2006-05-11 13:05:43.000000000 +0400 -@@ -9,8 +9,10 @@ - #include <linux/mm.h> - #include <linux/dcache.h> - #include <linux/init.h> -+#include <linux/kernel_stat.h> - #include <linux/quotaops.h> - #include <linux/slab.h> -+#include <linux/kmem_cache.h> - #include <linux/writeback.h> - #include <linux/module.h> - #include <linux/backing-dev.h> -@@ -99,11 +101,18 @@ struct inodes_stat_t inodes_stat; - - static kmem_cache_t * inode_cachep; - -+unsigned int inode_memusage(void) -+{ -+ return kmem_cache_memusage(inode_cachep); -+} -+ -+static struct address_space_operations vfs_empty_aops; -+struct inode_operations vfs_empty_iops; -+static struct file_operations vfs_empty_fops; -+EXPORT_SYMBOL(vfs_empty_iops); -+ - static struct inode *alloc_inode(struct super_block *sb) - { -- static struct address_space_operations empty_aops; -- static struct inode_operations empty_iops; -- static struct file_operations empty_fops; - struct inode *inode; - - if (sb->s_op->alloc_inode) -@@ -119,8 +128,8 @@ static struct inode *alloc_inode(struct - inode->i_flags = 0; - atomic_set(&inode->i_count, 1); - inode->i_sock = 0; -- inode->i_op = &empty_iops; -- inode->i_fop = &empty_fops; -+ inode->i_op = &vfs_empty_iops; -+ inode->i_fop = &vfs_empty_fops; - inode->i_nlink = 1; - atomic_set(&inode->i_writecount, 0); - inode->i_size = 0; -@@ -144,7 +153,7 @@ static struct inode *alloc_inode(struct - return NULL; - } - -- mapping->a_ops = &empty_aops; -+ mapping->a_ops = &vfs_empty_aops; - mapping->host = inode; - mapping->flags = 0; - mapping_set_gfp_mask(mapping, GFP_HIGHUSER); -@@ -295,10 +304,11 @@ static void dispose_list(struct list_hea - /* - * Invalidate all inodes for a device. - */ --static int invalidate_list(struct list_head *head, struct super_block * sb, struct list_head * dispose) -+static int invalidate_list(struct list_head *head, struct list_head * dispose, -+ int verify) - { - struct list_head *next; -- int busy = 0, count = 0; -+ int busy = 0, count = 0, print_once = 1; - - next = head->next; - for (;;) { -@@ -308,18 +318,63 @@ static int invalidate_list(struct list_h - next = next->next; - if (tmp == head) - break; -- inode = list_entry(tmp, struct inode, i_list); -- if (inode->i_sb != sb) -- continue; -+ inode = list_entry(tmp, struct inode, i_sb_list); - invalidate_inode_buffers(inode); - if (!atomic_read(&inode->i_count)) { - hlist_del_init(&inode->i_hash); -+ list_del(&inode->i_sb_list); - list_move(&inode->i_list, dispose); - inode->i_state |= I_FREEING; - count++; - continue; - } - busy = 1; -+ -+ if (!verify) -+ continue; -+ -+ if (print_once) { -+ struct super_block *sb = inode->i_sb; -+ printk("VFS: Busy inodes after unmount. " -+ "sb = %p, fs type = %s, sb count = %d, " -+ "sb->s_root = %s\n", sb, -+ (sb->s_type != NULL) ? sb->s_type->name : "", -+ sb->s_count, -+ (sb->s_root != NULL) ? -+ (char *)sb->s_root->d_name.name : ""); -+ print_once = 0; -+ } -+ -+ { -+ struct dentry *d; -+ int i; -+ -+ printk("inode = %p, inode->i_count = %d, " -+ "inode->i_nlink = %d, " -+ "inode->i_mode = %d, " -+ "inode->i_state = %ld, " -+ "inode->i_flags = %d, " -+ "inode->i_devices.next = %p, " -+ "inode->i_devices.prev = %p, " -+ "inode->i_ino = %ld\n", -+ tmp, -+ atomic_read(&inode->i_count), -+ inode->i_nlink, -+ inode->i_mode, -+ inode->i_state, -+ inode->i_flags, -+ inode->i_devices.next, -+ inode->i_devices.prev, -+ inode->i_ino); -+ printk("inode dump: "); -+ for (i = 0; i < sizeof(*tmp); i++) -+ printk("%2.2x ", *((u_char *)tmp + i)); -+ printk("\n"); -+ list_for_each_entry(d, &inode->i_dentry, d_alias) -+ printk(" d_alias %s\n", -+ d->d_name.name); -+ -+ } - } - /* only unused inodes may be cached with i_count zero */ - inodes_stat.nr_unused -= count; -@@ -342,17 +397,14 @@ static int invalidate_list(struct list_h - * fails because there are busy inodes then a non zero value is returned. - * If the discard is successful all the inodes have been discarded. - */ --int invalidate_inodes(struct super_block * sb) -+int invalidate_inodes(struct super_block * sb, int verify) - { - int busy; - LIST_HEAD(throw_away); - - down(&iprune_sem); - spin_lock(&inode_lock); -- busy = invalidate_list(&inode_in_use, sb, &throw_away); -- busy |= invalidate_list(&inode_unused, sb, &throw_away); -- busy |= invalidate_list(&sb->s_dirty, sb, &throw_away); -- busy |= invalidate_list(&sb->s_io, sb, &throw_away); -+ busy = invalidate_list(&sb->s_inodes, &throw_away, verify); - spin_unlock(&inode_lock); - - dispose_list(&throw_away); -@@ -381,7 +433,7 @@ int __invalidate_device(struct block_dev - * hold). - */ - shrink_dcache_sb(sb); -- res = invalidate_inodes(sb); -+ res = invalidate_inodes(sb, 0); - drop_super(sb); - } - invalidate_bdev(bdev, 0); -@@ -452,6 +504,7 @@ static void prune_icache(int nr_to_scan) - continue; - } - hlist_del_init(&inode->i_hash); -+ list_del(&inode->i_sb_list); - list_move(&inode->i_list, &freeable); - inode->i_state |= I_FREEING; - nr_pruned++; -@@ -479,6 +532,7 @@ static void prune_icache(int nr_to_scan) - */ - static int shrink_icache_memory(int nr, unsigned int gfp_mask) - { -+ KSTAT_PERF_ENTER(shrink_icache) - if (nr) { - /* - * Nasty deadlock avoidance. We may hold various FS locks, -@@ -488,6 +542,7 @@ static int shrink_icache_memory(int nr, - if (gfp_mask & __GFP_FS) - prune_icache(nr); - } -+ KSTAT_PERF_LEAVE(shrink_icache) - return (inodes_stat.nr_unused / 100) * sysctl_vfs_cache_pressure; - } - -@@ -510,7 +565,7 @@ repeat: - continue; - if (!test(inode, data)) - continue; -- if (inode->i_state & (I_FREEING|I_CLEAR)) { -+ if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)) { - __wait_on_freeing_inode(inode); - goto repeat; - } -@@ -535,7 +590,7 @@ repeat: - continue; - if (inode->i_sb != sb) - continue; -- if (inode->i_state & (I_FREEING|I_CLEAR)) { -+ if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)) { - __wait_on_freeing_inode(inode); - goto repeat; - } -@@ -561,6 +616,7 @@ struct inode *new_inode(struct super_blo - if (inode) { - spin_lock(&inode_lock); - inodes_stat.nr_inodes++; -+ list_add(&inode->i_sb_list, &sb->s_inodes); - list_add(&inode->i_list, &inode_in_use); - inode->i_ino = ++last_ino; - inode->i_state = 0; -@@ -609,6 +665,7 @@ static struct inode * get_new_inode(stru - goto set_failed; - - inodes_stat.nr_inodes++; -+ list_add(&inode->i_sb_list, &sb->s_inodes); - list_add(&inode->i_list, &inode_in_use); - hlist_add_head(&inode->i_hash, head); - inode->i_state = I_LOCK|I_NEW; -@@ -657,6 +714,7 @@ static struct inode * get_new_inode_fast - if (!old) { - inode->i_ino = ino; - inodes_stat.nr_inodes++; -+ list_add(&inode->i_sb_list, &sb->s_inodes); - list_add(&inode->i_list, &inode_in_use); - hlist_add_head(&inode->i_hash, head); - inode->i_state = I_LOCK|I_NEW; -@@ -734,7 +792,7 @@ EXPORT_SYMBOL(iunique); - struct inode *igrab(struct inode *inode) - { - spin_lock(&inode_lock); -- if (!(inode->i_state & I_FREEING)) -+ if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) - __iget(inode); - else - /* -@@ -993,6 +1051,7 @@ void generic_delete_inode(struct inode * - { - struct super_operations *op = inode->i_sb->s_op; - -+ list_del(&inode->i_sb_list); - list_del_init(&inode->i_list); - inode->i_state|=I_FREEING; - inodes_stat.nr_inodes--; -@@ -1030,14 +1089,20 @@ static void generic_forget_inode(struct - if (!(inode->i_state & (I_DIRTY|I_LOCK))) - list_move(&inode->i_list, &inode_unused); - inodes_stat.nr_unused++; -- spin_unlock(&inode_lock); -- if (!sb || (sb->s_flags & MS_ACTIVE)) -+ if (!sb || (sb->s_flags & MS_ACTIVE)) { -+ spin_unlock(&inode_lock); - return; -+ } -+ inode->i_state |= I_WILL_FREE; -+ BUG_ON(inode->i_state & I_LOCK); -+ spin_unlock(&inode_lock); - write_inode_now(inode, 1); - spin_lock(&inode_lock); -+ inode->i_state &= ~I_WILL_FREE; - inodes_stat.nr_unused--; - hlist_del_init(&inode->i_hash); - } -+ list_del(&inode->i_sb_list); - list_del_init(&inode->i_list); - inode->i_state|=I_FREEING; - inodes_stat.nr_inodes--; -@@ -1128,19 +1193,6 @@ sector_t bmap(struct inode * inode, sect - - EXPORT_SYMBOL(bmap); - --/* -- * Return true if the filesystem which backs this inode considers the two -- * passed timespecs to be sufficiently different to warrant flushing the -- * altered time out to disk. -- */ --static int inode_times_differ(struct inode *inode, -- struct timespec *old, struct timespec *new) --{ -- if (IS_ONE_SECOND(inode)) -- return old->tv_sec != new->tv_sec; -- return !timespec_equal(old, new); --} -- - /** - * update_atime - update the access time - * @inode: inode accessed -@@ -1160,8 +1212,8 @@ void update_atime(struct inode *inode) - if (IS_RDONLY(inode)) - return; - -- now = current_kernel_time(); -- if (inode_times_differ(inode, &inode->i_atime, &now)) { -+ now = current_fs_time(inode->i_sb); -+ if (!timespec_equal(&inode->i_atime, &now)) { - inode->i_atime = now; - mark_inode_dirty_sync(inode); - } else { -@@ -1191,14 +1243,13 @@ void inode_update_time(struct inode *ino - if (IS_RDONLY(inode)) - return; - -- now = current_kernel_time(); -- -- if (inode_times_differ(inode, &inode->i_mtime, &now)) -+ now = current_fs_time(inode->i_sb); -+ if (!timespec_equal(&inode->i_mtime, &now)) - sync_it = 1; - inode->i_mtime = now; - - if (ctime_too) { -- if (inode_times_differ(inode, &inode->i_ctime, &now)) -+ if (!timespec_equal(&inode->i_ctime, &now)) - sync_it = 1; - inode->i_ctime = now; - } -@@ -1230,33 +1281,15 @@ int remove_inode_dquot_ref(struct inode - void remove_dquot_ref(struct super_block *sb, int type, struct list_head *tofree_head) - { - struct inode *inode; -- struct list_head *act_head; - - if (!sb->dq_op) - return; /* nothing to do */ -- spin_lock(&inode_lock); /* This lock is for inodes code */ - -+ spin_lock(&inode_lock); /* This lock is for inodes code */ - /* We hold dqptr_sem so we are safe against the quota code */ -- list_for_each(act_head, &inode_in_use) { -- inode = list_entry(act_head, struct inode, i_list); -- if (inode->i_sb == sb && !IS_NOQUOTA(inode)) -- remove_inode_dquot_ref(inode, type, tofree_head); -- } -- list_for_each(act_head, &inode_unused) { -- inode = list_entry(act_head, struct inode, i_list); -- if (inode->i_sb == sb && !IS_NOQUOTA(inode)) -- remove_inode_dquot_ref(inode, type, tofree_head); -- } -- list_for_each(act_head, &sb->s_dirty) { -- inode = list_entry(act_head, struct inode, i_list); -+ list_for_each_entry(inode, &sb->s_inodes, i_sb_list) - if (!IS_NOQUOTA(inode)) - remove_inode_dquot_ref(inode, type, tofree_head); -- } -- list_for_each(act_head, &sb->s_io) { -- inode = list_entry(act_head, struct inode, i_list); -- if (!IS_NOQUOTA(inode)) -- remove_inode_dquot_ref(inode, type, tofree_head); -- } - spin_unlock(&inode_lock); - } - -@@ -1372,7 +1405,7 @@ void __init inode_init(unsigned long mem - - /* inode slab cache */ - inode_cachep = kmem_cache_create("inode_cache", sizeof(struct inode), -- 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, init_once, -+ 0, SLAB_RECLAIM_ACCOUNT|SLAB_HWCACHE_ALIGN|SLAB_PANIC, init_once, - NULL); - set_shrinker(DEFAULT_SEEKS, shrink_icache_memory); - } -diff -uprN linux-2.6.8.1.orig/fs/isofs/compress.c linux-2.6.8.1-ve022stab078/fs/isofs/compress.c ---- linux-2.6.8.1.orig/fs/isofs/compress.c 2004-08-14 14:54:49.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/isofs/compress.c 2006-05-11 13:05:34.000000000 +0400 -@@ -147,8 +147,14 @@ static int zisofs_readpage(struct file * - cend = le32_to_cpu(*(u32 *)(bh->b_data + (blockendptr & bufmask))); - brelse(bh); - -+ if (cstart > cend) -+ goto eio; -+ - csize = cend-cstart; - -+ if (csize > deflateBound(1UL << zisofs_block_shift)) -+ goto eio; -+ - /* Now page[] contains an array of pages, any of which can be NULL, - and the locks on which we hold. We should now read the data and - release the pages. If the pages are NULL the decompressed data -diff -uprN linux-2.6.8.1.orig/fs/isofs/inode.c linux-2.6.8.1-ve022stab078/fs/isofs/inode.c ---- linux-2.6.8.1.orig/fs/isofs/inode.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/isofs/inode.c 2006-05-11 13:05:34.000000000 +0400 -@@ -685,6 +685,8 @@ root_found: - sbi->s_log_zone_size = isonum_723 (h_pri->logical_block_size); - sbi->s_max_size = isonum_733(h_pri->volume_space_size); - } else { -+ if (!pri) -+ goto out_freebh; - rootp = (struct iso_directory_record *) pri->root_directory_record; - sbi->s_nzones = isonum_733 (pri->volume_space_size); - sbi->s_log_zone_size = isonum_723 (pri->logical_block_size); -@@ -1394,6 +1396,9 @@ struct inode *isofs_iget(struct super_bl - struct inode *inode; - struct isofs_iget5_callback_data data; - -+ if (offset >= 1ul << sb->s_blocksize_bits) -+ return NULL; -+ - data.block = block; - data.offset = offset; - -diff -uprN linux-2.6.8.1.orig/fs/isofs/rock.c linux-2.6.8.1-ve022stab078/fs/isofs/rock.c ---- linux-2.6.8.1.orig/fs/isofs/rock.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/isofs/rock.c 2006-05-11 13:05:34.000000000 +0400 -@@ -53,6 +53,7 @@ - if(LEN & 1) LEN++; \ - CHR = ((unsigned char *) DE) + LEN; \ - LEN = *((unsigned char *) DE) - LEN; \ -+ if (LEN<0) LEN=0; \ - if (ISOFS_SB(inode->i_sb)->s_rock_offset!=-1) \ - { \ - LEN-=ISOFS_SB(inode->i_sb)->s_rock_offset; \ -@@ -73,6 +74,10 @@ - offset1 = 0; \ - pbh = sb_bread(DEV->i_sb, block); \ - if(pbh){ \ -+ if (offset > pbh->b_size || offset + cont_size > pbh->b_size){ \ -+ brelse(pbh); \ -+ goto out; \ -+ } \ - memcpy(buffer + offset1, pbh->b_data + offset, cont_size - offset1); \ - brelse(pbh); \ - chr = (unsigned char *) buffer; \ -@@ -103,12 +108,13 @@ int get_rock_ridge_filename(struct iso_d - struct rock_ridge * rr; - int sig; - -- while (len > 1){ /* There may be one byte for padding somewhere */ -+ while (len > 2){ /* There may be one byte for padding somewhere */ - rr = (struct rock_ridge *) chr; -- if (rr->len == 0) goto out; /* Something got screwed up here */ -+ if (rr->len < 3) goto out; /* Something got screwed up here */ - sig = isonum_721(chr); - chr += rr->len; - len -= rr->len; -+ if (len < 0) goto out; /* corrupted isofs */ - - switch(sig){ - case SIG('R','R'): -@@ -122,6 +128,7 @@ int get_rock_ridge_filename(struct iso_d - break; - case SIG('N','M'): - if (truncate) break; -+ if (rr->len < 5) break; - /* - * If the flags are 2 or 4, this indicates '.' or '..'. - * We don't want to do anything with this, because it -@@ -183,12 +190,13 @@ int parse_rock_ridge_inode_internal(stru - struct rock_ridge * rr; - int rootflag; - -- while (len > 1){ /* There may be one byte for padding somewhere */ -+ while (len > 2){ /* There may be one byte for padding somewhere */ - rr = (struct rock_ridge *) chr; -- if (rr->len == 0) goto out; /* Something got screwed up here */ -+ if (rr->len < 3) goto out; /* Something got screwed up here */ - sig = isonum_721(chr); - chr += rr->len; - len -= rr->len; -+ if (len < 0) goto out; /* corrupted isofs */ - - switch(sig){ - #ifndef CONFIG_ZISOFS /* No flag for SF or ZF */ -@@ -460,7 +468,7 @@ static int rock_ridge_symlink_readpage(s - struct rock_ridge *rr; - - if (!ISOFS_SB(inode->i_sb)->s_rock) -- panic ("Cannot have symlink with high sierra variant of iso filesystem\n"); -+ goto error; - - block = ei->i_iget5_block; - lock_kernel(); -@@ -485,13 +493,15 @@ static int rock_ridge_symlink_readpage(s - SETUP_ROCK_RIDGE(raw_inode, chr, len); - - repeat: -- while (len > 1) { /* There may be one byte for padding somewhere */ -+ while (len > 2) { /* There may be one byte for padding somewhere */ - rr = (struct rock_ridge *) chr; -- if (rr->len == 0) -+ if (rr->len < 3) - goto out; /* Something got screwed up here */ - sig = isonum_721(chr); - chr += rr->len; - len -= rr->len; -+ if (len < 0) -+ goto out; /* corrupted isofs */ - - switch (sig) { - case SIG('R', 'R'): -@@ -539,6 +549,7 @@ static int rock_ridge_symlink_readpage(s - fail: - brelse(bh); - unlock_kernel(); -+ error: - SetPageError(page); - kunmap(page); - unlock_page(page); -diff -uprN linux-2.6.8.1.orig/fs/jbd/checkpoint.c linux-2.6.8.1-ve022stab078/fs/jbd/checkpoint.c ---- linux-2.6.8.1.orig/fs/jbd/checkpoint.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/jbd/checkpoint.c 2006-05-11 13:05:32.000000000 +0400 -@@ -335,8 +335,10 @@ int log_do_checkpoint(journal_t *journal - retry = __flush_buffer(journal, jh, bhs, &batch_count, &drop_count); - } while (jh != last_jh && !retry); - -- if (batch_count) -+ if (batch_count) { - __flush_batch(journal, bhs, &batch_count); -+ retry = 1; -+ } - - /* - * If someone cleaned up this transaction while we slept, we're -diff -uprN linux-2.6.8.1.orig/fs/jbd/commit.c linux-2.6.8.1-ve022stab078/fs/jbd/commit.c ---- linux-2.6.8.1.orig/fs/jbd/commit.c 2004-08-14 14:56:01.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/jbd/commit.c 2006-05-11 13:05:35.000000000 +0400 -@@ -103,10 +103,10 @@ void journal_commit_transaction(journal_ - { - transaction_t *commit_transaction; - struct journal_head *jh, *new_jh, *descriptor; -- struct buffer_head *wbuf[64]; -+ struct buffer_head **wbuf = journal->j_wbuf; - int bufs; - int flags; -- int err; -+ int err, data_err; - unsigned long blocknr; - char *tagp = NULL; - journal_header_t *header; -@@ -234,6 +234,7 @@ void journal_commit_transaction(journal_ - */ - - err = 0; -+ data_err = 0; - /* - * Whenever we unlock the journal and sleep, things can get added - * onto ->t_sync_datalist, so we have to keep looping back to -@@ -258,7 +259,7 @@ write_out_data: - BUFFER_TRACE(bh, "locked"); - if (!inverted_lock(journal, bh)) - goto write_out_data; -- __journal_unfile_buffer(jh); -+ __journal_temp_unlink_buffer(jh); - __journal_file_buffer(jh, commit_transaction, - BJ_Locked); - jbd_unlock_bh_state(bh); -@@ -271,7 +272,7 @@ write_out_data: - BUFFER_TRACE(bh, "start journal writeout"); - get_bh(bh); - wbuf[bufs++] = bh; -- if (bufs == ARRAY_SIZE(wbuf)) { -+ if (bufs == journal->j_wbufsize) { - jbd_debug(2, "submit %d writes\n", - bufs); - spin_unlock(&journal->j_list_lock); -@@ -284,6 +285,8 @@ write_out_data: - BUFFER_TRACE(bh, "writeout complete: unfile"); - if (!inverted_lock(journal, bh)) - goto write_out_data; -+ if (unlikely(!buffer_uptodate(bh))) -+ data_err = -EIO; - __journal_unfile_buffer(jh); - jbd_unlock_bh_state(bh); - journal_remove_journal_head(bh); -@@ -315,8 +318,6 @@ write_out_data: - if (buffer_locked(bh)) { - spin_unlock(&journal->j_list_lock); - wait_on_buffer(bh); -- if (unlikely(!buffer_uptodate(bh))) -- err = -EIO; - spin_lock(&journal->j_list_lock); - } - if (!inverted_lock(journal, bh)) { -@@ -324,6 +325,8 @@ write_out_data: - spin_lock(&journal->j_list_lock); - continue; - } -+ if (unlikely(!buffer_uptodate(bh))) -+ data_err = -EIO; - if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) { - __journal_unfile_buffer(jh); - jbd_unlock_bh_state(bh); -@@ -341,6 +344,12 @@ write_out_data: - } - spin_unlock(&journal->j_list_lock); - -+ /* -+ * XXX: what to do if (data_err)? -+ * Print message? -+ * Abort journal? -+ */ -+ - journal_write_revoke_records(journal, commit_transaction); - - jbd_debug(3, "JBD: commit phase 2\n"); -@@ -365,6 +374,7 @@ write_out_data: - descriptor = NULL; - bufs = 0; - while (commit_transaction->t_buffers) { -+ int error; - - /* Find the next buffer to be journaled... */ - -@@ -405,9 +415,9 @@ write_out_data: - jbd_debug(4, "JBD: got buffer %llu (%p)\n", - (unsigned long long)bh->b_blocknr, bh->b_data); - header = (journal_header_t *)&bh->b_data[0]; -- header->h_magic = htonl(JFS_MAGIC_NUMBER); -- header->h_blocktype = htonl(JFS_DESCRIPTOR_BLOCK); -- header->h_sequence = htonl(commit_transaction->t_tid); -+ header->h_magic = cpu_to_be32(JFS_MAGIC_NUMBER); -+ header->h_blocktype = cpu_to_be32(JFS_DESCRIPTOR_BLOCK); -+ header->h_sequence = cpu_to_be32(commit_transaction->t_tid); - - tagp = &bh->b_data[sizeof(journal_header_t)]; - space_left = bh->b_size - sizeof(journal_header_t); -@@ -425,11 +435,12 @@ write_out_data: - - /* Where is the buffer to be written? */ - -- err = journal_next_log_block(journal, &blocknr); -+ error = journal_next_log_block(journal, &blocknr); - /* If the block mapping failed, just abandon the buffer - and repeat this loop: we'll fall into the - refile-on-abort condition above. */ -- if (err) { -+ if (error) { -+ err = error; - __journal_abort_hard(journal); - continue; - } -@@ -473,8 +484,8 @@ write_out_data: - tag_flag |= JFS_FLAG_SAME_UUID; - - tag = (journal_block_tag_t *) tagp; -- tag->t_blocknr = htonl(jh2bh(jh)->b_blocknr); -- tag->t_flags = htonl(tag_flag); -+ tag->t_blocknr = cpu_to_be32(jh2bh(jh)->b_blocknr); -+ tag->t_flags = cpu_to_be32(tag_flag); - tagp += sizeof(journal_block_tag_t); - space_left -= sizeof(journal_block_tag_t); - -@@ -488,7 +499,7 @@ write_out_data: - /* If there's no more to do, or if the descriptor is full, - let the IO rip! */ - -- if (bufs == ARRAY_SIZE(wbuf) || -+ if (bufs == journal->j_wbufsize || - commit_transaction->t_buffers == NULL || - space_left < sizeof(journal_block_tag_t) + 16) { - -@@ -498,7 +509,7 @@ write_out_data: - submitting the IOs. "tag" still points to - the last tag we set up. */ - -- tag->t_flags |= htonl(JFS_FLAG_LAST_TAG); -+ tag->t_flags |= cpu_to_be32(JFS_FLAG_LAST_TAG); - - start_journal_io: - for (i = 0; i < bufs; i++) { -@@ -613,6 +624,8 @@ wait_for_iobuf: - - jbd_debug(3, "JBD: commit phase 6\n"); - -+ if (err) -+ goto skip_commit; - if (is_journal_aborted(journal)) - goto skip_commit; - -@@ -631,9 +644,9 @@ wait_for_iobuf: - for (i = 0; i < jh2bh(descriptor)->b_size; i += 512) { - journal_header_t *tmp = - (journal_header_t*)jh2bh(descriptor)->b_data; -- tmp->h_magic = htonl(JFS_MAGIC_NUMBER); -- tmp->h_blocktype = htonl(JFS_COMMIT_BLOCK); -- tmp->h_sequence = htonl(commit_transaction->t_tid); -+ tmp->h_magic = cpu_to_be32(JFS_MAGIC_NUMBER); -+ tmp->h_blocktype = cpu_to_be32(JFS_COMMIT_BLOCK); -+ tmp->h_sequence = cpu_to_be32(commit_transaction->t_tid); - } - - JBUFFER_TRACE(descriptor, "write commit block"); -@@ -655,8 +668,13 @@ wait_for_iobuf: - - skip_commit: /* The journal should be unlocked by now. */ - -- if (err) -+ if (err) { -+ char b[BDEVNAME_SIZE]; -+ -+ printk(KERN_ERR "Error %d writing journal on %s\n", -+ err, bdevname(journal->j_dev, b)); - __journal_abort_hard(journal); -+ } - - /* - * Call any callbacks that had been registered for handles in this -diff -uprN linux-2.6.8.1.orig/fs/jbd/journal.c linux-2.6.8.1-ve022stab078/fs/jbd/journal.c ---- linux-2.6.8.1.orig/fs/jbd/journal.c 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/jbd/journal.c 2006-05-11 13:05:37.000000000 +0400 -@@ -34,6 +34,7 @@ - #include <linux/suspend.h> - #include <linux/pagemap.h> - #include <asm/uaccess.h> -+#include <asm/page.h> - #include <linux/proc_fs.h> - - EXPORT_SYMBOL(journal_start); -@@ -152,6 +153,9 @@ int kjournald(void *arg) - spin_lock(&journal->j_state_lock); - - loop: -+ if (journal->j_flags & JFS_UNMOUNT) -+ goto end_loop; -+ - jbd_debug(1, "commit_sequence=%d, commit_request=%d\n", - journal->j_commit_sequence, journal->j_commit_request); - -@@ -161,11 +165,11 @@ loop: - del_timer_sync(journal->j_commit_timer); - journal_commit_transaction(journal); - spin_lock(&journal->j_state_lock); -- goto end_loop; -+ goto loop; - } - - wake_up(&journal->j_wait_done_commit); -- if (current->flags & PF_FREEZE) { -+ if (test_thread_flag(TIF_FREEZE)) { - /* - * The simpler the better. Flushing journal isn't a - * good idea, because that depends on threads that may -@@ -173,7 +177,7 @@ loop: - */ - jbd_debug(1, "Now suspending kjournald\n"); - spin_unlock(&journal->j_state_lock); -- refrigerator(PF_FREEZE); -+ refrigerator(); - spin_lock(&journal->j_state_lock); - } else { - /* -@@ -191,6 +195,8 @@ loop: - if (transaction && time_after_eq(jiffies, - transaction->t_expires)) - should_sleep = 0; -+ if (journal->j_flags & JFS_UNMOUNT) -+ should_sleep = 0; - if (should_sleep) { - spin_unlock(&journal->j_state_lock); - schedule(); -@@ -209,10 +215,9 @@ loop: - journal->j_commit_request = transaction->t_tid; - jbd_debug(1, "woke because of timeout\n"); - } --end_loop: -- if (!(journal->j_flags & JFS_UNMOUNT)) -- goto loop; -+ goto loop; - -+end_loop: - spin_unlock(&journal->j_state_lock); - del_timer_sync(journal->j_commit_timer); - journal->j_task = NULL; -@@ -221,10 +226,16 @@ end_loop: - return 0; - } - --static void journal_start_thread(journal_t *journal) -+static int journal_start_thread(journal_t *journal) - { -- kernel_thread(kjournald, journal, CLONE_VM|CLONE_FS|CLONE_FILES); -+ int err; -+ -+ err = kernel_thread(kjournald, journal, CLONE_VM|CLONE_FS|CLONE_FILES); -+ if (err < 0) -+ return err; -+ - wait_event(journal->j_wait_done_commit, journal->j_task != 0); -+ return 0; - } - - static void journal_kill_thread(journal_t *journal) -@@ -325,8 +336,8 @@ repeat: - /* - * Check for escaping - */ -- if (*((unsigned int *)(mapped_data + new_offset)) == -- htonl(JFS_MAGIC_NUMBER)) { -+ if (*((__be32 *)(mapped_data + new_offset)) == -+ cpu_to_be32(JFS_MAGIC_NUMBER)) { - need_copy_out = 1; - do_escape = 1; - } -@@ -720,6 +731,7 @@ journal_t * journal_init_dev(struct bloc - { - journal_t *journal = journal_init_common(); - struct buffer_head *bh; -+ int n; - - if (!journal) - return NULL; -@@ -735,6 +747,17 @@ journal_t * journal_init_dev(struct bloc - journal->j_sb_buffer = bh; - journal->j_superblock = (journal_superblock_t *)bh->b_data; - -+ /* journal descriptor can store up to n blocks -bzzz */ -+ n = journal->j_blocksize / sizeof(journal_block_tag_t); -+ journal->j_wbufsize = n; -+ journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); -+ if (!journal->j_wbuf) { -+ printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n", -+ __FUNCTION__); -+ kfree(journal); -+ journal = NULL; -+ } -+ - return journal; - } - -@@ -751,6 +774,7 @@ journal_t * journal_init_inode (struct i - struct buffer_head *bh; - journal_t *journal = journal_init_common(); - int err; -+ int n; - unsigned long blocknr; - - if (!journal) -@@ -767,6 +791,17 @@ journal_t * journal_init_inode (struct i - journal->j_maxlen = inode->i_size >> inode->i_sb->s_blocksize_bits; - journal->j_blocksize = inode->i_sb->s_blocksize; - -+ /* journal descriptor can store up to n blocks -bzzz */ -+ n = journal->j_blocksize / sizeof(journal_block_tag_t); -+ journal->j_wbufsize = n; -+ journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); -+ if (!journal->j_wbuf) { -+ printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n", -+ __FUNCTION__); -+ kfree(journal); -+ return NULL; -+ } -+ - err = journal_bmap(journal, 0, &blocknr); - /* If that failed, give up */ - if (err) { -@@ -808,8 +843,8 @@ static int journal_reset(journal_t *jour - journal_superblock_t *sb = journal->j_superblock; - unsigned int first, last; - -- first = ntohl(sb->s_first); -- last = ntohl(sb->s_maxlen); -+ first = be32_to_cpu(sb->s_first); -+ last = be32_to_cpu(sb->s_maxlen); - - journal->j_first = first; - journal->j_last = last; -@@ -826,8 +861,7 @@ static int journal_reset(journal_t *jour - - /* Add the dynamic fields and write it to disk. */ - journal_update_superblock(journal, 1); -- journal_start_thread(journal); -- return 0; -+ return journal_start_thread(journal); - } - - /** -@@ -886,12 +920,12 @@ int journal_create(journal_t *journal) - /* OK, fill in the initial static fields in the new superblock */ - sb = journal->j_superblock; - -- sb->s_header.h_magic = htonl(JFS_MAGIC_NUMBER); -- sb->s_header.h_blocktype = htonl(JFS_SUPERBLOCK_V2); -+ sb->s_header.h_magic = cpu_to_be32(JFS_MAGIC_NUMBER); -+ sb->s_header.h_blocktype = cpu_to_be32(JFS_SUPERBLOCK_V2); - -- sb->s_blocksize = htonl(journal->j_blocksize); -- sb->s_maxlen = htonl(journal->j_maxlen); -- sb->s_first = htonl(1); -+ sb->s_blocksize = cpu_to_be32(journal->j_blocksize); -+ sb->s_maxlen = cpu_to_be32(journal->j_maxlen); -+ sb->s_first = cpu_to_be32(1); - - journal->j_transaction_sequence = 1; - -@@ -934,9 +968,9 @@ void journal_update_superblock(journal_t - jbd_debug(1,"JBD: updating superblock (start %ld, seq %d, errno %d)\n", - journal->j_tail, journal->j_tail_sequence, journal->j_errno); - -- sb->s_sequence = htonl(journal->j_tail_sequence); -- sb->s_start = htonl(journal->j_tail); -- sb->s_errno = htonl(journal->j_errno); -+ sb->s_sequence = cpu_to_be32(journal->j_tail_sequence); -+ sb->s_start = cpu_to_be32(journal->j_tail); -+ sb->s_errno = cpu_to_be32(journal->j_errno); - spin_unlock(&journal->j_state_lock); - - BUFFER_TRACE(bh, "marking dirty"); -@@ -987,13 +1021,13 @@ static int journal_get_superblock(journa - - err = -EINVAL; - -- if (sb->s_header.h_magic != htonl(JFS_MAGIC_NUMBER) || -- sb->s_blocksize != htonl(journal->j_blocksize)) { -+ if (sb->s_header.h_magic != cpu_to_be32(JFS_MAGIC_NUMBER) || -+ sb->s_blocksize != cpu_to_be32(journal->j_blocksize)) { - printk(KERN_WARNING "JBD: no valid journal superblock found\n"); - goto out; - } - -- switch(ntohl(sb->s_header.h_blocktype)) { -+ switch(be32_to_cpu(sb->s_header.h_blocktype)) { - case JFS_SUPERBLOCK_V1: - journal->j_format_version = 1; - break; -@@ -1005,9 +1039,9 @@ static int journal_get_superblock(journa - goto out; - } - -- if (ntohl(sb->s_maxlen) < journal->j_maxlen) -- journal->j_maxlen = ntohl(sb->s_maxlen); -- else if (ntohl(sb->s_maxlen) > journal->j_maxlen) { -+ if (be32_to_cpu(sb->s_maxlen) < journal->j_maxlen) -+ journal->j_maxlen = be32_to_cpu(sb->s_maxlen); -+ else if (be32_to_cpu(sb->s_maxlen) > journal->j_maxlen) { - printk (KERN_WARNING "JBD: journal file too short\n"); - goto out; - } -@@ -1035,11 +1069,11 @@ static int load_superblock(journal_t *jo - - sb = journal->j_superblock; - -- journal->j_tail_sequence = ntohl(sb->s_sequence); -- journal->j_tail = ntohl(sb->s_start); -- journal->j_first = ntohl(sb->s_first); -- journal->j_last = ntohl(sb->s_maxlen); -- journal->j_errno = ntohl(sb->s_errno); -+ journal->j_tail_sequence = be32_to_cpu(sb->s_sequence); -+ journal->j_tail = be32_to_cpu(sb->s_start); -+ journal->j_first = be32_to_cpu(sb->s_first); -+ journal->j_last = be32_to_cpu(sb->s_maxlen); -+ journal->j_errno = be32_to_cpu(sb->s_errno); - - return 0; - } -@@ -1140,6 +1174,7 @@ void journal_destroy(journal_t *journal) - iput(journal->j_inode); - if (journal->j_revoke) - journal_destroy_revoke(journal); -+ kfree(journal->j_wbuf); - kfree(journal); - } - -@@ -1252,7 +1287,7 @@ int journal_update_format (journal_t *jo - - sb = journal->j_superblock; - -- switch (ntohl(sb->s_header.h_blocktype)) { -+ switch (be32_to_cpu(sb->s_header.h_blocktype)) { - case JFS_SUPERBLOCK_V2: - return 0; - case JFS_SUPERBLOCK_V1: -@@ -1274,7 +1309,7 @@ static int journal_convert_superblock_v1 - - /* Pre-initialise new fields to zero */ - offset = ((char *) &(sb->s_feature_compat)) - ((char *) sb); -- blocksize = ntohl(sb->s_blocksize); -+ blocksize = be32_to_cpu(sb->s_blocksize); - memset(&sb->s_feature_compat, 0, blocksize-offset); - - sb->s_nr_users = cpu_to_be32(1); -@@ -1490,7 +1525,7 @@ void __journal_abort_soft (journal_t *jo - * entered abort state during the update. - * - * Recursive transactions are not disturbed by journal abort until the -- * final journal_stop, which will receive the -EIO error. -+ * final journal_stop. - * - * Finally, the journal_abort call allows the caller to supply an errno - * which will be recorded (if possible) in the journal superblock. This -@@ -1766,6 +1801,7 @@ static void __journal_remove_journal_hea - if (jh->b_transaction == NULL && - jh->b_next_transaction == NULL && - jh->b_cp_transaction == NULL) { -+ J_ASSERT_JH(jh, jh->b_jlist == BJ_None); - J_ASSERT_BH(bh, buffer_jbd(bh)); - J_ASSERT_BH(bh, jh2bh(jh) == bh); - BUFFER_TRACE(bh, "remove journal_head"); -diff -uprN linux-2.6.8.1.orig/fs/jbd/recovery.c linux-2.6.8.1-ve022stab078/fs/jbd/recovery.c ---- linux-2.6.8.1.orig/fs/jbd/recovery.c 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/jbd/recovery.c 2006-05-11 13:05:31.000000000 +0400 -@@ -191,10 +191,10 @@ static int count_tags(struct buffer_head - - nr++; - tagp += sizeof(journal_block_tag_t); -- if (!(tag->t_flags & htonl(JFS_FLAG_SAME_UUID))) -+ if (!(tag->t_flags & cpu_to_be32(JFS_FLAG_SAME_UUID))) - tagp += 16; - -- if (tag->t_flags & htonl(JFS_FLAG_LAST_TAG)) -+ if (tag->t_flags & cpu_to_be32(JFS_FLAG_LAST_TAG)) - break; - } - -@@ -239,8 +239,8 @@ int journal_recover(journal_t *journal) - - if (!sb->s_start) { - jbd_debug(1, "No recovery required, last transaction %d\n", -- ntohl(sb->s_sequence)); -- journal->j_transaction_sequence = ntohl(sb->s_sequence) + 1; -+ be32_to_cpu(sb->s_sequence)); -+ journal->j_transaction_sequence = be32_to_cpu(sb->s_sequence) + 1; - return 0; - } - -@@ -295,7 +295,7 @@ int journal_skip_recovery(journal_t *jou - ++journal->j_transaction_sequence; - } else { - #ifdef CONFIG_JBD_DEBUG -- int dropped = info.end_transaction - ntohl(sb->s_sequence); -+ int dropped = info.end_transaction - be32_to_cpu(sb->s_sequence); - #endif - jbd_debug(0, - "JBD: ignoring %d transaction%s from the journal.\n", -@@ -331,8 +331,8 @@ static int do_one_pass(journal_t *journa - */ - - sb = journal->j_superblock; -- next_commit_ID = ntohl(sb->s_sequence); -- next_log_block = ntohl(sb->s_start); -+ next_commit_ID = be32_to_cpu(sb->s_sequence); -+ next_log_block = be32_to_cpu(sb->s_start); - - first_commit_ID = next_commit_ID; - if (pass == PASS_SCAN) -@@ -385,13 +385,13 @@ static int do_one_pass(journal_t *journa - - tmp = (journal_header_t *)bh->b_data; - -- if (tmp->h_magic != htonl(JFS_MAGIC_NUMBER)) { -+ if (tmp->h_magic != cpu_to_be32(JFS_MAGIC_NUMBER)) { - brelse(bh); - break; - } - -- blocktype = ntohl(tmp->h_blocktype); -- sequence = ntohl(tmp->h_sequence); -+ blocktype = be32_to_cpu(tmp->h_blocktype); -+ sequence = be32_to_cpu(tmp->h_sequence); - jbd_debug(3, "Found magic %d, sequence %d\n", - blocktype, sequence); - -@@ -427,7 +427,7 @@ static int do_one_pass(journal_t *journa - unsigned long io_block; - - tag = (journal_block_tag_t *) tagp; -- flags = ntohl(tag->t_flags); -+ flags = be32_to_cpu(tag->t_flags); - - io_block = next_log_block++; - wrap(journal, next_log_block); -@@ -444,7 +444,7 @@ static int do_one_pass(journal_t *journa - unsigned long blocknr; - - J_ASSERT(obh != NULL); -- blocknr = ntohl(tag->t_blocknr); -+ blocknr = be32_to_cpu(tag->t_blocknr); - - /* If the block has been - * revoked, then we're all done -@@ -476,8 +476,8 @@ static int do_one_pass(journal_t *journa - memcpy(nbh->b_data, obh->b_data, - journal->j_blocksize); - if (flags & JFS_FLAG_ESCAPE) { -- *((unsigned int *)bh->b_data) = -- htonl(JFS_MAGIC_NUMBER); -+ *((__be32 *)bh->b_data) = -+ cpu_to_be32(JFS_MAGIC_NUMBER); - } - - BUFFER_TRACE(nbh, "marking dirty"); -@@ -572,13 +572,13 @@ static int scan_revoke_records(journal_t - - header = (journal_revoke_header_t *) bh->b_data; - offset = sizeof(journal_revoke_header_t); -- max = ntohl(header->r_count); -+ max = be32_to_cpu(header->r_count); - - while (offset < max) { - unsigned long blocknr; - int err; - -- blocknr = ntohl(* ((unsigned int *) (bh->b_data+offset))); -+ blocknr = be32_to_cpu(* ((__be32 *) (bh->b_data+offset))); - offset += 4; - err = journal_set_revoke(journal, blocknr, sequence); - if (err) -diff -uprN linux-2.6.8.1.orig/fs/jbd/revoke.c linux-2.6.8.1-ve022stab078/fs/jbd/revoke.c ---- linux-2.6.8.1.orig/fs/jbd/revoke.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/jbd/revoke.c 2006-05-11 13:05:31.000000000 +0400 -@@ -332,6 +332,7 @@ int journal_revoke(handle_t *handle, uns - struct block_device *bdev; - int err; - -+ might_sleep(); - if (bh_in) - BUFFER_TRACE(bh_in, "enter"); - -@@ -375,7 +376,12 @@ int journal_revoke(handle_t *handle, uns - first having the revoke cancelled: it's illegal to free a - block twice without allocating it in between! */ - if (bh) { -- J_ASSERT_BH(bh, !buffer_revoked(bh)); -+ if (!J_EXPECT_BH(bh, !buffer_revoked(bh), -+ "inconsistent data on disk")) { -+ if (!bh_in) -+ brelse(bh); -+ return -EIO; -+ } - set_buffer_revoked(bh); - set_buffer_revokevalid(bh); - if (bh_in) { -@@ -565,9 +571,9 @@ static void write_one_revoke_record(jour - if (!descriptor) - return; - header = (journal_header_t *) &jh2bh(descriptor)->b_data[0]; -- header->h_magic = htonl(JFS_MAGIC_NUMBER); -- header->h_blocktype = htonl(JFS_REVOKE_BLOCK); -- header->h_sequence = htonl(transaction->t_tid); -+ header->h_magic = cpu_to_be32(JFS_MAGIC_NUMBER); -+ header->h_blocktype = cpu_to_be32(JFS_REVOKE_BLOCK); -+ header->h_sequence = cpu_to_be32(transaction->t_tid); - - /* Record it so that we can wait for IO completion later */ - JBUFFER_TRACE(descriptor, "file as BJ_LogCtl"); -@@ -577,8 +583,8 @@ static void write_one_revoke_record(jour - *descriptorp = descriptor; - } - -- * ((unsigned int *)(&jh2bh(descriptor)->b_data[offset])) = -- htonl(record->blocknr); -+ * ((__be32 *)(&jh2bh(descriptor)->b_data[offset])) = -+ cpu_to_be32(record->blocknr); - offset += 4; - *offsetp = offset; - } -@@ -603,7 +609,7 @@ static void flush_descriptor(journal_t * - } - - header = (journal_revoke_header_t *) jh2bh(descriptor)->b_data; -- header->r_count = htonl(offset); -+ header->r_count = cpu_to_be32(offset); - set_buffer_jwrite(bh); - BUFFER_TRACE(bh, "write"); - set_buffer_dirty(bh); -diff -uprN linux-2.6.8.1.orig/fs/jbd/transaction.c linux-2.6.8.1-ve022stab078/fs/jbd/transaction.c ---- linux-2.6.8.1.orig/fs/jbd/transaction.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/jbd/transaction.c 2006-05-11 13:05:39.000000000 +0400 -@@ -1046,7 +1046,12 @@ int journal_dirty_data(handle_t *handle, - /* journal_clean_data_list() may have got there first */ - if (jh->b_transaction != NULL) { - JBUFFER_TRACE(jh, "unfile from commit"); -- __journal_unfile_buffer(jh); -+ __journal_temp_unlink_buffer(jh); -+ /* It still points to the committing -+ * transaction; move it to this one so -+ * that the refile assert checks are -+ * happy. */ -+ jh->b_transaction = handle->h_transaction; - } - /* The buffer will be refiled below */ - -@@ -1060,7 +1065,8 @@ int journal_dirty_data(handle_t *handle, - if (jh->b_jlist != BJ_SyncData && jh->b_jlist != BJ_Locked) { - JBUFFER_TRACE(jh, "not on correct data list: unfile"); - J_ASSERT_JH(jh, jh->b_jlist != BJ_Shadow); -- __journal_unfile_buffer(jh); -+ __journal_temp_unlink_buffer(jh); -+ jh->b_transaction = handle->h_transaction; - JBUFFER_TRACE(jh, "file as data"); - __journal_file_buffer(jh, handle->h_transaction, - BJ_SyncData); -@@ -1200,11 +1206,12 @@ journal_release_buffer(handle_t *handle, - * Allow this call even if the handle has aborted --- it may be part of - * the caller's cleanup after an abort. - */ --void journal_forget(handle_t *handle, struct buffer_head *bh) -+int journal_forget (handle_t *handle, struct buffer_head *bh) - { - transaction_t *transaction = handle->h_transaction; - journal_t *journal = transaction->t_journal; - struct journal_head *jh; -+ int err = 0; - - BUFFER_TRACE(bh, "entry"); - -@@ -1215,6 +1222,14 @@ void journal_forget(handle_t *handle, st - goto not_jbd; - jh = bh2jh(bh); - -+ /* Critical error: attempting to delete a bitmap buffer, maybe? -+ * Don't do any jbd operations, and return an error. */ -+ if (!J_EXPECT_JH(jh, !jh->b_committed_data, -+ "inconsistent data on disk")) { -+ err = -EIO; -+ goto not_jbd; -+ } -+ - if (jh->b_transaction == handle->h_transaction) { - J_ASSERT_JH(jh, !jh->b_frozen_data); - -@@ -1225,9 +1240,6 @@ void journal_forget(handle_t *handle, st - clear_buffer_jbddirty(bh); - - JBUFFER_TRACE(jh, "belongs to current transaction: unfile"); -- J_ASSERT_JH(jh, !jh->b_committed_data); -- -- __journal_unfile_buffer(jh); - - /* - * We are no longer going to journal this buffer. -@@ -1242,15 +1254,17 @@ void journal_forget(handle_t *handle, st - */ - - if (jh->b_cp_transaction) { -+ __journal_temp_unlink_buffer(jh); - __journal_file_buffer(jh, transaction, BJ_Forget); - } else { -+ __journal_unfile_buffer(jh); - journal_remove_journal_head(bh); - __brelse(bh); - if (!buffer_jbd(bh)) { - spin_unlock(&journal->j_list_lock); - jbd_unlock_bh_state(bh); - __bforget(bh); -- return; -+ return 0; - } - } - } else if (jh->b_transaction) { -@@ -1272,7 +1286,7 @@ not_jbd: - spin_unlock(&journal->j_list_lock); - jbd_unlock_bh_state(bh); - __brelse(bh); -- return; -+ return err; - } - - /** -@@ -1402,7 +1416,8 @@ int journal_stop(handle_t *handle) - * Special case: JFS_SYNC synchronous updates require us - * to wait for the commit to complete. - */ -- if (handle->h_sync && !(current->flags & PF_MEMALLOC)) -+ if (handle->h_sync && !(current->flags & -+ (PF_MEMALLOC | PF_MEMDIE))) - err = log_wait_commit(journal, tid); - } else { - spin_unlock(&transaction->t_handle_lock); -@@ -1498,7 +1513,7 @@ __blist_del_buffer(struct journal_head * - * - * Called under j_list_lock. The journal may not be locked. - */ --void __journal_unfile_buffer(struct journal_head *jh) -+void __journal_temp_unlink_buffer(struct journal_head *jh) - { - struct journal_head **list = NULL; - transaction_t *transaction; -@@ -1515,7 +1530,7 @@ void __journal_unfile_buffer(struct jour - - switch (jh->b_jlist) { - case BJ_None: -- goto out; -+ return; - case BJ_SyncData: - list = &transaction->t_sync_datalist; - break; -@@ -1548,7 +1563,11 @@ void __journal_unfile_buffer(struct jour - jh->b_jlist = BJ_None; - if (test_clear_buffer_jbddirty(bh)) - mark_buffer_dirty(bh); /* Expose it to the VM */ --out: -+} -+ -+void __journal_unfile_buffer(struct journal_head *jh) -+{ -+ __journal_temp_unlink_buffer(jh); - jh->b_transaction = NULL; - } - -@@ -1804,10 +1823,10 @@ static int journal_unmap_buffer(journal_ - JBUFFER_TRACE(jh, "checkpointed: add to BJ_Forget"); - ret = __dispose_buffer(jh, - journal->j_running_transaction); -+ journal_put_journal_head(jh); - spin_unlock(&journal->j_list_lock); - jbd_unlock_bh_state(bh); - spin_unlock(&journal->j_state_lock); -- journal_put_journal_head(jh); - return ret; - } else { - /* There is no currently-running transaction. So the -@@ -1818,10 +1837,10 @@ static int journal_unmap_buffer(journal_ - JBUFFER_TRACE(jh, "give to committing trans"); - ret = __dispose_buffer(jh, - journal->j_committing_transaction); -+ journal_put_journal_head(jh); - spin_unlock(&journal->j_list_lock); - jbd_unlock_bh_state(bh); - spin_unlock(&journal->j_state_lock); -- journal_put_journal_head(jh); - return ret; - } else { - /* The orphan record's transaction has -@@ -1831,7 +1850,17 @@ static int journal_unmap_buffer(journal_ - } - } - } else if (transaction == journal->j_committing_transaction) { -- /* If it is committing, we simply cannot touch it. We -+ if (jh->b_jlist == BJ_Locked) { -+ /* -+ * The buffer is on the committing transaction's locked -+ * list. We have the buffer locked, so I/O has -+ * completed. So we can nail the buffer now. -+ */ -+ may_free = __dispose_buffer(jh, transaction); -+ goto zap_buffer; -+ } -+ /* -+ * If it is committing, we simply cannot touch it. We - * can remove it's next_transaction pointer from the - * running transaction if that is set, but nothing - * else. */ -@@ -1842,10 +1871,10 @@ static int journal_unmap_buffer(journal_ - journal->j_running_transaction); - jh->b_next_transaction = NULL; - } -+ journal_put_journal_head(jh); - spin_unlock(&journal->j_list_lock); - jbd_unlock_bh_state(bh); - spin_unlock(&journal->j_state_lock); -- journal_put_journal_head(jh); - return 0; - } else { - /* Good, the buffer belongs to the running transaction. -@@ -1870,6 +1899,7 @@ zap_buffer_unlocked: - clear_buffer_mapped(bh); - clear_buffer_req(bh); - clear_buffer_new(bh); -+ clear_buffer_delay(bh); - bh->b_bdev = NULL; - return may_free; - } -@@ -1906,7 +1936,6 @@ int journal_invalidatepage(journal_t *jo - unsigned int next_off = curr_off + bh->b_size; - next = bh->b_this_page; - -- /* AKPM: doing lock_buffer here may be overly paranoid */ - if (offset <= curr_off) { - /* This block is wholly outside the truncation point */ - lock_buffer(bh); -@@ -1958,7 +1987,7 @@ void __journal_file_buffer(struct journa - } - - if (jh->b_transaction) -- __journal_unfile_buffer(jh); -+ __journal_temp_unlink_buffer(jh); - jh->b_transaction = transaction; - - switch (jlist) { -@@ -2041,7 +2070,7 @@ void __journal_refile_buffer(struct jour - */ - - was_dirty = test_clear_buffer_jbddirty(bh); -- __journal_unfile_buffer(jh); -+ __journal_temp_unlink_buffer(jh); - jh->b_transaction = jh->b_next_transaction; - jh->b_next_transaction = NULL; - __journal_file_buffer(jh, jh->b_transaction, BJ_Metadata); -diff -uprN linux-2.6.8.1.orig/fs/jffs2/background.c linux-2.6.8.1-ve022stab078/fs/jffs2/background.c ---- linux-2.6.8.1.orig/fs/jffs2/background.c 2004-08-14 14:54:46.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/jffs2/background.c 2006-05-11 13:05:25.000000000 +0400 -@@ -93,8 +93,8 @@ static int jffs2_garbage_collect_thread( - schedule(); - } - -- if (current->flags & PF_FREEZE) { -- refrigerator(0); -+ if (test_thread_flag(TIF_FREEZE)) { -+ refrigerator(); - /* refrigerator() should recalc sigpending for us - but doesn't. No matter - allow_signal() will. */ - continue; -diff -uprN linux-2.6.8.1.orig/fs/jfs/acl.c linux-2.6.8.1-ve022stab078/fs/jfs/acl.c ---- linux-2.6.8.1.orig/fs/jfs/acl.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/jfs/acl.c 2006-05-11 13:05:35.000000000 +0400 -@@ -127,7 +127,7 @@ out: - * - * modified vfs_permission to check posix acl - */ --int jfs_permission(struct inode * inode, int mask, struct nameidata *nd) -+int __jfs_permission(struct inode * inode, int mask) - { - umode_t mode = inode->i_mode; - struct jfs_inode_info *ji = JFS_IP(inode); -@@ -206,6 +206,28 @@ check_capabilities: - return -EACCES; - } - -+int jfs_permission(struct inode *inode, int mask, struct nameidata *nd, -+ struct exec_perm *exec_perm) -+{ -+ int ret; -+ -+ if (exec_perm != NULL) -+ down(&inode->i_sem); -+ -+ ret = __jfs_permission(inode, mask); -+ -+ if (exec_perm != NULL) { -+ if (!ret) { -+ exec_perm->set = 1; -+ exec_perm->mode = inode->i_mode; -+ exec_perm->uid = inode->i_uid; -+ exec_perm->gid = inode->i_gid; -+ } -+ up(&inode->i_sem); -+ } -+ return ret; -+} -+ - int jfs_init_acl(struct inode *inode, struct inode *dir) - { - struct posix_acl *acl = NULL; -diff -uprN linux-2.6.8.1.orig/fs/jfs/inode.c linux-2.6.8.1-ve022stab078/fs/jfs/inode.c ---- linux-2.6.8.1.orig/fs/jfs/inode.c 2004-08-14 14:54:46.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/jfs/inode.c 2006-05-11 13:05:35.000000000 +0400 -@@ -105,10 +105,10 @@ int jfs_commit_inode(struct inode *inode - return rc; - } - --void jfs_write_inode(struct inode *inode, int wait) -+int jfs_write_inode(struct inode *inode, int wait) - { - if (test_cflag(COMMIT_Nolink, inode)) -- return; -+ return 0; - /* - * If COMMIT_DIRTY is not set, the inode isn't really dirty. - * It has been committed since the last change, but was still -@@ -117,12 +117,14 @@ void jfs_write_inode(struct inode *inode - if (!test_cflag(COMMIT_Dirty, inode)) { - /* Make sure committed changes hit the disk */ - jfs_flush_journal(JFS_SBI(inode->i_sb)->log, wait); -- return; -+ return 0; - } - - if (jfs_commit_inode(inode, wait)) { - jfs_err("jfs_write_inode: jfs_commit_inode failed!"); -- } -+ return -EIO; -+ } else -+ return 0; - } - - void jfs_delete_inode(struct inode *inode) -diff -uprN linux-2.6.8.1.orig/fs/jfs/jfs_acl.h linux-2.6.8.1-ve022stab078/fs/jfs/jfs_acl.h ---- linux-2.6.8.1.orig/fs/jfs/jfs_acl.h 2004-08-14 14:56:00.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/jfs/jfs_acl.h 2006-05-11 13:05:35.000000000 +0400 -@@ -22,7 +22,7 @@ - - #include <linux/xattr_acl.h> - --int jfs_permission(struct inode *, int, struct nameidata *); -+int jfs_permission(struct inode *, int, struct nameidata *, struct exec_perm *); - int jfs_init_acl(struct inode *, struct inode *); - int jfs_setattr(struct dentry *, struct iattr *); - -diff -uprN linux-2.6.8.1.orig/fs/jfs/jfs_logmgr.c linux-2.6.8.1-ve022stab078/fs/jfs/jfs_logmgr.c ---- linux-2.6.8.1.orig/fs/jfs/jfs_logmgr.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/jfs/jfs_logmgr.c 2006-05-11 13:05:25.000000000 +0400 -@@ -2328,9 +2328,9 @@ int jfsIOWait(void *arg) - lbmStartIO(bp); - spin_lock_irq(&log_redrive_lock); - } -- if (current->flags & PF_FREEZE) { -+ if (test_thread_flag(TIF_FREEZE)) { - spin_unlock_irq(&log_redrive_lock); -- refrigerator(PF_FREEZE); -+ refrigerator(); - } else { - add_wait_queue(&jfs_IO_thread_wait, &wq); - set_current_state(TASK_INTERRUPTIBLE); -diff -uprN linux-2.6.8.1.orig/fs/jfs/jfs_txnmgr.c linux-2.6.8.1-ve022stab078/fs/jfs/jfs_txnmgr.c ---- linux-2.6.8.1.orig/fs/jfs/jfs_txnmgr.c 2004-08-14 14:55:34.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/jfs/jfs_txnmgr.c 2006-05-11 13:05:25.000000000 +0400 -@@ -2776,9 +2776,9 @@ int jfs_lazycommit(void *arg) - break; - } - -- if (current->flags & PF_FREEZE) { -+ if (test_thread_flag(TIF_FREEZE)) { - LAZY_UNLOCK(flags); -- refrigerator(PF_FREEZE); -+ refrigerator(); - } else { - DECLARE_WAITQUEUE(wq, current); - -@@ -2987,9 +2987,9 @@ int jfs_sync(void *arg) - /* Add anon_list2 back to anon_list */ - list_splice_init(&TxAnchor.anon_list2, &TxAnchor.anon_list); - -- if (current->flags & PF_FREEZE) { -+ if (test_thread_flag(TIF_FREEZE)) { - TXN_UNLOCK(); -- refrigerator(PF_FREEZE); -+ refrigerator(); - } else { - DECLARE_WAITQUEUE(wq, current); - -diff -uprN linux-2.6.8.1.orig/fs/jfs/super.c linux-2.6.8.1-ve022stab078/fs/jfs/super.c ---- linux-2.6.8.1.orig/fs/jfs/super.c 2004-08-14 14:55:31.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/jfs/super.c 2006-05-11 13:05:35.000000000 +0400 -@@ -77,7 +77,7 @@ extern int jfs_sync(void *); - extern void jfs_read_inode(struct inode *inode); - extern void jfs_dirty_inode(struct inode *inode); - extern void jfs_delete_inode(struct inode *inode); --extern void jfs_write_inode(struct inode *inode, int wait); -+extern int jfs_write_inode(struct inode *inode, int wait); - - extern struct dentry *jfs_get_parent(struct dentry *dentry); - extern int jfs_extendfs(struct super_block *, s64, int); -diff -uprN linux-2.6.8.1.orig/fs/jfs/xattr.c linux-2.6.8.1-ve022stab078/fs/jfs/xattr.c ---- linux-2.6.8.1.orig/fs/jfs/xattr.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/jfs/xattr.c 2006-05-11 13:05:35.000000000 +0400 -@@ -745,7 +745,7 @@ static int can_set_xattr(struct inode *i - (!S_ISDIR(inode->i_mode) || inode->i_mode &S_ISVTX)) - return -EPERM; - -- return permission(inode, MAY_WRITE, NULL); -+ return permission(inode, MAY_WRITE, NULL, NULL); - } - - int __jfs_setxattr(struct inode *inode, const char *name, const void *value, -@@ -906,7 +906,7 @@ static int can_get_xattr(struct inode *i - { - if(strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN) == 0) - return 0; -- return permission(inode, MAY_READ, NULL); -+ return permission(inode, MAY_READ, NULL, NULL); - } - - ssize_t __jfs_getxattr(struct inode *inode, const char *name, void *data, -diff -uprN linux-2.6.8.1.orig/fs/libfs.c linux-2.6.8.1-ve022stab078/fs/libfs.c ---- linux-2.6.8.1.orig/fs/libfs.c 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/libfs.c 2006-05-11 13:05:40.000000000 +0400 -@@ -412,10 +412,13 @@ static spinlock_t pin_fs_lock = SPIN_LOC - int simple_pin_fs(char *name, struct vfsmount **mount, int *count) - { - struct vfsmount *mnt = NULL; -+ struct file_system_type *fstype; - spin_lock(&pin_fs_lock); - if (unlikely(!*mount)) { - spin_unlock(&pin_fs_lock); -- mnt = do_kern_mount(name, 0, name, NULL); -+ fstype = get_fs_type(name); -+ mnt = do_kern_mount(fstype, 0, name, NULL); -+ put_filesystem(fstype); - if (IS_ERR(mnt)) - return PTR_ERR(mnt); - spin_lock(&pin_fs_lock); -diff -uprN linux-2.6.8.1.orig/fs/lockd/clntproc.c linux-2.6.8.1-ve022stab078/fs/lockd/clntproc.c ---- linux-2.6.8.1.orig/fs/lockd/clntproc.c 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/lockd/clntproc.c 2006-05-11 13:05:40.000000000 +0400 -@@ -53,10 +53,10 @@ nlmclnt_setlockargs(struct nlm_rqst *req - nlmclnt_next_cookie(&argp->cookie); - argp->state = nsm_local_state; - memcpy(&lock->fh, NFS_FH(fl->fl_file->f_dentry->d_inode), sizeof(struct nfs_fh)); -- lock->caller = system_utsname.nodename; -+ lock->caller = ve_utsname.nodename; - lock->oh.data = req->a_owner; - lock->oh.len = sprintf(req->a_owner, "%d@%s", -- current->pid, system_utsname.nodename); -+ current->pid, ve_utsname.nodename); - locks_copy_lock(&lock->fl, fl); - } - -@@ -69,7 +69,7 @@ nlmclnt_setgrantargs(struct nlm_rqst *ca - { - locks_copy_lock(&call->a_args.lock.fl, &lock->fl); - memcpy(&call->a_args.lock.fh, &lock->fh, sizeof(call->a_args.lock.fh)); -- call->a_args.lock.caller = system_utsname.nodename; -+ call->a_args.lock.caller = ve_utsname.nodename; - call->a_args.lock.oh.len = lock->oh.len; - - /* set default data area */ -diff -uprN linux-2.6.8.1.orig/fs/lockd/mon.c linux-2.6.8.1-ve022stab078/fs/lockd/mon.c ---- linux-2.6.8.1.orig/fs/lockd/mon.c 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/lockd/mon.c 2006-05-11 13:05:40.000000000 +0400 -@@ -151,7 +151,7 @@ xdr_encode_common(struct rpc_rqst *rqstp - sprintf(buffer, "%d.%d.%d.%d", (addr>>24) & 0xff, (addr>>16) & 0xff, - (addr>>8) & 0xff, (addr) & 0xff); - if (!(p = xdr_encode_string(p, buffer)) -- || !(p = xdr_encode_string(p, system_utsname.nodename))) -+ || !(p = xdr_encode_string(p, ve_utsname.nodename))) - return ERR_PTR(-EIO); - *p++ = htonl(argp->prog); - *p++ = htonl(argp->vers); -diff -uprN linux-2.6.8.1.orig/fs/locks.c linux-2.6.8.1-ve022stab078/fs/locks.c ---- linux-2.6.8.1.orig/fs/locks.c 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/locks.c 2006-05-11 13:05:40.000000000 +0400 -@@ -127,6 +127,8 @@ - #include <asm/semaphore.h> - #include <asm/uaccess.h> - -+#include <ub/ub_misc.h> -+ - #define IS_POSIX(fl) (fl->fl_flags & FL_POSIX) - #define IS_FLOCK(fl) (fl->fl_flags & FL_FLOCK) - #define IS_LEASE(fl) (fl->fl_flags & FL_LEASE) -@@ -146,9 +148,23 @@ static LIST_HEAD(blocked_list); - static kmem_cache_t *filelock_cache; - - /* Allocate an empty lock structure. */ --static struct file_lock *locks_alloc_lock(void) -+static struct file_lock *locks_alloc_lock(int charge) - { -- return kmem_cache_alloc(filelock_cache, SLAB_KERNEL); -+ struct file_lock *flock; -+ -+ flock = kmem_cache_alloc(filelock_cache, SLAB_KERNEL); -+ if (flock == NULL) -+ goto out; -+ flock->fl_charged = 0; -+ if (!charge) -+ goto out; -+ if (!ub_flock_charge(flock, 1)) -+ goto out; -+ -+ kmem_cache_free(filelock_cache, flock); -+ flock = NULL; -+out: -+ return flock; - } - - /* Free a lock which is not in use. */ -@@ -167,6 +183,7 @@ static inline void locks_free_lock(struc - if (!list_empty(&fl->fl_link)) - panic("Attempting to free lock on active lock list"); - -+ ub_flock_uncharge(fl); - kmem_cache_free(filelock_cache, fl); - } - -@@ -247,8 +264,8 @@ static int flock_make_lock(struct file * - int type = flock_translate_cmd(cmd); - if (type < 0) - return type; -- -- fl = locks_alloc_lock(); -+ -+ fl = locks_alloc_lock(type != F_UNLCK); - if (fl == NULL) - return -ENOMEM; - -@@ -382,7 +399,7 @@ static int flock64_to_posix_lock(struct - /* Allocate a file_lock initialised to this type of lease */ - static int lease_alloc(struct file *filp, int type, struct file_lock **flp) - { -- struct file_lock *fl = locks_alloc_lock(); -+ struct file_lock *fl = locks_alloc_lock(1); - if (fl == NULL) - return -ENOMEM; - -@@ -733,8 +750,11 @@ static int __posix_lock_file(struct inod - * We may need two file_lock structures for this operation, - * so we get them in advance to avoid races. - */ -- new_fl = locks_alloc_lock(); -- new_fl2 = locks_alloc_lock(); -+ if (request->fl_type != F_UNLCK) -+ new_fl = locks_alloc_lock(1); -+ else -+ new_fl = NULL; -+ new_fl2 = locks_alloc_lock(0); - - lock_kernel(); - if (request->fl_type != F_UNLCK) { -@@ -762,7 +782,7 @@ static int __posix_lock_file(struct inod - goto out; - - error = -ENOLCK; /* "no luck" */ -- if (!(new_fl && new_fl2)) -+ if (!((request->fl_type == F_UNLCK || new_fl) && new_fl2)) - goto out; - - /* -@@ -864,19 +884,29 @@ static int __posix_lock_file(struct inod - if (!added) { - if (request->fl_type == F_UNLCK) - goto out; -+ error = -ENOLCK; -+ if (right && (left == right) && ub_flock_charge(new_fl, 1)) -+ goto out; - locks_copy_lock(new_fl, request); - locks_insert_lock(before, new_fl); - new_fl = NULL; -+ error = 0; - } - if (right) { - if (left == right) { - /* The new lock breaks the old one in two pieces, - * so we have to use the second new lock. - */ -+ error = -ENOLCK; -+ if (added && ub_flock_charge(new_fl2, -+ request->fl_type != F_UNLCK)) -+ goto out; -+ new_fl2->fl_charged = 1; - left = new_fl2; - new_fl2 = NULL; - locks_copy_lock(left, right); - locks_insert_lock(before, left); -+ error = 0; - } - right->fl_start = request->fl_end + 1; - locks_wake_up_blocks(right); -@@ -1024,7 +1054,6 @@ static void time_out_leases(struct inode - before = &fl->fl_next; - continue; - } -- printk(KERN_INFO "lease broken - owner pid = %d\n", fl->fl_pid); - lease_modify(before, fl->fl_type & ~F_INPROGRESS); - if (fl == *before) /* lease_modify may have freed fl */ - before = &fl->fl_next; -@@ -1146,7 +1175,7 @@ void lease_get_mtime(struct inode *inode - { - struct file_lock *flock = inode->i_flock; - if (flock && IS_LEASE(flock) && (flock->fl_type & F_WRLCK)) -- *time = CURRENT_TIME; -+ *time = current_fs_time(inode->i_sb); - else - *time = inode->i_mtime; - } -@@ -1400,7 +1429,7 @@ int fcntl_getlk(struct file *filp, struc - - flock.l_type = F_UNLCK; - if (fl != NULL) { -- flock.l_pid = fl->fl_pid; -+ flock.l_pid = pid_type_to_vpid(PIDTYPE_TGID, fl->fl_pid); - #if BITS_PER_LONG == 32 - /* - * Make sure we can represent the posix lock via -@@ -1432,7 +1461,7 @@ out: - */ - int fcntl_setlk(struct file *filp, unsigned int cmd, struct flock __user *l) - { -- struct file_lock *file_lock = locks_alloc_lock(); -+ struct file_lock *file_lock = locks_alloc_lock(0); - struct flock flock; - struct inode *inode; - int error; -@@ -1547,7 +1576,7 @@ int fcntl_getlk64(struct file *filp, str - - flock.l_type = F_UNLCK; - if (fl != NULL) { -- flock.l_pid = fl->fl_pid; -+ flock.l_pid = pid_type_to_vpid(PIDTYPE_TGID, fl->fl_pid); - flock.l_start = fl->fl_start; - flock.l_len = fl->fl_end == OFFSET_MAX ? 0 : - fl->fl_end - fl->fl_start + 1; -@@ -1567,7 +1596,7 @@ out: - */ - int fcntl_setlk64(struct file *filp, unsigned int cmd, struct flock64 __user *l) - { -- struct file_lock *file_lock = locks_alloc_lock(); -+ struct file_lock *file_lock = locks_alloc_lock(1); - struct flock64 flock; - struct inode *inode; - int error; -@@ -1712,7 +1741,12 @@ void locks_remove_flock(struct file *fil - - while ((fl = *before) != NULL) { - if (fl->fl_file == filp) { -- if (IS_FLOCK(fl)) { -+ /* -+ * We might have a POSIX lock that was created at the same time -+ * the filp was closed for the last time. Just remove that too, -+ * regardless of ownership, since nobody can own it. -+ */ -+ if (IS_FLOCK(fl) || IS_POSIX(fl)) { - locks_delete_lock(before); - continue; - } -@@ -1720,9 +1754,7 @@ void locks_remove_flock(struct file *fil - lease_modify(before, F_UNLCK); - continue; - } -- /* FL_POSIX locks of this process have already been -- * removed in filp_close->locks_remove_posix. -- */ -+ /* What? */ - BUG(); - } - before = &fl->fl_next; -@@ -1775,7 +1807,9 @@ EXPORT_SYMBOL(posix_unblock_lock); - static void lock_get_status(char* out, struct file_lock *fl, int id, char *pfx) - { - struct inode *inode = NULL; -+ unsigned int fl_pid; - -+ fl_pid = pid_type_to_vpid(PIDTYPE_TGID, fl->fl_pid); - if (fl->fl_file != NULL) - inode = fl->fl_file->f_dentry->d_inode; - -@@ -1817,16 +1851,16 @@ static void lock_get_status(char* out, s - } - if (inode) { - #ifdef WE_CAN_BREAK_LSLK_NOW -- out += sprintf(out, "%d %s:%ld ", fl->fl_pid, -+ out += sprintf(out, "%d %s:%ld ", fl_pid, - inode->i_sb->s_id, inode->i_ino); - #else - /* userspace relies on this representation of dev_t ;-( */ -- out += sprintf(out, "%d %02x:%02x:%ld ", fl->fl_pid, -+ out += sprintf(out, "%d %02x:%02x:%ld ", fl_pid, - MAJOR(inode->i_sb->s_dev), - MINOR(inode->i_sb->s_dev), inode->i_ino); - #endif - } else { -- out += sprintf(out, "%d <none>:0 ", fl->fl_pid); -+ out += sprintf(out, "%d <none>:0 ", fl_pid); - } - if (IS_POSIX(fl)) { - if (fl->fl_end == OFFSET_MAX) -@@ -1875,11 +1909,17 @@ int get_locks_status(char *buffer, char - char *q = buffer; - off_t pos = 0; - int i = 0; -+ struct ve_struct *env; - - lock_kernel(); -+ env = get_exec_env(); - list_for_each(tmp, &file_lock_list) { - struct list_head *btmp; - struct file_lock *fl = list_entry(tmp, struct file_lock, fl_link); -+ -+ if (!ve_accessible(VE_OWNER_FILP(fl->fl_file), env)) -+ continue; -+ - lock_get_status(q, fl, ++i, ""); - move_lock_status(&q, &pos, offset); - -@@ -2033,9 +2073,9 @@ EXPORT_SYMBOL(steal_locks); - static int __init filelock_init(void) - { - filelock_cache = kmem_cache_create("file_lock_cache", -- sizeof(struct file_lock), 0, SLAB_PANIC, -+ sizeof(struct file_lock), 0, SLAB_PANIC | SLAB_UBC, - init_once, NULL); - return 0; - } - --module_init(filelock_init) -+core_initcall(filelock_init); -diff -uprN linux-2.6.8.1.orig/fs/minix/inode.c linux-2.6.8.1-ve022stab078/fs/minix/inode.c ---- linux-2.6.8.1.orig/fs/minix/inode.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/minix/inode.c 2006-05-11 13:05:35.000000000 +0400 -@@ -18,7 +18,7 @@ - #include <linux/vfs.h> - - static void minix_read_inode(struct inode * inode); --static void minix_write_inode(struct inode * inode, int wait); -+static int minix_write_inode(struct inode * inode, int wait); - static int minix_statfs(struct super_block *sb, struct kstatfs *buf); - static int minix_remount (struct super_block * sb, int * flags, char * data); - -@@ -505,9 +505,10 @@ static struct buffer_head *minix_update_ - return V2_minix_update_inode(inode); - } - --static void minix_write_inode(struct inode * inode, int wait) -+static int minix_write_inode(struct inode * inode, int wait) - { - brelse(minix_update_inode(inode)); -+ return 0; - } - - int minix_sync_inode(struct inode * inode) -diff -uprN linux-2.6.8.1.orig/fs/minix/namei.c linux-2.6.8.1-ve022stab078/fs/minix/namei.c ---- linux-2.6.8.1.orig/fs/minix/namei.c 2004-08-14 14:55:19.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/minix/namei.c 2006-05-11 13:05:32.000000000 +0400 -@@ -116,7 +116,7 @@ static int minix_symlink(struct inode * - - inode->i_mode = S_IFLNK | 0777; - minix_set_inode(inode, 0); -- err = page_symlink(inode, symname, i); -+ err = page_symlink(inode, symname, i, GFP_KERNEL); - if (err) - goto out_fail; - -diff -uprN linux-2.6.8.1.orig/fs/mpage.c linux-2.6.8.1-ve022stab078/fs/mpage.c ---- linux-2.6.8.1.orig/fs/mpage.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/mpage.c 2006-05-11 13:05:25.000000000 +0400 -@@ -687,6 +687,8 @@ retry: - bio = mpage_writepage(bio, page, get_block, - &last_block_in_bio, &ret, wbc); - } -+ if (unlikely(ret == WRITEPAGE_ACTIVATE)) -+ unlock_page(page); - if (ret || (--(wbc->nr_to_write) <= 0)) - done = 1; - if (wbc->nonblocking && bdi_write_congested(bdi)) { -diff -uprN linux-2.6.8.1.orig/fs/namei.c linux-2.6.8.1-ve022stab078/fs/namei.c ---- linux-2.6.8.1.orig/fs/namei.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/namei.c 2006-05-11 13:05:43.000000000 +0400 -@@ -115,11 +115,12 @@ static inline int do_getname(const char - int retval; - unsigned long len = PATH_MAX; - -- if ((unsigned long) filename >= TASK_SIZE) { -- if (!segment_eq(get_fs(), KERNEL_DS)) -+ if (!segment_eq(get_fs(), KERNEL_DS)) { -+ if ((unsigned long) filename >= TASK_SIZE) - return -EFAULT; -- } else if (TASK_SIZE - (unsigned long) filename < PATH_MAX) -- len = TASK_SIZE - (unsigned long) filename; -+ if (TASK_SIZE - (unsigned long) filename < PATH_MAX) -+ len = TASK_SIZE - (unsigned long) filename; -+ } - - retval = strncpy_from_user((char *)page, filename, len); - if (retval > 0) { -@@ -159,7 +160,7 @@ char * getname(const char __user * filen - * for filesystem access without changing the "normal" uids which - * are used for other things.. - */ --int vfs_permission(struct inode * inode, int mask) -+int __vfs_permission(struct inode * inode, int mask) - { - umode_t mode = inode->i_mode; - -@@ -208,7 +209,29 @@ int vfs_permission(struct inode * inode, - return -EACCES; - } - --int permission(struct inode * inode,int mask, struct nameidata *nd) -+int vfs_permission(struct inode * inode, int mask, struct exec_perm * exec_perm) -+{ -+ int ret; -+ -+ if (exec_perm != NULL) -+ down(&inode->i_sem); -+ -+ ret = __vfs_permission(inode, mask); -+ -+ if (exec_perm != NULL) { -+ if (!ret) { -+ exec_perm->set = 1; -+ exec_perm->mode = inode->i_mode; -+ exec_perm->uid = inode->i_uid; -+ exec_perm->gid = inode->i_gid; -+ } -+ up(&inode->i_sem); -+ } -+ return ret; -+} -+ -+int permission(struct inode * inode, int mask, struct nameidata *nd, -+ struct exec_perm *exec_perm) - { - int retval; - int submask; -@@ -217,9 +240,9 @@ int permission(struct inode * inode,int - submask = mask & ~MAY_APPEND; - - if (inode->i_op && inode->i_op->permission) -- retval = inode->i_op->permission(inode, submask, nd); -+ retval = inode->i_op->permission(inode, submask, nd, exec_perm); - else -- retval = vfs_permission(inode, submask); -+ retval = vfs_permission(inode, submask, exec_perm); - if (retval) - return retval; - -@@ -302,6 +325,21 @@ static struct dentry * cached_lookup(str - if (!dentry) - dentry = d_lookup(parent, name); - -+ /* -+ * The revalidation rules are simple: -+ * d_revalidate operation is called when we're about to use a cached -+ * dentry rather than call d_lookup. -+ * d_revalidate method may unhash the dentry itself or return FALSE, in -+ * which case if the dentry can be released d_lookup will be called. -+ * -+ * Additionally, by request of NFS people -+ * (http://linux.bkbits.net:8080/linux-2.4/cset@1.181?nav=index.html|src/|src/fs|related/fs/namei.c) -+ * d_revalidate is called when `/', `.' or `..' are looked up. -+ * Since re-lookup is impossible on them, we introduce a hack and -+ * return an error in this case. -+ * -+ * 2003/02/19 SAW -+ */ - if (dentry && dentry->d_op && dentry->d_op->d_revalidate) { - if (!dentry->d_op->d_revalidate(dentry, nd) && !d_invalidate(dentry)) { - dput(dentry); -@@ -364,6 +402,7 @@ static struct dentry * real_lookup(struc - struct dentry * result; - struct inode *dir = parent->d_inode; - -+repeat: - down(&dir->i_sem); - /* - * First re-do the cached lookup just in case it was created -@@ -402,7 +441,7 @@ static struct dentry * real_lookup(struc - if (result->d_op && result->d_op->d_revalidate) { - if (!result->d_op->d_revalidate(result, nd) && !d_invalidate(result)) { - dput(result); -- result = ERR_PTR(-ENOENT); -+ goto repeat; - } - } - return result; -@@ -578,7 +617,14 @@ static inline void follow_dotdot(struct - read_unlock(¤t->fs->lock); - break; - } -- read_unlock(¤t->fs->lock); -+#ifdef CONFIG_VE -+ if (*dentry == get_exec_env()->fs_root && -+ *mnt == get_exec_env()->fs_rootmnt) { -+ read_unlock(¤t->fs->lock); -+ break; -+ } -+#endif -+ read_unlock(¤t->fs->lock); - spin_lock(&dcache_lock); - if (*dentry != (*mnt)->mnt_root) { - *dentry = dget((*dentry)->d_parent); -@@ -658,6 +704,7 @@ int fastcall link_path_walk(const char * - { - struct path next; - struct inode *inode; -+ int real_components = 0; - int err; - unsigned int lookup_flags = nd->flags; - -@@ -678,7 +725,7 @@ int fastcall link_path_walk(const char * - - err = exec_permission_lite(inode, nd); - if (err == -EAGAIN) { -- err = permission(inode, MAY_EXEC, nd); -+ err = permission(inode, MAY_EXEC, nd, NULL); - } - if (err) - break; -@@ -730,10 +777,14 @@ int fastcall link_path_walk(const char * - } - nd->flags |= LOOKUP_CONTINUE; - /* This does the actual lookups.. */ -+ real_components++; - err = do_lookup(nd, &this, &next); - if (err) - break; - /* Check mountpoints.. */ -+ err = -ENOENT; -+ if ((lookup_flags & LOOKUP_STRICT) && d_mountpoint(nd->dentry)) -+ goto out_dput; - follow_mount(&next.mnt, &next.dentry); - - err = -ENOENT; -@@ -745,6 +796,10 @@ int fastcall link_path_walk(const char * - goto out_dput; - - if (inode->i_op->follow_link) { -+ err = -ENOENT; -+ if (lookup_flags & LOOKUP_STRICT) -+ goto out_dput; -+ - mntget(next.mnt); - err = do_follow_link(next.dentry, nd); - dput(next.dentry); -@@ -795,9 +850,13 @@ last_component: - err = do_lookup(nd, &this, &next); - if (err) - break; -+ err = -ENOENT; -+ if ((lookup_flags & LOOKUP_STRICT) && d_mountpoint(nd->dentry)) -+ goto out_dput; - follow_mount(&next.mnt, &next.dentry); - inode = next.dentry->d_inode; - if ((lookup_flags & LOOKUP_FOLLOW) -+ && !(lookup_flags & LOOKUP_STRICT) - && inode && inode->i_op && inode->i_op->follow_link) { - mntget(next.mnt); - err = do_follow_link(next.dentry, nd); -@@ -825,26 +884,40 @@ lookup_parent: - nd->last_type = LAST_NORM; - if (this.name[0] != '.') - goto return_base; -- if (this.len == 1) -+ if (this.len == 1) { - nd->last_type = LAST_DOT; -- else if (this.len == 2 && this.name[1] == '.') -+ goto return_reval; -+ } else if (this.len == 2 && this.name[1] == '.') { - nd->last_type = LAST_DOTDOT; -- else -- goto return_base; -+ goto return_reval; -+ } -+return_base: -+ if (!(nd->flags & LOOKUP_NOAREACHECK)) { -+ err = check_area_access_ve(nd->dentry, nd->mnt); -+ if (err) -+ break; -+ } -+ return 0; - return_reval: - /* - * We bypassed the ordinary revalidation routines. - * We may need to check the cached dentry for staleness. - */ -- if (nd->dentry && nd->dentry->d_sb && -+ if (!real_components && nd->dentry && nd->dentry->d_sb && - (nd->dentry->d_sb->s_type->fs_flags & FS_REVAL_DOT)) { - err = -ESTALE; - /* Note: we do not d_invalidate() */ - if (!nd->dentry->d_op->d_revalidate(nd->dentry, nd)) -+ /* -+ * This lookup is for `/' or `.' or `..'. -+ * The filesystem unhashed the dentry itself -+ * inside d_revalidate (otherwise, d_invalidate -+ * wouldn't succeed). As a special courtesy to -+ * NFS we return an error. 2003/02/19 SAW -+ */ - break; - } --return_base: -- return 0; -+ goto return_base; - out_dput: - dput(next.dentry); - break; -@@ -971,7 +1044,7 @@ static struct dentry * __lookup_hash(str - int err; - - inode = base->d_inode; -- err = permission(inode, MAY_EXEC, nd); -+ err = permission(inode, MAY_EXEC, nd, NULL); - dentry = ERR_PTR(err); - if (err) - goto out; -@@ -1096,7 +1169,7 @@ static inline int may_delete(struct inod - int error; - if (!victim->d_inode || victim->d_parent->d_inode != dir) - return -ENOENT; -- error = permission(dir,MAY_WRITE | MAY_EXEC, NULL); -+ error = permission(dir,MAY_WRITE | MAY_EXEC, NULL, NULL); - if (error) - return error; - if (IS_APPEND(dir)) -@@ -1133,7 +1206,7 @@ static inline int may_create(struct inod - return -EEXIST; - if (IS_DEADDIR(dir)) - return -ENOENT; -- return permission(dir,MAY_WRITE | MAY_EXEC, nd); -+ return permission(dir, MAY_WRITE | MAY_EXEC, nd, NULL); - } - - /* -@@ -1241,7 +1314,7 @@ int may_open(struct nameidata *nd, int a - if (S_ISDIR(inode->i_mode) && (flag & FMODE_WRITE)) - return -EISDIR; - -- error = permission(inode, acc_mode, nd); -+ error = permission(inode, acc_mode, nd, NULL); - if (error) - return error; - -@@ -1662,17 +1735,13 @@ out: - static void d_unhash(struct dentry *dentry) - { - dget(dentry); -- spin_lock(&dcache_lock); -- switch (atomic_read(&dentry->d_count)) { -- default: -- spin_unlock(&dcache_lock); -+ if (atomic_read(&dentry->d_count)) - shrink_dcache_parent(dentry); -- spin_lock(&dcache_lock); -- if (atomic_read(&dentry->d_count) != 2) -- break; -- case 2: -+ spin_lock(&dcache_lock); -+ spin_lock(&dentry->d_lock); -+ if (atomic_read(&dentry->d_count) == 2) - __d_drop(dentry); -- } -+ spin_unlock(&dentry->d_lock); - spin_unlock(&dcache_lock); - } - -@@ -2020,7 +2089,7 @@ int vfs_rename_dir(struct inode *old_dir - * we'll need to flip '..'. - */ - if (new_dir != old_dir) { -- error = permission(old_dentry->d_inode, MAY_WRITE, NULL); -+ error = permission(old_dentry->d_inode, MAY_WRITE, NULL, NULL); - if (error) - return error; - } -@@ -2090,6 +2159,9 @@ int vfs_rename(struct inode *old_dir, st - int error; - int is_dir = S_ISDIR(old_dentry->d_inode->i_mode); - -+ if (DQUOT_RENAME(old_dentry->d_inode, old_dir, new_dir)) -+ return -EXDEV; -+ - if (old_dentry->d_inode == new_dentry->d_inode) - return 0; - -@@ -2332,13 +2404,16 @@ int page_follow_link(struct dentry *dent - return res; - } - --int page_symlink(struct inode *inode, const char *symname, int len) -+int page_symlink(struct inode *inode, const char *symname, int len, -+ int gfp_mask) - { - struct address_space *mapping = inode->i_mapping; -- struct page *page = grab_cache_page(mapping, 0); -+ struct page *page; - int err = -ENOMEM; - char *kaddr; - -+ page = find_or_create_page(mapping, 0, -+ mapping_gfp_mask(mapping) | gfp_mask); - if (!page) - goto fail; - err = mapping->a_ops->prepare_write(NULL, page, 0, len-1); -diff -uprN linux-2.6.8.1.orig/fs/namespace.c linux-2.6.8.1-ve022stab078/fs/namespace.c ---- linux-2.6.8.1.orig/fs/namespace.c 2004-08-14 14:55:35.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/namespace.c 2006-05-11 13:05:40.000000000 +0400 -@@ -37,6 +37,7 @@ static inline int sysfs_init(void) - - /* spinlock for vfsmount related operations, inplace of dcache_lock */ - spinlock_t vfsmount_lock __cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED; -+EXPORT_SYMBOL(vfsmount_lock); - - static struct list_head *mount_hashtable; - static int hash_mask, hash_bits; -@@ -238,10 +239,32 @@ static int show_vfsmnt(struct seq_file * - { 0, NULL } - }; - struct proc_fs_info *fs_infop; -+ char *path_buf, *path; - -- mangle(m, mnt->mnt_devname ? mnt->mnt_devname : "none"); -+ /* skip FS_NOMOUNT mounts (rootfs) */ -+ if (mnt->mnt_sb->s_flags & MS_NOUSER) -+ return 0; -+ -+ path_buf = (char *) __get_free_page(GFP_KERNEL); -+ if (!path_buf) -+ return -ENOMEM; -+ path = d_path(mnt->mnt_root, mnt, path_buf, PAGE_SIZE); -+ if (IS_ERR(path)) { -+ free_page((unsigned long) path_buf); -+ /* -+ * This means that the file position will be incremented, i.e. -+ * the total number of "invisible" vfsmnt will leak. -+ */ -+ return 0; -+ } -+ -+ if (ve_is_super(get_exec_env())) -+ mangle(m, mnt->mnt_devname ? mnt->mnt_devname : "none"); -+ else -+ mangle(m, mnt->mnt_sb->s_type->name); - seq_putc(m, ' '); -- seq_path(m, mnt, mnt->mnt_root, " \t\n\\"); -+ mangle(m, path); -+ free_page((unsigned long) path_buf); - seq_putc(m, ' '); - mangle(m, mnt->mnt_sb->s_type->name); - seq_puts(m, mnt->mnt_sb->s_flags & MS_RDONLY ? " ro" : " rw"); -@@ -364,6 +387,7 @@ void umount_tree(struct vfsmount *mnt) - spin_lock(&vfsmount_lock); - } - } -+EXPORT_SYMBOL(umount_tree); - - static int do_umount(struct vfsmount *mnt, int flags) - { -@@ -480,7 +504,7 @@ asmlinkage long sys_umount(char __user * - goto dput_and_out; - - retval = -EPERM; -- if (!capable(CAP_SYS_ADMIN)) -+ if (!capable(CAP_VE_SYS_ADMIN)) - goto dput_and_out; - - retval = do_umount(nd.mnt, flags); -@@ -505,7 +529,7 @@ asmlinkage long sys_oldumount(char __use - - static int mount_is_safe(struct nameidata *nd) - { -- if (capable(CAP_SYS_ADMIN)) -+ if (capable(CAP_VE_SYS_ADMIN)) - return 0; - return -EPERM; - #ifdef notyet -@@ -515,7 +539,7 @@ static int mount_is_safe(struct nameidat - if (current->uid != nd->dentry->d_inode->i_uid) - return -EPERM; - } -- if (permission(nd->dentry->d_inode, MAY_WRITE, nd)) -+ if (permission(nd->dentry->d_inode, MAY_WRITE, nd, NULL)) - return -EPERM; - return 0; - #endif -@@ -673,7 +697,7 @@ static int do_remount(struct nameidata * - int err; - struct super_block * sb = nd->mnt->mnt_sb; - -- if (!capable(CAP_SYS_ADMIN)) -+ if (!capable(CAP_VE_SYS_ADMIN)) - return -EPERM; - - if (!check_mnt(nd->mnt)) -@@ -682,6 +706,10 @@ static int do_remount(struct nameidata * - if (nd->dentry != nd->mnt->mnt_root) - return -EINVAL; - -+ /* do not allow to remount bind-mounts */ -+ if (nd->dentry != sb->s_root) -+ return -EINVAL; -+ - down_write(&sb->s_umount); - err = do_remount_sb(sb, flags, data, 0); - if (!err) -@@ -697,7 +725,7 @@ static int do_move_mount(struct nameidat - struct nameidata old_nd, parent_nd; - struct vfsmount *p; - int err = 0; -- if (!capable(CAP_SYS_ADMIN)) -+ if (!capable(CAP_VE_SYS_ADMIN)) - return -EPERM; - if (!old_name || !*old_name) - return -EINVAL; -@@ -764,15 +792,20 @@ static int do_new_mount(struct nameidata - int mnt_flags, char *name, void *data) - { - struct vfsmount *mnt; -+ struct file_system_type *fstype; - - if (!type || !memchr(type, 0, PAGE_SIZE)) - return -EINVAL; - - /* we need capabilities... */ -- if (!capable(CAP_SYS_ADMIN)) -+ if (!capable(CAP_VE_SYS_ADMIN)) - return -EPERM; - -- mnt = do_kern_mount(type, flags, name, data); -+ fstype = get_fs_type(type); -+ if (fstype == NULL) -+ return -ENODEV; -+ mnt = do_kern_mount(fstype, flags, name, data); -+ put_filesystem(fstype); - if (IS_ERR(mnt)) - return PTR_ERR(mnt); - -@@ -809,6 +842,10 @@ int do_add_mount(struct vfsmount *newmnt - newmnt->mnt_flags = mnt_flags; - err = graft_tree(newmnt, nd); - -+ if (newmnt->mnt_mountpoint->d_flags & DCACHE_VIRTUAL) -+ /* unaccessible yet - no lock */ -+ newmnt->mnt_root->d_flags |= DCACHE_VIRTUAL; -+ - if (err == 0 && fslist) { - /* add to the specified expiration list */ - spin_lock(&vfsmount_lock); -@@ -1213,7 +1250,7 @@ static void chroot_fs_refs(struct nameid - struct fs_struct *fs; - - read_lock(&tasklist_lock); -- do_each_thread(g, p) { -+ do_each_thread_ve(g, p) { - task_lock(p); - fs = p->fs; - if (fs) { -@@ -1226,7 +1263,7 @@ static void chroot_fs_refs(struct nameid - put_fs_struct(fs); - } else - task_unlock(p); -- } while_each_thread(g, p); -+ } while_each_thread_ve(g, p); - read_unlock(&tasklist_lock); - } - -@@ -1339,8 +1376,13 @@ static void __init init_mount_tree(void) - struct vfsmount *mnt; - struct namespace *namespace; - struct task_struct *g, *p; -+ struct file_system_type *fstype; - -- mnt = do_kern_mount("rootfs", 0, "rootfs", NULL); -+ fstype = get_fs_type("rootfs"); -+ if (fstype == NULL) -+ panic("Can't create rootfs"); -+ mnt = do_kern_mount(fstype, 0, "rootfs", NULL); -+ put_filesystem(fstype); - if (IS_ERR(mnt)) - panic("Can't create rootfs"); - namespace = kmalloc(sizeof(*namespace), GFP_KERNEL); -@@ -1355,10 +1397,10 @@ static void __init init_mount_tree(void) - - init_task.namespace = namespace; - read_lock(&tasklist_lock); -- do_each_thread(g, p) { -+ do_each_thread_all(g, p) { - get_namespace(namespace); - p->namespace = namespace; -- } while_each_thread(g, p); -+ } while_each_thread_all(g, p); - read_unlock(&tasklist_lock); - - set_fs_pwd(current->fs, namespace->root, namespace->root->mnt_root); -@@ -1373,7 +1415,7 @@ void __init mnt_init(unsigned long mempa - int i; - - mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct vfsmount), -- 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); -+ 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_UBC, NULL, NULL); - - order = 0; - mount_hashtable = (struct list_head *) -diff -uprN linux-2.6.8.1.orig/fs/ncpfs/ioctl.c linux-2.6.8.1-ve022stab078/fs/ncpfs/ioctl.c ---- linux-2.6.8.1.orig/fs/ncpfs/ioctl.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ncpfs/ioctl.c 2006-05-11 13:05:35.000000000 +0400 -@@ -34,7 +34,7 @@ ncp_get_fs_info(struct ncp_server* serve - { - struct ncp_fs_info info; - -- if ((permission(inode, MAY_WRITE, NULL) != 0) -+ if ((permission(inode, MAY_WRITE, NULL, NULL) != 0) - && (current->uid != server->m.mounted_uid)) { - return -EACCES; - } -@@ -62,7 +62,7 @@ ncp_get_fs_info_v2(struct ncp_server* se - { - struct ncp_fs_info_v2 info2; - -- if ((permission(inode, MAY_WRITE, NULL) != 0) -+ if ((permission(inode, MAY_WRITE, NULL, NULL) != 0) - && (current->uid != server->m.mounted_uid)) { - return -EACCES; - } -@@ -190,7 +190,7 @@ int ncp_ioctl(struct inode *inode, struc - switch (cmd) { - case NCP_IOC_NCPREQUEST: - -- if ((permission(inode, MAY_WRITE, NULL) != 0) -+ if ((permission(inode, MAY_WRITE, NULL, NULL) != 0) - && (current->uid != server->m.mounted_uid)) { - return -EACCES; - } -@@ -254,7 +254,7 @@ int ncp_ioctl(struct inode *inode, struc - { - unsigned long tmp = server->m.mounted_uid; - -- if ( (permission(inode, MAY_READ, NULL) != 0) -+ if ( (permission(inode, MAY_READ, NULL, NULL) != 0) - && (current->uid != server->m.mounted_uid)) - { - return -EACCES; -@@ -268,7 +268,7 @@ int ncp_ioctl(struct inode *inode, struc - { - struct ncp_setroot_ioctl sr; - -- if ( (permission(inode, MAY_READ, NULL) != 0) -+ if ( (permission(inode, MAY_READ, NULL, NULL) != 0) - && (current->uid != server->m.mounted_uid)) - { - return -EACCES; -@@ -341,7 +341,7 @@ int ncp_ioctl(struct inode *inode, struc - - #ifdef CONFIG_NCPFS_PACKET_SIGNING - case NCP_IOC_SIGN_INIT: -- if ((permission(inode, MAY_WRITE, NULL) != 0) -+ if ((permission(inode, MAY_WRITE, NULL, NULL) != 0) - && (current->uid != server->m.mounted_uid)) - { - return -EACCES; -@@ -364,7 +364,7 @@ int ncp_ioctl(struct inode *inode, struc - return 0; - - case NCP_IOC_SIGN_WANTED: -- if ( (permission(inode, MAY_READ, NULL) != 0) -+ if ( (permission(inode, MAY_READ, NULL, NULL) != 0) - && (current->uid != server->m.mounted_uid)) - { - return -EACCES; -@@ -377,7 +377,7 @@ int ncp_ioctl(struct inode *inode, struc - { - int newstate; - -- if ( (permission(inode, MAY_WRITE, NULL) != 0) -+ if ( (permission(inode, MAY_WRITE, NULL, NULL) != 0) - && (current->uid != server->m.mounted_uid)) - { - return -EACCES; -@@ -398,7 +398,7 @@ int ncp_ioctl(struct inode *inode, struc - - #ifdef CONFIG_NCPFS_IOCTL_LOCKING - case NCP_IOC_LOCKUNLOCK: -- if ( (permission(inode, MAY_WRITE, NULL) != 0) -+ if ( (permission(inode, MAY_WRITE, NULL, NULL) != 0) - && (current->uid != server->m.mounted_uid)) - { - return -EACCES; -@@ -603,7 +603,7 @@ outrel: - #endif /* CONFIG_NCPFS_NLS */ - - case NCP_IOC_SETDENTRYTTL: -- if ((permission(inode, MAY_WRITE, NULL) != 0) && -+ if ((permission(inode, MAY_WRITE, NULL, NULL) != 0) && - (current->uid != server->m.mounted_uid)) - return -EACCES; - { -@@ -633,7 +633,7 @@ outrel: - so we have this out of switch */ - if (cmd == NCP_IOC_GETMOUNTUID) { - __kernel_uid_t uid = 0; -- if ((permission(inode, MAY_READ, NULL) != 0) -+ if ((permission(inode, MAY_READ, NULL, NULL) != 0) - && (current->uid != server->m.mounted_uid)) { - return -EACCES; - } -diff -uprN linux-2.6.8.1.orig/fs/nfs/dir.c linux-2.6.8.1-ve022stab078/fs/nfs/dir.c ---- linux-2.6.8.1.orig/fs/nfs/dir.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/nfs/dir.c 2006-05-11 13:05:35.000000000 +0400 -@@ -1499,7 +1499,8 @@ out: - } - - int --nfs_permission(struct inode *inode, int mask, struct nameidata *nd) -+nfs_permission(struct inode *inode, int mask, struct nameidata *nd, -+ struct exec_perm *exec_perm) - { - struct nfs_access_cache *cache = &NFS_I(inode)->cache_access; - struct rpc_cred *cred; -@@ -1541,6 +1542,7 @@ nfs_permission(struct inode *inode, int - if (!NFS_PROTO(inode)->access) - goto out_notsup; - -+ /* Can NFS fill exec_perm atomically? Don't know... --SAW */ - cred = rpcauth_lookupcred(NFS_CLIENT(inode)->cl_auth, 0); - if (cache->cred == cred - && time_before(jiffies, cache->jiffies + NFS_ATTRTIMEO(inode)) -@@ -1565,7 +1567,7 @@ out: - return res; - out_notsup: - nfs_revalidate_inode(NFS_SERVER(inode), inode); -- res = vfs_permission(inode, mask); -+ res = vfs_permission(inode, mask, exec_perm); - unlock_kernel(); - return res; - add_cache: -diff -uprN linux-2.6.8.1.orig/fs/nfs/direct.c linux-2.6.8.1-ve022stab078/fs/nfs/direct.c ---- linux-2.6.8.1.orig/fs/nfs/direct.c 2004-08-14 14:56:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/nfs/direct.c 2006-05-11 13:05:34.000000000 +0400 -@@ -72,8 +72,10 @@ nfs_get_user_pages(int rw, unsigned long - size_t array_size; - - /* set an arbitrary limit to prevent arithmetic overflow */ -- if (size > MAX_DIRECTIO_SIZE) -+ if (size > MAX_DIRECTIO_SIZE) { -+ *pages = NULL; - return -EFBIG; -+ } - - page_count = (user_addr + size + PAGE_SIZE - 1) >> PAGE_SHIFT; - page_count -= user_addr >> PAGE_SHIFT; -diff -uprN linux-2.6.8.1.orig/fs/nfs/file.c linux-2.6.8.1-ve022stab078/fs/nfs/file.c ---- linux-2.6.8.1.orig/fs/nfs/file.c 2004-08-14 14:55:35.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/nfs/file.c 2006-05-11 13:05:28.000000000 +0400 -@@ -103,6 +103,9 @@ nfs_file_open(struct inode *inode, struc - static int - nfs_file_release(struct inode *inode, struct file *filp) - { -+ /* Ensure that dirty pages are flushed out with the right creds */ -+ if (filp->f_mode & FMODE_WRITE) -+ filemap_fdatawrite(filp->f_mapping); - return NFS_PROTO(inode)->file_release(inode, filp); - } - -diff -uprN linux-2.6.8.1.orig/fs/nfs/inode.c linux-2.6.8.1-ve022stab078/fs/nfs/inode.c ---- linux-2.6.8.1.orig/fs/nfs/inode.c 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/nfs/inode.c 2006-05-11 13:05:35.000000000 +0400 -@@ -55,7 +55,7 @@ static int nfs_update_inode(struct inode - - static struct inode *nfs_alloc_inode(struct super_block *sb); - static void nfs_destroy_inode(struct inode *); --static void nfs_write_inode(struct inode *,int); -+static int nfs_write_inode(struct inode *,int); - static void nfs_delete_inode(struct inode *); - static void nfs_put_super(struct super_block *); - static void nfs_clear_inode(struct inode *); -@@ -110,12 +110,16 @@ nfs_fattr_to_ino_t(struct nfs_fattr *fat - return nfs_fileid_to_ino_t(fattr->fileid); - } - --static void -+static int - nfs_write_inode(struct inode *inode, int sync) - { - int flags = sync ? FLUSH_WAIT : 0; -+ int ret; - -- nfs_commit_inode(inode, 0, 0, flags); -+ ret = nfs_commit_inode(inode, 0, 0, flags); -+ if (ret < 0) -+ return ret; -+ return 0; - } - - static void -diff -uprN linux-2.6.8.1.orig/fs/nfs/nfsroot.c linux-2.6.8.1-ve022stab078/fs/nfs/nfsroot.c ---- linux-2.6.8.1.orig/fs/nfs/nfsroot.c 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/nfs/nfsroot.c 2006-05-11 13:05:40.000000000 +0400 -@@ -306,7 +306,7 @@ static int __init root_nfs_name(char *na - /* Override them by options set on kernel command-line */ - root_nfs_parse(name, buf); - -- cp = system_utsname.nodename; -+ cp = ve_utsname.nodename; - if (strlen(buf) + strlen(cp) > NFS_MAXPATHLEN) { - printk(KERN_ERR "Root-NFS: Pathname for remote directory too long.\n"); - return -1; -diff -uprN linux-2.6.8.1.orig/fs/nfsctl.c linux-2.6.8.1-ve022stab078/fs/nfsctl.c ---- linux-2.6.8.1.orig/fs/nfsctl.c 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/nfsctl.c 2006-05-11 13:05:40.000000000 +0400 -@@ -23,8 +23,14 @@ static struct file *do_open(char *name, - { - struct nameidata nd; - int error; -+ struct file_system_type *fstype; - -- nd.mnt = do_kern_mount("nfsd", 0, "nfsd", NULL); -+ fstype = get_fs_type("nfsd"); -+ if (fstype == NULL) -+ return ERR_PTR(-ENODEV); -+ -+ nd.mnt = do_kern_mount(fstype, 0, "nfsd", NULL); -+ put_filesystem(fstype); - - if (IS_ERR(nd.mnt)) - return (struct file *)nd.mnt; -diff -uprN linux-2.6.8.1.orig/fs/nfsd/nfsfh.c linux-2.6.8.1-ve022stab078/fs/nfsd/nfsfh.c ---- linux-2.6.8.1.orig/fs/nfsd/nfsfh.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/nfsd/nfsfh.c 2006-05-11 13:05:35.000000000 +0400 -@@ -56,7 +56,7 @@ int nfsd_acceptable(void *expv, struct d - /* make sure parents give x permission to user */ - int err; - parent = dget_parent(tdentry); -- err = permission(parent->d_inode, MAY_EXEC, NULL); -+ err = permission(parent->d_inode, MAY_EXEC, NULL, NULL); - if (err < 0) { - dput(parent); - break; -diff -uprN linux-2.6.8.1.orig/fs/nfsd/vfs.c linux-2.6.8.1-ve022stab078/fs/nfsd/vfs.c ---- linux-2.6.8.1.orig/fs/nfsd/vfs.c 2004-08-14 14:55:19.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/nfsd/vfs.c 2006-05-11 13:05:35.000000000 +0400 -@@ -1592,12 +1592,13 @@ nfsd_permission(struct svc_export *exp, - inode->i_uid == current->fsuid) - return 0; - -- err = permission(inode, acc & (MAY_READ|MAY_WRITE|MAY_EXEC), NULL); -+ err = permission(inode, acc & (MAY_READ|MAY_WRITE|MAY_EXEC), -+ NULL, NULL); - - /* Allow read access to binaries even when mode 111 */ - if (err == -EACCES && S_ISREG(inode->i_mode) && - acc == (MAY_READ | MAY_OWNER_OVERRIDE)) -- err = permission(inode, MAY_EXEC, NULL); -+ err = permission(inode, MAY_EXEC, NULL, NULL); - - return err? nfserrno(err) : 0; - } -diff -uprN linux-2.6.8.1.orig/fs/nls/nls_ascii.c linux-2.6.8.1-ve022stab078/fs/nls/nls_ascii.c ---- linux-2.6.8.1.orig/fs/nls/nls_ascii.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/nls/nls_ascii.c 2006-05-11 13:05:34.000000000 +0400 -@@ -13,7 +13,7 @@ - #include <linux/nls.h> - #include <linux/errno.h> - --static wchar_t charset2uni[128] = { -+static wchar_t charset2uni[256] = { - /* 0x00*/ - 0x0000, 0x0001, 0x0002, 0x0003, - 0x0004, 0x0005, 0x0006, 0x0007, -@@ -56,7 +56,7 @@ static wchar_t charset2uni[128] = { - 0x007c, 0x007d, 0x007e, 0x007f, - }; - --static unsigned char page00[128] = { -+static unsigned char page00[256] = { - 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, /* 0x00-0x07 */ - 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, /* 0x08-0x0f */ - 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, /* 0x10-0x17 */ -@@ -75,11 +75,11 @@ static unsigned char page00[128] = { - 0x78, 0x79, 0x7a, 0x7b, 0x7c, 0x7d, 0x7e, 0x7f, /* 0x78-0x7f */ - }; - --static unsigned char *page_uni2charset[128] = { -- page00, NULL, NULL, NULL, NULL, NULL, NULL, NULL, -+static unsigned char *page_uni2charset[256] = { -+ page00, - }; - --static unsigned char charset2lower[128] = { -+static unsigned char charset2lower[256] = { - 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, /* 0x00-0x07 */ - 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, /* 0x08-0x0f */ - 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, /* 0x10-0x17 */ -@@ -98,7 +98,7 @@ static unsigned char charset2lower[128] - 0x78, 0x79, 0x7a, 0x7b, 0x7c, 0x7d, 0x7e, 0x7f, /* 0x78-0x7f */ - }; - --static unsigned char charset2upper[128] = { -+static unsigned char charset2upper[256] = { - 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, /* 0x00-0x07 */ - 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, /* 0x08-0x0f */ - 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, /* 0x10-0x17 */ -diff -uprN linux-2.6.8.1.orig/fs/ntfs/inode.h linux-2.6.8.1-ve022stab078/fs/ntfs/inode.h ---- linux-2.6.8.1.orig/fs/ntfs/inode.h 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ntfs/inode.h 2006-05-11 13:05:35.000000000 +0400 -@@ -285,7 +285,7 @@ extern void ntfs_truncate(struct inode * - - extern int ntfs_setattr(struct dentry *dentry, struct iattr *attr); - --extern void ntfs_write_inode(struct inode *vi, int sync); -+extern int ntfs_write_inode(struct inode *vi, int sync); - - static inline void ntfs_commit_inode(struct inode *vi) - { -diff -uprN linux-2.6.8.1.orig/fs/ntfs/super.c linux-2.6.8.1-ve022stab078/fs/ntfs/super.c ---- linux-2.6.8.1.orig/fs/ntfs/super.c 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ntfs/super.c 2006-05-11 13:05:43.000000000 +0400 -@@ -2404,7 +2404,7 @@ iput_tmp_ino_err_out_now: - * method again... FIXME: Do we need to do this twice now because of - * attribute inodes? I think not, so leave as is for now... (AIA) - */ -- if (invalidate_inodes(sb)) { -+ if (invalidate_inodes(sb, 0)) { - ntfs_error(sb, "Busy inodes left. This is most likely a NTFS " - "driver bug."); - /* Copied from fs/super.c. I just love this message. (-; */ -diff -uprN linux-2.6.8.1.orig/fs/open.c linux-2.6.8.1-ve022stab078/fs/open.c ---- linux-2.6.8.1.orig/fs/open.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/open.c 2006-05-11 13:05:43.000000000 +0400 -@@ -22,6 +22,7 @@ - #include <asm/uaccess.h> - #include <linux/fs.h> - #include <linux/pagemap.h> -+#include <linux/faudit.h> - - #include <asm/unistd.h> - -@@ -46,7 +47,21 @@ int vfs_statfs(struct super_block *sb, s - - EXPORT_SYMBOL(vfs_statfs); - --static int vfs_statfs_native(struct super_block *sb, struct statfs *buf) -+int faudit_statfs(struct super_block *sb, struct kstatfs *buf) -+{ -+ struct faudit_statfs_arg arg; -+ -+ arg.sb = sb; -+ arg.stat = buf; -+ -+ if (virtinfo_notifier_call(VITYPE_FAUDIT, VIRTINFO_FAUDIT_STATFS, &arg) -+ != NOTIFY_DONE) -+ return arg.err; -+ return 0; -+} -+ -+static int vfs_statfs_native(struct super_block *sb, struct vfsmount *mnt, -+ struct statfs *buf) - { - struct kstatfs st; - int retval; -@@ -55,6 +70,10 @@ static int vfs_statfs_native(struct supe - if (retval) - return retval; - -+ retval = faudit_statfs(mnt->mnt_sb, &st); -+ if (retval) -+ return retval; -+ - if (sizeof(*buf) == sizeof(st)) - memcpy(buf, &st, sizeof(st)); - else { -@@ -89,7 +108,8 @@ static int vfs_statfs_native(struct supe - return 0; - } - --static int vfs_statfs64(struct super_block *sb, struct statfs64 *buf) -+static int vfs_statfs64(struct super_block *sb, struct vfsmount *mnt, -+ struct statfs64 *buf) - { - struct kstatfs st; - int retval; -@@ -98,6 +118,10 @@ static int vfs_statfs64(struct super_blo - if (retval) - return retval; - -+ retval = faudit_statfs(mnt->mnt_sb, &st); -+ if (retval) -+ return retval; -+ - if (sizeof(*buf) == sizeof(st)) - memcpy(buf, &st, sizeof(st)); - else { -@@ -124,7 +148,8 @@ asmlinkage long sys_statfs(const char __ - error = user_path_walk(path, &nd); - if (!error) { - struct statfs tmp; -- error = vfs_statfs_native(nd.dentry->d_inode->i_sb, &tmp); -+ error = vfs_statfs_native(nd.dentry->d_inode->i_sb, -+ nd.mnt, &tmp); - if (!error && copy_to_user(buf, &tmp, sizeof(tmp))) - error = -EFAULT; - path_release(&nd); -@@ -143,7 +168,8 @@ asmlinkage long sys_statfs64(const char - error = user_path_walk(path, &nd); - if (!error) { - struct statfs64 tmp; -- error = vfs_statfs64(nd.dentry->d_inode->i_sb, &tmp); -+ error = vfs_statfs64(nd.dentry->d_inode->i_sb, -+ nd.mnt, &tmp); - if (!error && copy_to_user(buf, &tmp, sizeof(tmp))) - error = -EFAULT; - path_release(&nd); -@@ -162,7 +188,8 @@ asmlinkage long sys_fstatfs(unsigned int - file = fget(fd); - if (!file) - goto out; -- error = vfs_statfs_native(file->f_dentry->d_inode->i_sb, &tmp); -+ error = vfs_statfs_native(file->f_dentry->d_inode->i_sb, -+ file->f_vfsmnt, &tmp); - if (!error && copy_to_user(buf, &tmp, sizeof(tmp))) - error = -EFAULT; - fput(file); -@@ -183,7 +210,8 @@ asmlinkage long sys_fstatfs64(unsigned i - file = fget(fd); - if (!file) - goto out; -- error = vfs_statfs64(file->f_dentry->d_inode->i_sb, &tmp); -+ error = vfs_statfs64(file->f_dentry->d_inode->i_sb, -+ file->f_vfsmnt, &tmp); - if (!error && copy_to_user(buf, &tmp, sizeof(tmp))) - error = -EFAULT; - fput(file); -@@ -234,7 +262,7 @@ static inline long do_sys_truncate(const - if (!S_ISREG(inode->i_mode)) - goto dput_and_out; - -- error = permission(inode,MAY_WRITE,&nd); -+ error = permission(inode,MAY_WRITE,&nd,NULL); - if (error) - goto dput_and_out; - -@@ -388,7 +416,7 @@ asmlinkage long sys_utime(char __user * - goto dput_and_out; - - if (current->fsuid != inode->i_uid && -- (error = permission(inode,MAY_WRITE,&nd)) != 0) -+ (error = permission(inode,MAY_WRITE,&nd,NULL)) != 0) - goto dput_and_out; - } - down(&inode->i_sem); -@@ -441,7 +469,7 @@ long do_utimes(char __user * filename, s - goto dput_and_out; - - if (current->fsuid != inode->i_uid && -- (error = permission(inode,MAY_WRITE,&nd)) != 0) -+ (error = permission(inode,MAY_WRITE,&nd,NULL)) != 0) - goto dput_and_out; - } - down(&inode->i_sem); -@@ -500,7 +528,7 @@ asmlinkage long sys_access(const char __ - - res = __user_walk(filename, LOOKUP_FOLLOW|LOOKUP_ACCESS, &nd); - if (!res) { -- res = permission(nd.dentry->d_inode, mode, &nd); -+ res = permission(nd.dentry->d_inode, mode, &nd, NULL); - /* SuS v2 requires we report a read only fs too */ - if(!res && (mode & S_IWOTH) && IS_RDONLY(nd.dentry->d_inode) - && !special_file(nd.dentry->d_inode->i_mode)) -@@ -524,7 +552,7 @@ asmlinkage long sys_chdir(const char __u - if (error) - goto out; - -- error = permission(nd.dentry->d_inode,MAY_EXEC,&nd); -+ error = permission(nd.dentry->d_inode,MAY_EXEC,&nd,NULL); - if (error) - goto dput_and_out; - -@@ -557,7 +585,7 @@ asmlinkage long sys_fchdir(unsigned int - if (!S_ISDIR(inode->i_mode)) - goto out_putf; - -- error = permission(inode, MAY_EXEC, NULL); -+ error = permission(inode, MAY_EXEC, NULL, NULL); - if (!error) - set_fs_pwd(current->fs, mnt, dentry); - out_putf: -@@ -575,7 +603,7 @@ asmlinkage long sys_chroot(const char __ - if (error) - goto out; - -- error = permission(nd.dentry->d_inode,MAY_EXEC,&nd); -+ error = permission(nd.dentry->d_inode,MAY_EXEC,&nd,NULL); - if (error) - goto dput_and_out; - -@@ -776,6 +804,9 @@ struct file *dentry_open(struct dentry * - struct inode *inode; - int error; - -+ if (!capable(CAP_SYS_RAWIO)) -+ flags &= ~O_DIRECT; -+ - error = -ENFILE; - f = get_empty_filp(); - if (!f) -@@ -1082,3 +1113,81 @@ int nonseekable_open(struct inode *inode - } - - EXPORT_SYMBOL(nonseekable_open); -+ -+long sys_lchmod(char __user * filename, mode_t mode) -+{ -+ struct nameidata nd; -+ struct inode * inode; -+ int error; -+ struct iattr newattrs; -+ -+ error = user_path_walk_link(filename, &nd); -+ if (error) -+ goto out; -+ inode = nd.dentry->d_inode; -+ -+ error = -EROFS; -+ if (IS_RDONLY(inode)) -+ goto dput_and_out; -+ -+ error = -EPERM; -+ if (IS_IMMUTABLE(inode) || IS_APPEND(inode)) -+ goto dput_and_out; -+ -+ down(&inode->i_sem); -+ if (mode == (mode_t) -1) -+ mode = inode->i_mode; -+ newattrs.ia_mode = (mode & S_IALLUGO) | (inode->i_mode & ~S_IALLUGO); -+ newattrs.ia_valid = ATTR_MODE | ATTR_CTIME; -+ error = notify_change(nd.dentry, &newattrs); -+ up(&inode->i_sem); -+ -+dput_and_out: -+ path_release(&nd); -+out: -+ return error; -+} -+ -+long sys_lutime(char __user * filename, -+ struct utimbuf __user * times) -+{ -+ int error; -+ struct nameidata nd; -+ struct inode * inode; -+ struct iattr newattrs; -+ -+ error = user_path_walk_link(filename, &nd); -+ if (error) -+ goto out; -+ inode = nd.dentry->d_inode; -+ -+ error = -EROFS; -+ if (IS_RDONLY(inode)) -+ goto dput_and_out; -+ -+ /* Don't worry, the checks are done in inode_change_ok() */ -+ newattrs.ia_valid = ATTR_CTIME | ATTR_MTIME | ATTR_ATIME; -+ if (times) { -+ error = get_user(newattrs.ia_atime.tv_sec, ×->actime); -+ newattrs.ia_atime.tv_nsec = 0; -+ if (!error) -+ error = get_user(newattrs.ia_mtime.tv_sec, -+ ×->modtime); -+ newattrs.ia_mtime.tv_nsec = 0; -+ if (error) -+ goto dput_and_out; -+ -+ newattrs.ia_valid |= ATTR_ATIME_SET | ATTR_MTIME_SET; -+ } else { -+ if (current->fsuid != inode->i_uid && -+ (error = permission(inode, MAY_WRITE, NULL, NULL)) != 0) -+ goto dput_and_out; -+ } -+ down(&inode->i_sem); -+ error = notify_change(nd.dentry, &newattrs); -+ up(&inode->i_sem); -+dput_and_out: -+ path_release(&nd); -+out: -+ return error; -+} -diff -uprN linux-2.6.8.1.orig/fs/partitions/check.c linux-2.6.8.1-ve022stab078/fs/partitions/check.c ---- linux-2.6.8.1.orig/fs/partitions/check.c 2004-08-14 14:56:20.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/partitions/check.c 2006-05-11 13:05:40.000000000 +0400 -@@ -127,6 +127,7 @@ char *disk_name(struct gendisk *hd, int - - return buf; - } -+EXPORT_SYMBOL(disk_name); - - const char *bdevname(struct block_device *bdev, char *buf) - { -diff -uprN linux-2.6.8.1.orig/fs/pipe.c linux-2.6.8.1-ve022stab078/fs/pipe.c ---- linux-2.6.8.1.orig/fs/pipe.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/pipe.c 2006-05-11 13:05:39.000000000 +0400 -@@ -534,7 +534,7 @@ struct inode* pipe_new(struct inode* ino - { - unsigned long page; - -- page = __get_free_page(GFP_USER); -+ page = __get_free_page(GFP_USER_UBC); - if (!page) - return NULL; - -diff -uprN linux-2.6.8.1.orig/fs/proc/array.c linux-2.6.8.1-ve022stab078/fs/proc/array.c ---- linux-2.6.8.1.orig/fs/proc/array.c 2004-08-14 14:55:34.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/proc/array.c 2006-05-11 13:05:45.000000000 +0400 -@@ -73,6 +73,8 @@ - #include <linux/highmem.h> - #include <linux/file.h> - #include <linux/times.h> -+#include <linux/fairsched.h> -+#include <ub/beancounter.h> - - #include <asm/uaccess.h> - #include <asm/pgtable.h> -@@ -88,10 +90,13 @@ static inline char * task_name(struct ta - { - int i; - char * name; -+ char tcomm[sizeof(p->comm)]; -+ -+ get_task_comm(tcomm, p); - - ADDBUF(buf, "Name:\t"); -- name = p->comm; -- i = sizeof(p->comm); -+ name = tcomm; -+ i = sizeof(tcomm); - do { - unsigned char c = *name; - name++; -@@ -127,18 +132,19 @@ static const char *task_state_array[] = - "S (sleeping)", /* 1 */ - "D (disk sleep)", /* 2 */ - "T (stopped)", /* 4 */ -- "Z (zombie)", /* 8 */ -- "X (dead)" /* 16 */ -+ "T (tracing stop)", /* 8 */ -+ "Z (zombie)", /* 16 */ -+ "X (dead)" /* 32 */ - }; - - static inline const char * get_task_state(struct task_struct *tsk) - { -- unsigned int state = tsk->state & (TASK_RUNNING | -- TASK_INTERRUPTIBLE | -- TASK_UNINTERRUPTIBLE | -- TASK_ZOMBIE | -- TASK_DEAD | -- TASK_STOPPED); -+ unsigned int state = (tsk->state & (TASK_RUNNING | -+ TASK_INTERRUPTIBLE | -+ TASK_UNINTERRUPTIBLE | -+ TASK_STOPPED)) | -+ (tsk->exit_state & (EXIT_ZOMBIE | -+ EXIT_DEAD)); - const char **p = &task_state_array[0]; - - while (state) { -@@ -152,8 +158,13 @@ static inline char * task_state(struct t - { - struct group_info *group_info; - int g; -+ pid_t pid, ppid, tgid; -+ -+ pid = get_task_pid(p); -+ tgid = get_task_tgid(p); - - read_lock(&tasklist_lock); -+ ppid = get_task_ppid(p); - buffer += sprintf(buffer, - "State:\t%s\n" - "SleepAVG:\t%lu%%\n" -@@ -161,13 +172,19 @@ static inline char * task_state(struct t - "Pid:\t%d\n" - "PPid:\t%d\n" - "TracerPid:\t%d\n" -+#ifdef CONFIG_FAIRSCHED -+ "FNid:\t%d\n" -+#endif - "Uid:\t%d\t%d\t%d\t%d\n" - "Gid:\t%d\t%d\t%d\t%d\n", - get_task_state(p), - (p->sleep_avg/1024)*100/(1020000000/1024), -- p->tgid, -- p->pid, p->pid ? p->real_parent->pid : 0, -- p->pid && p->ptrace ? p->parent->pid : 0, -+ tgid, -+ pid, ppid, -+ p->pid && p->ptrace ? get_task_pid(p->parent) : 0, -+#ifdef CONFIG_FAIRSCHED -+ task_fairsched_node_id(p), -+#endif - p->uid, p->euid, p->suid, p->fsuid, - p->gid, p->egid, p->sgid, p->fsgid); - read_unlock(&tasklist_lock); -@@ -186,6 +203,20 @@ static inline char * task_state(struct t - put_group_info(group_info); - - buffer += sprintf(buffer, "\n"); -+ -+#ifdef CONFIG_VE -+ buffer += sprintf(buffer, -+ "envID:\t%d\n" -+ "VPid:\t%d\n" -+ "PNState:\t%u\n" -+ "StopState:\t%u\n" -+ "SigSuspState:\t%u\n", -+ VE_TASK_INFO(p)->owner_env->veid, -+ virt_pid(p), -+ p->pn_state, -+ p->stopped_state, -+ p->sigsuspend_state); -+#endif - return buffer; - } - -@@ -231,7 +262,7 @@ static void collect_sigign_sigcatch(stru - - static inline char * task_sig(struct task_struct *p, char *buffer) - { -- sigset_t pending, shpending, blocked, ignored, caught; -+ sigset_t pending, shpending, blocked, ignored, caught, saved; - int num_threads = 0; - - sigemptyset(&pending); -@@ -239,6 +270,7 @@ static inline char * task_sig(struct tas - sigemptyset(&blocked); - sigemptyset(&ignored); - sigemptyset(&caught); -+ sigemptyset(&saved); - - /* Gather all the data with the appropriate locks held */ - read_lock(&tasklist_lock); -@@ -247,6 +279,7 @@ static inline char * task_sig(struct tas - pending = p->pending.signal; - shpending = p->signal->shared_pending.signal; - blocked = p->blocked; -+ saved = p->saved_sigset; - collect_sigign_sigcatch(p, &ignored, &caught); - num_threads = atomic_read(&p->signal->count); - spin_unlock_irq(&p->sighand->siglock); -@@ -261,6 +294,7 @@ static inline char * task_sig(struct tas - buffer = render_sigset_t("SigBlk:\t", &blocked, buffer); - buffer = render_sigset_t("SigIgn:\t", &ignored, buffer); - buffer = render_sigset_t("SigCgt:\t", &caught, buffer); -+ buffer = render_sigset_t("SigSvd:\t", &saved, buffer); - - return buffer; - } -@@ -275,6 +309,24 @@ static inline char *task_cap(struct task - cap_t(p->cap_effective)); - } - -+#ifdef CONFIG_USER_RESOURCE -+static inline char *task_show_ub(struct task_struct *p, char *buffer) -+{ -+ char ub_info[64]; -+ -+ print_ub_uid(get_task_ub(p), ub_info, sizeof(ub_info)); -+ buffer += sprintf(buffer, "TaskUB:\t%s\n", ub_info); -+ task_lock(p); -+ if (p->mm != NULL) -+ print_ub_uid(mm_ub(p->mm), ub_info, sizeof(ub_info)); -+ else -+ strcpy(ub_info, "N/A"); -+ task_unlock(p); -+ buffer += sprintf(buffer, "MMUB:\t%s\n", ub_info); -+ return buffer; -+} -+#endif -+ - extern char *task_mem(struct mm_struct *, char *); - int proc_pid_status(struct task_struct *task, char * buffer) - { -@@ -293,6 +345,9 @@ int proc_pid_status(struct task_struct * - #if defined(CONFIG_ARCH_S390) - buffer = task_show_regs(task, buffer); - #endif -+#ifdef CONFIG_USER_RESOURCE -+ buffer = task_show_ub(task, buffer); -+#endif - return buffer - orig; - } - -@@ -309,6 +364,9 @@ int proc_pid_stat(struct task_struct *ta - int num_threads = 0; - struct mm_struct *mm; - unsigned long long start_time; -+ char tcomm[sizeof(task->comm)]; -+ char mm_ub_info[64]; -+ char task_ub_info[64]; - - state = *get_task_state(task); - vsize = eip = esp = 0; -@@ -325,6 +383,7 @@ int proc_pid_stat(struct task_struct *ta - up_read(&mm->mmap_sem); - } - -+ get_task_comm(tcomm, task); - wchan = get_wchan(task); - - sigemptyset(&sigign); -@@ -338,12 +397,13 @@ int proc_pid_stat(struct task_struct *ta - } - if (task->signal) { - if (task->signal->tty) { -- tty_pgrp = task->signal->tty->pgrp; -+ tty_pgrp = pid_type_to_vpid(PIDTYPE_PGID, task->signal->tty->pgrp); - tty_nr = new_encode_dev(tty_devnum(task->signal->tty)); - } -- pgid = process_group(task); -- sid = task->signal->session; -+ pgid = get_task_pgid(task); -+ sid = get_task_sid(task); - } -+ ppid = get_task_ppid(task); - read_unlock(&tasklist_lock); - - /* scale priority and nice values from timeslices to -20..20 */ -@@ -351,18 +411,27 @@ int proc_pid_stat(struct task_struct *ta - priority = task_prio(task); - nice = task_nice(task); - -- read_lock(&tasklist_lock); -- ppid = task->pid ? task->real_parent->pid : 0; -- read_unlock(&tasklist_lock); -- - /* Temporary variable needed for gcc-2.96 */ - start_time = jiffies_64_to_clock_t(task->start_time - INITIAL_JIFFIES); - -+#ifdef CONFIG_USER_RESOURCE -+ print_ub_uid(get_task_ub(task), task_ub_info, sizeof(task_ub_info)); -+ if (mm != NULL) -+ print_ub_uid(mm_ub(mm), mm_ub_info, sizeof(mm_ub_info)); -+ else -+ strcpy(mm_ub_info, "N/A"); -+#else -+ strcpy(task_ub_info, "0"); -+ strcpy(mm_ub_info, "0"); -+#endif -+ - res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \ - %lu %lu %lu %lu %lu %ld %ld %ld %ld %d %ld %llu %lu %ld %lu %lu %lu %lu %lu \ --%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu\n", -- task->pid, -- task->comm, -+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu \ -+0 0 0 0 0 0 0 0 %d %u \ -+%s %s\n", -+ get_task_pid(task), -+ tcomm, - state, - ppid, - pgid, -@@ -382,7 +451,12 @@ int proc_pid_stat(struct task_struct *ta - nice, - num_threads, - jiffies_to_clock_t(task->it_real_value), -+#ifndef CONFIG_VE - start_time, -+#else -+ jiffies_64_to_clock_t(task->start_time - -+ get_exec_env()->init_entry->start_time), -+#endif - vsize, - mm ? mm->rss : 0, /* you might want to shift this left 3 */ - task->rlim[RLIMIT_RSS].rlim_cur, -@@ -405,7 +479,11 @@ int proc_pid_stat(struct task_struct *ta - task->exit_signal, - task_cpu(task), - task->rt_priority, -- task->policy); -+ task->policy, -+ virt_pid(task), -+ VEID(VE_TASK_INFO(task)->owner_env), -+ task_ub_info, -+ mm_ub_info); - if(mm) - mmput(mm); - return res; -diff -uprN linux-2.6.8.1.orig/fs/proc/base.c linux-2.6.8.1-ve022stab078/fs/proc/base.c ---- linux-2.6.8.1.orig/fs/proc/base.c 2004-08-14 14:55:35.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/proc/base.c 2006-05-11 13:05:40.000000000 +0400 -@@ -188,22 +188,25 @@ static int proc_fd_link(struct inode *in - struct files_struct *files; - struct file *file; - int fd = proc_type(inode) - PROC_TID_FD_DIR; -+ int err = -ENOENT; - - files = get_files_struct(task); - if (files) { - spin_lock(&files->file_lock); - file = fcheck_files(files, fd); - if (file) { -- *mnt = mntget(file->f_vfsmnt); -- *dentry = dget(file->f_dentry); -- spin_unlock(&files->file_lock); -- put_files_struct(files); -- return 0; -+ if (d_root_check(file->f_dentry, file->f_vfsmnt)) { -+ err = -EACCES; -+ } else { -+ *mnt = mntget(file->f_vfsmnt); -+ *dentry = dget(file->f_dentry); -+ err = 0; -+ } - } - spin_unlock(&files->file_lock); - put_files_struct(files); - } -- return -ENOENT; -+ return err; - } - - static int proc_exe_link(struct inode *inode, struct dentry **dentry, struct vfsmount **mnt) -@@ -220,13 +223,16 @@ static int proc_exe_link(struct inode *i - while (vma) { - if ((vma->vm_flags & VM_EXECUTABLE) && - vma->vm_file) { -- *mnt = mntget(vma->vm_file->f_vfsmnt); -- *dentry = dget(vma->vm_file->f_dentry); -- result = 0; -+ result = d_root_check(vma->vm_file->f_dentry, -+ vma->vm_file->f_vfsmnt); -+ if (!result) { -+ *mnt = mntget(vma->vm_file->f_vfsmnt); -+ *dentry = dget(vma->vm_file->f_dentry); -+ } - break; - } - vma = vma->vm_next; -- } -+ } - up_read(&mm->mmap_sem); - mmput(mm); - out: -@@ -244,10 +250,12 @@ static int proc_cwd_link(struct inode *i - task_unlock(proc_task(inode)); - if (fs) { - read_lock(&fs->lock); -- *mnt = mntget(fs->pwdmnt); -- *dentry = dget(fs->pwd); -+ result = d_root_check(fs->pwd, fs->pwdmnt); -+ if (!result) { -+ *mnt = mntget(fs->pwdmnt); -+ *dentry = dget(fs->pwd); -+ } - read_unlock(&fs->lock); -- result = 0; - put_fs_struct(fs); - } - return result; -@@ -297,6 +305,11 @@ static int may_ptrace_attach(struct task - rmb(); - if (!task->mm->dumpable && !capable(CAP_SYS_PTRACE)) - goto out; -+ if (!task->mm->vps_dumpable && !ve_is_super(get_exec_env())) -+ goto out; -+ /* optional: defensive measure */ -+ if (!ve_accessible(VE_TASK_INFO(task)->owner_env, get_exec_env())) -+ goto out; - if (security_ptrace(current, task)) - goto out; - -@@ -329,6 +342,8 @@ static int proc_pid_cmdline(struct task_ - struct mm_struct *mm = get_task_mm(task); - if (!mm) - goto out; -+ if (!mm->arg_end) -+ goto out_mm; /* Shh! No looking before we're done */ - - len = mm->arg_end - mm->arg_start; - -@@ -351,8 +366,8 @@ static int proc_pid_cmdline(struct task_ - res = strnlen(buffer, res); - } - } -+out_mm: - mmput(mm); -- - out: - return res; - } -@@ -443,9 +458,10 @@ out: - goto exit; - } - --static int proc_permission(struct inode *inode, int mask, struct nameidata *nd) -+static int proc_permission(struct inode *inode, int mask, struct nameidata *nd, -+ struct exec_perm *exec_perm) - { -- if (vfs_permission(inode, mask) != 0) -+ if (vfs_permission(inode, mask, exec_perm) != 0) - return -EACCES; - return proc_check_root(inode); - } -@@ -767,12 +783,6 @@ static struct inode_operations proc_pid_ - .follow_link = proc_pid_follow_link - }; - --static int pid_alive(struct task_struct *p) --{ -- BUG_ON(p->pids[PIDTYPE_PID].pidptr != &p->pids[PIDTYPE_PID].pid); -- return atomic_read(&p->pids[PIDTYPE_PID].pid.count); --} -- - #define NUMBUF 10 - - static int proc_readfd(struct file * filp, void * dirent, filldir_t filldir) -@@ -927,6 +937,10 @@ static struct inode *proc_pid_make_inode - struct inode * inode; - struct proc_inode *ei; - -+ if (!ve_accessible(VE_TASK_INFO(task)->owner_env, -+ VE_OWNER_FSTYPE(sb->s_type))) -+ return NULL; -+ - /* We need a new inode */ - - inode = new_inode(sb); -@@ -1030,6 +1044,10 @@ static void pid_base_iput(struct dentry - spin_lock(&task->proc_lock); - if (task->proc_dentry == dentry) - task->proc_dentry = NULL; -+#ifdef CONFIG_VE -+ if (VE_TASK_INFO(task)->glob_proc_dentry == dentry) -+ VE_TASK_INFO(task)->glob_proc_dentry = NULL; -+#endif - spin_unlock(&task->proc_lock); - iput(inode); - } -@@ -1467,14 +1485,14 @@ static int proc_self_readlink(struct den - int buflen) - { - char tmp[30]; -- sprintf(tmp, "%d", current->tgid); -+ sprintf(tmp, "%d", get_task_tgid(current)); - return vfs_readlink(dentry,buffer,buflen,tmp); - } - - static int proc_self_follow_link(struct dentry *dentry, struct nameidata *nd) - { - char tmp[30]; -- sprintf(tmp, "%d", current->tgid); -+ sprintf(tmp, "%d", get_task_tgid(current)); - return vfs_follow_link(nd,tmp); - } - -@@ -1499,24 +1517,33 @@ static struct inode_operations proc_self - * of PIDTYPE_PID. - */ - --struct dentry *proc_pid_unhash(struct task_struct *p) -+struct dentry *__proc_pid_unhash(struct task_struct *p, struct dentry *proc_dentry) - { -- struct dentry *proc_dentry; -- -- proc_dentry = p->proc_dentry; - if (proc_dentry != NULL) { - - spin_lock(&dcache_lock); -+ spin_lock(&proc_dentry->d_lock); - if (!d_unhashed(proc_dentry)) { - dget_locked(proc_dentry); - __d_drop(proc_dentry); -- } else -+ spin_unlock(&proc_dentry->d_lock); -+ } else { -+ spin_unlock(&proc_dentry->d_lock); - proc_dentry = NULL; -+ } - spin_unlock(&dcache_lock); - } - return proc_dentry; - } - -+void proc_pid_unhash(struct task_struct *p, struct dentry *pd[2]) -+{ -+ pd[0] = __proc_pid_unhash(p, p->proc_dentry); -+#ifdef CONFIG_VE -+ pd[1] = __proc_pid_unhash(p, VE_TASK_INFO(p)->glob_proc_dentry); -+#endif -+} -+ - /** - * proc_pid_flush - recover memory used by stale /proc/<pid>/x entries - * @proc_entry: directoy to prune. -@@ -1524,7 +1551,7 @@ struct dentry *proc_pid_unhash(struct ta - * Shrink the /proc directory that was used by the just killed thread. - */ - --void proc_pid_flush(struct dentry *proc_dentry) -+void __proc_pid_flush(struct dentry *proc_dentry) - { - if(proc_dentry != NULL) { - shrink_dcache_parent(proc_dentry); -@@ -1532,12 +1559,21 @@ void proc_pid_flush(struct dentry *proc_ - } - } - -+void proc_pid_flush(struct dentry *proc_dentry[2]) -+{ -+ __proc_pid_flush(proc_dentry[0]); -+#ifdef CONFIG_VE -+ __proc_pid_flush(proc_dentry[1]); -+#endif -+} -+ - /* SMP-safe */ - struct dentry *proc_pid_lookup(struct inode *dir, struct dentry * dentry, struct nameidata *nd) - { - struct task_struct *task; - struct inode *inode; - struct proc_inode *ei; -+ struct dentry *pd[2]; - unsigned tgid; - int died; - -@@ -1561,7 +1597,19 @@ struct dentry *proc_pid_lookup(struct in - goto out; - - read_lock(&tasklist_lock); -- task = find_task_by_pid(tgid); -+ task = find_task_by_pid_ve(tgid); -+ /* In theory we are allowed to lookup both /proc/VIRT_PID and -+ * /proc/GLOBAL_PID inside VE. However, current /proc implementation -+ * cannot maintain two references to one task, so that we have -+ * to prohibit /proc/GLOBAL_PID. -+ */ -+ if (task && !ve_is_super(get_exec_env()) && !is_virtual_pid(tgid)) { -+ /* However, VE_ENTERed tasks are exception, they use global -+ * pids. -+ */ -+ if (virt_pid(task) != tgid) -+ task = NULL; -+ } - if (task) - get_task_struct(task); - read_unlock(&tasklist_lock); -@@ -1586,16 +1634,23 @@ struct dentry *proc_pid_lookup(struct in - died = 0; - d_add(dentry, inode); - spin_lock(&task->proc_lock); -+#ifdef CONFIG_VE -+ if (ve_is_super(VE_OWNER_FSTYPE(inode->i_sb->s_type))) -+ VE_TASK_INFO(task)->glob_proc_dentry = dentry; -+ else -+ task->proc_dentry = dentry; -+#else - task->proc_dentry = dentry; -+#endif - if (!pid_alive(task)) { -- dentry = proc_pid_unhash(task); -+ proc_pid_unhash(task, pd); - died = 1; - } - spin_unlock(&task->proc_lock); - - put_task_struct(task); - if (died) { -- proc_pid_flush(dentry); -+ proc_pid_flush(pd); - goto out; - } - return NULL; -@@ -1616,7 +1671,12 @@ static struct dentry *proc_task_lookup(s - goto out; - - read_lock(&tasklist_lock); -- task = find_task_by_pid(tid); -+ task = find_task_by_pid_ve(tid); -+ /* See comment above in similar place. */ -+ if (task && !ve_is_super(get_exec_env()) && !is_virtual_pid(tid)) { -+ if (virt_pid(task) != tid) -+ task = NULL; -+ } - if (task) - get_task_struct(task); - read_unlock(&tasklist_lock); -@@ -1656,7 +1716,8 @@ out: - * tasklist lock while doing this, and we must release it before - * we actually do the filldir itself, so we use a temp buffer.. - */ --static int get_tgid_list(int index, unsigned long version, unsigned int *tgids) -+static int get_tgid_list(int index, unsigned long version, unsigned int *tgids, -+ struct ve_struct *owner) - { - struct task_struct *p; - int nr_tgids = 0; -@@ -1665,18 +1726,23 @@ static int get_tgid_list(int index, unsi - read_lock(&tasklist_lock); - p = NULL; - if (version) { -- p = find_task_by_pid(version); -- if (!thread_group_leader(p)) -+ struct ve_struct *oldve; -+ -+ oldve = set_exec_env(owner); -+ p = find_task_by_pid_ve(version); -+ (void)set_exec_env(oldve); -+ -+ if (p != NULL && !thread_group_leader(p)) - p = NULL; - } - - if (p) - index = 0; - else -- p = next_task(&init_task); -+ p = __first_task_ve(owner); - -- for ( ; p != &init_task; p = next_task(p)) { -- int tgid = p->pid; -+ for ( ; p != NULL; p = __next_task_ve(owner, p)) { -+ int tgid = get_task_pid_ve(p, owner); - if (!pid_alive(p)) - continue; - if (--index >= 0) -@@ -1709,7 +1775,7 @@ static int get_tid_list(int index, unsig - * via next_thread(). - */ - if (pid_alive(task)) do { -- int tid = task->pid; -+ int tid = get_task_pid(task); - - if (--index >= 0) - continue; -@@ -1741,7 +1807,8 @@ int proc_pid_readdir(struct file * filp, - /* - * f_version caches the last tgid which was returned from readdir - */ -- nr_tgids = get_tgid_list(nr, filp->f_version, tgid_array); -+ nr_tgids = get_tgid_list(nr, filp->f_version, tgid_array, -+ VE_OWNER_FSTYPE(filp->f_dentry->d_sb->s_type)); - - for (i = 0; i < nr_tgids; i++) { - int tgid = tgid_array[i]; -diff -uprN linux-2.6.8.1.orig/fs/proc/generic.c linux-2.6.8.1-ve022stab078/fs/proc/generic.c ---- linux-2.6.8.1.orig/fs/proc/generic.c 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/proc/generic.c 2006-05-11 13:05:40.000000000 +0400 -@@ -10,7 +10,9 @@ - - #include <linux/errno.h> - #include <linux/time.h> -+#include <linux/fs.h> - #include <linux/proc_fs.h> -+#include <linux/ve_owner.h> - #include <linux/stat.h> - #include <linux/module.h> - #include <linux/mount.h> -@@ -27,6 +29,8 @@ static ssize_t proc_file_write(struct fi - size_t count, loff_t *ppos); - static loff_t proc_file_lseek(struct file *, loff_t, int); - -+static DECLARE_RWSEM(proc_tree_sem); -+ - int proc_match(int len, const char *name, struct proc_dir_entry *de) - { - if (de->namelen != len) -@@ -54,13 +58,25 @@ proc_file_read(struct file *file, char _ - ssize_t n, count; - char *start; - struct proc_dir_entry * dp; -+ unsigned long long pos; -+ -+ /* -+ * Gaah, please just use "seq_file" instead. The legacy /proc -+ * interfaces cut loff_t down to off_t for reads, and ignore -+ * the offset entirely for writes.. -+ */ -+ pos = *ppos; -+ if (pos > MAX_NON_LFS) -+ return 0; -+ if (nbytes > MAX_NON_LFS - pos) -+ nbytes = MAX_NON_LFS - pos; - - dp = PDE(inode); - if (!(page = (char*) __get_free_page(GFP_KERNEL))) - return -ENOMEM; - - while ((nbytes > 0) && !eof) { -- count = min_t(ssize_t, PROC_BLOCK_SIZE, nbytes); -+ count = min_t(size_t, PROC_BLOCK_SIZE, nbytes); - - start = NULL; - if (dp->get_info) { -@@ -202,32 +218,20 @@ proc_file_write(struct file *file, const - static loff_t - proc_file_lseek(struct file *file, loff_t offset, int orig) - { -- lock_kernel(); -- -- switch (orig) { -- case 0: -- if (offset < 0) -- goto out; -- file->f_pos = offset; -- unlock_kernel(); -- return(file->f_pos); -- case 1: -- if (offset + file->f_pos < 0) -- goto out; -- file->f_pos += offset; -- unlock_kernel(); -- return(file->f_pos); -- case 2: -- goto out; -- default: -- goto out; -- } -- --out: -- unlock_kernel(); -- return -EINVAL; -+ loff_t retval = -EINVAL; -+ switch (orig) { -+ case 1: -+ offset += file->f_pos; -+ /* fallthrough */ -+ case 0: -+ if (offset < 0 || offset > MAX_NON_LFS) -+ break; -+ file->f_pos = retval = offset; -+ } -+ return retval; - } - -+#ifndef CONFIG_VE - static int proc_notify_change(struct dentry *dentry, struct iattr *iattr) - { - struct inode *inode = dentry->d_inode; -@@ -248,9 +252,12 @@ static int proc_notify_change(struct den - out: - return error; - } -+#endif - - static struct inode_operations proc_file_inode_operations = { -+#ifndef CONFIG_VE - .setattr = proc_notify_change, -+#endif - }; - - /* -@@ -258,14 +265,14 @@ static struct inode_operations proc_file - * returns the struct proc_dir_entry for "/proc/tty/driver", and - * returns "serial" in residual. - */ --static int xlate_proc_name(const char *name, -- struct proc_dir_entry **ret, const char **residual) -+static int __xlate_proc_name(struct proc_dir_entry *root, const char *name, -+ struct proc_dir_entry **ret, const char **residual) - { - const char *cp = name, *next; - struct proc_dir_entry *de; - int len; - -- de = &proc_root; -+ de = root; - while (1) { - next = strchr(cp, '/'); - if (!next) -@@ -285,6 +292,23 @@ static int xlate_proc_name(const char *n - return 0; - } - -+#ifndef CONFIG_VE -+#define xlate_proc_loc_name xlate_proc_name -+#else -+static int xlate_proc_loc_name(const char *name, -+ struct proc_dir_entry **ret, const char **residual) -+{ -+ return __xlate_proc_name(get_exec_env()->proc_root, -+ name, ret, residual); -+} -+#endif -+ -+static int xlate_proc_name(const char *name, -+ struct proc_dir_entry **ret, const char **residual) -+{ -+ return __xlate_proc_name(&proc_root, name, ret, residual); -+} -+ - static DEFINE_IDR(proc_inum_idr); - static spinlock_t proc_inum_lock = SPIN_LOCK_UNLOCKED; /* protects the above */ - -@@ -363,31 +387,102 @@ static struct dentry_operations proc_den - struct dentry *proc_lookup(struct inode * dir, struct dentry *dentry, struct nameidata *nd) - { - struct inode *inode = NULL; -- struct proc_dir_entry * de; -+ struct proc_dir_entry *lde, *gde; - int error = -ENOENT; - - lock_kernel(); -- de = PDE(dir); -- if (de) { -- for (de = de->subdir; de ; de = de->next) { -- if (de->namelen != dentry->d_name.len) -- continue; -- if (!memcmp(dentry->d_name.name, de->name, de->namelen)) { -- unsigned int ino = de->low_ino; -+ lde = LPDE(dir); -+ if (!lde) -+ goto out; - -- error = -EINVAL; -- inode = proc_get_inode(dir->i_sb, ino, de); -+ down_read(&proc_tree_sem); -+ for (lde = lde->subdir; lde ; lde = lde->next) { -+ if (lde->namelen != dentry->d_name.len) -+ continue; -+ if (!memcmp(dentry->d_name.name, lde->name, lde->namelen)) -+ break; -+ } -+#ifdef CONFIG_VE -+ gde = GPDE(dir); -+ if (gde != NULL) { -+ for (gde = gde->subdir; gde ; gde = gde->next) { -+ if (gde->namelen != dentry->d_name.len) -+ continue; -+ if (!memcmp(dentry->d_name.name, gde->name, gde->namelen)) - break; -- } - } - } -- unlock_kernel(); -+#else -+ gde = NULL; -+#endif -+ -+ /* -+ * There are following possible cases after lookup: -+ * -+ * lde gde -+ * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -+ * NULL NULL ENOENT -+ * loc NULL found in local tree -+ * loc glob found in both trees -+ * NULL glob found in global tree -+ * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -+ * -+ * We initialized inode as follows after lookup: -+ * -+ * inode->lde inode->gde -+ * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -+ * loc NULL in local tree -+ * loc glob both trees -+ * glob glob global tree -+ * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -+ * i.e. inode->lde is always initialized -+ */ -+ -+ if (lde == NULL && gde == NULL) -+ goto out_up; - -+ if (lde != NULL) { -+ inode = proc_get_inode(dir->i_sb, lde->low_ino, lde); -+ } else { -+ inode = proc_get_inode(dir->i_sb, gde->low_ino, gde); -+ } -+ /* -+ * We can sleep in proc_get_inode(), but since we have i_sem -+ * being taken, no one can setup GPDE/LPDE on this inode. -+ */ - if (inode) { -+#ifdef CONFIG_VE -+ GPDE(inode) = gde; -+ if (gde) { -+ atomic_inc(&gde->count); /* de_get() */ -+ /* we have taken a ref in proc_get_inode() already */ -+ __module_get(gde->owner); -+ } -+ /* if dentry is found in both trees and it is a directory -+ * then inode's nlink count must be altered, because local -+ * and global subtrees may differ. -+ * on the other hand, they may intersect, so actual nlink -+ * value is difficult to calculate - upper estimate is used -+ * instead of it. -+ * dentry found in global tree only must not be writable -+ * in non-super ve. -+ */ -+ if (lde && gde && lde != gde && gde->nlink > 1) -+ inode->i_nlink += gde->nlink - 2; -+ if (lde == NULL && !ve_is_super( -+ VE_OWNER_FSTYPE(dir->i_sb->s_type))) -+ inode->i_mode &= ~S_IWUGO; -+#endif -+ up_read(&proc_tree_sem); -+ unlock_kernel(); - dentry->d_op = &proc_dentry_operations; - d_add(dentry, inode); - return NULL; - } -+out_up: -+ up_read(&proc_tree_sem); -+out: -+ unlock_kernel(); - return ERR_PTR(error); - } - -@@ -434,29 +529,58 @@ int proc_readdir(struct file * filp, - filp->f_pos++; - /* fall through */ - default: -- de = de->subdir; - i -= 2; -- for (;;) { -- if (!de) { -- ret = 1; -- goto out; -- } -- if (!i) -- break; -- de = de->next; -- i--; -- } -+ } - -- do { -- if (filldir(dirent, de->name, de->namelen, filp->f_pos, -- de->low_ino, de->mode >> 12) < 0) -- goto out; -- filp->f_pos++; -- de = de->next; -- } while (de); -+ down_read(&proc_tree_sem); -+ de = de->subdir; -+ for (; de != NULL; de = de->next) { -+ if (!i) -+ break; -+ i--; - } -+ -+ for (; de != NULL; de = de->next) { -+ if (filldir(dirent, de->name, de->namelen, filp->f_pos, -+ de->low_ino, de->mode >> 12) < 0) -+ goto out_up; -+ filp->f_pos++; -+ } -+#ifdef CONFIG_VE -+ de = GPDE(inode); -+ if (de == NULL) { -+ ret = 1; -+ goto out_up; -+ } -+ de = de->subdir; -+ -+ for (; de != NULL; de = de->next) { -+ struct proc_dir_entry *p; -+ /* check that we haven't filled this dir already */ -+ for (p = LPDE(inode)->subdir; p; p = p->next) { -+ if (de->namelen != p->namelen) -+ continue; -+ if (!memcmp(de->name, p->name, p->namelen)) -+ break; -+ } -+ if (p) -+ continue; -+ /* skip first i entries */ -+ if (i > 0) { -+ i--; -+ continue; -+ } -+ if (filldir(dirent, de->name, de->namelen, filp->f_pos, -+ de->low_ino, de->mode >> 12) < 0) -+ goto out_up; -+ filp->f_pos++; -+ } -+#endif - ret = 1; --out: unlock_kernel(); -+out_up: -+ up_read(&proc_tree_sem); -+out: -+ unlock_kernel(); - return ret; - } - -@@ -475,7 +599,9 @@ static struct file_operations proc_dir_o - */ - static struct inode_operations proc_dir_inode_operations = { - .lookup = proc_lookup, -+#ifndef CONFIG_VE - .setattr = proc_notify_change, -+#endif - }; - - static int proc_register(struct proc_dir_entry * dir, struct proc_dir_entry * dp) -@@ -504,6 +630,7 @@ static int proc_register(struct proc_dir - if (dp->proc_iops == NULL) - dp->proc_iops = &proc_file_inode_operations; - } -+ de_get(dir); - return 0; - } - -@@ -549,7 +676,7 @@ static struct proc_dir_entry *proc_creat - /* make sure name is valid */ - if (!name || !strlen(name)) goto out; - -- if (!(*parent) && xlate_proc_name(name, parent, &fn) != 0) -+ if (!(*parent) && xlate_proc_loc_name(name, parent, &fn) != 0) - goto out; - len = strlen(fn); - -@@ -558,6 +685,7 @@ static struct proc_dir_entry *proc_creat - - memset(ent, 0, sizeof(struct proc_dir_entry)); - memcpy(((char *) ent) + sizeof(struct proc_dir_entry), fn, len + 1); -+ atomic_set(&ent->count, 1); - ent->name = ((char *) ent) + sizeof(*ent); - ent->namelen = len; - ent->mode = mode; -@@ -571,6 +699,7 @@ struct proc_dir_entry *proc_symlink(cons - { - struct proc_dir_entry *ent; - -+ down_write(&proc_tree_sem); - ent = proc_create(&parent,name, - (S_IFLNK | S_IRUGO | S_IWUGO | S_IXUGO),1); - -@@ -588,6 +717,7 @@ struct proc_dir_entry *proc_symlink(cons - ent = NULL; - } - } -+ up_write(&proc_tree_sem); - return ent; - } - -@@ -596,6 +726,7 @@ struct proc_dir_entry *proc_mkdir_mode(c - { - struct proc_dir_entry *ent; - -+ down_write(&proc_tree_sem); - ent = proc_create(&parent, name, S_IFDIR | mode, 2); - if (ent) { - ent->proc_fops = &proc_dir_operations; -@@ -606,6 +737,7 @@ struct proc_dir_entry *proc_mkdir_mode(c - ent = NULL; - } - } -+ up_write(&proc_tree_sem); - return ent; - } - -@@ -615,7 +747,7 @@ struct proc_dir_entry *proc_mkdir(const - return proc_mkdir_mode(name, S_IRUGO | S_IXUGO, parent); - } - --struct proc_dir_entry *create_proc_entry(const char *name, mode_t mode, -+static struct proc_dir_entry *__create_proc_entry(const char *name, mode_t mode, - struct proc_dir_entry *parent) - { - struct proc_dir_entry *ent; -@@ -647,6 +779,35 @@ struct proc_dir_entry *create_proc_entry - return ent; - } - -+struct proc_dir_entry *create_proc_entry(const char *name, mode_t mode, -+ struct proc_dir_entry *parent) -+{ -+ struct proc_dir_entry *ent; -+ const char *path = name; -+ -+ ent = NULL; -+ down_write(&proc_tree_sem); -+ if (parent || xlate_proc_loc_name(path, &parent, &name) == 0) -+ ent = __create_proc_entry(name, mode, parent); -+ up_write(&proc_tree_sem); -+ return ent; -+} -+ -+struct proc_dir_entry *create_proc_glob_entry(const char *name, mode_t mode, -+ struct proc_dir_entry *parent) -+{ -+ struct proc_dir_entry *ent; -+ const char *path = name; -+ -+ ent = NULL; -+ down_write(&proc_tree_sem); -+ if (parent || xlate_proc_name(path, &parent, &name) == 0) -+ ent = __create_proc_entry(name, mode, parent); -+ up_write(&proc_tree_sem); -+ return ent; -+} -+EXPORT_SYMBOL(create_proc_glob_entry); -+ - void free_proc_entry(struct proc_dir_entry *de) - { - unsigned int ino = de->low_ino; -@@ -665,15 +826,13 @@ void free_proc_entry(struct proc_dir_ent - * Remove a /proc entry and free it if it's not currently in use. - * If it is in use, we set the 'deleted' flag. - */ --void remove_proc_entry(const char *name, struct proc_dir_entry *parent) -+static void __remove_proc_entry(const char *name, struct proc_dir_entry *parent) - { - struct proc_dir_entry **p; - struct proc_dir_entry *de; - const char *fn = name; - int len; - -- if (!parent && xlate_proc_name(name, &parent, &fn) != 0) -- goto out; - len = strlen(fn); - for (p = &parent->subdir; *p; p=&(*p)->next ) { - if (!proc_match(len, fn, *p)) -@@ -681,20 +840,58 @@ void remove_proc_entry(const char *name, - de = *p; - *p = de->next; - de->next = NULL; -+ de_put(parent); - if (S_ISDIR(de->mode)) - parent->nlink--; - proc_kill_inodes(de); - de->nlink = 0; - WARN_ON(de->subdir); -- if (!atomic_read(&de->count)) -- free_proc_entry(de); -- else { -- de->deleted = 1; -- printk("remove_proc_entry: %s/%s busy, count=%d\n", -- parent->name, de->name, atomic_read(&de->count)); -- } -+ de->deleted = 1; -+ de_put(de); - break; - } --out: -- return; -+} -+ -+static void __remove_proc_glob_entry(const char *name, struct proc_dir_entry *p) -+{ -+ const char *fn = name; -+ -+ if (!p && xlate_proc_name(name, &p, &fn) != 0) -+ return; -+ __remove_proc_entry(fn, p); -+} -+ -+void remove_proc_glob_entry(const char *name, struct proc_dir_entry *parent) -+{ -+ down_write(&proc_tree_sem); -+ __remove_proc_glob_entry(name, parent); -+ up_write(&proc_tree_sem); -+} -+ -+static void __remove_proc_loc_entry(const char *name, struct proc_dir_entry *p) -+{ -+ const char *fn = name; -+ -+ if (!p && xlate_proc_loc_name(name, &p, &fn) != 0) -+ return; -+ __remove_proc_entry(fn, p); -+} -+ -+void remove_proc_loc_entry(const char *name, struct proc_dir_entry *parent) -+{ -+ down_write(&proc_tree_sem); -+ __remove_proc_entry(name, parent); -+ up_write(&proc_tree_sem); -+} -+ -+/* used in cases when we don't know whether it is global or local proc tree */ -+void remove_proc_entry(const char *name, struct proc_dir_entry *parent) -+{ -+ down_write(&proc_tree_sem); -+ __remove_proc_loc_entry(name, parent); -+#ifdef CONFIG_VE -+ if (ve_is_super(get_exec_env())) -+ __remove_proc_glob_entry(name, parent); -+#endif -+ up_write(&proc_tree_sem); - } -diff -uprN linux-2.6.8.1.orig/fs/proc/inode.c linux-2.6.8.1-ve022stab078/fs/proc/inode.c ---- linux-2.6.8.1.orig/fs/proc/inode.c 2004-08-14 14:56:14.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/proc/inode.c 2006-05-11 13:05:40.000000000 +0400 -@@ -8,6 +8,7 @@ - #include <linux/proc_fs.h> - #include <linux/kernel.h> - #include <linux/mm.h> -+#include <linux/ve_owner.h> - #include <linux/string.h> - #include <linux/stat.h> - #include <linux/file.h> -@@ -22,34 +23,25 @@ - - extern void free_proc_entry(struct proc_dir_entry *); - --static inline struct proc_dir_entry * de_get(struct proc_dir_entry *de) --{ -- if (de) -- atomic_inc(&de->count); -- return de; --} -- - /* - * Decrements the use count and checks for deferred deletion. - */ --static void de_put(struct proc_dir_entry *de) -+void de_put(struct proc_dir_entry *de) - { - if (de) { -- lock_kernel(); - if (!atomic_read(&de->count)) { - printk("de_put: entry %s already free!\n", de->name); -- unlock_kernel(); - return; - } - - if (atomic_dec_and_test(&de->count)) { -- if (de->deleted) { -- printk("de_put: deferred delete of %s\n", -- de->name); -- free_proc_entry(de); -+ if (!de->deleted) { -+ printk("de_put: entry %s is not removed yet\n", -+ de->name); -+ return; - } -- } -- unlock_kernel(); -+ free_proc_entry(de); -+ } - } - } - -@@ -67,12 +59,19 @@ static void proc_delete_inode(struct ino - put_task_struct(tsk); - - /* Let go of any associated proc directory entry */ -- de = PROC_I(inode)->pde; -+ de = LPDE(inode); - if (de) { - if (de->owner) - module_put(de->owner); - de_put(de); - } -+#ifdef CONFIG_VE -+ de = GPDE(inode); -+ if (de) { -+ module_put(de->owner); -+ de_put(de); -+ } -+#endif - clear_inode(inode); - } - -@@ -99,6 +98,9 @@ static struct inode *proc_alloc_inode(st - ei->pde = NULL; - inode = &ei->vfs_inode; - inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME; -+#ifdef CONFIG_VE -+ GPDE(inode) = NULL; -+#endif - return inode; - } - -@@ -200,10 +202,13 @@ struct inode *proc_get_inode(struct supe - - WARN_ON(de && de->deleted); - -+ if (de != NULL && !try_module_get(de->owner)) -+ goto out_mod; -+ - inode = iget(sb, ino); - if (!inode) -- goto out_fail; -- -+ goto out_ino; -+ - PROC_I(inode)->pde = de; - if (de) { - if (de->mode) { -@@ -215,20 +220,20 @@ struct inode *proc_get_inode(struct supe - inode->i_size = de->size; - if (de->nlink) - inode->i_nlink = de->nlink; -- if (!try_module_get(de->owner)) -- goto out_fail; - if (de->proc_iops) - inode->i_op = de->proc_iops; - if (de->proc_fops) - inode->i_fop = de->proc_fops; - } - --out: - return inode; - --out_fail: -+out_ino: -+ if (de != NULL) -+ module_put(de->owner); -+out_mod: - de_put(de); -- goto out; -+ return NULL; - } - - int proc_fill_super(struct super_block *s, void *data, int silent) -@@ -251,6 +256,14 @@ int proc_fill_super(struct super_block * - s->s_root = d_alloc_root(root_inode); - if (!s->s_root) - goto out_no_root; -+ -+#ifdef CONFIG_VE -+ LPDE(root_inode) = de_get(get_exec_env()->proc_root); -+ GPDE(root_inode) = &proc_root; -+#else -+ LPDE(root_inode) = &proc_root; -+#endif -+ - parse_options(data, &root_inode->i_uid, &root_inode->i_gid); - return 0; - -diff -uprN linux-2.6.8.1.orig/fs/proc/kmsg.c linux-2.6.8.1-ve022stab078/fs/proc/kmsg.c ---- linux-2.6.8.1.orig/fs/proc/kmsg.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/proc/kmsg.c 2006-05-11 13:05:42.000000000 +0400 -@@ -11,6 +11,7 @@ - #include <linux/kernel.h> - #include <linux/poll.h> - #include <linux/fs.h> -+#include <linux/veprintk.h> - - #include <asm/uaccess.h> - #include <asm/io.h> -@@ -40,7 +41,7 @@ static ssize_t kmsg_read(struct file *fi - - static unsigned int kmsg_poll(struct file *file, poll_table *wait) - { -- poll_wait(file, &log_wait, wait); -+ poll_wait(file, &ve_log_wait, wait); - if (do_syslog(9, NULL, 0)) - return POLLIN | POLLRDNORM; - return 0; -diff -uprN linux-2.6.8.1.orig/fs/proc/proc_misc.c linux-2.6.8.1-ve022stab078/fs/proc/proc_misc.c ---- linux-2.6.8.1.orig/fs/proc/proc_misc.c 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/proc/proc_misc.c 2006-05-11 13:05:49.000000000 +0400 -@@ -31,6 +31,7 @@ - #include <linux/pagemap.h> - #include <linux/swap.h> - #include <linux/slab.h> -+#include <linux/virtinfo.h> - #include <linux/smp.h> - #include <linux/signal.h> - #include <linux/module.h> -@@ -44,14 +45,15 @@ - #include <linux/jiffies.h> - #include <linux/sysrq.h> - #include <linux/vmalloc.h> -+#include <linux/version.h> -+#include <linux/compile.h> - #include <asm/uaccess.h> - #include <asm/pgtable.h> - #include <asm/io.h> - #include <asm/tlb.h> - #include <asm/div64.h> -+#include <linux/fairsched.h> - --#define LOAD_INT(x) ((x) >> FSHIFT) --#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) - /* - * Warning: stuff below (imported functions) assumes that its output will fit - * into one page. For some of those functions it may be wrong. Moreover, we -@@ -83,15 +85,33 @@ static int loadavg_read_proc(char *page, - { - int a, b, c; - int len; -- -- a = avenrun[0] + (FIXED_1/200); -- b = avenrun[1] + (FIXED_1/200); -- c = avenrun[2] + (FIXED_1/200); -+ unsigned long __nr_running; -+ int __nr_threads; -+ unsigned long *__avenrun; -+ struct ve_struct *ve; -+ -+ ve = get_exec_env(); -+ -+ if (ve_is_super(ve)) { -+ __avenrun = &avenrun[0]; -+ __nr_running = nr_running(); -+ __nr_threads = nr_threads; -+ } -+#ifdef CONFIG_VE -+ else { -+ __avenrun = &ve->avenrun[0]; -+ __nr_running = nr_running_ve(ve); -+ __nr_threads = atomic_read(&ve->pcounter); -+ } -+#endif -+ a = __avenrun[0] + (FIXED_1/200); -+ b = __avenrun[1] + (FIXED_1/200); -+ c = __avenrun[2] + (FIXED_1/200); - len = sprintf(page,"%d.%02d %d.%02d %d.%02d %ld/%d %d\n", - LOAD_INT(a), LOAD_FRAC(a), - LOAD_INT(b), LOAD_FRAC(b), - LOAD_INT(c), LOAD_FRAC(c), -- nr_running(), nr_threads, last_pid); -+ __nr_running, __nr_threads, last_pid); - return proc_calc_metrics(page, start, off, count, eof, len); - } - -@@ -139,6 +159,13 @@ static int uptime_read_proc(char *page, - u64 idle_jiffies = init_task.utime + init_task.stime; - - do_posix_clock_monotonic_gettime(&uptime); -+#ifdef CONFIG_VE -+ if (!ve_is_super(get_exec_env())) { -+ set_normalized_timespec(&uptime, -+ uptime.tv_sec - get_exec_env()->start_timespec.tv_sec, -+ uptime.tv_nsec - get_exec_env()->start_timespec.tv_nsec); -+ } -+#endif - jiffies_to_timespec(idle_jiffies, &idle); - len = sprintf(page,"%lu.%02lu %lu.%02lu\n", - (unsigned long) uptime.tv_sec, -@@ -152,30 +179,34 @@ static int uptime_read_proc(char *page, - static int meminfo_read_proc(char *page, char **start, off_t off, - int count, int *eof, void *data) - { -- struct sysinfo i; -- int len, committed; -- struct page_state ps; -- unsigned long inactive; -- unsigned long active; -- unsigned long free; -- unsigned long vmtot; -+ struct meminfo mi; -+ int len; -+ unsigned long dummy; - struct vmalloc_info vmi; - -- get_page_state(&ps); -- get_zone_counts(&active, &inactive, &free); -+ get_page_state(&mi.ps); -+ get_zone_counts(&mi.active, &mi.inactive, &dummy); - - /* - * display in kilobytes. - */ - #define K(x) ((x) << (PAGE_SHIFT - 10)) -- si_meminfo(&i); -- si_swapinfo(&i); -- committed = atomic_read(&vm_committed_space); -+ si_meminfo(&mi.si); -+ si_swapinfo(&mi.si); -+ mi.committed_space = atomic_read(&vm_committed_space); -+ mi.swapcache = total_swapcache_pages; -+ mi.cache = get_page_cache_size() - mi.swapcache - mi.si.bufferram; - -- vmtot = (VMALLOC_END-VMALLOC_START)>>10; -+ mi.vmalloc_total = (VMALLOC_END - VMALLOC_START) >> PAGE_SHIFT; - vmi = get_vmalloc_info(); -- vmi.used >>= 10; -- vmi.largest_chunk >>= 10; -+ mi.vmalloc_used = vmi.used >> PAGE_SHIFT; -+ mi.vmalloc_largest = vmi.largest_chunk >> PAGE_SHIFT; -+ -+#ifdef CONFIG_USER_RESOURCE -+ if (virtinfo_notifier_call(VITYPE_GENERAL, VIRTINFO_MEMINFO, &mi) -+ & NOTIFY_FAIL) -+ return -ENOMSG; -+#endif - - /* - * Tagged format, for easy grepping and expansion. -@@ -198,36 +229,40 @@ static int meminfo_read_proc(char *page, - "Writeback: %8lu kB\n" - "Mapped: %8lu kB\n" - "Slab: %8lu kB\n" -- "Committed_AS: %8u kB\n" -+ "Committed_AS: %8lu kB\n" - "PageTables: %8lu kB\n" - "VmallocTotal: %8lu kB\n" - "VmallocUsed: %8lu kB\n" - "VmallocChunk: %8lu kB\n", -- K(i.totalram), -- K(i.freeram), -- K(i.bufferram), -- K(get_page_cache_size()-total_swapcache_pages-i.bufferram), -- K(total_swapcache_pages), -- K(active), -- K(inactive), -- K(i.totalhigh), -- K(i.freehigh), -- K(i.totalram-i.totalhigh), -- K(i.freeram-i.freehigh), -- K(i.totalswap), -- K(i.freeswap), -- K(ps.nr_dirty), -- K(ps.nr_writeback), -- K(ps.nr_mapped), -- K(ps.nr_slab), -- K(committed), -- K(ps.nr_page_table_pages), -- vmtot, -- vmi.used, -- vmi.largest_chunk -+ K(mi.si.totalram), -+ K(mi.si.freeram), -+ K(mi.si.bufferram), -+ K(mi.cache), -+ K(mi.swapcache), -+ K(mi.active), -+ K(mi.inactive), -+ K(mi.si.totalhigh), -+ K(mi.si.freehigh), -+ K(mi.si.totalram-mi.si.totalhigh), -+ K(mi.si.freeram-mi.si.freehigh), -+ K(mi.si.totalswap), -+ K(mi.si.freeswap), -+ K(mi.ps.nr_dirty), -+ K(mi.ps.nr_writeback), -+ K(mi.ps.nr_mapped), -+ K(mi.ps.nr_slab), -+ K(mi.committed_space), -+ K(mi.ps.nr_page_table_pages), -+ K(mi.vmalloc_total), -+ K(mi.vmalloc_used), -+ K(mi.vmalloc_largest) - ); - -+#ifdef CONFIG_HUGETLB_PAGE -+#warning Virtualize hugetlb_report_meminfo -+#else - len += hugetlb_report_meminfo(page + len); -+#endif - - return proc_calc_metrics(page, start, off, count, eof, len); - #undef K -@@ -252,8 +287,15 @@ static int version_read_proc(char *page, - { - extern char *linux_banner; - int len; -+ struct new_utsname *utsname = &ve_utsname; - -- strcpy(page, linux_banner); -+ if (ve_is_super(get_exec_env())) -+ strcpy(page, linux_banner); -+ else -+ sprintf(page, "Linux version %s (" -+ LINUX_COMPILE_BY "@" LINUX_COMPILE_HOST ") (" -+ LINUX_COMPILER ") %s\n", -+ utsname->release, utsname->version); - len = strlen(page); - return proc_calc_metrics(page, start, off, count, eof, len); - } -@@ -352,21 +394,14 @@ static struct file_operations proc_slabi - .release = seq_release, - }; - --int show_stat(struct seq_file *p, void *v) -+static void show_stat_ve0(struct seq_file *p) - { -- int i; -- extern unsigned long total_forks; -- unsigned long jif; -- u64 sum = 0, user = 0, nice = 0, system = 0, -- idle = 0, iowait = 0, irq = 0, softirq = 0; -- -- jif = - wall_to_monotonic.tv_sec; -- if (wall_to_monotonic.tv_nsec) -- --jif; -+ int i, j; -+ struct page_state page_state; -+ u64 sum, user, nice, system, idle, iowait, irq, softirq; - -+ sum = user = nice = system = idle = iowait = irq = softirq = 0; - for_each_cpu(i) { -- int j; -- - user += kstat_cpu(i).cpustat.user; - nice += kstat_cpu(i).cpustat.nice; - system += kstat_cpu(i).cpustat.system; -@@ -386,8 +421,8 @@ int show_stat(struct seq_file *p, void * - (unsigned long long)jiffies_64_to_clock_t(iowait), - (unsigned long long)jiffies_64_to_clock_t(irq), - (unsigned long long)jiffies_64_to_clock_t(softirq)); -- for_each_online_cpu(i) { - -+ for_each_online_cpu(i) { - /* Copy values here to work around gcc-2.95.3, gcc-2.96 */ - user = kstat_cpu(i).cpustat.user; - nice = kstat_cpu(i).cpustat.nice; -@@ -396,6 +431,7 @@ int show_stat(struct seq_file *p, void * - iowait = kstat_cpu(i).cpustat.iowait; - irq = kstat_cpu(i).cpustat.irq; - softirq = kstat_cpu(i).cpustat.softirq; -+ - seq_printf(p, "cpu%d %llu %llu %llu %llu %llu %llu %llu\n", - i, - (unsigned long long)jiffies_64_to_clock_t(user), -@@ -412,6 +448,84 @@ int show_stat(struct seq_file *p, void * - for (i = 0; i < NR_IRQS; i++) - seq_printf(p, " %u", kstat_irqs(i)); - #endif -+ get_full_page_state(&page_state); -+ seq_printf(p, "\nswap %lu %lu", -+ page_state.pswpin, page_state.pswpout); -+} -+ -+#ifdef CONFIG_VE -+static void show_stat_ve(struct seq_file *p, struct ve_struct *env) -+{ -+ int i; -+ u64 user, nice, system; -+ cycles_t idle, iowait; -+ cpumask_t ve_cpus; -+ -+ ve_cpu_online_map(env, &ve_cpus); -+ -+ user = nice = system = idle = iowait = 0; -+ for_each_cpu_mask(i, ve_cpus) { -+ user += VE_CPU_STATS(env, i)->user; -+ nice += VE_CPU_STATS(env, i)->nice; -+ system += VE_CPU_STATS(env, i)->system; -+ idle += ve_sched_get_idle_time(env, i); -+ iowait += ve_sched_get_iowait_time(env, i); -+ } -+ -+ seq_printf(p, "cpu %llu %llu %llu %llu %llu 0 0\n", -+ (unsigned long long)jiffies_64_to_clock_t(user), -+ (unsigned long long)jiffies_64_to_clock_t(nice), -+ (unsigned long long)jiffies_64_to_clock_t(system), -+ (unsigned long long)cycles_to_clocks(idle), -+ (unsigned long long)cycles_to_clocks(iowait)); -+ -+ for_each_cpu_mask(i, ve_cpus) { -+ user = VE_CPU_STATS(env, i)->user; -+ nice = VE_CPU_STATS(env, i)->nice; -+ system = VE_CPU_STATS(env, i)->system; -+ idle = ve_sched_get_idle_time(env, i); -+ iowait = ve_sched_get_iowait_time(env, i); -+ -+ seq_printf(p, "cpu%d %llu %llu %llu %llu %llu 0 0\n", -+ i, -+ (unsigned long long)jiffies_64_to_clock_t(user), -+ (unsigned long long)jiffies_64_to_clock_t(nice), -+ (unsigned long long)jiffies_64_to_clock_t(system), -+ (unsigned long long)cycles_to_clocks(idle), -+ (unsigned long long)cycles_to_clocks(iowait)); -+ } -+ seq_printf(p, "intr 0"); -+ seq_printf(p, "\nswap %d %d", 0, 0); -+} -+#endif -+ -+int show_stat(struct seq_file *p, void *v) -+{ -+ extern unsigned long total_forks; -+ unsigned long seq, jif; -+ struct ve_struct *env; -+ unsigned long __nr_running, __nr_iowait; -+ -+ do { -+ seq = read_seqbegin(&xtime_lock); -+ jif = - wall_to_monotonic.tv_sec; -+ if (wall_to_monotonic.tv_nsec) -+ --jif; -+ } while (read_seqretry(&xtime_lock, seq)); -+ -+ env = get_exec_env(); -+ if (ve_is_super(env)) { -+ show_stat_ve0(p); -+ __nr_running = nr_running(); -+ __nr_iowait = nr_iowait(); -+ } -+#ifdef CONFIG_VE -+ else { -+ show_stat_ve(p, env); -+ __nr_running = nr_running_ve(env); -+ __nr_iowait = nr_iowait_ve(env); -+ } -+#endif - - seq_printf(p, - "\nctxt %llu\n" -@@ -422,8 +536,8 @@ int show_stat(struct seq_file *p, void * - nr_context_switches(), - (unsigned long)jif, - total_forks, -- nr_running(), -- nr_iowait()); -+ __nr_running, -+ __nr_iowait); - - return 0; - } -@@ -520,7 +634,8 @@ static int cmdline_read_proc(char *page, - { - int len; - -- len = sprintf(page, "%s\n", saved_command_line); -+ len = sprintf(page, "%s\n", -+ ve_is_super(get_exec_env()) ? saved_command_line : ""); - return proc_calc_metrics(page, start, off, count, eof, len); - } - -diff -uprN linux-2.6.8.1.orig/fs/proc/proc_tty.c linux-2.6.8.1-ve022stab078/fs/proc/proc_tty.c ---- linux-2.6.8.1.orig/fs/proc/proc_tty.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/proc/proc_tty.c 2006-05-11 13:05:40.000000000 +0400 -@@ -6,6 +6,7 @@ - - #include <asm/uaccess.h> - -+#include <linux/ve_owner.h> - #include <linux/init.h> - #include <linux/errno.h> - #include <linux/time.h> -@@ -111,24 +112,35 @@ static int show_tty_driver(struct seq_fi - /* iterator */ - static void *t_start(struct seq_file *m, loff_t *pos) - { -- struct list_head *p; -+ struct tty_driver *drv; -+ - loff_t l = *pos; -- list_for_each(p, &tty_drivers) -+ read_lock(&tty_driver_guard); -+ list_for_each_entry(drv, &tty_drivers, tty_drivers) { -+ if (!ve_accessible_strict(VE_OWNER_TTYDRV(drv), get_exec_env())) -+ continue; - if (!l--) -- return list_entry(p, struct tty_driver, tty_drivers); -+ return drv; -+ } - return NULL; - } - - static void *t_next(struct seq_file *m, void *v, loff_t *pos) - { -- struct list_head *p = ((struct tty_driver *)v)->tty_drivers.next; -+ struct tty_driver *drv; -+ - (*pos)++; -- return p==&tty_drivers ? NULL : -- list_entry(p, struct tty_driver, tty_drivers); -+ drv = (struct tty_driver *)v; -+ list_for_each_entry_continue(drv, &tty_drivers, tty_drivers) { -+ if (ve_accessible_strict(VE_OWNER_TTYDRV(drv), get_exec_env())) -+ return drv; -+ } -+ return NULL; - } - - static void t_stop(struct seq_file *m, void *v) - { -+ read_unlock(&tty_driver_guard); - } - - static struct seq_operations tty_drivers_op = { -diff -uprN linux-2.6.8.1.orig/fs/proc/root.c linux-2.6.8.1-ve022stab078/fs/proc/root.c ---- linux-2.6.8.1.orig/fs/proc/root.c 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/proc/root.c 2006-05-11 13:05:42.000000000 +0400 -@@ -30,12 +30,14 @@ static struct super_block *proc_get_sb(s - return get_sb_single(fs_type, flags, data, proc_fill_super); - } - --static struct file_system_type proc_fs_type = { -+struct file_system_type proc_fs_type = { - .name = "proc", - .get_sb = proc_get_sb, - .kill_sb = kill_anon_super, - }; - -+EXPORT_SYMBOL(proc_fs_type); -+ - extern int __init proc_init_inodecache(void); - void __init proc_root_init(void) - { -diff -uprN linux-2.6.8.1.orig/fs/qnx4/inode.c linux-2.6.8.1-ve022stab078/fs/qnx4/inode.c ---- linux-2.6.8.1.orig/fs/qnx4/inode.c 2004-08-14 14:56:01.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/qnx4/inode.c 2006-05-11 13:05:35.000000000 +0400 -@@ -78,7 +78,7 @@ static void qnx4_write_super(struct supe - unlock_kernel(); - } - --static void qnx4_write_inode(struct inode *inode, int unused) -+static int qnx4_write_inode(struct inode *inode, int unused) - { - struct qnx4_inode_entry *raw_inode; - int block, ino; -@@ -87,12 +87,12 @@ static void qnx4_write_inode(struct inod - - QNX4DEBUG(("qnx4: write inode 1.\n")); - if (inode->i_nlink == 0) { -- return; -+ return 0; - } - if (!ino) { - printk("qnx4: bad inode number on dev %s: %d is out of range\n", - inode->i_sb->s_id, ino); -- return; -+ return -EIO; - } - QNX4DEBUG(("qnx4: write inode 2.\n")); - block = ino / QNX4_INODES_PER_BLOCK; -@@ -101,7 +101,7 @@ static void qnx4_write_inode(struct inod - printk("qnx4: major problem: unable to read inode from dev " - "%s\n", inode->i_sb->s_id); - unlock_kernel(); -- return; -+ return -EIO; - } - raw_inode = ((struct qnx4_inode_entry *) bh->b_data) + - (ino % QNX4_INODES_PER_BLOCK); -@@ -117,6 +117,7 @@ static void qnx4_write_inode(struct inod - mark_buffer_dirty(bh); - brelse(bh); - unlock_kernel(); -+ return 0; - } - - #endif -diff -uprN linux-2.6.8.1.orig/fs/quota.c linux-2.6.8.1-ve022stab078/fs/quota.c ---- linux-2.6.8.1.orig/fs/quota.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/quota.c 2006-05-11 13:05:43.000000000 +0400 -@@ -94,26 +94,29 @@ static int check_quotactl_valid(struct s - if (cmd == Q_GETQUOTA || cmd == Q_XGETQUOTA) { - if (((type == USRQUOTA && current->euid != id) || - (type == GRPQUOTA && !in_egroup_p(id))) && -- !capable(CAP_SYS_ADMIN)) -+ !capable(CAP_VE_SYS_ADMIN)) - return -EPERM; - } - else if (cmd != Q_GETFMT && cmd != Q_SYNC && cmd != Q_GETINFO && cmd != Q_XGETQSTAT) -- if (!capable(CAP_SYS_ADMIN)) -+ if (!capable(CAP_VE_SYS_ADMIN)) - return -EPERM; - - return security_quotactl (cmd, type, id, sb); - } - --static struct super_block *get_super_to_sync(int type) -+void sync_dquots(struct super_block *sb, int type) - { -- struct list_head *head; - int cnt, dirty; -- --restart: -+ -+ if (sb) { -+ if (sb->s_qcop && sb->s_qcop->quota_sync) -+ sb->s_qcop->quota_sync(sb, type); -+ return; -+ } -+ - spin_lock(&sb_lock); -- list_for_each(head, &super_blocks) { -- struct super_block *sb = list_entry(head, struct super_block, s_list); -- -+restart: -+ list_for_each_entry(sb, &super_blocks, s_list) { - /* This test just improves performance so it needn't be reliable... */ - for (cnt = 0, dirty = 0; cnt < MAXQUOTAS; cnt++) - if ((type == cnt || type == -1) && sb_has_quota_enabled(sb, cnt) -@@ -124,29 +127,14 @@ restart: - sb->s_count++; - spin_unlock(&sb_lock); - down_read(&sb->s_umount); -- if (!sb->s_root) { -- drop_super(sb); -+ if (sb->s_root && sb->s_qcop->quota_sync) -+ sb->s_qcop->quota_sync(sb, type); -+ up_read(&sb->s_umount); -+ spin_lock(&sb_lock); -+ if (__put_super_and_need_restart(sb)) - goto restart; -- } -- return sb; - } - spin_unlock(&sb_lock); -- return NULL; --} -- --void sync_dquots(struct super_block *sb, int type) --{ -- if (sb) { -- if (sb->s_qcop->quota_sync) -- sb->s_qcop->quota_sync(sb, type); -- } -- else { -- while ((sb = get_super_to_sync(type)) != 0) { -- if (sb->s_qcop->quota_sync) -- sb->s_qcop->quota_sync(sb, type); -- drop_super(sb); -- } -- } - } - - /* Copy parameters and call proper function */ -@@ -258,6 +246,250 @@ static int do_quotactl(struct super_bloc - return 0; - } - -+static struct super_block *quota_get_sb(const char __user *special) -+{ -+ struct super_block *sb; -+ struct block_device *bdev; -+ char *tmp; -+ -+ tmp = getname(special); -+ if (IS_ERR(tmp)) -+ return (struct super_block *)tmp; -+ bdev = lookup_bdev(tmp, FMODE_QUOTACTL); -+ putname(tmp); -+ if (IS_ERR(bdev)) -+ return (struct super_block *)bdev; -+ sb = get_super(bdev); -+ bdput(bdev); -+ if (!sb) -+ return ERR_PTR(-ENODEV); -+ return sb; -+} -+ -+#ifdef CONFIG_QUOTA_COMPAT -+ -+#define QC_QUOTAON 0x0100 /* enable quotas */ -+#define QC_QUOTAOFF 0x0200 /* disable quotas */ -+/* GETQUOTA, SETQUOTA and SETUSE which were at 0x0300-0x0500 has now other parameteres */ -+#define QC_SYNC 0x0600 /* sync disk copy of a filesystems quotas */ -+#define QC_SETQLIM 0x0700 /* set limits */ -+/* GETSTATS at 0x0800 is now longer... */ -+#define QC_GETINFO 0x0900 /* get info about quotas - graces, flags... */ -+#define QC_SETINFO 0x0A00 /* set info about quotas */ -+#define QC_SETGRACE 0x0B00 /* set inode and block grace */ -+#define QC_SETFLAGS 0x0C00 /* set flags for quota */ -+#define QC_GETQUOTA 0x0D00 /* get limits and usage */ -+#define QC_SETQUOTA 0x0E00 /* set limits and usage */ -+#define QC_SETUSE 0x0F00 /* set usage */ -+/* 0x1000 used by old RSQUASH */ -+#define QC_GETSTATS 0x1100 /* get collected stats */ -+#define QC_GETQUOTI 0x2B00 /* get limits and usage by index */ -+ -+struct compat_dqblk { -+ unsigned int dqb_ihardlimit; -+ unsigned int dqb_isoftlimit; -+ unsigned int dqb_curinodes; -+ unsigned int dqb_bhardlimit; -+ unsigned int dqb_bsoftlimit; -+ qsize_t dqb_curspace; -+ __kernel_time_t dqb_btime; -+ __kernel_time_t dqb_itime; -+}; -+ -+struct compat_dqinfo { -+ unsigned int dqi_bgrace; -+ unsigned int dqi_igrace; -+ unsigned int dqi_flags; -+ unsigned int dqi_blocks; -+ unsigned int dqi_free_blk; -+ unsigned int dqi_free_entry; -+}; -+ -+struct compat_dqstats { -+ __u32 lookups; -+ __u32 drops; -+ __u32 reads; -+ __u32 writes; -+ __u32 cache_hits; -+ __u32 allocated_dquots; -+ __u32 free_dquots; -+ __u32 syncs; -+ __u32 version; -+}; -+ -+asmlinkage long sys_quotactl(unsigned int cmd, const char __user *special, qid_t id, void __user *addr); -+static long compat_quotactl(unsigned int cmds, unsigned int type, -+ const char __user *special, qid_t id, -+ void __user *addr) -+{ -+ struct super_block *sb; -+ long ret; -+ -+ sb = NULL; -+ switch (cmds) { -+ case QC_QUOTAON: -+ return sys_quotactl(QCMD(Q_QUOTAON, type), -+ special, id, addr); -+ -+ case QC_QUOTAOFF: -+ return sys_quotactl(QCMD(Q_QUOTAOFF, type), -+ special, id, addr); -+ -+ case QC_SYNC: -+ return sys_quotactl(QCMD(Q_SYNC, type), -+ special, id, addr); -+ -+ case QC_GETQUOTA: { -+ struct if_dqblk idq; -+ struct compat_dqblk cdq; -+ -+ sb = quota_get_sb(special); -+ ret = PTR_ERR(sb); -+ if (IS_ERR(sb)) -+ break; -+ ret = check_quotactl_valid(sb, type, Q_GETQUOTA, id); -+ if (ret) -+ break; -+ ret = sb->s_qcop->get_dqblk(sb, type, id, &idq); -+ if (ret) -+ break; -+ cdq.dqb_ihardlimit = idq.dqb_ihardlimit; -+ cdq.dqb_isoftlimit = idq.dqb_isoftlimit; -+ cdq.dqb_curinodes = idq.dqb_curinodes; -+ cdq.dqb_bhardlimit = idq.dqb_bhardlimit; -+ cdq.dqb_bsoftlimit = idq.dqb_bsoftlimit; -+ cdq.dqb_curspace = idq.dqb_curspace; -+ cdq.dqb_btime = idq.dqb_btime; -+ cdq.dqb_itime = idq.dqb_itime; -+ ret = 0; -+ if (copy_to_user(addr, &cdq, sizeof(cdq))) -+ ret = -EFAULT; -+ break; -+ } -+ -+ case QC_SETQUOTA: -+ case QC_SETUSE: -+ case QC_SETQLIM: { -+ struct if_dqblk idq; -+ struct compat_dqblk cdq; -+ -+ sb = quota_get_sb(special); -+ ret = PTR_ERR(sb); -+ if (IS_ERR(sb)) -+ break; -+ ret = check_quotactl_valid(sb, type, Q_SETQUOTA, id); -+ if (ret) -+ break; -+ ret = -EFAULT; -+ if (copy_from_user(&cdq, addr, sizeof(cdq))) -+ break; -+ idq.dqb_ihardlimit = cdq.dqb_ihardlimit; -+ idq.dqb_isoftlimit = cdq.dqb_isoftlimit; -+ idq.dqb_curinodes = cdq.dqb_curinodes; -+ idq.dqb_bhardlimit = cdq.dqb_bhardlimit; -+ idq.dqb_bsoftlimit = cdq.dqb_bsoftlimit; -+ idq.dqb_curspace = cdq.dqb_curspace; -+ idq.dqb_valid = 0; -+ if (cmds == QC_SETQUOTA || cmds == QC_SETQLIM) -+ idq.dqb_valid |= QIF_LIMITS; -+ if (cmds == QC_SETQUOTA || cmds == QC_SETUSE) -+ idq.dqb_valid |= QIF_USAGE; -+ ret = sb->s_qcop->set_dqblk(sb, type, id, &idq); -+ break; -+ } -+ -+ case QC_GETINFO: { -+ struct if_dqinfo iinf; -+ struct compat_dqinfo cinf; -+ -+ sb = quota_get_sb(special); -+ ret = PTR_ERR(sb); -+ if (IS_ERR(sb)) -+ break; -+ ret = check_quotactl_valid(sb, type, Q_GETQUOTA, id); -+ if (ret) -+ break; -+ ret = sb->s_qcop->get_info(sb, type, &iinf); -+ if (ret) -+ break; -+ cinf.dqi_bgrace = iinf.dqi_bgrace; -+ cinf.dqi_igrace = iinf.dqi_igrace; -+ cinf.dqi_flags = 0; -+ if (iinf.dqi_flags & DQF_INFO_DIRTY) -+ cinf.dqi_flags |= 0x0010; -+ cinf.dqi_blocks = 0; -+ cinf.dqi_free_blk = 0; -+ cinf.dqi_free_entry = 0; -+ ret = 0; -+ if (copy_to_user(addr, &cinf, sizeof(cinf))) -+ ret = -EFAULT; -+ break; -+ } -+ -+ case QC_SETINFO: -+ case QC_SETGRACE: -+ case QC_SETFLAGS: { -+ struct if_dqinfo iinf; -+ struct compat_dqinfo cinf; -+ -+ sb = quota_get_sb(special); -+ ret = PTR_ERR(sb); -+ if (IS_ERR(sb)) -+ break; -+ ret = check_quotactl_valid(sb, type, Q_SETINFO, id); -+ if (ret) -+ break; -+ ret = -EFAULT; -+ if (copy_from_user(&cinf, addr, sizeof(cinf))) -+ break; -+ iinf.dqi_bgrace = cinf.dqi_bgrace; -+ iinf.dqi_igrace = cinf.dqi_igrace; -+ iinf.dqi_flags = cinf.dqi_flags; -+ iinf.dqi_valid = 0; -+ if (cmds == QC_SETINFO || cmds == QC_SETGRACE) -+ iinf.dqi_valid |= IIF_BGRACE | IIF_IGRACE; -+ if (cmds == QC_SETINFO || cmds == QC_SETFLAGS) -+ iinf.dqi_valid |= IIF_FLAGS; -+ ret = sb->s_qcop->set_info(sb, type, &iinf); -+ break; -+ } -+ -+ case QC_GETSTATS: { -+ struct compat_dqstats stat; -+ -+ memset(&stat, 0, sizeof(stat)); -+ stat.version = 6*10000+5*100+0; -+ ret = 0; -+ if (copy_to_user(addr, &stat, sizeof(stat))) -+ ret = -EFAULT; -+ break; -+ } -+ -+ case QC_GETQUOTI: -+ sb = quota_get_sb(special); -+ ret = PTR_ERR(sb); -+ if (IS_ERR(sb)) -+ break; -+ ret = check_quotactl_valid(sb, type, Q_GETINFO, id); -+ if (ret) -+ break; -+ ret = -ENOSYS; -+ if (!sb->s_qcop->get_quoti) -+ break; -+ ret = sb->s_qcop->get_quoti(sb, type, id, addr); -+ break; -+ -+ default: -+ ret = -ENOSYS; -+ break; -+ } -+ if (sb && !IS_ERR(sb)) -+ drop_super(sb); -+ return ret; -+} -+ -+#endif -+ - /* - * This is the system call interface. This communicates with - * the user-level programs. Currently this only supports diskquota -@@ -268,25 +500,20 @@ asmlinkage long sys_quotactl(unsigned in - { - uint cmds, type; - struct super_block *sb = NULL; -- struct block_device *bdev; -- char *tmp; - int ret; - - cmds = cmd >> SUBCMDSHIFT; - type = cmd & SUBCMDMASK; - -+#ifdef CONFIG_QUOTA_COMPAT -+ if (cmds >= 0x0100 && cmds < 0x3000) -+ return compat_quotactl(cmds, type, special, id, addr); -+#endif -+ - if (cmds != Q_SYNC || special) { -- tmp = getname(special); -- if (IS_ERR(tmp)) -- return PTR_ERR(tmp); -- bdev = lookup_bdev(tmp); -- putname(tmp); -- if (IS_ERR(bdev)) -- return PTR_ERR(bdev); -- sb = get_super(bdev); -- bdput(bdev); -- if (!sb) -- return -ENODEV; -+ sb = quota_get_sb(special); -+ if (IS_ERR(sb)) -+ return PTR_ERR(sb); - } - - ret = check_quotactl_valid(sb, type, cmds, id); -diff -uprN linux-2.6.8.1.orig/fs/ramfs/inode.c linux-2.6.8.1-ve022stab078/fs/ramfs/inode.c ---- linux-2.6.8.1.orig/fs/ramfs/inode.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ramfs/inode.c 2006-05-11 13:05:32.000000000 +0400 -@@ -128,7 +128,7 @@ static int ramfs_symlink(struct inode * - inode = ramfs_get_inode(dir->i_sb, S_IFLNK|S_IRWXUGO, 0); - if (inode) { - int l = strlen(symname)+1; -- error = page_symlink(inode, symname, l); -+ error = page_symlink(inode, symname, l, GFP_KERNEL); - if (!error) { - if (dir->i_mode & S_ISGID) - inode->i_gid = dir->i_gid; -diff -uprN linux-2.6.8.1.orig/fs/reiserfs/file.c linux-2.6.8.1-ve022stab078/fs/reiserfs/file.c ---- linux-2.6.8.1.orig/fs/reiserfs/file.c 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/reiserfs/file.c 2006-05-11 13:05:33.000000000 +0400 -@@ -535,7 +535,7 @@ error_exit: - - /* Unlock pages prepared by reiserfs_prepare_file_region_for_write */ - void reiserfs_unprepare_pages(struct page **prepared_pages, /* list of locked pages */ -- int num_pages /* amount of pages */) { -+ size_t num_pages /* amount of pages */) { - int i; // loop counter - - for (i=0; i < num_pages ; i++) { -@@ -566,7 +566,7 @@ int reiserfs_copy_from_user_to_file_regi - int offset; // offset in page - - for ( i = 0, offset = (pos & (PAGE_CACHE_SIZE-1)); i < num_pages ; i++,offset=0) { -- int count = min_t(int,PAGE_CACHE_SIZE-offset,write_bytes); // How much of bytes to write to this page -+ size_t count = min_t(size_t,PAGE_CACHE_SIZE-offset,write_bytes); // How much of bytes to write to this page - struct page *page=prepared_pages[i]; // Current page we process. - - fault_in_pages_readable( buf, count); -@@ -661,8 +661,8 @@ int reiserfs_submit_file_region_for_writ - struct reiserfs_transaction_handle *th, - struct inode *inode, - loff_t pos, /* Writing position offset */ -- int num_pages, /* Number of pages to write */ -- int write_bytes, /* number of bytes to write */ -+ size_t num_pages, /* Number of pages to write */ -+ size_t write_bytes, /* number of bytes to write */ - struct page **prepared_pages /* list of pages */ - ) - { -@@ -795,9 +795,9 @@ int reiserfs_check_for_tail_and_convert( - int reiserfs_prepare_file_region_for_write( - struct inode *inode /* Inode of the file */, - loff_t pos, /* position in the file */ -- int num_pages, /* number of pages to -+ size_t num_pages, /* number of pages to - prepare */ -- int write_bytes, /* Amount of bytes to be -+ size_t write_bytes, /* Amount of bytes to be - overwritten from - @pos */ - struct page **prepared_pages /* pointer to array -@@ -1176,10 +1176,9 @@ ssize_t reiserfs_file_write( struct file - while ( count > 0) { - /* This is the main loop in which we running until some error occures - or until we write all of the data. */ -- int num_pages;/* amount of pages we are going to write this iteration */ -- int write_bytes; /* amount of bytes to write during this iteration */ -- int blocks_to_allocate; /* how much blocks we need to allocate for -- this iteration */ -+ size_t num_pages;/* amount of pages we are going to write this iteration */ -+ size_t write_bytes; /* amount of bytes to write during this iteration */ -+ size_t blocks_to_allocate; /* how much blocks we need to allocate for this iteration */ - - /* (pos & (PAGE_CACHE_SIZE-1)) is an idiom for offset into a page of pos*/ - num_pages = !!((pos+count) & (PAGE_CACHE_SIZE - 1)) + /* round up partial -@@ -1193,7 +1192,7 @@ ssize_t reiserfs_file_write( struct file - /* If we were asked to write more data than we want to or if there - is not that much space, then we shorten amount of data to write - for this iteration. */ -- num_pages = min_t(int, REISERFS_WRITE_PAGES_AT_A_TIME, reiserfs_can_fit_pages(inode->i_sb)); -+ num_pages = min_t(size_t, REISERFS_WRITE_PAGES_AT_A_TIME, reiserfs_can_fit_pages(inode->i_sb)); - /* Also we should not forget to set size in bytes accordingly */ - write_bytes = (num_pages << PAGE_CACHE_SHIFT) - - (pos & (PAGE_CACHE_SIZE-1)); -@@ -1219,7 +1218,7 @@ ssize_t reiserfs_file_write( struct file - // But overwriting files on absolutelly full volumes would not - // be very efficient. Well, people are not supposed to fill - // 100% of disk space anyway. -- write_bytes = min_t(int, count, inode->i_sb->s_blocksize - (pos & (inode->i_sb->s_blocksize - 1))); -+ write_bytes = min_t(size_t, count, inode->i_sb->s_blocksize - (pos & (inode->i_sb->s_blocksize - 1))); - num_pages = 1; - // No blocks were claimed before, so do it now. - reiserfs_claim_blocks_to_be_allocated(inode->i_sb, 1 << (PAGE_CACHE_SHIFT - inode->i_blkbits)); -diff -uprN linux-2.6.8.1.orig/fs/reiserfs/inode.c linux-2.6.8.1-ve022stab078/fs/reiserfs/inode.c ---- linux-2.6.8.1.orig/fs/reiserfs/inode.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/reiserfs/inode.c 2006-05-11 13:05:35.000000000 +0400 -@@ -1504,7 +1504,7 @@ int reiserfs_encode_fh(struct dentry *de - ** to properly mark inodes for datasync and such, but only actually - ** does something when called for a synchronous update. - */ --void reiserfs_write_inode (struct inode * inode, int do_sync) { -+int reiserfs_write_inode (struct inode * inode, int do_sync) { - struct reiserfs_transaction_handle th ; - int jbegin_count = 1 ; - -@@ -1512,7 +1512,7 @@ void reiserfs_write_inode (struct inode - reiserfs_warning (inode->i_sb, - "clm-6005: writing inode %lu on readonly FS", - inode->i_ino) ; -- return ; -+ return -EROFS; - } - /* memory pressure can sometimes initiate write_inode calls with sync == 1, - ** these cases are just when the system needs ram, not when the -@@ -1526,6 +1526,7 @@ void reiserfs_write_inode (struct inode - journal_end_sync(&th, inode->i_sb, jbegin_count) ; - reiserfs_write_unlock(inode->i_sb); - } -+ return 0; - } - - /* FIXME: no need any more. right? */ -diff -uprN linux-2.6.8.1.orig/fs/reiserfs/namei.c linux-2.6.8.1-ve022stab078/fs/reiserfs/namei.c ---- linux-2.6.8.1.orig/fs/reiserfs/namei.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/reiserfs/namei.c 2006-05-11 13:05:43.000000000 +0400 -@@ -799,6 +799,9 @@ static int reiserfs_rmdir (struct inode - struct reiserfs_dir_entry de; - - -+ inode = dentry->d_inode; -+ DQUOT_INIT(inode); -+ - /* we will be doing 2 balancings and update 2 stat data */ - jbegin_count = JOURNAL_PER_BALANCE_CNT * 2 + 2; - -@@ -814,8 +817,6 @@ static int reiserfs_rmdir (struct inode - goto end_rmdir; - } - -- inode = dentry->d_inode; -- - reiserfs_update_inode_transaction(inode) ; - reiserfs_update_inode_transaction(dir) ; - -@@ -878,6 +879,7 @@ static int reiserfs_unlink (struct inode - unsigned long savelink; - - inode = dentry->d_inode; -+ DQUOT_INIT(inode); - - /* in this transaction we can be doing at max two balancings and update - two stat datas */ -@@ -1146,6 +1148,8 @@ static int reiserfs_rename (struct inode - - old_inode = old_dentry->d_inode; - new_dentry_inode = new_dentry->d_inode; -+ if (new_dentry_inode) -+ DQUOT_INIT(new_dentry_inode); - - // make sure, that oldname still exists and points to an object we - // are going to rename -diff -uprN linux-2.6.8.1.orig/fs/reiserfs/xattr.c linux-2.6.8.1-ve022stab078/fs/reiserfs/xattr.c ---- linux-2.6.8.1.orig/fs/reiserfs/xattr.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/reiserfs/xattr.c 2006-05-11 13:05:35.000000000 +0400 -@@ -1429,9 +1429,26 @@ check_capabilities: - } - - int --reiserfs_permission (struct inode *inode, int mask, struct nameidata *nd) -+reiserfs_permission (struct inode *inode, int mask, struct nameidata *nd, -+ struct exec_perm *exec_perm) - { -- return __reiserfs_permission (inode, mask, nd, 1); -+ int ret; -+ -+ if (exec_perm != NULL) -+ down(&inode->i_sem); -+ -+ ret = __reiserfs_permission (inode, mask, nd, 1); -+ -+ if (exec_perm != NULL) { -+ if (!ret) { -+ exec_perm->set = 1; -+ exec_perm->mode = inode->i_mode; -+ exec_perm->uid = inode->i_uid; -+ exec_perm->gid = inode->i_gid; -+ } -+ up(&inode->i_sem); -+ } -+ return ret; - } - - int -diff -uprN linux-2.6.8.1.orig/fs/select.c linux-2.6.8.1-ve022stab078/fs/select.c ---- linux-2.6.8.1.orig/fs/select.c 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/select.c 2006-05-11 13:05:39.000000000 +0400 -@@ -24,6 +24,8 @@ - - #include <asm/uaccess.h> - -+#include <ub/ub_mem.h> -+ - #define ROUND_UP(x,y) (((x)+(y)-1)/(y)) - #define DEFAULT_POLLMASK (POLLIN | POLLOUT | POLLRDNORM | POLLWRNORM) - -@@ -94,7 +96,8 @@ void __pollwait(struct file *filp, wait_ - if (!table || POLL_TABLE_FULL(table)) { - struct poll_table_page *new_table; - -- new_table = (struct poll_table_page *) __get_free_page(GFP_KERNEL); -+ new_table = (struct poll_table_page *) __get_free_page( -+ GFP_KERNEL_UBC); - if (!new_table) { - p->error = -ENOMEM; - __set_current_state(TASK_RUNNING); -@@ -275,7 +278,7 @@ EXPORT_SYMBOL(do_select); - - static void *select_bits_alloc(int size) - { -- return kmalloc(6 * size, GFP_KERNEL); -+ return ub_kmalloc(6 * size, GFP_KERNEL); - } - - static void select_bits_free(void *bits, int size) -@@ -484,7 +487,7 @@ asmlinkage long sys_poll(struct pollfd _ - err = -ENOMEM; - while(i!=0) { - struct poll_list *pp; -- pp = kmalloc(sizeof(struct poll_list)+ -+ pp = ub_kmalloc(sizeof(struct poll_list)+ - sizeof(struct pollfd)* - (i>POLLFD_PER_PAGE?POLLFD_PER_PAGE:i), - GFP_KERNEL); -diff -uprN linux-2.6.8.1.orig/fs/seq_file.c linux-2.6.8.1-ve022stab078/fs/seq_file.c ---- linux-2.6.8.1.orig/fs/seq_file.c 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/seq_file.c 2006-05-11 13:05:40.000000000 +0400 -@@ -311,6 +311,8 @@ int seq_path(struct seq_file *m, - if (m->count < m->size) { - char *s = m->buf + m->count; - char *p = d_path(dentry, mnt, s, m->size - m->count); -+ if (IS_ERR(p) && PTR_ERR(p) != -ENAMETOOLONG) -+ return 0; - if (!IS_ERR(p)) { - while (s <= p) { - char c = *p++; -diff -uprN linux-2.6.8.1.orig/fs/simfs.c linux-2.6.8.1-ve022stab078/fs/simfs.c ---- linux-2.6.8.1.orig/fs/simfs.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/fs/simfs.c 2006-05-11 13:05:43.000000000 +0400 -@@ -0,0 +1,289 @@ -+/* -+ * fs/simfs.c -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#include <linux/config.h> -+#include <linux/fs.h> -+#include <linux/file.h> -+#include <linux/init.h> -+#include <linux/namei.h> -+#include <linux/err.h> -+#include <linux/module.h> -+#include <linux/mount.h> -+#include <linux/vzquota.h> -+#include <linux/statfs.h> -+#include <linux/virtinfo.h> -+#include <linux/faudit.h> -+#include <linux/genhd.h> -+ -+#include <asm/unistd.h> -+#include <asm/uaccess.h> -+ -+#define SIMFS_GET_LOWER_FS_SB(sb) sb->s_root->d_sb -+ -+static struct super_operations sim_super_ops; -+ -+static int sim_getattr(struct vfsmount *mnt, struct dentry *dentry, -+ struct kstat *stat) -+{ -+ struct super_block *sb; -+ struct inode *inode; -+ -+ inode = dentry->d_inode; -+ if (!inode->i_op->getattr) { -+ generic_fillattr(inode, stat); -+ if (!stat->blksize) { -+ unsigned blocks; -+ -+ sb = inode->i_sb; -+ blocks = (stat->size + sb->s_blocksize-1) >> -+ sb->s_blocksize_bits; -+ stat->blocks = (sb->s_blocksize / 512) * blocks; -+ stat->blksize = sb->s_blocksize; -+ } -+ } else { -+ int err; -+ -+ err = inode->i_op->getattr(mnt, dentry, stat); -+ if (err) -+ return err; -+ } -+ -+ sb = mnt->mnt_sb; -+ if (sb->s_op == &sim_super_ops) -+ stat->dev = sb->s_dev; -+ return 0; -+} -+ -+static void quota_get_stat(struct super_block *sb, struct kstatfs *buf) -+{ -+ int err; -+ struct dq_stat qstat; -+ struct virt_info_quota q; -+ long free_file, adj_file; -+ s64 blk, free_blk, adj_blk; -+ int bsize_bits; -+ -+ q.super = sb; -+ q.qstat = &qstat; -+ err = virtinfo_notifier_call(VITYPE_QUOTA, VIRTINFO_QUOTA_GETSTAT, &q); -+ if (err != NOTIFY_OK) -+ return; -+ -+ bsize_bits = ffs(buf->f_bsize) - 1; -+ free_blk = (s64)(qstat.bsoftlimit - qstat.bcurrent) >> bsize_bits; -+ if (free_blk < 0) -+ free_blk = 0; -+ /* -+ * In the regular case, we always set buf->f_bfree and buf->f_blocks to -+ * the values reported by quota. In case of real disk space shortage, -+ * we adjust the values. We want this adjustment to look as if the -+ * total disk space were reduced, not as if the usage were increased. -+ * -- SAW -+ */ -+ adj_blk = 0; -+ if (buf->f_bfree < free_blk) -+ adj_blk = free_blk - buf->f_bfree; -+ buf->f_bfree = (long)(free_blk - adj_blk); -+ -+ if (free_blk < buf->f_bavail) -+ buf->f_bavail = (long)free_blk; /* min(f_bavail, free_blk) */ -+ -+ blk = (qstat.bsoftlimit >> bsize_bits) - adj_blk; -+ buf->f_blocks = blk > LONG_MAX ? LONG_MAX : blk; -+ -+ free_file = qstat.isoftlimit - qstat.icurrent; -+ if (free_file < 0) -+ free_file = 0; -+ if (buf->f_ffree == -1) -+ /* -+ * One filesystem uses -1 to represent the fact that it doesn't -+ * have a detached limit for inode number. -+ * May be, because -1 is a good pretendent for the maximum value -+ * of signed long type, may be, because it's just nice to have -+ * an exceptional case... Guess what that filesystem is :-) -+ * -- SAW -+ */ -+ buf->f_ffree = free_file; -+ adj_file = 0; -+ if (buf->f_ffree < free_file) -+ adj_file = free_file - buf->f_ffree; -+ buf->f_ffree = free_file - adj_file; -+ buf->f_files = qstat.isoftlimit - adj_file; -+} -+ -+static int sim_statfs(struct super_block *sb, struct kstatfs *buf) -+{ -+ int err; -+ struct super_block *lsb; -+ struct kstatfs statbuf; -+ -+ err = 0; -+ if (sb->s_op != &sim_super_ops) -+ return 0; -+ -+ lsb = SIMFS_GET_LOWER_FS_SB(sb); -+ -+ err = -ENOSYS; -+ if (lsb && lsb->s_op && lsb->s_op->statfs) -+ err = lsb->s_op->statfs(lsb, &statbuf); -+ if (err) -+ return err; -+ -+ quota_get_stat(sb, &statbuf); -+ buf->f_files = statbuf.f_files; -+ buf->f_ffree = statbuf.f_ffree; -+ buf->f_blocks = statbuf.f_blocks; -+ buf->f_bfree = statbuf.f_bfree; -+ buf->f_bavail = statbuf.f_bavail; -+ return 0; -+} -+ -+static int sim_systemcall(struct vnotifier_block *me, unsigned long n, -+ void *d, int old_ret) -+{ -+ int err; -+ -+ switch (n) { -+ case VIRTINFO_FAUDIT_STAT: { -+ struct faudit_stat_arg *arg; -+ -+ arg = (struct faudit_stat_arg *)d; -+ err = sim_getattr(arg->mnt, arg->dentry, arg->stat); -+ arg->err = err; -+ } -+ break; -+ case VIRTINFO_FAUDIT_STATFS: { -+ struct faudit_statfs_arg *arg; -+ -+ arg = (struct faudit_statfs_arg *)d; -+ err = sim_statfs(arg->sb, arg->stat); -+ arg->err = err; -+ } -+ break; -+ default: -+ return old_ret; -+ } -+ return (err ? NOTIFY_BAD : NOTIFY_OK); -+} -+ -+static struct inode *sim_quota_root(struct super_block *sb) -+{ -+ return sb->s_root->d_inode; -+} -+ -+void sim_put_super(struct super_block *sb) -+{ -+ struct virt_info_quota viq; -+ -+ viq.super = sb; -+ virtinfo_notifier_call(VITYPE_QUOTA, VIRTINFO_QUOTA_OFF, &viq); -+ bdput(sb->s_bdev); -+} -+ -+static struct super_operations sim_super_ops = { -+ .get_quota_root = sim_quota_root, -+ .put_super = sim_put_super, -+}; -+ -+static int sim_fill_super(struct super_block *s, void *data) -+{ -+ int err; -+ struct nameidata *nd; -+ -+ err = set_anon_super(s, NULL); -+ if (err) -+ goto out; -+ -+ err = 0; -+ nd = (struct nameidata *)data; -+ s->s_root = dget(nd->dentry); -+ s->s_op = &sim_super_ops; -+out: -+ return err; -+} -+ -+struct super_block *sim_get_sb(struct file_system_type *type, -+ int flags, const char *dev_name, void *opt) -+{ -+ int err; -+ struct nameidata nd; -+ struct super_block *sb; -+ struct block_device *bd; -+ struct virt_info_quota viq; -+ static struct hd_struct fake_hds; -+ -+ sb = ERR_PTR(-EINVAL); -+ if (opt == NULL) -+ goto out; -+ -+ err = path_lookup(opt, LOOKUP_FOLLOW|LOOKUP_DIRECTORY, &nd); -+ sb = ERR_PTR(err); -+ if (err) -+ goto out; -+ -+ sb = sget(type, NULL, sim_fill_super, &nd); -+ if (IS_ERR(sb)) -+ goto out_path; -+ -+ bd = bdget(sb->s_dev); -+ if (!bd) -+ goto out_killsb; -+ -+ sb->s_bdev = bd; -+ bd->bd_part = &fake_hds; -+ viq.super = sb; -+ virtinfo_notifier_call(VITYPE_QUOTA, VIRTINFO_QUOTA_ON, &viq); -+out_path: -+ path_release(&nd); -+out: -+ return sb; -+ -+out_killsb: -+ up_write(&sb->s_umount); -+ deactivate_super(sb); -+ sb = ERR_PTR(-ENODEV); -+ goto out_path; -+} -+ -+static struct file_system_type sim_fs_type = { -+ .owner = THIS_MODULE, -+ .name = "simfs", -+ .get_sb = sim_get_sb, -+ .kill_sb = kill_anon_super, -+}; -+ -+static struct vnotifier_block sim_syscalls = { -+ .notifier_call = sim_systemcall, -+}; -+ -+static int __init init_simfs(void) -+{ -+ int err; -+ -+ err = register_filesystem(&sim_fs_type); -+ if (err) -+ return err; -+ -+ virtinfo_notifier_register(VITYPE_FAUDIT, &sim_syscalls); -+ return 0; -+} -+ -+static void __exit exit_simfs(void) -+{ -+ virtinfo_notifier_unregister(VITYPE_FAUDIT, &sim_syscalls); -+ unregister_filesystem(&sim_fs_type); -+} -+ -+MODULE_AUTHOR("SWsoft <info@sw-soft.com>"); -+MODULE_DESCRIPTION("Open Virtuozzo Simulation of File System"); -+MODULE_LICENSE("GPL v2"); -+ -+module_init(init_simfs); -+module_exit(exit_simfs); -diff -uprN linux-2.6.8.1.orig/fs/smbfs/dir.c linux-2.6.8.1-ve022stab078/fs/smbfs/dir.c ---- linux-2.6.8.1.orig/fs/smbfs/dir.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/smbfs/dir.c 2006-05-11 13:05:34.000000000 +0400 -@@ -431,6 +431,11 @@ smb_lookup(struct inode *dir, struct den - if (dentry->d_name.len > SMB_MAXNAMELEN) - goto out; - -+ /* Do not allow lookup of names with backslashes in */ -+ error = -EINVAL; -+ if (memchr(dentry->d_name.name, '\\', dentry->d_name.len)) -+ goto out; -+ - lock_kernel(); - error = smb_proc_getattr(dentry, &finfo); - #ifdef SMBFS_PARANOIA -diff -uprN linux-2.6.8.1.orig/fs/smbfs/file.c linux-2.6.8.1-ve022stab078/fs/smbfs/file.c ---- linux-2.6.8.1.orig/fs/smbfs/file.c 2004-08-14 14:56:13.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/smbfs/file.c 2006-05-11 13:05:35.000000000 +0400 -@@ -387,7 +387,8 @@ smb_file_release(struct inode *inode, st - * privileges, so we need our own check for this. - */ - static int --smb_file_permission(struct inode *inode, int mask, struct nameidata *nd) -+smb_file_permission(struct inode *inode, int mask, struct nameidata *nd, -+ struct exec_perm *exec_perm) - { - int mode = inode->i_mode; - int error = 0; -diff -uprN linux-2.6.8.1.orig/fs/smbfs/inode.c linux-2.6.8.1-ve022stab078/fs/smbfs/inode.c ---- linux-2.6.8.1.orig/fs/smbfs/inode.c 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/smbfs/inode.c 2006-05-11 13:05:43.000000000 +0400 -@@ -233,7 +233,7 @@ smb_invalidate_inodes(struct smb_sb_info - { - VERBOSE("\n"); - shrink_dcache_sb(SB_of(server)); -- invalidate_inodes(SB_of(server)); -+ invalidate_inodes(SB_of(server), 0); - } - - /* -diff -uprN linux-2.6.8.1.orig/fs/smbfs/sock.c linux-2.6.8.1-ve022stab078/fs/smbfs/sock.c ---- linux-2.6.8.1.orig/fs/smbfs/sock.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/smbfs/sock.c 2006-05-11 13:05:44.000000000 +0400 -@@ -100,6 +100,7 @@ smb_close_socket(struct smb_sb_info *ser - - VERBOSE("closing socket %p\n", sock); - sock->sk->sk_data_ready = server->data_ready; -+ sock->sk->sk_user_data = NULL; - server->sock_file = NULL; - fput(file); - } -diff -uprN linux-2.6.8.1.orig/fs/stat.c linux-2.6.8.1-ve022stab078/fs/stat.c ---- linux-2.6.8.1.orig/fs/stat.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/stat.c 2006-05-11 13:05:40.000000000 +0400 -@@ -14,6 +14,7 @@ - #include <linux/fs.h> - #include <linux/namei.h> - #include <linux/security.h> -+#include <linux/faudit.h> - - #include <asm/uaccess.h> - #include <asm/unistd.h> -@@ -41,11 +42,19 @@ int vfs_getattr(struct vfsmount *mnt, st - { - struct inode *inode = dentry->d_inode; - int retval; -+ struct faudit_stat_arg arg; - - retval = security_inode_getattr(mnt, dentry); - if (retval) - return retval; - -+ arg.mnt = mnt; -+ arg.dentry = dentry; -+ arg.stat = stat; -+ if (virtinfo_notifier_call(VITYPE_FAUDIT, VIRTINFO_FAUDIT_STAT, &arg) -+ != NOTIFY_DONE) -+ return arg.err; -+ - if (inode->i_op->getattr) - return inode->i_op->getattr(mnt, dentry, stat); - -diff -uprN linux-2.6.8.1.orig/fs/super.c linux-2.6.8.1-ve022stab078/fs/super.c ---- linux-2.6.8.1.orig/fs/super.c 2004-08-14 14:55:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/super.c 2006-05-11 13:05:43.000000000 +0400 -@@ -23,6 +23,7 @@ - #include <linux/config.h> - #include <linux/module.h> - #include <linux/slab.h> -+#include <linux/ve_owner.h> - #include <linux/init.h> - #include <linux/smp_lock.h> - #include <linux/acct.h> -@@ -65,8 +66,10 @@ static struct super_block *alloc_super(v - } - INIT_LIST_HEAD(&s->s_dirty); - INIT_LIST_HEAD(&s->s_io); -+ INIT_LIST_HEAD(&s->s_inodes); - INIT_LIST_HEAD(&s->s_files); - INIT_LIST_HEAD(&s->s_instances); -+ INIT_LIST_HEAD(&s->s_dshrinkers); - INIT_HLIST_HEAD(&s->s_anon); - init_rwsem(&s->s_umount); - sema_init(&s->s_lock, 1); -@@ -116,6 +119,27 @@ int __put_super(struct super_block *sb) - return ret; - } - -+/* -+ * Drop a superblock's refcount. -+ * Returns non-zero if the superblock is about to be destroyed and -+ * at least is already removed from super_blocks list, so if we are -+ * making a loop through super blocks then we need to restart. -+ * The caller must hold sb_lock. -+ */ -+int __put_super_and_need_restart(struct super_block *sb) -+{ -+ /* check for race with generic_shutdown_super() */ -+ if (list_empty(&sb->s_list)) { -+ /* super block is removed, need to restart... */ -+ __put_super(sb); -+ return 1; -+ } -+ /* can't be the last, since s_list is still in use */ -+ sb->s_count--; -+ BUG_ON(sb->s_count == 0); -+ return 0; -+} -+ - /** - * put_super - drop a temporary reference to superblock - * @s: superblock in question -@@ -205,14 +229,15 @@ void generic_shutdown_super(struct super - if (root) { - sb->s_root = NULL; - shrink_dcache_parent(root); -- shrink_dcache_anon(&sb->s_anon); -+ shrink_dcache_anon(sb); - dput(root); -+ dcache_shrinker_wait_sb(sb); - fsync_super(sb); - lock_super(sb); - lock_kernel(); - sb->s_flags &= ~MS_ACTIVE; - /* bad name - it should be evict_inodes() */ -- invalidate_inodes(sb); -+ invalidate_inodes(sb, 0); - - if (sop->write_super && sb->s_dirt) - sop->write_super(sb); -@@ -220,16 +245,16 @@ void generic_shutdown_super(struct super - sop->put_super(sb); - - /* Forget any remaining inodes */ -- if (invalidate_inodes(sb)) { -- printk("VFS: Busy inodes after unmount. " -- "Self-destruct in 5 seconds. Have a nice day...\n"); -- } -+ if (invalidate_inodes(sb, 1)) -+ printk("Self-destruct in 5 seconds. " -+ "Have a nice day...\n"); - - unlock_kernel(); - unlock_super(sb); - } - spin_lock(&sb_lock); -- list_del(&sb->s_list); -+ /* should be initialized for __put_super_and_need_restart() */ -+ list_del_init(&sb->s_list); - list_del(&sb->s_instances); - spin_unlock(&sb_lock); - up_write(&sb->s_umount); -@@ -282,7 +307,7 @@ retry: - } - s->s_type = type; - strlcpy(s->s_id, type->name, sizeof(s->s_id)); -- list_add(&s->s_list, super_blocks.prev); -+ list_add_tail(&s->s_list, &super_blocks); - list_add(&s->s_instances, &type->fs_supers); - spin_unlock(&sb_lock); - get_filesystem(type); -@@ -315,20 +340,22 @@ static inline void write_super(struct su - */ - void sync_supers(void) - { -- struct super_block * sb; --restart: -+ struct super_block *sb; -+ - spin_lock(&sb_lock); -- sb = sb_entry(super_blocks.next); -- while (sb != sb_entry(&super_blocks)) -+restart: -+ list_for_each_entry(sb, &super_blocks, s_list) { - if (sb->s_dirt) { - sb->s_count++; - spin_unlock(&sb_lock); - down_read(&sb->s_umount); - write_super(sb); -- drop_super(sb); -- goto restart; -- } else -- sb = sb_entry(sb->s_list.next); -+ up_read(&sb->s_umount); -+ spin_lock(&sb_lock); -+ if (__put_super_and_need_restart(sb)) -+ goto restart; -+ } -+ } - spin_unlock(&sb_lock); - } - -@@ -355,20 +382,16 @@ void sync_filesystems(int wait) - - down(&mutex); /* Could be down_interruptible */ - spin_lock(&sb_lock); -- for (sb = sb_entry(super_blocks.next); sb != sb_entry(&super_blocks); -- sb = sb_entry(sb->s_list.next)) { -+ list_for_each_entry(sb, &super_blocks, s_list) { - if (!sb->s_op->sync_fs) - continue; - if (sb->s_flags & MS_RDONLY) - continue; - sb->s_need_sync_fs = 1; - } -- spin_unlock(&sb_lock); - - restart: -- spin_lock(&sb_lock); -- for (sb = sb_entry(super_blocks.next); sb != sb_entry(&super_blocks); -- sb = sb_entry(sb->s_list.next)) { -+ list_for_each_entry(sb, &super_blocks, s_list) { - if (!sb->s_need_sync_fs) - continue; - sb->s_need_sync_fs = 0; -@@ -379,8 +402,11 @@ restart: - down_read(&sb->s_umount); - if (sb->s_root && (wait || sb->s_dirt)) - sb->s_op->sync_fs(sb, wait); -- drop_super(sb); -- goto restart; -+ up_read(&sb->s_umount); -+ /* restart only when sb is no longer on the list */ -+ spin_lock(&sb_lock); -+ if (__put_super_and_need_restart(sb)) -+ goto restart; - } - spin_unlock(&sb_lock); - up(&mutex); -@@ -396,20 +422,20 @@ restart: - - struct super_block * get_super(struct block_device *bdev) - { -- struct list_head *p; -+ struct super_block *sb; -+ - if (!bdev) - return NULL; - rescan: - spin_lock(&sb_lock); -- list_for_each(p, &super_blocks) { -- struct super_block *s = sb_entry(p); -- if (s->s_bdev == bdev) { -- s->s_count++; -+ list_for_each_entry(sb, &super_blocks, s_list) { -+ if (sb->s_bdev == bdev) { -+ sb->s_count++; - spin_unlock(&sb_lock); -- down_read(&s->s_umount); -- if (s->s_root) -- return s; -- drop_super(s); -+ down_read(&sb->s_umount); -+ if (sb->s_root) -+ return sb; -+ drop_super(sb); - goto rescan; - } - } -@@ -421,19 +447,18 @@ EXPORT_SYMBOL(get_super); - - struct super_block * user_get_super(dev_t dev) - { -- struct list_head *p; -+ struct super_block *sb; - - rescan: - spin_lock(&sb_lock); -- list_for_each(p, &super_blocks) { -- struct super_block *s = sb_entry(p); -- if (s->s_dev == dev) { -- s->s_count++; -+ list_for_each_entry(sb, &super_blocks, s_list) { -+ if (sb->s_dev == dev) { -+ sb->s_count++; - spin_unlock(&sb_lock); -- down_read(&s->s_umount); -- if (s->s_root) -- return s; -- drop_super(s); -+ down_read(&sb->s_umount); -+ if (sb->s_root) -+ return sb; -+ drop_super(sb); - goto rescan; - } - } -@@ -448,11 +473,20 @@ asmlinkage long sys_ustat(unsigned dev, - struct super_block *s; - struct ustat tmp; - struct kstatfs sbuf; -- int err = -EINVAL; -+ dev_t kdev; -+ int err; -+ -+ kdev = new_decode_dev(dev); -+#ifdef CONFIG_VE -+ err = get_device_perms_ve(S_IFBLK, kdev, FMODE_READ); -+ if (err) -+ goto out; -+#endif - -- s = user_get_super(new_decode_dev(dev)); -- if (s == NULL) -- goto out; -+ err = -EINVAL; -+ s = user_get_super(kdev); -+ if (s == NULL) -+ goto out; - err = vfs_statfs(s, &sbuf); - drop_super(s); - if (err) -@@ -566,6 +600,13 @@ void emergency_remount(void) - static struct idr unnamed_dev_idr; - static spinlock_t unnamed_dev_lock = SPIN_LOCK_UNLOCKED;/* protects the above */ - -+/* for compatibility with coreutils still unaware of new minor sizes */ -+int unnamed_dev_majors[] = { -+ 0, 144, 145, 146, 242, 243, 244, 245, -+ 246, 247, 248, 249, 250, 251, 252, 253 -+}; -+EXPORT_SYMBOL(unnamed_dev_majors); -+ - int set_anon_super(struct super_block *s, void *data) - { - int dev; -@@ -583,13 +624,13 @@ int set_anon_super(struct super_block *s - else if (error) - return -EAGAIN; - -- if ((dev & MAX_ID_MASK) == (1 << MINORBITS)) { -+ if ((dev & MAX_ID_MASK) >= (1 << MINORBITS)) { - spin_lock(&unnamed_dev_lock); - idr_remove(&unnamed_dev_idr, dev); - spin_unlock(&unnamed_dev_lock); - return -EMFILE; - } -- s->s_dev = MKDEV(0, dev & MINORMASK); -+ s->s_dev = make_unnamed_dev(dev); - return 0; - } - -@@ -597,8 +638,9 @@ EXPORT_SYMBOL(set_anon_super); - - void kill_anon_super(struct super_block *sb) - { -- int slot = MINOR(sb->s_dev); -+ int slot; - -+ slot = unnamed_dev_idx(sb->s_dev); - generic_shutdown_super(sb); - spin_lock(&unnamed_dev_lock); - idr_remove(&unnamed_dev_idr, slot); -@@ -754,17 +796,14 @@ struct super_block *get_sb_single(struct - EXPORT_SYMBOL(get_sb_single); - - struct vfsmount * --do_kern_mount(const char *fstype, int flags, const char *name, void *data) -+do_kern_mount(struct file_system_type *type, int flags, -+ const char *name, void *data) - { -- struct file_system_type *type = get_fs_type(fstype); - struct super_block *sb = ERR_PTR(-ENOMEM); - struct vfsmount *mnt; - int error; - char *secdata = NULL; - -- if (!type) -- return ERR_PTR(-ENODEV); -- - mnt = alloc_vfsmnt(name); - if (!mnt) - goto out; -@@ -795,7 +834,6 @@ do_kern_mount(const char *fstype, int fl - mnt->mnt_parent = mnt; - mnt->mnt_namespace = current->namespace; - up_write(&sb->s_umount); -- put_filesystem(type); - return mnt; - out_sb: - up_write(&sb->s_umount); -@@ -806,7 +844,6 @@ out_free_secdata: - out_mnt: - free_vfsmnt(mnt); - out: -- put_filesystem(type); - return (struct vfsmount *)sb; - } - -@@ -814,7 +851,7 @@ EXPORT_SYMBOL_GPL(do_kern_mount); - - struct vfsmount *kern_mount(struct file_system_type *type) - { -- return do_kern_mount(type->name, 0, type->name, NULL); -+ return do_kern_mount(type, 0, type->name, NULL); - } - - EXPORT_SYMBOL(kern_mount); -diff -uprN linux-2.6.8.1.orig/fs/sysfs/bin.c linux-2.6.8.1-ve022stab078/fs/sysfs/bin.c ---- linux-2.6.8.1.orig/fs/sysfs/bin.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/sysfs/bin.c 2006-05-11 13:05:42.000000000 +0400 -@@ -162,6 +162,11 @@ int sysfs_create_bin_file(struct kobject - struct dentry * parent; - int error = 0; - -+#ifdef CONFIG_VE -+ if (!get_exec_env()->sysfs_sb) -+ return 0; -+#endif -+ - if (!kobj || !attr) - return -EINVAL; - -@@ -195,6 +200,10 @@ int sysfs_create_bin_file(struct kobject - - int sysfs_remove_bin_file(struct kobject * kobj, struct bin_attribute * attr) - { -+#ifdef CONFIG_VE -+ if (!get_exec_env()->sysfs_sb) -+ return 0; -+#endif - sysfs_hash_and_remove(kobj->dentry,attr->attr.name); - return 0; - } -diff -uprN linux-2.6.8.1.orig/fs/sysfs/dir.c linux-2.6.8.1-ve022stab078/fs/sysfs/dir.c ---- linux-2.6.8.1.orig/fs/sysfs/dir.c 2004-08-14 14:55:19.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/sysfs/dir.c 2006-05-11 13:05:42.000000000 +0400 -@@ -63,13 +63,17 @@ int sysfs_create_dir(struct kobject * ko - struct dentry * parent; - int error = 0; - -+#ifdef CONFIG_VE -+ if (!get_exec_env()->sysfs_sb) -+ return 0; -+#endif - if (!kobj) - return -EINVAL; - - if (kobj->parent) - parent = kobj->parent->dentry; -- else if (sysfs_mount && sysfs_mount->mnt_sb) -- parent = sysfs_mount->mnt_sb->s_root; -+ else if (visible_sysfs_mount && visible_sysfs_mount->mnt_sb) -+ parent = visible_sysfs_mount->mnt_sb->s_root; - else - return -EFAULT; - -@@ -113,9 +117,14 @@ void sysfs_remove_subdir(struct dentry * - void sysfs_remove_dir(struct kobject * kobj) - { - struct list_head * node; -- struct dentry * dentry = dget(kobj->dentry); -+ struct dentry * dentry; - -- if (!dentry) -+#ifdef CONFIG_VE -+ if (!get_exec_env()->sysfs_sb) -+ return; -+#endif -+ dentry = dget(kobj->dentry); -+ if (!dentry) - return; - - pr_debug("sysfs %s: removing dir\n",dentry->d_name.name); -@@ -129,6 +138,7 @@ restart: - - node = node->next; - pr_debug(" o %s (%d): ",d->d_name.name,atomic_read(&d->d_count)); -+ spin_lock(&d->d_lock); - if (!d_unhashed(d) && (d->d_inode)) { - d = dget_locked(d); - pr_debug("removing"); -@@ -137,6 +147,7 @@ restart: - * Unlink and unhash. - */ - __d_drop(d); -+ spin_unlock(&d->d_lock); - spin_unlock(&dcache_lock); - /* release the target kobject in case of - * a symlink -@@ -151,6 +162,7 @@ restart: - /* re-acquired dcache_lock, need to restart */ - goto restart; - } -+ spin_unlock(&d->d_lock); - } - spin_unlock(&dcache_lock); - up(&dentry->d_inode->i_sem); -@@ -167,6 +179,10 @@ int sysfs_rename_dir(struct kobject * ko - int error = 0; - struct dentry * new_dentry, * parent; - -+#ifdef CONFIG_VE -+ if (!get_exec_env()->sysfs_sb) -+ return 0; -+#endif - if (!strcmp(kobject_name(kobj), new_name)) - return -EINVAL; - -diff -uprN linux-2.6.8.1.orig/fs/sysfs/file.c linux-2.6.8.1-ve022stab078/fs/sysfs/file.c ---- linux-2.6.8.1.orig/fs/sysfs/file.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/sysfs/file.c 2006-05-11 13:05:42.000000000 +0400 -@@ -228,13 +228,14 @@ static ssize_t - sysfs_write_file(struct file *file, const char __user *buf, size_t count, loff_t *ppos) - { - struct sysfs_buffer * buffer = file->private_data; -+ ssize_t len; - -- count = fill_write_buffer(buffer,buf,count); -- if (count > 0) -- count = flush_write_buffer(file,buffer,count); -- if (count > 0) -- *ppos += count; -- return count; -+ len = fill_write_buffer(buffer, buf, count); -+ if (len > 0) -+ len = flush_write_buffer(file, buffer, len); -+ if (len > 0) -+ *ppos += len; -+ return len; - } - - static int check_perm(struct inode * inode, struct file * file) -@@ -375,6 +376,10 @@ int sysfs_add_file(struct dentry * dir, - - int sysfs_create_file(struct kobject * kobj, const struct attribute * attr) - { -+#ifdef CONFIG_VE -+ if (!get_exec_env()->sysfs_sb) -+ return 0; -+#endif - if (kobj && attr) - return sysfs_add_file(kobj->dentry,attr); - return -EINVAL; -@@ -395,6 +400,10 @@ int sysfs_update_file(struct kobject * k - struct dentry * victim; - int res = -ENOENT; - -+#ifdef CONFIG_VE -+ if (!get_exec_env()->sysfs_sb) -+ return 0; -+#endif - down(&dir->d_inode->i_sem); - victim = sysfs_get_dentry(dir, attr->name); - if (!IS_ERR(victim)) { -@@ -432,6 +441,10 @@ int sysfs_update_file(struct kobject * k - - void sysfs_remove_file(struct kobject * kobj, const struct attribute * attr) - { -+#ifdef CONFIG_VE -+ if (!get_exec_env()->sysfs_sb) -+ return; -+#endif - sysfs_hash_and_remove(kobj->dentry,attr->name); - } - -diff -uprN linux-2.6.8.1.orig/fs/sysfs/group.c linux-2.6.8.1-ve022stab078/fs/sysfs/group.c ---- linux-2.6.8.1.orig/fs/sysfs/group.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/sysfs/group.c 2006-05-11 13:05:42.000000000 +0400 -@@ -45,6 +45,10 @@ int sysfs_create_group(struct kobject * - struct dentry * dir; - int error; - -+#ifdef CONFIG_VE -+ if (!get_exec_env()->sysfs_sb) -+ return 0; -+#endif - if (grp->name) { - error = sysfs_create_subdir(kobj,grp->name,&dir); - if (error) -@@ -65,6 +69,10 @@ void sysfs_remove_group(struct kobject * - { - struct dentry * dir; - -+#ifdef CONFIG_VE -+ if (!get_exec_env()->sysfs_sb) -+ return; -+#endif - if (grp->name) - dir = sysfs_get_dentry(kobj->dentry,grp->name); - else -diff -uprN linux-2.6.8.1.orig/fs/sysfs/inode.c linux-2.6.8.1-ve022stab078/fs/sysfs/inode.c ---- linux-2.6.8.1.orig/fs/sysfs/inode.c 2004-08-14 14:56:00.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/sysfs/inode.c 2006-05-11 13:05:42.000000000 +0400 -@@ -8,10 +8,17 @@ - - #undef DEBUG - -+#include <linux/config.h> - #include <linux/pagemap.h> - #include <linux/namei.h> - #include <linux/backing-dev.h> --extern struct super_block * sysfs_sb; -+ -+#ifndef CONFIG_VE -+extern struct super_block *sysfs_sb; -+#define visible_sysfs_sb sysfs_sb -+#else -+#define visible_sysfs_sb (get_exec_env()->sysfs_sb) -+#endif - - static struct address_space_operations sysfs_aops = { - .readpage = simple_readpage, -@@ -26,7 +33,7 @@ static struct backing_dev_info sysfs_bac - - struct inode * sysfs_new_inode(mode_t mode) - { -- struct inode * inode = new_inode(sysfs_sb); -+ struct inode * inode = new_inode(visible_sysfs_sb); - if (inode) { - inode->i_mode = mode; - inode->i_uid = current->fsuid; -diff -uprN linux-2.6.8.1.orig/fs/sysfs/mount.c linux-2.6.8.1-ve022stab078/fs/sysfs/mount.c ---- linux-2.6.8.1.orig/fs/sysfs/mount.c 2004-08-14 14:56:00.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/sysfs/mount.c 2006-05-11 13:05:42.000000000 +0400 -@@ -7,6 +7,7 @@ - #include <linux/fs.h> - #include <linux/mount.h> - #include <linux/pagemap.h> -+#include <linux/module.h> - #include <linux/init.h> - - #include "sysfs.h" -@@ -17,6 +18,15 @@ - struct vfsmount *sysfs_mount; - struct super_block * sysfs_sb = NULL; - -+void prepare_sysfs(void) -+{ -+#ifdef CONFIG_VE -+ get_ve0()->sysfs_mnt = sysfs_mount; -+ sysfs_mount = (struct vfsmount *)SYSFS_MAGIC; -+ /* ve0.sysfs_sb is setup by sysfs_fill_super() */ -+#endif -+} -+ - static struct super_operations sysfs_ops = { - .statfs = simple_statfs, - .drop_inode = generic_delete_inode, -@@ -31,7 +41,7 @@ static int sysfs_fill_super(struct super - sb->s_blocksize_bits = PAGE_CACHE_SHIFT; - sb->s_magic = SYSFS_MAGIC; - sb->s_op = &sysfs_ops; -- sysfs_sb = sb; -+ visible_sysfs_sb = sb; - - inode = sysfs_new_inode(S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO); - if (inode) { -@@ -60,12 +70,14 @@ static struct super_block *sysfs_get_sb( - return get_sb_single(fs_type, flags, data, sysfs_fill_super); - } - --static struct file_system_type sysfs_fs_type = { -+struct file_system_type sysfs_fs_type = { - .name = "sysfs", - .get_sb = sysfs_get_sb, - .kill_sb = kill_litter_super, - }; - -+EXPORT_SYMBOL(sysfs_fs_type); -+ - int __init sysfs_init(void) - { - int err; -@@ -79,5 +91,6 @@ int __init sysfs_init(void) - sysfs_mount = NULL; - } - } -+ prepare_sysfs(); - return err; - } -diff -uprN linux-2.6.8.1.orig/fs/sysfs/symlink.c linux-2.6.8.1-ve022stab078/fs/sysfs/symlink.c ---- linux-2.6.8.1.orig/fs/sysfs/symlink.c 2004-08-14 14:55:31.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/sysfs/symlink.c 2006-05-11 13:05:42.000000000 +0400 -@@ -65,6 +65,10 @@ int sysfs_create_link(struct kobject * k - struct dentry * d; - int error = 0; - -+#ifdef CONFIG_VE -+ if (!get_exec_env()->sysfs_sb) -+ return 0; -+#endif - down(&dentry->d_inode->i_sem); - d = sysfs_get_dentry(dentry,name); - if (!IS_ERR(d)) { -@@ -90,6 +94,10 @@ int sysfs_create_link(struct kobject * k - - void sysfs_remove_link(struct kobject * kobj, char * name) - { -+#ifdef CONFIG_VE -+ if(!get_exec_env()->sysfs_sb) -+ return; -+#endif - sysfs_hash_and_remove(kobj->dentry,name); - } - -diff -uprN linux-2.6.8.1.orig/fs/sysfs/sysfs.h linux-2.6.8.1-ve022stab078/fs/sysfs/sysfs.h ---- linux-2.6.8.1.orig/fs/sysfs/sysfs.h 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/sysfs/sysfs.h 2006-05-11 13:05:42.000000000 +0400 -@@ -1,5 +1,13 @@ - --extern struct vfsmount * sysfs_mount; -+#ifndef CONFIG_VE -+extern struct vfsmount *sysfs_mount; -+extern struct super_block *sysfs_sb; -+#define visible_sysfs_mount sysfs_mount -+#define visible_sysfs_sb sysfs_sb -+#else -+#define visible_sysfs_mount (get_exec_env()->sysfs_mnt) -+#define visible_sysfs_sb (get_exec_env()->sysfs_sb) -+#endif - - extern struct inode * sysfs_new_inode(mode_t mode); - extern int sysfs_create(struct dentry *, int mode, int (*init)(struct inode *)); -diff -uprN linux-2.6.8.1.orig/fs/sysv/inode.c linux-2.6.8.1-ve022stab078/fs/sysv/inode.c ---- linux-2.6.8.1.orig/fs/sysv/inode.c 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/sysv/inode.c 2006-05-11 13:05:35.000000000 +0400 -@@ -260,13 +260,14 @@ static struct buffer_head * sysv_update_ - return bh; - } - --void sysv_write_inode(struct inode * inode, int wait) -+int sysv_write_inode(struct inode * inode, int wait) - { - struct buffer_head *bh; - lock_kernel(); - bh = sysv_update_inode(inode); - brelse(bh); - unlock_kernel(); -+ return 0; - } - - int sysv_sync_inode(struct inode * inode) -diff -uprN linux-2.6.8.1.orig/fs/sysv/namei.c linux-2.6.8.1-ve022stab078/fs/sysv/namei.c ---- linux-2.6.8.1.orig/fs/sysv/namei.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/sysv/namei.c 2006-05-11 13:05:32.000000000 +0400 -@@ -114,7 +114,7 @@ static int sysv_symlink(struct inode * d - goto out; - - sysv_set_inode(inode, 0); -- err = page_symlink(inode, symname, l); -+ err = page_symlink(inode, symname, l, GFP_KERNEL); - if (err) - goto out_fail; - -diff -uprN linux-2.6.8.1.orig/fs/sysv/sysv.h linux-2.6.8.1-ve022stab078/fs/sysv/sysv.h ---- linux-2.6.8.1.orig/fs/sysv/sysv.h 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/sysv/sysv.h 2006-05-11 13:05:35.000000000 +0400 -@@ -134,7 +134,7 @@ extern unsigned long sysv_count_free_blo - extern void sysv_truncate(struct inode *); - - /* inode.c */ --extern void sysv_write_inode(struct inode *, int); -+extern int sysv_write_inode(struct inode *, int); - extern int sysv_sync_inode(struct inode *); - extern int sysv_sync_file(struct file *, struct dentry *, int); - extern void sysv_set_inode(struct inode *, dev_t); -diff -uprN linux-2.6.8.1.orig/fs/udf/file.c linux-2.6.8.1-ve022stab078/fs/udf/file.c ---- linux-2.6.8.1.orig/fs/udf/file.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/udf/file.c 2006-05-11 13:05:35.000000000 +0400 -@@ -188,7 +188,7 @@ int udf_ioctl(struct inode *inode, struc - { - int result = -EINVAL; - -- if ( permission(inode, MAY_READ, NULL) != 0 ) -+ if ( permission(inode, MAY_READ, NULL, NULL) != 0 ) - { - udf_debug("no permission to access inode %lu\n", - inode->i_ino); -diff -uprN linux-2.6.8.1.orig/fs/udf/inode.c linux-2.6.8.1-ve022stab078/fs/udf/inode.c ---- linux-2.6.8.1.orig/fs/udf/inode.c 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/udf/inode.c 2006-05-11 13:05:35.000000000 +0400 -@@ -1313,11 +1313,13 @@ udf_convert_permissions(struct fileEntry - * Written, tested, and released. - */ - --void udf_write_inode(struct inode * inode, int sync) -+int udf_write_inode(struct inode * inode, int sync) - { -+ int ret; - lock_kernel(); -- udf_update_inode(inode, sync); -+ ret = udf_update_inode(inode, sync); - unlock_kernel(); -+ return ret; - } - - int udf_sync_inode(struct inode * inode) -diff -uprN linux-2.6.8.1.orig/fs/udf/udfdecl.h linux-2.6.8.1-ve022stab078/fs/udf/udfdecl.h ---- linux-2.6.8.1.orig/fs/udf/udfdecl.h 2004-08-14 14:55:20.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/udf/udfdecl.h 2006-05-11 13:05:35.000000000 +0400 -@@ -100,7 +100,7 @@ extern void udf_read_inode(struct inode - extern void udf_put_inode(struct inode *); - extern void udf_delete_inode(struct inode *); - extern void udf_clear_inode(struct inode *); --extern void udf_write_inode(struct inode *, int); -+extern int udf_write_inode(struct inode *, int); - extern long udf_block_map(struct inode *, long); - extern int8_t inode_bmap(struct inode *, int, lb_addr *, uint32_t *, lb_addr *, uint32_t *, uint32_t *, struct buffer_head **); - extern int8_t udf_add_aext(struct inode *, lb_addr *, int *, lb_addr, uint32_t, struct buffer_head **, int); -diff -uprN linux-2.6.8.1.orig/fs/ufs/inode.c linux-2.6.8.1-ve022stab078/fs/ufs/inode.c ---- linux-2.6.8.1.orig/fs/ufs/inode.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ufs/inode.c 2006-05-11 13:05:35.000000000 +0400 -@@ -788,11 +788,13 @@ static int ufs_update_inode(struct inode - return 0; - } - --void ufs_write_inode (struct inode * inode, int wait) -+int ufs_write_inode (struct inode * inode, int wait) - { -+ int ret; - lock_kernel(); -- ufs_update_inode (inode, wait); -+ ret = ufs_update_inode (inode, wait); - unlock_kernel(); -+ return ret; - } - - int ufs_sync_inode (struct inode *inode) -diff -uprN linux-2.6.8.1.orig/fs/ufs/namei.c linux-2.6.8.1-ve022stab078/fs/ufs/namei.c ---- linux-2.6.8.1.orig/fs/ufs/namei.c 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/ufs/namei.c 2006-05-11 13:05:32.000000000 +0400 -@@ -156,7 +156,7 @@ static int ufs_symlink (struct inode * d - /* slow symlink */ - inode->i_op = &page_symlink_inode_operations; - inode->i_mapping->a_ops = &ufs_aops; -- err = page_symlink(inode, symname, l); -+ err = page_symlink(inode, symname, l, GFP_KERNEL); - if (err) - goto out_fail; - } else { -diff -uprN linux-2.6.8.1.orig/fs/umsdos/inode.c linux-2.6.8.1-ve022stab078/fs/umsdos/inode.c ---- linux-2.6.8.1.orig/fs/umsdos/inode.c 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/umsdos/inode.c 2006-05-11 13:05:35.000000000 +0400 -@@ -312,11 +312,12 @@ out: - /* - * Update the disk with the inode content - */ --void UMSDOS_write_inode (struct inode *inode, int wait) -+int UMSDOS_write_inode (struct inode *inode, int wait) - { - struct iattr newattrs; -+ int ret; - -- fat_write_inode (inode, wait); -+ ret = fat_write_inode (inode, wait); - newattrs.ia_mtime = inode->i_mtime; - newattrs.ia_atime = inode->i_atime; - newattrs.ia_ctime = inode->i_ctime; -@@ -330,6 +331,7 @@ void UMSDOS_write_inode (struct inode *i - * UMSDOS_notify_change (inode, &newattrs); - - * inode->i_state &= ~I_DIRTY; / * FIXME: this doesn't work. We need to remove ourselves from list on dirty inodes. /mn/ */ -+ return ret; - } - - -diff -uprN linux-2.6.8.1.orig/fs/umsdos/namei.c linux-2.6.8.1-ve022stab078/fs/umsdos/namei.c ---- linux-2.6.8.1.orig/fs/umsdos/namei.c 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/umsdos/namei.c 2006-05-11 13:05:32.000000000 +0400 -@@ -499,7 +499,7 @@ static int umsdos_symlink_x (struct inod - } - - len = strlen (symname) + 1; -- ret = page_symlink(dentry->d_inode, symname, len); -+ ret = page_symlink(dentry->d_inode, symname, len, GFP_KERNEL); - if (ret < 0) - goto out_unlink; - out: -diff -uprN linux-2.6.8.1.orig/fs/vzdq_file.c linux-2.6.8.1-ve022stab078/fs/vzdq_file.c ---- linux-2.6.8.1.orig/fs/vzdq_file.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/fs/vzdq_file.c 2006-05-11 13:05:44.000000000 +0400 -@@ -0,0 +1,851 @@ -+/* -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ * This file contains Virtuozzo quota files as proc entry implementation. -+ * It is required for std quota tools to work correctly as they are expecting -+ * aquota.user and aquota.group files. -+ */ -+ -+#include <linux/ctype.h> -+#include <linux/slab.h> -+#include <linux/list.h> -+#include <linux/module.h> -+#include <linux/proc_fs.h> -+#include <linux/sysctl.h> -+#include <linux/mount.h> -+#include <linux/namespace.h> -+#include <linux/quotaio_v2.h> -+#include <asm/uaccess.h> -+ -+#include <linux/ve.h> -+#include <linux/ve_proto.h> -+#include <linux/vzdq_tree.h> -+#include <linux/vzquota.h> -+ -+/* ---------------------------------------------------------------------- -+ * -+ * File read operation -+ * -+ * FIXME: functions in this section (as well as many functions in vzdq_ugid.c, -+ * perhaps) abuse vz_quota_sem. -+ * Taking a global semaphore for lengthy and user-controlled operations inside -+ * VPSs is not a good idea in general. -+ * In this case, the reasons for taking this semaphore are completely unclear, -+ * especially taking into account that the only function that has comments -+ * about the necessity to be called under this semaphore -+ * (create_proc_quotafile) is actually called OUTSIDE it. -+ * -+ * --------------------------------------------------------------------- */ -+ -+#define DQBLOCK_SIZE 1024 -+#define DQUOTBLKNUM 21U -+#define DQTREE_DEPTH 4 -+#define TREENUM_2_BLKNUM(num) (((num) + 1) << 1) -+#define ISINDBLOCK(num) ((num)%2 != 0) -+#define FIRST_DATABLK 2 /* first even number */ -+#define LAST_IND_LEVEL (DQTREE_DEPTH - 1) -+#define CONVERT_LEVEL(level) ((level) * (QUOTAID_EBITS/QUOTAID_BBITS)) -+#define GETLEVINDX(ind, lev) (((ind) >> QUOTAID_BBITS*(lev)) \ -+ & QUOTATREE_BMASK) -+ -+#if (QUOTAID_EBITS / QUOTAID_BBITS) != (QUOTATREE_DEPTH / DQTREE_DEPTH) -+#error xBITS and DQTREE_DEPTH does not correspond -+#endif -+ -+#define BLOCK_NOT_FOUND 1 -+ -+/* data for quota file -- one per proc entry */ -+struct quotatree_data { -+ struct list_head list; -+ struct vz_quota_master *qmblk; -+ int type; /* type of the tree */ -+}; -+ -+/* serialized by vz_quota_sem */ -+static LIST_HEAD(qf_data_head); -+ -+static const u_int32_t vzquota_magics[] = V2_INITQMAGICS; -+static const u_int32_t vzquota_versions[] = V2_INITQVERSIONS; -+ -+static inline loff_t get_depoff(int depth) -+{ -+ loff_t res = 1; -+ while (depth) { -+ res += (1 << ((depth - 1)*QUOTAID_EBITS + 1)); -+ depth--; -+ } -+ return res; -+} -+ -+static inline loff_t get_blknum(loff_t num, int depth) -+{ -+ loff_t res; -+ res = (num << 1) + get_depoff(depth); -+ return res; -+} -+ -+static int get_depth(loff_t num) -+{ -+ int i; -+ for (i = 0; i < DQTREE_DEPTH; i++) { -+ if (num >= get_depoff(i) && (i == DQTREE_DEPTH - 1 -+ || num < get_depoff(i + 1))) -+ return i; -+ } -+ return -1; -+} -+ -+static inline loff_t get_offset(loff_t num) -+{ -+ loff_t res, tmp; -+ -+ tmp = get_depth(num); -+ if (tmp < 0) -+ return -1; -+ num -= get_depoff(tmp); -+ BUG_ON(num < 0); -+ res = num >> 1; -+ -+ return res; -+} -+ -+static inline loff_t get_quot_blk_num(struct quotatree_tree *tree, int level) -+{ -+ /* return maximum available block num */ -+ return tree->levels[level].freenum; -+} -+ -+static inline loff_t get_block_num(struct quotatree_tree *tree) -+{ -+ loff_t ind_blk_num, quot_blk_num, max_ind, max_quot; -+ -+ quot_blk_num = get_quot_blk_num(tree, CONVERT_LEVEL(DQTREE_DEPTH) - 1); -+ max_quot = TREENUM_2_BLKNUM(quot_blk_num); -+ ind_blk_num = get_quot_blk_num(tree, CONVERT_LEVEL(DQTREE_DEPTH - 1)); -+ max_ind = (quot_blk_num) ? get_blknum(ind_blk_num, LAST_IND_LEVEL) -+ : get_blknum(ind_blk_num, 0); -+ -+ return (max_ind > max_quot) ? max_ind + 1 : max_quot + 1; -+} -+ -+/* Write quota file header */ -+static int read_header(void *buf, struct quotatree_tree *tree, -+ struct dq_info *dq_ugid_info, int type) -+{ -+ struct v2_disk_dqheader *dqh; -+ struct v2_disk_dqinfo *dq_disk_info; -+ -+ dqh = buf; -+ dq_disk_info = buf + sizeof(struct v2_disk_dqheader); -+ -+ dqh->dqh_magic = vzquota_magics[type]; -+ dqh->dqh_version = vzquota_versions[type]; -+ -+ dq_disk_info->dqi_bgrace = dq_ugid_info[type].bexpire; -+ dq_disk_info->dqi_igrace = dq_ugid_info[type].iexpire; -+ dq_disk_info->dqi_flags = 0; /* no flags */ -+ dq_disk_info->dqi_blocks = get_block_num(tree); -+ dq_disk_info->dqi_free_blk = 0; /* first block in the file */ -+ dq_disk_info->dqi_free_entry = FIRST_DATABLK; -+ -+ return 0; -+} -+ -+static int get_block_child(int depth, struct quotatree_node *p, u_int32_t *buf) -+{ -+ int i, j, lev_num; -+ -+ lev_num = QUOTATREE_DEPTH/DQTREE_DEPTH - 1; -+ for (i = 0; i < BLOCK_SIZE/sizeof(u_int32_t); i++) { -+ struct quotatree_node *next, *parent; -+ -+ parent = p; -+ next = p; -+ for (j = lev_num; j >= 0; j--) { -+ if (!next->blocks[GETLEVINDX(i,j)]) { -+ buf[i] = 0; -+ goto bad_branch; -+ } -+ parent = next; -+ next = next->blocks[GETLEVINDX(i,j)]; -+ } -+ buf[i] = (depth == DQTREE_DEPTH - 1) ? -+ TREENUM_2_BLKNUM(parent->num) -+ : get_blknum(next->num, depth + 1); -+ -+ bad_branch: -+ ; -+ } -+ -+ return 0; -+} -+ -+/* -+ * Write index block to disk (or buffer) -+ * @buf has length 256*sizeof(u_int32_t) bytes -+ */ -+static int read_index_block(int num, u_int32_t *buf, -+ struct quotatree_tree *tree) -+{ -+ struct quotatree_node *p; -+ u_int32_t index; -+ loff_t off; -+ int depth, res; -+ -+ res = BLOCK_NOT_FOUND; -+ index = 0; -+ depth = get_depth(num); -+ off = get_offset(num); -+ if (depth < 0 || off < 0) -+ return -EINVAL; -+ -+ list_for_each_entry(p, &tree->levels[CONVERT_LEVEL(depth)].usedlh, -+ list) { -+ if (p->num >= off) -+ res = 0; -+ if (p->num != off) -+ continue; -+ get_block_child(depth, p, buf); -+ break; -+ } -+ -+ return res; -+} -+ -+static inline void convert_quot_format(struct v2_disk_dqblk *dq, -+ struct vz_quota_ugid *vzq) -+{ -+ dq->dqb_id = vzq->qugid_id; -+ dq->dqb_ihardlimit = vzq->qugid_stat.ihardlimit; -+ dq->dqb_isoftlimit = vzq->qugid_stat.isoftlimit; -+ dq->dqb_curinodes = vzq->qugid_stat.icurrent; -+ dq->dqb_bhardlimit = vzq->qugid_stat.bhardlimit / QUOTABLOCK_SIZE; -+ dq->dqb_bsoftlimit = vzq->qugid_stat.bsoftlimit / QUOTABLOCK_SIZE; -+ dq->dqb_curspace = vzq->qugid_stat.bcurrent; -+ dq->dqb_btime = vzq->qugid_stat.btime; -+ dq->dqb_itime = vzq->qugid_stat.itime; -+} -+ -+static int read_dquot(loff_t num, void *buf, struct quotatree_tree *tree) -+{ -+ int res, i, entries = 0; -+ struct v2_disk_dqdbheader *dq_header; -+ struct quotatree_node *p; -+ struct v2_disk_dqblk *blk = buf + sizeof(struct v2_disk_dqdbheader); -+ -+ res = BLOCK_NOT_FOUND; -+ dq_header = buf; -+ memset(dq_header, 0, sizeof(*dq_header)); -+ -+ list_for_each_entry(p, &(tree->levels[QUOTATREE_DEPTH - 1].usedlh), -+ list) { -+ if (TREENUM_2_BLKNUM(p->num) >= num) -+ res = 0; -+ if (TREENUM_2_BLKNUM(p->num) != num) -+ continue; -+ -+ for (i = 0; i < QUOTATREE_BSIZE; i++) { -+ if (!p->blocks[i]) -+ continue; -+ convert_quot_format(blk + entries, -+ (struct vz_quota_ugid *)p->blocks[i]); -+ entries++; -+ res = 0; -+ } -+ break; -+ } -+ dq_header->dqdh_entries = entries; -+ -+ return res; -+} -+ -+static int read_block(int num, void *buf, struct quotatree_tree *tree, -+ struct dq_info *dq_ugid_info, int magic) -+{ -+ int res; -+ -+ memset(buf, 0, DQBLOCK_SIZE); -+ if (!num) -+ res = read_header(buf, tree, dq_ugid_info, magic); -+ else if (ISINDBLOCK(num)) -+ res = read_index_block(num, (u_int32_t*)buf, tree); -+ else -+ res = read_dquot(num, buf, tree); -+ -+ return res; -+} -+ -+/* -+ * FIXME: this function can handle quota files up to 2GB only. -+ */ -+static int read_proc_quotafile(char *page, char **start, off_t off, int count, -+ int *eof, void *data) -+{ -+ off_t blk_num, blk_off, buf_off; -+ char *tmp; -+ size_t buf_size; -+ struct quotatree_data *qtd; -+ struct quotatree_tree *tree; -+ struct dq_info *dqi; -+ int res; -+ -+ tmp = kmalloc(DQBLOCK_SIZE, GFP_KERNEL); -+ if (!tmp) -+ return -ENOMEM; -+ -+ qtd = data; -+ down(&vz_quota_sem); -+ down(&qtd->qmblk->dq_sem); -+ -+ res = 0; -+ tree = QUGID_TREE(qtd->qmblk, qtd->type); -+ if (!tree) { -+ *eof = 1; -+ goto out_dq; -+ } -+ -+ dqi = &qtd->qmblk->dq_ugid_info[qtd->type]; -+ -+ buf_off = 0; -+ buf_size = count; -+ blk_num = off / DQBLOCK_SIZE; -+ blk_off = off % DQBLOCK_SIZE; -+ -+ while (buf_size > 0) { -+ off_t len; -+ -+ len = min((size_t)(DQBLOCK_SIZE-blk_off), buf_size); -+ res = read_block(blk_num, tmp, tree, dqi, qtd->type); -+ if (res < 0) -+ goto out_err; -+ if (res == BLOCK_NOT_FOUND) { -+ *eof = 1; -+ break; -+ } -+ memcpy(page + buf_off, tmp + blk_off, len); -+ -+ blk_num++; -+ buf_size -= len; -+ blk_off = 0; -+ buf_off += len; -+ } -+ res = buf_off; -+ -+out_err: -+ *start = NULL + count; -+out_dq: -+ up(&qtd->qmblk->dq_sem); -+ up(&vz_quota_sem); -+ kfree(tmp); -+ -+ return res; -+} -+ -+ -+/* ---------------------------------------------------------------------- -+ * -+ * /proc/vz/vzaquota/QID/aquota.* files -+ * -+ * FIXME: this code lacks serialization of read/readdir/lseek. -+ * However, this problem should be fixed after the mainstream issue of what -+ * appears to be non-atomic read and update of file position in sys_read. -+ * -+ * --------------------------------------------------------------------- */ -+ -+static inline unsigned long vzdq_aquot_getino(dev_t dev) -+{ -+ return 0xec000000UL + dev; -+} -+ -+static inline dev_t vzdq_aquot_getidev(struct inode *inode) -+{ -+ return (dev_t)(unsigned long)PROC_I(inode)->op.proc_get_link; -+} -+ -+static inline void vzdq_aquot_setidev(struct inode *inode, dev_t dev) -+{ -+ PROC_I(inode)->op.proc_get_link = (void *)(unsigned long)dev; -+} -+ -+static ssize_t vzdq_aquotf_read(struct file *file, -+ char __user *buf, size_t size, loff_t *ppos) -+{ -+ char *page; -+ size_t bufsize; -+ ssize_t l, l2, copied; -+ char *start; -+ struct inode *inode; -+ struct block_device *bdev; -+ struct super_block *sb; -+ struct quotatree_data data; -+ int eof, err; -+ -+ err = -ENOMEM; -+ page = (char *)__get_free_page(GFP_KERNEL); -+ if (page == NULL) -+ goto out_err; -+ -+ err = -ENODEV; -+ inode = file->f_dentry->d_inode; -+ bdev = bdget(vzdq_aquot_getidev(inode)); -+ if (bdev == NULL) -+ goto out_err; -+ sb = get_super(bdev); -+ bdput(bdev); -+ if (sb == NULL) -+ goto out_err; -+ data.qmblk = vzquota_find_qmblk(sb); -+ data.type = PROC_I(inode)->type - 1; -+ drop_super(sb); -+ if (data.qmblk == NULL || data.qmblk == VZ_QUOTA_BAD) -+ goto out_err; -+ -+ copied = 0; -+ l = l2 = 0; -+ while (1) { -+ bufsize = min(size, (size_t)PAGE_SIZE); -+ if (bufsize <= 0) -+ break; -+ -+ l = read_proc_quotafile(page, &start, *ppos, bufsize, -+ &eof, &data); -+ if (l <= 0) -+ break; -+ -+ l2 = copy_to_user(buf, page, l); -+ copied += l - l2; -+ if (l2) -+ break; -+ -+ buf += l; -+ size -= l; -+ *ppos += (unsigned long)start; -+ l = l2 = 0; -+ } -+ -+ qmblk_put(data.qmblk); -+ free_page((unsigned long)page); -+ if (copied) -+ return copied; -+ else if (l2) /* last copy_to_user failed */ -+ return -EFAULT; -+ else /* read error or EOF */ -+ return l; -+ -+out_err: -+ if (page != NULL) -+ free_page((unsigned long)page); -+ return err; -+} -+ -+static struct file_operations vzdq_aquotf_file_operations = { -+ .read = &vzdq_aquotf_read, -+}; -+ -+static struct inode_operations vzdq_aquotf_inode_operations = { -+}; -+ -+ -+/* ---------------------------------------------------------------------- -+ * -+ * /proc/vz/vzaquota/QID directory -+ * -+ * --------------------------------------------------------------------- */ -+ -+static int vzdq_aquotq_readdir(struct file *file, void *data, filldir_t filler) -+{ -+ loff_t n; -+ int err; -+ -+ n = file->f_pos; -+ for (err = 0; !err; n++) { -+ switch (n) { -+ case 0: -+ err = (*filler)(data, ".", 1, n, -+ file->f_dentry->d_inode->i_ino, -+ DT_DIR); -+ break; -+ case 1: -+ err = (*filler)(data, "..", 2, n, -+ parent_ino(file->f_dentry), DT_DIR); -+ break; -+ case 2: -+ err = (*filler)(data, "aquota.user", 11, n, -+ file->f_dentry->d_inode->i_ino -+ + USRQUOTA + 1, -+ DT_REG); -+ break; -+ case 3: -+ err = (*filler)(data, "aquota.group", 12, n, -+ file->f_dentry->d_inode->i_ino -+ + GRPQUOTA + 1, -+ DT_REG); -+ break; -+ default: -+ goto out; -+ } -+ } -+out: -+ file->f_pos = n; -+ return err; -+} -+ -+struct vzdq_aquotq_lookdata { -+ dev_t dev; -+ int type; -+}; -+ -+static int vzdq_aquotq_looktest(struct inode *inode, void *data) -+{ -+ struct vzdq_aquotq_lookdata *d; -+ -+ d = data; -+ return inode->i_op == &vzdq_aquotf_inode_operations && -+ vzdq_aquot_getidev(inode) == d->dev && -+ PROC_I(inode)->type == d->type + 1; -+} -+ -+static int vzdq_aquotq_lookset(struct inode *inode, void *data) -+{ -+ struct vzdq_aquotq_lookdata *d; -+ -+ d = data; -+ inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME; -+ inode->i_ino = vzdq_aquot_getino(d->dev) + d->type + 1; -+ inode->i_mode = S_IFREG | S_IRUSR; -+ inode->i_uid = 0; -+ inode->i_gid = 0; -+ inode->i_nlink = 1; -+ inode->i_op = &vzdq_aquotf_inode_operations; -+ inode->i_fop = &vzdq_aquotf_file_operations; -+ PROC_I(inode)->type = d->type + 1; -+ vzdq_aquot_setidev(inode, d->dev); -+ return 0; -+} -+ -+static struct dentry *vzdq_aquotq_lookup(struct inode *dir, -+ struct dentry *dentry, -+ struct nameidata *nd) -+{ -+ struct inode *inode; -+ struct vzdq_aquotq_lookdata d; -+ int k; -+ -+ if (dentry->d_name.len == 11) { -+ if (memcmp(dentry->d_name.name, "aquota.user", 11)) -+ goto out; -+ k = USRQUOTA; -+ } else if (dentry->d_name.len == 12) { -+ if (memcmp(dentry->d_name.name, "aquota.group", 11)) -+ goto out; -+ k = GRPQUOTA; -+ } else -+ goto out; -+ d.dev = vzdq_aquot_getidev(dir); -+ d.type = k; -+ inode = iget5_locked(dir->i_sb, dir->i_ino + k + 1, -+ vzdq_aquotq_looktest, vzdq_aquotq_lookset, &d); -+ if (inode == NULL) -+ goto out; -+ unlock_new_inode(inode); -+ d_add(dentry, inode); -+ return NULL; -+ -+out: -+ return ERR_PTR(-ENOENT); -+} -+ -+static struct file_operations vzdq_aquotq_file_operations = { -+ .read = &generic_read_dir, -+ .readdir = &vzdq_aquotq_readdir, -+}; -+ -+static struct inode_operations vzdq_aquotq_inode_operations = { -+ .lookup = &vzdq_aquotq_lookup, -+}; -+ -+ -+/* ---------------------------------------------------------------------- -+ * -+ * /proc/vz/vzaquota directory -+ * -+ * --------------------------------------------------------------------- */ -+ -+struct vzdq_aquot_de { -+ struct list_head list; -+ struct vfsmount *mnt; -+}; -+ -+static int vzdq_aquot_buildmntlist(struct ve_struct *ve, -+ struct list_head *head) -+{ -+ struct vfsmount *rmnt, *mnt; -+ struct vzdq_aquot_de *p; -+ int err; -+ -+#ifdef CONFIG_VE -+ rmnt = mntget(ve->fs_rootmnt); -+#else -+ read_lock(¤t->fs->lock); -+ rmnt = mntget(current->fs->rootmnt); -+ read_unlock(¤t->fs->lock); -+#endif -+ mnt = rmnt; -+ down_read(&rmnt->mnt_namespace->sem); -+ while (1) { -+ list_for_each_entry(p, head, list) { -+ if (p->mnt->mnt_sb == mnt->mnt_sb) -+ goto skip; -+ } -+ -+ err = -ENOMEM; -+ p = kmalloc(sizeof(*p), GFP_KERNEL); -+ if (p == NULL) -+ goto out; -+ p->mnt = mntget(mnt); -+ list_add_tail(&p->list, head); -+ -+skip: -+ err = 0; -+ if (list_empty(&mnt->mnt_mounts)) { -+ while (1) { -+ if (mnt == rmnt) -+ goto out; -+ if (mnt->mnt_child.next != -+ &mnt->mnt_parent->mnt_mounts) -+ break; -+ mnt = mnt->mnt_parent; -+ } -+ mnt = list_entry(mnt->mnt_child.next, -+ struct vfsmount, mnt_child); -+ } else -+ mnt = list_first_entry(&mnt->mnt_mounts, -+ struct vfsmount, mnt_child); -+ } -+out: -+ up_read(&rmnt->mnt_namespace->sem); -+ mntput(rmnt); -+ return err; -+} -+ -+static void vzdq_aquot_releasemntlist(struct ve_struct *ve, -+ struct list_head *head) -+{ -+ struct vzdq_aquot_de *p; -+ -+ while (!list_empty(head)) { -+ p = list_first_entry(head, typeof(*p), list); -+ mntput(p->mnt); -+ list_del(&p->list); -+ kfree(p); -+ } -+} -+ -+static int vzdq_aquotd_readdir(struct file *file, void *data, filldir_t filler) -+{ -+ struct ve_struct *ve, *old_ve; -+ struct list_head mntlist; -+ struct vzdq_aquot_de *de; -+ struct super_block *sb; -+ struct vz_quota_master *qmblk; -+ loff_t i, n; -+ char buf[24]; -+ int l, err; -+ -+ i = 0; -+ n = file->f_pos; -+ ve = VE_OWNER_FSTYPE(file->f_dentry->d_sb->s_type); -+ old_ve = set_exec_env(ve); -+ -+ INIT_LIST_HEAD(&mntlist); -+#ifdef CONFIG_VE -+ /* -+ * The only reason of disabling readdir for the host system is that -+ * this readdir can be slow and CPU consuming with large number of VPSs -+ * (or just mount points). -+ */ -+ err = ve_is_super(ve); -+#else -+ err = 0; -+#endif -+ if (!err) { -+ err = vzdq_aquot_buildmntlist(ve, &mntlist); -+ if (err) -+ goto out_err; -+ } -+ -+ if (i >= n) { -+ if ((*filler)(data, ".", 1, i, -+ file->f_dentry->d_inode->i_ino, DT_DIR)) -+ goto out_fill; -+ } -+ i++; -+ -+ if (i >= n) { -+ if ((*filler)(data, "..", 2, i, -+ parent_ino(file->f_dentry), DT_DIR)) -+ goto out_fill; -+ } -+ i++; -+ -+ list_for_each_entry (de, &mntlist, list) { -+ sb = de->mnt->mnt_sb; -+#ifdef CONFIG_VE -+ if (get_device_perms_ve(S_IFBLK, sb->s_dev, FMODE_QUOTACTL)) -+ continue; -+#endif -+ qmblk = vzquota_find_qmblk(sb); -+ if (qmblk == NULL || qmblk == VZ_QUOTA_BAD) -+ continue; -+ -+ qmblk_put(qmblk); -+ i++; -+ if (i <= n) -+ continue; -+ -+ l = sprintf(buf, "%08x", new_encode_dev(sb->s_dev)); -+ if ((*filler)(data, buf, l, i - 1, -+ vzdq_aquot_getino(sb->s_dev), DT_DIR)) -+ break; -+ } -+ -+out_fill: -+ err = 0; -+ file->f_pos = i; -+out_err: -+ vzdq_aquot_releasemntlist(ve, &mntlist); -+ set_exec_env(old_ve); -+ return err; -+} -+ -+static int vzdq_aquotd_looktest(struct inode *inode, void *data) -+{ -+ return inode->i_op == &vzdq_aquotq_inode_operations && -+ vzdq_aquot_getidev(inode) == (dev_t)(unsigned long)data; -+} -+ -+static int vzdq_aquotd_lookset(struct inode *inode, void *data) -+{ -+ dev_t dev; -+ -+ dev = (dev_t)(unsigned long)data; -+ inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME; -+ inode->i_ino = vzdq_aquot_getino(dev); -+ inode->i_mode = S_IFDIR | S_IRUSR | S_IXUSR; -+ inode->i_uid = 0; -+ inode->i_gid = 0; -+ inode->i_nlink = 2; -+ inode->i_op = &vzdq_aquotq_inode_operations; -+ inode->i_fop = &vzdq_aquotq_file_operations; -+ vzdq_aquot_setidev(inode, dev); -+ return 0; -+} -+ -+static struct dentry *vzdq_aquotd_lookup(struct inode *dir, -+ struct dentry *dentry, -+ struct nameidata *nd) -+{ -+ struct ve_struct *ve, *old_ve; -+ const unsigned char *s; -+ int l; -+ dev_t dev; -+ struct inode *inode; -+ -+ ve = VE_OWNER_FSTYPE(dir->i_sb->s_type); -+ old_ve = set_exec_env(ve); -+#ifdef CONFIG_VE -+ /* -+ * Lookup is much lighter than readdir, so it can be allowed for the -+ * host system. But it would be strange to be able to do lookup only -+ * without readdir... -+ */ -+ if (ve_is_super(ve)) -+ goto out; -+#endif -+ -+ dev = 0; -+ l = dentry->d_name.len; -+ if (l <= 0) -+ goto out; -+ for (s = dentry->d_name.name; l > 0; s++, l--) { -+ if (!isxdigit(*s)) -+ goto out; -+ if (dev & ~(~0UL >> 4)) -+ goto out; -+ dev <<= 4; -+ if (isdigit(*s)) -+ dev += *s - '0'; -+ else if (islower(*s)) -+ dev += *s - 'a' + 10; -+ else -+ dev += *s - 'A' + 10; -+ } -+ dev = new_decode_dev(dev); -+ -+#ifdef CONFIG_VE -+ if (get_device_perms_ve(S_IFBLK, dev, FMODE_QUOTACTL)) -+ goto out; -+#endif -+ -+ inode = iget5_locked(dir->i_sb, vzdq_aquot_getino(dev), -+ vzdq_aquotd_looktest, vzdq_aquotd_lookset, -+ (void *)(unsigned long)dev); -+ if (inode == NULL) -+ goto out; -+ unlock_new_inode(inode); -+ -+ d_add(dentry, inode); -+ set_exec_env(old_ve); -+ return NULL; -+ -+out: -+ set_exec_env(old_ve); -+ return ERR_PTR(-ENOENT); -+} -+ -+static struct file_operations vzdq_aquotd_file_operations = { -+ .read = &generic_read_dir, -+ .readdir = &vzdq_aquotd_readdir, -+}; -+ -+static struct inode_operations vzdq_aquotd_inode_operations = { -+ .lookup = &vzdq_aquotd_lookup, -+}; -+ -+ -+/* ---------------------------------------------------------------------- -+ * -+ * Initialization and deinitialization -+ * -+ * --------------------------------------------------------------------- */ -+ -+/* -+ * FIXME: creation of proc entries here is unsafe with respect to module -+ * unloading. -+ */ -+void vzaquota_init(void) -+{ -+ struct proc_dir_entry *de; -+ -+ de = create_proc_glob_entry("vz/vzaquota", -+ S_IFDIR | S_IRUSR | S_IXUSR, NULL); -+ if (de != NULL) { -+ de->proc_iops = &vzdq_aquotd_inode_operations; -+ de->proc_fops = &vzdq_aquotd_file_operations; -+ } else -+ printk("VZDQ: vz/vzaquota creation failed\n"); -+#if defined(CONFIG_SYSCTL) -+ de = create_proc_glob_entry("sys/fs/quota", -+ S_IFDIR | S_IRUSR | S_IXUSR, NULL); -+ if (de == NULL) -+ printk("VZDQ: sys/fs/quota creation failed\n"); -+#endif -+} -+ -+void vzaquota_fini(void) -+{ -+} -diff -uprN linux-2.6.8.1.orig/fs/vzdq_mgmt.c linux-2.6.8.1-ve022stab078/fs/vzdq_mgmt.c ---- linux-2.6.8.1.orig/fs/vzdq_mgmt.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/fs/vzdq_mgmt.c 2006-05-11 13:05:43.000000000 +0400 -@@ -0,0 +1,735 @@ -+/* -+ * Copyright (C) 2001, 2002, 2004, 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ */ -+ -+#include <linux/config.h> -+#include <linux/kernel.h> -+#include <linux/string.h> -+#include <linux/list.h> -+#include <asm/semaphore.h> -+#include <linux/sched.h> -+#include <linux/fs.h> -+#include <linux/dcache.h> -+#include <linux/mount.h> -+#include <linux/namei.h> -+#include <linux/writeback.h> -+#include <linux/gfp.h> -+#include <asm/uaccess.h> -+#include <linux/proc_fs.h> -+#include <linux/quota.h> -+#include <linux/vzctl_quota.h> -+#include <linux/vzquota.h> -+ -+ -+/* ---------------------------------------------------------------------- -+ * Switching quota on. -+ * --------------------------------------------------------------------- */ -+ -+/* -+ * check limits copied from user -+ */ -+int vzquota_check_sane_limits(struct dq_stat *qstat) -+{ -+ int err; -+ -+ err = -EINVAL; -+ -+ /* softlimit must be less then hardlimit */ -+ if (qstat->bsoftlimit > qstat->bhardlimit) -+ goto out; -+ -+ if (qstat->isoftlimit > qstat->ihardlimit) -+ goto out; -+ -+ err = 0; -+out: -+ return err; -+} -+ -+/* -+ * check usage values copied from user -+ */ -+int vzquota_check_sane_values(struct dq_stat *qstat) -+{ -+ int err; -+ -+ err = -EINVAL; -+ -+ /* expiration time must not be set if softlimit was not exceeded */ -+ if (qstat->bcurrent < qstat->bsoftlimit && qstat->btime != (time_t)0) -+ goto out; -+ -+ if (qstat->icurrent < qstat->isoftlimit && qstat->itime != (time_t)0) -+ goto out; -+ -+ err = vzquota_check_sane_limits(qstat); -+out: -+ return err; -+} -+ -+/* -+ * create new quota master block -+ * this function should: -+ * - copy limits and usage parameters from user buffer; -+ * - allock, initialize quota block and insert it to hash; -+ */ -+static int vzquota_create(unsigned int quota_id, struct vz_quota_stat *u_qstat) -+{ -+ int err; -+ struct vz_quota_stat qstat; -+ struct vz_quota_master *qmblk; -+ -+ down(&vz_quota_sem); -+ -+ err = -EFAULT; -+ if (copy_from_user(&qstat, u_qstat, sizeof(qstat))) -+ goto out; -+ -+ err = -EINVAL; -+ if (quota_id == 0) -+ goto out; -+ -+ if (vzquota_check_sane_values(&qstat.dq_stat)) -+ goto out; -+ err = 0; -+ qmblk = vzquota_alloc_master(quota_id, &qstat); -+ -+ if (IS_ERR(qmblk)) /* ENOMEM or EEXIST */ -+ err = PTR_ERR(qmblk); -+out: -+ up(&vz_quota_sem); -+ -+ return err; -+} -+ -+/** -+ * vzquota_on - turn quota on -+ * -+ * This function should: -+ * - find and get refcnt of directory entry for quota root and corresponding -+ * mountpoint; -+ * - find corresponding quota block and mark it with given path; -+ * - check quota tree; -+ * - initialize quota for the tree root. -+ */ -+static int vzquota_on(unsigned int quota_id, const char *quota_root) -+{ -+ int err; -+ struct nameidata nd; -+ struct vz_quota_master *qmblk; -+ struct super_block *dqsb; -+ -+ dqsb = NULL; -+ down(&vz_quota_sem); -+ -+ err = -ENOENT; -+ qmblk = vzquota_find_master(quota_id); -+ if (qmblk == NULL) -+ goto out; -+ -+ err = -EBUSY; -+ if (qmblk->dq_state != VZDQ_STARTING) -+ goto out; -+ -+ err = user_path_walk(quota_root, &nd); -+ if (err) -+ goto out; -+ /* init path must be a directory */ -+ err = -ENOTDIR; -+ if (!S_ISDIR(nd.dentry->d_inode->i_mode)) -+ goto out_path; -+ -+ qmblk->dq_root_dentry = nd.dentry; -+ qmblk->dq_root_mnt = nd.mnt; -+ qmblk->dq_sb = nd.dentry->d_inode->i_sb; -+ err = vzquota_get_super(qmblk->dq_sb); -+ if (err) -+ goto out_super; -+ -+ /* -+ * Serialization with quota initialization and operations is performed -+ * through generation check: generation is memorized before qmblk is -+ * found and compared under inode_qmblk_lock with assignment. -+ * -+ * Note that the dentry tree is shrunk only for high-level logical -+ * serialization, purely as a courtesy to the user: to have consistent -+ * quota statistics, files should be closed etc. on quota on. -+ */ -+ err = vzquota_on_qmblk(qmblk->dq_sb, qmblk->dq_root_dentry->d_inode, -+ qmblk); -+ if (err) -+ goto out_init; -+ qmblk->dq_state = VZDQ_WORKING; -+ -+ up(&vz_quota_sem); -+ return 0; -+ -+out_init: -+ dqsb = qmblk->dq_sb; -+out_super: -+ /* clear for qmblk_put/quota_free_master */ -+ qmblk->dq_sb = NULL; -+ qmblk->dq_root_dentry = NULL; -+ qmblk->dq_root_mnt = NULL; -+out_path: -+ path_release(&nd); -+out: -+ if (dqsb) -+ vzquota_put_super(dqsb); -+ up(&vz_quota_sem); -+ return err; -+} -+ -+ -+/* ---------------------------------------------------------------------- -+ * Switching quota off. -+ * --------------------------------------------------------------------- */ -+ -+/* -+ * destroy quota block by ID -+ */ -+static int vzquota_destroy(unsigned int quota_id) -+{ -+ int err; -+ struct vz_quota_master *qmblk; -+ struct dentry *dentry; -+ struct vfsmount *mnt; -+ -+ down(&vz_quota_sem); -+ -+ err = -ENOENT; -+ qmblk = vzquota_find_master(quota_id); -+ if (qmblk == NULL) -+ goto out; -+ -+ err = -EBUSY; -+ if (qmblk->dq_state == VZDQ_WORKING) -+ goto out; /* quota_off first */ -+ -+ list_del_init(&qmblk->dq_hash); -+ dentry = qmblk->dq_root_dentry; -+ qmblk->dq_root_dentry = NULL; -+ mnt = qmblk->dq_root_mnt; -+ qmblk->dq_root_mnt = NULL; -+ -+ if (qmblk->dq_sb) -+ vzquota_put_super(qmblk->dq_sb); -+ up(&vz_quota_sem); -+ -+ qmblk_put(qmblk); -+ dput(dentry); -+ mntput(mnt); -+ return 0; -+ -+out: -+ up(&vz_quota_sem); -+ return err; -+} -+ -+/** -+ * vzquota_off - turn quota off -+ */ -+ -+static int __vzquota_sync_list(struct list_head *lh, -+ struct vz_quota_master *qmblk, -+ enum writeback_sync_modes sync_mode) -+{ -+ struct writeback_control wbc; -+ LIST_HEAD(list); -+ struct vz_quota_ilink *qlnk; -+ struct inode *inode; -+ int err; -+ -+ memset(&wbc, 0, sizeof(wbc)); -+ wbc.sync_mode = sync_mode; -+ -+ err = 0; -+ while (!list_empty(lh) && !err) { -+ if (need_resched()) { -+ inode_qmblk_unlock(qmblk->dq_sb); -+ schedule(); -+ inode_qmblk_lock(qmblk->dq_sb); -+ } -+ -+ qlnk = list_first_entry(lh, struct vz_quota_ilink, list); -+ list_move(&qlnk->list, &list); -+ -+ inode = igrab(QLNK_INODE(qlnk)); -+ if (!inode) -+ continue; -+ -+ inode_qmblk_unlock(qmblk->dq_sb); -+ -+ wbc.nr_to_write = LONG_MAX; -+ err = sync_inode(inode, &wbc); -+ iput(inode); -+ -+ inode_qmblk_lock(qmblk->dq_sb); -+ } -+ -+ list_splice(&list, lh); -+ return err; -+} -+ -+static int vzquota_sync_list(struct list_head *lh, -+ struct vz_quota_master *qmblk) -+{ -+ int err; -+ -+ err = __vzquota_sync_list(lh, qmblk, WB_SYNC_NONE); -+ if (err) -+ return err; -+ -+ err = __vzquota_sync_list(lh, qmblk, WB_SYNC_ALL); -+ if (err) -+ return err; -+ -+ return 0; -+} -+ -+static int vzquota_sync_inodes(struct vz_quota_master *qmblk) -+{ -+ int err; -+ LIST_HEAD(qlnk_list); -+ -+ list_splice_init(&qmblk->dq_ilink_list, &qlnk_list); -+ err = vzquota_sync_list(&qlnk_list, qmblk); -+ if (!err && !list_empty(&qmblk->dq_ilink_list)) -+ err = -EBUSY; -+ list_splice(&qlnk_list, &qmblk->dq_ilink_list); -+ -+ return err; -+} -+ -+static int vzquota_off(unsigned int quota_id) -+{ -+ int err; -+ struct vz_quota_master *qmblk; -+ -+ down(&vz_quota_sem); -+ -+ err = -ENOENT; -+ qmblk = vzquota_find_master(quota_id); -+ if (qmblk == NULL) -+ goto out; -+ -+ err = -EALREADY; -+ if (qmblk->dq_state != VZDQ_WORKING) -+ goto out; -+ -+ inode_qmblk_lock(qmblk->dq_sb); /* protects dq_ilink_list also */ -+ err = vzquota_sync_inodes(qmblk); -+ if (err) -+ goto out_unlock; -+ inode_qmblk_unlock(qmblk->dq_sb); -+ -+ err = vzquota_off_qmblk(qmblk->dq_sb, qmblk); -+ if (err) -+ goto out; -+ -+ /* vzquota_destroy will free resources */ -+ qmblk->dq_state = VZDQ_STOPING; -+out: -+ up(&vz_quota_sem); -+ -+ return err; -+ -+out_unlock: -+ inode_qmblk_unlock(qmblk->dq_sb); -+ goto out; -+} -+ -+ -+/* ---------------------------------------------------------------------- -+ * Other VZQUOTA ioctl's. -+ * --------------------------------------------------------------------- */ -+ -+/* -+ * this function should: -+ * - set new limits/buffer under quota master block lock -+ * - if new softlimit less then usage, then set expiration time -+ * - no need to alloc ugid hash table - we'll do that on demand -+ */ -+int vzquota_update_limit(struct dq_stat *_qstat, -+ struct dq_stat *qstat) -+{ -+ int err; -+ -+ err = -EINVAL; -+ if (vzquota_check_sane_limits(qstat)) -+ goto out; -+ -+ err = 0; -+ -+ /* limits */ -+ _qstat->bsoftlimit = qstat->bsoftlimit; -+ _qstat->bhardlimit = qstat->bhardlimit; -+ /* -+ * If the soft limit is exceeded, administrator can override the moment -+ * when the grace period for limit exceeding ends. -+ * Specifying the moment may be useful if the soft limit is set to be -+ * lower than the current usage. In the latter case, if the grace -+ * period end isn't specified, the grace period will start from the -+ * moment of the first write operation. -+ * There is a race with the user level. Soft limit may be already -+ * exceeded before the limit change, and grace period end calculated by -+ * the kernel will be overriden. User level may check if the limit is -+ * already exceeded, but check and set calls are not atomic. -+ * This race isn't dangerous. Under normal cicrumstances, the -+ * difference between the grace period end calculated by the kernel and -+ * the user level should be not greater than as the difference between -+ * the moments of check and set calls, i.e. not bigger than the quota -+ * timer resolution - 1 sec. -+ */ -+ if (qstat->btime != (time_t)0 && -+ _qstat->bcurrent >= _qstat->bsoftlimit) -+ _qstat->btime = qstat->btime; -+ -+ _qstat->isoftlimit = qstat->isoftlimit; -+ _qstat->ihardlimit = qstat->ihardlimit; -+ if (qstat->itime != (time_t)0 && -+ _qstat->icurrent >= _qstat->isoftlimit) -+ _qstat->itime = qstat->itime; -+ -+out: -+ return err; -+} -+ -+/* -+ * set new quota limits. -+ * this function should: -+ * copy new limits from user level -+ * - find quota block -+ * - set new limits and flags. -+ */ -+static int vzquota_setlimit(unsigned int quota_id, -+ struct vz_quota_stat *u_qstat) -+{ -+ int err; -+ struct vz_quota_stat qstat; -+ struct vz_quota_master *qmblk; -+ -+ down(&vz_quota_sem); /* for hash list protection */ -+ -+ err = -ENOENT; -+ qmblk = vzquota_find_master(quota_id); -+ if (qmblk == NULL) -+ goto out; -+ -+ err = -EFAULT; -+ if (copy_from_user(&qstat, u_qstat, sizeof(qstat))) -+ goto out; -+ -+ qmblk_data_write_lock(qmblk); -+ err = vzquota_update_limit(&qmblk->dq_stat, &qstat.dq_stat); -+ if (err == 0) -+ qmblk->dq_info = qstat.dq_info; -+ qmblk_data_write_unlock(qmblk); -+ -+out: -+ up(&vz_quota_sem); -+ return err; -+} -+ -+/* -+ * get quota limits. -+ * very simple - just return stat buffer to user -+ */ -+static int vzquota_getstat(unsigned int quota_id, -+ struct vz_quota_stat *u_qstat) -+{ -+ int err; -+ struct vz_quota_stat qstat; -+ struct vz_quota_master *qmblk; -+ -+ down(&vz_quota_sem); -+ -+ err = -ENOENT; -+ qmblk = vzquota_find_master(quota_id); -+ if (qmblk == NULL) -+ goto out; -+ -+ qmblk_data_read_lock(qmblk); -+ /* copy whole buffer under lock */ -+ memcpy(&qstat.dq_stat, &qmblk->dq_stat, sizeof(qstat.dq_stat)); -+ memcpy(&qstat.dq_info, &qmblk->dq_info, sizeof(qstat.dq_info)); -+ qmblk_data_read_unlock(qmblk); -+ -+ err = copy_to_user(u_qstat, &qstat, sizeof(qstat)); -+ if (err) -+ err = -EFAULT; -+ -+out: -+ up(&vz_quota_sem); -+ return err; -+} -+ -+/* -+ * This is a system call to turn per-VE disk quota on. -+ * Note this call is allowed to run ONLY from VE0 -+ */ -+long do_vzquotactl(int cmd, unsigned int quota_id, -+ struct vz_quota_stat *qstat, const char *ve_root) -+{ -+ int ret; -+ -+ ret = -EPERM; -+ /* access allowed only from root of VE0 */ -+ if (!capable(CAP_SYS_RESOURCE) || -+ !capable(CAP_SYS_ADMIN)) -+ goto out; -+ -+ switch (cmd) { -+ case VZ_DQ_CREATE: -+ ret = vzquota_create(quota_id, qstat); -+ break; -+ case VZ_DQ_DESTROY: -+ ret = vzquota_destroy(quota_id); -+ break; -+ case VZ_DQ_ON: -+ ret = vzquota_on(quota_id, ve_root); -+ break; -+ case VZ_DQ_OFF: -+ ret = vzquota_off(quota_id); -+ break; -+ case VZ_DQ_SETLIMIT: -+ ret = vzquota_setlimit(quota_id, qstat); -+ break; -+ case VZ_DQ_GETSTAT: -+ ret = vzquota_getstat(quota_id, qstat); -+ break; -+ -+ default: -+ ret = -EINVAL; -+ goto out; -+ } -+ -+out: -+ return ret; -+} -+ -+ -+/* ---------------------------------------------------------------------- -+ * Proc filesystem routines -+ * ---------------------------------------------------------------------*/ -+ -+#if defined(CONFIG_PROC_FS) -+ -+#define QUOTA_UINT_LEN 15 -+#define QUOTA_TIME_LEN_FMT_UINT "%11u" -+#define QUOTA_NUM_LEN_FMT_UINT "%15u" -+#define QUOTA_NUM_LEN_FMT_ULL "%15Lu" -+#define QUOTA_TIME_LEN_FMT_STR "%11s" -+#define QUOTA_NUM_LEN_FMT_STR "%15s" -+#define QUOTA_PROC_MAX_LINE_LEN 2048 -+ -+/* -+ * prints /proc/ve_dq header line -+ */ -+static int print_proc_header(char * buffer) -+{ -+ return sprintf(buffer, -+ "%-11s" -+ QUOTA_NUM_LEN_FMT_STR -+ QUOTA_NUM_LEN_FMT_STR -+ QUOTA_NUM_LEN_FMT_STR -+ QUOTA_TIME_LEN_FMT_STR -+ QUOTA_TIME_LEN_FMT_STR -+ "\n", -+ "qid: path", -+ "usage", "softlimit", "hardlimit", "time", "expire"); -+} -+ -+/* -+ * prints proc master record id, dentry path -+ */ -+static int print_proc_master_id(char * buffer, char * path_buf, -+ struct vz_quota_master * qp) -+{ -+ char *path; -+ int over; -+ -+ path = NULL; -+ switch (qp->dq_state) { -+ case VZDQ_WORKING: -+ if (!path_buf) { -+ path = ""; -+ break; -+ } -+ path = d_path(qp->dq_root_dentry, -+ qp->dq_root_mnt, path_buf, PAGE_SIZE); -+ if (IS_ERR(path)) { -+ path = ""; -+ break; -+ } -+ /* do not print large path, truncate it */ -+ over = strlen(path) - -+ (QUOTA_PROC_MAX_LINE_LEN - 3 - 3 - -+ QUOTA_UINT_LEN); -+ if (over > 0) { -+ path += over - 3; -+ path[0] = path[1] = path[3] = '.'; -+ } -+ break; -+ case VZDQ_STARTING: -+ path = "-- started --"; -+ break; -+ case VZDQ_STOPING: -+ path = "-- stopped --"; -+ break; -+ } -+ -+ return sprintf(buffer, "%u: %s\n", qp->dq_id, path); -+} -+ -+/* -+ * prints struct vz_quota_stat data -+ */ -+static int print_proc_stat(char * buffer, struct dq_stat *qs, -+ struct dq_info *qi) -+{ -+ return sprintf(buffer, -+ "%11s" -+ QUOTA_NUM_LEN_FMT_ULL -+ QUOTA_NUM_LEN_FMT_ULL -+ QUOTA_NUM_LEN_FMT_ULL -+ QUOTA_TIME_LEN_FMT_UINT -+ QUOTA_TIME_LEN_FMT_UINT -+ "\n" -+ "%11s" -+ QUOTA_NUM_LEN_FMT_UINT -+ QUOTA_NUM_LEN_FMT_UINT -+ QUOTA_NUM_LEN_FMT_UINT -+ QUOTA_TIME_LEN_FMT_UINT -+ QUOTA_TIME_LEN_FMT_UINT -+ "\n", -+ "1k-blocks", -+ qs->bcurrent >> 10, -+ qs->bsoftlimit >> 10, -+ qs->bhardlimit >> 10, -+ (unsigned int)qs->btime, -+ (unsigned int)qi->bexpire, -+ "inodes", -+ qs->icurrent, -+ qs->isoftlimit, -+ qs->ihardlimit, -+ (unsigned int)qs->itime, -+ (unsigned int)qi->iexpire); -+} -+ -+ -+/* -+ * for /proc filesystem output -+ */ -+static int vzquota_read_proc(char *page, char **start, off_t off, int count, -+ int *eof, void *data) -+{ -+ int len, i; -+ off_t printed = 0; -+ char *p = page; -+ struct vz_quota_master *qp; -+ struct vz_quota_ilink *ql2; -+ struct list_head *listp; -+ char *path_buf; -+ -+ path_buf = (char*)__get_free_page(GFP_KERNEL); -+ if (path_buf == NULL) -+ return -ENOMEM; -+ -+ len = print_proc_header(p); -+ printed += len; -+ if (off < printed) /* keep header in output */ { -+ *start = p + off; -+ p += len; -+ } -+ -+ down(&vz_quota_sem); -+ -+ /* traverse master hash table for all records */ -+ for (i = 0; i < vzquota_hash_size; i++) { -+ list_for_each(listp, &vzquota_hash_table[i]) { -+ qp = list_entry(listp, -+ struct vz_quota_master, dq_hash); -+ -+ /* Skip other VE's information if not root of VE0 */ -+ if ((!capable(CAP_SYS_ADMIN) || -+ !capable(CAP_SYS_RESOURCE))) { -+ ql2 = INODE_QLNK(current->fs->root->d_inode); -+ if (ql2 == NULL || qp != ql2->qmblk) -+ continue; -+ } -+ /* -+ * Now print the next record -+ */ -+ len = 0; -+ /* we print quotaid and path only in VE0 */ -+ if (capable(CAP_SYS_ADMIN)) -+ len += print_proc_master_id(p+len,path_buf, qp); -+ len += print_proc_stat(p+len, &qp->dq_stat, -+ &qp->dq_info); -+ printed += len; -+ /* skip unnecessary lines */ -+ if (printed <= off) -+ continue; -+ p += len; -+ /* provide start offset */ -+ if (*start == NULL) -+ *start = p + (off - printed); -+ /* have we printed all requested size? */ -+ if (PAGE_SIZE - (p - page) < QUOTA_PROC_MAX_LINE_LEN || -+ (p - *start) >= count) -+ goto out; -+ } -+ } -+ -+ *eof = 1; /* checked all hash */ -+out: -+ up(&vz_quota_sem); -+ -+ len = 0; -+ if (*start != NULL) { -+ len = (p - *start); -+ if (len > count) -+ len = count; -+ } -+ -+ if (path_buf) -+ free_page((unsigned long) path_buf); -+ -+ return len; -+} -+ -+/* -+ * Register procfs read callback -+ */ -+int vzquota_proc_init(void) -+{ -+ struct proc_dir_entry *de; -+ -+ de = create_proc_entry("vz/vzquota", S_IFREG|S_IRUSR, NULL); -+ if (de == NULL) { -+ /* create "vz" subdirectory, if not exist */ -+ de = create_proc_entry("vz", S_IFDIR|S_IRUGO|S_IXUGO, NULL); -+ if (de == NULL) -+ goto out_err; -+ de = create_proc_entry("vzquota", S_IFREG|S_IRUSR, de); -+ if (de == NULL) -+ goto out_err; -+ } -+ de->read_proc = vzquota_read_proc; -+ de->data = NULL; -+ return 0; -+out_err: -+ return -EBUSY; -+} -+ -+void vzquota_proc_release(void) -+{ -+ /* Unregister procfs read callback */ -+ remove_proc_entry("vz/vzquota", NULL); -+} -+ -+#endif -diff -uprN linux-2.6.8.1.orig/fs/vzdq_ops.c linux-2.6.8.1-ve022stab078/fs/vzdq_ops.c ---- linux-2.6.8.1.orig/fs/vzdq_ops.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/fs/vzdq_ops.c 2006-05-11 13:05:43.000000000 +0400 -@@ -0,0 +1,563 @@ -+/* -+ * Copyright (C) 2001, 2002, 2004, 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ */ -+ -+#include <linux/config.h> -+#include <linux/kernel.h> -+#include <linux/types.h> -+#include <asm/semaphore.h> -+#include <linux/sched.h> -+#include <linux/fs.h> -+#include <linux/quota.h> -+#include <linux/vzquota.h> -+ -+ -+/* ---------------------------------------------------------------------- -+ * Quota superblock operations - helper functions. -+ * --------------------------------------------------------------------- */ -+ -+static inline void vzquota_incr_inodes(struct dq_stat *dqstat, -+ unsigned long number) -+{ -+ dqstat->icurrent += number; -+} -+ -+static inline void vzquota_incr_space(struct dq_stat *dqstat, -+ __u64 number) -+{ -+ dqstat->bcurrent += number; -+} -+ -+static inline void vzquota_decr_inodes(struct dq_stat *dqstat, -+ unsigned long number) -+{ -+ if (dqstat->icurrent > number) -+ dqstat->icurrent -= number; -+ else -+ dqstat->icurrent = 0; -+ if (dqstat->icurrent < dqstat->isoftlimit) -+ dqstat->itime = (time_t) 0; -+} -+ -+static inline void vzquota_decr_space(struct dq_stat *dqstat, -+ __u64 number) -+{ -+ if (dqstat->bcurrent > number) -+ dqstat->bcurrent -= number; -+ else -+ dqstat->bcurrent = 0; -+ if (dqstat->bcurrent < dqstat->bsoftlimit) -+ dqstat->btime = (time_t) 0; -+} -+ -+/* -+ * better printk() message or use /proc/vzquotamsg interface -+ * similar to /proc/kmsg -+ */ -+static inline void vzquota_warn(struct dq_info *dq_info, int dq_id, int flag, -+ const char *fmt) -+{ -+ if (dq_info->flags & flag) /* warning already printed for this -+ masterblock */ -+ return; -+ printk(fmt, dq_id); -+ dq_info->flags |= flag; -+} -+ -+/* -+ * ignore_hardlimit - -+ * -+ * Intended to allow superuser of VE0 to overwrite hardlimits. -+ * -+ * ignore_hardlimit() has a very bad feature: -+ * -+ * writepage() operation for writable mapping of a file with holes -+ * may trigger get_block() with wrong current and as a consequence, -+ * opens a possibility to overcommit hardlimits -+ */ -+/* for the reason above, it is disabled now */ -+static inline int ignore_hardlimit(struct dq_info *dqstat) -+{ -+#if 0 -+ return ve_is_super(get_exec_env()) && -+ capable(CAP_SYS_RESOURCE) && -+ (dqstat->options & VZ_QUOTA_OPT_RSQUASH); -+#else -+ return 0; -+#endif -+} -+ -+static int vzquota_check_inodes(struct dq_info *dq_info, -+ struct dq_stat *dqstat, -+ unsigned long number, int dq_id) -+{ -+ if (number == 0) -+ return QUOTA_OK; -+ -+ if (dqstat->icurrent + number > dqstat->ihardlimit && -+ !ignore_hardlimit(dq_info)) { -+ vzquota_warn(dq_info, dq_id, VZ_QUOTA_INODES, -+ "VZ QUOTA: file hardlimit reached for id=%d\n"); -+ return NO_QUOTA; -+ } -+ -+ if (dqstat->icurrent + number > dqstat->isoftlimit) { -+ if (dqstat->itime == (time_t)0) { -+ vzquota_warn(dq_info, dq_id, 0, -+ "VZ QUOTA: file softlimit exceeded " -+ "for id=%d\n"); -+ dqstat->itime = CURRENT_TIME_SECONDS + dq_info->iexpire; -+ } else if (CURRENT_TIME_SECONDS >= dqstat->itime && -+ !ignore_hardlimit(dq_info)) { -+ vzquota_warn(dq_info, dq_id, VZ_QUOTA_INODES, -+ "VZ QUOTA: file softlimit expired " -+ "for id=%d\n"); -+ return NO_QUOTA; -+ } -+ } -+ -+ return QUOTA_OK; -+} -+ -+static int vzquota_check_space(struct dq_info *dq_info, -+ struct dq_stat *dqstat, -+ __u64 number, int dq_id, char prealloc) -+{ -+ if (number == 0) -+ return QUOTA_OK; -+ -+ if (dqstat->bcurrent + number > dqstat->bhardlimit && -+ !ignore_hardlimit(dq_info)) { -+ if (!prealloc) -+ vzquota_warn(dq_info, dq_id, VZ_QUOTA_SPACE, -+ "VZ QUOTA: disk hardlimit reached " -+ "for id=%d\n"); -+ return NO_QUOTA; -+ } -+ -+ if (dqstat->bcurrent + number > dqstat->bsoftlimit) { -+ if (dqstat->btime == (time_t)0) { -+ if (!prealloc) { -+ vzquota_warn(dq_info, dq_id, 0, -+ "VZ QUOTA: disk softlimit exceeded " -+ "for id=%d\n"); -+ dqstat->btime = CURRENT_TIME_SECONDS -+ + dq_info->bexpire; -+ } else { -+ /* -+ * Original Linux quota doesn't allow -+ * preallocation to exceed softlimit so -+ * exceeding will be always printed -+ */ -+ return NO_QUOTA; -+ } -+ } else if (CURRENT_TIME_SECONDS >= dqstat->btime && -+ !ignore_hardlimit(dq_info)) { -+ if (!prealloc) -+ vzquota_warn(dq_info, dq_id, VZ_QUOTA_SPACE, -+ "VZ QUOTA: disk quota " -+ "softlimit expired " -+ "for id=%d\n"); -+ return NO_QUOTA; -+ } -+ } -+ -+ return QUOTA_OK; -+} -+ -+static int vzquota_check_ugid_inodes(struct vz_quota_master *qmblk, -+ struct vz_quota_ugid *qugid[], -+ int type, unsigned long number) -+{ -+ struct dq_info *dqinfo; -+ struct dq_stat *dqstat; -+ -+ if (qugid[type] == NULL) -+ return QUOTA_OK; -+ if (qugid[type] == VZ_QUOTA_UGBAD) -+ return NO_QUOTA; -+ -+ if (type == USRQUOTA && !(qmblk->dq_flags & VZDQ_USRQUOTA)) -+ return QUOTA_OK; -+ if (type == GRPQUOTA && !(qmblk->dq_flags & VZDQ_GRPQUOTA)) -+ return QUOTA_OK; -+ if (number == 0) -+ return QUOTA_OK; -+ -+ dqinfo = &qmblk->dq_ugid_info[type]; -+ dqstat = &qugid[type]->qugid_stat; -+ -+ if (dqstat->ihardlimit != 0 && -+ dqstat->icurrent + number > dqstat->ihardlimit) -+ return NO_QUOTA; -+ -+ if (dqstat->isoftlimit != 0 && -+ dqstat->icurrent + number > dqstat->isoftlimit) { -+ if (dqstat->itime == (time_t)0) -+ dqstat->itime = CURRENT_TIME_SECONDS + dqinfo->iexpire; -+ else if (CURRENT_TIME_SECONDS >= dqstat->itime) -+ return NO_QUOTA; -+ } -+ -+ return QUOTA_OK; -+} -+ -+static int vzquota_check_ugid_space(struct vz_quota_master *qmblk, -+ struct vz_quota_ugid *qugid[], -+ int type, __u64 number, char prealloc) -+{ -+ struct dq_info *dqinfo; -+ struct dq_stat *dqstat; -+ -+ if (qugid[type] == NULL) -+ return QUOTA_OK; -+ if (qugid[type] == VZ_QUOTA_UGBAD) -+ return NO_QUOTA; -+ -+ if (type == USRQUOTA && !(qmblk->dq_flags & VZDQ_USRQUOTA)) -+ return QUOTA_OK; -+ if (type == GRPQUOTA && !(qmblk->dq_flags & VZDQ_GRPQUOTA)) -+ return QUOTA_OK; -+ if (number == 0) -+ return QUOTA_OK; -+ -+ dqinfo = &qmblk->dq_ugid_info[type]; -+ dqstat = &qugid[type]->qugid_stat; -+ -+ if (dqstat->bhardlimit != 0 && -+ dqstat->bcurrent + number > dqstat->bhardlimit) -+ return NO_QUOTA; -+ -+ if (dqstat->bsoftlimit != 0 && -+ dqstat->bcurrent + number > dqstat->bsoftlimit) { -+ if (dqstat->btime == (time_t)0) { -+ if (!prealloc) -+ dqstat->btime = CURRENT_TIME_SECONDS -+ + dqinfo->bexpire; -+ else -+ /* -+ * Original Linux quota doesn't allow -+ * preallocation to exceed softlimit so -+ * exceeding will be always printed -+ */ -+ return NO_QUOTA; -+ } else if (CURRENT_TIME_SECONDS >= dqstat->btime) -+ return NO_QUOTA; -+ } -+ -+ return QUOTA_OK; -+} -+ -+/* ---------------------------------------------------------------------- -+ * Quota superblock operations -+ * --------------------------------------------------------------------- */ -+ -+/* -+ * S_NOQUOTA note. -+ * In the current kernel (2.6.8.1), S_NOQUOTA flag is set only for -+ * - quota file (absent in our case) -+ * - after explicit DQUOT_DROP (earlier than clear_inode) in functions like -+ * filesystem-specific new_inode, before the inode gets outside links. -+ * For the latter case, the only quota operation where care about S_NOQUOTA -+ * might be required is vzquota_drop, but there S_NOQUOTA has already been -+ * checked in DQUOT_DROP(). -+ * So, S_NOQUOTA may be ignored for now in the VZDQ code. -+ * -+ * The above note is not entirely correct. -+ * Both for ext2 and ext3 filesystems, DQUOT_FREE_INODE is called from -+ * delete_inode if new_inode fails (for example, because of inode quota -+ * limits), so S_NOQUOTA check is needed in free_inode. -+ * This seems to be the dark corner of the current quota API. -+ */ -+ -+/* -+ * Initialize quota operations for the specified inode. -+ */ -+static int vzquota_initialize(struct inode *inode, int type) -+{ -+ vzquota_inode_init_call(inode); -+ return 0; /* ignored by caller */ -+} -+ -+/* -+ * Release quota for the specified inode. -+ */ -+static int vzquota_drop(struct inode *inode) -+{ -+ vzquota_inode_drop_call(inode); -+ return 0; /* ignored by caller */ -+} -+ -+/* -+ * Allocate block callback. -+ * -+ * If (prealloc) disk quota exceeding warning is not printed. -+ * See Linux quota to know why. -+ * -+ * Return: -+ * QUOTA_OK == 0 on SUCCESS -+ * NO_QUOTA == 1 if allocation should fail -+ */ -+static int vzquota_alloc_space(struct inode *inode, -+ qsize_t number, int prealloc) -+{ -+ struct vz_quota_master *qmblk; -+ struct vz_quota_datast data; -+ int ret = QUOTA_OK; -+ -+ qmblk = vzquota_inode_data(inode, &data); -+ if (qmblk == VZ_QUOTA_BAD) -+ return NO_QUOTA; -+ if (qmblk != NULL) { -+#ifdef CONFIG_VZ_QUOTA_UGID -+ int cnt; -+ struct vz_quota_ugid * qugid[MAXQUOTAS]; -+#endif -+ -+ /* checking first */ -+ ret = vzquota_check_space(&qmblk->dq_info, &qmblk->dq_stat, -+ number, qmblk->dq_id, prealloc); -+ if (ret == NO_QUOTA) -+ goto no_quota; -+#ifdef CONFIG_VZ_QUOTA_UGID -+ for (cnt = 0; cnt < MAXQUOTAS; cnt++) { -+ qugid[cnt] = INODE_QLNK(inode)->qugid[cnt]; -+ ret = vzquota_check_ugid_space(qmblk, qugid, -+ cnt, number, prealloc); -+ if (ret == NO_QUOTA) -+ goto no_quota; -+ } -+ /* check ok, may increment */ -+ for (cnt = 0; cnt < MAXQUOTAS; cnt++) { -+ if (qugid[cnt] == NULL) -+ continue; -+ vzquota_incr_space(&qugid[cnt]->qugid_stat, number); -+ } -+#endif -+ vzquota_incr_space(&qmblk->dq_stat, number); -+ vzquota_data_unlock(inode, &data); -+ } -+ -+ inode_add_bytes(inode, number); -+ might_sleep(); -+ return QUOTA_OK; -+ -+no_quota: -+ vzquota_data_unlock(inode, &data); -+ return NO_QUOTA; -+} -+ -+/* -+ * Allocate inodes callback. -+ * -+ * Return: -+ * QUOTA_OK == 0 on SUCCESS -+ * NO_QUOTA == 1 if allocation should fail -+ */ -+static int vzquota_alloc_inode(const struct inode *inode, unsigned long number) -+{ -+ struct vz_quota_master *qmblk; -+ struct vz_quota_datast data; -+ int ret = QUOTA_OK; -+ -+ qmblk = vzquota_inode_data((struct inode *)inode, &data); -+ if (qmblk == VZ_QUOTA_BAD) -+ return NO_QUOTA; -+ if (qmblk != NULL) { -+#ifdef CONFIG_VZ_QUOTA_UGID -+ int cnt; -+ struct vz_quota_ugid *qugid[MAXQUOTAS]; -+#endif -+ -+ /* checking first */ -+ ret = vzquota_check_inodes(&qmblk->dq_info, &qmblk->dq_stat, -+ number, qmblk->dq_id); -+ if (ret == NO_QUOTA) -+ goto no_quota; -+#ifdef CONFIG_VZ_QUOTA_UGID -+ for (cnt = 0; cnt < MAXQUOTAS; cnt++) { -+ qugid[cnt] = INODE_QLNK(inode)->qugid[cnt]; -+ ret = vzquota_check_ugid_inodes(qmblk, qugid, -+ cnt, number); -+ if (ret == NO_QUOTA) -+ goto no_quota; -+ } -+ /* check ok, may increment */ -+ for (cnt = 0; cnt < MAXQUOTAS; cnt++) { -+ if (qugid[cnt] == NULL) -+ continue; -+ vzquota_incr_inodes(&qugid[cnt]->qugid_stat, number); -+ } -+#endif -+ vzquota_incr_inodes(&qmblk->dq_stat, number); -+ vzquota_data_unlock((struct inode *)inode, &data); -+ } -+ -+ might_sleep(); -+ return QUOTA_OK; -+ -+no_quota: -+ vzquota_data_unlock((struct inode *)inode, &data); -+ return NO_QUOTA; -+} -+ -+/* -+ * Free space callback. -+ */ -+static int vzquota_free_space(struct inode *inode, qsize_t number) -+{ -+ struct vz_quota_master *qmblk; -+ struct vz_quota_datast data; -+ -+ qmblk = vzquota_inode_data(inode, &data); -+ if (qmblk == VZ_QUOTA_BAD) -+ return NO_QUOTA; /* isn't checked by the caller */ -+ if (qmblk != NULL) { -+#ifdef CONFIG_VZ_QUOTA_UGID -+ int cnt; -+ struct vz_quota_ugid * qugid; -+#endif -+ -+ vzquota_decr_space(&qmblk->dq_stat, number); -+#ifdef CONFIG_VZ_QUOTA_UGID -+ for (cnt = 0; cnt < MAXQUOTAS; cnt++) { -+ qugid = INODE_QLNK(inode)->qugid[cnt]; -+ if (qugid == NULL || qugid == VZ_QUOTA_UGBAD) -+ continue; -+ vzquota_decr_space(&qugid->qugid_stat, number); -+ } -+#endif -+ vzquota_data_unlock(inode, &data); -+ } -+ inode_sub_bytes(inode, number); -+ might_sleep(); -+ return QUOTA_OK; -+} -+ -+/* -+ * Free inodes callback. -+ */ -+static int vzquota_free_inode(const struct inode *inode, unsigned long number) -+{ -+ struct vz_quota_master *qmblk; -+ struct vz_quota_datast data; -+ -+ if (IS_NOQUOTA(inode)) -+ return QUOTA_OK; -+ -+ qmblk = vzquota_inode_data((struct inode *)inode, &data); -+ if (qmblk == VZ_QUOTA_BAD) -+ return NO_QUOTA; -+ if (qmblk != NULL) { -+#ifdef CONFIG_VZ_QUOTA_UGID -+ int cnt; -+ struct vz_quota_ugid * qugid; -+#endif -+ -+ vzquota_decr_inodes(&qmblk->dq_stat, number); -+#ifdef CONFIG_VZ_QUOTA_UGID -+ for (cnt = 0; cnt < MAXQUOTAS; cnt++) { -+ qugid = INODE_QLNK(inode)->qugid[cnt]; -+ if (qugid == NULL || qugid == VZ_QUOTA_UGBAD) -+ continue; -+ vzquota_decr_inodes(&qugid->qugid_stat, number); -+ } -+#endif -+ vzquota_data_unlock((struct inode *)inode, &data); -+ } -+ might_sleep(); -+ return QUOTA_OK; -+} -+ -+#if defined(CONFIG_VZ_QUOTA_UGID) -+ -+/* -+ * helper function for quota_transfer -+ * check that we can add inode to this quota_id -+ */ -+static int vzquota_transfer_check(struct vz_quota_master *qmblk, -+ struct vz_quota_ugid *qugid[], -+ unsigned int type, __u64 size) -+{ -+ if (vzquota_check_ugid_space(qmblk, qugid, type, size, 0) != QUOTA_OK || -+ vzquota_check_ugid_inodes(qmblk, qugid, type, 1) != QUOTA_OK) -+ return -1; -+ return 0; -+} -+ -+int vzquota_transfer_usage(struct inode *inode, -+ int mask, -+ struct vz_quota_ilink *qlnk) -+{ -+ struct vz_quota_ugid *qugid_old; -+ __u64 space; -+ int i; -+ -+ space = inode_get_bytes(inode); -+ for (i = 0; i < MAXQUOTAS; i++) { -+ if (!(mask & (1 << i))) -+ continue; -+ if (vzquota_transfer_check(qlnk->qmblk, qlnk->qugid, i, space)) -+ return -1; -+ } -+ -+ for (i = 0; i < MAXQUOTAS; i++) { -+ if (!(mask & (1 << i))) -+ continue; -+ qugid_old = INODE_QLNK(inode)->qugid[i]; -+ vzquota_decr_space(&qugid_old->qugid_stat, space); -+ vzquota_decr_inodes(&qugid_old->qugid_stat, 1); -+ vzquota_incr_space(&qlnk->qugid[i]->qugid_stat, space); -+ vzquota_incr_inodes(&qlnk->qugid[i]->qugid_stat, 1); -+ } -+ return 0; -+} -+ -+/* -+ * Transfer the inode between diffent user/group quotas. -+ */ -+static int vzquota_transfer(struct inode *inode, struct iattr *iattr) -+{ -+ return vzquota_inode_transfer_call(inode, iattr) ? -+ NO_QUOTA : QUOTA_OK; -+} -+ -+#else /* CONFIG_VZ_QUOTA_UGID */ -+ -+static int vzquota_transfer(struct inode *inode, struct iattr *iattr) -+{ -+ return QUOTA_OK; -+} -+ -+#endif -+ -+/* -+ * Called under following semaphores: -+ * old_d->d_inode->i_sb->s_vfs_rename_sem -+ * old_d->d_inode->i_sem -+ * new_d->d_inode->i_sem -+ * [not verified --SAW] -+ */ -+static int vzquota_rename(struct inode *inode, -+ struct inode *old_dir, struct inode *new_dir) -+{ -+ return vzquota_rename_check(inode, old_dir, new_dir) ? -+ NO_QUOTA : QUOTA_OK; -+} -+ -+/* -+ * Structure of superblock diskquota operations. -+ */ -+struct dquot_operations vz_quota_operations = { -+ initialize: vzquota_initialize, -+ drop: vzquota_drop, -+ alloc_space: vzquota_alloc_space, -+ alloc_inode: vzquota_alloc_inode, -+ free_space: vzquota_free_space, -+ free_inode: vzquota_free_inode, -+ transfer: vzquota_transfer, -+ rename: vzquota_rename -+}; -diff -uprN linux-2.6.8.1.orig/fs/vzdq_tree.c linux-2.6.8.1-ve022stab078/fs/vzdq_tree.c ---- linux-2.6.8.1.orig/fs/vzdq_tree.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/fs/vzdq_tree.c 2006-05-11 13:05:44.000000000 +0400 -@@ -0,0 +1,286 @@ -+/* -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ * This file contains Virtuozzo quota tree implementation -+ */ -+ -+#include <linux/errno.h> -+#include <linux/slab.h> -+#include <linux/vzdq_tree.h> -+ -+struct quotatree_tree *quotatree_alloc(void) -+{ -+ int l; -+ struct quotatree_tree *tree; -+ -+ tree = kmalloc(sizeof(struct quotatree_tree), GFP_KERNEL); -+ if (tree == NULL) -+ goto out; -+ -+ for (l = 0; l < QUOTATREE_DEPTH; l++) { -+ INIT_LIST_HEAD(&tree->levels[l].usedlh); -+ INIT_LIST_HEAD(&tree->levels[l].freelh); -+ tree->levels[l].freenum = 0; -+ } -+ tree->root = NULL; -+ tree->leaf_num = 0; -+out: -+ return tree; -+} -+ -+static struct quotatree_node * -+quotatree_follow(struct quotatree_tree *tree, quotaid_t id, int level, -+ struct quotatree_find_state *st) -+{ -+ void **block; -+ struct quotatree_node *parent; -+ int l, index; -+ -+ parent = NULL; -+ block = (void **)&tree->root; -+ l = 0; -+ while (l < level && *block != NULL) { -+ index = (id >> QUOTATREE_BSHIFT(l)) & QUOTATREE_BMASK; -+ parent = *block; -+ block = parent->blocks + index; -+ l++; -+ } -+ if (st != NULL) { -+ st->block = block; -+ st->level = l; -+ } -+ -+ return parent; -+} -+ -+void *quotatree_find(struct quotatree_tree *tree, quotaid_t id, -+ struct quotatree_find_state *st) -+{ -+ quotatree_follow(tree, id, QUOTATREE_DEPTH, st); -+ if (st->level == QUOTATREE_DEPTH) -+ return *st->block; -+ else -+ return NULL; -+} -+ -+void *quotatree_leaf_byindex(struct quotatree_tree *tree, unsigned int index) -+{ -+ int i, count; -+ struct quotatree_node *p; -+ void *leaf; -+ -+ if (QTREE_LEAFNUM(tree) <= index) -+ return NULL; -+ -+ count = 0; -+ list_for_each_entry(p, &QTREE_LEAFLVL(tree)->usedlh, list) { -+ for (i = 0; i < QUOTATREE_BSIZE; i++) { -+ leaf = p->blocks[i]; -+ if (leaf == NULL) -+ continue; -+ if (count == index) -+ return leaf; -+ count++; -+ } -+ } -+ return NULL; -+} -+ -+/* returns data leaf (vz_quota_ugid) after _existent_ ugid (@id) -+ * in the tree... */ -+void *quotatree_get_next(struct quotatree_tree *tree, quotaid_t id) -+{ -+ int off; -+ struct quotatree_node *parent, *p; -+ struct list_head *lh; -+ -+ /* get parent refering correct quota tree node of the last level */ -+ parent = quotatree_follow(tree, id, QUOTATREE_DEPTH, NULL); -+ if (!parent) -+ return NULL; -+ -+ off = (id & QUOTATREE_BMASK) + 1; /* next ugid */ -+ lh = &parent->list; -+ do { -+ p = list_entry(lh, struct quotatree_node, list); -+ for ( ; off < QUOTATREE_BSIZE; off++) -+ if (p->blocks[off]) -+ return p->blocks[off]; -+ off = 0; -+ lh = lh->next; -+ } while (lh != &QTREE_LEAFLVL(tree)->usedlh); -+ -+ return NULL; -+} -+ -+int quotatree_insert(struct quotatree_tree *tree, quotaid_t id, -+ struct quotatree_find_state *st, void *data) -+{ -+ struct quotatree_node *p; -+ int l, index; -+ -+ while (st->level < QUOTATREE_DEPTH) { -+ l = st->level; -+ if (!list_empty(&tree->levels[l].freelh)) { -+ p = list_entry(tree->levels[l].freelh.next, -+ struct quotatree_node, list); -+ list_del(&p->list); -+ } else { -+ p = kmalloc(sizeof(struct quotatree_node), GFP_NOFS | __GFP_NOFAIL); -+ if (p == NULL) -+ return -ENOMEM; -+ /* save block number in the l-level -+ * it uses for quota file generation */ -+ p->num = tree->levels[l].freenum++; -+ } -+ list_add(&p->list, &tree->levels[l].usedlh); -+ memset(p->blocks, 0, sizeof(p->blocks)); -+ *st->block = p; -+ -+ index = (id >> QUOTATREE_BSHIFT(l)) & QUOTATREE_BMASK; -+ st->block = p->blocks + index; -+ st->level++; -+ } -+ tree->leaf_num++; -+ *st->block = data; -+ -+ return 0; -+} -+ -+static struct quotatree_node * -+quotatree_remove_ptr(struct quotatree_tree *tree, quotaid_t id, -+ int level) -+{ -+ struct quotatree_node *parent; -+ struct quotatree_find_state st; -+ -+ parent = quotatree_follow(tree, id, level, &st); -+ if (st.level == QUOTATREE_DEPTH) -+ tree->leaf_num--; -+ *st.block = NULL; -+ return parent; -+} -+ -+void quotatree_remove(struct quotatree_tree *tree, quotaid_t id) -+{ -+ struct quotatree_node *p; -+ int level, i; -+ -+ p = quotatree_remove_ptr(tree, id, QUOTATREE_DEPTH); -+ for (level = QUOTATREE_DEPTH - 1; level >= QUOTATREE_CDEPTH; level--) { -+ for (i = 0; i < QUOTATREE_BSIZE; i++) -+ if (p->blocks[i] != NULL) -+ return; -+ list_move(&p->list, &tree->levels[level].freelh); -+ p = quotatree_remove_ptr(tree, id, level); -+ } -+} -+ -+#if 0 -+static void quotatree_walk(struct quotatree_tree *tree, -+ struct quotatree_node *node_start, -+ quotaid_t id_start, -+ int level_start, int level_end, -+ int (*callback)(struct quotatree_tree *, -+ quotaid_t id, -+ int level, -+ void *ptr, -+ void *data), -+ void *data) -+{ -+ struct quotatree_node *p; -+ int l, shift, index; -+ quotaid_t id; -+ struct quotatree_find_state st; -+ -+ p = node_start; -+ l = level_start; -+ shift = (QUOTATREE_DEPTH - l) * QUOTAID_BBITS; -+ id = id_start; -+ index = 0; -+ -+ /* -+ * Invariants: -+ * shift == (QUOTATREE_DEPTH - l) * QUOTAID_BBITS; -+ * id & ((1 << shift) - 1) == 0 -+ * p is l-level node corresponding to id -+ */ -+ do { -+ if (!p) -+ break; -+ -+ if (l < level_end) { -+ for (; index < QUOTATREE_BSIZE; index++) -+ if (p->blocks[index] != NULL) -+ break; -+ if (index < QUOTATREE_BSIZE) { -+ /* descend */ -+ p = p->blocks[index]; -+ l++; -+ shift -= QUOTAID_BBITS; -+ id += (quotaid_t)index << shift; -+ index = 0; -+ continue; -+ } -+ } -+ -+ if ((*callback)(tree, id, l, p, data)) -+ break; -+ -+ /* ascend and to the next node */ -+ p = quotatree_follow(tree, id, l, &st); -+ -+ index = ((id >> shift) & QUOTATREE_BMASK) + 1; -+ l--; -+ shift += QUOTAID_BBITS; -+ id &= ~(((quotaid_t)1 << shift) - 1); -+ } while (l >= level_start); -+} -+#endif -+ -+static void free_list(struct list_head *node_list) -+{ -+ struct quotatree_node *p, *tmp; -+ -+ list_for_each_entry_safe(p, tmp, node_list, list) { -+ list_del(&p->list); -+ kfree(p); -+ } -+} -+ -+static inline void quotatree_free_nodes(struct quotatree_tree *tree) -+{ -+ int i; -+ -+ for (i = 0; i < QUOTATREE_DEPTH; i++) { -+ free_list(&tree->levels[i].usedlh); -+ free_list(&tree->levels[i].freelh); -+ } -+} -+ -+static void quotatree_free_leafs(struct quotatree_tree *tree, -+ void (*dtor)(void *)) -+{ -+ int i; -+ struct quotatree_node *p; -+ -+ list_for_each_entry(p, &QTREE_LEAFLVL(tree)->usedlh, list) { -+ for (i = 0; i < QUOTATREE_BSIZE; i++) { -+ if (p->blocks[i] == NULL) -+ continue; -+ -+ dtor(p->blocks[i]); -+ } -+ } -+} -+ -+void quotatree_free(struct quotatree_tree *tree, void (*dtor)(void *)) -+{ -+ quotatree_free_leafs(tree, dtor); -+ quotatree_free_nodes(tree); -+ kfree(tree); -+} -diff -uprN linux-2.6.8.1.orig/fs/vzdq_ugid.c linux-2.6.8.1-ve022stab078/fs/vzdq_ugid.c ---- linux-2.6.8.1.orig/fs/vzdq_ugid.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/fs/vzdq_ugid.c 2006-05-11 13:05:44.000000000 +0400 -@@ -0,0 +1,1130 @@ -+/* -+ * Copyright (C) 2002 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ * This file contains Virtuozzo UID/GID disk quota implementation -+ */ -+ -+#include <linux/config.h> -+#include <linux/string.h> -+#include <linux/slab.h> -+#include <linux/list.h> -+#include <linux/smp_lock.h> -+#include <linux/rcupdate.h> -+#include <asm/uaccess.h> -+#include <linux/proc_fs.h> -+#include <linux/init.h> -+#include <linux/module.h> -+#include <linux/quota.h> -+#include <linux/quotaio_v2.h> -+#include <linux/virtinfo.h> -+ -+#include <linux/vzctl.h> -+#include <linux/vzctl_quota.h> -+#include <linux/vzquota.h> -+ -+/* -+ * XXX -+ * may be something is needed for sb->s_dquot->info[]? -+ */ -+ -+#define USRQUOTA_MASK (1 << USRQUOTA) -+#define GRPQUOTA_MASK (1 << GRPQUOTA) -+#define QTYPE2MASK(type) (1 << (type)) -+ -+static kmem_cache_t *vz_quota_ugid_cachep; -+ -+/* guard to protect vz_quota_master from destroy in quota_on/off. Also protects -+ * list on the hash table */ -+extern struct semaphore vz_quota_sem; -+ -+inline struct vz_quota_ugid *vzquota_get_ugid(struct vz_quota_ugid *qugid) -+{ -+ if (qugid != VZ_QUOTA_UGBAD) -+ atomic_inc(&qugid->qugid_count); -+ return qugid; -+} -+ -+/* we don't limit users with zero limits */ -+static inline int vzquota_fake_stat(struct dq_stat *stat) -+{ -+ return stat->bhardlimit == 0 && stat->bsoftlimit == 0 && -+ stat->ihardlimit == 0 && stat->isoftlimit == 0; -+} -+ -+/* callback function for quotatree_free() */ -+static inline void vzquota_free_qugid(void *ptr) -+{ -+ kmem_cache_free(vz_quota_ugid_cachep, ptr); -+} -+ -+/* -+ * destroy ugid, if it have zero refcount, limits and usage -+ * must be called under qmblk->dq_sem -+ */ -+void vzquota_put_ugid(struct vz_quota_master *qmblk, -+ struct vz_quota_ugid *qugid) -+{ -+ if (qugid == VZ_QUOTA_UGBAD) -+ return; -+ qmblk_data_read_lock(qmblk); -+ if (atomic_dec_and_test(&qugid->qugid_count) && -+ (qmblk->dq_flags & VZDQUG_FIXED_SET) == 0 && -+ vzquota_fake_stat(&qugid->qugid_stat) && -+ qugid->qugid_stat.bcurrent == 0 && -+ qugid->qugid_stat.icurrent == 0) { -+ quotatree_remove(QUGID_TREE(qmblk, qugid->qugid_type), -+ qugid->qugid_id); -+ qmblk->dq_ugid_count--; -+ vzquota_free_qugid(qugid); -+ } -+ qmblk_data_read_unlock(qmblk); -+} -+ -+/* -+ * Get ugid block by its index, like it would present in array. -+ * In reality, this is not array - this is leafs chain of the tree. -+ * NULL if index is out of range. -+ * qmblk semaphore is required to protect the tree. -+ */ -+static inline struct vz_quota_ugid * -+vzquota_get_byindex(struct vz_quota_master *qmblk, unsigned int index, int type) -+{ -+ return quotatree_leaf_byindex(QUGID_TREE(qmblk, type), index); -+} -+ -+/* -+ * get next element from ugid "virtual array" -+ * ugid must be in current array and this array may not be changed between -+ * two accesses (quaranteed by "stopped" quota state and quota semaphore) -+ * qmblk semaphore is required to protect the tree -+ */ -+static inline struct vz_quota_ugid * -+vzquota_get_next(struct vz_quota_master *qmblk, struct vz_quota_ugid *qugid) -+{ -+ return quotatree_get_next(QUGID_TREE(qmblk, qugid->qugid_type), -+ qugid->qugid_id); -+} -+ -+/* -+ * requires dq_sem -+ */ -+struct vz_quota_ugid *__vzquota_find_ugid(struct vz_quota_master *qmblk, -+ unsigned int quota_id, int type, int flags) -+{ -+ struct vz_quota_ugid *qugid; -+ struct quotatree_tree *tree; -+ struct quotatree_find_state st; -+ -+ tree = QUGID_TREE(qmblk, type); -+ qugid = quotatree_find(tree, quota_id, &st); -+ if (qugid) -+ goto success; -+ -+ /* caller does not want alloc */ -+ if (flags & VZDQUG_FIND_DONT_ALLOC) -+ goto fail; -+ -+ if (flags & VZDQUG_FIND_FAKE) -+ goto doit; -+ -+ /* check limit */ -+ if (qmblk->dq_ugid_count >= qmblk->dq_ugid_max) -+ goto fail; -+ -+ /* see comment at VZDQUG_FIXED_SET define */ -+ if (qmblk->dq_flags & VZDQUG_FIXED_SET) -+ goto fail; -+ -+doit: -+ /* alloc new structure */ -+ qugid = kmem_cache_alloc(vz_quota_ugid_cachep, -+ SLAB_NOFS | __GFP_NOFAIL); -+ if (qugid == NULL) -+ goto fail; -+ -+ /* initialize new structure */ -+ qugid->qugid_id = quota_id; -+ memset(&qugid->qugid_stat, 0, sizeof(qugid->qugid_stat)); -+ qugid->qugid_type = type; -+ atomic_set(&qugid->qugid_count, 0); -+ -+ /* insert in tree */ -+ if (quotatree_insert(tree, quota_id, &st, qugid) < 0) -+ goto fail_insert; -+ qmblk->dq_ugid_count++; -+ -+success: -+ vzquota_get_ugid(qugid); -+ return qugid; -+ -+fail_insert: -+ vzquota_free_qugid(qugid); -+fail: -+ return VZ_QUOTA_UGBAD; -+} -+ -+/* -+ * takes dq_sem, may schedule -+ */ -+struct vz_quota_ugid *vzquota_find_ugid(struct vz_quota_master *qmblk, -+ unsigned int quota_id, int type, int flags) -+{ -+ struct vz_quota_ugid *qugid; -+ -+ down(&qmblk->dq_sem); -+ qugid = __vzquota_find_ugid(qmblk, quota_id, type, flags); -+ up(&qmblk->dq_sem); -+ -+ return qugid; -+} -+ -+/* -+ * destroy all ugid records on given quota master -+ */ -+void vzquota_kill_ugid(struct vz_quota_master *qmblk) -+{ -+ BUG_ON((qmblk->dq_gid_tree == NULL && qmblk->dq_uid_tree != NULL) || -+ (qmblk->dq_uid_tree == NULL && qmblk->dq_gid_tree != NULL)); -+ -+ if (qmblk->dq_uid_tree != NULL) { -+ quotatree_free(qmblk->dq_uid_tree, vzquota_free_qugid); -+ quotatree_free(qmblk->dq_gid_tree, vzquota_free_qugid); -+ } -+} -+ -+ -+/* ---------------------------------------------------------------------- -+ * Management interface to ugid quota for (super)users. -+ * --------------------------------------------------------------------- */ -+ -+/** -+ * vzquota_find_qmblk - helper to emulate quota on virtual filesystems -+ * -+ * This function finds a quota master block corresponding to the root of -+ * a virtual filesystem. -+ * Returns a quota master block with reference taken, or %NULL if not under -+ * quota, or %VZ_QUOTA_BAD if quota inconsistency is found (and all allocation -+ * operations will fail). -+ * -+ * Note: this function uses vzquota_inode_qmblk(). -+ * The latter is a rather confusing function: it returns qmblk that used to be -+ * on the inode some time ago (without guarantee that it still has any -+ * relations to the inode). So, vzquota_find_qmblk() leaves it up to the -+ * caller to think whether the inode could have changed its qmblk and what to -+ * do in that case. -+ * Currently, the callers appear to not care :( -+ */ -+struct vz_quota_master *vzquota_find_qmblk(struct super_block *sb) -+{ -+ struct inode *qrinode; -+ struct vz_quota_master *qmblk; -+ -+ qmblk = NULL; -+ qrinode = NULL; -+ if (sb->s_op->get_quota_root != NULL) -+ qrinode = sb->s_op->get_quota_root(sb); -+ if (qrinode != NULL) -+ qmblk = vzquota_inode_qmblk(qrinode); -+ return qmblk; -+} -+ -+static int vzquota_initialize2(struct inode *inode, int type) -+{ -+ return QUOTA_OK; -+} -+ -+static int vzquota_drop2(struct inode *inode) -+{ -+ return QUOTA_OK; -+} -+ -+static int vzquota_alloc_space2(struct inode *inode, -+ qsize_t number, int prealloc) -+{ -+ inode_add_bytes(inode, number); -+ return QUOTA_OK; -+} -+ -+static int vzquota_alloc_inode2(const struct inode *inode, unsigned long number) -+{ -+ return QUOTA_OK; -+} -+ -+static int vzquota_free_space2(struct inode *inode, qsize_t number) -+{ -+ inode_sub_bytes(inode, number); -+ return QUOTA_OK; -+} -+ -+static int vzquota_free_inode2(const struct inode *inode, unsigned long number) -+{ -+ return QUOTA_OK; -+} -+ -+static int vzquota_transfer2(struct inode *inode, struct iattr *iattr) -+{ -+ return QUOTA_OK; -+} -+ -+struct dquot_operations vz_quota_operations2 = { -+ initialize: vzquota_initialize2, -+ drop: vzquota_drop2, -+ alloc_space: vzquota_alloc_space2, -+ alloc_inode: vzquota_alloc_inode2, -+ free_space: vzquota_free_space2, -+ free_inode: vzquota_free_inode2, -+ transfer: vzquota_transfer2 -+}; -+ -+static int vz_quota_on(struct super_block *sb, int type, -+ int format_id, char *path) -+{ -+ struct vz_quota_master *qmblk; -+ int mask, mask2; -+ int err; -+ -+ qmblk = vzquota_find_qmblk(sb); -+ down(&vz_quota_sem); -+ err = -ESRCH; -+ if (qmblk == NULL) -+ goto out; -+ err = -EIO; -+ if (qmblk == VZ_QUOTA_BAD) -+ goto out; -+ -+ mask = 0; -+ mask2 = 0; -+ sb->dq_op = &vz_quota_operations2; -+ sb->s_qcop = &vz_quotactl_operations; -+ if (type == USRQUOTA) { -+ mask = DQUOT_USR_ENABLED; -+ mask2 = VZDQ_USRQUOTA; -+ } -+ if (type == GRPQUOTA) { -+ mask = DQUOT_GRP_ENABLED; -+ mask2 = VZDQ_GRPQUOTA; -+ } -+ err = -EBUSY; -+ if (qmblk->dq_flags & mask2) -+ goto out; -+ -+ err = 0; -+ qmblk->dq_flags |= mask2; -+ sb->s_dquot.flags |= mask; -+ -+out: -+ up(&vz_quota_sem); -+ if (qmblk != NULL && qmblk != VZ_QUOTA_BAD) -+ qmblk_put(qmblk); -+ return err; -+} -+ -+static int vz_quota_off(struct super_block *sb, int type) -+{ -+ struct vz_quota_master *qmblk; -+ int mask2; -+ int err; -+ -+ qmblk = vzquota_find_qmblk(sb); -+ down(&vz_quota_sem); -+ err = -ESRCH; -+ if (qmblk == NULL) -+ goto out; -+ err = -EIO; -+ if (qmblk == VZ_QUOTA_BAD) -+ goto out; -+ -+ mask2 = 0; -+ if (type == USRQUOTA) -+ mask2 = VZDQ_USRQUOTA; -+ if (type == GRPQUOTA) -+ mask2 = VZDQ_GRPQUOTA; -+ err = -EINVAL; -+ if (!(qmblk->dq_flags & mask2)) -+ goto out; -+ -+ qmblk->dq_flags &= ~mask2; -+ err = 0; -+ -+out: -+ up(&vz_quota_sem); -+ if (qmblk != NULL && qmblk != VZ_QUOTA_BAD) -+ qmblk_put(qmblk); -+ return err; -+} -+ -+static int vz_quota_sync(struct super_block *sb, int type) -+{ -+ return 0; /* vz quota is always uptodate */ -+} -+ -+static int vz_get_dqblk(struct super_block *sb, int type, -+ qid_t id, struct if_dqblk *di) -+{ -+ struct vz_quota_master *qmblk; -+ struct vz_quota_ugid *ugid; -+ int err; -+ -+ qmblk = vzquota_find_qmblk(sb); -+ down(&vz_quota_sem); -+ err = -ESRCH; -+ if (qmblk == NULL) -+ goto out; -+ err = -EIO; -+ if (qmblk == VZ_QUOTA_BAD) -+ goto out; -+ -+ err = 0; -+ ugid = vzquota_find_ugid(qmblk, id, type, VZDQUG_FIND_DONT_ALLOC); -+ if (ugid != VZ_QUOTA_UGBAD) { -+ qmblk_data_read_lock(qmblk); -+ di->dqb_bhardlimit = ugid->qugid_stat.bhardlimit >> 10; -+ di->dqb_bsoftlimit = ugid->qugid_stat.bsoftlimit >> 10; -+ di->dqb_curspace = ugid->qugid_stat.bcurrent; -+ di->dqb_ihardlimit = ugid->qugid_stat.ihardlimit; -+ di->dqb_isoftlimit = ugid->qugid_stat.isoftlimit; -+ di->dqb_curinodes = ugid->qugid_stat.icurrent; -+ di->dqb_btime = ugid->qugid_stat.btime; -+ di->dqb_itime = ugid->qugid_stat.itime; -+ qmblk_data_read_unlock(qmblk); -+ di->dqb_valid = QIF_ALL; -+ vzquota_put_ugid(qmblk, ugid); -+ } else { -+ memset(di, 0, sizeof(*di)); -+ di->dqb_valid = QIF_ALL; -+ } -+ -+out: -+ up(&vz_quota_sem); -+ if (qmblk != NULL && qmblk != VZ_QUOTA_BAD) -+ qmblk_put(qmblk); -+ return err; -+} -+ -+/* must be called under vz_quota_sem */ -+static int __vz_set_dqblk(struct vz_quota_master *qmblk, -+ int type, qid_t id, struct if_dqblk *di) -+{ -+ struct vz_quota_ugid *ugid; -+ -+ ugid = vzquota_find_ugid(qmblk, id, type, 0); -+ if (ugid == VZ_QUOTA_UGBAD) -+ return -ESRCH; -+ -+ qmblk_data_write_lock(qmblk); -+ /* -+ * Subtle compatibility breakage. -+ * -+ * Some old non-vz kernel quota didn't start grace period -+ * if the new soft limit happens to be below the usage. -+ * Non-vz kernel quota in 2.4.20 starts the grace period -+ * (if it hasn't been started). -+ * Current non-vz kernel performs even more complicated -+ * manipulations... -+ * -+ * Also, current non-vz kernels have inconsistency related to -+ * the grace time start. In regular operations the grace period -+ * is started if the usage is greater than the soft limit (and, -+ * strangely, is cancelled if the usage is less). -+ * However, set_dqblk starts the grace period if the usage is greater -+ * or equal to the soft limit. -+ * -+ * Here we try to mimic the behavior of the current non-vz kernel. -+ */ -+ if (di->dqb_valid & QIF_BLIMITS) { -+ ugid->qugid_stat.bhardlimit = -+ (__u64)di->dqb_bhardlimit << 10; -+ ugid->qugid_stat.bsoftlimit = -+ (__u64)di->dqb_bsoftlimit << 10; -+ if (di->dqb_bsoftlimit == 0 || -+ ugid->qugid_stat.bcurrent < ugid->qugid_stat.bsoftlimit) -+ ugid->qugid_stat.btime = 0; -+ else if (!(di->dqb_valid & QIF_BTIME)) -+ ugid->qugid_stat.btime = CURRENT_TIME_SECONDS -+ + qmblk->dq_ugid_info[type].bexpire; -+ else -+ ugid->qugid_stat.btime = di->dqb_btime; -+ } -+ if (di->dqb_valid & QIF_ILIMITS) { -+ ugid->qugid_stat.ihardlimit = di->dqb_ihardlimit; -+ ugid->qugid_stat.isoftlimit = di->dqb_isoftlimit; -+ if (di->dqb_isoftlimit == 0 || -+ ugid->qugid_stat.icurrent < ugid->qugid_stat.isoftlimit) -+ ugid->qugid_stat.itime = 0; -+ else if (!(di->dqb_valid & QIF_ITIME)) -+ ugid->qugid_stat.itime = CURRENT_TIME_SECONDS -+ + qmblk->dq_ugid_info[type].iexpire; -+ else -+ ugid->qugid_stat.itime = di->dqb_itime; -+ } -+ qmblk_data_write_unlock(qmblk); -+ vzquota_put_ugid(qmblk, ugid); -+ -+ return 0; -+} -+ -+static int vz_set_dqblk(struct super_block *sb, int type, -+ qid_t id, struct if_dqblk *di) -+{ -+ struct vz_quota_master *qmblk; -+ int err; -+ -+ qmblk = vzquota_find_qmblk(sb); -+ down(&vz_quota_sem); -+ err = -ESRCH; -+ if (qmblk == NULL) -+ goto out; -+ err = -EIO; -+ if (qmblk == VZ_QUOTA_BAD) -+ goto out; -+ err = __vz_set_dqblk(qmblk, type, id, di); -+out: -+ up(&vz_quota_sem); -+ if (qmblk != NULL && qmblk != VZ_QUOTA_BAD) -+ qmblk_put(qmblk); -+ return err; -+} -+ -+static int vz_get_dqinfo(struct super_block *sb, int type, -+ struct if_dqinfo *ii) -+{ -+ struct vz_quota_master *qmblk; -+ int err; -+ -+ qmblk = vzquota_find_qmblk(sb); -+ down(&vz_quota_sem); -+ err = -ESRCH; -+ if (qmblk == NULL) -+ goto out; -+ err = -EIO; -+ if (qmblk == VZ_QUOTA_BAD) -+ goto out; -+ -+ err = 0; -+ ii->dqi_bgrace = qmblk->dq_ugid_info[type].bexpire; -+ ii->dqi_igrace = qmblk->dq_ugid_info[type].iexpire; -+ ii->dqi_flags = 0; -+ ii->dqi_valid = IIF_ALL; -+ -+out: -+ up(&vz_quota_sem); -+ if (qmblk != NULL && qmblk != VZ_QUOTA_BAD) -+ qmblk_put(qmblk); -+ return err; -+} -+ -+/* must be called under vz_quota_sem */ -+static int __vz_set_dqinfo(struct vz_quota_master *qmblk, -+ int type, struct if_dqinfo *ii) -+{ -+ if (ii->dqi_valid & IIF_FLAGS) -+ if (ii->dqi_flags & DQF_MASK) -+ return -EINVAL; -+ -+ if (ii->dqi_valid & IIF_BGRACE) -+ qmblk->dq_ugid_info[type].bexpire = ii->dqi_bgrace; -+ if (ii->dqi_valid & IIF_IGRACE) -+ qmblk->dq_ugid_info[type].iexpire = ii->dqi_igrace; -+ return 0; -+} -+ -+static int vz_set_dqinfo(struct super_block *sb, int type, -+ struct if_dqinfo *ii) -+{ -+ struct vz_quota_master *qmblk; -+ int err; -+ -+ qmblk = vzquota_find_qmblk(sb); -+ down(&vz_quota_sem); -+ err = -ESRCH; -+ if (qmblk == NULL) -+ goto out; -+ err = -EIO; -+ if (qmblk == VZ_QUOTA_BAD) -+ goto out; -+ err = __vz_set_dqinfo(qmblk, type, ii); -+out: -+ up(&vz_quota_sem); -+ if (qmblk != NULL && qmblk != VZ_QUOTA_BAD) -+ qmblk_put(qmblk); -+ return err; -+} -+ -+#ifdef CONFIG_QUOTA_COMPAT -+ -+#define Q_GETQUOTI_SIZE 1024 -+ -+#define UGID2DQBLK(dst, src) \ -+ do { \ -+ (dst)->dqb_ihardlimit = (src)->qugid_stat.ihardlimit; \ -+ (dst)->dqb_isoftlimit = (src)->qugid_stat.isoftlimit; \ -+ (dst)->dqb_curinodes = (src)->qugid_stat.icurrent; \ -+ /* in 1K blocks */ \ -+ (dst)->dqb_bhardlimit = (src)->qugid_stat.bhardlimit >> 10; \ -+ /* in 1K blocks */ \ -+ (dst)->dqb_bsoftlimit = (src)->qugid_stat.bsoftlimit >> 10; \ -+ /* in bytes, 64 bit */ \ -+ (dst)->dqb_curspace = (src)->qugid_stat.bcurrent; \ -+ (dst)->dqb_btime = (src)->qugid_stat.btime; \ -+ (dst)->dqb_itime = (src)->qugid_stat.itime; \ -+ } while (0) -+ -+static int vz_get_quoti(struct super_block *sb, int type, qid_t idx, -+ struct v2_disk_dqblk *dqblk) -+{ -+ struct vz_quota_master *qmblk; -+ struct v2_disk_dqblk *data, *kbuf; -+ struct vz_quota_ugid *ugid; -+ int count; -+ int err; -+ -+ qmblk = vzquota_find_qmblk(sb); -+ err = -ESRCH; -+ if (qmblk == NULL) -+ goto out; -+ err = -EIO; -+ if (qmblk == VZ_QUOTA_BAD) -+ goto out; -+ -+ err = -ENOMEM; -+ kbuf = vmalloc(Q_GETQUOTI_SIZE * sizeof(*kbuf)); -+ if (!kbuf) -+ goto out; -+ -+ down(&vz_quota_sem); -+ down(&qmblk->dq_sem); -+ for (ugid = vzquota_get_byindex(qmblk, idx, type), count = 0; -+ ugid != NULL && count < Q_GETQUOTI_SIZE; -+ count++) -+ { -+ data = kbuf + count; -+ qmblk_data_read_lock(qmblk); -+ UGID2DQBLK(data, ugid); -+ qmblk_data_read_unlock(qmblk); -+ data->dqb_id = ugid->qugid_id; -+ -+ /* Find next entry */ -+ ugid = vzquota_get_next(qmblk, ugid); -+ BUG_ON(ugid != NULL && ugid->qugid_type != type); -+ } -+ up(&qmblk->dq_sem); -+ up(&vz_quota_sem); -+ -+ err = count; -+ if (copy_to_user(dqblk, kbuf, count * sizeof(*kbuf))) -+ err = -EFAULT; -+ -+ vfree(kbuf); -+out: -+ if (qmblk != NULL && qmblk != VZ_QUOTA_BAD) -+ qmblk_put(qmblk); -+ -+ return err; -+} -+ -+#endif -+ -+struct quotactl_ops vz_quotactl_operations = { -+ quota_on: vz_quota_on, -+ quota_off: vz_quota_off, -+ quota_sync: vz_quota_sync, -+ get_info: vz_get_dqinfo, -+ set_info: vz_set_dqinfo, -+ get_dqblk: vz_get_dqblk, -+ set_dqblk: vz_set_dqblk, -+#ifdef CONFIG_QUOTA_COMPAT -+ get_quoti: vz_get_quoti -+#endif -+}; -+ -+ -+/* ---------------------------------------------------------------------- -+ * Management interface for host system admins. -+ * --------------------------------------------------------------------- */ -+ -+static int quota_ugid_addstat(unsigned int quota_id, unsigned int ugid_size, -+ struct vz_quota_iface *u_ugid_buf) -+{ -+ struct vz_quota_master *qmblk; -+ int ret; -+ -+ down(&vz_quota_sem); -+ -+ ret = -ENOENT; -+ qmblk = vzquota_find_master(quota_id); -+ if (qmblk == NULL) -+ goto out; -+ -+ ret = -EBUSY; -+ if (qmblk->dq_state != VZDQ_STARTING) -+ goto out; /* working quota doesn't accept new ugids */ -+ -+ ret = 0; -+ /* start to add ugids */ -+ for (ret = 0; ret < ugid_size; ret++) { -+ struct vz_quota_iface ugid_buf; -+ struct vz_quota_ugid *ugid; -+ -+ if (copy_from_user(&ugid_buf, u_ugid_buf, sizeof(ugid_buf))) -+ break; -+ -+ if (ugid_buf.qi_type >= MAXQUOTAS) -+ break; /* bad quota type - this is the only check */ -+ -+ ugid = vzquota_find_ugid(qmblk, -+ ugid_buf.qi_id, ugid_buf.qi_type, 0); -+ if (ugid == VZ_QUOTA_UGBAD) { -+ qmblk->dq_flags |= VZDQUG_FIXED_SET; -+ break; /* limit reached */ -+ } -+ -+ /* update usage/limits -+ * we can copy the data without the lock, because the data -+ * cannot be modified in VZDQ_STARTING state */ -+ ugid->qugid_stat = ugid_buf.qi_stat; -+ -+ vzquota_put_ugid(qmblk, ugid); -+ -+ u_ugid_buf++; /* next user buffer */ -+ } -+out: -+ up(&vz_quota_sem); -+ -+ return ret; -+} -+ -+static int quota_ugid_setgrace(unsigned int quota_id, -+ struct dq_info u_dq_info[]) -+{ -+ struct vz_quota_master *qmblk; -+ struct dq_info dq_info[MAXQUOTAS]; -+ struct dq_info *target; -+ int err, type; -+ -+ down(&vz_quota_sem); -+ -+ err = -ENOENT; -+ qmblk = vzquota_find_master(quota_id); -+ if (qmblk == NULL) -+ goto out; -+ -+ err = -EBUSY; -+ if (qmblk->dq_state != VZDQ_STARTING) -+ goto out; /* working quota doesn't accept changing options */ -+ -+ err = -EFAULT; -+ if (copy_from_user(dq_info, u_dq_info, sizeof(dq_info))) -+ goto out; -+ -+ err = 0; -+ -+ /* update in qmblk */ -+ for (type = 0; type < MAXQUOTAS; type ++) { -+ target = &qmblk->dq_ugid_info[type]; -+ target->bexpire = dq_info[type].bexpire; -+ target->iexpire = dq_info[type].iexpire; -+ } -+out: -+ up(&vz_quota_sem); -+ -+ return err; -+} -+ -+static int do_quota_ugid_getstat(struct vz_quota_master *qmblk, int index, int size, -+ struct vz_quota_iface *u_ugid_buf) -+{ -+ int type, count; -+ struct vz_quota_ugid *ugid; -+ -+ if (QTREE_LEAFNUM(qmblk->dq_uid_tree) + -+ QTREE_LEAFNUM(qmblk->dq_gid_tree) -+ <= index) -+ return 0; -+ -+ count = 0; -+ -+ type = index < QTREE_LEAFNUM(qmblk->dq_uid_tree) ? USRQUOTA : GRPQUOTA; -+ if (type == GRPQUOTA) -+ index -= QTREE_LEAFNUM(qmblk->dq_uid_tree); -+ -+ /* loop through ugid and then qgid quota */ -+repeat: -+ for (ugid = vzquota_get_byindex(qmblk, index, type); -+ ugid != NULL && count < size; -+ ugid = vzquota_get_next(qmblk, ugid), count++) -+ { -+ struct vz_quota_iface ugid_buf; -+ -+ /* form interface buffer and send in to user-level */ -+ qmblk_data_read_lock(qmblk); -+ memcpy(&ugid_buf.qi_stat, &ugid->qugid_stat, -+ sizeof(ugid_buf.qi_stat)); -+ qmblk_data_read_unlock(qmblk); -+ ugid_buf.qi_id = ugid->qugid_id; -+ ugid_buf.qi_type = ugid->qugid_type; -+ -+ memcpy(u_ugid_buf, &ugid_buf, sizeof(ugid_buf)); -+ u_ugid_buf++; /* next portion of user buffer */ -+ } -+ -+ if (type == USRQUOTA && count < size) { -+ type = GRPQUOTA; -+ index = 0; -+ goto repeat; -+ } -+ -+ return count; -+} -+ -+static int quota_ugid_getstat(unsigned int quota_id, -+ int index, int size, struct vz_quota_iface *u_ugid_buf) -+{ -+ struct vz_quota_master *qmblk; -+ struct vz_quota_iface *k_ugid_buf; -+ int err; -+ -+ if (index < 0 || size < 0) -+ return -EINVAL; -+ -+ if (size > INT_MAX / sizeof(struct vz_quota_iface)) -+ return -EINVAL; -+ -+ k_ugid_buf = vmalloc(size * sizeof(struct vz_quota_iface)); -+ if (k_ugid_buf == NULL) -+ return -ENOMEM; -+ -+ down(&vz_quota_sem); -+ -+ err = -ENOENT; -+ qmblk = vzquota_find_master(quota_id); -+ if (qmblk == NULL) -+ goto out; -+ -+ down(&qmblk->dq_sem); -+ err = do_quota_ugid_getstat(qmblk, index, size, k_ugid_buf); -+ up(&qmblk->dq_sem); -+ if (err < 0) -+ goto out; -+ -+ if (copy_to_user(u_ugid_buf, k_ugid_buf, -+ size * sizeof(struct vz_quota_iface))) -+ err = -EFAULT; -+ -+out: -+ up(&vz_quota_sem); -+ vfree(k_ugid_buf); -+ return err; -+} -+ -+static int quota_ugid_getgrace(unsigned int quota_id, -+ struct dq_info u_dq_info[]) -+{ -+ struct vz_quota_master *qmblk; -+ struct dq_info dq_info[MAXQUOTAS]; -+ struct dq_info *target; -+ int err, type; -+ -+ down(&vz_quota_sem); -+ -+ err = -ENOENT; -+ qmblk = vzquota_find_master(quota_id); -+ if (qmblk == NULL) -+ goto out; -+ -+ err = 0; -+ /* update from qmblk */ -+ for (type = 0; type < MAXQUOTAS; type ++) { -+ target = &qmblk->dq_ugid_info[type]; -+ dq_info[type].bexpire = target->bexpire; -+ dq_info[type].iexpire = target->iexpire; -+ dq_info[type].flags = target->flags; -+ } -+ -+ if (copy_to_user(u_dq_info, dq_info, sizeof(dq_info))) -+ err = -EFAULT; -+out: -+ up(&vz_quota_sem); -+ -+ return err; -+} -+ -+static int quota_ugid_getconfig(unsigned int quota_id, -+ struct vz_quota_ugid_stat *info) -+{ -+ struct vz_quota_master *qmblk; -+ struct vz_quota_ugid_stat kinfo; -+ int err; -+ -+ down(&vz_quota_sem); -+ -+ err = -ENOENT; -+ qmblk = vzquota_find_master(quota_id); -+ if (qmblk == NULL) -+ goto out; -+ -+ err = 0; -+ kinfo.limit = qmblk->dq_ugid_max; -+ kinfo.count = qmblk->dq_ugid_count; -+ kinfo.flags = qmblk->dq_flags; -+ -+ if (copy_to_user(info, &kinfo, sizeof(kinfo))) -+ err = -EFAULT; -+out: -+ up(&vz_quota_sem); -+ -+ return err; -+} -+ -+static int quota_ugid_setconfig(unsigned int quota_id, -+ struct vz_quota_ugid_stat *info) -+{ -+ struct vz_quota_master *qmblk; -+ struct vz_quota_ugid_stat kinfo; -+ int err; -+ -+ down(&vz_quota_sem); -+ -+ err = -ENOENT; -+ qmblk = vzquota_find_master(quota_id); -+ if (qmblk == NULL) -+ goto out; -+ -+ err = -EFAULT; -+ if (copy_from_user(&kinfo, info, sizeof(kinfo))) -+ goto out; -+ -+ err = 0; -+ qmblk->dq_ugid_max = kinfo.limit; -+ if (qmblk->dq_state == VZDQ_STARTING) { -+ qmblk->dq_flags = kinfo.flags; -+ if (qmblk->dq_flags & VZDQUG_ON) -+ qmblk->dq_flags |= VZDQ_USRQUOTA | VZDQ_GRPQUOTA; -+ } -+ -+out: -+ up(&vz_quota_sem); -+ -+ return err; -+} -+ -+static int quota_ugid_setlimit(unsigned int quota_id, -+ struct vz_quota_ugid_setlimit *u_lim) -+{ -+ struct vz_quota_master *qmblk; -+ struct vz_quota_ugid_setlimit lim; -+ int err; -+ -+ down(&vz_quota_sem); -+ -+ err = -ESRCH; -+ qmblk = vzquota_find_master(quota_id); -+ if (qmblk == NULL) -+ goto out; -+ -+ err = -EFAULT; -+ if (copy_from_user(&lim, u_lim, sizeof(lim))) -+ goto out; -+ -+ err = __vz_set_dqblk(qmblk, lim.type, lim.id, &lim.dqb); -+ -+out: -+ up(&vz_quota_sem); -+ -+ return err; -+} -+ -+static int quota_ugid_setinfo(unsigned int quota_id, -+ struct vz_quota_ugid_setinfo *u_info) -+{ -+ struct vz_quota_master *qmblk; -+ struct vz_quota_ugid_setinfo info; -+ int err; -+ -+ down(&vz_quota_sem); -+ -+ err = -ESRCH; -+ qmblk = vzquota_find_master(quota_id); -+ if (qmblk == NULL) -+ goto out; -+ -+ err = -EFAULT; -+ if (copy_from_user(&info, u_info, sizeof(info))) -+ goto out; -+ -+ err = __vz_set_dqinfo(qmblk, info.type, &info.dqi); -+ -+out: -+ up(&vz_quota_sem); -+ -+ return err; -+} -+ -+/* -+ * This is a system call to maintain UGID quotas -+ * Note this call is allowed to run ONLY from VE0 -+ */ -+long do_vzquotaugidctl(struct vzctl_quotaugidctl *qub) -+{ -+ int ret; -+ -+ ret = -EPERM; -+ /* access allowed only from root of VE0 */ -+ if (!capable(CAP_SYS_RESOURCE) || -+ !capable(CAP_SYS_ADMIN)) -+ goto out; -+ -+ switch (qub->cmd) { -+ case VZ_DQ_UGID_GETSTAT: -+ ret = quota_ugid_getstat(qub->quota_id, -+ qub->ugid_index, qub->ugid_size, -+ (struct vz_quota_iface *)qub->addr); -+ break; -+ case VZ_DQ_UGID_ADDSTAT: -+ ret = quota_ugid_addstat(qub->quota_id, qub->ugid_size, -+ (struct vz_quota_iface *)qub->addr); -+ break; -+ case VZ_DQ_UGID_GETGRACE: -+ ret = quota_ugid_getgrace(qub->quota_id, -+ (struct dq_info *)qub->addr); -+ break; -+ case VZ_DQ_UGID_SETGRACE: -+ ret = quota_ugid_setgrace(qub->quota_id, -+ (struct dq_info *)qub->addr); -+ break; -+ case VZ_DQ_UGID_GETCONFIG: -+ ret = quota_ugid_getconfig(qub->quota_id, -+ (struct vz_quota_ugid_stat *)qub->addr); -+ break; -+ case VZ_DQ_UGID_SETCONFIG: -+ ret = quota_ugid_setconfig(qub->quota_id, -+ (struct vz_quota_ugid_stat *)qub->addr); -+ break; -+ case VZ_DQ_UGID_SETLIMIT: -+ ret = quota_ugid_setlimit(qub->quota_id, -+ (struct vz_quota_ugid_setlimit *) -+ qub->addr); -+ break; -+ case VZ_DQ_UGID_SETINFO: -+ ret = quota_ugid_setinfo(qub->quota_id, -+ (struct vz_quota_ugid_setinfo *) -+ qub->addr); -+ break; -+ default: -+ ret = -EINVAL; -+ goto out; -+ } -+out: -+ return ret; -+} -+ -+static void ugid_quota_on_sb(struct super_block *sb) -+{ -+ struct super_block *real_sb; -+ struct vz_quota_master *qmblk; -+ -+ if (!sb->s_op->get_quota_root) -+ return; -+ -+ real_sb = sb->s_op->get_quota_root(sb)->i_sb; -+ if (real_sb->dq_op != &vz_quota_operations) -+ return; -+ -+ sb->dq_op = &vz_quota_operations2; -+ sb->s_qcop = &vz_quotactl_operations; -+ INIT_LIST_HEAD(&sb->s_dquot.info[USRQUOTA].dqi_dirty_list); -+ INIT_LIST_HEAD(&sb->s_dquot.info[GRPQUOTA].dqi_dirty_list); -+ sb->s_dquot.info[USRQUOTA].dqi_format = &vz_quota_empty_v2_format; -+ sb->s_dquot.info[GRPQUOTA].dqi_format = &vz_quota_empty_v2_format; -+ -+ qmblk = vzquota_find_qmblk(sb); -+ if ((qmblk == NULL) || (qmblk == VZ_QUOTA_BAD)) -+ return; -+ down(&vz_quota_sem); -+ if (qmblk->dq_flags & VZDQ_USRQUOTA) -+ sb->s_dquot.flags |= DQUOT_USR_ENABLED; -+ if (qmblk->dq_flags & VZDQ_GRPQUOTA) -+ sb->s_dquot.flags |= DQUOT_GRP_ENABLED; -+ up(&vz_quota_sem); -+ qmblk_put(qmblk); -+} -+ -+static void ugid_quota_off_sb(struct super_block *sb) -+{ -+ /* can't make quota off on mounted super block */ -+ BUG_ON(sb->s_root != NULL); -+} -+ -+static int ugid_notifier_call(struct vnotifier_block *self, -+ unsigned long n, void *data, int old_ret) -+{ -+ struct virt_info_quota *viq; -+ -+ viq = (struct virt_info_quota *)data; -+ -+ switch (n) { -+ case VIRTINFO_QUOTA_ON: -+ ugid_quota_on_sb(viq->super); -+ break; -+ case VIRTINFO_QUOTA_OFF: -+ ugid_quota_off_sb(viq->super); -+ break; -+ case VIRTINFO_QUOTA_GETSTAT: -+ break; -+ default: -+ return old_ret; -+ } -+ return NOTIFY_OK; -+} -+ -+static struct vnotifier_block ugid_notifier_block = { -+ .notifier_call = ugid_notifier_call, -+}; -+ -+/* ---------------------------------------------------------------------- -+ * Init/exit. -+ * --------------------------------------------------------------------- */ -+ -+struct quota_format_type vz_quota_empty_v2_format = { -+ qf_fmt_id: QFMT_VFS_V0, -+ qf_ops: NULL, -+ qf_owner: THIS_MODULE -+}; -+ -+int vzquota_ugid_init() -+{ -+ int err; -+ -+ vz_quota_ugid_cachep = kmem_cache_create("vz_quota_ugid", -+ sizeof(struct vz_quota_ugid), -+ 0, SLAB_HWCACHE_ALIGN, -+ NULL, NULL); -+ if (vz_quota_ugid_cachep == NULL) -+ goto err_slab; -+ -+ err = register_quota_format(&vz_quota_empty_v2_format); -+ if (err) -+ goto err_reg; -+ -+ virtinfo_notifier_register(VITYPE_QUOTA, &ugid_notifier_block); -+ return 0; -+ -+err_reg: -+ kmem_cache_destroy(vz_quota_ugid_cachep); -+ return err; -+ -+err_slab: -+ printk(KERN_ERR "Cannot create VZ_QUOTA SLAB cache\n"); -+ return -ENOMEM; -+} -+ -+void vzquota_ugid_release() -+{ -+ virtinfo_notifier_unregister(VITYPE_QUOTA, &ugid_notifier_block); -+ unregister_quota_format(&vz_quota_empty_v2_format); -+ -+ if (kmem_cache_destroy(vz_quota_ugid_cachep)) -+ printk(KERN_ERR "VZQUOTA: kmem_cache_destroy failed\n"); -+} -diff -uprN linux-2.6.8.1.orig/fs/vzdquot.c linux-2.6.8.1-ve022stab078/fs/vzdquot.c ---- linux-2.6.8.1.orig/fs/vzdquot.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/fs/vzdquot.c 2006-05-11 13:05:43.000000000 +0400 -@@ -0,0 +1,1706 @@ -+/* -+ * Copyright (C) 2001, 2002, 2004, 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ * This file contains the core of Virtuozzo disk quota implementation: -+ * maintenance of VZDQ information in inodes, -+ * external interfaces, -+ * module entry. -+ */ -+ -+#include <linux/config.h> -+#include <linux/kernel.h> -+#include <linux/string.h> -+#include <linux/list.h> -+#include <asm/atomic.h> -+#include <linux/spinlock.h> -+#include <asm/semaphore.h> -+#include <linux/slab.h> -+#include <linux/fs.h> -+#include <linux/dcache.h> -+#include <linux/quota.h> -+#include <linux/rcupdate.h> -+#include <linux/module.h> -+#include <asm/uaccess.h> -+#include <linux/vzctl.h> -+#include <linux/vzctl_quota.h> -+#include <linux/vzquota.h> -+#include <linux/virtinfo.h> -+#include <linux/vzdq_tree.h> -+ -+/* ---------------------------------------------------------------------- -+ * -+ * Locking -+ * -+ * ---------------------------------------------------------------------- */ -+ -+/* -+ * Serializes on/off and all other do_vzquotactl operations. -+ * Protects qmblk hash. -+ */ -+struct semaphore vz_quota_sem; -+ -+/* -+ * Data access locks -+ * inode_qmblk -+ * protects qmblk pointers in all inodes and qlnk content in general -+ * (but not qmblk content); -+ * also protects related qmblk invalidation procedures; -+ * can't be per-inode because of vzquota_dtree_qmblk complications -+ * and problems with serialization with quota_on, -+ * but can be per-superblock; -+ * qmblk_data -+ * protects qmblk fields (such as current usage) -+ * quota_data -+ * protects charge/uncharge operations, thus, implies -+ * qmblk_data lock and, if CONFIG_VZ_QUOTA_UGID, inode_qmblk lock -+ * (to protect ugid pointers). -+ * -+ * Lock order: -+ * inode_qmblk_lock -> dcache_lock -+ * inode_qmblk_lock -> qmblk_data -+ */ -+static spinlock_t vzdq_qmblk_lock = SPIN_LOCK_UNLOCKED; -+ -+inline void inode_qmblk_lock(struct super_block *sb) -+{ -+ spin_lock(&vzdq_qmblk_lock); -+} -+ -+inline void inode_qmblk_unlock(struct super_block *sb) -+{ -+ spin_unlock(&vzdq_qmblk_lock); -+} -+ -+inline void qmblk_data_read_lock(struct vz_quota_master *qmblk) -+{ -+ spin_lock(&qmblk->dq_data_lock); -+} -+ -+inline void qmblk_data_read_unlock(struct vz_quota_master *qmblk) -+{ -+ spin_unlock(&qmblk->dq_data_lock); -+} -+ -+inline void qmblk_data_write_lock(struct vz_quota_master *qmblk) -+{ -+ spin_lock(&qmblk->dq_data_lock); -+} -+ -+inline void qmblk_data_write_unlock(struct vz_quota_master *qmblk) -+{ -+ spin_unlock(&qmblk->dq_data_lock); -+} -+ -+ -+/* ---------------------------------------------------------------------- -+ * -+ * Master hash table handling. -+ * -+ * SMP not safe, serialied by vz_quota_sem within quota syscalls -+ * -+ * --------------------------------------------------------------------- */ -+ -+static kmem_cache_t *vzquota_cachep; -+ -+/* -+ * Hash function. -+ */ -+#define QHASH_BITS 6 -+#define VZ_QUOTA_HASH_SIZE (1 << QHASH_BITS) -+#define QHASH_MASK (VZ_QUOTA_HASH_SIZE - 1) -+ -+struct list_head vzquota_hash_table[VZ_QUOTA_HASH_SIZE]; -+int vzquota_hash_size = VZ_QUOTA_HASH_SIZE; -+ -+static inline int vzquota_hash_func(unsigned int qid) -+{ -+ return (((qid >> QHASH_BITS) ^ qid) & QHASH_MASK); -+} -+ -+/** -+ * vzquota_alloc_master - alloc and instantiate master quota record -+ * -+ * Returns: -+ * pointer to newly created record if SUCCESS -+ * -ENOMEM if out of memory -+ * -EEXIST if record with given quota_id already exist -+ */ -+struct vz_quota_master *vzquota_alloc_master(unsigned int quota_id, -+ struct vz_quota_stat *qstat) -+{ -+ int err; -+ struct vz_quota_master *qmblk; -+ -+ err = -EEXIST; -+ if (vzquota_find_master(quota_id) != NULL) -+ goto out; -+ -+ err = -ENOMEM; -+ qmblk = kmem_cache_alloc(vzquota_cachep, SLAB_KERNEL); -+ if (qmblk == NULL) -+ goto out; -+#ifdef CONFIG_VZ_QUOTA_UGID -+ qmblk->dq_uid_tree = quotatree_alloc(); -+ if (!qmblk->dq_uid_tree) -+ goto out_free; -+ -+ qmblk->dq_gid_tree = quotatree_alloc(); -+ if (!qmblk->dq_gid_tree) -+ goto out_free_tree; -+#endif -+ -+ qmblk->dq_state = VZDQ_STARTING; -+ init_MUTEX(&qmblk->dq_sem); -+ spin_lock_init(&qmblk->dq_data_lock); -+ -+ qmblk->dq_id = quota_id; -+ qmblk->dq_stat = qstat->dq_stat; -+ qmblk->dq_info = qstat->dq_info; -+ qmblk->dq_root_dentry = NULL; -+ qmblk->dq_root_mnt = NULL; -+ qmblk->dq_sb = NULL; -+ qmblk->dq_ugid_count = 0; -+ qmblk->dq_ugid_max = 0; -+ qmblk->dq_flags = 0; -+ memset(qmblk->dq_ugid_info, 0, sizeof(qmblk->dq_ugid_info)); -+ INIT_LIST_HEAD(&qmblk->dq_ilink_list); -+ -+ atomic_set(&qmblk->dq_count, 1); -+ -+ /* insert in hash chain */ -+ list_add(&qmblk->dq_hash, -+ &vzquota_hash_table[vzquota_hash_func(quota_id)]); -+ -+ /* success */ -+ return qmblk; -+ -+out_free_tree: -+ quotatree_free(qmblk->dq_uid_tree, NULL); -+out_free: -+ kmem_cache_free(vzquota_cachep, qmblk); -+out: -+ return ERR_PTR(err); -+} -+ -+static struct vz_quota_master *vzquota_alloc_fake(void) -+{ -+ struct vz_quota_master *qmblk; -+ -+ qmblk = kmem_cache_alloc(vzquota_cachep, SLAB_KERNEL); -+ if (qmblk == NULL) -+ return NULL; -+ memset(qmblk, 0, sizeof(*qmblk)); -+ qmblk->dq_state = VZDQ_STOPING; -+ qmblk->dq_flags = VZDQ_NOQUOT; -+ spin_lock_init(&qmblk->dq_data_lock); -+ INIT_LIST_HEAD(&qmblk->dq_ilink_list); -+ atomic_set(&qmblk->dq_count, 1); -+ return qmblk; -+} -+ -+/** -+ * vzquota_find_master - find master record with given id -+ * -+ * Returns qmblk without touching its refcounter. -+ * Called under vz_quota_sem. -+ */ -+struct vz_quota_master *vzquota_find_master(unsigned int quota_id) -+{ -+ int i; -+ struct vz_quota_master *qp; -+ -+ i = vzquota_hash_func(quota_id); -+ list_for_each_entry(qp, &vzquota_hash_table[i], dq_hash) { -+ if (qp->dq_id == quota_id) -+ return qp; -+ } -+ return NULL; -+} -+ -+/** -+ * vzquota_free_master - release resources taken by qmblk, freeing memory -+ * -+ * qmblk is assumed to be already taken out from the hash. -+ * Should be called outside vz_quota_sem. -+ */ -+void vzquota_free_master(struct vz_quota_master *qmblk) -+{ -+#ifdef CONFIG_VZ_QUOTA_UGID -+ vzquota_kill_ugid(qmblk); -+#endif -+ BUG_ON(!list_empty(&qmblk->dq_ilink_list)); -+ kmem_cache_free(vzquota_cachep, qmblk); -+} -+ -+ -+/* ---------------------------------------------------------------------- -+ * -+ * Passing quota information through current -+ * -+ * Used in inode -> qmblk lookup at inode creation stage (since at that -+ * time there are no links between the inode being created and its parent -+ * directory). -+ * -+ * --------------------------------------------------------------------- */ -+ -+#define VZDQ_CUR_MAGIC 0x57d0fee2 -+ -+static inline int vzquota_cur_qmblk_check(void) -+{ -+ return current->magic == VZDQ_CUR_MAGIC; -+} -+ -+static inline struct inode *vzquota_cur_qmblk_fetch(void) -+{ -+ return current->ino; -+} -+ -+static inline void vzquota_cur_qmblk_set(struct inode *data) -+{ -+ struct task_struct *tsk; -+ -+ tsk = current; -+ tsk->magic = VZDQ_CUR_MAGIC; -+ tsk->ino = data; -+} -+ -+#if 0 -+static inline void vzquota_cur_qmblk_reset(void) -+{ -+ current->magic = 0; -+} -+#endif -+ -+ -+/* ---------------------------------------------------------------------- -+ * -+ * Superblock quota operations -+ * -+ * --------------------------------------------------------------------- */ -+ -+/* -+ * Kernel structure abuse. -+ * We use files[0] pointer as an int variable: -+ * reference counter of how many quota blocks uses this superblock. -+ * files[1] is used for generations structure which helps us to track -+ * when traversing of dentries is really required. -+ */ -+#define __VZ_QUOTA_NOQUOTA(sb) (*(struct vz_quota_master **)\ -+ &sb->s_dquot.files[1]) -+#define __VZ_QUOTA_TSTAMP(sb) ((struct timeval *)\ -+ &sb->s_dquot.dqio_sem) -+ -+#if defined(VZ_QUOTA_UNLOAD) -+ -+#define __VZ_QUOTA_SBREF(sb) (*(int *)&sb->s_dquot.files[0]) -+ -+struct dquot_operations *orig_dq_op; -+struct quotactl_ops *orig_dq_cop; -+ -+/** -+ * quota_get_super - account for new a quoted tree under the superblock -+ * -+ * One superblock can have multiple directory subtrees with different VZ -+ * quotas. We keep a counter of such subtrees and set VZ quota operations or -+ * reset the default ones. -+ * -+ * Called under vz_quota_sem (from quota_on). -+ */ -+int vzquota_get_super(struct super_block *sb) -+{ -+ if (sb->dq_op != &vz_quota_operations) { -+ down(&sb->s_dquot.dqonoff_sem); -+ if (sb->s_dquot.flags & (DQUOT_USR_ENABLED|DQUOT_GRP_ENABLED)) { -+ up(&sb->s_dquot.dqonoff_sem); -+ return -EEXIST; -+ } -+ if (orig_dq_op == NULL && sb->dq_op != NULL) -+ orig_dq_op = sb->dq_op; -+ sb->dq_op = &vz_quota_operations; -+ if (orig_dq_cop == NULL && sb->s_qcop != NULL) -+ orig_dq_cop = sb->s_qcop; -+ /* XXX this may race with sys_quotactl */ -+#ifdef CONFIG_VZ_QUOTA_UGID -+ sb->s_qcop = &vz_quotactl_operations; -+#else -+ sb->s_qcop = NULL; -+#endif -+ do_gettimeofday(__VZ_QUOTA_TSTAMP(sb)); -+ memset(&sb->s_dquot.info, 0, sizeof(sb->s_dquot.info)); -+ -+ INIT_LIST_HEAD(&sb->s_dquot.info[USRQUOTA].dqi_dirty_list); -+ INIT_LIST_HEAD(&sb->s_dquot.info[GRPQUOTA].dqi_dirty_list); -+ sb->s_dquot.info[USRQUOTA].dqi_format = &vz_quota_empty_v2_format; -+ sb->s_dquot.info[GRPQUOTA].dqi_format = &vz_quota_empty_v2_format; -+ /* -+ * To get quotaops.h call us we need to mark superblock -+ * as having quota. These flags mark the moment when -+ * our dq_op start to be called. -+ * -+ * The ordering of dq_op and s_dquot.flags assignment -+ * needs to be enforced, but other CPUs do not do rmb() -+ * between s_dquot.flags and dq_op accesses. -+ */ -+ wmb(); synchronize_kernel(); -+ sb->s_dquot.flags = DQUOT_USR_ENABLED|DQUOT_GRP_ENABLED; -+ __module_get(THIS_MODULE); -+ up(&sb->s_dquot.dqonoff_sem); -+ } -+ /* protected by vz_quota_sem */ -+ __VZ_QUOTA_SBREF(sb)++; -+ return 0; -+} -+ -+/** -+ * quota_put_super - release superblock when one quota tree goes away -+ * -+ * Called under vz_quota_sem. -+ */ -+void vzquota_put_super(struct super_block *sb) -+{ -+ int count; -+ -+ count = --__VZ_QUOTA_SBREF(sb); -+ if (count == 0) { -+ down(&sb->s_dquot.dqonoff_sem); -+ sb->s_dquot.flags = 0; -+ wmb(); synchronize_kernel(); -+ sema_init(&sb->s_dquot.dqio_sem, 1); -+ sb->s_qcop = orig_dq_cop; -+ sb->dq_op = orig_dq_op; -+ inode_qmblk_lock(sb); -+ quota_gen_put(SB_QGEN(sb)); -+ SB_QGEN(sb) = NULL; -+ /* release qlnk's without qmblk */ -+ remove_inode_quota_links_list(&non_vzquota_inodes_lh, -+ sb, NULL); -+ /* -+ * Races with quota initialization: -+ * after this inode_qmblk_unlock all inode's generations are -+ * invalidated, quota_inode_qmblk checks superblock operations. -+ */ -+ inode_qmblk_unlock(sb); -+ /* -+ * Module refcounting: in theory, this is the best place -+ * to call module_put(THIS_MODULE). -+ * In reality, it can't be done because we can't be sure that -+ * other CPUs do not enter our code segment through dq_op -+ * cached long time ago. Quotaops interface isn't supposed to -+ * go into modules currently (that is, into unloadable -+ * modules). By omitting module_put, our module isn't -+ * unloadable. -+ */ -+ up(&sb->s_dquot.dqonoff_sem); -+ } -+} -+ -+#else -+ -+struct vzquota_new_sop { -+ struct super_operations new_op; -+ struct super_operations *old_op; -+}; -+ -+/** -+ * vzquota_shutdown_super - callback on umount -+ */ -+void vzquota_shutdown_super(struct super_block *sb) -+{ -+ struct vz_quota_master *qmblk; -+ struct vzquota_new_sop *sop; -+ -+ qmblk = __VZ_QUOTA_NOQUOTA(sb); -+ __VZ_QUOTA_NOQUOTA(sb) = NULL; -+ if (qmblk != NULL) -+ qmblk_put(qmblk); -+ sop = container_of(sb->s_op, struct vzquota_new_sop, new_op); -+ sb->s_op = sop->old_op; -+ kfree(sop); -+ (*sb->s_op->put_super)(sb); -+} -+ -+/** -+ * vzquota_get_super - account for new a quoted tree under the superblock -+ * -+ * One superblock can have multiple directory subtrees with different VZ -+ * quotas. -+ * -+ * Called under vz_quota_sem (from vzquota_on). -+ */ -+int vzquota_get_super(struct super_block *sb) -+{ -+ struct vz_quota_master *qnew; -+ struct vzquota_new_sop *sop; -+ int err; -+ -+ down(&sb->s_dquot.dqonoff_sem); -+ err = -EEXIST; -+ if ((sb->s_dquot.flags & (DQUOT_USR_ENABLED|DQUOT_GRP_ENABLED)) && -+ sb->dq_op != &vz_quota_operations) -+ goto out_up; -+ -+ /* -+ * This allocation code should be under sb->dq_op check below, but -+ * it doesn't really matter... -+ */ -+ if (__VZ_QUOTA_NOQUOTA(sb) == NULL) { -+ qnew = vzquota_alloc_fake(); -+ if (qnew == NULL) -+ goto out_up; -+ __VZ_QUOTA_NOQUOTA(sb) = qnew; -+ } -+ -+ if (sb->dq_op != &vz_quota_operations) { -+ sop = kmalloc(sizeof(*sop), GFP_KERNEL); -+ if (sop == NULL) { -+ vzquota_free_master(__VZ_QUOTA_NOQUOTA(sb)); -+ __VZ_QUOTA_NOQUOTA(sb) = NULL; -+ goto out_up; -+ } -+ memcpy(&sop->new_op, sb->s_op, sizeof(sop->new_op)); -+ sop->new_op.put_super = &vzquota_shutdown_super; -+ sop->old_op = sb->s_op; -+ sb->s_op = &sop->new_op; -+ -+ sb->dq_op = &vz_quota_operations; -+#ifdef CONFIG_VZ_QUOTA_UGID -+ sb->s_qcop = &vz_quotactl_operations; -+#else -+ sb->s_qcop = NULL; -+#endif -+ do_gettimeofday(__VZ_QUOTA_TSTAMP(sb)); -+ -+ memset(&sb->s_dquot.info, 0, sizeof(sb->s_dquot.info)); -+ /* these 2 list heads are checked in sync_dquots() */ -+ INIT_LIST_HEAD(&sb->s_dquot.info[USRQUOTA].dqi_dirty_list); -+ INIT_LIST_HEAD(&sb->s_dquot.info[GRPQUOTA].dqi_dirty_list); -+ sb->s_dquot.info[USRQUOTA].dqi_format = -+ &vz_quota_empty_v2_format; -+ sb->s_dquot.info[GRPQUOTA].dqi_format = -+ &vz_quota_empty_v2_format; -+ -+ /* -+ * To get quotaops.h to call us we need to mark superblock -+ * as having quota. These flags mark the moment when -+ * our dq_op start to be called. -+ * -+ * The ordering of dq_op and s_dquot.flags assignment -+ * needs to be enforced, but other CPUs do not do rmb() -+ * between s_dquot.flags and dq_op accesses. -+ */ -+ wmb(); synchronize_kernel(); -+ sb->s_dquot.flags = DQUOT_USR_ENABLED|DQUOT_GRP_ENABLED; -+ } -+ err = 0; -+ -+out_up: -+ up(&sb->s_dquot.dqonoff_sem); -+ return err; -+} -+ -+/** -+ * vzquota_put_super - one quota tree less on this superblock -+ * -+ * Called under vz_quota_sem. -+ */ -+void vzquota_put_super(struct super_block *sb) -+{ -+ /* -+ * Even if this put is the last one, -+ * sb->s_dquot.flags can't be cleared, because otherwise vzquota_drop -+ * won't be called and the remaining qmblk references won't be put. -+ */ -+} -+ -+#endif -+ -+ -+/* ---------------------------------------------------------------------- -+ * -+ * Helpers for inode -> qmblk link maintenance -+ * -+ * --------------------------------------------------------------------- */ -+ -+#define __VZ_QUOTA_EMPTY ((void *)0xbdbdbdbd) -+#define VZ_QUOTA_IS_NOQUOTA(qm, sb) ((qm)->dq_flags & VZDQ_NOQUOT) -+#define VZ_QUOTA_EMPTY_IOPS (&vfs_empty_iops) -+extern struct inode_operations vfs_empty_iops; -+ -+static int VZ_QUOTA_IS_ACTUAL(struct inode *inode) -+{ -+ struct vz_quota_master *qmblk; -+ -+ qmblk = INODE_QLNK(inode)->qmblk; -+ if (qmblk == VZ_QUOTA_BAD) -+ return 1; -+ if (qmblk == __VZ_QUOTA_EMPTY) -+ return 0; -+ if (qmblk->dq_flags & VZDQ_NOACT) -+ /* not actual (invalidated) qmblk */ -+ return 0; -+ return 1; -+} -+ -+static inline int vzquota_qlnk_is_empty(struct vz_quota_ilink *qlnk) -+{ -+ return qlnk->qmblk == __VZ_QUOTA_EMPTY; -+} -+ -+static inline void vzquota_qlnk_set_empty(struct vz_quota_ilink *qlnk) -+{ -+ qlnk->qmblk = __VZ_QUOTA_EMPTY; -+ qlnk->origin = VZ_QUOTAO_SETE; -+} -+ -+void vzquota_qlnk_init(struct vz_quota_ilink *qlnk) -+{ -+ memset(qlnk, 0, sizeof(*qlnk)); -+ INIT_LIST_HEAD(&qlnk->list); -+ vzquota_qlnk_set_empty(qlnk); -+ qlnk->origin = VZ_QUOTAO_INIT; -+} -+ -+void vzquota_qlnk_destroy(struct vz_quota_ilink *qlnk) -+{ -+ might_sleep(); -+ if (vzquota_qlnk_is_empty(qlnk)) -+ return; -+#if defined(CONFIG_VZ_QUOTA_UGID) -+ if (qlnk->qmblk != NULL && qlnk->qmblk != VZ_QUOTA_BAD) { -+ struct vz_quota_master *qmblk; -+ struct vz_quota_ugid *quid, *qgid; -+ qmblk = qlnk->qmblk; -+ quid = qlnk->qugid[USRQUOTA]; -+ qgid = qlnk->qugid[GRPQUOTA]; -+ if (quid != NULL || qgid != NULL) { -+ down(&qmblk->dq_sem); -+ if (qgid != NULL) -+ vzquota_put_ugid(qmblk, qgid); -+ if (quid != NULL) -+ vzquota_put_ugid(qmblk, quid); -+ up(&qmblk->dq_sem); -+ } -+ } -+#endif -+ if (qlnk->qmblk != NULL && qlnk->qmblk != VZ_QUOTA_BAD) -+ qmblk_put(qlnk->qmblk); -+ qlnk->origin = VZ_QUOTAO_DESTR; -+} -+ -+/** -+ * vzquota_qlnk_swap - swap inode's and temporary vz_quota_ilink contents -+ * @qlt: temporary -+ * @qli: inode's -+ * -+ * Locking is provided by the caller (depending on the context). -+ * After swap, @qli is inserted into the corresponding dq_ilink_list, -+ * @qlt list is reinitialized. -+ */ -+static void vzquota_qlnk_swap(struct vz_quota_ilink *qlt, -+ struct vz_quota_ilink *qli) -+{ -+ struct vz_quota_master *qb; -+ struct vz_quota_ugid *qu; -+ int i; -+ -+ qb = qlt->qmblk; -+ qlt->qmblk = qli->qmblk; -+ qli->qmblk = qb; -+ list_del_init(&qli->list); -+ if (qb != __VZ_QUOTA_EMPTY && qb != VZ_QUOTA_BAD) -+ list_add(&qli->list, &qb->dq_ilink_list); -+ INIT_LIST_HEAD(&qlt->list); -+ qli->origin = VZ_QUOTAO_SWAP; -+ -+ for (i = 0; i < MAXQUOTAS; i++) { -+ qu = qlt->qugid[i]; -+ qlt->qugid[i] = qli->qugid[i]; -+ qli->qugid[i] = qu; -+ } -+} -+ -+/** -+ * vzquota_qlnk_reinit_locked - destroy qlnk content, called under locks -+ * -+ * Called under dcache_lock and inode_qmblk locks. -+ * Returns 1 if locks were dropped inside, 0 if atomic. -+ */ -+static int vzquota_qlnk_reinit_locked(struct vz_quota_ilink *qlnk, -+ struct inode *inode) -+{ -+ if (vzquota_qlnk_is_empty(qlnk)) -+ return 0; -+ if (qlnk->qmblk == VZ_QUOTA_BAD) { -+ vzquota_qlnk_set_empty(qlnk); -+ return 0; -+ } -+ spin_unlock(&dcache_lock); -+ inode_qmblk_unlock(inode->i_sb); -+ vzquota_qlnk_destroy(qlnk); -+ vzquota_qlnk_init(qlnk); -+ inode_qmblk_lock(inode->i_sb); -+ spin_lock(&dcache_lock); -+ return 1; -+} -+ -+#if defined(CONFIG_VZ_QUOTA_UGID) -+/** -+ * vzquota_qlnk_reinit_attr - destroy and reinit qlnk content -+ * -+ * Similar to vzquota_qlnk_reinit_locked, called under different locks. -+ */ -+static int vzquota_qlnk_reinit_attr(struct vz_quota_ilink *qlnk, -+ struct inode *inode, -+ struct vz_quota_master *qmblk) -+{ -+ if (vzquota_qlnk_is_empty(qlnk)) -+ return 0; -+ /* may be optimized if qlnk->qugid all NULLs */ -+ qmblk_data_write_unlock(qmblk); -+ inode_qmblk_unlock(inode->i_sb); -+ vzquota_qlnk_destroy(qlnk); -+ vzquota_qlnk_init(qlnk); -+ inode_qmblk_lock(inode->i_sb); -+ qmblk_data_write_lock(qmblk); -+ return 1; -+} -+#endif -+ -+/** -+ * vzquota_qlnk_fill - fill vz_quota_ilink content -+ * @qlnk: vz_quota_ilink to fill -+ * @inode: inode for which @qlnk is filled (i_sb, i_uid, i_gid) -+ * @qmblk: qmblk to which this @qlnk will belong -+ * -+ * Called under dcache_lock and inode_qmblk locks. -+ * Returns 1 if locks were dropped inside, 0 if atomic. -+ * @qlnk is expected to be empty. -+ */ -+static int vzquota_qlnk_fill(struct vz_quota_ilink *qlnk, -+ struct inode *inode, -+ struct vz_quota_master *qmblk) -+{ -+ if (qmblk != VZ_QUOTA_BAD) -+ qmblk_get(qmblk); -+ qlnk->qmblk = qmblk; -+ -+#if defined(CONFIG_VZ_QUOTA_UGID) -+ if (qmblk != VZ_QUOTA_BAD && -+ !VZ_QUOTA_IS_NOQUOTA(qmblk, inode->i_sb) && -+ (qmblk->dq_flags & VZDQUG_ON)) { -+ struct vz_quota_ugid *quid, *qgid; -+ -+ spin_unlock(&dcache_lock); -+ inode_qmblk_unlock(inode->i_sb); -+ -+ down(&qmblk->dq_sem); -+ quid = __vzquota_find_ugid(qmblk, inode->i_uid, USRQUOTA, 0); -+ qgid = __vzquota_find_ugid(qmblk, inode->i_gid, GRPQUOTA, 0); -+ up(&qmblk->dq_sem); -+ -+ inode_qmblk_lock(inode->i_sb); -+ spin_lock(&dcache_lock); -+ qlnk->qugid[USRQUOTA] = quid; -+ qlnk->qugid[GRPQUOTA] = qgid; -+ return 1; -+ } -+#endif -+ -+ return 0; -+} -+ -+#if defined(CONFIG_VZ_QUOTA_UGID) -+/** -+ * vzquota_qlnk_fill_attr - fill vz_quota_ilink content for uid, gid -+ * -+ * This function is a helper for vzquota_transfer, and differs from -+ * vzquota_qlnk_fill only by locking. -+ */ -+static int vzquota_qlnk_fill_attr(struct vz_quota_ilink *qlnk, -+ struct inode *inode, -+ struct iattr *iattr, -+ int mask, -+ struct vz_quota_master *qmblk) -+{ -+ qmblk_get(qmblk); -+ qlnk->qmblk = qmblk; -+ -+ if (mask) { -+ struct vz_quota_ugid *quid, *qgid; -+ -+ quid = qgid = NULL; /* to make gcc happy */ -+ if (!(mask & (1 << USRQUOTA))) -+ quid = vzquota_get_ugid(INODE_QLNK(inode)-> -+ qugid[USRQUOTA]); -+ if (!(mask & (1 << GRPQUOTA))) -+ qgid = vzquota_get_ugid(INODE_QLNK(inode)-> -+ qugid[GRPQUOTA]); -+ -+ qmblk_data_write_unlock(qmblk); -+ inode_qmblk_unlock(inode->i_sb); -+ -+ down(&qmblk->dq_sem); -+ if (mask & (1 << USRQUOTA)) -+ quid = __vzquota_find_ugid(qmblk, iattr->ia_uid, -+ USRQUOTA, 0); -+ if (mask & (1 << GRPQUOTA)) -+ qgid = __vzquota_find_ugid(qmblk, iattr->ia_gid, -+ GRPQUOTA, 0); -+ up(&qmblk->dq_sem); -+ -+ inode_qmblk_lock(inode->i_sb); -+ qmblk_data_write_lock(qmblk); -+ qlnk->qugid[USRQUOTA] = quid; -+ qlnk->qugid[GRPQUOTA] = qgid; -+ return 1; -+ } -+ -+ return 0; -+} -+#endif -+ -+/** -+ * __vzquota_inode_init - make sure inode's qlnk is initialized -+ * -+ * May be called if qlnk is already initialized, detects this situation itself. -+ * Called under inode_qmblk_lock. -+ */ -+static void __vzquota_inode_init(struct inode *inode, unsigned char origin) -+{ -+ if (inode->i_dquot[USRQUOTA] == NODQUOT) { -+ vzquota_qlnk_init(INODE_QLNK(inode)); -+ inode->i_dquot[USRQUOTA] = (void *)~(unsigned long)NODQUOT; -+ } -+ INODE_QLNK(inode)->origin = origin; -+} -+ -+/** -+ * vzquota_inode_drop - destroy VZ quota information in the inode -+ * -+ * Inode must not be externally accessible or dirty. -+ */ -+static void vzquota_inode_drop(struct inode *inode) -+{ -+ struct vz_quota_ilink qlnk; -+ -+ vzquota_qlnk_init(&qlnk); -+ inode_qmblk_lock(inode->i_sb); -+ vzquota_qlnk_swap(&qlnk, INODE_QLNK(inode)); -+ INODE_QLNK(inode)->origin = VZ_QUOTAO_DRCAL; -+ inode->i_dquot[USRQUOTA] = NODQUOT; -+ inode_qmblk_unlock(inode->i_sb); -+ vzquota_qlnk_destroy(&qlnk); -+} -+ -+/** -+ * vzquota_inode_qmblk_set - initialize inode's qlnk -+ * @inode: inode to be initialized -+ * @qmblk: quota master block to which this inode should belong (may be BAD) -+ * @qlnk: placeholder to store data to resolve locking issues -+ * -+ * Returns 1 if locks were dropped and rechecks possibly needed, 0 otherwise. -+ * Called under dcache_lock and inode_qmblk locks. -+ * @qlnk will be destroyed in the caller chain. -+ * -+ * It is not mandatory to restart parent checks since quota on/off currently -+ * shrinks dentry tree and checks that there are not outside references. -+ * But if at some time that shink is removed, restarts will be required. -+ * Additionally, the restarts prevent inconsistencies if the dentry tree -+ * changes (inode is moved). This is not a big deal, but anyway... -+ */ -+static int vzquota_inode_qmblk_set(struct inode *inode, -+ struct vz_quota_master *qmblk, -+ struct vz_quota_ilink *qlnk) -+{ -+ if (qmblk == NULL) { -+ printk(KERN_ERR "VZDQ: NULL in set, " -+ "orig %u, dev %s, inode %lu, fs %s\n", -+ INODE_QLNK(inode)->origin, -+ inode->i_sb->s_id, inode->i_ino, -+ inode->i_sb->s_type->name); -+ printk(KERN_ERR "current %d (%s), VE %d\n", -+ current->pid, current->comm, -+ VEID(get_exec_env())); -+ dump_stack(); -+ qmblk = VZ_QUOTA_BAD; -+ } -+ while (1) { -+ if (vzquota_qlnk_is_empty(qlnk) && -+ vzquota_qlnk_fill(qlnk, inode, qmblk)) -+ return 1; -+ if (qlnk->qmblk == qmblk) -+ break; -+ if (vzquota_qlnk_reinit_locked(qlnk, inode)) -+ return 1; -+ } -+ vzquota_qlnk_swap(qlnk, INODE_QLNK(inode)); -+ INODE_QLNK(inode)->origin = VZ_QUOTAO_QSET; -+ return 0; -+} -+ -+ -+/* ---------------------------------------------------------------------- -+ * -+ * vzquota_inode_qmblk (inode -> qmblk lookup) parts -+ * -+ * --------------------------------------------------------------------- */ -+ -+static int vzquota_dparents_check_attach(struct inode *inode) -+{ -+ if (!list_empty(&inode->i_dentry)) -+ return 0; -+ printk(KERN_ERR "VZDQ: no parent for " -+ "dev %s, inode %lu, fs %s\n", -+ inode->i_sb->s_id, -+ inode->i_ino, -+ inode->i_sb->s_type->name); -+ return -1; -+} -+ -+static struct inode *vzquota_dparents_check_actual(struct inode *inode) -+{ -+ struct dentry *de; -+ -+ list_for_each_entry(de, &inode->i_dentry, d_alias) { -+ if (de->d_parent == de) /* detached dentry, perhaps */ -+ continue; -+ /* first access to parent, make sure its qlnk initialized */ -+ __vzquota_inode_init(de->d_parent->d_inode, VZ_QUOTAO_ACT); -+ if (!VZ_QUOTA_IS_ACTUAL(de->d_parent->d_inode)) -+ return de->d_parent->d_inode; -+ } -+ return NULL; -+} -+ -+static struct vz_quota_master *vzquota_dparents_check_same(struct inode *inode) -+{ -+ struct dentry *de; -+ struct vz_quota_master *qmblk; -+ -+ qmblk = NULL; -+ list_for_each_entry(de, &inode->i_dentry, d_alias) { -+ if (de->d_parent == de) /* detached dentry, perhaps */ -+ continue; -+ if (qmblk == NULL) { -+ qmblk = INODE_QLNK(de->d_parent->d_inode)->qmblk; -+ continue; -+ } -+ if (INODE_QLNK(de->d_parent->d_inode)->qmblk != qmblk) { -+ printk(KERN_WARNING "VZDQ: multiple quotas for " -+ "dev %s, inode %lu, fs %s\n", -+ inode->i_sb->s_id, -+ inode->i_ino, -+ inode->i_sb->s_type->name); -+ qmblk = VZ_QUOTA_BAD; -+ break; -+ } -+ } -+ if (qmblk == NULL) { -+ printk(KERN_WARNING "VZDQ: not attached to tree, " -+ "dev %s, inode %lu, fs %s\n", -+ inode->i_sb->s_id, -+ inode->i_ino, -+ inode->i_sb->s_type->name); -+ qmblk = VZ_QUOTA_BAD; -+ } -+ return qmblk; -+} -+ -+static void vzquota_dbranch_actualize(struct inode *inode, -+ struct inode *refinode) -+{ -+ struct inode *pinode; -+ struct vz_quota_master *qmblk; -+ struct vz_quota_ilink qlnk; -+ -+ vzquota_qlnk_init(&qlnk); -+ -+start: -+ if (inode == inode->i_sb->s_root->d_inode) { -+ /* filesystem root */ -+ atomic_inc(&inode->i_count); -+ do { -+ qmblk = __VZ_QUOTA_NOQUOTA(inode->i_sb); -+ } while (vzquota_inode_qmblk_set(inode, qmblk, &qlnk)); -+ goto out; -+ } -+ -+ if (!vzquota_dparents_check_attach(inode)) { -+ pinode = vzquota_dparents_check_actual(inode); -+ if (pinode != NULL) { -+ inode = pinode; -+ goto start; -+ } -+ } -+ -+ atomic_inc(&inode->i_count); -+ while (1) { -+ if (VZ_QUOTA_IS_ACTUAL(inode)) /* actualized without us */ -+ break; -+ /* -+ * Need to check parents again if we have slept inside -+ * vzquota_inode_qmblk_set() in the loop. -+ * If the state of parents is different, just return and repeat -+ * the actualizing process again from the inode passed to -+ * vzquota_inode_qmblk_recalc(). -+ */ -+ if (!vzquota_dparents_check_attach(inode)) { -+ if (vzquota_dparents_check_actual(inode) != NULL) -+ break; -+ qmblk = vzquota_dparents_check_same(inode); -+ } else -+ qmblk = VZ_QUOTA_BAD; -+ if (!vzquota_inode_qmblk_set(inode, qmblk, &qlnk)){/* success */ -+ INODE_QLNK(inode)->origin = VZ_QUOTAO_ACT; -+ break; -+ } -+ } -+ -+out: -+ spin_unlock(&dcache_lock); -+ inode_qmblk_unlock(refinode->i_sb); -+ vzquota_qlnk_destroy(&qlnk); -+ iput(inode); -+ inode_qmblk_lock(refinode->i_sb); -+ spin_lock(&dcache_lock); -+} -+ -+static void vzquota_dtree_qmblk_recalc(struct inode *inode, -+ struct vz_quota_ilink *qlnk) -+{ -+ struct inode *pinode; -+ struct vz_quota_master *qmblk; -+ -+ if (inode == inode->i_sb->s_root->d_inode) { -+ /* filesystem root */ -+ do { -+ qmblk = __VZ_QUOTA_NOQUOTA(inode->i_sb); -+ } while (vzquota_inode_qmblk_set(inode, qmblk, qlnk)); -+ return; -+ } -+ -+start: -+ if (VZ_QUOTA_IS_ACTUAL(inode)) -+ return; -+ /* -+ * Here qmblk is (re-)initialized for all ancestors. -+ * This is not a very efficient procedure, but it guarantees that -+ * the quota tree is consistent (that is, the inode doesn't have two -+ * ancestors with different qmblk). -+ */ -+ if (!vzquota_dparents_check_attach(inode)) { -+ pinode = vzquota_dparents_check_actual(inode); -+ if (pinode != NULL) { -+ vzquota_dbranch_actualize(pinode, inode); -+ goto start; -+ } -+ qmblk = vzquota_dparents_check_same(inode); -+ } else -+ qmblk = VZ_QUOTA_BAD; -+ -+ if (vzquota_inode_qmblk_set(inode, qmblk, qlnk)) -+ goto start; -+ INODE_QLNK(inode)->origin = VZ_QUOTAO_DTREE; -+} -+ -+static void vzquota_det_qmblk_recalc(struct inode *inode, -+ struct vz_quota_ilink *qlnk) -+{ -+ struct inode *parent; -+ struct vz_quota_master *qmblk; -+ char *msg; -+ int cnt; -+ time_t timeout; -+ -+ cnt = 0; -+ parent = NULL; -+start: -+ /* -+ * qmblk of detached inodes shouldn't be considered as not actual. -+ * They are not in any dentry tree, so quota on/off shouldn't affect -+ * them. -+ */ -+ if (!vzquota_qlnk_is_empty(INODE_QLNK(inode))) -+ return; -+ -+ timeout = 3; -+ qmblk = __VZ_QUOTA_NOQUOTA(inode->i_sb); -+ msg = "detached inode not in creation"; -+ if (inode->i_op != VZ_QUOTA_EMPTY_IOPS) -+ goto fail; -+ qmblk = VZ_QUOTA_BAD; -+ msg = "unexpected creation context"; -+ if (!vzquota_cur_qmblk_check()) -+ goto fail; -+ timeout = 0; -+ parent = vzquota_cur_qmblk_fetch(); -+ msg = "uninitialized parent"; -+ if (vzquota_qlnk_is_empty(INODE_QLNK(parent))) -+ goto fail; -+ msg = "parent not in tree"; -+ if (list_empty(&parent->i_dentry)) -+ goto fail; -+ msg = "parent has 0 refcount"; -+ if (!atomic_read(&parent->i_count)) -+ goto fail; -+ msg = "parent has different sb"; -+ if (parent->i_sb != inode->i_sb) -+ goto fail; -+ if (!VZ_QUOTA_IS_ACTUAL(parent)) { -+ vzquota_dbranch_actualize(parent, inode); -+ goto start; -+ } -+ -+ qmblk = INODE_QLNK(parent)->qmblk; -+set: -+ if (vzquota_inode_qmblk_set(inode, qmblk, qlnk)) -+ goto start; -+ INODE_QLNK(inode)->origin = VZ_QUOTAO_DET; -+ return; -+ -+fail: -+ { -+ struct timeval tv, tvo; -+ do_gettimeofday(&tv); -+ memcpy(&tvo, __VZ_QUOTA_TSTAMP(inode->i_sb), sizeof(tvo)); -+ tv.tv_sec -= tvo.tv_sec; -+ if (tv.tv_usec < tvo.tv_usec) { -+ tv.tv_sec--; -+ tv.tv_usec += USEC_PER_SEC - tvo.tv_usec; -+ } else -+ tv.tv_usec -= tvo.tv_usec; -+ if (tv.tv_sec < timeout) -+ goto set; -+ printk(KERN_ERR "VZDQ: %s, orig %u," -+ " dev %s, inode %lu, fs %s\n", -+ msg, INODE_QLNK(inode)->origin, -+ inode->i_sb->s_id, inode->i_ino, -+ inode->i_sb->s_type->name); -+ if (!cnt++) { -+ printk(KERN_ERR "current %d (%s), VE %d," -+ " time %ld.%06ld\n", -+ current->pid, current->comm, -+ VEID(get_exec_env()), -+ tv.tv_sec, tv.tv_usec); -+ dump_stack(); -+ } -+ if (parent != NULL) -+ printk(KERN_ERR "VZDQ: parent of %lu is %lu\n", -+ inode->i_ino, parent->i_ino); -+ } -+ goto set; -+} -+ -+static void vzquota_inode_qmblk_recalc(struct inode *inode, -+ struct vz_quota_ilink *qlnk) -+{ -+ spin_lock(&dcache_lock); -+ if (!list_empty(&inode->i_dentry)) -+ vzquota_dtree_qmblk_recalc(inode, qlnk); -+ else -+ vzquota_det_qmblk_recalc(inode, qlnk); -+ spin_unlock(&dcache_lock); -+} -+ -+/** -+ * vzquota_inode_qmblk - obtain inode's qmblk -+ * -+ * Returns qmblk with refcounter taken, %NULL if not under -+ * VZ quota or %VZ_QUOTA_BAD. -+ * -+ * FIXME: This function should be removed when vzquota_find_qmblk / -+ * get_quota_root / vzquota_dstat code is cleaned up. -+ */ -+struct vz_quota_master *vzquota_inode_qmblk(struct inode *inode) -+{ -+ struct vz_quota_master *qmblk; -+ struct vz_quota_ilink qlnk; -+ -+ might_sleep(); -+ -+ if (inode->i_sb->dq_op != &vz_quota_operations) -+ return NULL; -+#if defined(VZ_QUOTA_UNLOAD) -+#error Make sure qmblk does not disappear -+#endif -+ -+ vzquota_qlnk_init(&qlnk); -+ inode_qmblk_lock(inode->i_sb); -+ __vzquota_inode_init(inode, VZ_QUOTAO_INICAL); -+ -+ if (vzquota_qlnk_is_empty(INODE_QLNK(inode)) || -+ !VZ_QUOTA_IS_ACTUAL(inode)) -+ vzquota_inode_qmblk_recalc(inode, &qlnk); -+ -+ qmblk = INODE_QLNK(inode)->qmblk; -+ if (qmblk != VZ_QUOTA_BAD) { -+ if (!VZ_QUOTA_IS_NOQUOTA(qmblk, inode->i_sb)) -+ qmblk_get(qmblk); -+ else -+ qmblk = NULL; -+ } -+ -+ inode_qmblk_unlock(inode->i_sb); -+ vzquota_qlnk_destroy(&qlnk); -+ return qmblk; -+} -+ -+ -+/* ---------------------------------------------------------------------- -+ * -+ * Calls from quota operations -+ * -+ * --------------------------------------------------------------------- */ -+ -+/** -+ * vzquota_inode_init_call - call from DQUOT_INIT -+ */ -+void vzquota_inode_init_call(struct inode *inode) -+{ -+ struct vz_quota_master *qmblk; -+ struct vz_quota_datast data; -+ -+ /* initializes inode's quota inside */ -+ qmblk = vzquota_inode_data(inode, &data); -+ if (qmblk != NULL && qmblk != VZ_QUOTA_BAD) -+ vzquota_data_unlock(inode, &data); -+ -+ /* -+ * The check is needed for repeated new_inode() calls from a single -+ * ext3 call like create or mkdir in case of -ENOSPC. -+ */ -+ spin_lock(&dcache_lock); -+ if (!list_empty(&inode->i_dentry)) -+ vzquota_cur_qmblk_set(inode); -+ spin_unlock(&dcache_lock); -+} -+ -+/** -+ * vzquota_inode_drop_call - call from DQUOT_DROP -+ */ -+void vzquota_inode_drop_call(struct inode *inode) -+{ -+ vzquota_inode_drop(inode); -+} -+ -+/** -+ * vzquota_inode_data - initialize (if nec.) and lock inode quota ptrs -+ * @inode: the inode -+ * @data: storage space -+ * -+ * Returns: qmblk is NULL or VZ_QUOTA_BAD or actualized qmblk. -+ * On return if qmblk is neither NULL nor VZ_QUOTA_BAD: -+ * qmblk in inode's qlnk is the same as returned, -+ * ugid pointers inside inode's qlnk are valid, -+ * some locks are taken (and should be released by vzquota_data_unlock). -+ * If qmblk is NULL or VZ_QUOTA_BAD, locks are NOT taken. -+ */ -+struct vz_quota_master *vzquota_inode_data(struct inode *inode, -+ struct vz_quota_datast *data) -+{ -+ struct vz_quota_master *qmblk; -+ -+ might_sleep(); -+ -+ vzquota_qlnk_init(&data->qlnk); -+ inode_qmblk_lock(inode->i_sb); -+ __vzquota_inode_init(inode, VZ_QUOTAO_INICAL); -+ -+ if (vzquota_qlnk_is_empty(INODE_QLNK(inode)) || -+ !VZ_QUOTA_IS_ACTUAL(inode)) -+ vzquota_inode_qmblk_recalc(inode, &data->qlnk); -+ -+ qmblk = INODE_QLNK(inode)->qmblk; -+ if (qmblk != VZ_QUOTA_BAD) { -+ if (!VZ_QUOTA_IS_NOQUOTA(qmblk, inode->i_sb)) { -+ /* -+ * Note that in the current implementation, -+ * inode_qmblk_lock can theoretically be dropped here. -+ * This place is serialized with quota_off because -+ * quota_off fails when there are extra dentry -+ * references and syncs inodes before removing quota -+ * information from them. -+ * However, quota usage information should stop being -+ * updated immediately after vzquota_off. -+ */ -+ qmblk_data_write_lock(qmblk); -+ } else { -+ inode_qmblk_unlock(inode->i_sb); -+ qmblk = NULL; -+ } -+ } else { -+ inode_qmblk_unlock(inode->i_sb); -+ } -+ return qmblk; -+} -+ -+void vzquota_data_unlock(struct inode *inode, -+ struct vz_quota_datast *data) -+{ -+ qmblk_data_write_unlock(INODE_QLNK(inode)->qmblk); -+ inode_qmblk_unlock(inode->i_sb); -+ vzquota_qlnk_destroy(&data->qlnk); -+} -+ -+#if defined(CONFIG_VZ_QUOTA_UGID) -+/** -+ * vzquota_inode_transfer_call - call from vzquota_transfer -+ */ -+int vzquota_inode_transfer_call(struct inode *inode, struct iattr *iattr) -+{ -+ struct vz_quota_master *qmblk; -+ struct vz_quota_datast data; -+ struct vz_quota_ilink qlnew; -+ int mask; -+ int ret; -+ -+ might_sleep(); -+ vzquota_qlnk_init(&qlnew); -+start: -+ qmblk = vzquota_inode_data(inode, &data); -+ ret = NO_QUOTA; -+ if (qmblk == VZ_QUOTA_BAD) -+ goto out_destr; -+ ret = QUOTA_OK; -+ if (qmblk == NULL) -+ goto out_destr; -+ qmblk_get(qmblk); -+ -+ ret = QUOTA_OK; -+ if (!(qmblk->dq_flags & VZDQUG_ON)) -+ /* no ugid quotas */ -+ goto out_unlock; -+ -+ mask = 0; -+ if ((iattr->ia_valid & ATTR_UID) && iattr->ia_uid != inode->i_uid) -+ mask |= 1 << USRQUOTA; -+ if ((iattr->ia_valid & ATTR_GID) && iattr->ia_gid != inode->i_gid) -+ mask |= 1 << GRPQUOTA; -+ while (1) { -+ if (vzquota_qlnk_is_empty(&qlnew) && -+ vzquota_qlnk_fill_attr(&qlnew, inode, iattr, mask, qmblk)) -+ break; -+ if (qlnew.qmblk == INODE_QLNK(inode)->qmblk && -+ qlnew.qmblk == qmblk) -+ goto finish; -+ if (vzquota_qlnk_reinit_attr(&qlnew, inode, qmblk)) -+ break; -+ } -+ -+ /* prepare for restart */ -+ vzquota_data_unlock(inode, &data); -+ qmblk_put(qmblk); -+ goto start; -+ -+finish: -+ /* all references obtained successfully */ -+ ret = vzquota_transfer_usage(inode, mask, &qlnew); -+ if (!ret) { -+ vzquota_qlnk_swap(&qlnew, INODE_QLNK(inode)); -+ INODE_QLNK(inode)->origin = VZ_QUOTAO_TRANS; -+ } -+out_unlock: -+ vzquota_data_unlock(inode, &data); -+ qmblk_put(qmblk); -+out_destr: -+ vzquota_qlnk_destroy(&qlnew); -+ return ret; -+} -+#endif -+ -+int vzquota_rename_check(struct inode *inode, -+ struct inode *old_dir, struct inode *new_dir) -+{ -+ struct vz_quota_master *qmblk; -+ struct vz_quota_ilink qlnk1, qlnk2; -+ int c, ret; -+ -+ if (inode->i_sb != old_dir->i_sb || inode->i_sb != new_dir->i_sb) -+ return -1; -+ -+ might_sleep(); -+ -+ vzquota_qlnk_init(&qlnk1); -+ vzquota_qlnk_init(&qlnk2); -+ inode_qmblk_lock(inode->i_sb); -+ __vzquota_inode_init(inode, VZ_QUOTAO_INICAL); -+ __vzquota_inode_init(old_dir, VZ_QUOTAO_INICAL); -+ __vzquota_inode_init(new_dir, VZ_QUOTAO_INICAL); -+ -+ do { -+ c = 0; -+ if (vzquota_qlnk_is_empty(INODE_QLNK(inode)) || -+ !VZ_QUOTA_IS_ACTUAL(inode)) { -+ vzquota_inode_qmblk_recalc(inode, &qlnk1); -+ c++; -+ } -+ if (vzquota_qlnk_is_empty(INODE_QLNK(new_dir)) || -+ !VZ_QUOTA_IS_ACTUAL(new_dir)) { -+ vzquota_inode_qmblk_recalc(new_dir, &qlnk2); -+ c++; -+ } -+ } while (c); -+ -+ ret = 0; -+ qmblk = INODE_QLNK(inode)->qmblk; -+ if (qmblk != INODE_QLNK(new_dir)->qmblk) { -+ ret = -1; -+ if (qmblk != VZ_QUOTA_BAD && -+ !VZ_QUOTA_IS_NOQUOTA(qmblk, inode->i_sb) && -+ qmblk->dq_root_dentry->d_inode == inode && -+ VZ_QUOTA_IS_NOQUOTA(INODE_QLNK(new_dir)->qmblk, -+ inode->i_sb) && -+ VZ_QUOTA_IS_NOQUOTA(INODE_QLNK(old_dir)->qmblk, -+ inode->i_sb)) -+ /* quota root rename is allowed */ -+ ret = 0; -+ } -+ -+ inode_qmblk_unlock(inode->i_sb); -+ vzquota_qlnk_destroy(&qlnk2); -+ vzquota_qlnk_destroy(&qlnk1); -+ return ret; -+} -+ -+ -+/* ---------------------------------------------------------------------- -+ * -+ * qmblk-related parts of on/off operations -+ * -+ * --------------------------------------------------------------------- */ -+ -+/** -+ * vzquota_check_dtree - check dentry tree if quota on/off is allowed -+ * -+ * This function doesn't allow quota to be turned on/off if some dentries in -+ * the tree have external references. -+ * In addition to technical reasons, it enforces user-space correctness: -+ * current usage (taken from or reported to the user space) can be meaningful -+ * and accurate only if the tree is not being modified. -+ * Side effect: additional vfsmount structures referencing the tree (bind -+ * mounts of tree nodes to some other places) are not allowed at on/off time. -+ */ -+int vzquota_check_dtree(struct vz_quota_master *qmblk, int off) -+{ -+ struct dentry *dentry; -+ int err, count; -+ -+ err = -EBUSY; -+ dentry = qmblk->dq_root_dentry; -+ -+ if (d_unhashed(dentry) && dentry != dentry->d_sb->s_root) -+ goto unhashed; -+ -+ /* attempt to shrink */ -+ if (!list_empty(&dentry->d_subdirs)) { -+ spin_unlock(&dcache_lock); -+ inode_qmblk_unlock(dentry->d_sb); -+ shrink_dcache_parent(dentry); -+ inode_qmblk_lock(dentry->d_sb); -+ spin_lock(&dcache_lock); -+ if (!list_empty(&dentry->d_subdirs)) -+ goto out; -+ -+ count = 1; -+ if (dentry == dentry->d_sb->s_root) -+ count += 2; /* sb and mnt refs */ -+ if (atomic_read(&dentry->d_count) < count) { -+ printk(KERN_ERR "%s: too small count %d vs %d.\n", -+ __FUNCTION__, -+ atomic_read(&dentry->d_count), count); -+ goto out; -+ } -+ if (atomic_read(&dentry->d_count) > count) -+ goto out; -+ } -+ -+ err = 0; -+out: -+ return err; -+ -+unhashed: -+ /* -+ * Quota root is removed. -+ * Allow to turn quota off, but not on. -+ */ -+ if (off) -+ err = 0; -+ goto out; -+} -+ -+int vzquota_on_qmblk(struct super_block *sb, struct inode *inode, -+ struct vz_quota_master *qmblk) -+{ -+ struct vz_quota_ilink qlnk; -+ struct vz_quota_master *qold, *qnew; -+ int err; -+ -+ might_sleep(); -+ -+ qold = NULL; -+ qnew = vzquota_alloc_fake(); -+ if (qnew == NULL) -+ return -ENOMEM; -+ -+ vzquota_qlnk_init(&qlnk); -+ inode_qmblk_lock(sb); -+ __vzquota_inode_init(inode, VZ_QUOTAO_INICAL); -+ -+ spin_lock(&dcache_lock); -+ while (1) { -+ err = vzquota_check_dtree(qmblk, 0); -+ if (err) -+ break; -+ if (!vzquota_inode_qmblk_set(inode, qmblk, &qlnk)) -+ break; -+ } -+ INODE_QLNK(inode)->origin = VZ_QUOTAO_ON; -+ spin_unlock(&dcache_lock); -+ -+ if (!err) { -+ qold = __VZ_QUOTA_NOQUOTA(sb); -+ qold->dq_flags |= VZDQ_NOACT; -+ __VZ_QUOTA_NOQUOTA(sb) = qnew; -+ } -+ -+ inode_qmblk_unlock(sb); -+ vzquota_qlnk_destroy(&qlnk); -+ if (qold != NULL) -+ qmblk_put(qold); -+ -+ return err; -+} -+ -+int vzquota_off_qmblk(struct super_block *sb, struct vz_quota_master *qmblk) -+{ -+ int ret; -+ -+ ret = 0; -+ inode_qmblk_lock(sb); -+ -+ spin_lock(&dcache_lock); -+ if (vzquota_check_dtree(qmblk, 1)) -+ ret = -EBUSY; -+ spin_unlock(&dcache_lock); -+ -+ if (!ret) -+ qmblk->dq_flags |= VZDQ_NOACT | VZDQ_NOQUOT; -+ inode_qmblk_unlock(sb); -+ return ret; -+} -+ -+ -+/* ---------------------------------------------------------------------- -+ * -+ * External interfaces -+ * -+ * ---------------------------------------------------------------------*/ -+ -+static int vzquota_ioctl(struct inode *ino, struct file *file, -+ unsigned int cmd, unsigned long arg) -+{ -+ int err; -+ struct vzctl_quotactl qb; -+ struct vzctl_quotaugidctl qub; -+ -+ switch (cmd) { -+ case VZCTL_QUOTA_CTL: -+ err = -ENOTTY; -+ break; -+ case VZCTL_QUOTA_NEW_CTL: -+ err = -EFAULT; -+ if (copy_from_user(&qb, (void *)arg, sizeof(qb))) -+ break; -+ err = do_vzquotactl(qb.cmd, qb.quota_id, -+ qb.qstat, qb.ve_root); -+ break; -+#ifdef CONFIG_VZ_QUOTA_UGID -+ case VZCTL_QUOTA_UGID_CTL: -+ err = -EFAULT; -+ if (copy_from_user(&qub, (void *)arg, sizeof(qub))) -+ break; -+ err = do_vzquotaugidctl(&qub); -+ break; -+#endif -+ default: -+ err = -ENOTTY; -+ } -+ might_sleep(); /* debug */ -+ return err; -+} -+ -+static struct vzioctlinfo vzdqcalls = { -+ .type = VZDQCTLTYPE, -+ .func = vzquota_ioctl, -+ .owner = THIS_MODULE, -+}; -+ -+/** -+ * vzquota_dstat - get quota usage info for virtual superblock -+ */ -+static int vzquota_dstat(struct super_block *super, struct dq_stat *qstat) -+{ -+ struct vz_quota_master *qmblk; -+ -+ qmblk = vzquota_find_qmblk(super); -+ if (qmblk == NULL) -+ return -ENOENT; -+ if (qmblk == VZ_QUOTA_BAD) { -+ memset(qstat, 0, sizeof(*qstat)); -+ return 0; -+ } -+ -+ qmblk_data_read_lock(qmblk); -+ memcpy(qstat, &qmblk->dq_stat, sizeof(*qstat)); -+ qmblk_data_read_unlock(qmblk); -+ qmblk_put(qmblk); -+ return 0; -+} -+ -+ -+/* ---------------------------------------------------------------------- -+ * -+ * Init/exit helpers -+ * -+ * ---------------------------------------------------------------------*/ -+ -+static int vzquota_cache_init(void) -+{ -+ int i; -+ -+ vzquota_cachep = kmem_cache_create("vz_quota_master", -+ sizeof(struct vz_quota_master), -+ 0, SLAB_HWCACHE_ALIGN, NULL, NULL); -+ if (vzquota_cachep == NULL) { -+ printk(KERN_ERR "Cannot create VZ_QUOTA SLAB cache\n"); -+ goto nomem2; -+ } -+ for (i = 0; i < VZ_QUOTA_HASH_SIZE; i++) -+ INIT_LIST_HEAD(&vzquota_hash_table[i]); -+ -+ return 0; -+ -+nomem2: -+ return -ENOMEM; -+} -+ -+static void vzquota_cache_release(void) -+{ -+ int i; -+ -+ /* sanity check */ -+ for (i = 0; i < VZ_QUOTA_HASH_SIZE; i++) -+ if (!list_empty(&vzquota_hash_table[i])) -+ BUG(); -+ -+ /* release caches */ -+ if (kmem_cache_destroy(vzquota_cachep)) -+ printk(KERN_ERR -+ "VZQUOTA: vz_quota_master kmem_cache_destroy failed\n"); -+ vzquota_cachep = NULL; -+} -+ -+static int quota_notifier_call(struct vnotifier_block *self, -+ unsigned long n, void *data, int err) -+{ -+ struct virt_info_quota *viq; -+ struct super_block *sb; -+ -+ viq = (struct virt_info_quota *)data; -+ switch (n) { -+ case VIRTINFO_QUOTA_ON: -+ err = NOTIFY_BAD; -+ if (!try_module_get(THIS_MODULE)) -+ break; -+ sb = viq->super; -+ memset(&sb->s_dquot.info, 0, sizeof(sb->s_dquot.info)); -+ INIT_LIST_HEAD(&sb->s_dquot.info[USRQUOTA].dqi_dirty_list); -+ INIT_LIST_HEAD(&sb->s_dquot.info[GRPQUOTA].dqi_dirty_list); -+ err = NOTIFY_OK; -+ break; -+ case VIRTINFO_QUOTA_OFF: -+ module_put(THIS_MODULE); -+ err = NOTIFY_OK; -+ break; -+ case VIRTINFO_QUOTA_GETSTAT: -+ err = NOTIFY_BAD; -+ if (vzquota_dstat(viq->super, viq->qstat)) -+ break; -+ err = NOTIFY_OK; -+ break; -+ } -+ return err; -+} -+ -+struct vnotifier_block quota_notifier_block = { -+ .notifier_call = quota_notifier_call, -+ .priority = INT_MAX, -+}; -+ -+/* ---------------------------------------------------------------------- -+ * -+ * Init/exit procedures -+ * -+ * ---------------------------------------------------------------------*/ -+ -+static int __init vzquota_init(void) -+{ -+ int err; -+ -+ if ((err = vzquota_cache_init()) != 0) -+ goto out_cache; -+ -+ if ((err = vzquota_proc_init()) != 0) -+ goto out_proc; -+ -+#ifdef CONFIG_VZ_QUOTA_UGID -+ if ((err = vzquota_ugid_init()) != 0) -+ goto out_ugid; -+#endif -+ -+ init_MUTEX(&vz_quota_sem); -+ vzioctl_register(&vzdqcalls); -+ virtinfo_notifier_register(VITYPE_QUOTA, "a_notifier_block); -+#if defined(CONFIG_VZ_QUOTA_UGID) && defined(CONFIG_PROC_FS) -+ vzaquota_init(); -+#endif -+ -+ return 0; -+ -+#ifdef CONFIG_VZ_QUOTA_UGID -+out_ugid: -+ vzquota_proc_release(); -+#endif -+out_proc: -+ vzquota_cache_release(); -+out_cache: -+ return err; -+} -+ -+#if defined(VZ_QUOTA_UNLOAD) -+static void __exit vzquota_release(void) -+{ -+ virtinfo_notifier_unregister(VITYPE_QUOTA, "a_notifier_block); -+ vzioctl_unregister(&vzdqcalls); -+#ifdef CONFIG_VZ_QUOTA_UGID -+#ifdef CONFIG_PROC_FS -+ vzaquota_fini(); -+#endif -+ vzquota_ugid_release(); -+#endif -+ vzquota_proc_release(); -+ vzquota_cache_release(); -+} -+#endif -+ -+MODULE_AUTHOR("SWsoft <info@sw-soft.com>"); -+MODULE_DESCRIPTION("Virtuozzo Disk Quota"); -+MODULE_LICENSE("GPL v2"); -+ -+module_init(vzquota_init) -+#if defined(VZ_QUOTA_UNLOAD) -+module_exit(vzquota_release) -+#endif -diff -uprN linux-2.6.8.1.orig/fs/xfs/linux-2.6/xfs_buf.c linux-2.6.8.1-ve022stab078/fs/xfs/linux-2.6/xfs_buf.c ---- linux-2.6.8.1.orig/fs/xfs/linux-2.6/xfs_buf.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/xfs/linux-2.6/xfs_buf.c 2006-05-11 13:05:25.000000000 +0400 -@@ -1628,8 +1628,8 @@ pagebuf_daemon( - INIT_LIST_HEAD(&tmp); - do { - /* swsusp */ -- if (current->flags & PF_FREEZE) -- refrigerator(PF_FREEZE); -+ if (test_thread_flag(TIF_FREEZE)) -+ refrigerator(); - - set_current_state(TASK_INTERRUPTIBLE); - schedule_timeout((xfs_buf_timer_centisecs * HZ) / 100); -diff -uprN linux-2.6.8.1.orig/fs/xfs/linux-2.6/xfs_iops.c linux-2.6.8.1-ve022stab078/fs/xfs/linux-2.6/xfs_iops.c ---- linux-2.6.8.1.orig/fs/xfs/linux-2.6/xfs_iops.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/xfs/linux-2.6/xfs_iops.c 2006-05-11 13:05:35.000000000 +0400 -@@ -468,7 +468,8 @@ STATIC int - linvfs_permission( - struct inode *inode, - int mode, -- struct nameidata *nd) -+ struct nameidata *nd, -+ struct exec_perm *exec_perm) - { - vnode_t *vp = LINVFS_GET_VP(inode); - int error; -diff -uprN linux-2.6.8.1.orig/fs/xfs/linux-2.6/xfs_super.c linux-2.6.8.1-ve022stab078/fs/xfs/linux-2.6/xfs_super.c ---- linux-2.6.8.1.orig/fs/xfs/linux-2.6/xfs_super.c 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/fs/xfs/linux-2.6/xfs_super.c 2006-05-11 13:05:35.000000000 +0400 -@@ -356,7 +356,7 @@ destroy_inodecache( void ) - * at the point when it is unpinned after a log write, - * since this is when the inode itself becomes flushable. - */ --STATIC void -+STATIC int - linvfs_write_inode( - struct inode *inode, - int sync) -@@ -364,12 +364,14 @@ linvfs_write_inode( - vnode_t *vp = LINVFS_GET_VP(inode); - int error, flags = FLUSH_INODE; - -+ error = 0; - if (vp) { - vn_trace_entry(vp, __FUNCTION__, (inst_t *)__return_address); - if (sync) - flags |= FLUSH_SYNC; - VOP_IFLUSH(vp, flags, error); - } -+ return error; - } - - STATIC void -@@ -408,8 +410,8 @@ xfssyncd( - set_current_state(TASK_INTERRUPTIBLE); - schedule_timeout((xfs_syncd_centisecs * HZ) / 100); - /* swsusp */ -- if (current->flags & PF_FREEZE) -- refrigerator(PF_FREEZE); -+ if (test_thread_flag(TIF_FREEZE)) -+ refrigerator(); - if (vfsp->vfs_flag & VFS_UMOUNT) - break; - if (vfsp->vfs_flag & VFS_RDONLY) -diff -uprN linux-2.6.8.1.orig/include/asm-generic/pgtable.h linux-2.6.8.1-ve022stab078/include/asm-generic/pgtable.h ---- linux-2.6.8.1.orig/include/asm-generic/pgtable.h 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-generic/pgtable.h 2006-05-11 13:05:30.000000000 +0400 -@@ -126,4 +126,8 @@ static inline void ptep_mkdirty(pte_t *p - #define pgd_offset_gate(mm, addr) pgd_offset(mm, addr) - #endif - -+#ifndef __HAVE_ARCH_LAZY_MMU_PROT_UPDATE -+#define lazy_mmu_prot_update(pte) do { } while (0) -+#endif -+ - #endif /* _ASM_GENERIC_PGTABLE_H */ -diff -uprN linux-2.6.8.1.orig/include/asm-generic/tlb.h linux-2.6.8.1-ve022stab078/include/asm-generic/tlb.h ---- linux-2.6.8.1.orig/include/asm-generic/tlb.h 2004-08-14 14:54:46.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-generic/tlb.h 2006-05-11 13:05:39.000000000 +0400 -@@ -110,6 +110,9 @@ tlb_is_full_mm(struct mmu_gather *tlb) - * handling the additional races in SMP caused by other CPUs caching valid - * mappings in their TLBs. - */ -+#include <ub/ub_mem.h> -+#include <ub/ub_vmpages.h> -+ - static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page) - { - tlb->need_flush = 1; -diff -uprN linux-2.6.8.1.orig/include/asm-i386/apic.h linux-2.6.8.1-ve022stab078/include/asm-i386/apic.h ---- linux-2.6.8.1.orig/include/asm-i386/apic.h 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/apic.h 2006-05-11 13:05:32.000000000 +0400 -@@ -79,7 +79,7 @@ extern void sync_Arb_IDs (void); - extern void init_bsp_APIC (void); - extern void setup_local_APIC (void); - extern void init_apic_mappings (void); --extern void smp_local_timer_interrupt (struct pt_regs * regs); -+extern asmlinkage void smp_local_timer_interrupt (struct pt_regs * regs); - extern void setup_boot_APIC_clock (void); - extern void setup_secondary_APIC_clock (void); - extern void setup_apic_nmi_watchdog (void); -diff -uprN linux-2.6.8.1.orig/include/asm-i386/atomic_kmap.h linux-2.6.8.1-ve022stab078/include/asm-i386/atomic_kmap.h ---- linux-2.6.8.1.orig/include/asm-i386/atomic_kmap.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/atomic_kmap.h 2006-05-11 13:05:38.000000000 +0400 -@@ -0,0 +1,96 @@ -+/* -+ * atomic_kmap.h: temporary virtual kernel memory mappings -+ * -+ * Copyright (C) 2003 Ingo Molnar <mingo@redhat.com> -+ */ -+ -+#ifndef _ASM_ATOMIC_KMAP_H -+#define _ASM_ATOMIC_KMAP_H -+ -+#ifdef __KERNEL__ -+ -+#include <linux/config.h> -+#include <asm/tlbflush.h> -+ -+#ifdef CONFIG_DEBUG_HIGHMEM -+#define HIGHMEM_DEBUG 1 -+#else -+#define HIGHMEM_DEBUG 0 -+#endif -+ -+extern pte_t *kmap_pte; -+#define kmap_prot PAGE_KERNEL -+#define kmap_prot_nocache PAGE_KERNEL_NOCACHE -+ -+#define PKMAP_BASE (0xff000000UL) -+#define NR_SHARED_PMDS ((0xffffffff-PKMAP_BASE+1)/PMD_SIZE) -+ -+static inline unsigned long __kmap_atomic_vaddr(enum km_type type) -+{ -+ enum fixed_addresses idx; -+ -+ idx = type + KM_TYPE_NR*smp_processor_id(); -+ return __fix_to_virt(FIX_KMAP_BEGIN + idx); -+} -+ -+static inline void *__kmap_atomic_noflush(struct page *page, enum km_type type) -+{ -+ enum fixed_addresses idx; -+ unsigned long vaddr; -+ -+ idx = type + KM_TYPE_NR*smp_processor_id(); -+ vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); -+ /* -+ * NOTE: entries that rely on some secondary TLB-flush -+ * effect must not be global: -+ */ -+ set_pte(kmap_pte-idx, mk_pte(page, PAGE_KERNEL)); -+ -+ return (void*) vaddr; -+} -+ -+static inline void *__kmap_atomic(struct page *page, enum km_type type) -+{ -+ enum fixed_addresses idx; -+ unsigned long vaddr; -+ -+ idx = type + KM_TYPE_NR*smp_processor_id(); -+ vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); -+#if HIGHMEM_DEBUG -+ BUG_ON(!pte_none(*(kmap_pte-idx))); -+#else -+ /* -+ * Performance optimization - do not flush if the new -+ * pte is the same as the old one: -+ */ -+ if (pte_val(*(kmap_pte-idx)) == pte_val(mk_pte(page, kmap_prot))) -+ return (void *) vaddr; -+#endif -+ set_pte(kmap_pte-idx, mk_pte(page, kmap_prot)); -+ __flush_tlb_one(vaddr); -+ -+ return (void*) vaddr; -+} -+ -+static inline void __kunmap_atomic(void *kvaddr, enum km_type type) -+{ -+#if HIGHMEM_DEBUG -+ unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK; -+ enum fixed_addresses idx = type + KM_TYPE_NR*smp_processor_id(); -+ -+ BUG_ON(vaddr != __fix_to_virt(FIX_KMAP_BEGIN+idx)); -+ /* -+ * force other mappings to Oops if they'll try to access -+ * this pte without first remap it -+ */ -+ pte_clear(kmap_pte-idx); -+ __flush_tlb_one(vaddr); -+#endif -+} -+ -+#define __kunmap_atomic_type(type) \ -+ __kunmap_atomic((void *)__kmap_atomic_vaddr(type), (type)) -+ -+#endif /* __KERNEL__ */ -+ -+#endif /* _ASM_ATOMIC_KMAP_H */ -diff -uprN linux-2.6.8.1.orig/include/asm-i386/bug.h linux-2.6.8.1-ve022stab078/include/asm-i386/bug.h ---- linux-2.6.8.1.orig/include/asm-i386/bug.h 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/bug.h 2006-05-11 13:05:24.000000000 +0400 -@@ -12,7 +12,10 @@ - #if 1 /* Set to zero for a slightly smaller kernel */ - #define BUG() \ - __asm__ __volatile__( "ud2\n" \ -+ "\t.byte 0x66\n"\ -+ "\t.byte 0xb8\n" /* mov $xxx, %ax */\ - "\t.word %c0\n" \ -+ "\t.byte 0xb8\n" /* mov $xxx, %eax */\ - "\t.long %c1\n" \ - : : "i" (__LINE__), "i" (__FILE__)) - #else -diff -uprN linux-2.6.8.1.orig/include/asm-i386/checksum.h linux-2.6.8.1-ve022stab078/include/asm-i386/checksum.h ---- linux-2.6.8.1.orig/include/asm-i386/checksum.h 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/checksum.h 2006-05-11 13:05:38.000000000 +0400 -@@ -25,7 +25,7 @@ asmlinkage unsigned int csum_partial(con - * better 64-bit) boundary - */ - --asmlinkage unsigned int csum_partial_copy_generic( const char *src, char *dst, int len, int sum, -+asmlinkage unsigned int direct_csum_partial_copy_generic( const char *src, char *dst, int len, int sum, - int *src_err_ptr, int *dst_err_ptr); - - /* -@@ -39,14 +39,19 @@ static __inline__ - unsigned int csum_partial_copy_nocheck ( const char *src, char *dst, - int len, int sum) - { -- return csum_partial_copy_generic ( src, dst, len, sum, NULL, NULL); -+ /* -+ * The direct function is OK for kernel-space => kernel-space copies: -+ */ -+ return direct_csum_partial_copy_generic ( src, dst, len, sum, NULL, NULL); - } - - static __inline__ - unsigned int csum_partial_copy_from_user ( const char __user *src, char *dst, - int len, int sum, int *err_ptr) - { -- return csum_partial_copy_generic ( (__force char *)src, dst, len, sum, err_ptr, NULL); -+ if (copy_from_user(dst, src, len)) -+ *err_ptr = -EFAULT; -+ return csum_partial(dst, len, sum); - } - - /* -@@ -172,13 +177,28 @@ static __inline__ unsigned short int csu - * Copy and checksum to user - */ - #define HAVE_CSUM_COPY_USER --static __inline__ unsigned int csum_and_copy_to_user(const char *src, -+static __inline__ unsigned int direct_csum_and_copy_to_user(const char *src, - char __user *dst, - int len, int sum, - int *err_ptr) - { - if (access_ok(VERIFY_WRITE, dst, len)) -- return csum_partial_copy_generic(src, (__force char *)dst, len, sum, NULL, err_ptr); -+ return direct_csum_partial_copy_generic(src, dst, len, sum, NULL, err_ptr); -+ -+ if (len) -+ *err_ptr = -EFAULT; -+ -+ return -1; /* invalid checksum */ -+} -+ -+static __inline__ unsigned int csum_and_copy_to_user(const char *src, char __user *dst, -+ int len, int sum, int *err_ptr) -+{ -+ if (access_ok(VERIFY_WRITE, dst, len)) { -+ if (copy_to_user(dst, src, len)) -+ *err_ptr = -EFAULT; -+ return csum_partial(src, len, sum); -+ } - - if (len) - *err_ptr = -EFAULT; -diff -uprN linux-2.6.8.1.orig/include/asm-i386/desc.h linux-2.6.8.1-ve022stab078/include/asm-i386/desc.h ---- linux-2.6.8.1.orig/include/asm-i386/desc.h 2004-08-14 14:54:46.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/desc.h 2006-05-11 13:05:38.000000000 +0400 -@@ -21,6 +21,13 @@ struct Xgt_desc_struct { - - extern struct Xgt_desc_struct idt_descr, cpu_gdt_descr[NR_CPUS]; - -+extern void trap_init_virtual_IDT(void); -+extern void trap_init_virtual_GDT(void); -+ -+asmlinkage int system_call(void); -+asmlinkage void lcall7(void); -+asmlinkage void lcall27(void); -+ - #define load_TR_desc() __asm__ __volatile__("ltr %%ax"::"a" (GDT_ENTRY_TSS*8)) - #define load_LDT_desc() __asm__ __volatile__("lldt %%ax"::"a" (GDT_ENTRY_LDT*8)) - -@@ -30,6 +37,7 @@ extern struct Xgt_desc_struct idt_descr, - */ - extern struct desc_struct default_ldt[]; - extern void set_intr_gate(unsigned int irq, void * addr); -+extern void set_trap_gate(unsigned int n, void *addr); - - #define _set_tssldt_desc(n,addr,limit,type) \ - __asm__ __volatile__ ("movw %w3,0(%2)\n\t" \ -@@ -91,31 +99,8 @@ static inline void load_TLS(struct threa - #undef C - } - --static inline void clear_LDT(void) --{ -- int cpu = get_cpu(); -- -- set_ldt_desc(cpu, &default_ldt[0], 5); -- load_LDT_desc(); -- put_cpu(); --} -- --/* -- * load one particular LDT into the current CPU -- */ --static inline void load_LDT_nolock(mm_context_t *pc, int cpu) --{ -- void *segments = pc->ldt; -- int count = pc->size; -- -- if (likely(!count)) { -- segments = &default_ldt[0]; -- count = 5; -- } -- -- set_ldt_desc(cpu, segments, count); -- load_LDT_desc(); --} -+extern struct page *default_ldt_page; -+extern void load_LDT_nolock(mm_context_t *pc, int cpu); - - static inline void load_LDT(mm_context_t *pc) - { -@@ -124,6 +109,6 @@ static inline void load_LDT(mm_context_t - put_cpu(); - } - --#endif /* !__ASSEMBLY__ */ - -+#endif /* !__ASSEMBLY__ */ - #endif -diff -uprN linux-2.6.8.1.orig/include/asm-i386/elf.h linux-2.6.8.1-ve022stab078/include/asm-i386/elf.h ---- linux-2.6.8.1.orig/include/asm-i386/elf.h 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/elf.h 2006-05-11 13:05:45.000000000 +0400 -@@ -107,7 +107,7 @@ typedef struct user_fxsr_struct elf_fpxr - For the moment, we have only optimizations for the Intel generations, - but that could change... */ - --#define ELF_PLATFORM (system_utsname.machine) -+#define ELF_PLATFORM (ve_utsname.machine) - - /* - * Architecture-neutral AT_ values in 0-17, leave some room -@@ -140,8 +140,10 @@ extern void __kernel_vsyscall; - - #define ARCH_DLINFO \ - do { \ -+ if (sysctl_at_vsyscall) { \ - NEW_AUX_ENT(AT_SYSINFO, VSYSCALL_ENTRY); \ - NEW_AUX_ENT(AT_SYSINFO_EHDR, VSYSCALL_BASE); \ -+ } \ - } while (0) - - /* -diff -uprN linux-2.6.8.1.orig/include/asm-i386/fixmap.h linux-2.6.8.1-ve022stab078/include/asm-i386/fixmap.h ---- linux-2.6.8.1.orig/include/asm-i386/fixmap.h 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/fixmap.h 2006-05-11 13:05:38.000000000 +0400 -@@ -18,17 +18,17 @@ - #include <asm/acpi.h> - #include <asm/apicdef.h> - #include <asm/page.h> --#ifdef CONFIG_HIGHMEM - #include <linux/threads.h> - #include <asm/kmap_types.h> --#endif -+ -+#define __FIXADDR_TOP (0xfffff000UL) - - /* - * Here we define all the compile-time 'special' virtual - * addresses. The point is to have a constant address at - * compile time, but to set the physical address only -- * in the boot process. We allocate these special addresses -- * from the end of virtual memory (0xfffff000) backwards. -+ * in the boot process. We allocate these special addresses -+ * from the end of virtual memory (0xffffe000) backwards. - * Also this lets us do fail-safe vmalloc(), we - * can guarantee that these special addresses and - * vmalloc()-ed addresses never overlap. -@@ -41,11 +41,24 @@ - * TLB entries of such buffers will not be flushed across - * task switches. - */ -+ -+/* -+ * on UP currently we will have no trace of the fixmap mechanizm, -+ * no page table allocations, etc. This might change in the -+ * future, say framebuffers for the console driver(s) could be -+ * fix-mapped? -+ */ -+ -+#define TSS_SIZE sizeof(struct tss_struct) -+#define FIX_TSS_COUNT ((TSS_SIZE * NR_CPUS + PAGE_SIZE - 1)/ PAGE_SIZE) -+ - enum fixed_addresses { - FIX_HOLE, - FIX_VSYSCALL, - #ifdef CONFIG_X86_LOCAL_APIC - FIX_APIC_BASE, /* local (CPU) APIC) -- required for SMP or not */ -+#else -+ FIX_VSTACK_HOLE_1, - #endif - #ifdef CONFIG_X86_IO_APIC - FIX_IO_APIC_BASE_0, -@@ -57,16 +70,22 @@ enum fixed_addresses { - FIX_LI_PCIA, /* Lithium PCI Bridge A */ - FIX_LI_PCIB, /* Lithium PCI Bridge B */ - #endif --#ifdef CONFIG_X86_F00F_BUG -- FIX_F00F_IDT, /* Virtual mapping for IDT */ --#endif -+ FIX_IDT, -+ FIX_GDT_1, -+ FIX_GDT_0, -+ FIX_TSS_LAST, -+ FIX_TSS_0 = FIX_TSS_LAST + FIX_TSS_COUNT - 1, -+ FIX_ENTRY_TRAMPOLINE_1, -+ FIX_ENTRY_TRAMPOLINE_0, - #ifdef CONFIG_X86_CYCLONE_TIMER - FIX_CYCLONE_TIMER, /*cyclone timer register*/ -+ FIX_VSTACK_HOLE_2, - #endif --#ifdef CONFIG_HIGHMEM -- FIX_KMAP_BEGIN, /* reserved pte's for temporary kernel mappings */ -+ /* reserved pte's for temporary kernel mappings */ -+ __FIX_KMAP_BEGIN, -+ FIX_KMAP_BEGIN = __FIX_KMAP_BEGIN + (__FIX_KMAP_BEGIN & 1) + -+ ((__FIXADDR_TOP >> PAGE_SHIFT) & 1), - FIX_KMAP_END = FIX_KMAP_BEGIN+(KM_TYPE_NR*NR_CPUS)-1, --#endif - #ifdef CONFIG_ACPI_BOOT - FIX_ACPI_BEGIN, - FIX_ACPI_END = FIX_ACPI_BEGIN + FIX_ACPI_PAGES - 1, -@@ -98,12 +117,15 @@ extern void __set_fixmap (enum fixed_add - __set_fixmap(idx, 0, __pgprot(0)) - - /* -- * used by vmalloc.c. -+ * used by vmalloc.c and various other places. - * - * Leave one empty page between vmalloc'ed areas and - * the start of the fixmap. -+ * -+ * IMPORTANT: we have to align FIXADDR_TOP so that the virtual stack -+ * is THREAD_SIZE aligned. - */ --#define FIXADDR_TOP (0xfffff000UL) -+#define FIXADDR_TOP __FIXADDR_TOP - #define __FIXADDR_SIZE (__end_of_permanent_fixed_addresses << PAGE_SHIFT) - #define FIXADDR_START (FIXADDR_TOP - __FIXADDR_SIZE) - -diff -uprN linux-2.6.8.1.orig/include/asm-i386/highmem.h linux-2.6.8.1-ve022stab078/include/asm-i386/highmem.h ---- linux-2.6.8.1.orig/include/asm-i386/highmem.h 2004-08-14 14:54:49.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/highmem.h 2006-05-11 13:05:38.000000000 +0400 -@@ -25,26 +25,19 @@ - #include <linux/threads.h> - #include <asm/kmap_types.h> - #include <asm/tlbflush.h> -+#include <asm/atomic_kmap.h> - - /* declarations for highmem.c */ - extern unsigned long highstart_pfn, highend_pfn; - --extern pte_t *kmap_pte; --extern pgprot_t kmap_prot; - extern pte_t *pkmap_page_table; -- --extern void kmap_init(void); -+extern void kmap_init(void) __init; - - /* - * Right now we initialize only a single pte table. It can be extended - * easily, subsequent pte tables have to be allocated in one physical - * chunk of RAM. - */ --#if NR_CPUS <= 32 --#define PKMAP_BASE (0xff800000UL) --#else --#define PKMAP_BASE (0xff600000UL) --#endif - #ifdef CONFIG_X86_PAE - #define LAST_PKMAP 512 - #else -@@ -60,6 +53,7 @@ extern void FASTCALL(kunmap_high(struct - void *kmap(struct page *page); - void kunmap(struct page *page); - void *kmap_atomic(struct page *page, enum km_type type); -+void *kmap_atomic_pte(pte_t *pte, enum km_type type); - void kunmap_atomic(void *kvaddr, enum km_type type); - struct page *kmap_atomic_to_page(void *ptr); - -diff -uprN linux-2.6.8.1.orig/include/asm-i386/hpet.h linux-2.6.8.1-ve022stab078/include/asm-i386/hpet.h ---- linux-2.6.8.1.orig/include/asm-i386/hpet.h 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/hpet.h 2006-05-11 13:05:29.000000000 +0400 -@@ -93,6 +93,7 @@ - extern unsigned long hpet_period; /* fsecs / HPET clock */ - extern unsigned long hpet_tick; /* hpet clks count per tick */ - extern unsigned long hpet_address; /* hpet memory map physical address */ -+extern int hpet_use_timer; - - extern int hpet_rtc_timer_init(void); - extern int hpet_enable(void); -diff -uprN linux-2.6.8.1.orig/include/asm-i386/irq.h linux-2.6.8.1-ve022stab078/include/asm-i386/irq.h ---- linux-2.6.8.1.orig/include/asm-i386/irq.h 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/irq.h 2006-05-11 13:05:28.000000000 +0400 -@@ -55,4 +55,10 @@ struct pt_regs; - asmlinkage int handle_IRQ_event(unsigned int, struct pt_regs *, - struct irqaction *); - -+#ifdef CONFIG_IRQBALANCE -+extern int irqbalance_disable(char *str); -+#endif -+extern int no_irq_affinity; -+extern int noirqdebug_setup(char *str); -+ - #endif /* _ASM_IRQ_H */ -diff -uprN linux-2.6.8.1.orig/include/asm-i386/kmap_types.h linux-2.6.8.1-ve022stab078/include/asm-i386/kmap_types.h ---- linux-2.6.8.1.orig/include/asm-i386/kmap_types.h 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/kmap_types.h 2006-05-11 13:05:38.000000000 +0400 -@@ -2,30 +2,36 @@ - #define _ASM_KMAP_TYPES_H - - #include <linux/config.h> -- --#ifdef CONFIG_DEBUG_HIGHMEM --# define D(n) __KM_FENCE_##n , --#else --# define D(n) --#endif -+#include <linux/thread_info.h> - - enum km_type { --D(0) KM_BOUNCE_READ, --D(1) KM_SKB_SUNRPC_DATA, --D(2) KM_SKB_DATA_SOFTIRQ, --D(3) KM_USER0, --D(4) KM_USER1, --D(5) KM_BIO_SRC_IRQ, --D(6) KM_BIO_DST_IRQ, --D(7) KM_PTE0, --D(8) KM_PTE1, --D(9) KM_IRQ0, --D(10) KM_IRQ1, --D(11) KM_SOFTIRQ0, --D(12) KM_SOFTIRQ1, --D(13) KM_TYPE_NR --}; -+ /* -+ * IMPORTANT: don't move these 3 entries, be wary when adding entries, -+ * the 4G/4G virtual stack must be THREAD_SIZE aligned on each cpu. -+ */ -+ KM_BOUNCE_READ, -+ KM_VSTACK_BASE, -+ __KM_VSTACK_TOP = KM_VSTACK_BASE + STACK_PAGE_COUNT-1, -+ KM_VSTACK_TOP = __KM_VSTACK_TOP + (__KM_VSTACK_TOP % 2), - --#undef D -+ KM_LDT_PAGE15, -+ KM_LDT_PAGE0 = KM_LDT_PAGE15 + 16-1, -+ KM_USER_COPY, -+ KM_VSTACK_HOLE, -+ KM_SKB_SUNRPC_DATA, -+ KM_SKB_DATA_SOFTIRQ, -+ KM_USER0, -+ KM_USER1, -+ KM_BIO_SRC_IRQ, -+ KM_BIO_DST_IRQ, -+ KM_PTE0, -+ KM_PTE1, -+ KM_IRQ0, -+ KM_IRQ1, -+ KM_SOFTIRQ0, -+ KM_SOFTIRQ1, -+ __KM_TYPE_NR, -+ KM_TYPE_NR=__KM_TYPE_NR + (__KM_TYPE_NR % 2) -+}; - - #endif -diff -uprN linux-2.6.8.1.orig/include/asm-i386/mach-default/mach_ipi.h linux-2.6.8.1-ve022stab078/include/asm-i386/mach-default/mach_ipi.h ---- linux-2.6.8.1.orig/include/asm-i386/mach-default/mach_ipi.h 2004-08-14 14:56:14.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/mach-default/mach_ipi.h 2006-05-11 13:05:32.000000000 +0400 -@@ -1,8 +1,8 @@ - #ifndef __ASM_MACH_IPI_H - #define __ASM_MACH_IPI_H - --inline void send_IPI_mask_bitmask(cpumask_t mask, int vector); --inline void __send_IPI_shortcut(unsigned int shortcut, int vector); -+void send_IPI_mask_bitmask(cpumask_t mask, int vector); -+void __send_IPI_shortcut(unsigned int shortcut, int vector); - - static inline void send_IPI_mask(cpumask_t mask, int vector) - { -diff -uprN linux-2.6.8.1.orig/include/asm-i386/mman.h linux-2.6.8.1-ve022stab078/include/asm-i386/mman.h ---- linux-2.6.8.1.orig/include/asm-i386/mman.h 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/mman.h 2006-05-11 13:05:39.000000000 +0400 -@@ -22,6 +22,7 @@ - #define MAP_NORESERVE 0x4000 /* don't check for reservations */ - #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ - #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -+#define MAP_EXECPRIO 0x80000 /* map from exec - try not to fail */ - - #define MS_ASYNC 1 /* sync memory asynchronously */ - #define MS_INVALIDATE 2 /* invalidate the caches */ -diff -uprN linux-2.6.8.1.orig/include/asm-i386/mmu.h linux-2.6.8.1-ve022stab078/include/asm-i386/mmu.h ---- linux-2.6.8.1.orig/include/asm-i386/mmu.h 2004-08-14 14:55:34.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/mmu.h 2006-05-11 13:05:38.000000000 +0400 -@@ -8,10 +8,13 @@ - * - * cpu_vm_mask is used to optimize ldt flushing. - */ -+ -+#define MAX_LDT_PAGES 16 -+ - typedef struct { - int size; - struct semaphore sem; -- void *ldt; -+ struct page *ldt_pages[MAX_LDT_PAGES]; - } mm_context_t; - - #endif -diff -uprN linux-2.6.8.1.orig/include/asm-i386/mmu_context.h linux-2.6.8.1-ve022stab078/include/asm-i386/mmu_context.h ---- linux-2.6.8.1.orig/include/asm-i386/mmu_context.h 2004-08-14 14:54:46.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/mmu_context.h 2006-05-11 13:05:38.000000000 +0400 -@@ -29,6 +29,10 @@ static inline void switch_mm(struct mm_s - { - int cpu = smp_processor_id(); - -+#ifdef CONFIG_X86_SWITCH_PAGETABLES -+ if (tsk->mm) -+ tsk->thread_info->user_pgd = (void *)__pa(tsk->mm->pgd); -+#endif - if (likely(prev != next)) { - /* stop flush ipis for the previous mm */ - cpu_clear(cpu, prev->cpu_vm_mask); -@@ -39,12 +43,14 @@ static inline void switch_mm(struct mm_s - cpu_set(cpu, next->cpu_vm_mask); - - /* Re-load page tables */ -+#if !defined(CONFIG_X86_SWITCH_PAGETABLES) - load_cr3(next->pgd); -+#endif - - /* - * load the LDT, if the LDT is different: - */ -- if (unlikely(prev->context.ldt != next->context.ldt)) -+ if (unlikely(prev->context.size + next->context.size)) - load_LDT_nolock(&next->context, cpu); - } - #ifdef CONFIG_SMP -@@ -56,7 +62,9 @@ static inline void switch_mm(struct mm_s - /* We were in lazy tlb mode and leave_mm disabled - * tlb flush IPI delivery. We must reload %cr3. - */ -+#if !defined(CONFIG_X86_SWITCH_PAGETABLES) - load_cr3(next->pgd); -+#endif - load_LDT_nolock(&next->context, cpu); - } - } -@@ -67,6 +75,6 @@ static inline void switch_mm(struct mm_s - asm("movl %0,%%fs ; movl %0,%%gs": :"r" (0)) - - #define activate_mm(prev, next) \ -- switch_mm((prev),(next),NULL) -+ switch_mm((prev),(next),current) - - #endif -diff -uprN linux-2.6.8.1.orig/include/asm-i386/mtrr.h linux-2.6.8.1-ve022stab078/include/asm-i386/mtrr.h ---- linux-2.6.8.1.orig/include/asm-i386/mtrr.h 2004-08-14 14:55:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/mtrr.h 2006-05-11 13:05:32.000000000 +0400 -@@ -67,8 +67,6 @@ struct mtrr_gentry - - #ifdef __KERNEL__ - --extern char *mtrr_strings[]; -- - /* The following functions are for use by other drivers */ - # ifdef CONFIG_MTRR - extern int mtrr_add (unsigned long base, unsigned long size, -diff -uprN linux-2.6.8.1.orig/include/asm-i386/nmi.h linux-2.6.8.1-ve022stab078/include/asm-i386/nmi.h ---- linux-2.6.8.1.orig/include/asm-i386/nmi.h 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/nmi.h 2006-05-11 13:05:24.000000000 +0400 -@@ -17,6 +17,7 @@ typedef int (*nmi_callback_t)(struct pt_ - * set. Return 1 if the NMI was handled. - */ - void set_nmi_callback(nmi_callback_t callback); -+void set_nmi_ipi_callback(nmi_callback_t callback); - - /** - * unset_nmi_callback -@@ -24,5 +25,6 @@ void set_nmi_callback(nmi_callback_t cal - * Remove the handler previously set. - */ - void unset_nmi_callback(void); -+void unset_nmi_ipi_callback(void); - - #endif /* ASM_NMI_H */ -diff -uprN linux-2.6.8.1.orig/include/asm-i386/page.h linux-2.6.8.1-ve022stab078/include/asm-i386/page.h ---- linux-2.6.8.1.orig/include/asm-i386/page.h 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/page.h 2006-05-11 13:05:38.000000000 +0400 -@@ -1,6 +1,8 @@ - #ifndef _I386_PAGE_H - #define _I386_PAGE_H - -+#include <linux/config.h> -+ - /* PAGE_SHIFT determines the page size */ - #define PAGE_SHIFT 12 - #define PAGE_SIZE (1UL << PAGE_SHIFT) -@@ -9,11 +11,10 @@ - #define LARGE_PAGE_MASK (~(LARGE_PAGE_SIZE-1)) - #define LARGE_PAGE_SIZE (1UL << PMD_SHIFT) - --#ifdef __KERNEL__ --#ifndef __ASSEMBLY__ -- - #include <linux/config.h> - -+#ifdef __KERNEL__ -+#ifndef __ASSEMBLY__ - #ifdef CONFIG_X86_USE_3DNOW - - #include <asm/mmx.h> -@@ -92,13 +93,28 @@ typedef struct { unsigned long pgprot; } - * - * If you want more physical memory than this then see the CONFIG_HIGHMEM4G - * and CONFIG_HIGHMEM64G options in the kernel configuration. -+ * -+ * Note: on PAE the kernel must never go below 32 MB, we use the -+ * first 8 entries of the 2-level boot pgd for PAE magic. - */ - -+#ifdef CONFIG_X86_4G_VM_LAYOUT -+#define __PAGE_OFFSET (0x02000000) -+#define TASK_SIZE (0xc0000000) -+#else -+#define __PAGE_OFFSET (0xc0000000) -+#define TASK_SIZE (0xc0000000) -+#endif -+ - /* - * This much address space is reserved for vmalloc() and iomap() - * as well as fixmap mappings. - */ --#define __VMALLOC_RESERVE (128 << 20) -+#ifdef CONFIG_X86_4G -+#define __VMALLOC_RESERVE (320 << 20) -+#else -+#define __VMALLOC_RESERVE (192 << 20) -+#endif - - #ifndef __ASSEMBLY__ - -@@ -118,16 +134,10 @@ static __inline__ int get_order(unsigned - - #endif /* __ASSEMBLY__ */ - --#ifdef __ASSEMBLY__ --#define __PAGE_OFFSET (0xC0000000) --#else --#define __PAGE_OFFSET (0xC0000000UL) --#endif -- -- - #define PAGE_OFFSET ((unsigned long)__PAGE_OFFSET) - #define VMALLOC_RESERVE ((unsigned long)__VMALLOC_RESERVE) --#define MAXMEM (-__PAGE_OFFSET-__VMALLOC_RESERVE) -+#define __MAXMEM (-__PAGE_OFFSET-__VMALLOC_RESERVE) -+#define MAXMEM ((unsigned long)(-PAGE_OFFSET-VMALLOC_RESERVE)) - #define __pa(x) ((unsigned long)(x)-PAGE_OFFSET) - #define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET)) - #define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT) -diff -uprN linux-2.6.8.1.orig/include/asm-i386/pgtable.h linux-2.6.8.1-ve022stab078/include/asm-i386/pgtable.h ---- linux-2.6.8.1.orig/include/asm-i386/pgtable.h 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/pgtable.h 2006-05-11 13:05:38.000000000 +0400 -@@ -16,38 +16,41 @@ - #include <asm/processor.h> - #include <asm/fixmap.h> - #include <linux/threads.h> -+#include <linux/slab.h> - - #ifndef _I386_BITOPS_H - #include <asm/bitops.h> - #endif - --#include <linux/slab.h> --#include <linux/list.h> --#include <linux/spinlock.h> -- --/* -- * ZERO_PAGE is a global shared page that is always zero: used -- * for zero-mapped memory areas etc.. -- */ --#define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page)) --extern unsigned long empty_zero_page[1024]; - extern pgd_t swapper_pg_dir[1024]; --extern kmem_cache_t *pgd_cache; --extern kmem_cache_t *pmd_cache; -+extern kmem_cache_t *pgd_cache, *pmd_cache, *kpmd_cache; - extern spinlock_t pgd_lock; - extern struct page *pgd_list; -- - void pmd_ctor(void *, kmem_cache_t *, unsigned long); -+void kpmd_ctor(void *, kmem_cache_t *, unsigned long); - void pgd_ctor(void *, kmem_cache_t *, unsigned long); - void pgd_dtor(void *, kmem_cache_t *, unsigned long); - void pgtable_cache_init(void); --void paging_init(void); -+extern void paging_init(void); -+void setup_identity_mappings(pgd_t *pgd_base, unsigned long start, unsigned long end); -+ -+/* -+ * ZERO_PAGE is a global shared page that is always zero: used -+ * for zero-mapped memory areas etc.. -+ */ -+extern unsigned long empty_zero_page[1024]; -+#define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page)) - - /* - * The Linux x86 paging architecture is 'compile-time dual-mode', it - * implements both the traditional 2-level x86 page tables and the - * newer 3-level PAE-mode page tables. - */ -+ -+extern void set_system_gate(unsigned int n, void *addr); -+extern void init_entry_mappings(void); -+extern void entry_trampoline_setup(void); -+ - #ifdef CONFIG_X86_PAE - # include <asm/pgtable-3level-defs.h> - #else -@@ -59,7 +62,12 @@ void paging_init(void); - #define PGDIR_SIZE (1UL << PGDIR_SHIFT) - #define PGDIR_MASK (~(PGDIR_SIZE-1)) - --#define USER_PTRS_PER_PGD (TASK_SIZE/PGDIR_SIZE) -+#if defined(CONFIG_X86_PAE) && defined(CONFIG_X86_4G_VM_LAYOUT) -+# define USER_PTRS_PER_PGD 4 -+#else -+# define USER_PTRS_PER_PGD ((TASK_SIZE/PGDIR_SIZE) + ((TASK_SIZE % PGDIR_SIZE) + PGDIR_SIZE-1)/PGDIR_SIZE) -+#endif -+ - #define FIRST_USER_PGD_NR 0 - - #define USER_PGD_PTRS (PAGE_OFFSET >> PGDIR_SHIFT) -@@ -274,6 +282,7 @@ static inline void ptep_mkdirty(pte_t *p - - #define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot)) - #define mk_pte_huge(entry) ((entry).pte_low |= _PAGE_PRESENT | _PAGE_PSE) -+#define mk_pte_phys(physpage, pgprot) pfn_pte((physpage) >> PAGE_SHIFT, pgprot) - - static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) - { -@@ -421,4 +430,11 @@ extern pte_t *lookup_address(unsigned lo - #define __HAVE_ARCH_PTE_SAME - #include <asm-generic/pgtable.h> - -+/* -+ * The size of the low 1:1 mappings we use during bootup, -+ * SMP-boot and ACPI-sleep: -+ */ -+#define LOW_MAPPINGS_SIZE (16*1024*1024) -+ -+ - #endif /* _I386_PGTABLE_H */ -diff -uprN linux-2.6.8.1.orig/include/asm-i386/processor.h linux-2.6.8.1-ve022stab078/include/asm-i386/processor.h ---- linux-2.6.8.1.orig/include/asm-i386/processor.h 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/processor.h 2006-05-11 13:05:38.000000000 +0400 -@@ -84,8 +84,6 @@ struct cpuinfo_x86 { - - extern struct cpuinfo_x86 boot_cpu_data; - extern struct cpuinfo_x86 new_cpu_data; --extern struct tss_struct init_tss[NR_CPUS]; --extern struct tss_struct doublefault_tss; - - #ifdef CONFIG_SMP - extern struct cpuinfo_x86 cpu_data[]; -@@ -286,11 +284,6 @@ extern unsigned int machine_submodel_id; - extern unsigned int BIOS_revision; - extern unsigned int mca_pentium_flag; - --/* -- * User space process size: 3GB (default). -- */ --#define TASK_SIZE (PAGE_OFFSET) -- - /* This decides where the kernel will search for a free chunk of vm - * space during mmap's. - */ -@@ -302,7 +295,6 @@ extern unsigned int mca_pentium_flag; - #define IO_BITMAP_BITS 65536 - #define IO_BITMAP_BYTES (IO_BITMAP_BITS/8) - #define IO_BITMAP_LONGS (IO_BITMAP_BYTES/sizeof(long)) --#define IO_BITMAP_OFFSET offsetof(struct tss_struct,io_bitmap) - #define INVALID_IO_BITMAP_OFFSET 0x8000 - - struct i387_fsave_struct { -@@ -400,6 +392,11 @@ struct tss_struct { - - #define ARCH_MIN_TASKALIGN 16 - -+#define IO_BITMAP_OFFSET offsetof(struct tss_struct,io_bitmap) -+ -+extern struct tss_struct init_tss[NR_CPUS]; -+extern struct tss_struct doublefault_tss; -+ - struct thread_struct { - /* cached TLS descriptors. */ - struct desc_struct tls_array[GDT_ENTRY_TLS_ENTRIES]; -@@ -446,7 +443,8 @@ struct thread_struct { - .io_bitmap = { [ 0 ... IO_BITMAP_LONGS] = ~0 }, \ - } - --static inline void load_esp0(struct tss_struct *tss, struct thread_struct *thread) -+static inline void -+load_esp0(struct tss_struct *tss, struct thread_struct *thread) - { - tss->esp0 = thread->esp0; - /* This can only happen when SEP is enabled, no need to test "SEP"arately */ -@@ -482,6 +480,23 @@ extern void prepare_to_copy(struct task_ - */ - extern int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags); - -+#ifdef CONFIG_X86_HIGH_ENTRY -+#define virtual_esp0(tsk) \ -+ ((unsigned long)(tsk)->thread_info->virtual_stack + ((tsk)->thread.esp0 - (unsigned long)(tsk)->thread_info->real_stack)) -+#else -+# define virtual_esp0(tsk) ((tsk)->thread.esp0) -+#endif -+ -+#define load_virtual_esp0(tss, task) \ -+ do { \ -+ tss->esp0 = virtual_esp0(task); \ -+ if (likely(cpu_has_sep) && unlikely(tss->ss1 != task->thread.sysenter_cs)) { \ -+ tss->ss1 = task->thread.sysenter_cs; \ -+ wrmsr(MSR_IA32_SYSENTER_CS, \ -+ task->thread.sysenter_cs, 0); \ -+ } \ -+ } while (0) -+ - extern unsigned long thread_saved_pc(struct task_struct *tsk); - void show_trace(struct task_struct *task, unsigned long *stack); - -diff -uprN linux-2.6.8.1.orig/include/asm-i386/setup.h linux-2.6.8.1-ve022stab078/include/asm-i386/setup.h ---- linux-2.6.8.1.orig/include/asm-i386/setup.h 2004-08-14 14:55:35.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/setup.h 2006-05-11 13:05:29.000000000 +0400 -@@ -55,7 +55,7 @@ extern unsigned char boot_params[PARAM_S - #define KERNEL_START (*(unsigned long *) (PARAM+0x214)) - #define INITRD_START (*(unsigned long *) (PARAM+0x218)) - #define INITRD_SIZE (*(unsigned long *) (PARAM+0x21c)) --#define EDID_INFO (*(struct edid_info *) (PARAM+0x440)) -+#define EDID_INFO (*(struct edid_info *) (PARAM+0x140)) - #define EDD_NR (*(unsigned char *) (PARAM+EDDNR)) - #define EDD_MBR_SIG_NR (*(unsigned char *) (PARAM+EDD_MBR_SIG_NR_BUF)) - #define EDD_MBR_SIGNATURE ((unsigned int *) (PARAM+EDD_MBR_SIG_BUF)) -diff -uprN linux-2.6.8.1.orig/include/asm-i386/string.h linux-2.6.8.1-ve022stab078/include/asm-i386/string.h ---- linux-2.6.8.1.orig/include/asm-i386/string.h 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/string.h 2006-05-11 13:05:38.000000000 +0400 -@@ -60,6 +60,29 @@ __asm__ __volatile__( - return dest; - } - -+/* -+ * This is a more generic variant of strncpy_count() suitable for -+ * implementing string-access routines with all sorts of return -+ * code semantics. It's used by mm/usercopy.c. -+ */ -+static inline size_t strncpy_count(char * dest,const char *src,size_t count) -+{ -+ __asm__ __volatile__( -+ -+ "1:\tdecl %0\n\t" -+ "js 2f\n\t" -+ "lodsb\n\t" -+ "stosb\n\t" -+ "testb %%al,%%al\n\t" -+ "jne 1b\n\t" -+ "2:" -+ "incl %0" -+ : "=c" (count) -+ :"S" (src),"D" (dest),"0" (count) : "memory"); -+ -+ return count; -+} -+ - #define __HAVE_ARCH_STRCAT - static inline char * strcat(char * dest,const char * src) - { -@@ -117,7 +140,8 @@ __asm__ __volatile__( - "orb $1,%%al\n" - "3:" - :"=a" (__res), "=&S" (d0), "=&D" (d1) -- :"1" (cs),"2" (ct)); -+ :"1" (cs),"2" (ct) -+ :"memory"); - return __res; - } - -@@ -139,8 +163,9 @@ __asm__ __volatile__( - "3:\tsbbl %%eax,%%eax\n\t" - "orb $1,%%al\n" - "4:" -- :"=a" (__res), "=&S" (d0), "=&D" (d1), "=&c" (d2) -- :"1" (cs),"2" (ct),"3" (count)); -+ :"=a" (__res), "=&S" (d0), "=&D" (d1), "=&c" (d2) -+ :"1" (cs),"2" (ct),"3" (count) -+ :"memory"); - return __res; - } - -@@ -159,7 +184,9 @@ __asm__ __volatile__( - "movl $1,%1\n" - "2:\tmovl %1,%0\n\t" - "decl %0" -- :"=a" (__res), "=&S" (d0) : "1" (s),"0" (c)); -+ :"=a" (__res), "=&S" (d0) -+ :"1" (s),"0" (c) -+ :"memory"); - return __res; - } - -@@ -176,7 +203,9 @@ __asm__ __volatile__( - "leal -1(%%esi),%0\n" - "2:\ttestb %%al,%%al\n\t" - "jne 1b" -- :"=g" (__res), "=&S" (d0), "=&a" (d1) :"0" (0),"1" (s),"2" (c)); -+ :"=g" (__res), "=&S" (d0), "=&a" (d1) -+ :"0" (0),"1" (s),"2" (c) -+ :"memory"); - return __res; - } - -@@ -192,7 +221,9 @@ __asm__ __volatile__( - "scasb\n\t" - "notl %0\n\t" - "decl %0" -- :"=c" (__res), "=&D" (d0) :"1" (s),"a" (0), "0" (0xffffffffu)); -+ :"=c" (__res), "=&D" (d0) -+ :"1" (s),"a" (0), "0" (0xffffffffu) -+ :"memory"); - return __res; - } - -@@ -303,7 +334,9 @@ __asm__ __volatile__( - "je 1f\n\t" - "movl $1,%0\n" - "1:\tdecl %0" -- :"=D" (__res), "=&c" (d0) : "a" (c),"0" (cs),"1" (count)); -+ :"=D" (__res), "=&c" (d0) -+ :"a" (c),"0" (cs),"1" (count) -+ :"memory"); - return __res; - } - -@@ -339,7 +372,7 @@ __asm__ __volatile__( - "je 2f\n\t" - "stosb\n" - "2:" -- : "=&c" (d0), "=&D" (d1) -+ :"=&c" (d0), "=&D" (d1) - :"a" (c), "q" (count), "0" (count/4), "1" ((long) s) - :"memory"); - return (s); -@@ -362,7 +395,8 @@ __asm__ __volatile__( - "jne 1b\n" - "3:\tsubl %2,%0" - :"=a" (__res), "=&d" (d0) -- :"c" (s),"1" (count)); -+ :"c" (s),"1" (count) -+ :"memory"); - return __res; - } - /* end of additional stuff */ -@@ -443,7 +477,8 @@ static inline void * memscan(void * addr - "dec %%edi\n" - "1:" - : "=D" (addr), "=c" (size) -- : "0" (addr), "1" (size), "a" (c)); -+ : "0" (addr), "1" (size), "a" (c) -+ : "memory"); - return addr; - } - -diff -uprN linux-2.6.8.1.orig/include/asm-i386/thread_info.h linux-2.6.8.1-ve022stab078/include/asm-i386/thread_info.h ---- linux-2.6.8.1.orig/include/asm-i386/thread_info.h 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/thread_info.h 2006-05-11 13:05:39.000000000 +0400 -@@ -16,6 +16,15 @@ - #include <asm/processor.h> - #endif - -+#define PREEMPT_ACTIVE 0x4000000 -+#ifdef CONFIG_4KSTACKS -+#define THREAD_SIZE (4096) -+#else -+#define THREAD_SIZE (8192) -+#endif -+#define STACK_PAGE_COUNT (THREAD_SIZE/PAGE_SIZE) -+#define STACK_WARN (THREAD_SIZE/8) -+ - /* - * low level task data that entry.S needs immediate access to - * - this struct should fit entirely inside of one cache line -@@ -37,6 +46,8 @@ struct thread_info { - 0-0xBFFFFFFF for user-thead - 0-0xFFFFFFFF for kernel-thread - */ -+ void *real_stack, *virtual_stack, *user_pgd; -+ void *stack_page[STACK_PAGE_COUNT]; - struct restart_block restart_block; - - unsigned long previous_esp; /* ESP of the previous stack in case -@@ -51,14 +62,6 @@ struct thread_info { - - #endif - --#define PREEMPT_ACTIVE 0x4000000 --#ifdef CONFIG_4KSTACKS --#define THREAD_SIZE (4096) --#else --#define THREAD_SIZE (8192) --#endif -- --#define STACK_WARN (THREAD_SIZE/8) - /* - * macros/functions for gaining access to the thread information structure - * -@@ -66,7 +69,7 @@ struct thread_info { - */ - #ifndef __ASSEMBLY__ - --#define INIT_THREAD_INFO(tsk) \ -+#define INIT_THREAD_INFO(tsk, thread_info) \ - { \ - .task = &tsk, \ - .exec_domain = &default_exec_domain, \ -@@ -77,6 +80,7 @@ struct thread_info { - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ -+ .real_stack = &thread_info, \ - } - - #define init_thread_info (init_thread_union.thread_info) -@@ -105,13 +109,13 @@ static inline unsigned long current_stac - ({ \ - struct thread_info *ret; \ - \ -- ret = kmalloc(THREAD_SIZE, GFP_KERNEL); \ -+ ret = kmalloc(THREAD_SIZE, GFP_KERNEL_UBC); \ - if (ret) \ - memset(ret, 0, THREAD_SIZE); \ - ret; \ - }) - #else --#define alloc_thread_info(tsk) kmalloc(THREAD_SIZE, GFP_KERNEL) -+#define alloc_thread_info(tsk) kmalloc(THREAD_SIZE, GFP_KERNEL_UBC) - #endif - - #define free_thread_info(info) kfree(info) -@@ -143,8 +147,10 @@ static inline unsigned long current_stac - #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ - #define TIF_SINGLESTEP 4 /* restore singlestep on return to user mode */ - #define TIF_IRET 5 /* return with iret */ -+#define TIF_DB7 6 /* has debug registers */ - #define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */ - #define TIF_POLLING_NRFLAG 16 /* true if poll_idle() is polling TIF_NEED_RESCHED */ -+#define TIF_FREEZE 17 /* Freeze request, atomic version of PF_FREEZE */ - - #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE) - #define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME) -@@ -153,6 +159,7 @@ static inline unsigned long current_stac - #define _TIF_SINGLESTEP (1<<TIF_SINGLESTEP) - #define _TIF_IRET (1<<TIF_IRET) - #define _TIF_SYSCALL_AUDIT (1<<TIF_SYSCALL_AUDIT) -+#define _TIF_DB7 (1<<TIF_DB7) - #define _TIF_POLLING_NRFLAG (1<<TIF_POLLING_NRFLAG) - - /* work to do on interrupt/exception return */ -diff -uprN linux-2.6.8.1.orig/include/asm-i386/timex.h linux-2.6.8.1-ve022stab078/include/asm-i386/timex.h ---- linux-2.6.8.1.orig/include/asm-i386/timex.h 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/timex.h 2006-05-11 13:05:40.000000000 +0400 -@@ -41,7 +41,7 @@ extern cycles_t cacheflush_time; - static inline cycles_t get_cycles (void) - { - #ifndef CONFIG_X86_TSC -- return 0; -+#error "CONFIG_X86_TCS is not set!" - #else - unsigned long long ret; - -diff -uprN linux-2.6.8.1.orig/include/asm-i386/tlbflush.h linux-2.6.8.1-ve022stab078/include/asm-i386/tlbflush.h ---- linux-2.6.8.1.orig/include/asm-i386/tlbflush.h 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/tlbflush.h 2006-05-11 13:05:38.000000000 +0400 -@@ -85,22 +85,28 @@ extern unsigned long pgkern_mask; - - static inline void flush_tlb_mm(struct mm_struct *mm) - { -+#ifndef CONFIG_X86_SWITCH_PAGETABLES - if (mm == current->active_mm) - __flush_tlb(); -+#endif - } - - static inline void flush_tlb_page(struct vm_area_struct *vma, - unsigned long addr) - { -+#ifndef CONFIG_X86_SWITCH_PAGETABLES - if (vma->vm_mm == current->active_mm) - __flush_tlb_one(addr); -+#endif - } - - static inline void flush_tlb_range(struct vm_area_struct *vma, - unsigned long start, unsigned long end) - { -+#ifndef CONFIG_X86_SWITCH_PAGETABLES - if (vma->vm_mm == current->active_mm) - __flush_tlb(); -+#endif - } - - #else -@@ -111,11 +117,10 @@ static inline void flush_tlb_range(struc - __flush_tlb() - - extern void flush_tlb_all(void); --extern void flush_tlb_current_task(void); - extern void flush_tlb_mm(struct mm_struct *); - extern void flush_tlb_page(struct vm_area_struct *, unsigned long); - --#define flush_tlb() flush_tlb_current_task() -+#define flush_tlb() flush_tlb_all() - - static inline void flush_tlb_range(struct vm_area_struct * vma, unsigned long start, unsigned long end) - { -diff -uprN linux-2.6.8.1.orig/include/asm-i386/uaccess.h linux-2.6.8.1-ve022stab078/include/asm-i386/uaccess.h ---- linux-2.6.8.1.orig/include/asm-i386/uaccess.h 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/uaccess.h 2006-05-11 13:05:38.000000000 +0400 -@@ -26,7 +26,7 @@ - - - #define KERNEL_DS MAKE_MM_SEG(0xFFFFFFFFUL) --#define USER_DS MAKE_MM_SEG(PAGE_OFFSET) -+#define USER_DS MAKE_MM_SEG(TASK_SIZE) - - #define get_ds() (KERNEL_DS) - #define get_fs() (current_thread_info()->addr_limit) -@@ -150,6 +150,55 @@ extern void __get_user_4(void); - :"=a" (ret),"=d" (x) \ - :"0" (ptr)) - -+extern int get_user_size(unsigned int size, void *val, const void *ptr); -+extern int put_user_size(unsigned int size, const void *val, void *ptr); -+extern int zero_user_size(unsigned int size, void *ptr); -+extern int copy_str_fromuser_size(unsigned int size, void *val, const void *ptr); -+extern int strlen_fromuser_size(unsigned int size, const void *ptr); -+ -+/* -+ * GCC 2.96 has stupid bug which forces us to use volatile or barrier below. -+ * without volatile or barrier compiler generates ABSOLUTELY wrong code which -+ * igonores XXX_size function return code, but generates EFAULT :))) -+ * the bug was found in sys_utime() -+ */ -+# define indirect_get_user(x,ptr) \ -+({ int __ret_gu,__val_gu; \ -+ __typeof__(ptr) __ptr_gu = (ptr); \ -+ __ret_gu = get_user_size(sizeof(*__ptr_gu), &__val_gu,__ptr_gu) ? -EFAULT : 0;\ -+ barrier(); \ -+ (x) = (__typeof__(*__ptr_gu))__val_gu; \ -+ __ret_gu; \ -+}) -+#define indirect_put_user(x,ptr) \ -+({ \ -+ int __ret_pu; \ -+ __typeof__(*(ptr)) *__ptr_pu = (ptr), __x_pu = (x); \ -+ __ret_pu = put_user_size(sizeof(*__ptr_pu), \ -+ &__x_pu, __ptr_pu) ? -EFAULT : 0; \ -+ barrier(); \ -+ __ret_pu; \ -+}) -+#define __indirect_put_user indirect_put_user -+#define __indirect_get_user indirect_get_user -+ -+#define indirect_copy_from_user(to,from,n) get_user_size(n,to,from) -+#define indirect_copy_to_user(to,from,n) put_user_size(n,from,to) -+ -+#define __indirect_copy_from_user indirect_copy_from_user -+#define __indirect_copy_to_user indirect_copy_to_user -+ -+#define indirect_strncpy_from_user(dst, src, count) \ -+ copy_str_fromuser_size(count, dst, src) -+ -+extern int strlen_fromuser_size(unsigned int size, const void *ptr); -+#define indirect_strnlen_user(str, n) strlen_fromuser_size(n, str) -+#define indirect_strlen_user(str) indirect_strnlen_user(str, ~0UL >> 1) -+ -+extern int zero_user_size(unsigned int size, void *ptr); -+ -+#define indirect_clear_user(mem, len) zero_user_size(len, mem) -+#define __indirect_clear_user clear_user - - /* Careful: we have to cast the result to the type of the pointer for sign reasons */ - /** -@@ -169,7 +218,7 @@ extern void __get_user_4(void); - * Returns zero on success, or -EFAULT on error. - * On error, the variable @x is set to zero. - */ --#define get_user(x,ptr) \ -+#define direct_get_user(x,ptr) \ - ({ int __ret_gu,__val_gu; \ - __chk_user_ptr(ptr); \ - switch(sizeof (*(ptr))) { \ -@@ -200,7 +249,7 @@ extern void __put_user_bad(void); - * - * Returns zero on success, or -EFAULT on error. - */ --#define put_user(x,ptr) \ -+#define direct_put_user(x,ptr) \ - __put_user_check((__typeof__(*(ptr)))(x),(ptr),sizeof(*(ptr))) - - -@@ -224,7 +273,7 @@ extern void __put_user_bad(void); - * Returns zero on success, or -EFAULT on error. - * On error, the variable @x is set to zero. - */ --#define __get_user(x,ptr) \ -+#define __direct_get_user(x,ptr) \ - __get_user_nocheck((x),(ptr),sizeof(*(ptr))) - - -@@ -247,7 +296,7 @@ extern void __put_user_bad(void); - * - * Returns zero on success, or -EFAULT on error. - */ --#define __put_user(x,ptr) \ -+#define __direct_put_user(x,ptr) \ - __put_user_nocheck((__typeof__(*(ptr)))(x),(ptr),sizeof(*(ptr))) - - #define __put_user_nocheck(x,ptr,size) \ -@@ -400,7 +449,7 @@ unsigned long __copy_from_user_ll(void * - * On success, this will be zero. - */ - static inline unsigned long --__copy_to_user(void __user *to, const void *from, unsigned long n) -+__direct_copy_to_user(void __user *to, const void *from, unsigned long n) - { - if (__builtin_constant_p(n)) { - unsigned long ret; -@@ -438,7 +487,7 @@ __copy_to_user(void __user *to, const vo - * data to the requested size using zero bytes. - */ - static inline unsigned long --__copy_from_user(void *to, const void __user *from, unsigned long n) -+__direct_copy_from_user(void *to, const void __user *from, unsigned long n) - { - if (__builtin_constant_p(n)) { - unsigned long ret; -@@ -458,9 +507,55 @@ __copy_from_user(void *to, const void __ - return __copy_from_user_ll(to, from, n); - } - --unsigned long copy_to_user(void __user *to, const void *from, unsigned long n); --unsigned long copy_from_user(void *to, -- const void __user *from, unsigned long n); -+/** -+ * copy_to_user: - Copy a block of data into user space. -+ * @to: Destination address, in user space. -+ * @from: Source address, in kernel space. -+ * @n: Number of bytes to copy. -+ * -+ * Context: User context only. This function may sleep. -+ * -+ * Copy data from kernel space to user space. -+ * -+ * Returns number of bytes that could not be copied. -+ * On success, this will be zero. -+ */ -+static inline unsigned long -+direct_copy_to_user(void __user *to, const void *from, unsigned long n) -+{ -+ might_sleep(); -+ if (access_ok(VERIFY_WRITE, to, n)) -+ n = __direct_copy_to_user(to, from, n); -+ return n; -+} -+ -+/** -+ * copy_from_user: - Copy a block of data from user space. -+ * @to: Destination address, in kernel space. -+ * @from: Source address, in user space. -+ * @n: Number of bytes to copy. -+ * -+ * Context: User context only. This function may sleep. -+ * -+ * Copy data from user space to kernel space. -+ * -+ * Returns number of bytes that could not be copied. -+ * On success, this will be zero. -+ * -+ * If some data could not be copied, this function will pad the copied -+ * data to the requested size using zero bytes. -+ */ -+static inline unsigned long -+direct_copy_from_user(void *to, const void __user *from, unsigned long n) -+{ -+ might_sleep(); -+ if (access_ok(VERIFY_READ, from, n)) -+ n = __direct_copy_from_user(to, from, n); -+ else -+ memset(to, 0, n); -+ return n; -+} -+ - long strncpy_from_user(char *dst, const char __user *src, long count); - long __strncpy_from_user(char *dst, const char __user *src, long count); - -@@ -478,10 +573,68 @@ long __strncpy_from_user(char *dst, cons - * If there is a limit on the length of a valid string, you may wish to - * consider using strnlen_user() instead. - */ --#define strlen_user(str) strnlen_user(str, ~0UL >> 1) - --long strnlen_user(const char __user *str, long n); --unsigned long clear_user(void __user *mem, unsigned long len); --unsigned long __clear_user(void __user *mem, unsigned long len); -+long direct_strncpy_from_user(char *dst, const char *src, long count); -+long __direct_strncpy_from_user(char *dst, const char *src, long count); -+#define direct_strlen_user(str) direct_strnlen_user(str, ~0UL >> 1) -+long direct_strnlen_user(const char *str, long n); -+unsigned long direct_clear_user(void *mem, unsigned long len); -+unsigned long __direct_clear_user(void *mem, unsigned long len); -+ -+extern int indirect_uaccess; -+ -+#ifdef CONFIG_X86_UACCESS_INDIRECT -+ -+/* -+ * Return code and zeroing semantics: -+ -+ __clear_user 0 <-> bytes not done -+ clear_user 0 <-> bytes not done -+ __copy_to_user 0 <-> bytes not done -+ copy_to_user 0 <-> bytes not done -+ __copy_from_user 0 <-> bytes not done, zero rest -+ copy_from_user 0 <-> bytes not done, zero rest -+ __get_user 0 <-> -EFAULT -+ get_user 0 <-> -EFAULT -+ __put_user 0 <-> -EFAULT -+ put_user 0 <-> -EFAULT -+ strlen_user strlen + 1 <-> 0 -+ strnlen_user strlen + 1 (or n+1) <-> 0 -+ strncpy_from_user strlen (or n) <-> -EFAULT -+ -+ */ -+ -+#define __clear_user(mem,len) __indirect_clear_user(mem,len) -+#define clear_user(mem,len) indirect_clear_user(mem,len) -+#define __copy_to_user(to,from,n) __indirect_copy_to_user(to,from,n) -+#define copy_to_user(to,from,n) indirect_copy_to_user(to,from,n) -+#define __copy_from_user(to,from,n) __indirect_copy_from_user(to,from,n) -+#define copy_from_user(to,from,n) indirect_copy_from_user(to,from,n) -+#define __get_user(val,ptr) __indirect_get_user(val,ptr) -+#define get_user(val,ptr) indirect_get_user(val,ptr) -+#define __put_user(val,ptr) __indirect_put_user(val,ptr) -+#define put_user(val,ptr) indirect_put_user(val,ptr) -+#define strlen_user(str) indirect_strlen_user(str) -+#define strnlen_user(src,count) indirect_strnlen_user(src,count) -+#define strncpy_from_user(dst,src,count) \ -+ indirect_strncpy_from_user(dst,src,count) -+ -+#else -+ -+#define __clear_user __direct_clear_user -+#define clear_user direct_clear_user -+#define __copy_to_user __direct_copy_to_user -+#define copy_to_user direct_copy_to_user -+#define __copy_from_user __direct_copy_from_user -+#define copy_from_user direct_copy_from_user -+#define __get_user __direct_get_user -+#define get_user direct_get_user -+#define __put_user __direct_put_user -+#define put_user direct_put_user -+#define strlen_user direct_strlen_user -+#define strnlen_user direct_strnlen_user -+#define strncpy_from_user direct_strncpy_from_user -+ -+#endif /* CONFIG_X86_UACCESS_INDIRECT */ - - #endif /* __i386_UACCESS_H */ -diff -uprN linux-2.6.8.1.orig/include/asm-i386/unistd.h linux-2.6.8.1-ve022stab078/include/asm-i386/unistd.h ---- linux-2.6.8.1.orig/include/asm-i386/unistd.h 2004-08-14 14:55:35.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-i386/unistd.h 2006-05-11 13:05:43.000000000 +0400 -@@ -289,8 +289,18 @@ - #define __NR_mq_notify (__NR_mq_open+4) - #define __NR_mq_getsetattr (__NR_mq_open+5) - #define __NR_sys_kexec_load 283 -- --#define NR_syscalls 284 -+#define __NR_fairsched_mknod 500 /* FairScheduler syscalls */ -+#define __NR_fairsched_rmnod 501 -+#define __NR_fairsched_chwt 502 -+#define __NR_fairsched_mvpr 503 -+#define __NR_fairsched_rate 504 -+#define __NR_getluid 510 -+#define __NR_setluid 511 -+#define __NR_setublimit 512 -+#define __NR_ubstat 513 -+#define __NR_lchmod 516 -+#define __NR_lutime 517 -+#define NR_syscalls 517 - - /* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */ - -diff -uprN linux-2.6.8.1.orig/include/asm-ia64/machvec_init.h linux-2.6.8.1-ve022stab078/include/asm-ia64/machvec_init.h ---- linux-2.6.8.1.orig/include/asm-ia64/machvec_init.h 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-ia64/machvec_init.h 2006-05-11 13:05:37.000000000 +0400 -@@ -1,4 +1,5 @@ - #include <asm/machvec.h> -+#include <asm/io.h> - - extern ia64_mv_send_ipi_t ia64_send_ipi; - extern ia64_mv_global_tlb_purge_t ia64_global_tlb_purge; -diff -uprN linux-2.6.8.1.orig/include/asm-ia64/mman.h linux-2.6.8.1-ve022stab078/include/asm-ia64/mman.h ---- linux-2.6.8.1.orig/include/asm-ia64/mman.h 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-ia64/mman.h 2006-05-11 13:05:39.000000000 +0400 -@@ -30,6 +30,7 @@ - #define MAP_NORESERVE 0x04000 /* don't check for reservations */ - #define MAP_POPULATE 0x08000 /* populate (prefault) pagetables */ - #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -+#define MAP_EXECPRIO 0x80000 /* map from exec - try not to fail */ - - #define MS_ASYNC 1 /* sync memory asynchronously */ - #define MS_INVALIDATE 2 /* invalidate the caches */ -diff -uprN linux-2.6.8.1.orig/include/asm-ia64/pgtable.h linux-2.6.8.1-ve022stab078/include/asm-ia64/pgtable.h ---- linux-2.6.8.1.orig/include/asm-ia64/pgtable.h 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-ia64/pgtable.h 2006-05-11 13:05:30.000000000 +0400 -@@ -8,7 +8,7 @@ - * This hopefully works with any (fixed) IA-64 page-size, as defined - * in <asm/page.h> (currently 8192). - * -- * Copyright (C) 1998-2004 Hewlett-Packard Co -+ * Copyright (C) 1998-2005 Hewlett-Packard Co - * David Mosberger-Tang <davidm@hpl.hp.com> - */ - -@@ -420,6 +420,8 @@ pte_same (pte_t a, pte_t b) - return pte_val(a) == pte_val(b); - } - -+#define update_mmu_cache(vma, address, pte) do { } while (0) -+ - extern pgd_t swapper_pg_dir[PTRS_PER_PGD]; - extern void paging_init (void); - -@@ -479,7 +481,7 @@ extern void hugetlb_free_pgtables(struct - * information. However, we use this routine to take care of any (delayed) i-cache - * flushing that may be necessary. - */ --extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte); -+extern void lazy_mmu_prot_update (pte_t pte); - - #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS - /* -@@ -549,7 +551,11 @@ do { \ - - /* These tell get_user_pages() that the first gate page is accessible from user-level. */ - #define FIXADDR_USER_START GATE_ADDR --#define FIXADDR_USER_END (GATE_ADDR + 2*PERCPU_PAGE_SIZE) -+#ifdef HAVE_BUGGY_SEGREL -+# define FIXADDR_USER_END (GATE_ADDR + 2*PAGE_SIZE) -+#else -+# define FIXADDR_USER_END (GATE_ADDR + 2*PERCPU_PAGE_SIZE) -+#endif - - #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG - #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY -@@ -558,6 +564,7 @@ do { \ - #define __HAVE_ARCH_PTEP_MKDIRTY - #define __HAVE_ARCH_PTE_SAME - #define __HAVE_ARCH_PGD_OFFSET_GATE -+#define __HAVE_ARCH_LAZY_MMU_PROT_UPDATE - #include <asm-generic/pgtable.h> - - #endif /* _ASM_IA64_PGTABLE_H */ -diff -uprN linux-2.6.8.1.orig/include/asm-ia64/processor.h linux-2.6.8.1-ve022stab078/include/asm-ia64/processor.h ---- linux-2.6.8.1.orig/include/asm-ia64/processor.h 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-ia64/processor.h 2006-05-11 13:05:40.000000000 +0400 -@@ -310,7 +310,7 @@ struct thread_struct { - regs->loadrs = 0; \ - regs->r8 = current->mm->dumpable; /* set "don't zap registers" flag */ \ - regs->r12 = new_sp - 16; /* allocate 16 byte scratch area */ \ -- if (unlikely(!current->mm->dumpable)) { \ -+ if (unlikely(!current->mm->dumpable || !current->mm->vps_dumpable)) { \ - /* \ - * Zap scratch regs to avoid leaking bits between processes with different \ - * uid/privileges. \ -diff -uprN linux-2.6.8.1.orig/include/asm-ia64/ptrace.h linux-2.6.8.1-ve022stab078/include/asm-ia64/ptrace.h ---- linux-2.6.8.1.orig/include/asm-ia64/ptrace.h 2004-08-14 14:55:35.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-ia64/ptrace.h 2006-05-11 13:05:30.000000000 +0400 -@@ -2,7 +2,7 @@ - #define _ASM_IA64_PTRACE_H - - /* -- * Copyright (C) 1998-2003 Hewlett-Packard Co -+ * Copyright (C) 1998-2004 Hewlett-Packard Co - * David Mosberger-Tang <davidm@hpl.hp.com> - * Stephane Eranian <eranian@hpl.hp.com> - * Copyright (C) 2003 Intel Co -@@ -110,7 +110,11 @@ struct pt_regs { - - unsigned long cr_ipsr; /* interrupted task's psr */ - unsigned long cr_iip; /* interrupted task's instruction pointer */ -- unsigned long cr_ifs; /* interrupted task's function state */ -+ /* -+ * interrupted task's function state; if bit 63 is cleared, it -+ * contains syscall's ar.pfs.pfm: -+ */ -+ unsigned long cr_ifs; - - unsigned long ar_unat; /* interrupted task's NaT register (preserved) */ - unsigned long ar_pfs; /* prev function state */ -diff -uprN linux-2.6.8.1.orig/include/asm-ia64/system.h linux-2.6.8.1-ve022stab078/include/asm-ia64/system.h ---- linux-2.6.8.1.orig/include/asm-ia64/system.h 2004-08-14 14:55:19.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-ia64/system.h 2006-05-11 13:05:39.000000000 +0400 -@@ -279,7 +279,7 @@ do { \ - spin_lock(&(next)->switch_lock); \ - spin_unlock(&(rq)->lock); \ - } while (0) --#define finish_arch_switch(rq, prev) spin_unlock_irq(&(prev)->switch_lock) -+#define finish_arch_switch(rq, prev) spin_unlock(&(prev)->switch_lock) - #define task_running(rq, p) ((rq)->curr == (p) || spin_is_locked(&(p)->switch_lock)) - - #define ia64_platform_is(x) (strcmp(x, platform_name) == 0) -diff -uprN linux-2.6.8.1.orig/include/asm-ia64/thread_info.h linux-2.6.8.1-ve022stab078/include/asm-ia64/thread_info.h ---- linux-2.6.8.1.orig/include/asm-ia64/thread_info.h 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-ia64/thread_info.h 2006-05-11 13:05:25.000000000 +0400 -@@ -75,6 +75,7 @@ struct thread_info { - #define TIF_SYSCALL_TRACE 3 /* syscall trace active */ - #define TIF_SYSCALL_AUDIT 4 /* syscall auditing active */ - #define TIF_POLLING_NRFLAG 16 /* true if poll_idle() is polling TIF_NEED_RESCHED */ -+#define TIF_FREEZE 17 /* Freeze request, atomic version of PF_FREEZE */ - - #define TIF_WORK_MASK 0x7 /* like TIF_ALLWORK_BITS but sans TIF_SYSCALL_TRACE */ - #define TIF_ALLWORK_MASK 0x1f /* bits 0..4 are "work to do on user-return" bits */ -diff -uprN linux-2.6.8.1.orig/include/asm-ia64/timex.h linux-2.6.8.1-ve022stab078/include/asm-ia64/timex.h ---- linux-2.6.8.1.orig/include/asm-ia64/timex.h 2004-08-14 14:56:00.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-ia64/timex.h 2006-05-11 13:05:40.000000000 +0400 -@@ -10,11 +10,14 @@ - * Also removed cacheflush_time as it's entirely unused. - */ - --#include <asm/intrinsics.h> --#include <asm/processor.h> -+extern unsigned int cpu_khz; - - typedef unsigned long cycles_t; - -+#ifdef __KERNEL__ -+#include <asm/intrinsics.h> -+#include <asm/processor.h> -+ - /* - * For performance reasons, we don't want to define CLOCK_TICK_TRATE as - * local_cpu_data->itc_rate. Fortunately, we don't have to, either: according to George -@@ -37,4 +40,5 @@ get_cycles (void) - return ret; - } - -+#endif /* __KERNEL__ */ - #endif /* _ASM_IA64_TIMEX_H */ -diff -uprN linux-2.6.8.1.orig/include/asm-ia64/unistd.h linux-2.6.8.1-ve022stab078/include/asm-ia64/unistd.h ---- linux-2.6.8.1.orig/include/asm-ia64/unistd.h 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-ia64/unistd.h 2006-05-11 13:05:43.000000000 +0400 -@@ -259,12 +259,23 @@ - #define __NR_mq_getsetattr 1267 - #define __NR_kexec_load 1268 - #define __NR_vserver 1269 -+#define __NR_fairsched_mknod 1500 -+#define __NR_fairsched_rmnod 1501 -+#define __NR_fairsched_chwt 1502 -+#define __NR_fairsched_mvpr 1503 -+#define __NR_fairsched_rate 1504 -+#define __NR_getluid 1505 -+#define __NR_setluid 1506 -+#define __NR_setublimit 1507 -+#define __NR_ubstat 1508 -+#define __NR_lchmod 1509 -+#define __NR_lutime 1510 - - #ifdef __KERNEL__ - - #include <linux/config.h> - --#define NR_syscalls 256 /* length of syscall table */ -+#define NR_syscalls (__NR_lutime - __NR_ni_syscall + 1) /* length of syscall table */ - - #define __ARCH_WANT_SYS_RT_SIGACTION - -@@ -369,7 +380,7 @@ asmlinkage unsigned long sys_mmap2( - int fd, long pgoff); - struct pt_regs; - struct sigaction; --asmlinkage long sys_execve(char *filename, char **argv, char **envp, -+long sys_execve(char *filename, char **argv, char **envp, - struct pt_regs *regs); - asmlinkage long sys_pipe(long arg0, long arg1, long arg2, long arg3, - long arg4, long arg5, long arg6, long arg7, long stack); -diff -uprN linux-2.6.8.1.orig/include/asm-mips/system.h linux-2.6.8.1-ve022stab078/include/asm-mips/system.h ---- linux-2.6.8.1.orig/include/asm-mips/system.h 2004-08-14 14:55:20.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-mips/system.h 2006-05-11 13:05:39.000000000 +0400 -@@ -496,7 +496,7 @@ do { \ - spin_lock(&(next)->switch_lock); \ - spin_unlock(&(rq)->lock); \ - } while (0) --#define finish_arch_switch(rq, prev) spin_unlock_irq(&(prev)->switch_lock) -+#define finish_arch_switch(rq, prev) spin_unlock(&(prev)->switch_lock) - #define task_running(rq, p) ((rq)->curr == (p) || spin_is_locked(&(p)->switch_lock)) - - #endif /* _ASM_SYSTEM_H */ -diff -uprN linux-2.6.8.1.orig/include/asm-s390/system.h linux-2.6.8.1-ve022stab078/include/asm-s390/system.h ---- linux-2.6.8.1.orig/include/asm-s390/system.h 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-s390/system.h 2006-05-11 13:05:39.000000000 +0400 -@@ -107,7 +107,7 @@ static inline void restore_access_regs(u - #define task_running(rq, p) ((rq)->curr == (p)) - #define finish_arch_switch(rq, prev) do { \ - set_fs(current->thread.mm_segment); \ -- spin_unlock_irq(&(rq)->lock); \ -+ spin_unlock(&(rq)->lock); \ - } while (0) - - #define nop() __asm__ __volatile__ ("nop") -diff -uprN linux-2.6.8.1.orig/include/asm-sparc/system.h linux-2.6.8.1-ve022stab078/include/asm-sparc/system.h ---- linux-2.6.8.1.orig/include/asm-sparc/system.h 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-sparc/system.h 2006-05-11 13:05:39.000000000 +0400 -@@ -109,7 +109,7 @@ extern void fpsave(unsigned long *fpregs - "save %sp, -0x40, %sp\n\t" \ - "restore; restore; restore; restore; restore; restore; restore"); \ - } while(0) --#define finish_arch_switch(rq, next) spin_unlock_irq(&(rq)->lock) -+#define finish_arch_switch(rq, next) spin_unlock(&(rq)->lock) - #define task_running(rq, p) ((rq)->curr == (p)) - - /* Much care has gone into this code, do not touch it. -diff -uprN linux-2.6.8.1.orig/include/asm-sparc64/system.h linux-2.6.8.1-ve022stab078/include/asm-sparc64/system.h ---- linux-2.6.8.1.orig/include/asm-sparc64/system.h 2004-08-14 14:54:46.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-sparc64/system.h 2006-05-11 13:05:39.000000000 +0400 -@@ -146,7 +146,7 @@ do { spin_lock(&(next)->switch_lock); \ - } while (0) - - #define finish_arch_switch(rq, prev) \ --do { spin_unlock_irq(&(prev)->switch_lock); \ -+do { spin_unlock(&(prev)->switch_lock); \ - } while (0) - - #define task_running(rq, p) \ -diff -uprN linux-2.6.8.1.orig/include/asm-x86_64/a.out.h linux-2.6.8.1-ve022stab078/include/asm-x86_64/a.out.h ---- linux-2.6.8.1.orig/include/asm-x86_64/a.out.h 2004-08-14 14:55:20.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-x86_64/a.out.h 2006-05-11 13:05:29.000000000 +0400 -@@ -21,7 +21,7 @@ struct exec - - #ifdef __KERNEL__ - #include <linux/thread_info.h> --#define STACK_TOP (test_thread_flag(TIF_IA32) ? IA32_PAGE_OFFSET : TASK_SIZE) -+#define STACK_TOP TASK_SIZE - #endif - - #endif /* __A_OUT_GNU_H__ */ -diff -uprN linux-2.6.8.1.orig/include/asm-x86_64/cacheflush.h linux-2.6.8.1-ve022stab078/include/asm-x86_64/cacheflush.h ---- linux-2.6.8.1.orig/include/asm-x86_64/cacheflush.h 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-x86_64/cacheflush.h 2006-05-11 13:05:30.000000000 +0400 -@@ -25,5 +25,6 @@ - - void global_flush_tlb(void); - int change_page_attr(struct page *page, int numpages, pgprot_t prot); -+int change_page_attr_addr(unsigned long addr, int numpages, pgprot_t prot); - - #endif /* _X8664_CACHEFLUSH_H */ -diff -uprN linux-2.6.8.1.orig/include/asm-x86_64/calling.h linux-2.6.8.1-ve022stab078/include/asm-x86_64/calling.h ---- linux-2.6.8.1.orig/include/asm-x86_64/calling.h 2004-08-14 14:55:35.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-x86_64/calling.h 2006-05-11 13:05:33.000000000 +0400 -@@ -143,22 +143,6 @@ - RESTORE_ARGS 0,\addskip - .endm - -- /* push in order ss, rsp, eflags, cs, rip */ -- .macro FAKE_STACK_FRAME child_rip -- xorl %eax,%eax -- subq $6*8,%rsp -- movq %rax,5*8(%rsp) /* ss */ -- movq %rax,4*8(%rsp) /* rsp */ -- movq $(1<<9),3*8(%rsp) /* eflags */ -- movq $__KERNEL_CS,2*8(%rsp) /* cs */ -- movq \child_rip,1*8(%rsp) /* rip */ -- movq %rax,(%rsp) /* orig_rax */ -- .endm -- -- .macro UNFAKE_STACK_FRAME -- addq $8*6, %rsp -- .endm -- - .macro icebp - .byte 0xf1 - .endm -diff -uprN linux-2.6.8.1.orig/include/asm-x86_64/desc.h linux-2.6.8.1-ve022stab078/include/asm-x86_64/desc.h ---- linux-2.6.8.1.orig/include/asm-x86_64/desc.h 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-x86_64/desc.h 2006-05-11 13:05:29.000000000 +0400 -@@ -128,13 +128,13 @@ static inline void set_tss_desc(unsigned - { - set_tssldt_descriptor(&cpu_gdt_table[cpu][GDT_ENTRY_TSS], (unsigned long)addr, - DESC_TSS, -- sizeof(struct tss_struct)); -+ sizeof(struct tss_struct) - 1); - } - - static inline void set_ldt_desc(unsigned cpu, void *addr, int size) - { - set_tssldt_descriptor(&cpu_gdt_table[cpu][GDT_ENTRY_LDT], (unsigned long)addr, -- DESC_LDT, size * 8); -+ DESC_LDT, size * 8 - 1); - } - - static inline void set_seg_base(unsigned cpu, int entry, void *base) -diff -uprN linux-2.6.8.1.orig/include/asm-x86_64/hw_irq.h linux-2.6.8.1-ve022stab078/include/asm-x86_64/hw_irq.h ---- linux-2.6.8.1.orig/include/asm-x86_64/hw_irq.h 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-x86_64/hw_irq.h 2006-05-11 13:05:29.000000000 +0400 -@@ -163,7 +163,7 @@ static inline void x86_do_profile (struc - atomic_inc((atomic_t *)&prof_buffer[rip]); - } - --#if defined(CONFIG_X86_IO_APIC) && defined(CONFIG_SMP) -+#if defined(CONFIG_X86_IO_APIC) - static inline void hw_resend_irq(struct hw_interrupt_type *h, unsigned int i) { - if (IO_APIC_IRQ(i)) - send_IPI_self(IO_APIC_VECTOR(i)); -diff -uprN linux-2.6.8.1.orig/include/asm-x86_64/ia32.h linux-2.6.8.1-ve022stab078/include/asm-x86_64/ia32.h ---- linux-2.6.8.1.orig/include/asm-x86_64/ia32.h 2004-08-14 14:56:13.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-x86_64/ia32.h 2006-05-11 13:05:27.000000000 +0400 -@@ -84,7 +84,7 @@ typedef union sigval32 { - unsigned int sival_ptr; - } sigval_t32; - --typedef struct siginfo32 { -+typedef struct compat_siginfo { - int si_signo; - int si_errno; - int si_code; -@@ -134,7 +134,7 @@ typedef struct siginfo32 { - int _fd; - } _sigpoll; - } _sifields; --} siginfo_t32; -+} compat_siginfo_t; - - struct sigframe32 - { -@@ -151,7 +151,7 @@ struct rt_sigframe32 - int sig; - u32 pinfo; - u32 puc; -- struct siginfo32 info; -+ struct compat_siginfo info; - struct ucontext_ia32 uc; - struct _fpstate_ia32 fpstate; - }; -@@ -171,8 +171,6 @@ struct siginfo_t; - int do_get_thread_area(struct thread_struct *t, struct user_desc __user *info); - int do_set_thread_area(struct thread_struct *t, struct user_desc __user *info); - int ia32_child_tls(struct task_struct *p, struct pt_regs *childregs); --int ia32_copy_siginfo_from_user(siginfo_t *to, siginfo_t32 __user *from); --int ia32_copy_siginfo_to_user(siginfo_t32 __user *to, siginfo_t *from); - #endif - - #endif /* !CONFIG_IA32_SUPPORT */ -diff -uprN linux-2.6.8.1.orig/include/asm-x86_64/irq.h linux-2.6.8.1-ve022stab078/include/asm-x86_64/irq.h ---- linux-2.6.8.1.orig/include/asm-x86_64/irq.h 2004-08-14 14:54:46.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-x86_64/irq.h 2006-05-11 13:05:28.000000000 +0400 -@@ -57,4 +57,6 @@ struct irqaction; - struct pt_regs; - int handle_IRQ_event(unsigned int, struct pt_regs *, struct irqaction *); - -+extern int no_irq_affinity; -+ - #endif /* _ASM_IRQ_H */ -diff -uprN linux-2.6.8.1.orig/include/asm-x86_64/mman.h linux-2.6.8.1-ve022stab078/include/asm-x86_64/mman.h ---- linux-2.6.8.1.orig/include/asm-x86_64/mman.h 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-x86_64/mman.h 2006-05-11 13:05:39.000000000 +0400 -@@ -23,6 +23,7 @@ - #define MAP_NORESERVE 0x4000 /* don't check for reservations */ - #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ - #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -+#define MAP_EXECPRIO 0x80000 /* map from exec - try not to fail */ - - #define MS_ASYNC 1 /* sync memory asynchronously */ - #define MS_INVALIDATE 2 /* invalidate the caches */ -diff -uprN linux-2.6.8.1.orig/include/asm-x86_64/msr.h linux-2.6.8.1-ve022stab078/include/asm-x86_64/msr.h ---- linux-2.6.8.1.orig/include/asm-x86_64/msr.h 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-x86_64/msr.h 2006-05-11 13:05:28.000000000 +0400 -@@ -208,6 +208,7 @@ extern inline unsigned int cpuid_edx(uns - #define MSR_K8_TOP_MEM1 0xC001001A - #define MSR_K8_TOP_MEM2 0xC001001D - #define MSR_K8_SYSCFG 0xC0000010 -+#define MSR_K8_HWCR 0xC0010015 - - /* K6 MSRs */ - #define MSR_K6_EFER 0xC0000080 -diff -uprN linux-2.6.8.1.orig/include/asm-x86_64/mtrr.h linux-2.6.8.1-ve022stab078/include/asm-x86_64/mtrr.h ---- linux-2.6.8.1.orig/include/asm-x86_64/mtrr.h 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-x86_64/mtrr.h 2006-05-11 13:05:32.000000000 +0400 -@@ -71,8 +71,6 @@ struct mtrr_gentry - - #ifdef __KERNEL__ - --extern char *mtrr_strings[MTRR_NUM_TYPES]; -- - /* The following functions are for use by other drivers */ - # ifdef CONFIG_MTRR - extern int mtrr_add (unsigned long base, unsigned long size, -diff -uprN linux-2.6.8.1.orig/include/asm-x86_64/pgalloc.h linux-2.6.8.1-ve022stab078/include/asm-x86_64/pgalloc.h ---- linux-2.6.8.1.orig/include/asm-x86_64/pgalloc.h 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-x86_64/pgalloc.h 2006-05-11 13:05:39.000000000 +0400 -@@ -30,12 +30,12 @@ extern __inline__ void pmd_free(pmd_t *p - - static inline pmd_t *pmd_alloc_one (struct mm_struct *mm, unsigned long addr) - { -- return (pmd_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT); -+ return (pmd_t *)get_zeroed_page(GFP_KERNEL_UBC|__GFP_REPEAT); - } - - static inline pgd_t *pgd_alloc (struct mm_struct *mm) - { -- return (pgd_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT); -+ return (pgd_t *)get_zeroed_page(GFP_KERNEL_UBC|__GFP_REPEAT); - } - - static inline void pgd_free (pgd_t *pgd) -@@ -51,7 +51,7 @@ static inline pte_t *pte_alloc_one_kerne - - static inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address) - { -- void *p = (void *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT); -+ void *p = (void *)get_zeroed_page(GFP_KERNEL_UBC|__GFP_REPEAT); - if (!p) - return NULL; - return virt_to_page(p); -diff -uprN linux-2.6.8.1.orig/include/asm-x86_64/pgtable.h linux-2.6.8.1-ve022stab078/include/asm-x86_64/pgtable.h ---- linux-2.6.8.1.orig/include/asm-x86_64/pgtable.h 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-x86_64/pgtable.h 2006-05-11 13:05:29.000000000 +0400 -@@ -384,7 +384,7 @@ extern inline pte_t pte_modify(pte_t pte - } - - #define pte_index(address) \ -- ((address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) -+ (((address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) - #define pte_offset_kernel(dir, address) ((pte_t *) pmd_page_kernel(*(dir)) + \ - pte_index(address)) - -diff -uprN linux-2.6.8.1.orig/include/asm-x86_64/processor.h linux-2.6.8.1-ve022stab078/include/asm-x86_64/processor.h ---- linux-2.6.8.1.orig/include/asm-x86_64/processor.h 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-x86_64/processor.h 2006-05-11 13:05:45.000000000 +0400 -@@ -76,7 +76,6 @@ struct cpuinfo_x86 { - #define X86_VENDOR_UNKNOWN 0xff - - extern struct cpuinfo_x86 boot_cpu_data; --extern struct tss_struct init_tss[NR_CPUS]; - - #ifdef CONFIG_SMP - extern struct cpuinfo_x86 cpu_data[]; -@@ -166,16 +165,16 @@ static inline void clear_in_cr4 (unsigne - /* - * User space process size: 512GB - 1GB (default). - */ --#define TASK_SIZE (0x0000007fc0000000UL) -+#define TASK_SIZE64 (0x0000007fc0000000UL) - - /* This decides where the kernel will search for a free chunk of vm - * space during mmap's. - */ --#define IA32_PAGE_OFFSET ((current->personality & ADDR_LIMIT_3GB) ? 0xc0000000 : 0xFFFFe000) --#define TASK_UNMAPPED_32 PAGE_ALIGN(IA32_PAGE_OFFSET/3) --#define TASK_UNMAPPED_64 PAGE_ALIGN(TASK_SIZE/3) --#define TASK_UNMAPPED_BASE \ -- (test_thread_flag(TIF_IA32) ? TASK_UNMAPPED_32 : TASK_UNMAPPED_64) -+#define IA32_PAGE_OFFSET 0xc0000000 -+#define TASK_SIZE (test_thread_flag(TIF_IA32) ? IA32_PAGE_OFFSET : TASK_SIZE64) -+#define TASK_SIZE_OF(child) ((test_tsk_thread_flag(child, TIF_IA32) ? IA32_PAGE_OFFSET : TASK_SIZE64)) -+ -+#define TASK_UNMAPPED_BASE PAGE_ALIGN(TASK_SIZE/3) - - /* - * Size of io_bitmap. -@@ -183,7 +182,6 @@ static inline void clear_in_cr4 (unsigne - #define IO_BITMAP_BITS 65536 - #define IO_BITMAP_BYTES (IO_BITMAP_BITS/8) - #define IO_BITMAP_LONGS (IO_BITMAP_BYTES/sizeof(long)) --#define IO_BITMAP_OFFSET offsetof(struct tss_struct,io_bitmap) - #define INVALID_IO_BITMAP_OFFSET 0x8000 - - struct i387_fxsave_struct { -@@ -229,6 +227,10 @@ struct tss_struct { - - #define ARCH_MIN_TASKALIGN 16 - -+#define IO_BITMAP_OFFSET offsetof(struct tss_struct,io_bitmap) -+ -+extern struct tss_struct init_tss[NR_CPUS]; -+ - struct thread_struct { - unsigned long rsp0; - unsigned long rsp; -diff -uprN linux-2.6.8.1.orig/include/asm-x86_64/segment.h linux-2.6.8.1-ve022stab078/include/asm-x86_64/segment.h ---- linux-2.6.8.1.orig/include/asm-x86_64/segment.h 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-x86_64/segment.h 2006-05-11 13:05:45.000000000 +0400 -@@ -3,32 +3,31 @@ - - #include <asm/cache.h> - --#define __KERNEL_CS 0x10 --#define __KERNEL_DS 0x18 -- --#define __KERNEL32_CS 0x38 -- -+#define __KERNEL_COMPAT32_CS 0x8 -+#define GDT_ENTRY_BOOT_CS 2 -+#define __BOOT_CS (GDT_ENTRY_BOOT_CS * 8) -+#define GDT_ENTRY_BOOT_DS 3 -+#define __BOOT_DS (GDT_ENTRY_BOOT_DS * 8) -+#define GDT_ENTRY_TSS 4 /* needs two entries */ - /* - * we cannot use the same code segment descriptor for user and kernel - * -- not even in the long flat mode, because of different DPL /kkeil - * The segment offset needs to contain a RPL. Grr. -AK - * GDT layout to get 64bit syscall right (sysret hardcodes gdt offsets) - */ -- --#define __USER32_CS 0x23 /* 4*8+3 */ --#define __USER_DS 0x2b /* 5*8+3 */ --#define __USER_CS 0x33 /* 6*8+3 */ --#define __USER32_DS __USER_DS -+#define GDT_ENTRY_TLS_MIN 6 -+#define GDT_ENTRY_TLS_MAX 8 -+#define GDT_ENTRY_KERNELCS16 9 - #define __KERNEL16_CS (GDT_ENTRY_KERNELCS16 * 8) --#define __KERNEL_COMPAT32_CS 0x8 - --#define GDT_ENTRY_TLS 1 --#define GDT_ENTRY_TSS 8 /* needs two entries */ - #define GDT_ENTRY_LDT 10 --#define GDT_ENTRY_TLS_MIN 11 --#define GDT_ENTRY_TLS_MAX 13 --/* 14 free */ --#define GDT_ENTRY_KERNELCS16 15 -+#define __KERNEL32_CS 0x58 /* 11*8 */ -+#define __KERNEL_CS 0x60 /* 12*8 */ -+#define __KERNEL_DS 0x68 /* 13*8 */ -+#define __USER32_CS 0x73 /* 14*8+3 */ -+#define __USER_DS 0x7b /* 15*8+3 */ -+#define __USER32_DS __USER_DS -+#define __USER_CS 0x83 /* 16*8+3 */ - - #define GDT_ENTRY_TLS_ENTRIES 3 - -@@ -40,7 +39,7 @@ - #define FS_TLS_SEL ((GDT_ENTRY_TLS_MIN+FS_TLS)*8 + 3) - - #define IDT_ENTRIES 256 --#define GDT_ENTRIES 16 -+#define GDT_ENTRIES 32 - #define GDT_SIZE (GDT_ENTRIES * 8) - #define TLS_SIZE (GDT_ENTRY_TLS_ENTRIES * 8) - -diff -uprN linux-2.6.8.1.orig/include/asm-x86_64/system.h linux-2.6.8.1-ve022stab078/include/asm-x86_64/system.h ---- linux-2.6.8.1.orig/include/asm-x86_64/system.h 2004-08-14 14:56:00.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-x86_64/system.h 2006-05-11 13:05:30.000000000 +0400 -@@ -35,7 +35,7 @@ - "thread_return:\n\t" \ - "movq %%gs:%P[pda_pcurrent],%%rsi\n\t" \ - "movq %P[thread_info](%%rsi),%%r8\n\t" \ -- "btr %[tif_fork],%P[ti_flags](%%r8)\n\t" \ -+ LOCK "btr %[tif_fork],%P[ti_flags](%%r8)\n\t" \ - "movq %%rax,%%rdi\n\t" \ - "jc ret_from_fork\n\t" \ - RESTORE_CONTEXT \ -diff -uprN linux-2.6.8.1.orig/include/asm-x86_64/thread_info.h linux-2.6.8.1-ve022stab078/include/asm-x86_64/thread_info.h ---- linux-2.6.8.1.orig/include/asm-x86_64/thread_info.h 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-x86_64/thread_info.h 2006-05-11 13:05:25.000000000 +0400 -@@ -106,6 +106,7 @@ static inline struct thread_info *stack_ - #define TIF_IA32 17 /* 32bit process */ - #define TIF_FORK 18 /* ret_from_fork */ - #define TIF_ABI_PENDING 19 -+#define TIF_FREEZE 20 /* Freeze request, atomic version of PF_FREEZE */ - - #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE) - #define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME) -diff -uprN linux-2.6.8.1.orig/include/asm-x86_64/unistd.h linux-2.6.8.1-ve022stab078/include/asm-x86_64/unistd.h ---- linux-2.6.8.1.orig/include/asm-x86_64/unistd.h 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-x86_64/unistd.h 2006-05-11 13:05:43.000000000 +0400 -@@ -554,8 +554,30 @@ __SYSCALL(__NR_mq_notify, sys_mq_notify) - __SYSCALL(__NR_mq_getsetattr, sys_mq_getsetattr) - #define __NR_kexec_load 246 - __SYSCALL(__NR_kexec_load, sys_ni_syscall) -+#define __NR_getluid 500 -+__SYSCALL(__NR_getluid, sys_getluid) -+#define __NR_setluid 501 -+__SYSCALL(__NR_setluid, sys_setluid) -+#define __NR_setublimit 502 -+__SYSCALL(__NR_setublimit, sys_setublimit) -+#define __NR_ubstat 503 -+__SYSCALL(__NR_ubstat, sys_ubstat) -+#define __NR_fairsched_mknod 504 /* FairScheduler syscalls */ -+__SYSCALL(__NR_fairsched_mknod, sys_fairsched_mknod) -+#define __NR_fairsched_rmnod 505 -+__SYSCALL(__NR_fairsched_rmnod, sys_fairsched_rmnod) -+#define __NR_fairsched_chwt 506 -+__SYSCALL(__NR_fairsched_chwt, sys_fairsched_chwt) -+#define __NR_fairsched_mvpr 507 -+__SYSCALL(__NR_fairsched_mvpr, sys_fairsched_mvpr) -+#define __NR_fairsched_rate 508 -+__SYSCALL(__NR_fairsched_rate, sys_fairsched_rate) -+#define __NR_lchmod 509 -+__SYSCALL(__NR_lchmod, sys_lchmod) -+#define __NR_lutime 510 -+__SYSCALL(__NR_lutime, sys_lutime) - --#define __NR_syscall_max __NR_kexec_load -+#define __NR_syscall_max __NR_lutime - #ifndef __NO_STUBS - - /* user-visible error numbers are in the range -1 - -4095 */ -diff -uprN linux-2.6.8.1.orig/include/asm-x86_64/vsyscall.h linux-2.6.8.1-ve022stab078/include/asm-x86_64/vsyscall.h ---- linux-2.6.8.1.orig/include/asm-x86_64/vsyscall.h 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/asm-x86_64/vsyscall.h 2006-05-11 13:05:37.000000000 +0400 -@@ -1,8 +1,6 @@ - #ifndef _ASM_X86_64_VSYSCALL_H_ - #define _ASM_X86_64_VSYSCALL_H_ - --#include <linux/seqlock.h> -- - enum vsyscall_num { - __NR_vgettimeofday, - __NR_vtime, -@@ -15,13 +13,15 @@ enum vsyscall_num { - - #ifdef __KERNEL__ - -+#include <linux/seqlock.h> -+ - #define __section_vxtime __attribute__ ((unused, __section__ (".vxtime"), aligned(16))) - #define __section_wall_jiffies __attribute__ ((unused, __section__ (".wall_jiffies"), aligned(16))) - #define __section_jiffies __attribute__ ((unused, __section__ (".jiffies"), aligned(16))) - #define __section_sys_tz __attribute__ ((unused, __section__ (".sys_tz"), aligned(16))) - #define __section_sysctl_vsyscall __attribute__ ((unused, __section__ (".sysctl_vsyscall"), aligned(16))) - #define __section_xtime __attribute__ ((unused, __section__ (".xtime"), aligned(16))) --#define __section_xtime_lock __attribute__ ((unused, __section__ (".xtime_lock"), aligned(L1_CACHE_BYTES))) -+#define __section_xtime_lock __attribute__ ((unused, __section__ (".xtime_lock"), aligned(16))) - - #define VXTIME_TSC 1 - #define VXTIME_HPET 2 -diff -uprN linux-2.6.8.1.orig/include/linux/affs_fs.h linux-2.6.8.1-ve022stab078/include/linux/affs_fs.h ---- linux-2.6.8.1.orig/include/linux/affs_fs.h 2004-08-14 14:54:46.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/affs_fs.h 2006-05-11 13:05:35.000000000 +0400 -@@ -63,7 +63,7 @@ extern void affs_put_inode(struct ino - extern void affs_delete_inode(struct inode *inode); - extern void affs_clear_inode(struct inode *inode); - extern void affs_read_inode(struct inode *inode); --extern void affs_write_inode(struct inode *inode, int); -+extern int affs_write_inode(struct inode *inode, int); - extern int affs_add_entry(struct inode *dir, struct inode *inode, struct dentry *dentry, s32 type); - - /* super.c */ -diff -uprN linux-2.6.8.1.orig/include/linux/binfmts.h linux-2.6.8.1-ve022stab078/include/linux/binfmts.h ---- linux-2.6.8.1.orig/include/linux/binfmts.h 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/binfmts.h 2006-05-11 13:05:35.000000000 +0400 -@@ -2,6 +2,7 @@ - #define _LINUX_BINFMTS_H - - #include <linux/capability.h> -+#include <linux/fs.h> - - struct pt_regs; - -@@ -28,6 +29,7 @@ struct linux_binprm{ - int sh_bang; - struct file * file; - int e_uid, e_gid; -+ struct exec_perm perm; - kernel_cap_t cap_inheritable, cap_permitted, cap_effective; - void *security; - int argc, envc; -diff -uprN linux-2.6.8.1.orig/include/linux/bio.h linux-2.6.8.1-ve022stab078/include/linux/bio.h ---- linux-2.6.8.1.orig/include/linux/bio.h 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/bio.h 2006-05-11 13:05:31.000000000 +0400 -@@ -121,6 +121,7 @@ struct bio { - #define BIO_CLONED 4 /* doesn't own data */ - #define BIO_BOUNCED 5 /* bio is a bounce bio */ - #define BIO_USER_MAPPED 6 /* contains user pages */ -+#define BIO_EOPNOTSUPP 7 /* not supported */ - #define bio_flagged(bio, flag) ((bio)->bi_flags & (1 << (flag))) - - /* -@@ -160,6 +161,8 @@ struct bio { - #define bio_data(bio) (page_address(bio_page((bio))) + bio_offset((bio))) - #define bio_barrier(bio) ((bio)->bi_rw & (1 << BIO_RW_BARRIER)) - #define bio_sync(bio) ((bio)->bi_rw & (1 << BIO_RW_SYNC)) -+#define bio_failfast(bio) ((bio)->bi_rw & (1 << BIO_RW_FAILFAST)) -+#define bio_rw_ahead(bio) ((bio)->bi_rw & (1 << BIO_RW_AHEAD)) - - /* - * will die -diff -uprN linux-2.6.8.1.orig/include/linux/blkdev.h linux-2.6.8.1-ve022stab078/include/linux/blkdev.h ---- linux-2.6.8.1.orig/include/linux/blkdev.h 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/blkdev.h 2006-05-11 13:05:31.000000000 +0400 -@@ -195,6 +195,8 @@ enum rq_flag_bits { - __REQ_PM_SUSPEND, /* suspend request */ - __REQ_PM_RESUME, /* resume request */ - __REQ_PM_SHUTDOWN, /* shutdown request */ -+ __REQ_BAR_PREFLUSH, /* barrier pre-flush done */ -+ __REQ_BAR_POSTFLUSH, /* barrier post-flush */ - __REQ_NR_BITS, /* stops here */ - }; - -@@ -220,6 +222,8 @@ enum rq_flag_bits { - #define REQ_PM_SUSPEND (1 << __REQ_PM_SUSPEND) - #define REQ_PM_RESUME (1 << __REQ_PM_RESUME) - #define REQ_PM_SHUTDOWN (1 << __REQ_PM_SHUTDOWN) -+#define REQ_BAR_PREFLUSH (1 << __REQ_BAR_PREFLUSH) -+#define REQ_BAR_POSTFLUSH (1 << __REQ_BAR_POSTFLUSH) - - /* - * State information carried for REQ_PM_SUSPEND and REQ_PM_RESUME -@@ -248,6 +252,7 @@ typedef void (unplug_fn) (request_queue_ - struct bio_vec; - typedef int (merge_bvec_fn) (request_queue_t *, struct bio *, struct bio_vec *); - typedef void (activity_fn) (void *data, int rw); -+typedef int (issue_flush_fn) (request_queue_t *, struct gendisk *, sector_t *); - - enum blk_queue_state { - Queue_down, -@@ -290,6 +295,7 @@ struct request_queue - unplug_fn *unplug_fn; - merge_bvec_fn *merge_bvec_fn; - activity_fn *activity_fn; -+ issue_flush_fn *issue_flush_fn; - - /* - * Auto-unplugging state -@@ -373,6 +379,7 @@ struct request_queue - #define QUEUE_FLAG_DEAD 5 /* queue being torn down */ - #define QUEUE_FLAG_REENTER 6 /* Re-entrancy avoidance */ - #define QUEUE_FLAG_PLUGGED 7 /* queue is plugged */ -+#define QUEUE_FLAG_ORDERED 8 /* supports ordered writes */ - - #define blk_queue_plugged(q) test_bit(QUEUE_FLAG_PLUGGED, &(q)->queue_flags) - #define blk_queue_tagged(q) test_bit(QUEUE_FLAG_QUEUED, &(q)->queue_flags) -@@ -390,6 +397,10 @@ struct request_queue - #define blk_pm_request(rq) \ - ((rq)->flags & (REQ_PM_SUSPEND | REQ_PM_RESUME)) - -+#define blk_barrier_rq(rq) ((rq)->flags & REQ_HARDBARRIER) -+#define blk_barrier_preflush(rq) ((rq)->flags & REQ_BAR_PREFLUSH) -+#define blk_barrier_postflush(rq) ((rq)->flags & REQ_BAR_POSTFLUSH) -+ - #define list_entry_rq(ptr) list_entry((ptr), struct request, queuelist) - - #define rq_data_dir(rq) ((rq)->flags & 1) -@@ -560,6 +571,14 @@ extern void end_that_request_last(struct - extern int process_that_request_first(struct request *, unsigned int); - extern void end_request(struct request *req, int uptodate); - -+/* -+ * end_that_request_first/chunk() takes an uptodate argument. we account -+ * any value <= as an io error. 0 means -EIO for compatability reasons, -+ * any other < 0 value is the direct error type. An uptodate value of -+ * 1 indicates successful io completion -+ */ -+#define end_io_error(uptodate) (unlikely((uptodate) <= 0)) -+ - static inline void blkdev_dequeue_request(struct request *req) - { - BUG_ON(list_empty(&req->queuelist)); -@@ -588,6 +607,9 @@ extern void blk_queue_prep_rq(request_qu - extern void blk_queue_merge_bvec(request_queue_t *, merge_bvec_fn *); - extern void blk_queue_dma_alignment(request_queue_t *, int); - extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev); -+extern void blk_queue_ordered(request_queue_t *, int); -+extern void blk_queue_issue_flush_fn(request_queue_t *, issue_flush_fn *); -+extern int blkdev_scsi_issue_flush_fn(request_queue_t *, struct gendisk *, sector_t *); - - extern int blk_rq_map_sg(request_queue_t *, struct request *, struct scatterlist *); - extern void blk_dump_rq_flags(struct request *, char *); -@@ -616,6 +638,7 @@ extern long blk_congestion_wait(int rw, - - extern void blk_rq_bio_prep(request_queue_t *, struct request *, struct bio *); - extern void blk_rq_prep_restart(struct request *); -+extern int blkdev_issue_flush(struct block_device *, sector_t *); - - #define MAX_PHYS_SEGMENTS 128 - #define MAX_HW_SEGMENTS 128 -diff -uprN linux-2.6.8.1.orig/include/linux/buffer_head.h linux-2.6.8.1-ve022stab078/include/linux/buffer_head.h ---- linux-2.6.8.1.orig/include/linux/buffer_head.h 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/buffer_head.h 2006-05-11 13:05:31.000000000 +0400 -@@ -26,6 +26,7 @@ enum bh_state_bits { - BH_Delay, /* Buffer is not yet allocated on disk */ - BH_Boundary, /* Block is followed by a discontiguity */ - BH_Write_EIO, /* I/O error on write */ -+ BH_Ordered, /* ordered write */ - - BH_PrivateStart,/* not a state bit, but the first bit available - * for private allocation by other entities -@@ -110,7 +111,8 @@ BUFFER_FNS(Async_Read, async_read) - BUFFER_FNS(Async_Write, async_write) - BUFFER_FNS(Delay, delay) - BUFFER_FNS(Boundary, boundary) --BUFFER_FNS(Write_EIO,write_io_error) -+BUFFER_FNS(Write_EIO, write_io_error) -+BUFFER_FNS(Ordered, ordered) - - #define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK) - #define touch_buffer(bh) mark_page_accessed(bh->b_page) -@@ -173,7 +175,7 @@ void FASTCALL(unlock_buffer(struct buffe - void FASTCALL(__lock_buffer(struct buffer_head *bh)); - void ll_rw_block(int, int, struct buffer_head * bh[]); - void sync_dirty_buffer(struct buffer_head *bh); --void submit_bh(int, struct buffer_head *); -+int submit_bh(int, struct buffer_head *); - void write_boundary_block(struct block_device *bdev, - sector_t bblock, unsigned blocksize); - -diff -uprN linux-2.6.8.1.orig/include/linux/byteorder/big_endian.h linux-2.6.8.1-ve022stab078/include/linux/byteorder/big_endian.h ---- linux-2.6.8.1.orig/include/linux/byteorder/big_endian.h 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/byteorder/big_endian.h 2006-05-11 13:05:31.000000000 +0400 -@@ -8,48 +8,86 @@ - #define __BIG_ENDIAN_BITFIELD - #endif - -+#include <linux/types.h> - #include <linux/byteorder/swab.h> - - #define __constant_htonl(x) ((__u32)(x)) - #define __constant_ntohl(x) ((__u32)(x)) - #define __constant_htons(x) ((__u16)(x)) - #define __constant_ntohs(x) ((__u16)(x)) --#define __constant_cpu_to_le64(x) ___constant_swab64((x)) --#define __constant_le64_to_cpu(x) ___constant_swab64((x)) --#define __constant_cpu_to_le32(x) ___constant_swab32((x)) --#define __constant_le32_to_cpu(x) ___constant_swab32((x)) --#define __constant_cpu_to_le16(x) ___constant_swab16((x)) --#define __constant_le16_to_cpu(x) ___constant_swab16((x)) --#define __constant_cpu_to_be64(x) ((__u64)(x)) --#define __constant_be64_to_cpu(x) ((__u64)(x)) --#define __constant_cpu_to_be32(x) ((__u32)(x)) --#define __constant_be32_to_cpu(x) ((__u32)(x)) --#define __constant_cpu_to_be16(x) ((__u16)(x)) --#define __constant_be16_to_cpu(x) ((__u16)(x)) --#define __cpu_to_le64(x) __swab64((x)) --#define __le64_to_cpu(x) __swab64((x)) --#define __cpu_to_le32(x) __swab32((x)) --#define __le32_to_cpu(x) __swab32((x)) --#define __cpu_to_le16(x) __swab16((x)) --#define __le16_to_cpu(x) __swab16((x)) --#define __cpu_to_be64(x) ((__u64)(x)) --#define __be64_to_cpu(x) ((__u64)(x)) --#define __cpu_to_be32(x) ((__u32)(x)) --#define __be32_to_cpu(x) ((__u32)(x)) --#define __cpu_to_be16(x) ((__u16)(x)) --#define __be16_to_cpu(x) ((__u16)(x)) --#define __cpu_to_le64p(x) __swab64p((x)) --#define __le64_to_cpup(x) __swab64p((x)) --#define __cpu_to_le32p(x) __swab32p((x)) --#define __le32_to_cpup(x) __swab32p((x)) --#define __cpu_to_le16p(x) __swab16p((x)) --#define __le16_to_cpup(x) __swab16p((x)) --#define __cpu_to_be64p(x) (*(__u64*)(x)) --#define __be64_to_cpup(x) (*(__u64*)(x)) --#define __cpu_to_be32p(x) (*(__u32*)(x)) --#define __be32_to_cpup(x) (*(__u32*)(x)) --#define __cpu_to_be16p(x) (*(__u16*)(x)) --#define __be16_to_cpup(x) (*(__u16*)(x)) -+#define __constant_cpu_to_le64(x) ((__force __le64)___constant_swab64((x))) -+#define __constant_le64_to_cpu(x) ___constant_swab64((__force __u64)(__le64)(x)) -+#define __constant_cpu_to_le32(x) ((__force __le32)___constant_swab32((x))) -+#define __constant_le32_to_cpu(x) ___constant_swab32((__force __u32)(__le32)(x)) -+#define __constant_cpu_to_le16(x) ((__force __le16)___constant_swab16((x))) -+#define __constant_le16_to_cpu(x) ___constant_swab16((__force __u16)(__le16)(x)) -+#define __constant_cpu_to_be64(x) ((__force __be64)(__u64)(x)) -+#define __constant_be64_to_cpu(x) ((__force __u64)(__be64)(x)) -+#define __constant_cpu_to_be32(x) ((__force __be32)(__u32)(x)) -+#define __constant_be32_to_cpu(x) ((__force __u32)(__be32)(x)) -+#define __constant_cpu_to_be16(x) ((__force __be16)(__u16)(x)) -+#define __constant_be16_to_cpu(x) ((__force __u16)(__be16)(x)) -+#define __cpu_to_le64(x) ((__force __le64)___swab64((x))) -+#define __le64_to_cpu(x) ___swab64((__force __u64)(__le64)(x)) -+#define __cpu_to_le32(x) ((__force __le32)___swab32((x))) -+#define __le32_to_cpu(x) ___swab32((__force __u32)(__le32)(x)) -+#define __cpu_to_le16(x) ((__force __le16)___swab16((x))) -+#define __le16_to_cpu(x) ___swab16((__force __u16)(__le16)(x)) -+#define __cpu_to_be64(x) ((__force __be64)(__u64)(x)) -+#define __be64_to_cpu(x) ((__force __u64)(__be64)(x)) -+#define __cpu_to_be32(x) ((__force __be32)(__u32)(x)) -+#define __be32_to_cpu(x) ((__force __u32)(__be32)(x)) -+#define __cpu_to_be16(x) ((__force __be16)(__u16)(x)) -+#define __be16_to_cpu(x) ((__force __u16)(__be16)(x)) -+ -+static inline __le64 __cpu_to_le64p(const __u64 *p) -+{ -+ return (__force __le64)__swab64p(p); -+} -+static inline __u64 __le64_to_cpup(const __le64 *p) -+{ -+ return __swab64p((__u64 *)p); -+} -+static inline __le32 __cpu_to_le32p(const __u32 *p) -+{ -+ return (__force __le32)__swab32p(p); -+} -+static inline __u32 __le32_to_cpup(const __le32 *p) -+{ -+ return __swab32p((__u32 *)p); -+} -+static inline __le16 __cpu_to_le16p(const __u16 *p) -+{ -+ return (__force __le16)__swab16p(p); -+} -+static inline __u16 __le16_to_cpup(const __le16 *p) -+{ -+ return __swab16p((__u16 *)p); -+} -+static inline __be64 __cpu_to_be64p(const __u64 *p) -+{ -+ return (__force __be64)*p; -+} -+static inline __u64 __be64_to_cpup(const __be64 *p) -+{ -+ return (__force __u64)*p; -+} -+static inline __be32 __cpu_to_be32p(const __u32 *p) -+{ -+ return (__force __be32)*p; -+} -+static inline __u32 __be32_to_cpup(const __be32 *p) -+{ -+ return (__force __u32)*p; -+} -+static inline __be16 __cpu_to_be16p(const __u16 *p) -+{ -+ return (__force __be16)*p; -+} -+static inline __u16 __be16_to_cpup(const __be16 *p) -+{ -+ return (__force __u16)*p; -+} - #define __cpu_to_le64s(x) __swab64s((x)) - #define __le64_to_cpus(x) __swab64s((x)) - #define __cpu_to_le32s(x) __swab32s((x)) -diff -uprN linux-2.6.8.1.orig/include/linux/byteorder/little_endian.h linux-2.6.8.1-ve022stab078/include/linux/byteorder/little_endian.h ---- linux-2.6.8.1.orig/include/linux/byteorder/little_endian.h 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/byteorder/little_endian.h 2006-05-11 13:05:31.000000000 +0400 -@@ -8,48 +8,86 @@ - #define __LITTLE_ENDIAN_BITFIELD - #endif - -+#include <linux/types.h> - #include <linux/byteorder/swab.h> - - #define __constant_htonl(x) ___constant_swab32((x)) - #define __constant_ntohl(x) ___constant_swab32((x)) - #define __constant_htons(x) ___constant_swab16((x)) - #define __constant_ntohs(x) ___constant_swab16((x)) --#define __constant_cpu_to_le64(x) ((__u64)(x)) --#define __constant_le64_to_cpu(x) ((__u64)(x)) --#define __constant_cpu_to_le32(x) ((__u32)(x)) --#define __constant_le32_to_cpu(x) ((__u32)(x)) --#define __constant_cpu_to_le16(x) ((__u16)(x)) --#define __constant_le16_to_cpu(x) ((__u16)(x)) --#define __constant_cpu_to_be64(x) ___constant_swab64((x)) --#define __constant_be64_to_cpu(x) ___constant_swab64((x)) --#define __constant_cpu_to_be32(x) ___constant_swab32((x)) --#define __constant_be32_to_cpu(x) ___constant_swab32((x)) --#define __constant_cpu_to_be16(x) ___constant_swab16((x)) --#define __constant_be16_to_cpu(x) ___constant_swab16((x)) --#define __cpu_to_le64(x) ((__u64)(x)) --#define __le64_to_cpu(x) ((__u64)(x)) --#define __cpu_to_le32(x) ((__u32)(x)) --#define __le32_to_cpu(x) ((__u32)(x)) --#define __cpu_to_le16(x) ((__u16)(x)) --#define __le16_to_cpu(x) ((__u16)(x)) --#define __cpu_to_be64(x) __swab64((x)) --#define __be64_to_cpu(x) __swab64((x)) --#define __cpu_to_be32(x) __swab32((x)) --#define __be32_to_cpu(x) __swab32((x)) --#define __cpu_to_be16(x) __swab16((x)) --#define __be16_to_cpu(x) __swab16((x)) --#define __cpu_to_le64p(x) (*(__u64*)(x)) --#define __le64_to_cpup(x) (*(__u64*)(x)) --#define __cpu_to_le32p(x) (*(__u32*)(x)) --#define __le32_to_cpup(x) (*(__u32*)(x)) --#define __cpu_to_le16p(x) (*(__u16*)(x)) --#define __le16_to_cpup(x) (*(__u16*)(x)) --#define __cpu_to_be64p(x) __swab64p((x)) --#define __be64_to_cpup(x) __swab64p((x)) --#define __cpu_to_be32p(x) __swab32p((x)) --#define __be32_to_cpup(x) __swab32p((x)) --#define __cpu_to_be16p(x) __swab16p((x)) --#define __be16_to_cpup(x) __swab16p((x)) -+#define __constant_cpu_to_le64(x) ((__force __le64)(__u64)(x)) -+#define __constant_le64_to_cpu(x) ((__force __u64)(__le64)(x)) -+#define __constant_cpu_to_le32(x) ((__force __le32)(__u32)(x)) -+#define __constant_le32_to_cpu(x) ((__force __u32)(__le32)(x)) -+#define __constant_cpu_to_le16(x) ((__force __le16)(__u16)(x)) -+#define __constant_le16_to_cpu(x) ((__force __u16)(__le16)(x)) -+#define __constant_cpu_to_be64(x) ((__force __be64)___constant_swab64((x))) -+#define __constant_be64_to_cpu(x) ___constant_swab64((__force __u64)(__be64)(x)) -+#define __constant_cpu_to_be32(x) ((__force __be32)___constant_swab32((x))) -+#define __constant_be32_to_cpu(x) ___constant_swab32((__force __u32)(__be32)(x)) -+#define __constant_cpu_to_be16(x) ((__force __be16)___constant_swab16((x))) -+#define __constant_be16_to_cpu(x) ___constant_swab16((__force __u16)(__be16)(x)) -+#define __cpu_to_le64(x) ((__force __le64)(__u64)(x)) -+#define __le64_to_cpu(x) ((__force __u64)(__le64)(x)) -+#define __cpu_to_le32(x) ((__force __le32)(__u32)(x)) -+#define __le32_to_cpu(x) ((__force __u32)(__le32)(x)) -+#define __cpu_to_le16(x) ((__force __le16)(__u16)(x)) -+#define __le16_to_cpu(x) ((__force __u16)(__le16)(x)) -+#define __cpu_to_be64(x) ((__force __be64)___swab64((x))) -+#define __be64_to_cpu(x) ___swab64((__force __u64)(__be64)(x)) -+#define __cpu_to_be32(x) ((__force __be32)___swab32((x))) -+#define __be32_to_cpu(x) ___swab32((__force __u32)(__be32)(x)) -+#define __cpu_to_be16(x) ((__force __be16)___swab16((x))) -+#define __be16_to_cpu(x) ___swab16((__force __u16)(__be16)(x)) -+ -+static inline __le64 __cpu_to_le64p(const __u64 *p) -+{ -+ return (__force __le64)*p; -+} -+static inline __u64 __le64_to_cpup(const __le64 *p) -+{ -+ return (__force __u64)*p; -+} -+static inline __le32 __cpu_to_le32p(const __u32 *p) -+{ -+ return (__force __le32)*p; -+} -+static inline __u32 __le32_to_cpup(const __le32 *p) -+{ -+ return (__force __u32)*p; -+} -+static inline __le16 __cpu_to_le16p(const __u16 *p) -+{ -+ return (__force __le16)*p; -+} -+static inline __u16 __le16_to_cpup(const __le16 *p) -+{ -+ return (__force __u16)*p; -+} -+static inline __be64 __cpu_to_be64p(const __u64 *p) -+{ -+ return (__force __be64)__swab64p(p); -+} -+static inline __u64 __be64_to_cpup(const __be64 *p) -+{ -+ return __swab64p((__u64 *)p); -+} -+static inline __be32 __cpu_to_be32p(const __u32 *p) -+{ -+ return (__force __be32)__swab32p(p); -+} -+static inline __u32 __be32_to_cpup(const __be32 *p) -+{ -+ return __swab32p((__u32 *)p); -+} -+static inline __be16 __cpu_to_be16p(const __u16 *p) -+{ -+ return (__force __be16)__swab16p(p); -+} -+static inline __u16 __be16_to_cpup(const __be16 *p) -+{ -+ return __swab16p((__u16 *)p); -+} - #define __cpu_to_le64s(x) do {} while (0) - #define __le64_to_cpus(x) do {} while (0) - #define __cpu_to_le32s(x) do {} while (0) -diff -uprN linux-2.6.8.1.orig/include/linux/capability.h linux-2.6.8.1-ve022stab078/include/linux/capability.h ---- linux-2.6.8.1.orig/include/linux/capability.h 2004-08-14 14:55:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/capability.h 2006-05-11 13:05:40.000000000 +0400 -@@ -147,12 +147,9 @@ typedef __u32 kernel_cap_t; - - #define CAP_NET_BROADCAST 11 - --/* Allow interface configuration */ - /* Allow administration of IP firewall, masquerading and accounting */ - /* Allow setting debug option on sockets */ - /* Allow modification of routing tables */ --/* Allow setting arbitrary process / process group ownership on -- sockets */ - /* Allow binding to any address for transparent proxying */ - /* Allow setting TOS (type of service) */ - /* Allow setting promiscuous mode */ -@@ -183,6 +180,7 @@ typedef __u32 kernel_cap_t; - #define CAP_SYS_MODULE 16 - - /* Allow ioperm/iopl access */ -+/* Allow O_DIRECT access */ - /* Allow sending USB messages to any device via /proc/bus/usb */ - - #define CAP_SYS_RAWIO 17 -@@ -201,24 +199,19 @@ typedef __u32 kernel_cap_t; - - /* Allow configuration of the secure attention key */ - /* Allow administration of the random device */ --/* Allow examination and configuration of disk quotas */ - /* Allow configuring the kernel's syslog (printk behaviour) */ - /* Allow setting the domainname */ - /* Allow setting the hostname */ - /* Allow calling bdflush() */ --/* Allow mount() and umount(), setting up new smb connection */ -+/* Allow setting up new smb connection */ - /* Allow some autofs root ioctls */ - /* Allow nfsservctl */ - /* Allow VM86_REQUEST_IRQ */ - /* Allow to read/write pci config on alpha */ - /* Allow irix_prctl on mips (setstacksize) */ - /* Allow flushing all cache on m68k (sys_cacheflush) */ --/* Allow removing semaphores */ --/* Used instead of CAP_CHOWN to "chown" IPC message queues, semaphores -- and shared memory */ - /* Allow locking/unlocking of shared memory segment */ - /* Allow turning swap on/off */ --/* Allow forged pids on socket credentials passing */ - /* Allow setting readahead and flushing buffers on block devices */ - /* Allow setting geometry in floppy driver */ - /* Allow turning DMA on/off in xd driver */ -@@ -235,6 +228,8 @@ typedef __u32 kernel_cap_t; - /* Allow enabling/disabling tagged queuing on SCSI controllers and sending - arbitrary SCSI commands */ - /* Allow setting encryption key on loopback filesystem */ -+/* Modify data journaling mode on ext3 filesystem (uses journaling -+ resources) */ - - #define CAP_SYS_ADMIN 21 - -@@ -254,8 +249,6 @@ typedef __u32 kernel_cap_t; - /* Override resource limits. Set resource limits. */ - /* Override quota limits. */ - /* Override reserved space on ext2 filesystem */ --/* Modify data journaling mode on ext3 filesystem (uses journaling -- resources) */ - /* NOTE: ext2 honors fsuid when checking for resource overrides, so - you can override using fsuid too */ - /* Override size restrictions on IPC message queues */ -@@ -284,6 +277,36 @@ typedef __u32 kernel_cap_t; - - #define CAP_LEASE 28 - -+/* Allow access to all information. In the other case some structures will be -+ hiding to ensure different Virtual Environment non-interaction on the same -+ node */ -+#define CAP_SETVEID 29 -+ -+#define CAP_VE_ADMIN 30 -+ -+/* Replacement for CAP_NET_ADMIN: -+ delegated rights to the Virtual environment of its network administration. -+ For now the following rights have been delegated: -+ -+ Allow setting arbitrary process / process group ownership on sockets -+ Allow interface configuration -+*/ -+#define CAP_VE_NET_ADMIN CAP_VE_ADMIN -+ -+/* Replacement for CAP_SYS_ADMIN: -+ delegated rights to the Virtual environment of its administration. -+ For now the following rights have been delegated: -+*/ -+/* Allow mount/umount/remount */ -+/* Allow examination and configuration of disk quotas */ -+/* Allow removing semaphores */ -+/* Used instead of CAP_CHOWN to "chown" IPC message queues, semaphores -+ and shared memory */ -+/* Allow locking/unlocking of shared memory segment */ -+/* Allow forged pids on socket credentials passing */ -+ -+#define CAP_VE_SYS_ADMIN CAP_VE_ADMIN -+ - #ifdef __KERNEL__ - /* - * Bounding set -@@ -348,9 +371,16 @@ static inline kernel_cap_t cap_invert(ke - #define cap_issubset(a,set) (!(cap_t(a) & ~cap_t(set))) - - #define cap_clear(c) do { cap_t(c) = 0; } while(0) -+ -+#ifndef CONFIG_VE - #define cap_set_full(c) do { cap_t(c) = ~0; } while(0) --#define cap_mask(c,mask) do { cap_t(c) &= cap_t(mask); } while(0) -+#else -+#define cap_set_full(c) \ -+ do {cap_t(c) = ve_is_super(get_exec_env()) ? ~0 : \ -+ get_exec_env()->cap_default; } while(0) -+#endif - -+#define cap_mask(c,mask) do { cap_t(c) &= cap_t(mask); } while(0) - #define cap_is_fs_cap(c) (CAP_TO_MASK(c) & CAP_FS_MASK) - - #endif /* __KERNEL__ */ -diff -uprN linux-2.6.8.1.orig/include/linux/coda_linux.h linux-2.6.8.1-ve022stab078/include/linux/coda_linux.h ---- linux-2.6.8.1.orig/include/linux/coda_linux.h 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/coda_linux.h 2006-05-11 13:05:35.000000000 +0400 -@@ -38,7 +38,8 @@ extern struct file_operations coda_ioctl - int coda_open(struct inode *i, struct file *f); - int coda_flush(struct file *f); - int coda_release(struct inode *i, struct file *f); --int coda_permission(struct inode *inode, int mask, struct nameidata *nd); -+int coda_permission(struct inode *inode, int mask, struct nameidata *nd, -+ struct exec_perm *exec_perm); - int coda_revalidate_inode(struct dentry *); - int coda_getattr(struct vfsmount *, struct dentry *, struct kstat *); - int coda_setattr(struct dentry *, struct iattr *); -diff -uprN linux-2.6.8.1.orig/include/linux/compat.h linux-2.6.8.1-ve022stab078/include/linux/compat.h ---- linux-2.6.8.1.orig/include/linux/compat.h 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/compat.h 2006-05-11 13:05:27.000000000 +0400 -@@ -130,5 +130,8 @@ asmlinkage long compat_sys_select(int n, - compat_ulong_t __user *outp, compat_ulong_t __user *exp, - struct compat_timeval __user *tvp); - -+struct compat_siginfo; -+int copy_siginfo_from_user32(siginfo_t *to, struct compat_siginfo __user *from); -+int copy_siginfo_to_user32(struct compat_siginfo __user *to, siginfo_t *from); - #endif /* CONFIG_COMPAT */ - #endif /* _LINUX_COMPAT_H */ -diff -uprN linux-2.6.8.1.orig/include/linux/compat_ioctl.h linux-2.6.8.1-ve022stab078/include/linux/compat_ioctl.h ---- linux-2.6.8.1.orig/include/linux/compat_ioctl.h 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/compat_ioctl.h 2006-05-11 13:05:29.000000000 +0400 -@@ -16,6 +16,7 @@ COMPATIBLE_IOCTL(TCSETA) - COMPATIBLE_IOCTL(TCSETAW) - COMPATIBLE_IOCTL(TCSETAF) - COMPATIBLE_IOCTL(TCSBRK) -+ULONG_IOCTL(TCSBRKP) - COMPATIBLE_IOCTL(TCXONC) - COMPATIBLE_IOCTL(TCFLSH) - COMPATIBLE_IOCTL(TCGETS) -@@ -23,6 +24,8 @@ COMPATIBLE_IOCTL(TCSETS) - COMPATIBLE_IOCTL(TCSETSW) - COMPATIBLE_IOCTL(TCSETSF) - COMPATIBLE_IOCTL(TIOCLINUX) -+COMPATIBLE_IOCTL(TIOCSBRK) -+COMPATIBLE_IOCTL(TIOCCBRK) - /* Little t */ - COMPATIBLE_IOCTL(TIOCGETD) - COMPATIBLE_IOCTL(TIOCSETD) -diff -uprN linux-2.6.8.1.orig/include/linux/dcache.h linux-2.6.8.1-ve022stab078/include/linux/dcache.h ---- linux-2.6.8.1.orig/include/linux/dcache.h 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/dcache.h 2006-05-11 13:05:40.000000000 +0400 -@@ -80,6 +80,8 @@ struct dcookie_struct; - - #define DNAME_INLINE_LEN_MIN 36 - -+#include <ub/ub_dcache.h> -+ - struct dentry { - atomic_t d_count; - unsigned int d_flags; /* protected by d_lock */ -@@ -106,9 +108,15 @@ struct dentry { - struct rcu_head d_rcu; - struct dcookie_struct *d_cookie; /* cookie, if any */ - struct hlist_node d_hash; /* lookup hash list */ -+ /* It can't be at the end because of DNAME_INLINE_LEN */ -+ struct dentry_beancounter dentry_bc; - unsigned char d_iname[DNAME_INLINE_LEN_MIN]; /* small names */ - }; - -+#define DNAME_INLINE_LEN (sizeof(struct dentry)-offsetof(struct dentry,d_iname)) -+ -+#define dentry_bc(__d) (&(__d)->dentry_bc) -+ - struct dentry_operations { - int (*d_revalidate)(struct dentry *, struct nameidata *); - int (*d_hash) (struct dentry *, struct qstr *); -@@ -156,6 +164,9 @@ d_iput: no no no yes - - #define DCACHE_REFERENCED 0x0008 /* Recently used, don't discard. */ - #define DCACHE_UNHASHED 0x0010 -+#define DCACHE_VIRTUAL 0x0100 /* ve accessible */ -+ -+extern void mark_tree_virtual(struct vfsmount *m, struct dentry *d); - - extern spinlock_t dcache_lock; - -@@ -163,17 +174,16 @@ extern spinlock_t dcache_lock; - * d_drop - drop a dentry - * @dentry: dentry to drop - * -- * d_drop() unhashes the entry from the parent -- * dentry hashes, so that it won't be found through -- * a VFS lookup any more. Note that this is different -- * from deleting the dentry - d_delete will try to -- * mark the dentry negative if possible, giving a -- * successful _negative_ lookup, while d_drop will -+ * d_drop() unhashes the entry from the parent dentry hashes, so that it won't -+ * be found through a VFS lookup any more. Note that this is different from -+ * deleting the dentry - d_delete will try to mark the dentry negative if -+ * possible, giving a successful _negative_ lookup, while d_drop will - * just make the cache lookup fail. - * -- * d_drop() is used mainly for stuff that wants -- * to invalidate a dentry for some reason (NFS -- * timeouts or autofs deletes). -+ * d_drop() is used mainly for stuff that wants to invalidate a dentry for some -+ * reason (NFS timeouts or autofs deletes). -+ * -+ * __d_drop requires dentry->d_lock. - */ - - static inline void __d_drop(struct dentry *dentry) -@@ -187,7 +197,9 @@ static inline void __d_drop(struct dentr - static inline void d_drop(struct dentry *dentry) - { - spin_lock(&dcache_lock); -+ spin_lock(&dentry->d_lock); - __d_drop(dentry); -+ spin_unlock(&dentry->d_lock); - spin_unlock(&dcache_lock); - } - -@@ -208,7 +220,8 @@ extern struct dentry * d_alloc_anon(stru - extern struct dentry * d_splice_alias(struct inode *, struct dentry *); - extern void shrink_dcache_sb(struct super_block *); - extern void shrink_dcache_parent(struct dentry *); --extern void shrink_dcache_anon(struct hlist_head *); -+extern void shrink_dcache_anon(struct super_block *); -+extern void dcache_shrinker_wait_sb(struct super_block *sb); - extern int d_invalidate(struct dentry *); - - /* only used at mount-time */ -@@ -253,6 +266,7 @@ extern struct dentry * __d_lookup(struct - /* validate "insecure" dentry pointer */ - extern int d_validate(struct dentry *, struct dentry *); - -+extern int d_root_check(struct dentry *, struct vfsmount *); - extern char * d_path(struct dentry *, struct vfsmount *, char *, int); - - /* Allocation counts.. */ -@@ -273,6 +287,10 @@ extern char * d_path(struct dentry *, st - static inline struct dentry *dget(struct dentry *dentry) - { - if (dentry) { -+#ifdef CONFIG_USER_RESOURCE -+ if (atomic_inc_and_test(&dentry_bc(dentry)->d_inuse)) -+ BUG(); -+#endif - BUG_ON(!atomic_read(&dentry->d_count)); - atomic_inc(&dentry->d_count); - } -@@ -315,6 +333,8 @@ extern struct dentry *lookup_create(stru - - extern int sysctl_vfs_cache_pressure; - -+extern int check_area_access_ve(struct dentry *, struct vfsmount *); -+extern int check_area_execute_ve(struct dentry *, struct vfsmount *); - #endif /* __KERNEL__ */ - - #endif /* __LINUX_DCACHE_H */ -diff -uprN linux-2.6.8.1.orig/include/linux/devpts_fs.h linux-2.6.8.1-ve022stab078/include/linux/devpts_fs.h ---- linux-2.6.8.1.orig/include/linux/devpts_fs.h 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/devpts_fs.h 2006-05-11 13:05:40.000000000 +0400 -@@ -21,6 +21,13 @@ int devpts_pty_new(struct tty_struct *tt - struct tty_struct *devpts_get_tty(int number); /* get tty structure */ - void devpts_pty_kill(int number); /* unlink */ - -+struct devpts_config { -+ int setuid; -+ int setgid; -+ uid_t uid; -+ gid_t gid; -+ umode_t mode; -+}; - #else - - /* Dummy stubs in the no-pty case */ -diff -uprN linux-2.6.8.1.orig/include/linux/elfcore.h linux-2.6.8.1-ve022stab078/include/linux/elfcore.h ---- linux-2.6.8.1.orig/include/linux/elfcore.h 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/elfcore.h 2006-05-11 13:05:45.000000000 +0400 -@@ -6,6 +6,8 @@ - #include <linux/time.h> - #include <linux/user.h> - -+extern int sysctl_at_vsyscall; -+ - struct elf_siginfo - { - int si_signo; /* signal number */ -diff -uprN linux-2.6.8.1.orig/include/linux/eventpoll.h linux-2.6.8.1-ve022stab078/include/linux/eventpoll.h ---- linux-2.6.8.1.orig/include/linux/eventpoll.h 2004-08-14 14:55:20.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/eventpoll.h 2006-05-11 13:05:48.000000000 +0400 -@@ -85,6 +85,87 @@ static inline void eventpoll_release(str - eventpoll_release_file(file); - } - -+struct epoll_filefd { -+ struct file *file; -+ int fd; -+}; -+ -+/* -+ * This structure is stored inside the "private_data" member of the file -+ * structure and rapresent the main data sructure for the eventpoll -+ * interface. -+ */ -+struct eventpoll { -+ /* Protect the this structure access */ -+ rwlock_t lock; -+ -+ /* -+ * This semaphore is used to ensure that files are not removed -+ * while epoll is using them. This is read-held during the event -+ * collection loop and it is write-held during the file cleanup -+ * path, the epoll file exit code and the ctl operations. -+ */ -+ struct rw_semaphore sem; -+ -+ /* Wait queue used by sys_epoll_wait() */ -+ wait_queue_head_t wq; -+ -+ /* Wait queue used by file->poll() */ -+ wait_queue_head_t poll_wait; -+ -+ /* List of ready file descriptors */ -+ struct list_head rdllist; -+ -+ /* RB-Tree root used to store monitored fd structs */ -+ struct rb_root rbr; -+}; -+ -+/* -+ * Each file descriptor added to the eventpoll interface will -+ * have an entry of this type linked to the hash. -+ */ -+struct epitem { -+ /* RB-Tree node used to link this structure to the eventpoll rb-tree */ -+ struct rb_node rbn; -+ -+ /* List header used to link this structure to the eventpoll ready list */ -+ struct list_head rdllink; -+ -+ /* The file descriptor information this item refers to */ -+ struct epoll_filefd ffd; -+ -+ /* Number of active wait queue attached to poll operations */ -+ int nwait; -+ -+ /* List containing poll wait queues */ -+ struct list_head pwqlist; -+ -+ /* The "container" of this item */ -+ struct eventpoll *ep; -+ -+ /* The structure that describe the interested events and the source fd */ -+ struct epoll_event event; -+ -+ /* -+ * Used to keep track of the usage count of the structure. This avoids -+ * that the structure will desappear from underneath our processing. -+ */ -+ atomic_t usecnt; -+ -+ /* List header used to link this item to the "struct file" items list */ -+ struct list_head fllink; -+ -+ /* List header used to link the item to the transfer list */ -+ struct list_head txlink; -+ -+ /* -+ * This is used during the collection/transfer of events to userspace -+ * to pin items empty events set. -+ */ -+ unsigned int revents; -+}; -+ -+extern struct semaphore epsem; - - #else - -diff -uprN linux-2.6.8.1.orig/include/linux/ext2_fs.h linux-2.6.8.1-ve022stab078/include/linux/ext2_fs.h ---- linux-2.6.8.1.orig/include/linux/ext2_fs.h 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/ext2_fs.h 2006-05-11 13:05:31.000000000 +0400 -@@ -135,14 +135,14 @@ static inline struct ext2_sb_info *EXT2_ - */ - struct ext2_group_desc - { -- __u32 bg_block_bitmap; /* Blocks bitmap block */ -- __u32 bg_inode_bitmap; /* Inodes bitmap block */ -- __u32 bg_inode_table; /* Inodes table block */ -- __u16 bg_free_blocks_count; /* Free blocks count */ -- __u16 bg_free_inodes_count; /* Free inodes count */ -- __u16 bg_used_dirs_count; /* Directories count */ -- __u16 bg_pad; -- __u32 bg_reserved[3]; -+ __le32 bg_block_bitmap; /* Blocks bitmap block */ -+ __le32 bg_inode_bitmap; /* Inodes bitmap block */ -+ __le32 bg_inode_table; /* Inodes table block */ -+ __le16 bg_free_blocks_count; /* Free blocks count */ -+ __le16 bg_free_inodes_count; /* Free inodes count */ -+ __le16 bg_used_dirs_count; /* Directories count */ -+ __le16 bg_pad; -+ __le32 bg_reserved[3]; - }; - - /* -@@ -209,49 +209,49 @@ struct ext2_group_desc - * Structure of an inode on the disk - */ - struct ext2_inode { -- __u16 i_mode; /* File mode */ -- __u16 i_uid; /* Low 16 bits of Owner Uid */ -- __u32 i_size; /* Size in bytes */ -- __u32 i_atime; /* Access time */ -- __u32 i_ctime; /* Creation time */ -- __u32 i_mtime; /* Modification time */ -- __u32 i_dtime; /* Deletion Time */ -- __u16 i_gid; /* Low 16 bits of Group Id */ -- __u16 i_links_count; /* Links count */ -- __u32 i_blocks; /* Blocks count */ -- __u32 i_flags; /* File flags */ -+ __le16 i_mode; /* File mode */ -+ __le16 i_uid; /* Low 16 bits of Owner Uid */ -+ __le32 i_size; /* Size in bytes */ -+ __le32 i_atime; /* Access time */ -+ __le32 i_ctime; /* Creation time */ -+ __le32 i_mtime; /* Modification time */ -+ __le32 i_dtime; /* Deletion Time */ -+ __le16 i_gid; /* Low 16 bits of Group Id */ -+ __le16 i_links_count; /* Links count */ -+ __le32 i_blocks; /* Blocks count */ -+ __le32 i_flags; /* File flags */ - union { - struct { -- __u32 l_i_reserved1; -+ __le32 l_i_reserved1; - } linux1; - struct { -- __u32 h_i_translator; -+ __le32 h_i_translator; - } hurd1; - struct { -- __u32 m_i_reserved1; -+ __le32 m_i_reserved1; - } masix1; - } osd1; /* OS dependent 1 */ -- __u32 i_block[EXT2_N_BLOCKS];/* Pointers to blocks */ -- __u32 i_generation; /* File version (for NFS) */ -- __u32 i_file_acl; /* File ACL */ -- __u32 i_dir_acl; /* Directory ACL */ -- __u32 i_faddr; /* Fragment address */ -+ __le32 i_block[EXT2_N_BLOCKS];/* Pointers to blocks */ -+ __le32 i_generation; /* File version (for NFS) */ -+ __le32 i_file_acl; /* File ACL */ -+ __le32 i_dir_acl; /* Directory ACL */ -+ __le32 i_faddr; /* Fragment address */ - union { - struct { - __u8 l_i_frag; /* Fragment number */ - __u8 l_i_fsize; /* Fragment size */ - __u16 i_pad1; -- __u16 l_i_uid_high; /* these 2 fields */ -- __u16 l_i_gid_high; /* were reserved2[0] */ -+ __le16 l_i_uid_high; /* these 2 fields */ -+ __le16 l_i_gid_high; /* were reserved2[0] */ - __u32 l_i_reserved2; - } linux2; - struct { - __u8 h_i_frag; /* Fragment number */ - __u8 h_i_fsize; /* Fragment size */ -- __u16 h_i_mode_high; -- __u16 h_i_uid_high; -- __u16 h_i_gid_high; -- __u32 h_i_author; -+ __le16 h_i_mode_high; -+ __le16 h_i_uid_high; -+ __le16 h_i_gid_high; -+ __le32 h_i_author; - } hurd2; - struct { - __u8 m_i_frag; /* Fragment number */ -@@ -335,31 +335,31 @@ struct ext2_inode { - * Structure of the super block - */ - struct ext2_super_block { -- __u32 s_inodes_count; /* Inodes count */ -- __u32 s_blocks_count; /* Blocks count */ -- __u32 s_r_blocks_count; /* Reserved blocks count */ -- __u32 s_free_blocks_count; /* Free blocks count */ -- __u32 s_free_inodes_count; /* Free inodes count */ -- __u32 s_first_data_block; /* First Data Block */ -- __u32 s_log_block_size; /* Block size */ -- __s32 s_log_frag_size; /* Fragment size */ -- __u32 s_blocks_per_group; /* # Blocks per group */ -- __u32 s_frags_per_group; /* # Fragments per group */ -- __u32 s_inodes_per_group; /* # Inodes per group */ -- __u32 s_mtime; /* Mount time */ -- __u32 s_wtime; /* Write time */ -- __u16 s_mnt_count; /* Mount count */ -- __s16 s_max_mnt_count; /* Maximal mount count */ -- __u16 s_magic; /* Magic signature */ -- __u16 s_state; /* File system state */ -- __u16 s_errors; /* Behaviour when detecting errors */ -- __u16 s_minor_rev_level; /* minor revision level */ -- __u32 s_lastcheck; /* time of last check */ -- __u32 s_checkinterval; /* max. time between checks */ -- __u32 s_creator_os; /* OS */ -- __u32 s_rev_level; /* Revision level */ -- __u16 s_def_resuid; /* Default uid for reserved blocks */ -- __u16 s_def_resgid; /* Default gid for reserved blocks */ -+ __le32 s_inodes_count; /* Inodes count */ -+ __le32 s_blocks_count; /* Blocks count */ -+ __le32 s_r_blocks_count; /* Reserved blocks count */ -+ __le32 s_free_blocks_count; /* Free blocks count */ -+ __le32 s_free_inodes_count; /* Free inodes count */ -+ __le32 s_first_data_block; /* First Data Block */ -+ __le32 s_log_block_size; /* Block size */ -+ __le32 s_log_frag_size; /* Fragment size */ -+ __le32 s_blocks_per_group; /* # Blocks per group */ -+ __le32 s_frags_per_group; /* # Fragments per group */ -+ __le32 s_inodes_per_group; /* # Inodes per group */ -+ __le32 s_mtime; /* Mount time */ -+ __le32 s_wtime; /* Write time */ -+ __le16 s_mnt_count; /* Mount count */ -+ __le16 s_max_mnt_count; /* Maximal mount count */ -+ __le16 s_magic; /* Magic signature */ -+ __le16 s_state; /* File system state */ -+ __le16 s_errors; /* Behaviour when detecting errors */ -+ __le16 s_minor_rev_level; /* minor revision level */ -+ __le32 s_lastcheck; /* time of last check */ -+ __le32 s_checkinterval; /* max. time between checks */ -+ __le32 s_creator_os; /* OS */ -+ __le32 s_rev_level; /* Revision level */ -+ __le16 s_def_resuid; /* Default uid for reserved blocks */ -+ __le16 s_def_resgid; /* Default gid for reserved blocks */ - /* - * These fields are for EXT2_DYNAMIC_REV superblocks only. - * -@@ -373,16 +373,16 @@ struct ext2_super_block { - * feature set, it must abort and not try to meddle with - * things it doesn't understand... - */ -- __u32 s_first_ino; /* First non-reserved inode */ -- __u16 s_inode_size; /* size of inode structure */ -- __u16 s_block_group_nr; /* block group # of this superblock */ -- __u32 s_feature_compat; /* compatible feature set */ -- __u32 s_feature_incompat; /* incompatible feature set */ -- __u32 s_feature_ro_compat; /* readonly-compatible feature set */ -+ __le32 s_first_ino; /* First non-reserved inode */ -+ __le16 s_inode_size; /* size of inode structure */ -+ __le16 s_block_group_nr; /* block group # of this superblock */ -+ __le32 s_feature_compat; /* compatible feature set */ -+ __le32 s_feature_incompat; /* incompatible feature set */ -+ __le32 s_feature_ro_compat; /* readonly-compatible feature set */ - __u8 s_uuid[16]; /* 128-bit uuid for volume */ - char s_volume_name[16]; /* volume name */ - char s_last_mounted[64]; /* directory where last mounted */ -- __u32 s_algorithm_usage_bitmap; /* For compression */ -+ __le32 s_algorithm_usage_bitmap; /* For compression */ - /* - * Performance hints. Directory preallocation should only - * happen if the EXT2_COMPAT_PREALLOC flag is on. -@@ -401,8 +401,8 @@ struct ext2_super_block { - __u8 s_def_hash_version; /* Default hash version to use */ - __u8 s_reserved_char_pad; - __u16 s_reserved_word_pad; -- __u32 s_default_mount_opts; -- __u32 s_first_meta_bg; /* First metablock block group */ -+ __le32 s_default_mount_opts; -+ __le32 s_first_meta_bg; /* First metablock block group */ - __u32 s_reserved[190]; /* Padding to the end of the block */ - }; - -@@ -504,9 +504,9 @@ struct ext2_super_block { - #define EXT2_NAME_LEN 255 - - struct ext2_dir_entry { -- __u32 inode; /* Inode number */ -- __u16 rec_len; /* Directory entry length */ -- __u16 name_len; /* Name length */ -+ __le32 inode; /* Inode number */ -+ __le16 rec_len; /* Directory entry length */ -+ __le16 name_len; /* Name length */ - char name[EXT2_NAME_LEN]; /* File name */ - }; - -@@ -517,8 +517,8 @@ struct ext2_dir_entry { - * file_type field. - */ - struct ext2_dir_entry_2 { -- __u32 inode; /* Inode number */ -- __u16 rec_len; /* Directory entry length */ -+ __le32 inode; /* Inode number */ -+ __le16 rec_len; /* Directory entry length */ - __u8 name_len; /* Name length */ - __u8 file_type; - char name[EXT2_NAME_LEN]; /* File name */ -diff -uprN linux-2.6.8.1.orig/include/linux/ext3_fs.h linux-2.6.8.1-ve022stab078/include/linux/ext3_fs.h ---- linux-2.6.8.1.orig/include/linux/ext3_fs.h 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/ext3_fs.h 2006-05-11 13:05:35.000000000 +0400 -@@ -129,14 +129,14 @@ struct statfs; - */ - struct ext3_group_desc - { -- __u32 bg_block_bitmap; /* Blocks bitmap block */ -- __u32 bg_inode_bitmap; /* Inodes bitmap block */ -- __u32 bg_inode_table; /* Inodes table block */ -- __u16 bg_free_blocks_count; /* Free blocks count */ -- __u16 bg_free_inodes_count; /* Free inodes count */ -- __u16 bg_used_dirs_count; /* Directories count */ -+ __le32 bg_block_bitmap; /* Blocks bitmap block */ -+ __le32 bg_inode_bitmap; /* Inodes bitmap block */ -+ __le32 bg_inode_table; /* Inodes table block */ -+ __le16 bg_free_blocks_count; /* Free blocks count */ -+ __le16 bg_free_inodes_count; /* Free inodes count */ -+ __le16 bg_used_dirs_count; /* Directories count */ - __u16 bg_pad; -- __u32 bg_reserved[3]; -+ __le32 bg_reserved[3]; - }; - - /* -@@ -196,6 +196,31 @@ struct ext3_group_desc - #define EXT3_STATE_JDATA 0x00000001 /* journaled data exists */ - #define EXT3_STATE_NEW 0x00000002 /* inode is newly created */ - -+ -+/* Used to pass group descriptor data when online resize is done */ -+struct ext3_new_group_input { -+ __u32 group; /* Group number for this data */ -+ __u32 block_bitmap; /* Absolute block number of block bitmap */ -+ __u32 inode_bitmap; /* Absolute block number of inode bitmap */ -+ __u32 inode_table; /* Absolute block number of inode table start */ -+ __u32 blocks_count; /* Total number of blocks in this group */ -+ __u16 reserved_blocks; /* Number of reserved blocks in this group */ -+ __u16 unused; -+}; -+ -+/* The struct ext3_new_group_input in kernel space, with free_blocks_count */ -+struct ext3_new_group_data { -+ __u32 group; -+ __u32 block_bitmap; -+ __u32 inode_bitmap; -+ __u32 inode_table; -+ __u32 blocks_count; -+ __u16 reserved_blocks; -+ __u16 unused; -+ __u32 free_blocks_count; -+}; -+ -+ - /* - * ioctl commands - */ -@@ -203,6 +228,8 @@ struct ext3_group_desc - #define EXT3_IOC_SETFLAGS _IOW('f', 2, long) - #define EXT3_IOC_GETVERSION _IOR('f', 3, long) - #define EXT3_IOC_SETVERSION _IOW('f', 4, long) -+#define EXT3_IOC_GROUP_EXTEND _IOW('f', 7, unsigned long) -+#define EXT3_IOC_GROUP_ADD _IOW('f', 8,struct ext3_new_group_input) - #define EXT3_IOC_GETVERSION_OLD _IOR('v', 1, long) - #define EXT3_IOC_SETVERSION_OLD _IOW('v', 2, long) - #ifdef CONFIG_JBD_DEBUG -@@ -213,17 +240,17 @@ struct ext3_group_desc - * Structure of an inode on the disk - */ - struct ext3_inode { -- __u16 i_mode; /* File mode */ -- __u16 i_uid; /* Low 16 bits of Owner Uid */ -- __u32 i_size; /* Size in bytes */ -- __u32 i_atime; /* Access time */ -- __u32 i_ctime; /* Creation time */ -- __u32 i_mtime; /* Modification time */ -- __u32 i_dtime; /* Deletion Time */ -- __u16 i_gid; /* Low 16 bits of Group Id */ -- __u16 i_links_count; /* Links count */ -- __u32 i_blocks; /* Blocks count */ -- __u32 i_flags; /* File flags */ -+ __le16 i_mode; /* File mode */ -+ __le16 i_uid; /* Low 16 bits of Owner Uid */ -+ __le32 i_size; /* Size in bytes */ -+ __le32 i_atime; /* Access time */ -+ __le32 i_ctime; /* Creation time */ -+ __le32 i_mtime; /* Modification time */ -+ __le32 i_dtime; /* Deletion Time */ -+ __le16 i_gid; /* Low 16 bits of Group Id */ -+ __le16 i_links_count; /* Links count */ -+ __le32 i_blocks; /* Blocks count */ -+ __le32 i_flags; /* File flags */ - union { - struct { - __u32 l_i_reserved1; -@@ -235,18 +262,18 @@ struct ext3_inode { - __u32 m_i_reserved1; - } masix1; - } osd1; /* OS dependent 1 */ -- __u32 i_block[EXT3_N_BLOCKS];/* Pointers to blocks */ -- __u32 i_generation; /* File version (for NFS) */ -- __u32 i_file_acl; /* File ACL */ -- __u32 i_dir_acl; /* Directory ACL */ -- __u32 i_faddr; /* Fragment address */ -+ __le32 i_block[EXT3_N_BLOCKS];/* Pointers to blocks */ -+ __le32 i_generation; /* File version (for NFS) */ -+ __le32 i_file_acl; /* File ACL */ -+ __le32 i_dir_acl; /* Directory ACL */ -+ __le32 i_faddr; /* Fragment address */ - union { - struct { - __u8 l_i_frag; /* Fragment number */ - __u8 l_i_fsize; /* Fragment size */ - __u16 i_pad1; -- __u16 l_i_uid_high; /* these 2 fields */ -- __u16 l_i_gid_high; /* were reserved2[0] */ -+ __le16 l_i_uid_high; /* these 2 fields */ -+ __le16 l_i_gid_high; /* were reserved2[0] */ - __u32 l_i_reserved2; - } linux2; - struct { -@@ -363,31 +390,31 @@ struct ext3_inode { - * Structure of the super block - */ - struct ext3_super_block { --/*00*/ __u32 s_inodes_count; /* Inodes count */ -- __u32 s_blocks_count; /* Blocks count */ -- __u32 s_r_blocks_count; /* Reserved blocks count */ -- __u32 s_free_blocks_count; /* Free blocks count */ --/*10*/ __u32 s_free_inodes_count; /* Free inodes count */ -- __u32 s_first_data_block; /* First Data Block */ -- __u32 s_log_block_size; /* Block size */ -- __s32 s_log_frag_size; /* Fragment size */ --/*20*/ __u32 s_blocks_per_group; /* # Blocks per group */ -- __u32 s_frags_per_group; /* # Fragments per group */ -- __u32 s_inodes_per_group; /* # Inodes per group */ -- __u32 s_mtime; /* Mount time */ --/*30*/ __u32 s_wtime; /* Write time */ -- __u16 s_mnt_count; /* Mount count */ -- __s16 s_max_mnt_count; /* Maximal mount count */ -- __u16 s_magic; /* Magic signature */ -- __u16 s_state; /* File system state */ -- __u16 s_errors; /* Behaviour when detecting errors */ -- __u16 s_minor_rev_level; /* minor revision level */ --/*40*/ __u32 s_lastcheck; /* time of last check */ -- __u32 s_checkinterval; /* max. time between checks */ -- __u32 s_creator_os; /* OS */ -- __u32 s_rev_level; /* Revision level */ --/*50*/ __u16 s_def_resuid; /* Default uid for reserved blocks */ -- __u16 s_def_resgid; /* Default gid for reserved blocks */ -+/*00*/ __le32 s_inodes_count; /* Inodes count */ -+ __le32 s_blocks_count; /* Blocks count */ -+ __le32 s_r_blocks_count; /* Reserved blocks count */ -+ __le32 s_free_blocks_count; /* Free blocks count */ -+/*10*/ __le32 s_free_inodes_count; /* Free inodes count */ -+ __le32 s_first_data_block; /* First Data Block */ -+ __le32 s_log_block_size; /* Block size */ -+ __le32 s_log_frag_size; /* Fragment size */ -+/*20*/ __le32 s_blocks_per_group; /* # Blocks per group */ -+ __le32 s_frags_per_group; /* # Fragments per group */ -+ __le32 s_inodes_per_group; /* # Inodes per group */ -+ __le32 s_mtime; /* Mount time */ -+/*30*/ __le32 s_wtime; /* Write time */ -+ __le16 s_mnt_count; /* Mount count */ -+ __le16 s_max_mnt_count; /* Maximal mount count */ -+ __le16 s_magic; /* Magic signature */ -+ __le16 s_state; /* File system state */ -+ __le16 s_errors; /* Behaviour when detecting errors */ -+ __le16 s_minor_rev_level; /* minor revision level */ -+/*40*/ __le32 s_lastcheck; /* time of last check */ -+ __le32 s_checkinterval; /* max. time between checks */ -+ __le32 s_creator_os; /* OS */ -+ __le32 s_rev_level; /* Revision level */ -+/*50*/ __le16 s_def_resuid; /* Default uid for reserved blocks */ -+ __le16 s_def_resgid; /* Default gid for reserved blocks */ - /* - * These fields are for EXT3_DYNAMIC_REV superblocks only. - * -@@ -401,36 +428,36 @@ struct ext3_super_block { - * feature set, it must abort and not try to meddle with - * things it doesn't understand... - */ -- __u32 s_first_ino; /* First non-reserved inode */ -- __u16 s_inode_size; /* size of inode structure */ -- __u16 s_block_group_nr; /* block group # of this superblock */ -- __u32 s_feature_compat; /* compatible feature set */ --/*60*/ __u32 s_feature_incompat; /* incompatible feature set */ -- __u32 s_feature_ro_compat; /* readonly-compatible feature set */ -+ __le32 s_first_ino; /* First non-reserved inode */ -+ __le16 s_inode_size; /* size of inode structure */ -+ __le16 s_block_group_nr; /* block group # of this superblock */ -+ __le32 s_feature_compat; /* compatible feature set */ -+/*60*/ __le32 s_feature_incompat; /* incompatible feature set */ -+ __le32 s_feature_ro_compat; /* readonly-compatible feature set */ - /*68*/ __u8 s_uuid[16]; /* 128-bit uuid for volume */ - /*78*/ char s_volume_name[16]; /* volume name */ - /*88*/ char s_last_mounted[64]; /* directory where last mounted */ --/*C8*/ __u32 s_algorithm_usage_bitmap; /* For compression */ -+/*C8*/ __le32 s_algorithm_usage_bitmap; /* For compression */ - /* - * Performance hints. Directory preallocation should only - * happen if the EXT3_FEATURE_COMPAT_DIR_PREALLOC flag is on. - */ - __u8 s_prealloc_blocks; /* Nr of blocks to try to preallocate*/ - __u8 s_prealloc_dir_blocks; /* Nr to preallocate for dirs */ -- __u16 s_padding1; -+ __u16 s_reserved_gdt_blocks; /* Per group desc for online growth */ - /* - * Journaling support valid if EXT3_FEATURE_COMPAT_HAS_JOURNAL set. - */ - /*D0*/ __u8 s_journal_uuid[16]; /* uuid of journal superblock */ --/*E0*/ __u32 s_journal_inum; /* inode number of journal file */ -- __u32 s_journal_dev; /* device number of journal file */ -- __u32 s_last_orphan; /* start of list of inodes to delete */ -- __u32 s_hash_seed[4]; /* HTREE hash seed */ -+/*E0*/ __le32 s_journal_inum; /* inode number of journal file */ -+ __le32 s_journal_dev; /* device number of journal file */ -+ __le32 s_last_orphan; /* start of list of inodes to delete */ -+ __le32 s_hash_seed[4]; /* HTREE hash seed */ - __u8 s_def_hash_version; /* Default hash version to use */ - __u8 s_reserved_char_pad; - __u16 s_reserved_word_pad; -- __u32 s_default_mount_opts; -- __u32 s_first_meta_bg; /* First metablock block group */ -+ __le32 s_default_mount_opts; -+ __le32 s_first_meta_bg; /* First metablock block group */ - __u32 s_reserved[190]; /* Padding to the end of the block */ - }; - -@@ -545,9 +572,9 @@ static inline struct ext3_inode_info *EX - #define EXT3_NAME_LEN 255 - - struct ext3_dir_entry { -- __u32 inode; /* Inode number */ -- __u16 rec_len; /* Directory entry length */ -- __u16 name_len; /* Name length */ -+ __le32 inode; /* Inode number */ -+ __le16 rec_len; /* Directory entry length */ -+ __le16 name_len; /* Name length */ - char name[EXT3_NAME_LEN]; /* File name */ - }; - -@@ -558,8 +585,8 @@ struct ext3_dir_entry { - * file_type field. - */ - struct ext3_dir_entry_2 { -- __u32 inode; /* Inode number */ -- __u16 rec_len; /* Directory entry length */ -+ __le32 inode; /* Inode number */ -+ __le16 rec_len; /* Directory entry length */ - __u8 name_len; /* Name length */ - __u8 file_type; - char name[EXT3_NAME_LEN]; /* File name */ -@@ -684,6 +711,8 @@ extern int ext3_new_block (handle_t *, s - __u32 *, __u32 *, int *); - extern void ext3_free_blocks (handle_t *, struct inode *, unsigned long, - unsigned long); -+extern void ext3_free_blocks_sb (handle_t *, struct super_block *, -+ unsigned long, unsigned long, int *); - extern unsigned long ext3_count_free_blocks (struct super_block *); - extern void ext3_check_blocks_bitmap (struct super_block *); - extern struct ext3_group_desc * ext3_get_group_desc(struct super_block * sb, -@@ -723,7 +752,7 @@ extern struct buffer_head * ext3_getblk - extern struct buffer_head * ext3_bread (handle_t *, struct inode *, int, int, int *); - - extern void ext3_read_inode (struct inode *); --extern void ext3_write_inode (struct inode *, int); -+extern int ext3_write_inode (struct inode *, int); - extern int ext3_setattr (struct dentry *, struct iattr *); - extern void ext3_put_inode (struct inode *); - extern void ext3_delete_inode (struct inode *); -@@ -745,6 +774,13 @@ extern int ext3_orphan_del(handle_t *, s - extern int ext3_htree_fill_tree(struct file *dir_file, __u32 start_hash, - __u32 start_minor_hash, __u32 *next_hash); - -+/* resize.c */ -+extern int ext3_group_add(struct super_block *sb, -+ struct ext3_new_group_data *input); -+extern int ext3_group_extend(struct super_block *sb, -+ struct ext3_super_block *es, -+ unsigned long n_blocks_count); -+ - /* super.c */ - extern void ext3_error (struct super_block *, const char *, const char *, ...) - __attribute__ ((format (printf, 3, 4))); -diff -uprN linux-2.6.8.1.orig/include/linux/ext3_fs_i.h linux-2.6.8.1-ve022stab078/include/linux/ext3_fs_i.h ---- linux-2.6.8.1.orig/include/linux/ext3_fs_i.h 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/ext3_fs_i.h 2006-05-11 13:05:31.000000000 +0400 -@@ -22,7 +22,7 @@ - * second extended file system inode data in memory - */ - struct ext3_inode_info { -- __u32 i_data[15]; -+ __le32 i_data[15]; /* unconverted */ - __u32 i_flags; - #ifdef EXT3_FRAGMENTS - __u32 i_faddr; -diff -uprN linux-2.6.8.1.orig/include/linux/ext3_fs_sb.h linux-2.6.8.1-ve022stab078/include/linux/ext3_fs_sb.h ---- linux-2.6.8.1.orig/include/linux/ext3_fs_sb.h 2004-08-14 14:56:15.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/ext3_fs_sb.h 2006-05-11 13:05:31.000000000 +0400 -@@ -53,7 +53,6 @@ struct ext3_sb_info { - u32 s_next_generation; - u32 s_hash_seed[4]; - int s_def_hash_version; -- u8 *s_debts; - struct percpu_counter s_freeblocks_counter; - struct percpu_counter s_freeinodes_counter; - struct percpu_counter s_dirs_counter; -diff -uprN linux-2.6.8.1.orig/include/linux/ext3_jbd.h linux-2.6.8.1-ve022stab078/include/linux/ext3_jbd.h ---- linux-2.6.8.1.orig/include/linux/ext3_jbd.h 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/ext3_jbd.h 2006-05-11 13:05:31.000000000 +0400 -@@ -138,10 +138,13 @@ ext3_journal_release_buffer(handle_t *ha - journal_release_buffer(handle, bh, credits); - } - --static inline void --ext3_journal_forget(handle_t *handle, struct buffer_head *bh) -+static inline int -+__ext3_journal_forget(const char *where, handle_t *handle, struct buffer_head *bh) - { -- journal_forget(handle, bh); -+ int err = journal_forget(handle, bh); -+ if (err) -+ ext3_journal_abort_handle(where, __FUNCTION__, bh, handle,err); -+ return err; - } - - static inline int -@@ -187,10 +190,17 @@ __ext3_journal_dirty_metadata(const char - __ext3_journal_get_create_access(__FUNCTION__, (handle), (bh)) - #define ext3_journal_dirty_metadata(handle, bh) \ - __ext3_journal_dirty_metadata(__FUNCTION__, (handle), (bh)) -+#define ext3_journal_forget(handle, bh) \ -+ __ext3_journal_forget(__FUNCTION__, (handle), (bh)) - --handle_t *ext3_journal_start(struct inode *inode, int nblocks); -+handle_t *ext3_journal_start_sb(struct super_block *sb, int nblocks); - int __ext3_journal_stop(const char *where, handle_t *handle); - -+static inline handle_t *ext3_journal_start(struct inode *inode, int nblocks) -+{ -+ return ext3_journal_start_sb(inode->i_sb, nblocks); -+} -+ - #define ext3_journal_stop(handle) \ - __ext3_journal_stop(__FUNCTION__, (handle)) - -diff -uprN linux-2.6.8.1.orig/include/linux/fairsched.h linux-2.6.8.1-ve022stab078/include/linux/fairsched.h ---- linux-2.6.8.1.orig/include/linux/fairsched.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/linux/fairsched.h 2006-05-11 13:05:40.000000000 +0400 -@@ -0,0 +1,119 @@ -+#ifndef __LINUX_FAIRSCHED_H__ -+#define __LINUX_FAIRSCHED_H__ -+ -+/* -+ * Fair Scheduler -+ * -+ * Copyright (C) 2000-2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#include <linux/cache.h> -+#include <linux/cpumask.h> -+#include <asm/timex.h> -+ -+#define FAIRSCHED_HAS_CPU_BINDING 0 -+ -+typedef struct { cycles_t t; } fschtag_t; -+typedef struct { unsigned long d; } fschdur_t; -+typedef struct { cycles_t v; } fschvalue_t; -+ -+struct vcpu_scheduler; -+ -+struct fairsched_node { -+ struct list_head runlist; -+ -+ /* -+ * Fair Scheduler fields -+ * -+ * nr_running >= nr_ready (!= if delayed) -+ */ -+ fschtag_t start_tag; -+ int nr_ready; -+ int nr_runnable; -+ int nr_pcpu; -+ -+ /* -+ * Rate limitator fields -+ */ -+ cycles_t last_updated_at; -+ fschvalue_t value; /* leaky function value */ -+ cycles_t delay; /* removed from schedule till */ -+ unsigned char delayed; -+ -+ /* -+ * Configuration -+ * -+ * Read-only most of the time. -+ */ -+ unsigned weight ____cacheline_aligned_in_smp; -+ /* fairness weight */ -+ unsigned char rate_limited; -+ unsigned rate; /* max CPU share */ -+ fschtag_t max_latency; -+ unsigned min_weight; -+ -+ struct list_head nodelist; -+ int id; -+#ifdef CONFIG_VE -+ struct ve_struct *owner_env; -+#endif -+ struct vcpu_scheduler *vsched; -+}; -+ -+#ifdef CONFIG_FAIRSCHED -+ -+#define FSCHWEIGHT_MAX ((1 << 16) - 1) -+#define FSCHRATE_SHIFT 10 -+ -+/* -+ * Fairsched nodes used in boot process. -+ */ -+extern struct fairsched_node fairsched_init_node; -+extern struct fairsched_node fairsched_idle_node; -+ -+/* -+ * For proc output. -+ */ -+extern unsigned fairsched_nr_cpus; -+extern void fairsched_cpu_online_map(int id, cpumask_t *mask); -+ -+/* I hope vsched_id is always equal to fairsched node id --SAW */ -+#define task_fairsched_node_id(p) task_vsched_id(p) -+ -+/* -+ * Core functions. -+ */ -+extern void fairsched_incrun(struct fairsched_node *node); -+extern void fairsched_decrun(struct fairsched_node *node); -+extern void fairsched_inccpu(struct fairsched_node *node); -+extern void fairsched_deccpu(struct fairsched_node *node); -+extern struct fairsched_node *fairsched_schedule( -+ struct fairsched_node *prev_node, -+ struct fairsched_node *cur_node, -+ int cur_node_active, -+ cycles_t time); -+ -+/* -+ * Management functions. -+ */ -+void fairsched_init_early(void); -+asmlinkage int sys_fairsched_mknod(unsigned int parent, unsigned int weight, -+ unsigned int newid); -+asmlinkage int sys_fairsched_rmnod(unsigned int id); -+asmlinkage int sys_fairsched_mvpr(pid_t pid, unsigned int nodeid); -+ -+#else /* CONFIG_FAIRSCHED */ -+ -+#define task_fairsched_node_id(p) 0 -+#define fairsched_incrun(p) do { } while (0) -+#define fairsched_decrun(p) do { } while (0) -+#define fairsched_deccpu(p) do { } while (0) -+#define fairsched_cpu_online_map(id, mask) do { *(mask) = cpu_online_map; } while (0) -+ -+#endif /* CONFIG_FAIRSCHED */ -+ -+#endif /* __LINUX_FAIRSCHED_H__ */ -diff -uprN linux-2.6.8.1.orig/include/linux/faudit.h linux-2.6.8.1-ve022stab078/include/linux/faudit.h ---- linux-2.6.8.1.orig/include/linux/faudit.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/linux/faudit.h 2006-05-11 13:05:40.000000000 +0400 -@@ -0,0 +1,51 @@ -+/* -+ * include/linux/faudit.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef __FAUDIT_H_ -+#define __FAUDIT_H_ -+ -+#include <linux/config.h> -+#include <linux/virtinfo.h> -+ -+struct vfsmount; -+struct dentry; -+struct super_block; -+struct kstatfs; -+struct kstat; -+struct pt_regs; -+ -+struct faudit_regs_arg { -+ int err; -+ struct pt_regs *regs; -+}; -+ -+struct faudit_stat_arg { -+ int err; -+ struct vfsmount *mnt; -+ struct dentry *dentry; -+ struct kstat *stat; -+}; -+ -+struct faudit_statfs_arg { -+ int err; -+ struct super_block *sb; -+ struct kstatfs *stat; -+}; -+ -+#define VIRTINFO_FAUDIT (0) -+#define VIRTINFO_FAUDIT_EXIT (VIRTINFO_FAUDIT + 0) -+#define VIRTINFO_FAUDIT_FORK (VIRTINFO_FAUDIT + 1) -+#define VIRTINFO_FAUDIT_CLONE (VIRTINFO_FAUDIT + 2) -+#define VIRTINFO_FAUDIT_VFORK (VIRTINFO_FAUDIT + 3) -+#define VIRTINFO_FAUDIT_EXECVE (VIRTINFO_FAUDIT + 4) -+#define VIRTINFO_FAUDIT_STAT (VIRTINFO_FAUDIT + 5) -+#define VIRTINFO_FAUDIT_STATFS (VIRTINFO_FAUDIT + 6) -+ -+#endif -diff -uprN linux-2.6.8.1.orig/include/linux/fb.h linux-2.6.8.1-ve022stab078/include/linux/fb.h ---- linux-2.6.8.1.orig/include/linux/fb.h 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/fb.h 2006-05-11 13:05:32.000000000 +0400 -@@ -725,7 +725,6 @@ extern void fb_destroy_modedb(struct fb_ - - /* drivers/video/modedb.c */ - #define VESA_MODEDB_SIZE 34 --extern const struct fb_videomode vesa_modes[]; - - /* drivers/video/fbcmap.c */ - extern int fb_alloc_cmap(struct fb_cmap *cmap, int len, int transp); -@@ -754,6 +753,8 @@ struct fb_videomode { - u32 flag; - }; - -+extern const struct fb_videomode vesa_modes[]; -+ - extern int fb_find_mode(struct fb_var_screeninfo *var, - struct fb_info *info, const char *mode_option, - const struct fb_videomode *db, -diff -uprN linux-2.6.8.1.orig/include/linux/fs.h linux-2.6.8.1-ve022stab078/include/linux/fs.h ---- linux-2.6.8.1.orig/include/linux/fs.h 2004-08-14 14:55:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/fs.h 2006-05-11 13:05:43.000000000 +0400 -@@ -7,6 +7,7 @@ - */ - - #include <linux/config.h> -+#include <linux/ve_owner.h> - #include <linux/linkage.h> - #include <linux/limits.h> - #include <linux/wait.h> -@@ -79,6 +80,7 @@ extern int leases_enable, dir_notify_ena - #define FMODE_LSEEK 4 - #define FMODE_PREAD 8 - #define FMODE_PWRITE FMODE_PREAD /* These go hand in hand */ -+#define FMODE_QUOTACTL 4 - - #define RW_MASK 1 - #define RWA_MASK 2 -@@ -88,6 +90,7 @@ extern int leases_enable, dir_notify_ena - #define SPECIAL 4 /* For non-blockdevice requests in request queue */ - #define READ_SYNC (READ | (1 << BIO_RW_SYNC)) - #define WRITE_SYNC (WRITE | (1 << BIO_RW_SYNC)) -+#define WRITE_BARRIER ((1 << BIO_RW) | (1 << BIO_RW_BARRIER)) - - #define SEL_IN 1 - #define SEL_OUT 2 -@@ -96,6 +99,7 @@ extern int leases_enable, dir_notify_ena - /* public flags for file_system_type */ - #define FS_REQUIRES_DEV 1 - #define FS_BINARY_MOUNTDATA 2 -+#define FS_VIRTUALIZED 64 /* Can mount this fstype inside ve */ - #define FS_REVAL_DOT 16384 /* Check the paths ".", ".." for staleness */ - #define FS_ODD_RENAME 32768 /* Temporary stuff; will go away as soon - * as nfs_rename() will be cleaned up -@@ -118,7 +122,8 @@ extern int leases_enable, dir_notify_ena - #define MS_REC 16384 - #define MS_VERBOSE 32768 - #define MS_POSIXACL (1<<16) /* VFS does not apply the umask */ --#define MS_ONE_SECOND (1<<17) /* fs has 1 sec a/m/ctime resolution */ -+#define MS_ONE_SECOND (1<<17) /* fs has 1 sec time resolution (obsolete) */ -+#define MS_TIME_GRAN (1<<18) /* fs has s_time_gran field */ - #define MS_ACTIVE (1<<30) - #define MS_NOUSER (1<<31) - -@@ -292,6 +297,9 @@ struct iattr { - * Includes for diskquotas. - */ - #include <linux/quota.h> -+#if defined(CONFIG_VZ_QUOTA) || defined(CONFIG_VZ_QUOTA_MODULE) -+#include <linux/vzquota_qlnk.h> -+#endif - - /* - * oh the beauties of C type declarations. -@@ -419,6 +427,7 @@ static inline int mapping_writably_mappe - struct inode { - struct hlist_node i_hash; - struct list_head i_list; -+ struct list_head i_sb_list; - struct list_head i_dentry; - unsigned long i_ino; - atomic_t i_count; -@@ -448,6 +457,9 @@ struct inode { - #ifdef CONFIG_QUOTA - struct dquot *i_dquot[MAXQUOTAS]; - #endif -+#if defined(CONFIG_VZ_QUOTA) || defined(CONFIG_VZ_QUOTA_MODULE) -+ struct vz_quota_ilink i_qlnk; -+#endif - /* These three should probably be a union */ - struct list_head i_devices; - struct pipe_inode_info *i_pipe; -@@ -536,6 +548,12 @@ static inline unsigned imajor(struct ino - - extern struct block_device *I_BDEV(struct inode *inode); - -+struct exec_perm { -+ umode_t mode; -+ uid_t uid, gid; -+ int set; -+}; -+ - struct fown_struct { - rwlock_t lock; /* protects pid, uid, euid fields */ - int pid; /* pid or -pgrp where SIGIO should be sent */ -@@ -587,7 +605,10 @@ struct file { - spinlock_t f_ep_lock; - #endif /* #ifdef CONFIG_EPOLL */ - struct address_space *f_mapping; -+ struct ve_struct *owner_env; - }; -+DCL_VE_OWNER_PROTO(FILP, GENERIC, struct file, owner_env, -+ inline, (always_inline)) - extern spinlock_t files_lock; - #define file_list_lock() spin_lock(&files_lock); - #define file_list_unlock() spin_unlock(&files_lock); -@@ -639,6 +660,7 @@ struct file_lock { - struct file *fl_file; - unsigned char fl_flags; - unsigned char fl_type; -+ unsigned char fl_charged; - loff_t fl_start; - loff_t fl_end; - -@@ -750,10 +772,12 @@ struct super_block { - atomic_t s_active; - void *s_security; - -+ struct list_head s_inodes; /* all inodes */ - struct list_head s_dirty; /* dirty inodes */ - struct list_head s_io; /* parked for writeback */ - struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */ - struct list_head s_files; -+ struct list_head s_dshrinkers; /* active dcache shrinkers */ - - struct block_device *s_bdev; - struct list_head s_instances; -@@ -771,8 +795,33 @@ struct super_block { - * even looking at it. You had been warned. - */ - struct semaphore s_vfs_rename_sem; /* Kludge */ -+ -+ /* Granuality of c/m/atime in ns. -+ Cannot be worse than a second */ -+#ifndef __GENKSYMS__ -+ u32 s_time_gran; -+#endif - }; - -+extern struct timespec current_fs_time(struct super_block *sb); -+ -+static inline u32 get_sb_time_gran(struct super_block *sb) -+{ -+ if (sb->s_flags & MS_TIME_GRAN) -+ return sb->s_time_gran; -+ if (sb->s_flags & MS_ONE_SECOND) -+ return 1000000000U; -+ return 1; -+} -+ -+static inline void set_sb_time_gran(struct super_block *sb, u32 time_gran) -+{ -+ sb->s_time_gran = time_gran; -+ sb->s_flags |= MS_TIME_GRAN; -+ if (time_gran == 1000000000U) -+ sb->s_flags |= MS_ONE_SECOND; -+} -+ - /* - * Snapshotting support. - */ -@@ -911,7 +960,8 @@ struct inode_operations { - int (*follow_link) (struct dentry *, struct nameidata *); - void (*put_link) (struct dentry *, struct nameidata *); - void (*truncate) (struct inode *); -- int (*permission) (struct inode *, int, struct nameidata *); -+ int (*permission) (struct inode *, int, struct nameidata *, -+ struct exec_perm *); - int (*setattr) (struct dentry *, struct iattr *); - int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *); - int (*setxattr) (struct dentry *, const char *,const void *,size_t,int); -@@ -940,7 +990,7 @@ struct super_operations { - void (*read_inode) (struct inode *); - - void (*dirty_inode) (struct inode *); -- void (*write_inode) (struct inode *, int); -+ int (*write_inode) (struct inode *, int); - void (*put_inode) (struct inode *); - void (*drop_inode) (struct inode *); - void (*delete_inode) (struct inode *); -@@ -955,6 +1005,8 @@ struct super_operations { - void (*umount_begin) (struct super_block *); - - int (*show_options)(struct seq_file *, struct vfsmount *); -+ -+ struct inode *(*get_quota_root)(struct super_block *); - }; - - /* Inode state bits. Protected by inode_lock. */ -@@ -965,6 +1017,7 @@ struct super_operations { - #define I_FREEING 16 - #define I_CLEAR 32 - #define I_NEW 64 -+#define I_WILL_FREE 128 - - #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) - -@@ -1105,8 +1158,15 @@ struct file_system_type { - struct module *owner; - struct file_system_type * next; - struct list_head fs_supers; -+ struct ve_struct *owner_env; - }; - -+DCL_VE_OWNER_PROTO(FSTYPE, MODULE_NOCHECK, struct file_system_type, owner_env -+ , , ()) -+ -+void get_filesystem(struct file_system_type *fs); -+void put_filesystem(struct file_system_type *fs); -+ - struct super_block *get_sb_bdev(struct file_system_type *fs_type, - int flags, const char *dev_name, void *data, - int (*fill_super)(struct super_block *, void *, int)); -@@ -1129,6 +1189,7 @@ struct super_block *sget(struct file_sys - struct super_block *get_sb_pseudo(struct file_system_type *, char *, - struct super_operations *ops, unsigned long); - int __put_super(struct super_block *sb); -+int __put_super_and_need_restart(struct super_block *sb); - void unnamed_dev_init(void); - - /* Alas, no aliases. Too much hassle with bringing module.h everywhere */ -@@ -1143,8 +1204,11 @@ extern struct vfsmount *kern_mount(struc - extern int may_umount_tree(struct vfsmount *); - extern int may_umount(struct vfsmount *); - extern long do_mount(char *, char *, char *, unsigned long, void *); -+extern void umount_tree(struct vfsmount *); -+#define kern_umount mntput - - extern int vfs_statfs(struct super_block *, struct kstatfs *); -+extern int faudit_statfs(struct super_block *, struct kstatfs *); - - /* Return value for VFS lock functions - tells locks.c to lock conventionally - * REALLY kosha for root NFS and nfs_lock -@@ -1260,7 +1324,7 @@ extern int chrdev_open(struct inode *, s - #define BDEVNAME_SIZE 32 /* Largest string for a blockdev identifier */ - extern const char *__bdevname(dev_t, char *buffer); - extern const char *bdevname(struct block_device *bdev, char *buffer); --extern struct block_device *lookup_bdev(const char *); -+extern struct block_device *lookup_bdev(const char *, int mode); - extern struct block_device *open_bdev_excl(const char *, int, void *); - extern void close_bdev_excl(struct block_device *); - -@@ -1290,7 +1354,7 @@ extern int fs_may_remount_ro(struct supe - #define bio_data_dir(bio) ((bio)->bi_rw & 1) - - extern int check_disk_change(struct block_device *); --extern int invalidate_inodes(struct super_block *); -+extern int invalidate_inodes(struct super_block *, int); - extern int __invalidate_device(struct block_device *, int); - extern int invalidate_partition(struct gendisk *, int); - unsigned long invalidate_mapping_pages(struct address_space *mapping, -@@ -1317,8 +1381,9 @@ extern int do_remount_sb(struct super_bl - extern sector_t bmap(struct inode *, sector_t); - extern int setattr_mask(unsigned int); - extern int notify_change(struct dentry *, struct iattr *); --extern int permission(struct inode *, int, struct nameidata *); --extern int vfs_permission(struct inode *, int); -+extern int permission(struct inode *, int, struct nameidata *, -+ struct exec_perm *); -+extern int vfs_permission(struct inode *, int, struct exec_perm *); - extern int get_write_access(struct inode *); - extern int deny_write_access(struct file *); - static inline void put_write_access(struct inode * inode) -@@ -1335,8 +1400,9 @@ extern int do_pipe(int *); - extern int open_namei(const char *, int, int, struct nameidata *); - extern int may_open(struct nameidata *, int, int); - -+struct linux_binprm; - extern int kernel_read(struct file *, unsigned long, char *, unsigned long); --extern struct file * open_exec(const char *); -+extern struct file * open_exec(const char *, struct linux_binprm *); - - /* fs/dcache.c -- generic fs support functions */ - extern int is_subdir(struct dentry *, struct dentry *); -@@ -1482,7 +1548,7 @@ extern int page_readlink(struct dentry * - extern int page_follow_link(struct dentry *, struct nameidata *); - extern int page_follow_link_light(struct dentry *, struct nameidata *); - extern void page_put_link(struct dentry *, struct nameidata *); --extern int page_symlink(struct inode *inode, const char *symname, int len); -+extern int page_symlink(struct inode *inode, const char *symname, int len, int gfp_mask); - extern struct inode_operations page_symlink_inode_operations; - extern int generic_readlink(struct dentry *, char __user *, int); - extern void generic_fillattr(struct inode *, struct kstat *); -diff -uprN linux-2.6.8.1.orig/include/linux/gfp.h linux-2.6.8.1-ve022stab078/include/linux/gfp.h ---- linux-2.6.8.1.orig/include/linux/gfp.h 2004-08-14 14:55:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/gfp.h 2006-05-11 13:05:39.000000000 +0400 -@@ -38,19 +38,25 @@ struct vm_area_struct; - #define __GFP_NO_GROW 0x2000 /* Slab internal usage */ - #define __GFP_COMP 0x4000 /* Add compound page metadata */ - --#define __GFP_BITS_SHIFT 16 /* Room for 16 __GFP_FOO bits */ -+#define __GFP_UBC 0x08000 /* charge kmem in buddy and slab */ -+#define __GFP_SOFT_UBC 0x10000 /* use soft charging */ -+ -+#define __GFP_BITS_SHIFT 17 /* Room for 15 __GFP_FOO bits */ - #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1) - - /* if you forget to add the bitmask here kernel will crash, period */ - #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \ - __GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \ -- __GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP) -+ __GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \ -+ __GFP_UBC|__GFP_SOFT_UBC) - - #define GFP_ATOMIC (__GFP_HIGH) - #define GFP_NOIO (__GFP_WAIT) - #define GFP_NOFS (__GFP_WAIT | __GFP_IO) - #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS) -+#define GFP_KERNEL_UBC (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_UBC) - #define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS) -+#define GFP_USER_UBC (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_UBC) - #define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM) - - /* Flag - indicates that the buffer will be suitable for DMA. Ignored on some -diff -uprN linux-2.6.8.1.orig/include/linux/highmem.h linux-2.6.8.1-ve022stab078/include/linux/highmem.h ---- linux-2.6.8.1.orig/include/linux/highmem.h 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/highmem.h 2006-05-11 13:05:38.000000000 +0400 -@@ -28,9 +28,10 @@ static inline void *kmap(struct page *pa - - #define kunmap(page) do { (void) (page); } while (0) - --#define kmap_atomic(page, idx) page_address(page) --#define kunmap_atomic(addr, idx) do { } while (0) --#define kmap_atomic_to_page(ptr) virt_to_page(ptr) -+#define kmap_atomic(page, idx) page_address(page) -+#define kmap_atomic_pte(pte, idx) page_address(pte_page(*pte)) -+#define kunmap_atomic(addr, idx) do { } while (0) -+#define kmap_atomic_to_page(ptr) virt_to_page(ptr) - - #endif /* CONFIG_HIGHMEM */ - -diff -uprN linux-2.6.8.1.orig/include/linux/inetdevice.h linux-2.6.8.1-ve022stab078/include/linux/inetdevice.h ---- linux-2.6.8.1.orig/include/linux/inetdevice.h 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/inetdevice.h 2006-05-11 13:05:40.000000000 +0400 -@@ -28,6 +28,11 @@ struct ipv4_devconf - }; - - extern struct ipv4_devconf ipv4_devconf; -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+#define ve_ipv4_devconf (*(get_exec_env()->_ipv4_devconf)) -+#else -+#define ve_ipv4_devconf ipv4_devconf -+#endif - - struct in_device - { -@@ -53,28 +58,28 @@ struct in_device - }; - - #define IN_DEV_FORWARD(in_dev) ((in_dev)->cnf.forwarding) --#define IN_DEV_MFORWARD(in_dev) (ipv4_devconf.mc_forwarding && (in_dev)->cnf.mc_forwarding) --#define IN_DEV_RPFILTER(in_dev) (ipv4_devconf.rp_filter && (in_dev)->cnf.rp_filter) --#define IN_DEV_SOURCE_ROUTE(in_dev) (ipv4_devconf.accept_source_route && (in_dev)->cnf.accept_source_route) --#define IN_DEV_BOOTP_RELAY(in_dev) (ipv4_devconf.bootp_relay && (in_dev)->cnf.bootp_relay) -- --#define IN_DEV_LOG_MARTIANS(in_dev) (ipv4_devconf.log_martians || (in_dev)->cnf.log_martians) --#define IN_DEV_PROXY_ARP(in_dev) (ipv4_devconf.proxy_arp || (in_dev)->cnf.proxy_arp) --#define IN_DEV_SHARED_MEDIA(in_dev) (ipv4_devconf.shared_media || (in_dev)->cnf.shared_media) --#define IN_DEV_TX_REDIRECTS(in_dev) (ipv4_devconf.send_redirects || (in_dev)->cnf.send_redirects) --#define IN_DEV_SEC_REDIRECTS(in_dev) (ipv4_devconf.secure_redirects || (in_dev)->cnf.secure_redirects) -+#define IN_DEV_MFORWARD(in_dev) (ve_ipv4_devconf.mc_forwarding && (in_dev)->cnf.mc_forwarding) -+#define IN_DEV_RPFILTER(in_dev) (ve_ipv4_devconf.rp_filter && (in_dev)->cnf.rp_filter) -+#define IN_DEV_SOURCE_ROUTE(in_dev) (ve_ipv4_devconf.accept_source_route && (in_dev)->cnf.accept_source_route) -+#define IN_DEV_BOOTP_RELAY(in_dev) (ve_ipv4_devconf.bootp_relay && (in_dev)->cnf.bootp_relay) -+ -+#define IN_DEV_LOG_MARTIANS(in_dev) (ve_ipv4_devconf.log_martians || (in_dev)->cnf.log_martians) -+#define IN_DEV_PROXY_ARP(in_dev) (ve_ipv4_devconf.proxy_arp || (in_dev)->cnf.proxy_arp) -+#define IN_DEV_SHARED_MEDIA(in_dev) (ve_ipv4_devconf.shared_media || (in_dev)->cnf.shared_media) -+#define IN_DEV_TX_REDIRECTS(in_dev) (ve_ipv4_devconf.send_redirects || (in_dev)->cnf.send_redirects) -+#define IN_DEV_SEC_REDIRECTS(in_dev) (ve_ipv4_devconf.secure_redirects || (in_dev)->cnf.secure_redirects) - #define IN_DEV_IDTAG(in_dev) ((in_dev)->cnf.tag) - #define IN_DEV_MEDIUM_ID(in_dev) ((in_dev)->cnf.medium_id) - - #define IN_DEV_RX_REDIRECTS(in_dev) \ - ((IN_DEV_FORWARD(in_dev) && \ -- (ipv4_devconf.accept_redirects && (in_dev)->cnf.accept_redirects)) \ -+ (ve_ipv4_devconf.accept_redirects && (in_dev)->cnf.accept_redirects)) \ - || (!IN_DEV_FORWARD(in_dev) && \ -- (ipv4_devconf.accept_redirects || (in_dev)->cnf.accept_redirects))) -+ (ve_ipv4_devconf.accept_redirects || (in_dev)->cnf.accept_redirects))) - --#define IN_DEV_ARPFILTER(in_dev) (ipv4_devconf.arp_filter || (in_dev)->cnf.arp_filter) --#define IN_DEV_ARP_ANNOUNCE(in_dev) (max(ipv4_devconf.arp_announce, (in_dev)->cnf.arp_announce)) --#define IN_DEV_ARP_IGNORE(in_dev) (max(ipv4_devconf.arp_ignore, (in_dev)->cnf.arp_ignore)) -+#define IN_DEV_ARPFILTER(in_dev) (ve_ipv4_devconf.arp_filter || (in_dev)->cnf.arp_filter) -+#define IN_DEV_ARP_ANNOUNCE(in_dev) (max(ve_ipv4_devconf.arp_announce, (in_dev)->cnf.arp_announce)) -+#define IN_DEV_ARP_IGNORE(in_dev) (max(ve_ipv4_devconf.arp_ignore, (in_dev)->cnf.arp_ignore)) - - struct in_ifaddr - { -@@ -104,6 +109,7 @@ extern u32 inet_select_addr(const struc - extern u32 inet_confirm_addr(const struct net_device *dev, u32 dst, u32 local, int scope); - extern struct in_ifaddr *inet_ifa_byprefix(struct in_device *in_dev, u32 prefix, u32 mask); - extern void inet_forward_change(void); -+extern void inet_del_ifa(struct in_device *in_dev, struct in_ifaddr **ifap, int destroy); - - static __inline__ int inet_ifa_match(u32 addr, struct in_ifaddr *ifa) - { -@@ -167,6 +173,10 @@ in_dev_put(struct in_device *idev) - #define __in_dev_put(idev) atomic_dec(&(idev)->refcnt) - #define in_dev_hold(idev) atomic_inc(&(idev)->refcnt) - -+struct ve_struct; -+extern int devinet_sysctl_init(struct ve_struct *); -+extern void devinet_sysctl_fini(struct ve_struct *); -+extern void devinet_sysctl_free(struct ve_struct *); - #endif /* __KERNEL__ */ - - static __inline__ __u32 inet_make_mask(int logmask) -diff -uprN linux-2.6.8.1.orig/include/linux/initrd.h linux-2.6.8.1-ve022stab078/include/linux/initrd.h ---- linux-2.6.8.1.orig/include/linux/initrd.h 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/initrd.h 2006-05-11 13:05:37.000000000 +0400 -@@ -14,7 +14,7 @@ extern int rd_image_start; - extern int initrd_below_start_ok; - - /* free_initrd_mem always gets called with the next two as arguments.. */ --extern unsigned long initrd_start, initrd_end; -+extern unsigned long initrd_start, initrd_end, initrd_copy; - extern void free_initrd_mem(unsigned long, unsigned long); - - extern unsigned int real_root_dev; -diff -uprN linux-2.6.8.1.orig/include/linux/irq.h linux-2.6.8.1-ve022stab078/include/linux/irq.h ---- linux-2.6.8.1.orig/include/linux/irq.h 2004-08-14 14:56:26.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/irq.h 2006-05-11 13:05:38.000000000 +0400 -@@ -77,4 +77,6 @@ extern hw_irq_controller no_irq_type; / - - #endif - -+void check_stack_overflow(void); -+ - #endif /* __irq_h */ -diff -uprN linux-2.6.8.1.orig/include/linux/jbd.h linux-2.6.8.1-ve022stab078/include/linux/jbd.h ---- linux-2.6.8.1.orig/include/linux/jbd.h 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/jbd.h 2006-05-11 13:05:43.000000000 +0400 -@@ -137,9 +137,9 @@ typedef struct journal_s journal_t; /* J - */ - typedef struct journal_header_s - { -- __u32 h_magic; -- __u32 h_blocktype; -- __u32 h_sequence; -+ __be32 h_magic; -+ __be32 h_blocktype; -+ __be32 h_sequence; - } journal_header_t; - - -@@ -148,8 +148,8 @@ typedef struct journal_header_s - */ - typedef struct journal_block_tag_s - { -- __u32 t_blocknr; /* The on-disk block number */ -- __u32 t_flags; /* See below */ -+ __be32 t_blocknr; /* The on-disk block number */ -+ __be32 t_flags; /* See below */ - } journal_block_tag_t; - - /* -@@ -159,7 +159,7 @@ typedef struct journal_block_tag_s - typedef struct journal_revoke_header_s - { - journal_header_t r_header; -- int r_count; /* Count of bytes used in the block */ -+ __be32 r_count; /* Count of bytes used in the block */ - } journal_revoke_header_t; - - -@@ -180,35 +180,35 @@ typedef struct journal_superblock_s - - /* 0x000C */ - /* Static information describing the journal */ -- __u32 s_blocksize; /* journal device blocksize */ -- __u32 s_maxlen; /* total blocks in journal file */ -- __u32 s_first; /* first block of log information */ -+ __be32 s_blocksize; /* journal device blocksize */ -+ __be32 s_maxlen; /* total blocks in journal file */ -+ __be32 s_first; /* first block of log information */ - - /* 0x0018 */ - /* Dynamic information describing the current state of the log */ -- __u32 s_sequence; /* first commit ID expected in log */ -- __u32 s_start; /* blocknr of start of log */ -+ __be32 s_sequence; /* first commit ID expected in log */ -+ __be32 s_start; /* blocknr of start of log */ - - /* 0x0020 */ - /* Error value, as set by journal_abort(). */ -- __s32 s_errno; -+ __be32 s_errno; - - /* 0x0024 */ - /* Remaining fields are only valid in a version-2 superblock */ -- __u32 s_feature_compat; /* compatible feature set */ -- __u32 s_feature_incompat; /* incompatible feature set */ -- __u32 s_feature_ro_compat; /* readonly-compatible feature set */ -+ __be32 s_feature_compat; /* compatible feature set */ -+ __be32 s_feature_incompat; /* incompatible feature set */ -+ __be32 s_feature_ro_compat; /* readonly-compatible feature set */ - /* 0x0030 */ - __u8 s_uuid[16]; /* 128-bit uuid for journal */ - - /* 0x0040 */ -- __u32 s_nr_users; /* Nr of filesystems sharing log */ -+ __be32 s_nr_users; /* Nr of filesystems sharing log */ - -- __u32 s_dynsuper; /* Blocknr of dynamic superblock copy*/ -+ __be32 s_dynsuper; /* Blocknr of dynamic superblock copy*/ - - /* 0x0048 */ -- __u32 s_max_transaction; /* Limit of journal blocks per trans.*/ -- __u32 s_max_trans_data; /* Limit of data blocks per trans. */ -+ __be32 s_max_transaction; /* Limit of journal blocks per trans.*/ -+ __be32 s_max_trans_data; /* Limit of data blocks per trans. */ - - /* 0x0050 */ - __u32 s_padding[44]; -@@ -242,14 +242,28 @@ typedef struct journal_superblock_s - #include <asm/bug.h> - - #define JBD_ASSERTIONS -+#define JBD_SOFT_ASSERTIONS - #ifdef JBD_ASSERTIONS -+#ifdef JBD_SOFT_ASSERTIONS -+#define J_BUG() \ -+do { \ -+ unsigned long stack; \ -+ printk("Stack=%p current=%p pid=%d ve=%d process='%s'\n", \ -+ &stack, current, current->pid, \ -+ get_exec_env()->veid, \ -+ current->comm); \ -+ dump_stack(); \ -+} while(0) -+#else -+#define J_BUG() BUG() -+#endif - #define J_ASSERT(assert) \ - do { \ - if (!(assert)) { \ - printk (KERN_EMERG \ - "Assertion failure in %s() at %s:%d: \"%s\"\n", \ - __FUNCTION__, __FILE__, __LINE__, # assert); \ -- BUG(); \ -+ J_BUG(); \ - } \ - } while (0) - -@@ -277,13 +291,15 @@ void buffer_assertion_failure(struct buf - #define J_EXPECT_JH(jh, expr, why...) J_ASSERT_JH(jh, expr) - #else - #define __journal_expect(expr, why...) \ -- do { \ -- if (!(expr)) { \ -+ ({ \ -+ int val = (expr); \ -+ if (!val) { \ - printk(KERN_ERR \ - "EXT3-fs unexpected failure: %s;\n",# expr); \ -- printk(KERN_ERR why); \ -+ printk(KERN_ERR why "\n"); \ - } \ -- } while (0) -+ val; \ -+ }) - #define J_EXPECT(expr, why...) __journal_expect(expr, ## why) - #define J_EXPECT_BH(bh, expr, why...) __journal_expect(expr, ## why) - #define J_EXPECT_JH(jh, expr, why...) __journal_expect(expr, ## why) -@@ -826,6 +842,12 @@ struct journal_s - struct jbd_revoke_table_s *j_revoke_table[2]; - - /* -+ * array of bhs for journal_commit_transaction -+ */ -+ struct buffer_head **j_wbuf; -+ int j_wbufsize; -+ -+ /* - * An opaque pointer to fs-private information. ext3 puts its - * superblock pointer here - */ -@@ -847,6 +869,7 @@ struct journal_s - */ - - /* Filing buffers */ -+extern void __journal_temp_unlink_buffer(struct journal_head *jh); - extern void journal_unfile_buffer(journal_t *, struct journal_head *); - extern void __journal_unfile_buffer(struct journal_head *); - extern void __journal_refile_buffer(struct journal_head *); -@@ -912,7 +935,7 @@ extern int journal_dirty_data (handle_t - extern int journal_dirty_metadata (handle_t *, struct buffer_head *); - extern void journal_release_buffer (handle_t *, struct buffer_head *, - int credits); --extern void journal_forget (handle_t *, struct buffer_head *); -+extern int journal_forget (handle_t *, struct buffer_head *); - extern void journal_sync_buffer (struct buffer_head *); - extern int journal_invalidatepage(journal_t *, - struct page *, unsigned long); -diff -uprN linux-2.6.8.1.orig/include/linux/jiffies.h linux-2.6.8.1-ve022stab078/include/linux/jiffies.h ---- linux-2.6.8.1.orig/include/linux/jiffies.h 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/jiffies.h 2006-05-11 13:05:39.000000000 +0400 -@@ -15,6 +15,7 @@ - */ - extern u64 jiffies_64; - extern unsigned long volatile jiffies; -+extern unsigned long cycles_per_jiffy, cycles_per_clock; - - #if (BITS_PER_LONG < 64) - u64 get_jiffies_64(void); -diff -uprN linux-2.6.8.1.orig/include/linux/kdev_t.h linux-2.6.8.1-ve022stab078/include/linux/kdev_t.h ---- linux-2.6.8.1.orig/include/linux/kdev_t.h 2004-08-14 14:56:00.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/kdev_t.h 2006-05-11 13:05:40.000000000 +0400 -@@ -87,6 +87,57 @@ static inline unsigned sysv_minor(u32 de - return dev & 0x3ffff; - } - -+#define UNNAMED_MAJOR_COUNT 16 -+ -+#if UNNAMED_MAJOR_COUNT > 1 -+ -+extern int unnamed_dev_majors[UNNAMED_MAJOR_COUNT]; -+ -+static inline dev_t make_unnamed_dev(int idx) -+{ -+ /* -+ * Here we transfer bits from 8 to 8+log2(UNNAMED_MAJOR_COUNT) of the -+ * unnamed device index into major number. -+ */ -+ return MKDEV(unnamed_dev_majors[(idx >> 8) & (UNNAMED_MAJOR_COUNT - 1)], -+ idx & ~((UNNAMED_MAJOR_COUNT - 1) << 8)); -+} -+ -+static inline int unnamed_dev_idx(dev_t dev) -+{ -+ int i; -+ for (i = 0; i < UNNAMED_MAJOR_COUNT && -+ MAJOR(dev) != unnamed_dev_majors[i]; i++); -+ return MINOR(dev) | (i << 8); -+} -+ -+static inline int is_unnamed_dev(dev_t dev) -+{ -+ int i; -+ for (i = 0; i < UNNAMED_MAJOR_COUNT && -+ MAJOR(dev) != unnamed_dev_majors[i]; i++); -+ return i < UNNAMED_MAJOR_COUNT; -+} -+ -+#else /* UNNAMED_MAJOR_COUNT */ -+ -+static inline dev_t make_unnamed_dev(int idx) -+{ -+ return MKDEV(0, idx); -+} -+ -+static inline int unnamed_dev_idx(dev_t dev) -+{ -+ return MINOR(dev); -+} -+ -+static inline int is_unnamed_dev(dev_t dev) -+{ -+ return MAJOR(dev) == 0; -+} -+ -+#endif /* UNNAMED_MAJOR_COUNT */ -+ - - #else /* __KERNEL__ */ - -diff -uprN linux-2.6.8.1.orig/include/linux/kernel.h linux-2.6.8.1-ve022stab078/include/linux/kernel.h ---- linux-2.6.8.1.orig/include/linux/kernel.h 2004-08-14 14:54:46.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/kernel.h 2006-05-11 13:05:49.000000000 +0400 -@@ -97,9 +97,18 @@ extern int __kernel_text_address(unsigne - extern int kernel_text_address(unsigned long addr); - extern int session_of_pgrp(int pgrp); - -+asmlinkage int vprintk(const char *fmt, va_list args) -+ __attribute__ ((format (printf, 1, 0))); - asmlinkage int printk(const char * fmt, ...) - __attribute__ ((format (printf, 1, 2))); - -+#define VE0_LOG 1 -+#define VE_LOG 2 -+#define VE_LOG_BOTH (VE0_LOG | VE_LOG) -+asmlinkage int ve_printk(int, const char * fmt, ...) -+ __attribute__ ((format (printf, 2, 3))); -+void prepare_printk(void); -+ - unsigned long int_sqrt(unsigned long); - - static inline int __attribute_pure__ long_log2(unsigned long x) -@@ -114,9 +123,14 @@ static inline int __attribute_pure__ lon - extern int printk_ratelimit(void); - extern int __printk_ratelimit(int ratelimit_jiffies, int ratelimit_burst); - -+extern int console_silence_loglevel; -+ - static inline void console_silent(void) - { -- console_loglevel = 0; -+ if (console_loglevel > console_silence_loglevel) { -+ printk("console shuts up ...\n"); -+ console_loglevel = 0; -+ } - } - - static inline void console_verbose(void) -@@ -126,10 +140,14 @@ static inline void console_verbose(void) - } - - extern void bust_spinlocks(int yes); -+extern void wake_up_klogd(void); - extern int oops_in_progress; /* If set, an oops, panic(), BUG() or die() is in progress */ - extern int panic_on_oops; -+extern int decode_call_traces; - extern int tainted; -+extern int kernel_text_csum_broken; - extern const char *print_tainted(void); -+extern int alloc_fail_warn; - - /* Values used for system_state */ - extern enum system_states { -diff -uprN linux-2.6.8.1.orig/include/linux/kmem_cache.h linux-2.6.8.1-ve022stab078/include/linux/kmem_cache.h ---- linux-2.6.8.1.orig/include/linux/kmem_cache.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/linux/kmem_cache.h 2006-05-11 13:05:39.000000000 +0400 -@@ -0,0 +1,195 @@ -+#ifndef __KMEM_CACHE_H__ -+#define __KMEM_CACHE_H__ -+ -+#include <linux/config.h> -+#include <linux/threads.h> -+#include <linux/smp.h> -+#include <linux/spinlock.h> -+#include <linux/list.h> -+#include <linux/mm.h> -+#include <asm/atomic.h> -+ -+/* -+ * SLAB_DEBUG - 1 for kmem_cache_create() to honour; SLAB_DEBUG_INITIAL, -+ * SLAB_RED_ZONE & SLAB_POISON. -+ * 0 for faster, smaller code (especially in the critical paths). -+ * -+ * SLAB_STATS - 1 to collect stats for /proc/slabinfo. -+ * 0 for faster, smaller code (especially in the critical paths). -+ * -+ * SLAB_FORCED_DEBUG - 1 enables SLAB_RED_ZONE and SLAB_POISON (if possible) -+ */ -+ -+#ifdef CONFIG_DEBUG_SLAB -+#define SLAB_DEBUG 1 -+#define SLAB_STATS 1 -+#define SLAB_FORCED_DEBUG 1 -+#else -+#define SLAB_DEBUG 0 -+#define SLAB_STATS 0 /* must be off, see kmem_cache.h */ -+#define SLAB_FORCED_DEBUG 0 -+#endif -+ -+/* -+ * struct array_cache -+ * -+ * Per cpu structures -+ * Purpose: -+ * - LIFO ordering, to hand out cache-warm objects from _alloc -+ * - reduce the number of linked list operations -+ * - reduce spinlock operations -+ * -+ * The limit is stored in the per-cpu structure to reduce the data cache -+ * footprint. -+ * -+ */ -+struct array_cache { -+ unsigned int avail; -+ unsigned int limit; -+ unsigned int batchcount; -+ unsigned int touched; -+}; -+ -+/* bootstrap: The caches do not work without cpuarrays anymore, -+ * but the cpuarrays are allocated from the generic caches... -+ */ -+#define BOOT_CPUCACHE_ENTRIES 1 -+struct arraycache_init { -+ struct array_cache cache; -+ void * entries[BOOT_CPUCACHE_ENTRIES]; -+}; -+ -+/* -+ * The slab lists of all objects. -+ * Hopefully reduce the internal fragmentation -+ * NUMA: The spinlock could be moved from the kmem_cache_t -+ * into this structure, too. Figure out what causes -+ * fewer cross-node spinlock operations. -+ */ -+struct kmem_list3 { -+ struct list_head slabs_partial; /* partial list first, better asm code */ -+ struct list_head slabs_full; -+ struct list_head slabs_free; -+ unsigned long free_objects; -+ int free_touched; -+ unsigned long next_reap; -+ struct array_cache *shared; -+}; -+ -+#define LIST3_INIT(parent) \ -+ { \ -+ .slabs_full = LIST_HEAD_INIT(parent.slabs_full), \ -+ .slabs_partial = LIST_HEAD_INIT(parent.slabs_partial), \ -+ .slabs_free = LIST_HEAD_INIT(parent.slabs_free) \ -+ } -+#define list3_data(cachep) \ -+ (&(cachep)->lists) -+ -+/* NUMA: per-node */ -+#define list3_data_ptr(cachep, ptr) \ -+ list3_data(cachep) -+ -+/* -+ * kmem_cache_t -+ * -+ * manages a cache. -+ */ -+ -+struct kmem_cache_s { -+/* 1) per-cpu data, touched during every alloc/free */ -+ struct array_cache *array[NR_CPUS]; -+ unsigned int batchcount; -+ unsigned int limit; -+/* 2) touched by every alloc & free from the backend */ -+ struct kmem_list3 lists; -+ /* NUMA: kmem_3list_t *nodelists[MAX_NUMNODES] */ -+ unsigned int objsize; -+ unsigned int flags; /* constant flags */ -+ unsigned int num; /* # of objs per slab */ -+ unsigned int free_limit; /* upper limit of objects in the lists */ -+ spinlock_t spinlock; -+ -+/* 3) cache_grow/shrink */ -+ /* order of pgs per slab (2^n) */ -+ unsigned int gfporder; -+ -+ /* force GFP flags, e.g. GFP_DMA */ -+ unsigned int gfpflags; -+ -+ size_t colour; /* cache colouring range */ -+ unsigned int colour_off; /* colour offset */ -+ unsigned int colour_next; /* cache colouring */ -+ kmem_cache_t *slabp_cache; -+ unsigned int slab_size; -+ unsigned int dflags; /* dynamic flags */ -+ -+ /* constructor func */ -+ void (*ctor)(void *, kmem_cache_t *, unsigned long); -+ -+ /* de-constructor func */ -+ void (*dtor)(void *, kmem_cache_t *, unsigned long); -+ -+/* 4) cache creation/removal */ -+ const char *name; -+ struct list_head next; -+ -+/* 5) statistics */ -+#if SLAB_STATS -+ unsigned long num_active; -+ unsigned long num_allocations; -+ unsigned long high_mark; -+ unsigned long grown; -+ unsigned long reaped; -+ unsigned long errors; -+ unsigned long max_freeable; -+ atomic_t allochit; -+ atomic_t allocmiss; -+ atomic_t freehit; -+ atomic_t freemiss; -+#endif -+#if SLAB_DEBUG -+ int dbghead; -+ int reallen; -+#endif -+#ifdef CONFIG_USER_RESOURCE -+ unsigned int objuse; -+#endif -+}; -+ -+/* Macros for storing/retrieving the cachep and or slab from the -+ * global 'mem_map'. These are used to find the slab an obj belongs to. -+ * With kfree(), these are used to find the cache which an obj belongs to. -+ */ -+#define SET_PAGE_CACHE(pg,x) ((pg)->lru.next = (struct list_head *)(x)) -+#define GET_PAGE_CACHE(pg) ((kmem_cache_t *)(pg)->lru.next) -+#define SET_PAGE_SLAB(pg,x) ((pg)->lru.prev = (struct list_head *)(x)) -+#define GET_PAGE_SLAB(pg) ((struct slab *)(pg)->lru.prev) -+ -+#define CFLGS_OFF_SLAB (0x80000000UL) -+#define CFLGS_ENVIDS (0x04000000UL) -+#define OFF_SLAB(x) ((x)->flags & CFLGS_OFF_SLAB) -+#define ENVIDS(x) ((x)->flags & CFLGS_ENVIDS) -+ -+static inline unsigned int kmem_cache_memusage(kmem_cache_t *cache) -+{ -+#ifdef CONFIG_USER_RESOURCE -+ return cache->objuse; -+#else -+ return 0; -+#endif -+} -+ -+static inline unsigned int kmem_obj_memusage(void *obj) -+{ -+ kmem_cache_t *cachep; -+ -+ cachep = GET_PAGE_CACHE(virt_to_page(obj)); -+ return kmem_cache_memusage(cachep); -+} -+ -+static inline void kmem_mark_nocharge(kmem_cache_t *cachep) -+{ -+ cachep->flags |= SLAB_NO_CHARGE; -+} -+ -+#endif /* __KMEM_CACHE_H__ */ -diff -uprN linux-2.6.8.1.orig/include/linux/kmem_slab.h linux-2.6.8.1-ve022stab078/include/linux/kmem_slab.h ---- linux-2.6.8.1.orig/include/linux/kmem_slab.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/linux/kmem_slab.h 2006-05-11 13:05:35.000000000 +0400 -@@ -0,0 +1,47 @@ -+#ifndef __KMEM_SLAB_H__ -+#define __KMEM_SLAB_H__ -+ -+/* -+ * kmem_bufctl_t: -+ * -+ * Bufctl's are used for linking objs within a slab -+ * linked offsets. -+ * -+ * This implementation relies on "struct page" for locating the cache & -+ * slab an object belongs to. -+ * This allows the bufctl structure to be small (one int), but limits -+ * the number of objects a slab (not a cache) can contain when off-slab -+ * bufctls are used. The limit is the size of the largest general cache -+ * that does not use off-slab slabs. -+ * For 32bit archs with 4 kB pages, is this 56. -+ * This is not serious, as it is only for large objects, when it is unwise -+ * to have too many per slab. -+ * Note: This limit can be raised by introducing a general cache whose size -+ * is less than 512 (PAGE_SIZE<<3), but greater than 256. -+ */ -+ -+#define BUFCTL_END (((kmem_bufctl_t)(~0U))-0) -+#define BUFCTL_FREE (((kmem_bufctl_t)(~0U))-1) -+#define SLAB_LIMIT (((kmem_bufctl_t)(~0U))-2) -+ -+/* -+ * struct slab -+ * -+ * Manages the objs in a slab. Placed either at the beginning of mem allocated -+ * for a slab, or allocated from an general cache. -+ * Slabs are chained into three list: fully used, partial, fully free slabs. -+ */ -+struct slab { -+ struct list_head list; -+ unsigned long colouroff; -+ void *s_mem; /* including colour offset */ -+ unsigned int inuse; /* num of objs active in slab */ -+ kmem_bufctl_t free; -+}; -+ -+static inline kmem_bufctl_t *slab_bufctl(struct slab *slabp) -+{ -+ return (kmem_bufctl_t *)(slabp+1); -+} -+ -+#endif /* __KMEM_SLAB_H__ */ -diff -uprN linux-2.6.8.1.orig/include/linux/list.h linux-2.6.8.1-ve022stab078/include/linux/list.h ---- linux-2.6.8.1.orig/include/linux/list.h 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/list.h 2006-05-11 13:05:40.000000000 +0400 -@@ -305,6 +305,9 @@ static inline void list_splice_init(stru - #define list_entry(ptr, type, member) \ - container_of(ptr, type, member) - -+#define list_first_entry(ptr, type, member) \ -+ container_of((ptr)->next, type, member) -+ - /** - * list_for_each - iterate over a list - * @pos: the &struct list_head to use as a loop counter. -@@ -397,6 +400,20 @@ static inline void list_splice_init(stru - prefetch(pos->member.next)) - - /** -+ * list_for_each_entry_continue_reverse - iterate backwards over list of given -+ * type continuing after existing point -+ * @pos: the type * to use as a loop counter. -+ * @head: the head for your list. -+ * @member: the name of the list_struct within the struct. -+ */ -+#define list_for_each_entry_continue_reverse(pos, head, member) \ -+ for (pos = list_entry(pos->member.prev, typeof(*pos), member), \ -+ prefetch(pos->member.prev); \ -+ &pos->member != (head); \ -+ pos = list_entry(pos->member.prev, typeof(*pos), member), \ -+ prefetch(pos->member.prev)) -+ -+/** - * list_for_each_entry_safe - iterate over list of given type safe against removal of list entry - * @pos: the type * to use as a loop counter. - * @n: another type * to use as temporary storage -diff -uprN linux-2.6.8.1.orig/include/linux/major.h linux-2.6.8.1-ve022stab078/include/linux/major.h ---- linux-2.6.8.1.orig/include/linux/major.h 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/major.h 2006-05-11 13:05:40.000000000 +0400 -@@ -165,4 +165,7 @@ - - #define VIOTAPE_MAJOR 230 - -+#define UNNAMED_EXTRA_MAJOR 130 -+#define UNNAMED_EXTRA_MAJOR_COUNT 120 -+ - #endif -diff -uprN linux-2.6.8.1.orig/include/linux/mm.h linux-2.6.8.1-ve022stab078/include/linux/mm.h ---- linux-2.6.8.1.orig/include/linux/mm.h 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/mm.h 2006-05-11 13:05:40.000000000 +0400 -@@ -101,6 +101,8 @@ struct vm_area_struct { - #ifdef CONFIG_NUMA - struct mempolicy *vm_policy; /* NUMA policy for the VMA */ - #endif -+ /* rss counter by vma */ -+ unsigned long vm_rss; - }; - - /* -@@ -191,6 +193,9 @@ typedef unsigned long page_flags_t; - * moment. Note that we have no way to track which tasks are using - * a page. - */ -+struct user_beancounter; -+struct page_beancounter; -+ - struct page { - page_flags_t flags; /* Atomic flags, some possibly - * updated asynchronously */ -@@ -229,6 +234,10 @@ struct page { - void *virtual; /* Kernel virtual address (NULL if - not kmapped, ie. highmem) */ - #endif /* WANT_PAGE_VIRTUAL */ -+ union { -+ struct user_beancounter *page_ub; -+ struct page_beancounter *page_pbc; -+ } bc; - }; - - /* -@@ -496,7 +505,6 @@ int shmem_set_policy(struct vm_area_stru - struct mempolicy *shmem_get_policy(struct vm_area_struct *vma, - unsigned long addr); - struct file *shmem_file_setup(char * name, loff_t size, unsigned long flags); --void shmem_lock(struct file * file, int lock); - int shmem_zero_setup(struct vm_area_struct *); - - /* -@@ -624,7 +632,7 @@ extern struct vm_area_struct *vma_merge( - extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *); - extern int split_vma(struct mm_struct *, - struct vm_area_struct *, unsigned long addr, int new_below); --extern void insert_vm_struct(struct mm_struct *, struct vm_area_struct *); -+extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *); - extern void __vma_link_rb(struct mm_struct *, struct vm_area_struct *, - struct rb_node **, struct rb_node *); - extern struct vm_area_struct *copy_vma(struct vm_area_struct **, -@@ -709,6 +717,9 @@ extern struct vm_area_struct *find_exten - extern struct page * vmalloc_to_page(void *addr); - extern struct page * follow_page(struct mm_struct *mm, unsigned long address, - int write); -+extern struct page * follow_page_k(unsigned long address, int write); -+extern struct page * follow_page_pte(struct mm_struct *mm, -+ unsigned long address, int write, pte_t *pte); - extern int remap_page_range(struct vm_area_struct *vma, unsigned long from, - unsigned long to, unsigned long size, pgprot_t prot); - -@@ -724,5 +735,25 @@ extern struct vm_area_struct *get_gate_v - int in_gate_area(struct task_struct *task, unsigned long addr); - #endif - -+/* -+ * Common MM functions for inclusion in the VFS -+ * or in other stackable file systems. Some of these -+ * functions were in linux/mm/ C files. -+ * -+ */ -+static inline int sync_page(struct page *page) -+{ -+ struct address_space *mapping; -+ -+ /* -+ * FIXME, fercrissake. What is this barrier here for? -+ */ -+ smp_mb(); -+ mapping = page_mapping(page); -+ if (mapping && mapping->a_ops && mapping->a_ops->sync_page) -+ return mapping->a_ops->sync_page(page); -+ return 0; -+} -+ - #endif /* __KERNEL__ */ - #endif /* _LINUX_MM_H */ -diff -uprN linux-2.6.8.1.orig/include/linux/mount.h linux-2.6.8.1-ve022stab078/include/linux/mount.h ---- linux-2.6.8.1.orig/include/linux/mount.h 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/mount.h 2006-05-11 13:05:40.000000000 +0400 -@@ -63,7 +63,7 @@ static inline void mntput(struct vfsmoun - - extern void free_vfsmnt(struct vfsmount *mnt); - extern struct vfsmount *alloc_vfsmnt(const char *name); --extern struct vfsmount *do_kern_mount(const char *fstype, int flags, -+extern struct vfsmount *do_kern_mount(struct file_system_type *type, int flags, - const char *name, void *data); - - struct nameidata; -diff -uprN linux-2.6.8.1.orig/include/linux/msdos_fs.h linux-2.6.8.1-ve022stab078/include/linux/msdos_fs.h ---- linux-2.6.8.1.orig/include/linux/msdos_fs.h 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/msdos_fs.h 2006-05-11 13:05:35.000000000 +0400 -@@ -278,7 +278,7 @@ extern void fat_put_super(struct super_b - int fat_fill_super(struct super_block *sb, void *data, int silent, - struct inode_operations *fs_dir_inode_ops, int isvfat); - extern int fat_statfs(struct super_block *sb, struct kstatfs *buf); --extern void fat_write_inode(struct inode *inode, int wait); -+extern int fat_write_inode(struct inode *inode, int wait); - extern int fat_notify_change(struct dentry * dentry, struct iattr * attr); - - /* fat/misc.c */ -diff -uprN linux-2.6.8.1.orig/include/linux/namei.h linux-2.6.8.1-ve022stab078/include/linux/namei.h ---- linux-2.6.8.1.orig/include/linux/namei.h 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/namei.h 2006-05-11 13:05:40.000000000 +0400 -@@ -45,6 +45,8 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LA - #define LOOKUP_CONTINUE 4 - #define LOOKUP_PARENT 16 - #define LOOKUP_NOALT 32 -+#define LOOKUP_NOAREACHECK 64 /* no area check on lookup */ -+#define LOOKUP_STRICT 128 /* no symlinks or other filesystems */ - /* - * Intent data - */ -diff -uprN linux-2.6.8.1.orig/include/linux/netdevice.h linux-2.6.8.1-ve022stab078/include/linux/netdevice.h ---- linux-2.6.8.1.orig/include/linux/netdevice.h 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/netdevice.h 2006-05-11 13:05:42.000000000 +0400 -@@ -37,6 +37,7 @@ - #include <linux/config.h> - #include <linux/device.h> - #include <linux/percpu.h> -+#include <linux/ctype.h> - - struct divert_blk; - struct vlan_group; -@@ -245,6 +246,11 @@ struct netdev_boot_setup { - }; - #define NETDEV_BOOT_SETUP_MAX 8 - -+struct netdev_bc { -+ struct user_beancounter *exec_ub, *owner_ub; -+}; -+ -+#define netdev_bc(dev) (&(dev)->dev_bc) - - /* - * The DEVICE structure. -@@ -389,6 +395,7 @@ struct net_device - enum { NETREG_UNINITIALIZED=0, - NETREG_REGISTERING, /* called register_netdevice */ - NETREG_REGISTERED, /* completed register todo */ -+ NETREG_REGISTER_ERR, /* register todo failed */ - NETREG_UNREGISTERING, /* called unregister_netdevice */ - NETREG_UNREGISTERED, /* completed unregister todo */ - NETREG_RELEASED, /* called free_netdev */ -@@ -408,6 +415,8 @@ struct net_device - #define NETIF_F_VLAN_CHALLENGED 1024 /* Device cannot handle VLAN packets */ - #define NETIF_F_TSO 2048 /* Can offload TCP/IP segmentation */ - #define NETIF_F_LLTX 4096 /* LockLess TX */ -+#define NETIF_F_VIRTUAL 0x40000000 /* can be registered in ve */ -+#define NETIF_F_VENET 0x80000000 /* Device is VENET device */ - - /* Called after device is detached from network. */ - void (*uninit)(struct net_device *dev); -@@ -477,11 +486,18 @@ struct net_device - struct divert_blk *divert; - #endif /* CONFIG_NET_DIVERT */ - -+ unsigned orig_mtu; /* MTU value before move to VE */ -+ struct ve_struct *owner_env; /* Owner VE of the interface */ -+ struct netdev_bc dev_bc; -+ - /* class/net/name entry */ - struct class_device class_dev; - struct net_device_stats* (*last_stats)(struct net_device *); - /* how much padding had been added by alloc_netdev() */ - int padded; -+ -+ /* List entry in global devices list to keep track of their names assignment */ -+ struct list_head dev_global_list_entry; - }; - - #define NETDEV_ALIGN 32 -@@ -514,8 +530,21 @@ struct packet_type { - - extern struct net_device loopback_dev; /* The loopback */ - extern struct net_device *dev_base; /* All devices */ -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+#define visible_loopback_dev (*get_exec_env()->_loopback_dev) -+#define dev_base (get_exec_env()->_net_dev_base) -+#define visible_dev_head(x) (&(x)->_net_dev_head) -+#define visible_dev_index_head(x) (&(x)->_net_dev_index_head) -+#else -+#define visible_loopback_dev loopback_dev -+#define visible_dev_head(x) NULL -+#define visible_dev_index_head(x) NULL -+#endif - extern rwlock_t dev_base_lock; /* Device list lock */ - -+struct hlist_head *dev_name_hash(const char *name, struct ve_struct *env); -+struct hlist_head *dev_index_hash(int ifindex, struct ve_struct *env); -+ - extern int netdev_boot_setup_add(char *name, struct ifmap *map); - extern int netdev_boot_setup_check(struct net_device *dev); - extern unsigned long netdev_boot_base(const char *prefix, int unit); -@@ -540,6 +569,7 @@ extern int dev_alloc_name(struct net_de - extern int dev_open(struct net_device *dev); - extern int dev_close(struct net_device *dev); - extern int dev_queue_xmit(struct sk_buff *skb); -+extern int dev_set_mtu(struct net_device *dev, int new_mtu); - extern int register_netdevice(struct net_device *dev); - extern int unregister_netdevice(struct net_device *dev); - extern void free_netdev(struct net_device *dev); -@@ -547,7 +577,8 @@ extern void synchronize_net(void); - extern int register_netdevice_notifier(struct notifier_block *nb); - extern int unregister_netdevice_notifier(struct notifier_block *nb); - extern int call_netdevice_notifiers(unsigned long val, void *v); --extern int dev_new_index(void); -+extern int dev_new_index(struct net_device *dev); -+extern void dev_free_index(struct net_device *dev); - extern struct net_device *dev_get_by_index(int ifindex); - extern struct net_device *__dev_get_by_index(int ifindex); - extern int dev_restart(struct net_device *dev); -@@ -946,6 +977,18 @@ extern int skb_checksum_help(struct sk_b - extern char *net_sysctl_strdup(const char *s); - #endif - -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+static inline int ve_is_dev_movable(struct net_device *dev) -+{ -+ return !(dev->features & NETIF_F_VIRTUAL); -+} -+#else -+static inline int ve_is_dev_movable(struct net_device *dev) -+{ -+ return 0; -+} -+#endif -+ - #endif /* __KERNEL__ */ - - #endif /* _LINUX_DEV_H */ -diff -uprN linux-2.6.8.1.orig/include/linux/netfilter.h linux-2.6.8.1-ve022stab078/include/linux/netfilter.h ---- linux-2.6.8.1.orig/include/linux/netfilter.h 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/netfilter.h 2006-05-11 13:05:40.000000000 +0400 -@@ -25,6 +25,8 @@ - #define NFC_UNKNOWN 0x4000 - #define NFC_ALTERED 0x8000 - -+#define NFC_IPT_MASK (0x00FFFFFF) -+ - #ifdef __KERNEL__ - #include <linux/config.h> - #ifdef CONFIG_NETFILTER -@@ -93,6 +95,9 @@ struct nf_info - int nf_register_hook(struct nf_hook_ops *reg); - void nf_unregister_hook(struct nf_hook_ops *reg); - -+int visible_nf_register_hook(struct nf_hook_ops *reg); -+int visible_nf_unregister_hook(struct nf_hook_ops *reg); -+ - /* Functions to register get/setsockopt ranges (non-inclusive). You - need to check permissions yourself! */ - int nf_register_sockopt(struct nf_sockopt_ops *reg); -diff -uprN linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ip_conntrack.h linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ip_conntrack.h ---- linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ip_conntrack.h 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ip_conntrack.h 2006-05-11 13:05:45.000000000 +0400 -@@ -158,6 +158,10 @@ struct ip_conntrack_expect - - struct ip_conntrack_helper; - -+#ifdef CONFIG_VE_IPTABLES -+#include <linux/ve.h> -+#endif -+ - struct ip_conntrack - { - /* Usage count in here is 1 for hash table/destruct timer, 1 per skb, -@@ -173,6 +177,10 @@ struct ip_conntrack - /* Timer function; drops refcnt when it goes off. */ - struct timer_list timeout; - -+#ifdef CONFIG_VE_IPTABLES -+ /* VE struct pointer for timers */ -+ struct ve_ip_conntrack *ct_env; -+#endif - /* If we're expecting another related connection, this will be - in expected linked list */ - struct list_head sibling_list; -@@ -212,6 +220,9 @@ struct ip_conntrack - /* get master conntrack via master expectation */ - #define master_ct(conntr) (conntr->master ? conntr->master->expectant : NULL) - -+/* add conntrack entry to hash tables */ -+extern void ip_conntrack_hash_insert(struct ip_conntrack *ct); -+ - /* Alter reply tuple (maybe alter helper). If it's already taken, - return 0 and don't do alteration. */ - extern int -@@ -231,10 +242,17 @@ ip_conntrack_get(struct sk_buff *skb, en - /* decrement reference count on a conntrack */ - extern inline void ip_conntrack_put(struct ip_conntrack *ct); - -+/* allocate conntrack structure */ -+extern struct ip_conntrack *ip_conntrack_alloc(struct user_beancounter *ub); -+ - /* find unconfirmed expectation based on tuple */ - struct ip_conntrack_expect * - ip_conntrack_expect_find_get(const struct ip_conntrack_tuple *tuple); - -+/* insert expecation into lists */ -+void ip_conntrack_expect_insert(struct ip_conntrack_expect *new, -+ struct ip_conntrack *related_to); -+ - /* decrement reference count on an expectation */ - void ip_conntrack_expect_put(struct ip_conntrack_expect *exp); - -@@ -257,7 +275,7 @@ extern struct ip_conntrack ip_conntrack_ - - /* Returns new sk_buff, or NULL */ - struct sk_buff * --ip_ct_gather_frags(struct sk_buff *skb); -+ip_ct_gather_frags(struct sk_buff *skb, u_int32_t user); - - /* Delete all conntracks which match. */ - extern void -@@ -271,6 +289,7 @@ static inline int is_confirmed(struct ip - } - - extern unsigned int ip_conntrack_htable_size; -+extern int ip_conntrack_enable_ve0; - - /* eg. PROVIDES_CONNTRACK(ftp); */ - #define PROVIDES_CONNTRACK(name) \ -diff -uprN linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ip_conntrack_core.h linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ip_conntrack_core.h ---- linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ip_conntrack_core.h 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ip_conntrack_core.h 2006-05-11 13:05:40.000000000 +0400 -@@ -47,8 +47,37 @@ static inline int ip_conntrack_confirm(s - return NF_ACCEPT; - } - -+#ifdef CONFIG_VE_IPTABLES -+#include <linux/sched.h> -+#define ve_ip_conntrack_hash \ -+ (get_exec_env()->_ip_conntrack->_ip_conntrack_hash) -+#define ve_ip_conntrack_expect_list \ -+ (get_exec_env()->_ip_conntrack->_ip_conntrack_expect_list) -+#define ve_ip_conntrack_protocol_list \ -+ (get_exec_env()->_ip_conntrack->_ip_conntrack_protocol_list) -+#define ve_ip_conntrack_helpers \ -+ (get_exec_env()->_ip_conntrack->_ip_conntrack_helpers) -+#define ve_ip_conntrack_count \ -+ (get_exec_env()->_ip_conntrack->_ip_conntrack_count) -+#define ve_ip_conntrack_max \ -+ (get_exec_env()->_ip_conntrack->_ip_conntrack_max) -+#define ve_ip_conntrack_destroyed \ -+ (get_exec_env()->_ip_conntrack->_ip_conntrack_destroyed) -+#else -+#define ve_ip_conntrack_hash ip_conntrack_hash -+#define ve_ip_conntrack_expect_list ip_conntrack_expect_list -+#define ve_ip_conntrack_protocol_list protocol_list -+#define ve_ip_conntrack_helpers helpers -+#define ve_ip_conntrack_count ip_conntrack_count -+#define ve_ip_conntrack_max ip_conntrack_max -+#define ve_ip_conntrack_destroyed ip_conntrack_destroyed -+#endif /* CONFIG_VE_IPTABLES */ -+ - extern struct list_head *ip_conntrack_hash; - extern struct list_head ip_conntrack_expect_list; -+extern atomic_t ip_conntrack_count; -+extern unsigned long ** tcp_timeouts; -+ - DECLARE_RWLOCK_EXTERN(ip_conntrack_lock); - DECLARE_RWLOCK_EXTERN(ip_conntrack_expect_tuple_lock); - #endif /* _IP_CONNTRACK_CORE_H */ -diff -uprN linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ip_conntrack_ftp.h linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ip_conntrack_ftp.h ---- linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ip_conntrack_ftp.h 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ip_conntrack_ftp.h 2006-05-11 13:05:26.000000000 +0400 -@@ -4,11 +4,6 @@ - - #ifdef __KERNEL__ - --#include <linux/netfilter_ipv4/lockhelp.h> -- --/* Protects ftp part of conntracks */ --DECLARE_LOCK_EXTERN(ip_ftp_lock); -- - #define FTP_PORT 21 - - #endif /* __KERNEL__ */ -diff -uprN linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ip_conntrack_helper.h linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ip_conntrack_helper.h ---- linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ip_conntrack_helper.h 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ip_conntrack_helper.h 2006-05-11 13:05:40.000000000 +0400 -@@ -33,6 +33,9 @@ struct ip_conntrack_helper - extern int ip_conntrack_helper_register(struct ip_conntrack_helper *); - extern void ip_conntrack_helper_unregister(struct ip_conntrack_helper *); - -+extern int visible_ip_conntrack_helper_register(struct ip_conntrack_helper *); -+extern void visible_ip_conntrack_helper_unregister(struct ip_conntrack_helper *); -+ - extern struct ip_conntrack_helper *ip_ct_find_helper(const struct ip_conntrack_tuple *tuple); - - -@@ -46,4 +49,5 @@ extern int ip_conntrack_change_expect(st - struct ip_conntrack_tuple *newtuple); - extern void ip_conntrack_unexpect_related(struct ip_conntrack_expect *exp); - -+extern struct list_head helpers; - #endif /*_IP_CONNTRACK_HELPER_H*/ -diff -uprN linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ip_conntrack_irc.h linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ip_conntrack_irc.h ---- linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ip_conntrack_irc.h 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ip_conntrack_irc.h 2006-05-11 13:05:26.000000000 +0400 -@@ -33,13 +33,8 @@ struct ip_ct_irc_master { - - #ifdef __KERNEL__ - --#include <linux/netfilter_ipv4/lockhelp.h> -- - #define IRC_PORT 6667 - --/* Protects irc part of conntracks */ --DECLARE_LOCK_EXTERN(ip_irc_lock); -- - #endif /* __KERNEL__ */ - - #endif /* _IP_CONNTRACK_IRC_H */ -diff -uprN linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ip_conntrack_protocol.h linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ip_conntrack_protocol.h ---- linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ip_conntrack_protocol.h 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ip_conntrack_protocol.h 2006-05-11 13:05:40.000000000 +0400 -@@ -58,9 +58,35 @@ struct ip_conntrack_protocol - extern int ip_conntrack_protocol_register(struct ip_conntrack_protocol *proto); - extern void ip_conntrack_protocol_unregister(struct ip_conntrack_protocol *proto); - -+extern int visible_ip_conntrack_protocol_register( -+ struct ip_conntrack_protocol *proto); -+extern void visible_ip_conntrack_protocol_unregister( -+ struct ip_conntrack_protocol *proto); -+ -+#ifdef CONFIG_VE_IPTABLES -+#include <linux/sched.h> -+#define ve_ip_ct_tcp_timeouts \ -+ (get_exec_env()->_ip_conntrack->_ip_ct_tcp_timeouts) -+#define ve_ip_ct_udp_timeout \ -+ (get_exec_env()->_ip_conntrack->_ip_ct_udp_timeout) -+#define ve_ip_ct_udp_timeout_stream \ -+ (get_exec_env()->_ip_conntrack->_ip_ct_udp_timeout_stream) -+#define ve_ip_ct_icmp_timeout \ -+ (get_exec_env()->_ip_conntrack->_ip_ct_icmp_timeout) -+#define ve_ip_ct_generic_timeout \ -+ (get_exec_env()->_ip_conntrack->_ip_ct_generic_timeout) -+#else -+#define ve_ip_ct_tcp_timeouts *tcp_timeouts -+#define ve_ip_ct_udp_timeout ip_ct_udp_timeout -+#define ve_ip_ct_udp_timeout_stream ip_ct_udp_timeout_stream -+#define ve_ip_ct_icmp_timeout ip_ct_icmp_timeout -+#define ve_ip_ct_generic_timeout ip_ct_generic_timeout -+#endif -+ - /* Existing built-in protocols */ - extern struct ip_conntrack_protocol ip_conntrack_protocol_tcp; - extern struct ip_conntrack_protocol ip_conntrack_protocol_udp; - extern struct ip_conntrack_protocol ip_conntrack_protocol_icmp; - extern int ip_conntrack_protocol_tcp_init(void); -+extern struct list_head protocol_list; - #endif /*_IP_CONNTRACK_PROTOCOL_H*/ -diff -uprN linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ip_nat.h linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ip_nat.h ---- linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ip_nat.h 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ip_nat.h 2006-05-11 13:05:49.000000000 +0400 -@@ -1,5 +1,6 @@ - #ifndef _IP_NAT_H - #define _IP_NAT_H -+#include <linux/config.h> - #include <linux/netfilter_ipv4.h> - #include <linux/netfilter_ipv4/ip_conntrack_tuple.h> - -@@ -55,6 +56,23 @@ struct ip_nat_multi_range - struct ip_nat_range range[1]; - }; - -+#ifdef CONFIG_COMPAT -+#include <net/compat.h> -+ -+struct compat_ip_nat_range -+{ -+ compat_uint_t flags; -+ u_int32_t min_ip, max_ip; -+ union ip_conntrack_manip_proto min, max; -+}; -+ -+struct compat_ip_nat_multi_range -+{ -+ compat_uint_t rangesize; -+ struct compat_ip_nat_range range[1]; -+}; -+#endif -+ - /* Worst case: local-out manip + 1 post-routing, and reverse dirn. */ - #define IP_NAT_MAX_MANIPS (2*3) - -diff -uprN linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ip_nat_core.h linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ip_nat_core.h ---- linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ip_nat_core.h 2004-08-14 14:54:46.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ip_nat_core.h 2006-05-11 13:05:45.000000000 +0400 -@@ -25,9 +25,20 @@ extern void replace_in_hashes(struct ip_ - struct ip_nat_info *info); - extern void place_in_hashes(struct ip_conntrack *conntrack, - struct ip_nat_info *info); -+extern int ip_nat_install_conntrack(struct ip_conntrack *conntrack, int helper); - - /* Built-in protocols. */ - extern struct ip_nat_protocol ip_nat_protocol_tcp; - extern struct ip_nat_protocol ip_nat_protocol_udp; - extern struct ip_nat_protocol ip_nat_protocol_icmp; -+ -+#ifdef CONFIG_VE_IPTABLES -+ -+#include <linux/sched.h> -+#define ve_ip_nat_protos \ -+ (get_exec_env()->_ip_conntrack->_ip_nat_protos) -+#else -+#define ve_ip_nat_protos protos -+#endif /* CONFIG_VE_IPTABLES */ -+ - #endif /* _IP_NAT_CORE_H */ -diff -uprN linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ip_nat_helper.h linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ip_nat_helper.h ---- linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ip_nat_helper.h 2004-08-14 14:55:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ip_nat_helper.h 2006-05-11 13:05:40.000000000 +0400 -@@ -38,10 +38,18 @@ struct ip_nat_helper - struct ip_nat_info *info); - }; - -+#ifdef CONFIG_VE_IPTABLES -+#define ve_ip_nat_helpers \ -+ (get_exec_env()->_ip_conntrack->_ip_nat_helpers) -+#else - extern struct list_head helpers; -+#define ve_ip_nat_helpers helpers -+#endif - - extern int ip_nat_helper_register(struct ip_nat_helper *me); - extern void ip_nat_helper_unregister(struct ip_nat_helper *me); -+extern int visible_ip_nat_helper_register(struct ip_nat_helper *me); -+extern void visible_ip_nat_helper_unregister(struct ip_nat_helper *me); - - /* These return true or false. */ - extern int ip_nat_mangle_tcp_packet(struct sk_buff **skb, -diff -uprN linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ip_nat_protocol.h linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ip_nat_protocol.h ---- linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ip_nat_protocol.h 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ip_nat_protocol.h 2006-05-11 13:05:40.000000000 +0400 -@@ -51,6 +51,9 @@ struct ip_nat_protocol - extern int ip_nat_protocol_register(struct ip_nat_protocol *proto); - extern void ip_nat_protocol_unregister(struct ip_nat_protocol *proto); - -+extern int visible_ip_nat_protocol_register(struct ip_nat_protocol *proto); -+extern void visible_ip_nat_protocol_unregister(struct ip_nat_protocol *proto); -+ - extern int init_protocols(void) __init; - extern void cleanup_protocols(void); - extern struct ip_nat_protocol *find_nat_proto(u_int16_t protonum); -diff -uprN linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ip_nat_rule.h linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ip_nat_rule.h ---- linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ip_nat_rule.h 2004-08-14 14:56:15.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ip_nat_rule.h 2006-05-11 13:05:40.000000000 +0400 -@@ -6,7 +6,7 @@ - - #ifdef __KERNEL__ - --extern int ip_nat_rule_init(void) __init; -+extern int ip_nat_rule_init(void); - extern void ip_nat_rule_cleanup(void); - extern int ip_nat_rule_find(struct sk_buff **pskb, - unsigned int hooknum, -diff -uprN linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ip_tables.h linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ip_tables.h ---- linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ip_tables.h 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ip_tables.h 2006-05-11 13:05:49.000000000 +0400 -@@ -16,6 +16,7 @@ - #define _IPTABLES_H - - #ifdef __KERNEL__ -+#include <linux/config.h> - #include <linux/if.h> - #include <linux/types.h> - #include <linux/in.h> -@@ -341,6 +342,12 @@ static DECLARE_MUTEX(ipt_mutex); - #include <linux/init.h> - extern void ipt_init(void) __init; - -+#ifdef CONFIG_COMPAT -+#define COMPAT_TO_USER 1 -+#define COMPAT_FROM_USER -1 -+#define COMPAT_CALC_SIZE 0 -+#endif -+ - struct ipt_match - { - struct list_head list; -@@ -370,6 +377,9 @@ struct ipt_match - /* Called when entry of this type deleted. */ - void (*destroy)(void *matchinfo, unsigned int matchinfosize); - -+#ifdef CONFIG_COMPAT -+ int (*compat)(void *match, void **dstptr, int *size, int convert); -+#endif - /* Set this to THIS_MODULE. */ - struct module *me; - }; -@@ -404,6 +414,9 @@ struct ipt_target - const void *targinfo, - void *userdata); - -+#ifdef CONFIG_COMPAT -+ int (*compat)(void *target, void **dstptr, int *size, int convert); -+#endif - /* Set this to THIS_MODULE. */ - struct module *me; - }; -@@ -416,9 +429,15 @@ arpt_find_target_lock(const char *name, - extern int ipt_register_target(struct ipt_target *target); - extern void ipt_unregister_target(struct ipt_target *target); - -+extern int visible_ipt_register_target(struct ipt_target *target); -+extern void visible_ipt_unregister_target(struct ipt_target *target); -+ - extern int ipt_register_match(struct ipt_match *match); - extern void ipt_unregister_match(struct ipt_match *match); - -+extern int visible_ipt_register_match(struct ipt_match *match); -+extern void visible_ipt_unregister_match(struct ipt_match *match); -+ - /* Furniture shopping... */ - struct ipt_table - { -@@ -453,5 +472,75 @@ extern unsigned int ipt_do_table(struct - void *userdata); - - #define IPT_ALIGN(s) (((s) + (__alignof__(struct ipt_entry)-1)) & ~(__alignof__(struct ipt_entry)-1)) -+ -+#ifdef CONFIG_COMPAT -+#include <net/compat.h> -+ -+struct compat_ipt_counters -+{ -+ u_int32_t cnt[4]; -+}; -+ -+struct compat_ipt_counters_info -+{ -+ char name[IPT_TABLE_MAXNAMELEN]; -+ compat_uint_t num_counters; -+ struct compat_ipt_counters counters[0]; -+}; -+ -+struct compat_ipt_getinfo -+{ -+ char name[IPT_TABLE_MAXNAMELEN]; -+ compat_uint_t valid_hooks; -+ compat_uint_t hook_entry[NF_IP_NUMHOOKS]; -+ compat_uint_t underflow[NF_IP_NUMHOOKS]; -+ compat_uint_t num_entries; -+ compat_uint_t size; -+}; -+ -+struct compat_ipt_entry -+{ -+ struct ipt_ip ip; -+ compat_uint_t nfcache; -+ u_int16_t target_offset; -+ u_int16_t next_offset; -+ compat_uint_t comefrom; -+ struct compat_ipt_counters counters; -+ unsigned char elems[0]; -+}; -+ -+struct compat_ipt_entry_match -+{ -+ union { -+ struct { -+ u_int16_t match_size; -+ char name[IPT_FUNCTION_MAXNAMELEN]; -+ } user; -+ u_int16_t match_size; -+ } u; -+ unsigned char data[0]; -+}; -+ -+struct compat_ipt_entry_target -+{ -+ union { -+ struct { -+ u_int16_t target_size; -+ char name[IPT_FUNCTION_MAXNAMELEN]; -+ } user; -+ u_int16_t target_size; -+ } u; -+ unsigned char data[0]; -+}; -+ -+#define COMPAT_IPT_ALIGN(s) (((s) + (__alignof__(struct compat_ipt_entry)-1)) \ -+ & ~(__alignof__(struct compat_ipt_entry)-1)) -+ -+extern int ipt_match_align_compat(void *match, void **dstptr, -+ int *size, int off, int convert); -+extern int ipt_target_align_compat(void *target, void **dstptr, -+ int *size, int off, int convert); -+ -+#endif /* CONFIG_COMPAT */ - #endif /*__KERNEL__*/ - #endif /* _IPTABLES_H */ -diff -uprN linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ipt_conntrack.h linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ipt_conntrack.h ---- linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ipt_conntrack.h 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ipt_conntrack.h 2006-05-11 13:05:49.000000000 +0400 -@@ -5,6 +5,8 @@ - #ifndef _IPT_CONNTRACK_H - #define _IPT_CONNTRACK_H - -+#include <linux/config.h> -+ - #define IPT_CONNTRACK_STATE_BIT(ctinfo) (1 << ((ctinfo)%IP_CT_IS_REPLY+1)) - #define IPT_CONNTRACK_STATE_INVALID (1 << 0) - -@@ -36,4 +38,21 @@ struct ipt_conntrack_info - /* Inverse flags */ - u_int8_t invflags; - }; -+ -+#ifdef CONFIG_COMPAT -+struct compat_ipt_conntrack_info -+{ -+ compat_uint_t statemask, statusmask; -+ -+ struct ip_conntrack_tuple tuple[IP_CT_DIR_MAX]; -+ struct in_addr sipmsk[IP_CT_DIR_MAX], dipmsk[IP_CT_DIR_MAX]; -+ -+ compat_ulong_t expires_min, expires_max; -+ -+ /* Flags word */ -+ u_int8_t flags; -+ /* Inverse flags */ -+ u_int8_t invflags; -+}; -+#endif - #endif /*_IPT_CONNTRACK_H*/ -diff -uprN linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ipt_helper.h linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ipt_helper.h ---- linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ipt_helper.h 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ipt_helper.h 2006-05-11 13:05:49.000000000 +0400 -@@ -1,8 +1,17 @@ - #ifndef _IPT_HELPER_H - #define _IPT_HELPER_H - -+#include <linux/config.h> -+ - struct ipt_helper_info { - int invert; - char name[30]; - }; -+ -+#ifdef CONFIG_COMPAT -+struct compat_ipt_helper_info { -+ compat_int_t invert; -+ char name[30]; -+}; -+#endif - #endif /* _IPT_HELPER_H */ -diff -uprN linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ipt_limit.h linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ipt_limit.h ---- linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ipt_limit.h 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ipt_limit.h 2006-05-11 13:05:49.000000000 +0400 -@@ -1,6 +1,8 @@ - #ifndef _IPT_RATE_H - #define _IPT_RATE_H - -+#include <linux/config.h> -+ - /* timings are in milliseconds. */ - #define IPT_LIMIT_SCALE 10000 - -@@ -18,4 +20,20 @@ struct ipt_rateinfo { - /* Ugly, ugly fucker. */ - struct ipt_rateinfo *master; - }; -+ -+#ifdef CONFIG_COMPAT -+struct compat_ipt_rateinfo { -+ u_int32_t avg; /* Average secs between packets * scale */ -+ u_int32_t burst; /* Period multiplier for upper limit. */ -+ -+ /* Used internally by the kernel */ -+ compat_ulong_t prev; -+ u_int32_t credit; -+ u_int32_t credit_cap, cost; -+ -+ /* Ugly, ugly fucker. */ -+ compat_uptr_t master; -+}; -+#endif -+ - #endif /*_IPT_RATE_H*/ -diff -uprN linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ipt_state.h linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ipt_state.h ---- linux-2.6.8.1.orig/include/linux/netfilter_ipv4/ipt_state.h 2004-08-14 14:56:00.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/netfilter_ipv4/ipt_state.h 2006-05-11 13:05:49.000000000 +0400 -@@ -1,6 +1,8 @@ - #ifndef _IPT_STATE_H - #define _IPT_STATE_H - -+#include <linux/config.h> -+ - #define IPT_STATE_BIT(ctinfo) (1 << ((ctinfo)%IP_CT_IS_REPLY+1)) - #define IPT_STATE_INVALID (1 << 0) - -@@ -10,4 +12,11 @@ struct ipt_state_info - { - unsigned int statemask; - }; -+ -+#ifdef CONFIG_COMPAT -+struct compat_ipt_state_info -+{ -+ compat_uint_t statemask; -+}; -+#endif - #endif /*_IPT_STATE_H*/ -diff -uprN linux-2.6.8.1.orig/include/linux/netlink.h linux-2.6.8.1-ve022stab078/include/linux/netlink.h ---- linux-2.6.8.1.orig/include/linux/netlink.h 2004-08-14 14:56:00.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/netlink.h 2006-05-11 13:05:45.000000000 +0400 -@@ -100,6 +100,20 @@ enum { - - #include <linux/capability.h> - -+struct netlink_opt -+{ -+ u32 pid; -+ unsigned groups; -+ u32 dst_pid; -+ unsigned dst_groups; -+ unsigned long state; -+ int (*handler)(int unit, struct sk_buff *skb); -+ wait_queue_head_t wait; -+ struct netlink_callback *cb; -+ spinlock_t cb_lock; -+ void (*data_ready)(struct sock *sk, int bytes); -+}; -+ - struct netlink_skb_parms - { - struct ucred creds; /* Skb credentials */ -@@ -129,14 +143,13 @@ extern int netlink_unregister_notifier(s - /* finegrained unicast helpers: */ - struct sock *netlink_getsockbypid(struct sock *ssk, u32 pid); - struct sock *netlink_getsockbyfilp(struct file *filp); --int netlink_attachskb(struct sock *sk, struct sk_buff *skb, int nonblock, long timeo); - void netlink_detachskb(struct sock *sk, struct sk_buff *skb); - int netlink_sendskb(struct sock *sk, struct sk_buff *skb, int protocol); - - /* finegrained unicast helpers: */ - struct sock *netlink_getsockbypid(struct sock *ssk, u32 pid); - struct sock *netlink_getsockbyfilp(struct file *filp); --int netlink_attachskb(struct sock *sk, struct sk_buff *skb, int nonblock, long timeo); -+int netlink_attachskb(struct sock *sk, struct sk_buff *skb, int nonblock, long timeo, struct sock *ssk); - void netlink_detachskb(struct sock *sk, struct sk_buff *skb); - int netlink_sendskb(struct sock *sk, struct sk_buff *skb, int protocol); - -diff -uprN linux-2.6.8.1.orig/include/linux/nfcalls.h linux-2.6.8.1-ve022stab078/include/linux/nfcalls.h ---- linux-2.6.8.1.orig/include/linux/nfcalls.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/linux/nfcalls.h 2006-05-11 13:05:42.000000000 +0400 -@@ -0,0 +1,224 @@ -+/* -+ * include/linux/nfcalls.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef _LINUX_NFCALLS_H -+#define _LINUX_NFCALLS_H -+ -+#include <linux/rcupdate.h> -+ -+#ifdef CONFIG_MODULES -+extern struct module no_module; -+ -+#define DECL_KSYM_MODULE(name) \ -+ extern struct module *vz_mod_##name -+#define DECL_KSYM_CALL(type, name, args) \ -+ extern type (*vz_##name) args -+ -+#define INIT_KSYM_MODULE(name) \ -+ struct module *vz_mod_##name = &no_module; \ -+ EXPORT_SYMBOL(vz_mod_##name) -+#define INIT_KSYM_CALL(type, name, args) \ -+ type (*vz_##name) args; \ -+ EXPORT_SYMBOL(vz_##name) -+ -+#define __KSYMERRCALL(err, type, mod, name, args) \ -+({ \ -+ type ret = (type)err; \ -+ if (!__vzksym_module_get(vz_mod_##mod)) { \ -+ if (vz_##name) \ -+ ret = ((*vz_##name)args); \ -+ __vzksym_module_put(vz_mod_##mod); \ -+ } \ -+ ret; \ -+}) -+#define __KSYMSAFECALL_VOID(mod, name, args) \ -+do { \ -+ if (!__vzksym_module_get(vz_mod_##mod)) { \ -+ if (vz_##name) \ -+ ((*vz_##name)args); \ -+ __vzksym_module_put(vz_mod_##mod); \ -+ } \ -+} while (0) -+#else -+#define DECL_KSYM_CALL(type, name, args) \ -+ extern type name args -+#define INIT_KSYM_MODULE(name) -+#define INIT_KSYM_CALL(type, name, args) \ -+ type name args -+#define __KSYMERRCALL(err, type, mod, name, args) ((*name)args) -+#define __KSYMSAFECALL_VOID(mod, name, args) ((*name)args) -+#endif -+ -+#define KSYMERRCALL(err, mod, name, args) \ -+ __KSYMERRCALL(err, int, mod, name, args) -+#define KSYMSAFECALL(type, mod, name, args) \ -+ __KSYMERRCALL(0, type, mod, name, args) -+#define KSYMSAFECALL_VOID(mod, name, args) \ -+ __KSYMSAFECALL_VOID(mod, name, args) -+ -+#if defined(CONFIG_VE) && defined(CONFIG_MODULES) -+/* should be called _after_ KSYMRESOLVE's */ -+#define KSYMMODRESOLVE(name) \ -+ __vzksym_modresolve(&vz_mod_##name, THIS_MODULE) -+#define KSYMMODUNRESOLVE(name) \ -+ __vzksym_modunresolve(&vz_mod_##name) -+ -+#define KSYMRESOLVE(name) \ -+ vz_##name = &name -+#define KSYMUNRESOLVE(name) \ -+ vz_##name = NULL -+#else -+#define KSYMRESOLVE(name) do { } while (0) -+#define KSYMUNRESOLVE(name) do { } while (0) -+#define KSYMMODRESOLVE(name) do { } while (0) -+#define KSYMMODUNRESOLVE(name) do { } while (0) -+#endif -+ -+#ifdef CONFIG_MODULES -+static inline void __vzksym_modresolve(struct module **modp, struct module *mod) -+{ -+ /* -+ * we want to be sure, that pointer updates are visible first: -+ * 1. wmb() is here only for piece of sure -+ * (note, no rmb() in KSYMSAFECALL) -+ * 2. synchronize_kernel() guarantees that updates are visible -+ * on all cpus and allows us to remove rmb() in KSYMSAFECALL -+ */ -+ wmb(); synchronize_kernel(); -+ *modp = mod; -+ /* just to be sure, our changes are visible as soon as possible */ -+ wmb(); synchronize_kernel(); -+} -+ -+static inline void __vzksym_modunresolve(struct module **modp) -+{ -+ /* -+ * try_module_get() in KSYMSAFECALL should fail at this moment since -+ * THIS_MODULE in in unloading state (we should be called from fini), -+ * no need to syncronize pointers/ve_module updates. -+ */ -+ *modp = &no_module; -+ /* -+ * synchronize_kernel() guarantees here that we see -+ * updated module pointer before the module really gets away -+ */ -+ synchronize_kernel(); -+} -+ -+static inline int __vzksym_module_get(struct module *mod) -+{ -+ /* -+ * we want to avoid rmb(), so use synchronize_kernel() in KSYMUNRESOLVE -+ * and smp_read_barrier_depends() here... -+ */ -+ smp_read_barrier_depends(); /* for module loading */ -+ if (!try_module_get(mod)) -+ return -EBUSY; -+ -+ return 0; -+} -+ -+static inline void __vzksym_module_put(struct module *mod) -+{ -+ module_put(mod); -+} -+#endif -+ -+#if defined(CONFIG_VE_IPTABLES) -+#ifdef CONFIG_MODULES -+DECL_KSYM_MODULE(ip_tables); -+DECL_KSYM_MODULE(iptable_filter); -+DECL_KSYM_MODULE(iptable_mangle); -+DECL_KSYM_MODULE(ipt_limit); -+DECL_KSYM_MODULE(ipt_multiport); -+DECL_KSYM_MODULE(ipt_tos); -+DECL_KSYM_MODULE(ipt_TOS); -+DECL_KSYM_MODULE(ipt_REJECT); -+DECL_KSYM_MODULE(ipt_TCPMSS); -+DECL_KSYM_MODULE(ipt_tcpmss); -+DECL_KSYM_MODULE(ipt_ttl); -+DECL_KSYM_MODULE(ipt_LOG); -+DECL_KSYM_MODULE(ipt_length); -+DECL_KSYM_MODULE(ip_conntrack); -+DECL_KSYM_MODULE(ip_conntrack_ftp); -+DECL_KSYM_MODULE(ip_conntrack_irc); -+DECL_KSYM_MODULE(ipt_conntrack); -+DECL_KSYM_MODULE(ipt_state); -+DECL_KSYM_MODULE(ipt_helper); -+DECL_KSYM_MODULE(iptable_nat); -+DECL_KSYM_MODULE(ip_nat_ftp); -+DECL_KSYM_MODULE(ip_nat_irc); -+DECL_KSYM_MODULE(ipt_REDIRECT); -+#endif -+ -+struct sk_buff; -+ -+DECL_KSYM_CALL(int, init_netfilter, (void)); -+DECL_KSYM_CALL(int, init_iptables, (void)); -+DECL_KSYM_CALL(int, init_iptable_filter, (void)); -+DECL_KSYM_CALL(int, init_iptable_mangle, (void)); -+DECL_KSYM_CALL(int, init_iptable_limit, (void)); -+DECL_KSYM_CALL(int, init_iptable_multiport, (void)); -+DECL_KSYM_CALL(int, init_iptable_tos, (void)); -+DECL_KSYM_CALL(int, init_iptable_TOS, (void)); -+DECL_KSYM_CALL(int, init_iptable_REJECT, (void)); -+DECL_KSYM_CALL(int, init_iptable_TCPMSS, (void)); -+DECL_KSYM_CALL(int, init_iptable_tcpmss, (void)); -+DECL_KSYM_CALL(int, init_iptable_ttl, (void)); -+DECL_KSYM_CALL(int, init_iptable_LOG, (void)); -+DECL_KSYM_CALL(int, init_iptable_length, (void)); -+DECL_KSYM_CALL(int, init_iptable_conntrack, (void)); -+DECL_KSYM_CALL(int, init_iptable_ftp, (void)); -+DECL_KSYM_CALL(int, init_iptable_irc, (void)); -+DECL_KSYM_CALL(int, init_iptable_conntrack_match, (void)); -+DECL_KSYM_CALL(int, init_iptable_state, (void)); -+DECL_KSYM_CALL(int, init_iptable_helper, (void)); -+DECL_KSYM_CALL(int, init_iptable_nat, (void)); -+DECL_KSYM_CALL(int, init_iptable_nat_ftp, (void)); -+DECL_KSYM_CALL(int, init_iptable_nat_irc, (void)); -+DECL_KSYM_CALL(int, init_iptable_REDIRECT, (void)); -+DECL_KSYM_CALL(void, fini_iptable_nat_irc, (void)); -+DECL_KSYM_CALL(void, fini_iptable_nat_ftp, (void)); -+DECL_KSYM_CALL(void, fini_iptable_nat, (void)); -+DECL_KSYM_CALL(void, fini_iptable_helper, (void)); -+DECL_KSYM_CALL(void, fini_iptable_state, (void)); -+DECL_KSYM_CALL(void, fini_iptable_conntrack_match, (void)); -+DECL_KSYM_CALL(void, fini_iptable_irc, (void)); -+DECL_KSYM_CALL(void, fini_iptable_ftp, (void)); -+DECL_KSYM_CALL(void, fini_iptable_conntrack, (void)); -+DECL_KSYM_CALL(void, fini_iptable_length, (void)); -+DECL_KSYM_CALL(void, fini_iptable_LOG, (void)); -+DECL_KSYM_CALL(void, fini_iptable_ttl, (void)); -+DECL_KSYM_CALL(void, fini_iptable_tcpmss, (void)); -+DECL_KSYM_CALL(void, fini_iptable_TCPMSS, (void)); -+DECL_KSYM_CALL(void, fini_iptable_REJECT, (void)); -+DECL_KSYM_CALL(void, fini_iptable_TOS, (void)); -+DECL_KSYM_CALL(void, fini_iptable_tos, (void)); -+DECL_KSYM_CALL(void, fini_iptable_multiport, (void)); -+DECL_KSYM_CALL(void, fini_iptable_limit, (void)); -+DECL_KSYM_CALL(void, fini_iptable_filter, (void)); -+DECL_KSYM_CALL(void, fini_iptable_mangle, (void)); -+DECL_KSYM_CALL(void, fini_iptables, (void)); -+DECL_KSYM_CALL(void, fini_netfilter, (void)); -+DECL_KSYM_CALL(void, fini_iptable_REDIRECT, (void)); -+ -+DECL_KSYM_CALL(void, ipt_flush_table, (struct ipt_table *table)); -+#endif /* CONFIG_VE_IPTABLES */ -+ -+#ifdef CONFIG_VE_CALLS_MODULE -+DECL_KSYM_MODULE(vzmon); -+DECL_KSYM_CALL(int, real_get_device_perms_ve, -+ (int dev_type, dev_t dev, int access_mode)); -+DECL_KSYM_CALL(void, real_do_env_cleanup, (struct ve_struct *env)); -+DECL_KSYM_CALL(void, real_do_env_free, (struct ve_struct *env)); -+DECL_KSYM_CALL(void, real_update_load_avg_ve, (void)); -+#endif -+ -+#endif /* _LINUX_NFCALLS_H */ -diff -uprN linux-2.6.8.1.orig/include/linux/nfs_fs.h linux-2.6.8.1-ve022stab078/include/linux/nfs_fs.h ---- linux-2.6.8.1.orig/include/linux/nfs_fs.h 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/nfs_fs.h 2006-05-11 13:05:35.000000000 +0400 -@@ -267,7 +267,8 @@ extern struct inode *nfs_fhget(struct su - struct nfs_fattr *); - extern int nfs_refresh_inode(struct inode *, struct nfs_fattr *); - extern int nfs_getattr(struct vfsmount *, struct dentry *, struct kstat *); --extern int nfs_permission(struct inode *, int, struct nameidata *); -+extern int nfs_permission(struct inode *, int, struct nameidata *, -+ struct exec_perm *); - extern void nfs_set_mmcred(struct inode *, struct rpc_cred *); - extern int nfs_open(struct inode *, struct file *); - extern int nfs_release(struct inode *, struct file *); -diff -uprN linux-2.6.8.1.orig/include/linux/notifier.h linux-2.6.8.1-ve022stab078/include/linux/notifier.h ---- linux-2.6.8.1.orig/include/linux/notifier.h 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/notifier.h 2006-05-11 13:05:39.000000000 +0400 -@@ -27,8 +27,9 @@ extern int notifier_call_chain(struct no - - #define NOTIFY_DONE 0x0000 /* Don't care */ - #define NOTIFY_OK 0x0001 /* Suits me */ -+#define NOTIFY_FAIL 0x0002 /* Reject */ - #define NOTIFY_STOP_MASK 0x8000 /* Don't call further */ --#define NOTIFY_BAD (NOTIFY_STOP_MASK|0x0002) /* Bad/Veto action */ -+#define NOTIFY_BAD (NOTIFY_STOP_MASK|NOTIFY_FAIL) /* Bad/Veto action */ - - /* - * Declared notifiers so far. I can imagine quite a few more chains -diff -uprN linux-2.6.8.1.orig/include/linux/pagevec.h linux-2.6.8.1-ve022stab078/include/linux/pagevec.h ---- linux-2.6.8.1.orig/include/linux/pagevec.h 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/pagevec.h 2006-05-11 13:05:29.000000000 +0400 -@@ -5,14 +5,15 @@ - * pages. A pagevec is a multipage container which is used for that. - */ - --#define PAGEVEC_SIZE 16 -+/* 14 pointers + two long's align the pagevec structure to a power of two */ -+#define PAGEVEC_SIZE 14 - - struct page; - struct address_space; - - struct pagevec { -- unsigned nr; -- int cold; -+ unsigned long nr; -+ unsigned long cold; - struct page *pages[PAGEVEC_SIZE]; - }; - -diff -uprN linux-2.6.8.1.orig/include/linux/pci_ids.h linux-2.6.8.1-ve022stab078/include/linux/pci_ids.h ---- linux-2.6.8.1.orig/include/linux/pci_ids.h 2004-08-14 14:56:26.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/pci_ids.h 2006-05-11 13:05:28.000000000 +0400 -@@ -2190,6 +2190,8 @@ - #define PCI_DEVICE_ID_INTEL_82855GM_HB 0x3580 - #define PCI_DEVICE_ID_INTEL_82855GM_IG 0x3582 - #define PCI_DEVICE_ID_INTEL_SMCH 0x3590 -+#define PCI_DEVICE_ID_INTEL_E7320_MCH 0x3592 -+#define PCI_DEVICE_ID_INTEL_E7525_MCH 0x359e - #define PCI_DEVICE_ID_INTEL_80310 0x530d - #define PCI_DEVICE_ID_INTEL_82371SB_0 0x7000 - #define PCI_DEVICE_ID_INTEL_82371SB_1 0x7010 -diff -uprN linux-2.6.8.1.orig/include/linux/pid.h linux-2.6.8.1-ve022stab078/include/linux/pid.h ---- linux-2.6.8.1.orig/include/linux/pid.h 2004-08-14 14:54:52.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/pid.h 2006-05-11 13:05:40.000000000 +0400 -@@ -1,6 +1,18 @@ - #ifndef _LINUX_PID_H - #define _LINUX_PID_H - -+#define VPID_BIT 10 -+#define VPID_DIV (1<<VPID_BIT) -+ -+#ifdef CONFIG_VE -+#define __is_virtual_pid(pid) ((pid) & VPID_DIV) -+#define is_virtual_pid(pid) \ -+ (__is_virtual_pid(pid) || ((pid)==1 && !ve_is_super(get_exec_env()))) -+#else -+#define __is_virtual_pid(pid) 0 -+#define is_virtual_pid(pid) 0 -+#endif -+ - enum pid_type - { - PIDTYPE_PID, -@@ -12,34 +24,24 @@ enum pid_type - - struct pid - { -+ /* Try to keep pid_chain in the same cacheline as nr for find_pid */ - int nr; -- atomic_t count; -- struct task_struct *task; -- struct list_head task_list; -- struct list_head hash_chain; --}; -- --struct pid_link --{ -- struct list_head pid_chain; -- struct pid *pidptr; -- struct pid pid; -+ struct hlist_node pid_chain; -+#ifdef CONFIG_VE -+ int vnr; -+#endif -+ /* list of pids with the same nr, only one of them is in the hash */ -+ struct list_head pid_list; - }; - - #define pid_task(elem, type) \ -- list_entry(elem, struct task_struct, pids[type].pid_chain) -+ list_entry(elem, struct task_struct, pids[type].pid_list) - - /* -- * attach_pid() and link_pid() must be called with the tasklist_lock -+ * attach_pid() and detach_pid() must be called with the tasklist_lock - * write-held. - */ - extern int FASTCALL(attach_pid(struct task_struct *task, enum pid_type type, int nr)); -- --extern void FASTCALL(link_pid(struct task_struct *task, struct pid_link *link, struct pid *pid)); -- --/* -- * detach_pid() must be called with the tasklist_lock write-held. -- */ - extern void FASTCALL(detach_pid(struct task_struct *task, enum pid_type)); - - /* -@@ -52,13 +54,89 @@ extern int alloc_pidmap(void); - extern void FASTCALL(free_pidmap(int)); - extern void switch_exec_pids(struct task_struct *leader, struct task_struct *thread); - --#define for_each_task_pid(who, type, task, elem, pid) \ -- if ((pid = find_pid(type, who))) \ -- for (elem = pid->task_list.next, \ -- prefetch(elem->next), \ -- task = pid_task(elem, type); \ -- elem != &pid->task_list; \ -- elem = elem->next, prefetch(elem->next), \ -- task = pid_task(elem, type)) -+#ifndef CONFIG_VE -+ -+#define vpid_to_pid(pid) (pid) -+#define __vpid_to_pid(pid) (pid) -+#define pid_type_to_vpid(pid, type) (pid) -+#define __pid_type_to_vpid(pid, type) (pid) -+ -+#define comb_vpid_to_pid(pid) (pid) -+#define comb_pid_to_vpid(pid) (pid) -+ -+#else -+ -+struct ve_struct; -+extern void free_vpid(int vpid, struct ve_struct *ve); -+extern int alloc_vpid(int pid, int vpid); -+extern int vpid_to_pid(int pid); -+extern int __vpid_to_pid(int pid); -+extern pid_t pid_type_to_vpid(int type, pid_t pid); -+extern pid_t _pid_type_to_vpid(int type, pid_t pid); -+ -+static inline int comb_vpid_to_pid(int vpid) -+{ -+ int pid = vpid; -+ -+ if (vpid > 0) { -+ pid = vpid_to_pid(vpid); -+ if (unlikely(pid < 0)) -+ return 0; -+ } else if (vpid < 0) { -+ pid = vpid_to_pid(-vpid); -+ if (unlikely(pid < 0)) -+ return 0; -+ pid = -pid; -+ } -+ return pid; -+} -+ -+static inline int comb_pid_to_vpid(int pid) -+{ -+ int vpid = pid; -+ -+ if (pid > 0) { -+ vpid = pid_type_to_vpid(PIDTYPE_PID, pid); -+ if (unlikely(vpid < 0)) -+ return 0; -+ } else if (pid < 0) { -+ vpid = pid_type_to_vpid(PIDTYPE_PGID, -pid); -+ if (unlikely(vpid < 0)) -+ return 0; -+ vpid = -vpid; -+ } -+ return vpid; -+} -+#endif -+ -+#define do_each_task_pid_all(who, type, task) \ -+ if ((task = find_task_by_pid_type_all(type, who))) { \ -+ prefetch((task)->pids[type].pid_list.next); \ -+ do { -+ -+#define while_each_task_pid_all(who, type, task) \ -+ } while (task = pid_task((task)->pids[type].pid_list.next,\ -+ type), \ -+ prefetch((task)->pids[type].pid_list.next), \ -+ hlist_unhashed(&(task)->pids[type].pid_chain)); \ -+ } \ -+ -+#ifndef CONFIG_VE -+#define __do_each_task_pid_ve(who, type, task, owner) \ -+ do_each_task_pid_all(who, type, task) -+#define __while_each_task_pid_ve(who, type, task, owner) \ -+ while_each_task_pid_all(who, type, task) -+#else /* CONFIG_VE */ -+#define __do_each_task_pid_ve(who, type, task, owner) \ -+ do_each_task_pid_all(who, type, task) \ -+ if (ve_accessible(VE_TASK_INFO(task)->owner_env, owner)) -+#define __while_each_task_pid_ve(who, type, task, owner) \ -+ while_each_task_pid_all(who, type, task) -+#endif /* CONFIG_VE */ -+ -+#define do_each_task_pid_ve(who, type, task) \ -+ __do_each_task_pid_ve(who, type, task, get_exec_env()); -+#define while_each_task_pid_ve(who, type, task) \ -+ __while_each_task_pid_ve(who, type, task, get_exec_env()); - - #endif /* _LINUX_PID_H */ -diff -uprN linux-2.6.8.1.orig/include/linux/proc_fs.h linux-2.6.8.1-ve022stab078/include/linux/proc_fs.h ---- linux-2.6.8.1.orig/include/linux/proc_fs.h 2004-08-14 14:56:25.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/proc_fs.h 2006-05-11 13:05:40.000000000 +0400 -@@ -66,8 +66,17 @@ struct proc_dir_entry { - write_proc_t *write_proc; - atomic_t count; /* use count */ - int deleted; /* delete flag */ -+ void *set; - }; - -+extern void de_put(struct proc_dir_entry *); -+static inline struct proc_dir_entry *de_get(struct proc_dir_entry *de) -+{ -+ if (de) -+ atomic_inc(&de->count); -+ return de; -+} -+ - struct kcore_list { - struct kcore_list *next; - unsigned long addr; -@@ -87,12 +96,15 @@ extern void proc_root_init(void); - extern void proc_misc_init(void); - - struct dentry *proc_pid_lookup(struct inode *dir, struct dentry * dentry, struct nameidata *); --struct dentry *proc_pid_unhash(struct task_struct *p); --void proc_pid_flush(struct dentry *proc_dentry); -+void proc_pid_unhash(struct task_struct *p, struct dentry * [2]); -+void proc_pid_flush(struct dentry *proc_dentry[2]); - int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir); - - extern struct proc_dir_entry *create_proc_entry(const char *name, mode_t mode, - struct proc_dir_entry *parent); -+extern struct proc_dir_entry *create_proc_glob_entry(const char *name, -+ mode_t mode, -+ struct proc_dir_entry *parent); - extern void remove_proc_entry(const char *name, struct proc_dir_entry *parent); - - extern struct vfsmount *proc_mnt; -@@ -169,6 +181,15 @@ static inline struct proc_dir_entry *pro - return create_proc_info_entry(name,mode,proc_net,get_info); - } - -+static inline struct proc_dir_entry *__proc_net_fops_create(const char *name, -+ mode_t mode, struct file_operations *fops, struct proc_dir_entry *p) -+{ -+ struct proc_dir_entry *res = create_proc_entry(name, mode, p); -+ if (res) -+ res->proc_fops = fops; -+ return res; -+} -+ - static inline struct proc_dir_entry *proc_net_fops_create(const char *name, - mode_t mode, struct file_operations *fops) - { -@@ -178,6 +199,11 @@ static inline struct proc_dir_entry *pro - return res; - } - -+static inline void __proc_net_remove(const char *name) -+{ -+ remove_proc_entry(name, NULL); -+} -+ - static inline void proc_net_remove(const char *name) - { - remove_proc_entry(name,proc_net); -@@ -188,15 +214,20 @@ static inline void proc_net_remove(const - #define proc_root_driver NULL - #define proc_net NULL - -+#define __proc_net_fops_create(name, mode, fops, p) ({ (void)(mode), NULL; }) - #define proc_net_fops_create(name, mode, fops) ({ (void)(mode), NULL; }) - #define proc_net_create(name, mode, info) ({ (void)(mode), NULL; }) -+static inline void __proc_net_remove(const char *name) {} - static inline void proc_net_remove(const char *name) {} - --static inline struct dentry *proc_pid_unhash(struct task_struct *p) { return NULL; } --static inline void proc_pid_flush(struct dentry *proc_dentry) { } -+static inline void proc_pid_unhash(struct task_struct *p, struct dentry * [2]) -+ { return NULL; } -+static inline void proc_pid_flush(struct dentry *proc_dentry[2]) { } - - static inline struct proc_dir_entry *create_proc_entry(const char *name, - mode_t mode, struct proc_dir_entry *parent) { return NULL; } -+static inline struct proc_dir_entry *create_proc_glob_entry(const char *name, -+ mode_t mode, struct proc_dir_entry *parent) { return NULL; } - - #define remove_proc_entry(name, parent) do {} while (0) - -@@ -255,4 +286,9 @@ static inline struct proc_dir_entry *PDE - return PROC_I(inode)->pde; - } - -+#define LPDE(inode) (PROC_I((inode))->pde) -+#ifdef CONFIG_VE -+#define GPDE(inode) (*(struct proc_dir_entry **)(&(inode)->i_pipe)) -+#endif -+ - #endif /* _LINUX_PROC_FS_H */ -diff -uprN linux-2.6.8.1.orig/include/linux/ptrace.h linux-2.6.8.1-ve022stab078/include/linux/ptrace.h ---- linux-2.6.8.1.orig/include/linux/ptrace.h 2004-08-14 14:54:49.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/ptrace.h 2006-05-11 13:05:34.000000000 +0400 -@@ -79,6 +79,7 @@ extern int ptrace_readdata(struct task_s - extern int ptrace_writedata(struct task_struct *tsk, char __user *src, unsigned long dst, int len); - extern int ptrace_attach(struct task_struct *tsk); - extern int ptrace_detach(struct task_struct *, unsigned int); -+extern void __ptrace_detach(struct task_struct *, unsigned int); - extern void ptrace_disable(struct task_struct *); - extern int ptrace_check_attach(struct task_struct *task, int kill); - extern int ptrace_request(struct task_struct *child, long request, long addr, long data); -diff -uprN linux-2.6.8.1.orig/include/linux/quota.h linux-2.6.8.1-ve022stab078/include/linux/quota.h ---- linux-2.6.8.1.orig/include/linux/quota.h 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/quota.h 2006-05-11 13:05:43.000000000 +0400 -@@ -37,7 +37,6 @@ - - #include <linux/errno.h> - #include <linux/types.h> --#include <linux/spinlock.h> - - #define __DQUOT_VERSION__ "dquot_6.5.1" - #define __DQUOT_NUM_VERSION__ 6*10000+5*100+1 -@@ -45,9 +44,6 @@ - typedef __kernel_uid32_t qid_t; /* Type in which we store ids in memory */ - typedef __u64 qsize_t; /* Type in which we store sizes */ - --extern spinlock_t dq_list_lock; --extern spinlock_t dq_data_lock; -- - /* Size of blocks in which are counted size limits */ - #define QUOTABLOCK_BITS 10 - #define QUOTABLOCK_SIZE (1 << QUOTABLOCK_BITS) -@@ -134,6 +130,12 @@ struct if_dqinfo { - - #ifdef __KERNEL__ - -+#include <linux/spinlock.h> -+ -+extern spinlock_t dq_list_lock; -+extern spinlock_t dq_data_lock; -+ -+ - #include <linux/dqblk_xfs.h> - #include <linux/dqblk_v1.h> - #include <linux/dqblk_v2.h> -@@ -240,6 +242,8 @@ struct quota_format_ops { - int (*release_dqblk)(struct dquot *dquot); /* Called when last reference to dquot is being dropped */ - }; - -+struct inode; -+struct iattr; - /* Operations working with dquots */ - struct dquot_operations { - int (*initialize) (struct inode *, int); -@@ -254,9 +258,11 @@ struct dquot_operations { - int (*release_dquot) (struct dquot *); /* Quota is going to be deleted from disk */ - int (*mark_dirty) (struct dquot *); /* Dquot is marked dirty */ - int (*write_info) (struct super_block *, int); /* Write of quota "superblock" */ -+ int (*rename) (struct inode *, struct inode *, struct inode *); - }; - - /* Operations handling requests from userspace */ -+struct v2_disk_dqblk; - struct quotactl_ops { - int (*quota_on)(struct super_block *, int, int, char *); - int (*quota_off)(struct super_block *, int); -@@ -269,6 +275,9 @@ struct quotactl_ops { - int (*set_xstate)(struct super_block *, unsigned int, int); - int (*get_xquota)(struct super_block *, int, qid_t, struct fs_disk_quota *); - int (*set_xquota)(struct super_block *, int, qid_t, struct fs_disk_quota *); -+#ifdef CONFIG_QUOTA_COMPAT -+ int (*get_quoti)(struct super_block *, int, unsigned int, struct v2_disk_dqblk *); -+#endif - }; - - struct quota_format_type { -diff -uprN linux-2.6.8.1.orig/include/linux/quotaops.h linux-2.6.8.1-ve022stab078/include/linux/quotaops.h ---- linux-2.6.8.1.orig/include/linux/quotaops.h 2004-08-14 14:55:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/quotaops.h 2006-05-11 13:05:43.000000000 +0400 -@@ -170,6 +170,19 @@ static __inline__ int DQUOT_TRANSFER(str - return 0; - } - -+static __inline__ int DQUOT_RENAME(struct inode *inode, -+ struct inode *old_dir, struct inode *new_dir) -+{ -+ struct dquot_operations *q_op; -+ -+ q_op = inode->i_sb->dq_op; -+ if (q_op && q_op->rename) { -+ if (q_op->rename(inode, old_dir, new_dir) == NO_QUOTA) -+ return 1; -+ } -+ return 0; -+} -+ - /* The following two functions cannot be called inside a transaction */ - #define DQUOT_SYNC(sb) sync_dquots(sb, -1) - -@@ -197,6 +210,7 @@ static __inline__ int DQUOT_OFF(struct s - #define DQUOT_SYNC(sb) do { } while(0) - #define DQUOT_OFF(sb) do { } while(0) - #define DQUOT_TRANSFER(inode, iattr) (0) -+#define DQUOT_RENAME(inode, old_dir, new_dir) (0) - extern __inline__ int DQUOT_PREALLOC_SPACE_NODIRTY(struct inode *inode, qsize_t nr) - { - inode_add_bytes(inode, nr); -diff -uprN linux-2.6.8.1.orig/include/linux/reiserfs_fs.h linux-2.6.8.1-ve022stab078/include/linux/reiserfs_fs.h ---- linux-2.6.8.1.orig/include/linux/reiserfs_fs.h 2004-08-14 14:56:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/reiserfs_fs.h 2006-05-11 13:05:35.000000000 +0400 -@@ -1944,7 +1944,7 @@ void reiserfs_read_locked_inode(struct i - int reiserfs_find_actor(struct inode * inode, void *p) ; - int reiserfs_init_locked_inode(struct inode * inode, void *p) ; - void reiserfs_delete_inode (struct inode * inode); --void reiserfs_write_inode (struct inode * inode, int) ; -+int reiserfs_write_inode (struct inode * inode, int) ; - struct dentry *reiserfs_get_dentry(struct super_block *, void *) ; - struct dentry *reiserfs_decode_fh(struct super_block *sb, __u32 *data, - int len, int fhtype, -diff -uprN linux-2.6.8.1.orig/include/linux/reiserfs_xattr.h linux-2.6.8.1-ve022stab078/include/linux/reiserfs_xattr.h ---- linux-2.6.8.1.orig/include/linux/reiserfs_xattr.h 2004-08-14 14:56:26.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/reiserfs_xattr.h 2006-05-11 13:05:35.000000000 +0400 -@@ -42,7 +42,8 @@ int reiserfs_removexattr (struct dentry - int reiserfs_delete_xattrs (struct inode *inode); - int reiserfs_chown_xattrs (struct inode *inode, struct iattr *attrs); - int reiserfs_xattr_init (struct super_block *sb, int mount_flags); --int reiserfs_permission (struct inode *inode, int mask, struct nameidata *nd); -+int reiserfs_permission (struct inode *inode, int mask, struct nameidata *nd, -+ struct exec_perm *exec_perm); - int reiserfs_permission_locked (struct inode *inode, int mask, struct nameidata *nd); - - int reiserfs_xattr_del (struct inode *, const char *); -diff -uprN linux-2.6.8.1.orig/include/linux/sched.h linux-2.6.8.1-ve022stab078/include/linux/sched.h ---- linux-2.6.8.1.orig/include/linux/sched.h 2004-08-14 14:54:49.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/sched.h 2006-05-11 13:05:49.000000000 +0400 -@@ -30,7 +30,12 @@ - #include <linux/pid.h> - #include <linux/percpu.h> - -+#include <ub/ub_task.h> -+ - struct exec_domain; -+struct task_beancounter; -+struct user_beancounter; -+struct ve_struct; - - /* - * cloning flags: -@@ -85,6 +90,9 @@ extern unsigned long avenrun[]; /* Load - load += n*(FIXED_1-exp); \ - load >>= FSHIFT; - -+#define LOAD_INT(x) ((x) >> FSHIFT) -+#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) -+ - #define CT_TO_SECS(x) ((x) / HZ) - #define CT_TO_USECS(x) (((x) % HZ) * 1000000/HZ) - -@@ -92,10 +100,22 @@ extern int nr_threads; - extern int last_pid; - DECLARE_PER_CPU(unsigned long, process_counts); - extern int nr_processes(void); -+ -+extern unsigned long nr_sleeping(void); -+extern unsigned long nr_stopped(void); -+extern unsigned long nr_zombie; -+extern unsigned long nr_dead; - extern unsigned long nr_running(void); - extern unsigned long nr_uninterruptible(void); - extern unsigned long nr_iowait(void); - -+#ifdef CONFIG_VE -+struct ve_struct; -+extern unsigned long nr_running_ve(struct ve_struct *); -+extern unsigned long nr_iowait_ve(struct ve_struct *); -+extern unsigned long nr_uninterruptible_ve(struct ve_struct *); -+#endif -+ - #include <linux/time.h> - #include <linux/param.h> - #include <linux/resource.h> -@@ -107,8 +127,8 @@ extern unsigned long nr_iowait(void); - #define TASK_INTERRUPTIBLE 1 - #define TASK_UNINTERRUPTIBLE 2 - #define TASK_STOPPED 4 --#define TASK_ZOMBIE 8 --#define TASK_DEAD 16 -+#define EXIT_ZOMBIE 16 -+#define EXIT_DEAD 32 - - #define __set_task_state(tsk, state_value) \ - do { (tsk)->state = (state_value); } while (0) -@@ -154,6 +174,8 @@ extern cpumask_t nohz_cpu_mask; - - extern void show_state(void); - extern void show_regs(struct pt_regs *); -+extern void smp_show_regs(struct pt_regs *, void *); -+extern void show_vsched(void); - - /* - * TASK is a pointer to the task whose backtrace we want to see (or NULL for current -@@ -171,6 +193,8 @@ extern void update_process_times(int use - extern void scheduler_tick(int user_tick, int system); - extern unsigned long cache_decay_ticks; - -+int setscheduler(pid_t pid, int policy, struct sched_param __user *param); -+ - /* Attach to any functions which should be ignored in wchan output. */ - #define __sched __attribute__((__section__(".sched.text"))) - /* Is this address in the __sched functions? */ -@@ -215,6 +239,7 @@ struct mm_struct { - unsigned long saved_auxv[40]; /* for /proc/PID/auxv */ - - unsigned dumpable:1; -+ unsigned vps_dumpable:1; - cpumask_t cpu_vm_mask; - - /* Architecture-specific MM context */ -@@ -229,8 +254,12 @@ struct mm_struct { - struct kioctx *ioctx_list; - - struct kioctx default_kioctx; -+ -+ struct user_beancounter *mm_ub; - }; - -+#define mm_ub(__mm) ((__mm)->mm_ub) -+ - extern int mmlist_nr; - - struct sighand_struct { -@@ -239,6 +268,9 @@ struct sighand_struct { - spinlock_t siglock; - }; - -+#include <linux/ve.h> -+#include <linux/ve_task.h> -+ - /* - * NOTE! "signal_struct" does not have it's own - * locking, because a shared signal_struct always -@@ -386,6 +418,8 @@ int set_current_groups(struct group_info - - struct audit_context; /* See audit.c */ - struct mempolicy; -+struct vcpu_scheduler; -+struct vcpu_info; - - struct task_struct { - volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ -@@ -396,6 +430,14 @@ struct task_struct { - - int lock_depth; /* Lock depth */ - -+#ifdef CONFIG_SCHED_VCPU -+ struct vcpu_scheduler *vsched; -+ struct vcpu_info *vcpu; -+ -+ /* id's are saved to avoid locking (e.g. on vsched->id access) */ -+ int vsched_id; -+ int vcpu_id; -+#endif - int prio, static_prio; - struct list_head run_list; - prio_array_t *array; -@@ -410,6 +452,7 @@ struct task_struct { - unsigned int time_slice, first_time_slice; - - struct list_head tasks; -+ - /* - * ptrace_list/ptrace_children forms the list of my children - * that were stolen by a ptracer. -@@ -421,6 +464,7 @@ struct task_struct { - - /* task state */ - struct linux_binfmt *binfmt; -+ long exit_state; - int exit_code, exit_signal; - int pdeath_signal; /* The signal sent when the parent dies */ - /* ??? */ -@@ -444,7 +488,7 @@ struct task_struct { - struct task_struct *group_leader; /* threadgroup leader */ - - /* PID/PID hash table linkage. */ -- struct pid_link pids[PIDTYPE_MAX]; -+ struct pid pids[PIDTYPE_MAX]; - - wait_queue_head_t wait_chldexit; /* for wait4() */ - struct completion *vfork_done; /* for vfork() */ -@@ -523,10 +567,25 @@ struct task_struct { - unsigned long ptrace_message; - siginfo_t *last_siginfo; /* For ptrace use. */ - -+/* state tracking for suspend */ -+ sigset_t saved_sigset; -+ __u8 pn_state; -+ __u8 stopped_state:1, sigsuspend_state:1; -+ - #ifdef CONFIG_NUMA - struct mempolicy *mempolicy; - short il_next; /* could be shared with used_math */ - #endif -+#ifdef CONFIG_USER_RESOURCE -+ struct task_beancounter task_bc; -+#endif -+#ifdef CONFIG_VE -+ struct ve_task_info ve_task_info; -+#endif -+#if defined(CONFIG_VZ_QUOTA) || defined(CONFIG_VZ_QUOTA_MODULE) -+ unsigned long magic; -+ struct inode *ino; -+#endif - }; - - static inline pid_t process_group(struct task_struct *tsk) -@@ -534,6 +593,11 @@ static inline pid_t process_group(struct - return tsk->signal->pgrp; - } - -+static inline int pid_alive(struct task_struct *p) -+{ -+ return p->pids[PIDTYPE_PID].nr != 0; -+} -+ - extern void __put_task_struct(struct task_struct *tsk); - #define get_task_struct(tsk) do { atomic_inc(&(tsk)->usage); } while(0) - #define put_task_struct(tsk) \ -@@ -555,7 +619,6 @@ do { if (atomic_dec_and_test(&(tsk)->usa - #define PF_MEMDIE 0x00001000 /* Killed for out-of-memory */ - #define PF_FLUSHER 0x00002000 /* responsible for disk writeback */ - --#define PF_FREEZE 0x00004000 /* this task should be frozen for suspend */ - #define PF_NOFREEZE 0x00008000 /* this thread should not be frozen */ - #define PF_FROZEN 0x00010000 /* frozen for system suspend */ - #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */ -@@ -564,6 +627,57 @@ do { if (atomic_dec_and_test(&(tsk)->usa - #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ - #define PF_SYNCWRITE 0x00200000 /* I am doing a sync write */ - -+#ifndef CONFIG_VE -+#define set_pn_state(tsk, state) do { } while(0) -+#define clear_pn_state(tsk) do { } while(0) -+#define set_sigsuspend_state(tsk, sig) do { } while(0) -+#define clear_sigsuspend_state(tsk) do { } while(0) -+#define set_stop_state(tsk) do { } while(0) -+#define clear_stop_state(tsk) do { } while(0) -+#else -+#define PN_STOP_TF 1 /* was not in 2.6.8 */ -+#define PN_STOP_TF_RT 2 /* was not in 2.6.8 */ -+#define PN_STOP_ENTRY 3 -+#define PN_STOP_FORK 4 -+#define PN_STOP_VFORK 5 -+#define PN_STOP_SIGNAL 6 -+#define PN_STOP_EXIT 7 -+#define PN_STOP_EXEC 8 -+#define PN_STOP_LEAVE 9 -+ -+static inline void set_pn_state(struct task_struct *tsk, int state) -+{ -+ tsk->pn_state = state; -+} -+ -+static inline void clear_pn_state(struct task_struct *tsk) -+{ -+ tsk->pn_state = 0; -+} -+ -+static inline void set_sigsuspend_state(struct task_struct *tsk, sigset_t saveset) -+{ -+ tsk->sigsuspend_state = 1; -+ tsk->saved_sigset = saveset; -+} -+ -+static inline void clear_sigsuspend_state(struct task_struct *tsk) -+{ -+ tsk->sigsuspend_state = 0; -+ siginitset(&tsk->saved_sigset, 0); -+} -+ -+static inline void set_stop_state(struct task_struct *tsk) -+{ -+ tsk->stopped_state = 1; -+} -+ -+static inline void clear_stop_state(struct task_struct *tsk) -+{ -+ tsk->stopped_state = 0; -+} -+#endif -+ - #ifdef CONFIG_SMP - #define SCHED_LOAD_SCALE 128UL /* increase resolution of load */ - -@@ -687,6 +801,20 @@ static inline int set_cpus_allowed(task_ - - extern unsigned long long sched_clock(void); - -+static inline unsigned long cycles_to_clocks(cycles_t cycles) -+{ -+ extern unsigned long cycles_per_clock; -+ do_div(cycles, cycles_per_clock); -+ return cycles; -+} -+ -+static inline u64 cycles_to_jiffies(cycles_t cycles) -+{ -+ extern unsigned long cycles_per_jiffy; -+ do_div(cycles, cycles_per_jiffy); -+ return cycles; -+} -+ - #ifdef CONFIG_SMP - extern void sched_balance_exec(void); - #else -@@ -699,6 +827,7 @@ extern int task_prio(const task_t *p); - extern int task_nice(const task_t *p); - extern int task_curr(const task_t *p); - extern int idle_cpu(int cpu); -+extern task_t *idle_task(int cpu); - - void yield(void); - -@@ -727,11 +856,243 @@ extern struct task_struct init_task; - - extern struct mm_struct init_mm; - --extern struct task_struct *find_task_by_pid(int pid); -+#define find_task_by_pid_all(nr) \ -+ find_task_by_pid_type_all(PIDTYPE_PID, nr) -+extern struct task_struct *find_task_by_pid_type_all(int type, int pid); - extern void set_special_pids(pid_t session, pid_t pgrp); - extern void __set_special_pids(pid_t session, pid_t pgrp); - -+#ifndef CONFIG_VE -+#define find_task_by_pid_ve find_task_by_pid_all -+ -+#define get_exec_env() NULL -+static inline struct ve_struct * set_exec_env(struct ve_struct *new_env) -+{ -+ return NULL; -+} -+#define ve_is_super(env) 1 -+#define ve_accessible(target, owner) 1 -+#define ve_accessible_strict(target, owner) 1 -+#define ve_accessible_veid(target, owner) 1 -+#define ve_accessible_strict_veid(target, owner) 1 -+ -+#define VEID(envid) 0 -+#define get_ve0() NULL -+ -+static inline pid_t virt_pid(struct task_struct *tsk) -+{ -+ return tsk->pid; -+} -+ -+static inline pid_t virt_tgid(struct task_struct *tsk) -+{ -+ return tsk->tgid; -+} -+ -+static inline pid_t virt_pgid(struct task_struct *tsk) -+{ -+ return tsk->signal->pgrp; -+} -+ -+static inline pid_t virt_sid(struct task_struct *tsk) -+{ -+ return tsk->signal->session; -+} -+ -+static inline pid_t get_task_pid_ve(struct task_struct *tsk, struct ve_struct *ve) -+{ -+ return tsk->pid; -+} -+ -+static inline pid_t get_task_pid(struct task_struct *tsk) -+{ -+ return tsk->pid; -+} -+ -+static inline pid_t get_task_tgid(struct task_struct *tsk) -+{ -+ return tsk->tgid; -+} -+ -+static inline pid_t get_task_pgid(struct task_struct *tsk) -+{ -+ return tsk->signal->pgrp; -+} -+ -+static inline pid_t get_task_sid(struct task_struct *tsk) -+{ -+ return tsk->signal->session; -+} -+ -+static inline void set_virt_pid(struct task_struct *tsk, pid_t pid) -+{ -+} -+ -+static inline void set_virt_tgid(struct task_struct *tsk, pid_t pid) -+{ -+} -+ -+static inline void set_virt_pgid(struct task_struct *tsk, pid_t pid) -+{ -+} -+ -+static inline void set_virt_sid(struct task_struct *tsk, pid_t pid) -+{ -+} -+ -+static inline pid_t get_task_ppid(struct task_struct *p) -+{ -+ if (!pid_alive(p)) -+ return 0; -+ return (p->pid > 1 ? p->group_leader->real_parent->pid : 0); -+} -+ -+#else /* CONFIG_VE */ -+ -+#include <asm/current.h> -+#include <linux/ve.h> -+ -+extern struct ve_struct ve0; -+ -+#define find_task_by_pid_ve(nr) \ -+ find_task_by_pid_type_ve(PIDTYPE_PID, nr) -+ -+extern struct task_struct *find_task_by_pid_type_ve(int type, int pid); -+ -+#define get_ve0() (&ve0) -+#define VEID(envid) ((envid)->veid) -+ -+#define get_exec_env() (VE_TASK_INFO(current)->exec_env) -+static inline struct ve_struct *set_exec_env(struct ve_struct *new_env) -+{ -+ struct ve_struct *old_env; -+ -+ old_env = VE_TASK_INFO(current)->exec_env; -+ VE_TASK_INFO(current)->exec_env = new_env; -+ -+ return old_env; -+} -+ -+#define ve_is_super(env) ((env) == get_ve0()) -+#define ve_accessible_strict(target, owner) ((target) == (owner)) -+static inline int ve_accessible(struct ve_struct *target, -+ struct ve_struct *owner) { -+ return ve_is_super(owner) || ve_accessible_strict(target, owner); -+} -+ -+#define ve_accessible_strict_veid(target, owner) ((target) == (owner)) -+static inline int ve_accessible_veid(envid_t target, envid_t owner) -+{ -+ return get_ve0()->veid == owner || -+ ve_accessible_strict_veid(target, owner); -+} -+ -+static inline pid_t virt_pid(struct task_struct *tsk) -+{ -+ return tsk->pids[PIDTYPE_PID].vnr; -+} -+ -+static inline pid_t virt_tgid(struct task_struct *tsk) -+{ -+ return tsk->pids[PIDTYPE_TGID].vnr; -+} -+ -+static inline pid_t virt_pgid(struct task_struct *tsk) -+{ -+ return tsk->pids[PIDTYPE_PGID].vnr; -+} -+ -+static inline pid_t virt_sid(struct task_struct *tsk) -+{ -+ return tsk->pids[PIDTYPE_SID].vnr; -+} -+ -+static inline pid_t get_task_pid_ve(struct task_struct *tsk, struct ve_struct *env) -+{ -+ return ve_is_super(env) ? tsk->pid : virt_pid(tsk); -+} -+ -+static inline pid_t get_task_pid(struct task_struct *tsk) -+{ -+ return get_task_pid_ve(tsk, get_exec_env()); -+} -+ -+static inline pid_t get_task_tgid(struct task_struct *tsk) -+{ -+ return ve_is_super(get_exec_env()) ? tsk->tgid : virt_tgid(tsk); -+} -+ -+static inline pid_t get_task_pgid(struct task_struct *tsk) -+{ -+ return ve_is_super(get_exec_env()) ? tsk->signal->pgrp : virt_pgid(tsk); -+} -+ -+static inline pid_t get_task_sid(struct task_struct *tsk) -+{ -+ return ve_is_super(get_exec_env()) ? tsk->signal->session : virt_sid(tsk); -+} -+ -+static inline void set_virt_pid(struct task_struct *tsk, pid_t pid) -+{ -+ tsk->pids[PIDTYPE_PID].vnr = pid; -+} -+ -+static inline void set_virt_tgid(struct task_struct *tsk, pid_t pid) -+{ -+ tsk->pids[PIDTYPE_TGID].vnr = pid; -+} -+ -+static inline void set_virt_pgid(struct task_struct *tsk, pid_t pid) -+{ -+ tsk->pids[PIDTYPE_PGID].vnr = pid; -+} -+ -+static inline void set_virt_sid(struct task_struct *tsk, pid_t pid) -+{ -+ tsk->pids[PIDTYPE_SID].vnr = pid; -+} -+ -+static inline pid_t get_task_ppid(struct task_struct *p) -+{ -+ struct task_struct *parent; -+ struct ve_struct *env; -+ -+ if (!pid_alive(p)) -+ return 0; -+ env = get_exec_env(); -+ if (get_task_pid_ve(p, env) == 1) -+ return 0; -+ parent = p->group_leader->real_parent; -+ return ve_accessible(VE_TASK_INFO(parent)->owner_env, env) ? -+ get_task_pid_ve(parent, env) : 1; -+} -+ -+void ve_sched_get_cpu_stat(struct ve_struct *envid, cycles_t *idle, -+ cycles_t *strv, unsigned int cpu); -+void ve_sched_attach(struct ve_struct *envid); -+ -+#endif /* CONFIG_VE */ -+ -+#if defined(CONFIG_SCHED_VCPU) && defined(CONFIG_VE) -+extern cycles_t ve_sched_get_idle_time(struct ve_struct *, int); -+extern cycles_t ve_sched_get_iowait_time(struct ve_struct *, int); -+#else -+#define ve_sched_get_idle_time(ve, cpu) 0 -+#define ve_sched_get_iowait_time(ve, cpu) 0 -+#endif -+ -+#ifdef CONFIG_SCHED_VCPU -+struct vcpu_scheduler; -+extern void fastcall vsched_cpu_online_map(struct vcpu_scheduler *sched, -+ cpumask_t *mask); -+#else -+#define vsched_cpu_online_map(vsched, mask) do { \ -+ *mask = cpu_online_map; \ -+ } while (0) -+#endif -+ - /* per-UID process charging. */ -+extern int set_user(uid_t new_ruid, int dumpclear); - extern struct user_struct * alloc_uid(uid_t); - static inline struct user_struct *get_uid(struct user_struct *u) - { -@@ -747,6 +1108,7 @@ extern unsigned long itimer_ticks; - extern unsigned long itimer_next; - extern void do_timer(struct pt_regs *); - -+extern void wake_up_init(void); - extern int FASTCALL(wake_up_state(struct task_struct * tsk, unsigned int state)); - extern int FASTCALL(wake_up_process(struct task_struct * tsk)); - extern void FASTCALL(wake_up_forked_process(struct task_struct * tsk)); -@@ -807,7 +1169,7 @@ extern struct sigqueue *sigqueue_alloc(v - extern void sigqueue_free(struct sigqueue *); - extern int send_sigqueue(int, struct sigqueue *, struct task_struct *); - extern int send_group_sigqueue(int, struct sigqueue *, struct task_struct *); --extern int do_sigaction(int, const struct k_sigaction *, struct k_sigaction *); -+extern int do_sigaction(int, struct k_sigaction *, struct k_sigaction *); - extern int do_sigaltstack(const stack_t __user *, stack_t __user *, unsigned long); - - /* These can be the second arg to send_sig_info/send_group_sig_info. */ -@@ -885,7 +1247,10 @@ extern task_t *child_reaper; - - extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *); - extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *); --extern struct task_struct * copy_process(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *); -+extern struct task_struct * copy_process(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *, long pid); -+ -+extern void set_task_comm(struct task_struct *tsk, char *from); -+extern void get_task_comm(char *to, struct task_struct *tsk); - - #ifdef CONFIG_SMP - extern void wait_task_inactive(task_t * p); -@@ -908,31 +1273,105 @@ extern void wait_task_inactive(task_t * - add_parent(p, (p)->parent); \ - } while (0) - --#define next_task(p) list_entry((p)->tasks.next, struct task_struct, tasks) --#define prev_task(p) list_entry((p)->tasks.prev, struct task_struct, tasks) -+#define next_task_all(p) list_entry((p)->tasks.next, struct task_struct, tasks) -+#define prev_task_all(p) list_entry((p)->tasks.prev, struct task_struct, tasks) - --#define for_each_process(p) \ -- for (p = &init_task ; (p = next_task(p)) != &init_task ; ) -+#define for_each_process_all(p) \ -+ for (p = &init_task ; (p = next_task_all(p)) != &init_task ; ) - - /* - * Careful: do_each_thread/while_each_thread is a double loop so - * 'break' will not work as expected - use goto instead. - */ --#define do_each_thread(g, t) \ -- for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do -+#define do_each_thread_all(g, t) \ -+ for (g = t = &init_task ; (g = t = next_task_all(g)) != &init_task ; ) do -+ -+#define while_each_thread_all(g, t) \ -+ while ((t = next_thread(t)) != g) -+ -+#ifndef CONFIG_VE -+ -+#define SET_VE_LINKS(p) -+#define REMOVE_VE_LINKS(p) -+#define for_each_process_ve(p) for_each_process_all(p) -+#define do_each_thread_ve(g, t) do_each_thread_all(g, t) -+#define while_each_thread_ve(g, t) while_each_thread_all(g, t) -+#define first_task_ve() next_task_ve(&init_task) -+#define next_task_ve(p) \ -+ (next_task_all(p) != &init_task ? next_task_all(p) : NULL) -+ -+#else /* CONFIG_VE */ -+ -+#define SET_VE_LINKS(p) \ -+ do { \ -+ if (thread_group_leader(p)) \ -+ list_add_tail(&VE_TASK_INFO(p)->vetask_list, \ -+ &VE_TASK_INFO(p)->owner_env->vetask_lh); \ -+ } while (0) - --#define while_each_thread(g, t) \ -+#define REMOVE_VE_LINKS(p) \ -+ do { \ -+ if (thread_group_leader(p)) \ -+ list_del(&VE_TASK_INFO(p)->vetask_list); \ -+ } while(0) -+ -+static inline task_t* __first_task_ve(struct ve_struct *ve) -+{ -+ task_t *tsk; -+ -+ if (unlikely(ve_is_super(ve))) { -+ tsk = next_task_all(&init_task); -+ if (tsk == &init_task) -+ tsk = NULL; -+ } else { -+ /* probably can return ve->init_entry, but it's more clear */ -+ BUG_ON(list_empty(&ve->vetask_lh)); -+ tsk = VE_TASK_LIST_2_TASK(ve->vetask_lh.next); -+ } -+ return tsk; -+} -+ -+static inline task_t* __next_task_ve(struct ve_struct *ve, task_t *tsk) -+{ -+ if (unlikely(ve_is_super(ve))) { -+ tsk = next_task_all(tsk); -+ if (tsk == &init_task) -+ tsk = NULL; -+ } else { -+ struct list_head *tmp; -+ -+ BUG_ON(VE_TASK_INFO(tsk)->owner_env != ve); -+ tmp = VE_TASK_INFO(tsk)->vetask_list.next; -+ if (tmp == &ve->vetask_lh) -+ tsk = NULL; -+ else -+ tsk = VE_TASK_LIST_2_TASK(tmp); -+ } -+ return tsk; -+} -+ -+#define first_task_ve() __first_task_ve(get_exec_env()) -+#define next_task_ve(p) __next_task_ve(get_exec_env(), p) -+/* no one uses prev_task_ve(), copy next_task_ve() if needed */ -+ -+#define for_each_process_ve(p) \ -+ for (p = first_task_ve(); p != NULL ; p = next_task_ve(p)) -+ -+#define do_each_thread_ve(g, t) \ -+ for (g = t = first_task_ve() ; g != NULL; g = t = next_task_ve(g)) do -+ -+#define while_each_thread_ve(g, t) \ - while ((t = next_thread(t)) != g) - -+#endif /* CONFIG_VE */ -+ - extern task_t * FASTCALL(next_thread(const task_t *p)); - - #define thread_group_leader(p) (p->pid == p->tgid) - - static inline int thread_group_empty(task_t *p) - { -- struct pid *pid = p->pids[PIDTYPE_TGID].pidptr; -- -- return pid->task_list.next->next == &pid->task_list; -+ return list_empty(&p->pids[PIDTYPE_TGID].pid_list); - } - - #define delay_group_leader(p) \ -@@ -941,8 +1380,8 @@ static inline int thread_group_empty(tas - extern void unhash_process(struct task_struct *p); - - /* -- * Protects ->fs, ->files, ->mm, ->ptrace, ->group_info and synchronises with -- * wait4(). -+ * Protects ->fs, ->files, ->mm, ->ptrace, ->group_info, ->comm and -+ * synchronises with wait4(). - * - * Nests both inside and outside of read_lock(&tasklist_lock). - * It must not be nested with write_lock_irq(&tasklist_lock), -@@ -1065,28 +1504,61 @@ extern void signal_wake_up(struct task_s - */ - #ifdef CONFIG_SMP - --static inline unsigned int task_cpu(const struct task_struct *p) -+static inline unsigned int task_pcpu(const struct task_struct *p) - { - return p->thread_info->cpu; - } - --static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) -+static inline void set_task_pcpu(struct task_struct *p, unsigned int cpu) - { - p->thread_info->cpu = cpu; - } - - #else - -+static inline unsigned int task_pcpu(const struct task_struct *p) -+{ -+ return 0; -+} -+ -+static inline void set_task_pcpu(struct task_struct *p, unsigned int cpu) -+{ -+} -+ -+#endif /* CONFIG_SMP */ -+ -+#ifdef CONFIG_SCHED_VCPU -+ -+static inline unsigned int task_vsched_id(const struct task_struct *p) -+{ -+ return p->vsched_id; -+} -+ - static inline unsigned int task_cpu(const struct task_struct *p) - { -+ return p->vcpu_id; -+} -+ -+extern void set_task_cpu(struct task_struct *p, unsigned int vcpu); -+ -+#else -+ -+static inline unsigned int task_vsched_id(const struct task_struct *p) -+{ - return 0; - } - -+static inline unsigned int task_cpu(const struct task_struct *p) -+{ -+ return task_pcpu(p); -+} -+ - static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) - { -+ set_task_pcpu(p, cpu); - } - --#endif /* CONFIG_SMP */ -+#endif /* CONFIG_SCHED_VCPU */ - - #endif /* __KERNEL__ */ - -diff -uprN linux-2.6.8.1.orig/include/linux/security.h linux-2.6.8.1-ve022stab078/include/linux/security.h ---- linux-2.6.8.1.orig/include/linux/security.h 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/security.h 2006-05-11 13:05:40.000000000 +0400 -@@ -61,7 +61,7 @@ static inline int cap_netlink_send (stru - - static inline int cap_netlink_recv (struct sk_buff *skb) - { -- if (!cap_raised (NETLINK_CB (skb).eff_cap, CAP_NET_ADMIN)) -+ if (!cap_raised (NETLINK_CB (skb).eff_cap, CAP_VE_NET_ADMIN)) - return -EPERM; - return 0; - } -diff -uprN linux-2.6.8.1.orig/include/linux/shm.h linux-2.6.8.1-ve022stab078/include/linux/shm.h ---- linux-2.6.8.1.orig/include/linux/shm.h 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/shm.h 2006-05-11 13:05:40.000000000 +0400 -@@ -72,6 +72,8 @@ struct shm_info { - }; - - #ifdef __KERNEL__ -+struct user_beancounter; -+ - struct shmid_kernel /* private to the kernel */ - { - struct kern_ipc_perm shm_perm; -@@ -84,8 +86,12 @@ struct shmid_kernel /* private to the ke - time_t shm_ctim; - pid_t shm_cprid; - pid_t shm_lprid; -+ struct user_beancounter *shmidk_ub; -+ struct ipc_ids *_shm_ids; - }; - -+#define shmid_ub(__shmid) (__shmid)->shmidk_ub -+ - /* shm_mode upper byte flags */ - #define SHM_DEST 01000 /* segment will be destroyed on last detach */ - #define SHM_LOCKED 02000 /* segment will not be swapped */ -diff -uprN linux-2.6.8.1.orig/include/linux/shmem_fs.h linux-2.6.8.1-ve022stab078/include/linux/shmem_fs.h ---- linux-2.6.8.1.orig/include/linux/shmem_fs.h 2004-08-14 14:56:00.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/shmem_fs.h 2006-05-11 13:05:39.000000000 +0400 -@@ -8,6 +8,8 @@ - - #define SHMEM_NR_DIRECT 16 - -+struct user_beancounter; -+ - struct shmem_inode_info { - spinlock_t lock; - unsigned long next_index; -@@ -19,8 +21,11 @@ struct shmem_inode_info { - struct shared_policy policy; - struct list_head list; - struct inode vfs_inode; -+ struct user_beancounter *info_ub; - }; - -+#define shm_info_ub(__shmi) (__shmi)->info_ub -+ - struct shmem_sb_info { - unsigned long max_blocks; /* How many blocks are allowed */ - unsigned long free_blocks; /* How many are left for allocation */ -diff -uprN linux-2.6.8.1.orig/include/linux/signal.h linux-2.6.8.1-ve022stab078/include/linux/signal.h ---- linux-2.6.8.1.orig/include/linux/signal.h 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/signal.h 2006-05-11 13:05:39.000000000 +0400 -@@ -14,14 +14,19 @@ - * Real Time signals may be queued. - */ - -+struct user_beancounter; -+ - struct sigqueue { - struct list_head list; - spinlock_t *lock; - int flags; - siginfo_t info; - struct user_struct *user; -+ struct user_beancounter *sig_ub; - }; - -+#define sig_ub(__q) ((__q)->sig_ub) -+ - /* flags values. */ - #define SIGQUEUE_PREALLOC 1 - -diff -uprN linux-2.6.8.1.orig/include/linux/skbuff.h linux-2.6.8.1-ve022stab078/include/linux/skbuff.h ---- linux-2.6.8.1.orig/include/linux/skbuff.h 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/skbuff.h 2006-05-11 13:05:40.000000000 +0400 -@@ -19,6 +19,7 @@ - #include <linux/compiler.h> - #include <linux/time.h> - #include <linux/cache.h> -+#include <linux/ve_owner.h> - - #include <asm/atomic.h> - #include <asm/types.h> -@@ -190,6 +191,8 @@ struct skb_shared_info { - * @tc_index: Traffic control index - */ - -+#include <ub/ub_sk.h> -+ - struct sk_buff { - /* These two members must be first. */ - struct sk_buff *next; -@@ -281,13 +284,18 @@ struct sk_buff { - *data, - *tail, - *end; -+ struct skb_beancounter skb_bc; -+ struct ve_struct *owner_env; - }; - -+DCL_VE_OWNER_PROTO(SKB, SLAB, struct sk_buff, owner_env, , (noinline, regparm(1))) -+ - #ifdef __KERNEL__ - /* - * Handling routines are only of interest to the kernel - */ - #include <linux/slab.h> -+#include <ub/ub_net.h> - - #include <asm/system.h> - -@@ -902,6 +910,8 @@ static inline int pskb_trim(struct sk_bu - */ - static inline void skb_orphan(struct sk_buff *skb) - { -+ ub_skb_uncharge(skb); -+ - if (skb->destructor) - skb->destructor(skb); - skb->destructor = NULL; -diff -uprN linux-2.6.8.1.orig/include/linux/slab.h linux-2.6.8.1-ve022stab078/include/linux/slab.h ---- linux-2.6.8.1.orig/include/linux/slab.h 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/slab.h 2006-05-11 13:05:39.000000000 +0400 -@@ -46,6 +46,27 @@ typedef struct kmem_cache_s kmem_cache_t - what is reclaimable later*/ - #define SLAB_PANIC 0x00040000UL /* panic if kmem_cache_create() fails */ - -+/* -+ * allocation rules: __GFP_UBC 0 -+ * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -+ * cache (SLAB_UBC) charge charge -+ * (usual caches: mm, vma, task_struct, ...) -+ * -+ * cache (SLAB_UBC | SLAB_NO_CHARGE) charge --- -+ * (ub_kmalloc) (kmalloc) -+ * -+ * cache (no UB flags) BUG() --- -+ * (nonub caches, mempools) -+ * -+ * pages charge --- -+ * (ub_vmalloc, (vmalloc, -+ * poll, fdsets, ...) non-ub allocs) -+ * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -+ */ -+#define SLAB_UBC 0x20000000UL /* alloc space for ubs ... */ -+#define SLAB_NO_CHARGE 0x40000000UL /* ... but don't charge */ -+ -+ - /* flags passed to a constructor func */ - #define SLAB_CTOR_CONSTRUCTOR 0x001UL /* if not set, then deconstructor */ - #define SLAB_CTOR_ATOMIC 0x002UL /* tell constructor it can't sleep */ -@@ -97,6 +118,8 @@ found: - return __kmalloc(size, flags); - } - -+extern void *kzalloc(size_t, gfp_t); -+ - extern void kfree(const void *); - extern unsigned int ksize(const void *); - -diff -uprN linux-2.6.8.1.orig/include/linux/smp.h linux-2.6.8.1-ve022stab078/include/linux/smp.h ---- linux-2.6.8.1.orig/include/linux/smp.h 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/smp.h 2006-05-11 13:05:24.000000000 +0400 -@@ -54,6 +54,9 @@ extern void smp_cpus_done(unsigned int m - extern int smp_call_function (void (*func) (void *info), void *info, - int retry, int wait); - -+typedef void (*smp_nmi_function)(struct pt_regs *regs, void *info); -+extern int smp_nmi_call_function(smp_nmi_function func, void *info, int wait); -+ - /* - * Call a function on all processors - */ -@@ -100,6 +103,7 @@ void smp_prepare_boot_cpu(void); - #define hard_smp_processor_id() 0 - #define smp_threads_ready 1 - #define smp_call_function(func,info,retry,wait) ({ 0; }) -+#define smp_nmi_call_function(func, info, wait) ({ 0; }) - #define on_each_cpu(func,info,retry,wait) ({ func(info); 0; }) - static inline void smp_send_reschedule(int cpu) { } - #define num_booting_cpus() 1 -diff -uprN linux-2.6.8.1.orig/include/linux/socket.h linux-2.6.8.1-ve022stab078/include/linux/socket.h ---- linux-2.6.8.1.orig/include/linux/socket.h 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/socket.h 2006-05-11 13:05:42.000000000 +0400 -@@ -90,6 +90,10 @@ struct cmsghdr { - (struct cmsghdr *)(ctl) : \ - (struct cmsghdr *)NULL) - #define CMSG_FIRSTHDR(msg) __CMSG_FIRSTHDR((msg)->msg_control, (msg)->msg_controllen) -+#define CMSG_OK(mhdr, cmsg) ((cmsg)->cmsg_len >= sizeof(struct cmsghdr) && \ -+ (cmsg)->cmsg_len <= (unsigned long) \ -+ ((mhdr)->msg_controllen - \ -+ ((char *)(cmsg) - (char *)(mhdr)->msg_control))) - - /* - * This mess will go away with glibc -@@ -287,6 +291,7 @@ extern void memcpy_tokerneliovec(struct - extern int move_addr_to_user(void *kaddr, int klen, void __user *uaddr, int __user *ulen); - extern int move_addr_to_kernel(void __user *uaddr, int ulen, void *kaddr); - extern int put_cmsg(struct msghdr*, int level, int type, int len, void *data); -+extern int vz_security_proto_check(int family, int type, int protocol); - - #endif - #endif /* not kernel and not glibc */ -diff -uprN linux-2.6.8.1.orig/include/linux/suspend.h linux-2.6.8.1-ve022stab078/include/linux/suspend.h ---- linux-2.6.8.1.orig/include/linux/suspend.h 2004-08-14 14:55:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/suspend.h 2006-05-11 13:05:25.000000000 +0400 -@@ -59,7 +59,7 @@ static inline int software_suspend(void) - - - #ifdef CONFIG_PM --extern void refrigerator(unsigned long); -+extern void refrigerator(void); - extern int freeze_processes(void); - extern void thaw_processes(void); - -@@ -67,7 +67,7 @@ extern int pm_prepare_console(void); - extern void pm_restore_console(void); - - #else --static inline void refrigerator(unsigned long flag) {} -+static inline void refrigerator(void) {} - #endif /* CONFIG_PM */ - - #ifdef CONFIG_SMP -diff -uprN linux-2.6.8.1.orig/include/linux/swap.h linux-2.6.8.1-ve022stab078/include/linux/swap.h ---- linux-2.6.8.1.orig/include/linux/swap.h 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/swap.h 2006-05-11 13:05:45.000000000 +0400 -@@ -13,6 +13,7 @@ - #define SWAP_FLAG_PREFER 0x8000 /* set if swap priority specified */ - #define SWAP_FLAG_PRIO_MASK 0x7fff - #define SWAP_FLAG_PRIO_SHIFT 0 -+#define SWAP_FLAG_READONLY 0x40000000 /* set if swap is read-only */ - - static inline int current_is_kswapd(void) - { -@@ -79,6 +80,7 @@ struct address_space; - struct sysinfo; - struct writeback_control; - struct zone; -+struct user_beancounter; - - /* - * A swap extent maps a range of a swapfile's PAGE_SIZE pages onto a range of -@@ -106,6 +108,7 @@ enum { - SWP_USED = (1 << 0), /* is slot in swap_info[] used? */ - SWP_WRITEOK = (1 << 1), /* ok to write to this swap? */ - SWP_ACTIVE = (SWP_USED | SWP_WRITEOK), -+ SWP_READONLY = (1 << 2) - }; - - #define SWAP_CLUSTER_MAX 32 -@@ -118,6 +121,8 @@ enum { - * extent_list.prev points at the lowest-index extent. That list is - * sorted. - */ -+struct user_beancounter; -+ - struct swap_info_struct { - unsigned int flags; - spinlock_t sdev_lock; -@@ -132,6 +137,7 @@ struct swap_info_struct { - unsigned int highest_bit; - unsigned int cluster_next; - unsigned int cluster_nr; -+ struct user_beancounter **owner_map; - int prio; /* swap priority */ - int pages; - unsigned long max; -@@ -148,7 +154,8 @@ struct swap_list_t { - #define vm_swap_full() (nr_swap_pages*2 < total_swap_pages) - - /* linux/mm/oom_kill.c */ --extern void out_of_memory(int gfp_mask); -+struct oom_freeing_stat; -+extern void out_of_memory(struct oom_freeing_stat *, int gfp_mask); - - /* linux/mm/memory.c */ - extern void swapin_readahead(swp_entry_t, unsigned long, struct vm_area_struct *); -@@ -210,7 +217,7 @@ extern long total_swap_pages; - extern unsigned int nr_swapfiles; - extern struct swap_info_struct swap_info[]; - extern void si_swapinfo(struct sysinfo *); --extern swp_entry_t get_swap_page(void); -+extern swp_entry_t get_swap_page(struct user_beancounter *); - extern int swap_duplicate(swp_entry_t); - extern int valid_swaphandles(swp_entry_t, unsigned long *); - extern void swap_free(swp_entry_t); -@@ -219,6 +226,7 @@ extern sector_t map_swap_page(struct swa - extern struct swap_info_struct *get_swap_info_struct(unsigned); - extern int can_share_swap_page(struct page *); - extern int remove_exclusive_swap_page(struct page *); -+extern int try_to_remove_exclusive_swap_page(struct page *); - struct backing_dev_info; - - extern struct swap_list_t swap_list; -@@ -259,7 +267,7 @@ static inline int remove_exclusive_swap_ - return 0; - } - --static inline swp_entry_t get_swap_page(void) -+static inline swp_entry_t get_swap_page(struct user_beancounter *ub) - { - swp_entry_t entry; - entry.val = 0; -diff -uprN linux-2.6.8.1.orig/include/linux/sysctl.h linux-2.6.8.1-ve022stab078/include/linux/sysctl.h ---- linux-2.6.8.1.orig/include/linux/sysctl.h 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/sysctl.h 2006-05-11 13:05:49.000000000 +0400 -@@ -24,6 +24,7 @@ - #include <linux/compiler.h> - - struct file; -+struct completion; - - #define CTL_MAXNAME 10 /* how many path components do we allow in a - call to sysctl? In other words, what is -@@ -133,6 +134,13 @@ enum - KERN_NGROUPS_MAX=63, /* int: NGROUPS_MAX */ - KERN_SPARC_SCONS_PWROFF=64, /* int: serial console power-off halt */ - KERN_HZ_TIMER=65, /* int: hz timer on or off */ -+ KERN_SILENCE_LEVEL=66, /* int: Console silence loglevel */ -+ KERN_ALLOC_FAIL_WARN=67, /* int: whether we'll print "alloc failure" */ -+ KERN_FAIRSCHED_MAX_LATENCY=201, /* int: Max start_tag delta */ -+ KERN_VCPU_SCHED_TIMESLICE=202, -+ KERN_VCPU_TIMESLICE=203, -+ KERN_VIRT_PIDS=204, /* int: VE pids virtualization */ -+ KERN_VIRT_OSRELEASE=205,/* virtualization of utsname.release */ - }; - - -@@ -320,6 +328,7 @@ enum - NET_TCP_RMEM=85, - NET_TCP_APP_WIN=86, - NET_TCP_ADV_WIN_SCALE=87, -+ NET_TCP_USE_SG=245, - NET_IPV4_NONLOCAL_BIND=88, - NET_IPV4_ICMP_RATELIMIT=89, - NET_IPV4_ICMP_RATEMASK=90, -@@ -343,6 +352,7 @@ enum - - enum { - NET_IPV4_ROUTE_FLUSH=1, -+ NET_IPV4_ROUTE_SRC_CHECK=188, - NET_IPV4_ROUTE_MIN_DELAY=2, - NET_IPV4_ROUTE_MAX_DELAY=3, - NET_IPV4_ROUTE_GC_THRESH=4, -@@ -650,6 +660,12 @@ enum - FS_XFS=17, /* struct: control xfs parameters */ - FS_AIO_NR=18, /* current system-wide number of aio requests */ - FS_AIO_MAX_NR=19, /* system-wide maximum number of aio requests */ -+ FS_AT_VSYSCALL=20, /* int: to announce vsyscall data */ -+}; -+ -+/* /proc/sys/debug */ -+enum { -+ DBG_DECODE_CALLTRACES = 1, /* int: decode call traces on oops */ - }; - - /* /proc/sys/fs/quota/ */ -@@ -780,6 +796,8 @@ extern int proc_doulongvec_minmax(ctl_ta - void __user *, size_t *, loff_t *); - extern int proc_doulongvec_ms_jiffies_minmax(ctl_table *table, int, - struct file *, void __user *, size_t *, loff_t *); -+extern int proc_doutsstring(ctl_table *table, int write, struct file *, -+ void __user *, size_t *, loff_t *); - - extern int do_sysctl (int __user *name, int nlen, - void __user *oldval, size_t __user *oldlenp, -@@ -833,6 +851,8 @@ extern ctl_handler sysctl_jiffies; - */ - - /* A sysctl table is an array of struct ctl_table: */ -+struct ve_struct; -+ - struct ctl_table - { - int ctl_name; /* Binary ID */ -@@ -846,6 +866,7 @@ struct ctl_table - struct proc_dir_entry *de; /* /proc control block */ - void *extra1; - void *extra2; -+ struct ve_struct *owner_env; - }; - - /* struct ctl_table_header is used to maintain dynamic lists of -@@ -854,12 +875,17 @@ struct ctl_table_header - { - ctl_table *ctl_table; - struct list_head ctl_entry; -+ int used; -+ struct completion *unregistering; - }; - - struct ctl_table_header * register_sysctl_table(ctl_table * table, - int insert_at_head); - void unregister_sysctl_table(struct ctl_table_header * table); - -+ctl_table *clone_sysctl_template(ctl_table *tmpl, int nr); -+void free_sysctl_clone(ctl_table *clone); -+ - #else /* __KERNEL__ */ - - #endif /* __KERNEL__ */ -diff -uprN linux-2.6.8.1.orig/include/linux/sysrq.h linux-2.6.8.1-ve022stab078/include/linux/sysrq.h ---- linux-2.6.8.1.orig/include/linux/sysrq.h 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/sysrq.h 2006-05-11 13:05:24.000000000 +0400 -@@ -29,6 +29,12 @@ struct sysrq_key_op { - * are available -- else NULL's). - */ - -+#ifdef CONFIG_SYSRQ_DEBUG -+int sysrq_eat_all(void); -+#else -+#define sysrq_eat_all() (0) -+#endif -+ - void handle_sysrq(int, struct pt_regs *, struct tty_struct *); - void __handle_sysrq(int, struct pt_regs *, struct tty_struct *); - -diff -uprN linux-2.6.8.1.orig/include/linux/tcp.h linux-2.6.8.1-ve022stab078/include/linux/tcp.h ---- linux-2.6.8.1.orig/include/linux/tcp.h 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/tcp.h 2006-05-11 13:05:37.000000000 +0400 -@@ -201,6 +201,27 @@ struct tcp_sack_block { - __u32 end_seq; - }; - -+struct tcp_options_received { -+/* PAWS/RTTM data */ -+ long ts_recent_stamp;/* Time we stored ts_recent (for aging) */ -+ __u32 ts_recent; /* Time stamp to echo next */ -+ __u32 rcv_tsval; /* Time stamp value */ -+ __u32 rcv_tsecr; /* Time stamp echo reply */ -+ char saw_tstamp; /* Saw TIMESTAMP on last packet */ -+ char tstamp_ok; /* TIMESTAMP seen on SYN packet */ -+ char sack_ok; /* SACK seen on SYN packet */ -+ char wscale_ok; /* Wscale seen on SYN packet */ -+ __u8 snd_wscale; /* Window scaling received from sender */ -+ __u8 rcv_wscale; /* Window scaling to send to receiver */ -+/* SACKs data */ -+ __u8 dsack; /* D-SACK is scheduled */ -+ __u8 eff_sacks; /* Size of SACK array to send with next packet */ -+ __u8 num_sacks; /* Number of SACK blocks */ -+ __u8 __pad; -+ __u16 user_mss; /* mss requested by user in ioctl */ -+ __u16 mss_clamp; /* Maximal mss, negotiated at connection setup */ -+}; -+ - struct tcp_opt { - int tcp_header_len; /* Bytes of tcp header to send */ - -@@ -251,22 +272,19 @@ struct tcp_opt { - __u32 pmtu_cookie; /* Last pmtu seen by socket */ - __u32 mss_cache; /* Cached effective mss, not including SACKS */ - __u16 mss_cache_std; /* Like mss_cache, but without TSO */ -- __u16 mss_clamp; /* Maximal mss, negotiated at connection setup */ - __u16 ext_header_len; /* Network protocol overhead (IP/IPv6 options) */ - __u16 ext2_header_len;/* Options depending on route */ - __u8 ca_state; /* State of fast-retransmit machine */ - __u8 retransmits; /* Number of unrecovered RTO timeouts. */ -+ __u32 frto_highmark; /* snd_nxt when RTO occurred */ - - __u8 reordering; /* Packet reordering metric. */ - __u8 frto_counter; /* Number of new acks after RTO */ -- __u32 frto_highmark; /* snd_nxt when RTO occurred */ - - __u8 unused_pad; - __u8 defer_accept; /* User waits for some data after accept() */ -- /* one byte hole, try to pack */ - - /* RTT measurement */ -- __u8 backoff; /* backoff */ - __u32 srtt; /* smothed round trip time << 3 */ - __u32 mdev; /* medium deviation */ - __u32 mdev_max; /* maximal mdev for the last rtt period */ -@@ -277,7 +295,15 @@ struct tcp_opt { - __u32 packets_out; /* Packets which are "in flight" */ - __u32 left_out; /* Packets which leaved network */ - __u32 retrans_out; /* Retransmitted packets out */ -+ __u8 backoff; /* backoff */ -+/* -+ * Options received (usually on last packet, some only on SYN packets). -+ */ -+ __u8 nonagle; /* Disable Nagle algorithm? */ -+ __u8 keepalive_probes; /* num of allowed keep alive probes */ - -+ __u8 probes_out; /* unanswered 0 window probes */ -+ struct tcp_options_received rx_opt; - - /* - * Slow start and congestion control (see also Nagle, and Karn & Partridge) -@@ -303,40 +329,19 @@ struct tcp_opt { - __u32 write_seq; /* Tail(+1) of data held in tcp send buffer */ - __u32 pushed_seq; /* Last pushed seq, required to talk to windows */ - __u32 copied_seq; /* Head of yet unread data */ --/* -- * Options received (usually on last packet, some only on SYN packets). -- */ -- char tstamp_ok, /* TIMESTAMP seen on SYN packet */ -- wscale_ok, /* Wscale seen on SYN packet */ -- sack_ok; /* SACK seen on SYN packet */ -- char saw_tstamp; /* Saw TIMESTAMP on last packet */ -- __u8 snd_wscale; /* Window scaling received from sender */ -- __u8 rcv_wscale; /* Window scaling to send to receiver */ -- __u8 nonagle; /* Disable Nagle algorithm? */ -- __u8 keepalive_probes; /* num of allowed keep alive probes */ -- --/* PAWS/RTTM data */ -- __u32 rcv_tsval; /* Time stamp value */ -- __u32 rcv_tsecr; /* Time stamp echo reply */ -- __u32 ts_recent; /* Time stamp to echo next */ -- long ts_recent_stamp;/* Time we stored ts_recent (for aging) */ - - /* SACKs data */ -- __u16 user_mss; /* mss requested by user in ioctl */ -- __u8 dsack; /* D-SACK is scheduled */ -- __u8 eff_sacks; /* Size of SACK array to send with next packet */ - struct tcp_sack_block duplicate_sack[1]; /* D-SACK block */ - struct tcp_sack_block selective_acks[4]; /* The SACKS themselves*/ - - __u32 window_clamp; /* Maximal window to advertise */ - __u32 rcv_ssthresh; /* Current window clamp */ -- __u8 probes_out; /* unanswered 0 window probes */ -- __u8 num_sacks; /* Number of SACK blocks */ - __u16 advmss; /* Advertised MSS */ - - __u8 syn_retries; /* num of allowed syn retries */ - __u8 ecn_flags; /* ECN status bits. */ - __u16 prior_ssthresh; /* ssthresh saved at recovery start */ -+ __u16 __pad1; - __u32 lost_out; /* Lost packets */ - __u32 sacked_out; /* SACK'd packets */ - __u32 fackets_out; /* FACK'd packets */ -diff -uprN linux-2.6.8.1.orig/include/linux/time.h linux-2.6.8.1-ve022stab078/include/linux/time.h ---- linux-2.6.8.1.orig/include/linux/time.h 2004-08-14 14:55:35.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/time.h 2006-05-11 13:05:32.000000000 +0400 -@@ -194,6 +194,18 @@ static inline unsigned int jiffies_to_ms - return (j * 1000) / HZ; - #endif - } -+ -+static inline unsigned int jiffies_to_usecs(const unsigned long j) -+{ -+#if HZ <= 1000 && !(1000 % HZ) -+ return (1000000 / HZ) * j; -+#elif HZ > 1000 && !(HZ % 1000) -+ return (j*1000 + (HZ - 1000))/(HZ / 1000); -+#else -+ return (j * 1000000) / HZ; -+#endif -+} -+ - static inline unsigned long msecs_to_jiffies(const unsigned int m) - { - #if HZ <= 1000 && !(1000 % HZ) -@@ -332,6 +344,7 @@ static inline unsigned long get_seconds( - struct timespec current_kernel_time(void); - - #define CURRENT_TIME (current_kernel_time()) -+#define CURRENT_TIME_SEC ((struct timespec) { xtime.tv_sec, 0 }) - - #endif /* __KERNEL__ */ - -@@ -349,6 +362,8 @@ struct itimerval; - extern int do_setitimer(int which, struct itimerval *value, struct itimerval *ovalue); - extern int do_getitimer(int which, struct itimerval *value); - -+extern struct timespec timespec_trunc(struct timespec t, unsigned gran); -+ - static inline void - set_normalized_timespec (struct timespec *ts, time_t sec, long nsec) - { -diff -uprN linux-2.6.8.1.orig/include/linux/tty.h linux-2.6.8.1-ve022stab078/include/linux/tty.h ---- linux-2.6.8.1.orig/include/linux/tty.h 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/tty.h 2006-05-11 13:05:40.000000000 +0400 -@@ -239,6 +239,8 @@ struct device; - * size each time the window is created or resized anyway. - * - TYT, 9/14/92 - */ -+struct user_beancounter; -+ - struct tty_struct { - int magic; - struct tty_driver *driver; -@@ -293,8 +295,12 @@ struct tty_struct { - spinlock_t read_lock; - /* If the tty has a pending do_SAK, queue it here - akpm */ - struct work_struct SAK_work; -+ struct ve_struct *owner_env; - }; - -+DCL_VE_OWNER_PROTO(TTY, TAIL_SOFT, struct tty_struct, owner_env, , ()) -+#define tty_ub(__tty) (slab_ub(__tty)) -+ - /* tty magic number */ - #define TTY_MAGIC 0x5401 - -@@ -319,6 +325,7 @@ struct tty_struct { - #define TTY_HW_COOK_IN 15 - #define TTY_PTY_LOCK 16 - #define TTY_NO_WRITE_SPLIT 17 -+#define TTY_CHARGED 18 - - #define TTY_WRITE_FLUSH(tty) tty_write_flush((tty)) - -diff -uprN linux-2.6.8.1.orig/include/linux/tty_driver.h linux-2.6.8.1-ve022stab078/include/linux/tty_driver.h ---- linux-2.6.8.1.orig/include/linux/tty_driver.h 2004-08-14 14:55:35.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/tty_driver.h 2006-05-11 13:05:40.000000000 +0400 -@@ -115,6 +115,7 @@ - * character to the device. - */ - -+#include <linux/ve_owner.h> - #include <linux/fs.h> - #include <linux/list.h> - #include <linux/cdev.h> -@@ -214,9 +215,13 @@ struct tty_driver { - unsigned int set, unsigned int clear); - - struct list_head tty_drivers; -+ struct ve_struct *owner_env; - }; - -+DCL_VE_OWNER_PROTO(TTYDRV, TAIL_SOFT, struct tty_driver, owner_env, , ()) -+ - extern struct list_head tty_drivers; -+extern rwlock_t tty_driver_guard; - - struct tty_driver *alloc_tty_driver(int lines); - void put_tty_driver(struct tty_driver *driver); -diff -uprN linux-2.6.8.1.orig/include/linux/types.h linux-2.6.8.1-ve022stab078/include/linux/types.h ---- linux-2.6.8.1.orig/include/linux/types.h 2004-08-14 14:56:01.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/types.h 2006-05-11 13:05:32.000000000 +0400 -@@ -132,6 +132,10 @@ typedef __s64 int64_t; - typedef unsigned long sector_t; - #endif - -+#ifdef __KERNEL__ -+typedef unsigned gfp_t; -+#endif -+ - /* - * The type of an index into the pagecache. Use a #define so asm/types.h - * can override it. -@@ -140,6 +144,19 @@ typedef unsigned long sector_t; - #define pgoff_t unsigned long - #endif - -+#ifdef __CHECKER__ -+#define __bitwise __attribute__((bitwise)) -+#else -+#define __bitwise -+#endif -+ -+typedef __u16 __bitwise __le16; -+typedef __u16 __bitwise __be16; -+typedef __u32 __bitwise __le32; -+typedef __u32 __bitwise __be32; -+typedef __u64 __bitwise __le64; -+typedef __u64 __bitwise __be64; -+ - #endif /* __KERNEL_STRICT_NAMES */ - - /* -diff -uprN linux-2.6.8.1.orig/include/linux/ufs_fs.h linux-2.6.8.1-ve022stab078/include/linux/ufs_fs.h ---- linux-2.6.8.1.orig/include/linux/ufs_fs.h 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/ufs_fs.h 2006-05-11 13:05:35.000000000 +0400 -@@ -899,7 +899,7 @@ extern struct inode * ufs_new_inode (str - extern u64 ufs_frag_map (struct inode *, sector_t); - extern void ufs_read_inode (struct inode *); - extern void ufs_put_inode (struct inode *); --extern void ufs_write_inode (struct inode *, int); -+extern int ufs_write_inode (struct inode *, int); - extern int ufs_sync_inode (struct inode *); - extern void ufs_delete_inode (struct inode *); - extern struct buffer_head * ufs_getfrag (struct inode *, unsigned, int, int *); -diff -uprN linux-2.6.8.1.orig/include/linux/ve.h linux-2.6.8.1-ve022stab078/include/linux/ve.h ---- linux-2.6.8.1.orig/include/linux/ve.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/linux/ve.h 2006-05-11 13:05:48.000000000 +0400 -@@ -0,0 +1,311 @@ -+/* -+ * include/linux/ve.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef _LINUX_VE_H -+#define _LINUX_VE_H -+ -+#include <linux/config.h> -+ -+#ifndef __ENVID_T_DEFINED__ -+typedef unsigned envid_t; -+#define __ENVID_T_DEFINED__ -+#endif -+ -+#include <linux/types.h> -+#include <linux/capability.h> -+#include <linux/utsname.h> -+#include <linux/sysctl.h> -+#include <linux/vzstat.h> -+#include <linux/kobject.h> -+ -+#ifdef VZMON_DEBUG -+# define VZTRACE(fmt,args...) \ -+ printk(KERN_DEBUG fmt, ##args) -+#else -+# define VZTRACE(fmt,args...) -+#endif /* VZMON_DEBUG */ -+ -+struct tty_driver; -+struct devpts_config; -+struct task_struct; -+struct new_utsname; -+struct file_system_type; -+struct icmp_mib; -+struct ip_mib; -+struct tcp_mib; -+struct udp_mib; -+struct linux_mib; -+struct fib_info; -+struct fib_rule; -+struct veip_struct; -+struct ve_monitor; -+ -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+struct fib_table; -+struct devcnfv4_struct; -+#ifdef CONFIG_VE_IPTABLES -+struct ipt_filter_initial_table; -+struct ipt_nat_initial_table; -+struct ipt_table; -+struct ip_conntrack; -+struct nf_hook_ops; -+struct ve_ip_conntrack { -+ struct list_head *_ip_conntrack_hash; -+ struct list_head _ip_conntrack_expect_list; -+ struct list_head _ip_conntrack_protocol_list; -+ struct list_head _ip_conntrack_helpers; -+ int _ip_conntrack_max; -+ unsigned long _ip_ct_tcp_timeouts[10]; -+ unsigned long _ip_ct_udp_timeout; -+ unsigned long _ip_ct_udp_timeout_stream; -+ unsigned long _ip_ct_icmp_timeout; -+ unsigned long _ip_ct_generic_timeout; -+ atomic_t _ip_conntrack_count; -+ void (*_ip_conntrack_destroyed)(struct ip_conntrack *conntrack); -+#ifdef CONFIG_SYSCTL -+ struct ctl_table_header *_ip_ct_sysctl_header; -+ ctl_table *_ip_ct_net_table; -+ ctl_table *_ip_ct_ipv4_table; -+ ctl_table *_ip_ct_netfilter_table; -+ ctl_table *_ip_ct_sysctl_table; -+#endif /*CONFIG_SYSCTL*/ -+ -+ int _ip_conntrack_ftp_ports_c; -+ int _ip_conntrack_irc_ports_c; -+ -+ struct list_head _ip_nat_protos; -+ struct list_head _ip_nat_helpers; -+ struct list_head *_ip_nat_bysource; -+ struct ipt_nat_initial_table *_ip_nat_initial_table; -+ struct ipt_table *_ip_nat_table; -+ -+ int _ip_nat_ftp_ports_c; -+ int _ip_nat_irc_ports_c; -+ -+ /* resource accounting */ -+ struct user_beancounter *ub; -+}; -+#endif -+#endif -+ -+#define UIDHASH_BITS_VE 6 -+#define UIDHASH_SZ_VE (1 << UIDHASH_BITS_VE) -+ -+struct ve_cpu_stats { -+ cycles_t idle_time; -+ cycles_t iowait_time; -+ cycles_t strt_idle_time; -+ cycles_t used_time; -+ seqcount_t stat_lock; -+ int nr_running; -+ int nr_unint; -+ int nr_iowait; -+ u64 user; -+ u64 nice; -+ u64 system; -+} ____cacheline_aligned; -+ -+struct ve_struct { -+ struct ve_struct *prev; -+ struct ve_struct *next; -+ -+ envid_t veid; -+ struct task_struct *init_entry; -+ struct list_head vetask_lh; -+ kernel_cap_t cap_default; -+ atomic_t pcounter; -+ /* ref counter to ve from ipc */ -+ atomic_t counter; -+ unsigned int class_id; -+ struct veip_struct *veip; -+ struct rw_semaphore op_sem; -+ int is_running; -+ int is_locked; -+ int virt_pids; -+ /* see vzcalluser.h for VE_FEATURE_XXX definitions */ -+ __u64 features; -+ -+/* VE's root */ -+ struct vfsmount *fs_rootmnt; -+ struct dentry *fs_root; -+ -+/* sysctl */ -+ struct new_utsname *utsname; -+ struct list_head sysctl_lh; -+ struct ctl_table_header *kern_header; -+ struct ctl_table *kern_table; -+ struct ctl_table_header *quota_header; -+ struct ctl_table *quota_table; -+ struct file_system_type *proc_fstype; -+ struct vfsmount *proc_mnt; -+ struct proc_dir_entry *proc_root; -+ struct proc_dir_entry *proc_sys_root; -+ -+/* SYSV IPC */ -+ struct ipc_ids *_shm_ids; -+ struct ipc_ids *_msg_ids; -+ struct ipc_ids *_sem_ids; -+ int _used_sems; -+ int _shm_tot; -+ size_t _shm_ctlmax; -+ size_t _shm_ctlall; -+ int _shm_ctlmni; -+ int _msg_ctlmax; -+ int _msg_ctlmni; -+ int _msg_ctlmnb; -+ int _sem_ctls[4]; -+ -+/* BSD pty's */ -+ struct tty_driver *pty_driver; -+ struct tty_driver *pty_slave_driver; -+ -+#ifdef CONFIG_UNIX98_PTYS -+ struct tty_driver *ptm_driver; -+ struct tty_driver *pts_driver; -+ struct idr *allocated_ptys; -+#endif -+ struct file_system_type *devpts_fstype; -+ struct vfsmount *devpts_mnt; -+ struct dentry *devpts_root; -+ struct devpts_config *devpts_config; -+ -+ struct file_system_type *shmem_fstype; -+ struct vfsmount *shmem_mnt; -+#ifdef CONFIG_SYSFS -+ struct file_system_type *sysfs_fstype; -+ struct vfsmount *sysfs_mnt; -+ struct super_block *sysfs_sb; -+#endif -+ struct subsystem *class_subsys; -+ struct subsystem *class_obj_subsys; -+ struct class *net_class; -+ -+/* User uids hash */ -+ struct list_head uidhash_table[UIDHASH_SZ_VE]; -+ -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+ struct hlist_head _net_dev_head; -+ struct hlist_head _net_dev_index_head; -+ struct net_device *_net_dev_base, **_net_dev_tail; -+ int ifindex; -+ struct net_device *_loopback_dev; -+ struct net_device *_venet_dev; -+ struct ipv4_devconf *_ipv4_devconf; -+ struct ipv4_devconf *_ipv4_devconf_dflt; -+ struct ctl_table_header *forward_header; -+ struct ctl_table *forward_table; -+#endif -+ unsigned long rt_flush_required; -+ -+/* per VE CPU stats*/ -+ struct timespec start_timespec; -+ u64 start_jiffies; -+ cycles_t start_cycles; -+ unsigned long avenrun[3]; /* loadavg data */ -+ -+ cycles_t cpu_used_ve; -+ struct kstat_lat_pcpu_struct sched_lat_ve; -+ -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+ struct fib_info *_fib_info_list; -+ struct fib_rule *_local_rule; -+ struct fib_rule *_fib_rules; -+#ifdef CONFIG_IP_MULTIPLE_TABLES -+ /* XXX: why a magic constant? */ -+ struct fib_table *_fib_tables[256]; /* RT_TABLE_MAX - for now */ -+#else -+ struct fib_table *_main_table; -+ struct fib_table *_local_table; -+#endif -+ struct icmp_mib *_icmp_statistics[2]; -+ struct ipstats_mib *_ip_statistics[2]; -+ struct tcp_mib *_tcp_statistics[2]; -+ struct udp_mib *_udp_statistics[2]; -+ struct linux_mib *_net_statistics[2]; -+ struct venet_stat *stat; -+#ifdef CONFIG_VE_IPTABLES -+/* core/netfilter.c virtualization */ -+ void *_nf_hooks; -+ struct ipt_filter_initial_table *_ipt_filter_initial_table; /* initial_table struct */ -+ struct ipt_table *_ve_ipt_filter_pf; /* packet_filter struct */ -+ struct nf_hook_ops *_ve_ipt_filter_io; /* ipt_ops struct */ -+ struct ipt_table *_ipt_mangle_table; -+ struct nf_hook_ops *_ipt_mangle_hooks; -+ struct list_head *_ipt_target; -+ struct list_head *_ipt_match; -+ struct list_head *_ipt_tables; -+ -+ struct ipt_target *_ipt_standard_target; -+ struct ipt_target *_ipt_error_target; -+ struct ipt_match *_tcp_matchstruct; -+ struct ipt_match *_udp_matchstruct; -+ struct ipt_match *_icmp_matchstruct; -+ -+ __u64 _iptables_modules; -+ struct ve_ip_conntrack *_ip_conntrack; -+#endif /* CONFIG_VE_IPTABLES */ -+#endif -+ wait_queue_head_t *_log_wait; -+ unsigned long *_log_start; -+ unsigned long *_log_end; -+ unsigned long *_logged_chars; -+ char *log_buf; -+#define VE_DEFAULT_LOG_BUF_LEN 4096 -+ -+ struct ve_cpu_stats ve_cpu_stats[NR_CPUS] ____cacheline_aligned; -+ unsigned long down_at; -+ struct list_head cleanup_list; -+ -+ unsigned long jiffies_fixup; -+ unsigned char disable_net; -+ unsigned char sparse_vpid; -+ struct ve_monitor *monitor; -+ struct proc_dir_entry *monitor_proc; -+}; -+ -+#define VE_CPU_STATS(ve, cpu) (&((ve)->ve_cpu_stats[(cpu)])) -+ -+extern int nr_ve; -+ -+#ifdef CONFIG_VE -+ -+int get_device_perms_ve(int dev_type, dev_t dev, int access_mode); -+void do_env_cleanup(struct ve_struct *envid); -+void do_update_load_avg_ve(void); -+void do_env_free(struct ve_struct *ptr); -+ -+#define ve_utsname (*get_exec_env()->utsname) -+ -+static inline struct ve_struct *get_ve(struct ve_struct *ptr) -+{ -+ if (ptr != NULL) -+ atomic_inc(&ptr->counter); -+ return ptr; -+} -+ -+static inline void put_ve(struct ve_struct *ptr) -+{ -+ if (ptr && atomic_dec_and_test(&ptr->counter)) { -+ if (atomic_read(&ptr->pcounter) > 0) -+ BUG(); -+ if (ptr->is_running) -+ BUG(); -+ do_env_free(ptr); -+ } -+} -+ -+#define ve_cpu_online_map(ve, mask) fairsched_cpu_online_map(ve->veid, mask) -+#else /* CONFIG_VE */ -+#define ve_utsname system_utsname -+#define get_ve(ve) (NULL) -+#define put_ve(ve) do { } while (0) -+#endif /* CONFIG_VE */ -+ -+#endif /* _LINUX_VE_H */ -diff -uprN linux-2.6.8.1.orig/include/linux/ve_owner.h linux-2.6.8.1-ve022stab078/include/linux/ve_owner.h ---- linux-2.6.8.1.orig/include/linux/ve_owner.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/linux/ve_owner.h 2006-05-11 13:05:40.000000000 +0400 -@@ -0,0 +1,32 @@ -+/* -+ * include/linux/ve_proto.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef __VE_OWNER_H__ -+#define __VE_OWNER_H__ -+ -+#include <linux/config.h> -+#include <linux/vmalloc.h> -+ -+ -+#define DCL_VE_OWNER(name, kind, type, member, attr1, attr2) -+ /* prototype declares static inline functions */ -+ -+#define DCL_VE_OWNER_PROTO(name, kind, type, member, attr1, attr2) \ -+type; \ -+static inline struct ve_struct *VE_OWNER_##name(type *obj) \ -+{ \ -+ return obj->member; \ -+} \ -+static inline void SET_VE_OWNER_##name(type *obj, struct ve_struct *ve) \ -+{ \ -+ obj->member = ve; \ -+} -+ -+#endif /* __VE_OWNER_H__ */ -diff -uprN linux-2.6.8.1.orig/include/linux/ve_proto.h linux-2.6.8.1-ve022stab078/include/linux/ve_proto.h ---- linux-2.6.8.1.orig/include/linux/ve_proto.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/linux/ve_proto.h 2006-05-11 13:05:42.000000000 +0400 -@@ -0,0 +1,73 @@ -+/* -+ * include/linux/ve_proto.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef __VE_H__ -+#define __VE_H__ -+ -+#ifdef CONFIG_VE -+ -+extern struct semaphore ve_call_guard; -+extern rwlock_t ve_call_lock; -+ -+#ifdef CONFIG_SYSVIPC -+extern void prepare_ipc(void); -+extern int init_ve_ipc(struct ve_struct *); -+extern void fini_ve_ipc(struct ve_struct *); -+extern void ve_ipc_cleanup(void); -+#endif -+ -+extern struct tty_driver *get_pty_driver(void); -+extern struct tty_driver *get_pty_slave_driver(void); -+#ifdef CONFIG_UNIX98_PTYS -+extern struct tty_driver *ptm_driver; /* Unix98 pty masters; for /dev/ptmx */ -+extern struct tty_driver *pts_driver; /* Unix98 pty slaves; for /dev/ptmx */ -+#endif -+ -+extern rwlock_t tty_driver_guard; -+ -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+void ip_fragment_cleanup(struct ve_struct *envid); -+void tcp_v4_kill_ve_sockets(struct ve_struct *envid); -+struct fib_table * fib_hash_init(int id); -+int move_addr_to_kernel(void *uaddr, int ulen, void *kaddr); -+extern int main_loopback_init(struct net_device*); -+int venet_init(void); -+#endif -+ -+extern struct ve_struct *ve_list_head; -+extern rwlock_t ve_list_guard; -+extern struct ve_struct *get_ve_by_id(envid_t); -+extern struct ve_struct *__find_ve_by_id(envid_t); -+ -+extern int do_setdevperms(envid_t veid, unsigned type, -+ dev_t dev, unsigned mask); -+ -+#define VE_HOOK_INIT 0 -+#define VE_HOOK_FINI 1 -+#define VE_MAX_HOOKS 2 -+ -+typedef int ve_hookfn(unsigned int hooknum, void *data); -+ -+struct ve_hook -+{ -+ struct list_head list; -+ ve_hookfn *hook; -+ ve_hookfn *undo; -+ struct module *owner; -+ int hooknum; -+ /* Functions are called in ascending priority. */ -+ int priority; -+}; -+ -+extern int ve_hook_register(struct ve_hook *vh); -+extern void ve_hook_unregister(struct ve_hook *vh); -+ -+#endif -+#endif -diff -uprN linux-2.6.8.1.orig/include/linux/ve_task.h linux-2.6.8.1-ve022stab078/include/linux/ve_task.h ---- linux-2.6.8.1.orig/include/linux/ve_task.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/linux/ve_task.h 2006-05-11 13:05:40.000000000 +0400 -@@ -0,0 +1,34 @@ -+/* -+ * include/linux/ve_task.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef __VE_TASK_H__ -+#define __VE_TASK_H__ -+ -+#include <linux/seqlock.h> -+ -+struct ve_task_info { -+/* virtualization */ -+ struct ve_struct *owner_env; -+ struct ve_struct *exec_env; -+ struct list_head vetask_list; -+ struct dentry *glob_proc_dentry; -+/* statistics: scheduling latency */ -+ cycles_t sleep_time; -+ cycles_t sched_time; -+ cycles_t sleep_stamp; -+ cycles_t wakeup_stamp; -+ seqcount_t wakeup_lock; -+}; -+ -+#define VE_TASK_INFO(task) (&(task)->ve_task_info) -+#define VE_TASK_LIST_2_TASK(lh) \ -+ list_entry(lh, struct task_struct, ve_task_info.vetask_list) -+ -+#endif /* __VE_TASK_H__ */ -diff -uprN linux-2.6.8.1.orig/include/linux/venet.h linux-2.6.8.1-ve022stab078/include/linux/venet.h ---- linux-2.6.8.1.orig/include/linux/venet.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/linux/venet.h 2006-05-11 13:05:40.000000000 +0400 -@@ -0,0 +1,68 @@ -+/* -+ * include/linux/venet.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef _VENET_H -+#define _VENET_H -+ -+#include <linux/list.h> -+#include <linux/spinlock.h> -+#include <linux/vzcalluser.h> -+ -+#define VEIP_HASH_SZ 512 -+ -+struct ve_struct; -+struct venet_stat; -+struct ip_entry_struct -+{ -+ __u32 ip; -+ struct ve_struct *active_env; -+ struct venet_stat *stat; -+ struct veip_struct *veip; -+ struct list_head ip_hash; -+ struct list_head ve_list; -+}; -+ -+struct veip_struct -+{ -+ struct list_head src_lh; -+ struct list_head dst_lh; -+ struct list_head ip_lh; -+ struct list_head list; -+ envid_t veid; -+}; -+ -+/* veip_hash_lock should be taken for write by caller */ -+void ip_entry_hash(struct ip_entry_struct *entry, struct veip_struct *veip); -+/* veip_hash_lock should be taken for write by caller */ -+void ip_entry_unhash(struct ip_entry_struct *entry); -+/* veip_hash_lock should be taken for read by caller */ -+struct ip_entry_struct *ip_entry_lookup(u32 addr); -+ -+/* veip_hash_lock should be taken for read by caller */ -+struct veip_struct *veip_find(envid_t veid); -+/* veip_hash_lock should be taken for write by caller */ -+struct veip_struct *veip_findcreate(envid_t veid); -+/* veip_hash_lock should be taken for write by caller */ -+void veip_put(struct veip_struct *veip); -+ -+int veip_start(struct ve_struct *ve); -+void veip_stop(struct ve_struct *ve); -+int veip_entry_add(struct ve_struct *ve, struct sockaddr_in *addr); -+int veip_entry_del(envid_t veid, struct sockaddr_in *addr); -+int venet_change_skb_owner(struct sk_buff *skb); -+ -+extern struct list_head ip_entry_hash_table[]; -+extern rwlock_t veip_hash_lock; -+ -+#ifdef CONFIG_PROC_FS -+int veip_seq_show(struct seq_file *m, void *v); -+#endif -+ -+#endif -diff -uprN linux-2.6.8.1.orig/include/linux/veprintk.h linux-2.6.8.1-ve022stab078/include/linux/veprintk.h ---- linux-2.6.8.1.orig/include/linux/veprintk.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/linux/veprintk.h 2006-05-11 13:05:42.000000000 +0400 -@@ -0,0 +1,38 @@ -+/* -+ * include/linux/veprintk.h -+ * -+ * Copyright (C) 2006 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef __VE_PRINTK_H__ -+#define __VE_PRINTK_H__ -+ -+#ifdef CONFIG_VE -+ -+#define ve_log_wait (*(get_exec_env()->_log_wait)) -+#define ve_log_start (*(get_exec_env()->_log_start)) -+#define ve_log_end (*(get_exec_env()->_log_end)) -+#define ve_logged_chars (*(get_exec_env()->_logged_chars)) -+#define ve_log_buf (get_exec_env()->log_buf) -+#define ve_log_buf_len (ve_is_super(get_exec_env()) ? \ -+ log_buf_len : VE_DEFAULT_LOG_BUF_LEN) -+#define VE_LOG_BUF_MASK (ve_log_buf_len - 1) -+#define VE_LOG_BUF(idx) (ve_log_buf[(idx) & VE_LOG_BUF_MASK]) -+ -+#else -+ -+#define ve_log_wait log_wait -+#define ve_log_start log_start -+#define ve_log_end log_end -+#define ve_logged_chars logged_chars -+#define ve_log_buf log_buf -+#define ve_log_buf_len log_buf_len -+#define VE_LOG_BUF_MASK LOG_BUF_MASK -+#define VE_LOG_BUF(idx) LOG_BUF(idx) -+ -+#endif /* CONFIG_VE */ -+#endif /* __VE_PRINTK_H__ */ -diff -uprN linux-2.6.8.1.orig/include/linux/virtinfo.h linux-2.6.8.1-ve022stab078/include/linux/virtinfo.h ---- linux-2.6.8.1.orig/include/linux/virtinfo.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/linux/virtinfo.h 2006-05-11 13:05:49.000000000 +0400 -@@ -0,0 +1,86 @@ -+/* -+ * include/linux/virtinfo.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef __LINUX_VIRTINFO_H -+#define __LINUX_VIRTINFO_H -+ -+#include <linux/config.h> -+#include <linux/kernel.h> -+#include <linux/page-flags.h> -+#include <linux/notifier.h> -+ -+struct vnotifier_block -+{ -+ int (*notifier_call)(struct vnotifier_block *self, -+ unsigned long, void *, int); -+ struct vnotifier_block *next; -+ int priority; -+}; -+ -+extern struct semaphore virtinfo_sem; -+void __virtinfo_notifier_register(int type, struct vnotifier_block *nb); -+void virtinfo_notifier_register(int type, struct vnotifier_block *nb); -+void virtinfo_notifier_unregister(int type, struct vnotifier_block *nb); -+int virtinfo_notifier_call(int type, unsigned long n, void *data); -+ -+struct meminfo { -+ struct sysinfo si; -+ unsigned long active, inactive; -+ unsigned long cache, swapcache; -+ unsigned long committed_space; -+ struct page_state ps; -+ unsigned long vmalloc_total, vmalloc_used, vmalloc_largest; -+}; -+ -+#define VIRTINFO_DOFORK 0 -+#define VIRTINFO_DOEXIT 1 -+#define VIRTINFO_DOEXECVE 2 -+#define VIRTINFO_DOFORKRET 3 -+#define VIRTINFO_DOFORKPOST 4 -+#define VIRTINFO_EXIT 5 -+#define VIRTINFO_EXITMMAP 6 -+#define VIRTINFO_EXECMMAP 7 -+#define VIRTINFO_ENOUGHMEM 8 -+#define VIRTINFO_OUTOFMEM 9 -+#define VIRTINFO_PAGEIN 10 -+#define VIRTINFO_MEMINFO 11 -+#define VIRTINFO_SYSINFO 12 -+#define VIRTINFO_NEWUBC 13 -+ -+enum virt_info_types { -+ VITYPE_GENERAL, -+ VITYPE_FAUDIT, -+ VITYPE_QUOTA, -+ VITYPE_SCP, -+ -+ VIRT_TYPES -+}; -+ -+#ifdef CONFIG_VZ_GENCALLS -+ -+static inline int virtinfo_gencall(unsigned long n, void *data) -+{ -+ int r; -+ -+ r = virtinfo_notifier_call(VITYPE_GENERAL, n, data); -+ if (r & NOTIFY_FAIL) -+ return -ENOBUFS; -+ if (r & NOTIFY_OK) -+ return -ERESTARTNOINTR; -+ return 0; -+} -+ -+#else -+ -+#define virtinfo_gencall(n, data) 0 -+ -+#endif -+ -+#endif /* __LINUX_VIRTINFO_H */ -diff -uprN linux-2.6.8.1.orig/include/linux/vmalloc.h linux-2.6.8.1-ve022stab078/include/linux/vmalloc.h ---- linux-2.6.8.1.orig/include/linux/vmalloc.h 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/vmalloc.h 2006-05-11 13:05:40.000000000 +0400 -@@ -9,6 +9,10 @@ - #define VM_ALLOC 0x00000002 /* vmalloc() */ - #define VM_MAP 0x00000004 /* vmap()ed pages */ - -+/* align size to 2^n page boundary */ -+#define POWER2_PAGE_ALIGN(size) \ -+ ((typeof(size))(1UL << (PAGE_SHIFT + get_order(size)))) -+ - struct vm_struct { - void *addr; - unsigned long size; -@@ -26,6 +30,8 @@ extern void *vmalloc(unsigned long size) - extern void *vmalloc_exec(unsigned long size); - extern void *vmalloc_32(unsigned long size); - extern void *__vmalloc(unsigned long size, int gfp_mask, pgprot_t prot); -+extern void *vmalloc_best(unsigned long size); -+extern void *ub_vmalloc_best(unsigned long size); - extern void vfree(void *addr); - - extern void *vmap(struct page **pages, unsigned int count, -@@ -38,6 +44,9 @@ extern void vunmap(void *addr); - extern struct vm_struct *get_vm_area(unsigned long size, unsigned long flags); - extern struct vm_struct *__get_vm_area(unsigned long size, unsigned long flags, - unsigned long start, unsigned long end); -+extern struct vm_struct * get_vm_area_best(unsigned long size, -+ unsigned long flags); -+extern void vprintstat(void); - extern struct vm_struct *remove_vm_area(void *addr); - extern int map_vm_area(struct vm_struct *area, pgprot_t prot, - struct page ***pages); -diff -uprN linux-2.6.8.1.orig/include/linux/vsched.h linux-2.6.8.1-ve022stab078/include/linux/vsched.h ---- linux-2.6.8.1.orig/include/linux/vsched.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/linux/vsched.h 2006-05-11 13:05:40.000000000 +0400 -@@ -0,0 +1,34 @@ -+/* -+ * include/linux/vsched.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef __VSCHED_H__ -+#define __VSCHED_H__ -+ -+#include <linux/config.h> -+#include <linux/cache.h> -+#include <linux/fairsched.h> -+#include <linux/sched.h> -+ -+extern int vsched_create(int id, struct fairsched_node *node); -+extern int vsched_destroy(struct vcpu_scheduler *vsched); -+ -+extern int vsched_mvpr(struct task_struct *p, struct vcpu_scheduler *vsched); -+ -+extern int vcpu_online(int cpu); -+ -+#ifdef CONFIG_VE -+#ifdef CONFIG_FAIRSCHED -+extern unsigned long ve_scale_khz(unsigned long khz); -+#else -+#define ve_scale_khz(khz) (khz) -+#endif -+#endif -+ -+#endif -diff -uprN linux-2.6.8.1.orig/include/linux/vzcalluser.h linux-2.6.8.1-ve022stab078/include/linux/vzcalluser.h ---- linux-2.6.8.1.orig/include/linux/vzcalluser.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/linux/vzcalluser.h 2006-05-11 13:05:48.000000000 +0400 -@@ -0,0 +1,220 @@ -+/* -+ * include/linux/vzcalluser.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef _LINUX_VZCALLUSER_H -+#define _LINUX_VZCALLUSER_H -+ -+#include <linux/types.h> -+#include <linux/ioctl.h> -+ -+#define KERN_VZ_PRIV_RANGE 51 -+ -+#ifndef __ENVID_T_DEFINED__ -+typedef unsigned envid_t; -+#define __ENVID_T_DEFINED__ -+#endif -+ -+/* -+ * VE management ioctls -+ */ -+ -+struct vzctl_old_env_create { -+ envid_t veid; -+ unsigned flags; -+#define VE_CREATE 1 /* Create VE, VE_ENTER added automatically */ -+#define VE_EXCLUSIVE 2 /* Fail if exists */ -+#define VE_ENTER 4 /* Enter existing VE */ -+#define VE_TEST 8 /* Test if VE exists */ -+#define VE_LOCK 16 /* Do not allow entering created VE */ -+#define VE_SKIPLOCK 32 /* Allow entering embrion VE */ -+ __u32 addr; -+}; -+ -+struct vzctl_mark_env_to_down { -+ envid_t veid; -+}; -+ -+struct vzctl_setdevperms { -+ envid_t veid; -+ unsigned type; -+#define VE_USE_MAJOR 010 /* Test MAJOR supplied in rule */ -+#define VE_USE_MINOR 030 /* Test MINOR supplied in rule */ -+#define VE_USE_MASK 030 /* Testing mask, VE_USE_MAJOR|VE_USE_MINOR */ -+ unsigned dev; -+ unsigned mask; -+}; -+ -+struct vzctl_ve_netdev { -+ envid_t veid; -+ int op; -+#define VE_NETDEV_ADD 1 -+#define VE_NETDEV_DEL 2 -+ char *dev_name; -+}; -+ -+/* these masks represent modules */ -+#define VE_IP_IPTABLES_MOD (1U<<0) -+#define VE_IP_FILTER_MOD (1U<<1) -+#define VE_IP_MANGLE_MOD (1U<<2) -+#define VE_IP_MATCH_LIMIT_MOD (1U<<3) -+#define VE_IP_MATCH_MULTIPORT_MOD (1U<<4) -+#define VE_IP_MATCH_TOS_MOD (1U<<5) -+#define VE_IP_TARGET_TOS_MOD (1U<<6) -+#define VE_IP_TARGET_REJECT_MOD (1U<<7) -+#define VE_IP_TARGET_TCPMSS_MOD (1U<<8) -+#define VE_IP_MATCH_TCPMSS_MOD (1U<<9) -+#define VE_IP_MATCH_TTL_MOD (1U<<10) -+#define VE_IP_TARGET_LOG_MOD (1U<<11) -+#define VE_IP_MATCH_LENGTH_MOD (1U<<12) -+#define VE_IP_CONNTRACK_MOD (1U<<14) -+#define VE_IP_CONNTRACK_FTP_MOD (1U<<15) -+#define VE_IP_CONNTRACK_IRC_MOD (1U<<16) -+#define VE_IP_MATCH_CONNTRACK_MOD (1U<<17) -+#define VE_IP_MATCH_STATE_MOD (1U<<18) -+#define VE_IP_MATCH_HELPER_MOD (1U<<19) -+#define VE_IP_NAT_MOD (1U<<20) -+#define VE_IP_NAT_FTP_MOD (1U<<21) -+#define VE_IP_NAT_IRC_MOD (1U<<22) -+#define VE_IP_TARGET_REDIRECT_MOD (1U<<23) -+ -+/* these masks represent modules with their dependences */ -+#define VE_IP_IPTABLES (VE_IP_IPTABLES_MOD) -+#define VE_IP_FILTER (VE_IP_FILTER_MOD \ -+ | VE_IP_IPTABLES) -+#define VE_IP_MANGLE (VE_IP_MANGLE_MOD \ -+ | VE_IP_IPTABLES) -+#define VE_IP_MATCH_LIMIT (VE_IP_MATCH_LIMIT_MOD \ -+ | VE_IP_IPTABLES) -+#define VE_IP_MATCH_MULTIPORT (VE_IP_MATCH_MULTIPORT_MOD \ -+ | VE_IP_IPTABLES) -+#define VE_IP_MATCH_TOS (VE_IP_MATCH_TOS_MOD \ -+ | VE_IP_IPTABLES) -+#define VE_IP_TARGET_TOS (VE_IP_TARGET_TOS_MOD \ -+ | VE_IP_IPTABLES) -+#define VE_IP_TARGET_REJECT (VE_IP_TARGET_REJECT_MOD \ -+ | VE_IP_IPTABLES) -+#define VE_IP_TARGET_TCPMSS (VE_IP_TARGET_TCPMSS_MOD \ -+ | VE_IP_IPTABLES) -+#define VE_IP_MATCH_TCPMSS (VE_IP_MATCH_TCPMSS_MOD \ -+ | VE_IP_IPTABLES) -+#define VE_IP_MATCH_TTL (VE_IP_MATCH_TTL_MOD \ -+ | VE_IP_IPTABLES) -+#define VE_IP_TARGET_LOG (VE_IP_TARGET_LOG_MOD \ -+ | VE_IP_IPTABLES) -+#define VE_IP_MATCH_LENGTH (VE_IP_MATCH_LENGTH_MOD \ -+ | VE_IP_IPTABLES) -+#define VE_IP_CONNTRACK (VE_IP_CONNTRACK_MOD \ -+ | VE_IP_IPTABLES) -+#define VE_IP_CONNTRACK_FTP (VE_IP_CONNTRACK_FTP_MOD \ -+ | VE_IP_CONNTRACK) -+#define VE_IP_CONNTRACK_IRC (VE_IP_CONNTRACK_IRC_MOD \ -+ | VE_IP_CONNTRACK) -+#define VE_IP_MATCH_CONNTRACK (VE_IP_MATCH_CONNTRACK_MOD \ -+ | VE_IP_CONNTRACK) -+#define VE_IP_MATCH_STATE (VE_IP_MATCH_STATE_MOD \ -+ | VE_IP_CONNTRACK) -+#define VE_IP_MATCH_HELPER (VE_IP_MATCH_HELPER_MOD \ -+ | VE_IP_CONNTRACK) -+#define VE_IP_NAT (VE_IP_NAT_MOD \ -+ | VE_IP_CONNTRACK) -+#define VE_IP_NAT_FTP (VE_IP_NAT_FTP_MOD \ -+ | VE_IP_NAT | VE_IP_CONNTRACK_FTP) -+#define VE_IP_NAT_IRC (VE_IP_NAT_IRC_MOD \ -+ | VE_IP_NAT | VE_IP_CONNTRACK_IRC) -+#define VE_IP_TARGET_REDIRECT (VE_IP_TARGET_REDIRECT_MOD \ -+ | VE_IP_NAT) -+ -+/* safe iptables mask to be used by default */ -+#define VE_IP_DEFAULT \ -+ (VE_IP_IPTABLES | \ -+ VE_IP_FILTER | VE_IP_MANGLE | \ -+ VE_IP_MATCH_LIMIT | VE_IP_MATCH_MULTIPORT | \ -+ VE_IP_MATCH_TOS | VE_IP_TARGET_REJECT | \ -+ VE_IP_TARGET_TCPMSS | VE_IP_MATCH_TCPMSS | \ -+ VE_IP_MATCH_TTL | VE_IP_MATCH_LENGTH) -+ -+#define VE_IPT_CMP(x,y) (((x) & (y)) == (y)) -+ -+struct vzctl_env_create_cid { -+ envid_t veid; -+ unsigned flags; -+ __u32 class_id; -+}; -+ -+struct vzctl_env_create { -+ envid_t veid; -+ unsigned flags; -+ __u32 class_id; -+}; -+ -+struct env_create_param { -+ __u64 iptables_mask; -+}; -+#define VZCTL_ENV_CREATE_DATA_MINLEN sizeof(struct env_create_param) -+ -+struct env_create_param2 { -+ __u64 iptables_mask; -+ __u64 feature_mask; -+#define VE_FEATURE_SYSFS (1ULL << 0) -+ __u32 total_vcpus; /* 0 - don't care, same as in host */ -+}; -+#define VZCTL_ENV_CREATE_DATA_MAXLEN sizeof(struct env_create_param2) -+ -+typedef struct env_create_param2 env_create_param_t; -+ -+struct vzctl_env_create_data { -+ envid_t veid; -+ unsigned flags; -+ __u32 class_id; -+ env_create_param_t *data; -+ int datalen; -+}; -+ -+struct vz_load_avg { -+ int val_int; -+ int val_frac; -+}; -+ -+struct vz_cpu_stat { -+ unsigned long user_jif; -+ unsigned long nice_jif; -+ unsigned long system_jif; -+ unsigned long uptime_jif; -+ __u64 idle_clk; -+ __u64 strv_clk; -+ __u64 uptime_clk; -+ struct vz_load_avg avenrun[3]; /* loadavg data */ -+}; -+ -+struct vzctl_cpustatctl { -+ envid_t veid; -+ struct vz_cpu_stat *cpustat; -+}; -+ -+#define VZCTLTYPE '.' -+#define VZCTL_OLD_ENV_CREATE _IOW(VZCTLTYPE, 0, \ -+ struct vzctl_old_env_create) -+#define VZCTL_MARK_ENV_TO_DOWN _IOW(VZCTLTYPE, 1, \ -+ struct vzctl_mark_env_to_down) -+#define VZCTL_SETDEVPERMS _IOW(VZCTLTYPE, 2, \ -+ struct vzctl_setdevperms) -+#define VZCTL_ENV_CREATE_CID _IOW(VZCTLTYPE, 4, \ -+ struct vzctl_env_create_cid) -+#define VZCTL_ENV_CREATE _IOW(VZCTLTYPE, 5, \ -+ struct vzctl_env_create) -+#define VZCTL_GET_CPU_STAT _IOW(VZCTLTYPE, 6, \ -+ struct vzctl_cpustatctl) -+#define VZCTL_ENV_CREATE_DATA _IOW(VZCTLTYPE, 10, \ -+ struct vzctl_env_create_data) -+#define VZCTL_VE_NETDEV _IOW(VZCTLTYPE, 11, \ -+ struct vzctl_ve_netdev) -+ -+ -+#endif -diff -uprN linux-2.6.8.1.orig/include/linux/vzctl.h linux-2.6.8.1-ve022stab078/include/linux/vzctl.h ---- linux-2.6.8.1.orig/include/linux/vzctl.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/linux/vzctl.h 2006-05-11 13:05:40.000000000 +0400 -@@ -0,0 +1,30 @@ -+/* -+ * include/linux/vzctl.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef _LINUX_VZCTL_H -+#define _LINUX_VZCTL_H -+ -+#include <linux/list.h> -+ -+struct module; -+struct inode; -+struct file; -+struct vzioctlinfo { -+ unsigned type; -+ int (*func)(struct inode *, struct file *, -+ unsigned int, unsigned long); -+ struct module *owner; -+ struct list_head list; -+}; -+ -+extern void vzioctl_register(struct vzioctlinfo *inf); -+extern void vzioctl_unregister(struct vzioctlinfo *inf); -+ -+#endif -diff -uprN linux-2.6.8.1.orig/include/linux/vzctl_quota.h linux-2.6.8.1-ve022stab078/include/linux/vzctl_quota.h ---- linux-2.6.8.1.orig/include/linux/vzctl_quota.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/linux/vzctl_quota.h 2006-05-11 13:05:43.000000000 +0400 -@@ -0,0 +1,43 @@ -+/* -+ * include/linux/vzctl_quota.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef __LINUX_VZCTL_QUOTA_H__ -+#define __LINUX_VZCTL_QUOTA_H__ -+ -+/* -+ * Quota management ioctl -+ */ -+ -+struct vz_quota_stat; -+struct vzctl_quotactl { -+ int cmd; -+ unsigned int quota_id; -+ struct vz_quota_stat *qstat; -+ char *ve_root; -+}; -+ -+struct vzctl_quotaugidctl { -+ int cmd; /* subcommand */ -+ unsigned int quota_id; /* quota id where it applies to */ -+ unsigned int ugid_index;/* for reading statistic. index of first -+ uid/gid record to read */ -+ unsigned int ugid_size; /* size of ugid_buf array */ -+ void *addr; /* user-level buffer */ -+}; -+ -+#define VZDQCTLTYPE '+' -+#define VZCTL_QUOTA_CTL _IOWR(VZDQCTLTYPE, 1, \ -+ struct vzctl_quotactl) -+#define VZCTL_QUOTA_NEW_CTL _IOWR(VZDQCTLTYPE, 2, \ -+ struct vzctl_quotactl) -+#define VZCTL_QUOTA_UGID_CTL _IOWR(VZDQCTLTYPE, 3, \ -+ struct vzctl_quotaugidctl) -+ -+#endif /* __LINUX_VZCTL_QUOTA_H__ */ -diff -uprN linux-2.6.8.1.orig/include/linux/vzctl_venet.h linux-2.6.8.1-ve022stab078/include/linux/vzctl_venet.h ---- linux-2.6.8.1.orig/include/linux/vzctl_venet.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/linux/vzctl_venet.h 2006-05-11 13:05:40.000000000 +0400 -@@ -0,0 +1,36 @@ -+/* -+ * include/linux/vzctl_venet.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef _VZCTL_VENET_H -+#define _VZCTL_VENET_H -+ -+#include <linux/types.h> -+#include <linux/ioctl.h> -+ -+#ifndef __ENVID_T_DEFINED__ -+typedef unsigned envid_t; -+#define __ENVID_T_DEFINED__ -+#endif -+ -+struct vzctl_ve_ip_map { -+ envid_t veid; -+ int op; -+#define VE_IP_ADD 1 -+#define VE_IP_DEL 2 -+ struct sockaddr *addr; -+ int addrlen; -+}; -+ -+#define VENETCTLTYPE '(' -+ -+#define VENETCTL_VE_IP_MAP _IOW(VENETCTLTYPE, 3, \ -+ struct vzctl_ve_ip_map) -+ -+#endif -diff -uprN linux-2.6.8.1.orig/include/linux/vzdq_tree.h linux-2.6.8.1-ve022stab078/include/linux/vzdq_tree.h ---- linux-2.6.8.1.orig/include/linux/vzdq_tree.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/linux/vzdq_tree.h 2006-05-11 13:05:43.000000000 +0400 -@@ -0,0 +1,99 @@ -+/* -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ * This file contains Virtuozzo disk quota tree definition -+ */ -+ -+#ifndef _VZDQ_TREE_H -+#define _VZDQ_TREE_H -+ -+#include <linux/list.h> -+#include <asm/string.h> -+ -+typedef unsigned int quotaid_t; -+#define QUOTAID_BITS 32 -+#define QUOTAID_BBITS 4 -+#define QUOTAID_EBITS 8 -+ -+#if QUOTAID_EBITS % QUOTAID_BBITS -+#error Quota bit assumption failure -+#endif -+ -+#define QUOTATREE_BSIZE (1 << QUOTAID_BBITS) -+#define QUOTATREE_BMASK (QUOTATREE_BSIZE - 1) -+#define QUOTATREE_DEPTH ((QUOTAID_BITS + QUOTAID_BBITS - 1) \ -+ / QUOTAID_BBITS) -+#define QUOTATREE_EDEPTH ((QUOTAID_BITS + QUOTAID_EBITS - 1) \ -+ / QUOTAID_EBITS) -+#define QUOTATREE_BSHIFT(lvl) ((QUOTATREE_DEPTH - (lvl) - 1) * QUOTAID_BBITS) -+ -+/* -+ * Depth of keeping unused node (not inclusive). -+ * 0 means release all nodes including root, -+ * QUOTATREE_DEPTH means never release nodes. -+ * Current value: release all nodes strictly after QUOTATREE_EDEPTH -+ * (measured in external shift units). -+ */ -+#define QUOTATREE_CDEPTH (QUOTATREE_DEPTH \ -+ - 2 * QUOTATREE_DEPTH / QUOTATREE_EDEPTH \ -+ + 1) -+ -+/* -+ * Levels 0..(QUOTATREE_DEPTH-1) are tree nodes. -+ * On level i the maximal number of nodes is 2^(i*QUOTAID_BBITS), -+ * and each node contains 2^QUOTAID_BBITS pointers. -+ * Level 0 is a (single) tree root node. -+ * -+ * Nodes of level (QUOTATREE_DEPTH-1) contain pointers to caller's data. -+ * Nodes of lower levels contain pointers to nodes. -+ * -+ * Double pointer in array of i-level node, pointing to a (i+1)-level node -+ * (such as inside quotatree_find_state) are marked by level (i+1), not i. -+ * Level 0 double pointer is a pointer to root inside tree struct. -+ * -+ * The tree is permanent, i.e. all index blocks allocated are keeped alive to -+ * preserve the blocks numbers in the quota file tree to keep its changes -+ * locally. -+ */ -+struct quotatree_node { -+ struct list_head list; -+ quotaid_t num; -+ void *blocks[QUOTATREE_BSIZE]; -+}; -+ -+struct quotatree_level { -+ struct list_head usedlh, freelh; -+ quotaid_t freenum; -+}; -+ -+struct quotatree_tree { -+ struct quotatree_level levels[QUOTATREE_DEPTH]; -+ struct quotatree_node *root; -+ unsigned int leaf_num; -+}; -+ -+struct quotatree_find_state { -+ void **block; -+ int level; -+}; -+ -+/* number of leafs (objects) and leaf level of the tree */ -+#define QTREE_LEAFNUM(tree) ((tree)->leaf_num) -+#define QTREE_LEAFLVL(tree) (&(tree)->levels[QUOTATREE_DEPTH - 1]) -+ -+struct quotatree_tree *quotatree_alloc(void); -+void *quotatree_find(struct quotatree_tree *tree, quotaid_t id, -+ struct quotatree_find_state *st); -+int quotatree_insert(struct quotatree_tree *tree, quotaid_t id, -+ struct quotatree_find_state *st, void *data); -+void quotatree_remove(struct quotatree_tree *tree, quotaid_t id); -+void quotatree_free(struct quotatree_tree *tree, void (*dtor)(void *)); -+void *quotatree_get_next(struct quotatree_tree *tree, quotaid_t id); -+void *quotatree_leaf_byindex(struct quotatree_tree *tree, unsigned int index); -+ -+#endif /* _VZDQ_TREE_H */ -+ -diff -uprN linux-2.6.8.1.orig/include/linux/vzquota.h linux-2.6.8.1-ve022stab078/include/linux/vzquota.h ---- linux-2.6.8.1.orig/include/linux/vzquota.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/linux/vzquota.h 2006-05-11 13:05:43.000000000 +0400 -@@ -0,0 +1,291 @@ -+/* -+ * -+ * Copyright (C) 2001-2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ * This file contains Virtuozzo disk quota implementation -+ */ -+ -+#ifndef _VZDQUOTA_H -+#define _VZDQUOTA_H -+ -+#include <linux/types.h> -+#include <linux/quota.h> -+ -+/* vzquotactl syscall commands */ -+#define VZ_DQ_CREATE 5 /* create quota master block */ -+#define VZ_DQ_DESTROY 6 /* destroy qmblk */ -+#define VZ_DQ_ON 7 /* mark dentry with already created qmblk */ -+#define VZ_DQ_OFF 8 /* remove mark, don't destroy qmblk */ -+#define VZ_DQ_SETLIMIT 9 /* set new limits */ -+#define VZ_DQ_GETSTAT 10 /* get usage statistic */ -+/* set of syscalls to maintain UGID quotas */ -+#define VZ_DQ_UGID_GETSTAT 1 /* get usage/limits for ugid(s) */ -+#define VZ_DQ_UGID_ADDSTAT 2 /* set usage/limits statistic for ugid(s) */ -+#define VZ_DQ_UGID_GETGRACE 3 /* get expire times */ -+#define VZ_DQ_UGID_SETGRACE 4 /* set expire times */ -+#define VZ_DQ_UGID_GETCONFIG 5 /* get ugid_max limit, cnt, flags of qmblk */ -+#define VZ_DQ_UGID_SETCONFIG 6 /* set ugid_max limit, flags of qmblk */ -+#define VZ_DQ_UGID_SETLIMIT 7 /* set ugid B/I limits */ -+#define VZ_DQ_UGID_SETINFO 8 /* set ugid info */ -+ -+/* common structure for vz and ugid quota */ -+struct dq_stat { -+ /* blocks limits */ -+ __u64 bhardlimit; /* absolute limit in bytes */ -+ __u64 bsoftlimit; /* preferred limit in bytes */ -+ time_t btime; /* time limit for excessive disk use */ -+ __u64 bcurrent; /* current bytes count */ -+ /* inodes limits */ -+ __u32 ihardlimit; /* absolute limit on allocated inodes */ -+ __u32 isoftlimit; /* preferred inode limit */ -+ time_t itime; /* time limit for excessive inode use */ -+ __u32 icurrent; /* current # allocated inodes */ -+}; -+ -+/* Values for dq_info->flags */ -+#define VZ_QUOTA_INODES 0x01 /* inodes limit warning printed */ -+#define VZ_QUOTA_SPACE 0x02 /* space limit warning printed */ -+ -+struct dq_info { -+ time_t bexpire; /* expire timeout for excessive disk use */ -+ time_t iexpire; /* expire timeout for excessive inode use */ -+ unsigned flags; /* see previos defines */ -+}; -+ -+struct vz_quota_stat { -+ struct dq_stat dq_stat; -+ struct dq_info dq_info; -+}; -+ -+/* UID/GID interface record - for user-kernel level exchange */ -+struct vz_quota_iface { -+ unsigned int qi_id; /* UID/GID this applies to */ -+ unsigned int qi_type; /* USRQUOTA|GRPQUOTA */ -+ struct dq_stat qi_stat; /* limits, options, usage stats */ -+}; -+ -+/* values for flags and dq_flags */ -+/* this flag is set if the userspace has been unable to provide usage -+ * information about all ugids -+ * if the flag is set, we don't allocate new UG quota blocks (their -+ * current usage is unknown) or free existing UG quota blocks (not to -+ * lose information that this block is ok) */ -+#define VZDQUG_FIXED_SET 0x01 -+/* permit to use ugid quota */ -+#define VZDQUG_ON 0x02 -+#define VZDQ_USRQUOTA 0x10 -+#define VZDQ_GRPQUOTA 0x20 -+#define VZDQ_NOACT 0x1000 /* not actual */ -+#define VZDQ_NOQUOT 0x2000 /* not under quota tree */ -+ -+struct vz_quota_ugid_stat { -+ unsigned int limit; /* max amount of ugid records */ -+ unsigned int count; /* amount of ugid records */ -+ unsigned int flags; -+}; -+ -+struct vz_quota_ugid_setlimit { -+ unsigned int type; /* quota type (USR/GRP) */ -+ unsigned int id; /* ugid */ -+ struct if_dqblk dqb; /* limits info */ -+}; -+ -+struct vz_quota_ugid_setinfo { -+ unsigned int type; /* quota type (USR/GRP) */ -+ struct if_dqinfo dqi; /* grace info */ -+}; -+ -+#ifdef __KERNEL__ -+#include <linux/list.h> -+#include <asm/atomic.h> -+#include <asm/semaphore.h> -+#include <linux/time.h> -+#include <linux/vzquota_qlnk.h> -+#include <linux/vzdq_tree.h> -+ -+/* One-second resolution for grace times */ -+#define CURRENT_TIME_SECONDS (get_seconds()) -+ -+/* Values for dq_info flags */ -+#define VZ_QUOTA_INODES 0x01 /* inodes limit warning printed */ -+#define VZ_QUOTA_SPACE 0x02 /* space limit warning printed */ -+ -+/* values for dq_state */ -+#define VZDQ_STARTING 0 /* created, not turned on yet */ -+#define VZDQ_WORKING 1 /* quota created, turned on */ -+#define VZDQ_STOPING 2 /* created, turned on and off */ -+ -+/* master quota record - one per veid */ -+struct vz_quota_master { -+ struct list_head dq_hash; /* next quota in hash list */ -+ atomic_t dq_count; /* inode reference count */ -+ unsigned int dq_flags; /* see VZDQUG_FIXED_SET */ -+ unsigned int dq_state; /* see values above */ -+ unsigned int dq_id; /* VEID this applies to */ -+ struct dq_stat dq_stat; /* limits, grace, usage stats */ -+ struct dq_info dq_info; /* grace times and flags */ -+ spinlock_t dq_data_lock; /* for dq_stat */ -+ -+ struct semaphore dq_sem; /* semaphore to protect -+ ugid tree */ -+ -+ struct list_head dq_ilink_list; /* list of vz_quota_ilink */ -+ struct quotatree_tree *dq_uid_tree; /* vz_quota_ugid tree for UIDs */ -+ struct quotatree_tree *dq_gid_tree; /* vz_quota_ugid tree for GIDs */ -+ unsigned int dq_ugid_count; /* amount of ugid records */ -+ unsigned int dq_ugid_max; /* max amount of ugid records */ -+ struct dq_info dq_ugid_info[MAXQUOTAS]; /* ugid grace times */ -+ -+ struct dentry *dq_root_dentry;/* dentry of fs tree */ -+ struct vfsmount *dq_root_mnt; /* vfsmnt of this dentry */ -+ struct super_block *dq_sb; /* superblock of our quota root */ -+}; -+ -+/* UID/GID quota record - one per pair (quota_master, uid or gid) */ -+struct vz_quota_ugid { -+ unsigned int qugid_id; /* UID/GID this applies to */ -+ struct dq_stat qugid_stat; /* limits, options, usage stats */ -+ int qugid_type; /* USRQUOTA|GRPQUOTA */ -+ atomic_t qugid_count; /* reference count */ -+}; -+ -+#define VZ_QUOTA_UGBAD ((struct vz_quota_ugid *)0xfeafea11) -+ -+struct vz_quota_datast { -+ struct vz_quota_ilink qlnk; -+}; -+ -+#define VIRTINFO_QUOTA_GETSTAT 0 -+#define VIRTINFO_QUOTA_ON 1 -+#define VIRTINFO_QUOTA_OFF 2 -+ -+struct virt_info_quota { -+ struct super_block *super; -+ struct dq_stat *qstat; -+}; -+ -+/* -+ * Interface to VZ quota core -+ */ -+#define INODE_QLNK(inode) (&(inode)->i_qlnk) -+#define QLNK_INODE(qlnk) container_of((qlnk), struct inode, i_qlnk) -+ -+#define VZ_QUOTA_BAD ((struct vz_quota_master *)0xefefefef) -+ -+#define VZ_QUOTAO_SETE 1 -+#define VZ_QUOTAO_INIT 2 -+#define VZ_QUOTAO_DESTR 3 -+#define VZ_QUOTAO_SWAP 4 -+#define VZ_QUOTAO_INICAL 5 -+#define VZ_QUOTAO_DRCAL 6 -+#define VZ_QUOTAO_QSET 7 -+#define VZ_QUOTAO_TRANS 8 -+#define VZ_QUOTAO_ACT 9 -+#define VZ_QUOTAO_DTREE 10 -+#define VZ_QUOTAO_DET 11 -+#define VZ_QUOTAO_ON 12 -+ -+extern struct semaphore vz_quota_sem; -+void inode_qmblk_lock(struct super_block *sb); -+void inode_qmblk_unlock(struct super_block *sb); -+void qmblk_data_read_lock(struct vz_quota_master *qmblk); -+void qmblk_data_read_unlock(struct vz_quota_master *qmblk); -+void qmblk_data_write_lock(struct vz_quota_master *qmblk); -+void qmblk_data_write_unlock(struct vz_quota_master *qmblk); -+ -+/* for quota operations */ -+void vzquota_inode_init_call(struct inode *inode); -+void vzquota_inode_drop_call(struct inode *inode); -+int vzquota_inode_transfer_call(struct inode *, struct iattr *); -+struct vz_quota_master *vzquota_inode_data(struct inode *inode, -+ struct vz_quota_datast *); -+void vzquota_data_unlock(struct inode *inode, struct vz_quota_datast *); -+int vzquota_rename_check(struct inode *inode, -+ struct inode *old_dir, struct inode *new_dir); -+struct vz_quota_master *vzquota_inode_qmblk(struct inode *inode); -+/* for second-level quota */ -+struct vz_quota_master *vzquota_find_qmblk(struct super_block *); -+/* for management operations */ -+struct vz_quota_master *vzquota_alloc_master(unsigned int quota_id, -+ struct vz_quota_stat *qstat); -+void vzquota_free_master(struct vz_quota_master *); -+struct vz_quota_master *vzquota_find_master(unsigned int quota_id); -+int vzquota_on_qmblk(struct super_block *sb, struct inode *inode, -+ struct vz_quota_master *qmblk); -+int vzquota_off_qmblk(struct super_block *sb, struct vz_quota_master *qmblk); -+int vzquota_get_super(struct super_block *sb); -+void vzquota_put_super(struct super_block *sb); -+ -+static inline struct vz_quota_master *qmblk_get(struct vz_quota_master *qmblk) -+{ -+ if (!atomic_read(&qmblk->dq_count)) -+ BUG(); -+ atomic_inc(&qmblk->dq_count); -+ return qmblk; -+} -+ -+static inline void __qmblk_put(struct vz_quota_master *qmblk) -+{ -+ atomic_dec(&qmblk->dq_count); -+} -+ -+static inline void qmblk_put(struct vz_quota_master *qmblk) -+{ -+ if (!atomic_dec_and_test(&qmblk->dq_count)) -+ return; -+ vzquota_free_master(qmblk); -+} -+ -+extern struct list_head vzquota_hash_table[]; -+extern int vzquota_hash_size; -+ -+/* -+ * Interface to VZ UGID quota -+ */ -+extern struct quotactl_ops vz_quotactl_operations; -+extern struct dquot_operations vz_quota_operations2; -+extern struct quota_format_type vz_quota_empty_v2_format; -+ -+#define QUGID_TREE(qmblk, type) (((type) == USRQUOTA) ? \ -+ qmblk->dq_uid_tree : \ -+ qmblk->dq_gid_tree) -+ -+#define VZDQUG_FIND_DONT_ALLOC 1 -+#define VZDQUG_FIND_FAKE 2 -+struct vz_quota_ugid *vzquota_find_ugid(struct vz_quota_master *qmblk, -+ unsigned int quota_id, int type, int flags); -+struct vz_quota_ugid *__vzquota_find_ugid(struct vz_quota_master *qmblk, -+ unsigned int quota_id, int type, int flags); -+struct vz_quota_ugid *vzquota_get_ugid(struct vz_quota_ugid *qugid); -+void vzquota_put_ugid(struct vz_quota_master *qmblk, -+ struct vz_quota_ugid *qugid); -+void vzquota_kill_ugid(struct vz_quota_master *qmblk); -+int vzquota_ugid_init(void); -+void vzquota_ugid_release(void); -+int vzquota_transfer_usage(struct inode *inode, int mask, -+ struct vz_quota_ilink *qlnk); -+ -+struct vzctl_quotaugidctl; -+long do_vzquotaugidctl(struct vzctl_quotaugidctl *qub); -+ -+/* -+ * Other VZ quota parts -+ */ -+extern struct dquot_operations vz_quota_operations; -+ -+long do_vzquotactl(int cmd, unsigned int quota_id, -+ struct vz_quota_stat *qstat, const char *ve_root); -+int vzquota_proc_init(void); -+void vzquota_proc_release(void); -+struct vz_quota_master *vzquota_find_qmblk(struct super_block *); -+extern struct semaphore vz_quota_sem; -+ -+void vzaquota_init(void); -+void vzaquota_fini(void); -+ -+#endif /* __KERNEL__ */ -+ -+#endif /* _VZDQUOTA_H */ -diff -uprN linux-2.6.8.1.orig/include/linux/vzquota_qlnk.h linux-2.6.8.1-ve022stab078/include/linux/vzquota_qlnk.h ---- linux-2.6.8.1.orig/include/linux/vzquota_qlnk.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/linux/vzquota_qlnk.h 2006-05-11 13:05:43.000000000 +0400 -@@ -0,0 +1,25 @@ -+/* -+ * include/linux/vzquota_qlnk.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef _VZDQUOTA_QLNK_H -+#define _VZDQUOTA_QLNK_H -+ -+struct vz_quota_master; -+struct vz_quota_ugid; -+ -+/* inode link, used to track inodes using quota via dq_ilink_list */ -+struct vz_quota_ilink { -+ struct vz_quota_master *qmblk; -+ struct vz_quota_ugid *qugid[MAXQUOTAS]; -+ struct list_head list; -+ unsigned char origin; -+}; -+ -+#endif /* _VZDQUOTA_QLNK_H */ -diff -uprN linux-2.6.8.1.orig/include/linux/vzratelimit.h linux-2.6.8.1-ve022stab078/include/linux/vzratelimit.h ---- linux-2.6.8.1.orig/include/linux/vzratelimit.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/linux/vzratelimit.h 2006-05-11 13:05:40.000000000 +0400 -@@ -0,0 +1,28 @@ -+/* -+ * include/linux/vzratelimit.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef __VZ_RATELIMIT_H__ -+#define __VZ_RATELIMIT_H__ -+ -+/* -+ * Generic ratelimiting stuff. -+ */ -+ -+struct vz_rate_info { -+ int burst; -+ int interval; /* jiffy_t per event */ -+ int bucket; /* kind of leaky bucket */ -+ unsigned long last; /* last event */ -+}; -+ -+/* Return true if rate limit permits. */ -+int vz_ratelimit(struct vz_rate_info *p); -+ -+#endif /* __VZ_RATELIMIT_H__ */ -diff -uprN linux-2.6.8.1.orig/include/linux/vzstat.h linux-2.6.8.1-ve022stab078/include/linux/vzstat.h ---- linux-2.6.8.1.orig/include/linux/vzstat.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/linux/vzstat.h 2006-05-11 13:05:40.000000000 +0400 -@@ -0,0 +1,176 @@ -+/* -+ * include/linux/vzstat.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef __VZSTAT_H__ -+#define __VZSTAT_H__ -+ -+struct swap_cache_info_struct { -+ unsigned long add_total; -+ unsigned long del_total; -+ unsigned long find_success; -+ unsigned long find_total; -+ unsigned long noent_race; -+ unsigned long exist_race; -+ unsigned long remove_race; -+}; -+ -+struct kstat_lat_snap_struct { -+ cycles_t maxlat, totlat; -+ unsigned long count; -+}; -+struct kstat_lat_pcpu_snap_struct { -+ cycles_t maxlat, totlat; -+ unsigned long count; -+ seqcount_t lock; -+} ____cacheline_maxaligned_in_smp; -+ -+struct kstat_lat_struct { -+ struct kstat_lat_snap_struct cur, last; -+ cycles_t avg[3]; -+}; -+struct kstat_lat_pcpu_struct { -+ struct kstat_lat_pcpu_snap_struct cur[NR_CPUS]; -+ cycles_t max_snap; -+ struct kstat_lat_snap_struct last; -+ cycles_t avg[3]; -+}; -+ -+struct kstat_perf_snap_struct { -+ cycles_t wall_tottime, cpu_tottime; -+ cycles_t wall_maxdur, cpu_maxdur; -+ unsigned long count; -+}; -+struct kstat_perf_struct { -+ struct kstat_perf_snap_struct cur, last; -+}; -+ -+struct kstat_zone_avg { -+ unsigned long free_pages_avg[3], -+ nr_active_avg[3], -+ nr_inactive_avg[3]; -+}; -+ -+#define KSTAT_ALLOCSTAT_NR 5 -+ -+struct kernel_stat_glob { -+ unsigned long nr_unint_avg[3]; -+ -+ unsigned long alloc_fails[KSTAT_ALLOCSTAT_NR]; -+ struct kstat_lat_struct alloc_lat[KSTAT_ALLOCSTAT_NR]; -+ struct kstat_lat_pcpu_struct sched_lat; -+ struct kstat_lat_struct swap_in; -+ -+ struct kstat_perf_struct ttfp, cache_reap, -+ refill_inact, shrink_icache, shrink_dcache; -+ -+ struct kstat_zone_avg zone_avg[3]; /* MAX_NR_ZONES */ -+} ____cacheline_aligned; -+ -+extern struct kernel_stat_glob kstat_glob ____cacheline_aligned; -+extern spinlock_t kstat_glb_lock; -+ -+#define KSTAT_PERF_ENTER(name) \ -+ unsigned long flags; \ -+ cycles_t start, sleep_time; \ -+ \ -+ start = get_cycles(); \ -+ sleep_time = VE_TASK_INFO(current)->sleep_time; \ -+ -+#define KSTAT_PERF_LEAVE(name) \ -+ spin_lock_irqsave(&kstat_glb_lock, flags); \ -+ kstat_glob.name.cur.count++; \ -+ start = get_cycles() - start; \ -+ if (kstat_glob.name.cur.wall_maxdur < start) \ -+ kstat_glob.name.cur.wall_maxdur = start;\ -+ kstat_glob.name.cur.wall_tottime += start; \ -+ start -= VE_TASK_INFO(current)->sleep_time - \ -+ sleep_time; \ -+ if (kstat_glob.name.cur.cpu_maxdur < start) \ -+ kstat_glob.name.cur.cpu_maxdur = start; \ -+ kstat_glob.name.cur.cpu_tottime += start; \ -+ spin_unlock_irqrestore(&kstat_glb_lock, flags); \ -+ -+/* -+ * Add another statistics reading. -+ * Serialization is the caller's due. -+ */ -+static inline void KSTAT_LAT_ADD(struct kstat_lat_struct *p, -+ cycles_t dur) -+{ -+ p->cur.count++; -+ if (p->cur.maxlat < dur) -+ p->cur.maxlat = dur; -+ p->cur.totlat += dur; -+} -+ -+static inline void KSTAT_LAT_PCPU_ADD(struct kstat_lat_pcpu_struct *p, int cpu, -+ cycles_t dur) -+{ -+ struct kstat_lat_pcpu_snap_struct *cur; -+ -+ cur = &p->cur[cpu]; -+ write_seqcount_begin(&cur->lock); -+ cur->count++; -+ if (cur->maxlat < dur) -+ cur->maxlat = dur; -+ cur->totlat += dur; -+ write_seqcount_end(&cur->lock); -+} -+ -+/* -+ * Move current statistics to last, clear last. -+ * Serialization is the caller's due. -+ */ -+static inline void KSTAT_LAT_UPDATE(struct kstat_lat_struct *p) -+{ -+ cycles_t m; -+ memcpy(&p->last, &p->cur, sizeof(p->last)); -+ p->cur.maxlat = 0; -+ m = p->last.maxlat; -+ CALC_LOAD(p->avg[0], EXP_1, m) -+ CALC_LOAD(p->avg[1], EXP_5, m) -+ CALC_LOAD(p->avg[2], EXP_15, m) -+} -+ -+static inline void KSTAT_LAT_PCPU_UPDATE(struct kstat_lat_pcpu_struct *p) -+{ -+ unsigned i, cpu; -+ struct kstat_lat_pcpu_snap_struct snap, *cur; -+ cycles_t m; -+ -+ memset(&p->last, 0, sizeof(p->last)); -+ for (cpu = 0; cpu < NR_CPUS; cpu++) { -+ cur = &p->cur[cpu]; -+ do { -+ i = read_seqcount_begin(&cur->lock); -+ memcpy(&snap, cur, sizeof(snap)); -+ } while (read_seqcount_retry(&cur->lock, i)); -+ /* -+ * read above and this update of maxlat is not atomic, -+ * but this is OK, since it happens rarely and losing -+ * a couple of peaks is not essential. xemul -+ */ -+ cur->maxlat = 0; -+ -+ p->last.count += snap.count; -+ p->last.totlat += snap.totlat; -+ if (p->last.maxlat < snap.maxlat) -+ p->last.maxlat = snap.maxlat; -+ } -+ -+ m = (p->last.maxlat > p->max_snap ? p->last.maxlat : p->max_snap); -+ CALC_LOAD(p->avg[0], EXP_1, m); -+ CALC_LOAD(p->avg[1], EXP_5, m); -+ CALC_LOAD(p->avg[2], EXP_15, m); -+ /* reset max_snap to calculate it correctly next time */ -+ p->max_snap = 0; -+} -+ -+#endif /* __VZSTAT_H__ */ -diff -uprN linux-2.6.8.1.orig/include/linux/zlib.h linux-2.6.8.1-ve022stab078/include/linux/zlib.h ---- linux-2.6.8.1.orig/include/linux/zlib.h 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/linux/zlib.h 2006-05-11 13:05:34.000000000 +0400 -@@ -506,6 +506,11 @@ extern int zlib_deflateReset (z_streamp - stream state was inconsistent (such as zalloc or state being NULL). - */ - -+static inline unsigned long deflateBound(unsigned long s) -+{ -+ return s + ((s + 7) >> 3) + ((s + 63) >> 6) + 11; -+} -+ - extern int zlib_deflateParams (z_streamp strm, int level, int strategy); - /* - Dynamically update the compression level and compression strategy. The -diff -uprN linux-2.6.8.1.orig/include/net/af_unix.h linux-2.6.8.1-ve022stab078/include/net/af_unix.h ---- linux-2.6.8.1.orig/include/net/af_unix.h 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/net/af_unix.h 2006-05-11 13:05:40.000000000 +0400 -@@ -13,23 +13,37 @@ extern atomic_t unix_tot_inflight; - - static inline struct sock *first_unix_socket(int *i) - { -+ struct sock *s; -+ struct ve_struct *ve; -+ -+ ve = get_exec_env(); - for (*i = 0; *i <= UNIX_HASH_SIZE; (*i)++) { -- if (!hlist_empty(&unix_socket_table[*i])) -- return __sk_head(&unix_socket_table[*i]); -+ for (s = sk_head(&unix_socket_table[*i]); -+ s != NULL && !ve_accessible(VE_OWNER_SK(s), ve); -+ s = sk_next(s)); -+ if (s != NULL) -+ return s; - } - return NULL; - } - - static inline struct sock *next_unix_socket(int *i, struct sock *s) - { -- struct sock *next = sk_next(s); -- /* More in this chain? */ -- if (next) -- return next; -+ struct ve_struct *ve; -+ -+ ve = get_exec_env(); -+ for (s = sk_next(s); s != NULL; s = sk_next(s)) { -+ if (!ve_accessible(VE_OWNER_SK(s), ve)) -+ continue; -+ return s; -+ } - /* Look for next non-empty chain. */ - for ((*i)++; *i <= UNIX_HASH_SIZE; (*i)++) { -- if (!hlist_empty(&unix_socket_table[*i])) -- return __sk_head(&unix_socket_table[*i]); -+ for (s = sk_head(&unix_socket_table[*i]); -+ s != NULL && !ve_accessible(VE_OWNER_SK(s), ve); -+ s = sk_next(s)); -+ if (s != NULL) -+ return s; - } - return NULL; - } -diff -uprN linux-2.6.8.1.orig/include/net/compat.h linux-2.6.8.1-ve022stab078/include/net/compat.h ---- linux-2.6.8.1.orig/include/net/compat.h 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/net/compat.h 2006-05-11 13:05:49.000000000 +0400 -@@ -23,6 +23,12 @@ struct compat_cmsghdr { - compat_int_t cmsg_type; - }; - -+#if defined(CONFIG_X86_64) -+#define is_current_32bits() (current_thread_info()->flags & _TIF_IA32) -+#else -+#define is_current_32bits() 0 -+#endif -+ - #else /* defined(CONFIG_COMPAT) */ - #define compat_msghdr msghdr /* to avoid compiler warnings */ - #endif /* defined(CONFIG_COMPAT) */ -@@ -33,7 +39,8 @@ extern asmlinkage long compat_sys_sendms - extern asmlinkage long compat_sys_recvmsg(int,struct compat_msghdr __user *,unsigned); - extern asmlinkage long compat_sys_getsockopt(int, int, int, char __user *, int __user *); - extern int put_cmsg_compat(struct msghdr*, int, int, int, void *); --extern int cmsghdr_from_user_compat_to_kern(struct msghdr *, unsigned char *, -- int); -+ -+struct sock; -+extern int cmsghdr_from_user_compat_to_kern(struct msghdr *, struct sock *, unsigned char *, int); - - #endif /* NET_COMPAT_H */ -diff -uprN linux-2.6.8.1.orig/include/net/flow.h linux-2.6.8.1-ve022stab078/include/net/flow.h ---- linux-2.6.8.1.orig/include/net/flow.h 2004-08-14 14:56:01.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/net/flow.h 2006-05-11 13:05:40.000000000 +0400 -@@ -10,6 +10,7 @@ - #include <linux/in6.h> - #include <asm/atomic.h> - -+struct ve_struct; - struct flowi { - int oif; - int iif; -@@ -77,6 +78,9 @@ struct flowi { - #define fl_icmp_type uli_u.icmpt.type - #define fl_icmp_code uli_u.icmpt.code - #define fl_ipsec_spi uli_u.spi -+#ifdef CONFIG_VE -+ struct ve_struct *owner_env; -+#endif - } __attribute__((__aligned__(BITS_PER_LONG/8))); - - #define FLOW_DIR_IN 0 -diff -uprN linux-2.6.8.1.orig/include/net/icmp.h linux-2.6.8.1-ve022stab078/include/net/icmp.h ---- linux-2.6.8.1.orig/include/net/icmp.h 2004-08-14 14:56:00.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/net/icmp.h 2006-05-11 13:05:40.000000000 +0400 -@@ -34,9 +34,14 @@ struct icmp_err { - - extern struct icmp_err icmp_err_convert[]; - DECLARE_SNMP_STAT(struct icmp_mib, icmp_statistics); --#define ICMP_INC_STATS(field) SNMP_INC_STATS(icmp_statistics, field) --#define ICMP_INC_STATS_BH(field) SNMP_INC_STATS_BH(icmp_statistics, field) --#define ICMP_INC_STATS_USER(field) SNMP_INC_STATS_USER(icmp_statistics, field) -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+#define ve_icmp_statistics (get_exec_env()->_icmp_statistics) -+#else -+#define ve_icmp_statistics icmp_statistics -+#endif -+#define ICMP_INC_STATS(field) SNMP_INC_STATS(ve_icmp_statistics, field) -+#define ICMP_INC_STATS_BH(field) SNMP_INC_STATS_BH(ve_icmp_statistics, field) -+#define ICMP_INC_STATS_USER(field) SNMP_INC_STATS_USER(ve_icmp_statistics, field) - - extern void icmp_send(struct sk_buff *skb_in, int type, int code, u32 info); - extern int icmp_rcv(struct sk_buff *skb); -diff -uprN linux-2.6.8.1.orig/include/net/ip.h linux-2.6.8.1-ve022stab078/include/net/ip.h ---- linux-2.6.8.1.orig/include/net/ip.h 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/net/ip.h 2006-05-11 13:05:40.000000000 +0400 -@@ -151,15 +151,25 @@ struct ipv4_config - - extern struct ipv4_config ipv4_config; - DECLARE_SNMP_STAT(struct ipstats_mib, ip_statistics); --#define IP_INC_STATS(field) SNMP_INC_STATS(ip_statistics, field) --#define IP_INC_STATS_BH(field) SNMP_INC_STATS_BH(ip_statistics, field) --#define IP_INC_STATS_USER(field) SNMP_INC_STATS_USER(ip_statistics, field) -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+#define ve_ip_statistics (get_exec_env()->_ip_statistics) -+#else -+#define ve_ip_statistics ip_statistics -+#endif -+#define IP_INC_STATS(field) SNMP_INC_STATS(ve_ip_statistics, field) -+#define IP_INC_STATS_BH(field) SNMP_INC_STATS_BH(ve_ip_statistics, field) -+#define IP_INC_STATS_USER(field) SNMP_INC_STATS_USER(ve_ip_statistics, field) - DECLARE_SNMP_STAT(struct linux_mib, net_statistics); --#define NET_INC_STATS(field) SNMP_INC_STATS(net_statistics, field) --#define NET_INC_STATS_BH(field) SNMP_INC_STATS_BH(net_statistics, field) --#define NET_INC_STATS_USER(field) SNMP_INC_STATS_USER(net_statistics, field) --#define NET_ADD_STATS_BH(field, adnd) SNMP_ADD_STATS_BH(net_statistics, field, adnd) --#define NET_ADD_STATS_USER(field, adnd) SNMP_ADD_STATS_USER(net_statistics, field, adnd) -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+#define ve_net_statistics (get_exec_env()->_net_statistics) -+#else -+#define ve_net_statistics net_statistics -+#endif -+#define NET_INC_STATS(field) SNMP_INC_STATS(ve_net_statistics, field) -+#define NET_INC_STATS_BH(field) SNMP_INC_STATS_BH(ve_net_statistics, field) -+#define NET_INC_STATS_USER(field) SNMP_INC_STATS_USER(ve_net_statistics, field) -+#define NET_ADD_STATS_BH(field, adnd) SNMP_ADD_STATS_BH(ve_net_statistics, field, adnd) -+#define NET_ADD_STATS_USER(field, adnd) SNMP_ADD_STATS_USER(ve_net_statistics, field, adnd) - - extern int sysctl_local_port_range[2]; - extern int sysctl_ip_default_ttl; -@@ -253,8 +263,21 @@ extern int ip_call_ra_chain(struct sk_bu - /* - * Functions provided by ip_fragment.o - */ -- --struct sk_buff *ip_defrag(struct sk_buff *skb); -+ -+enum ip_defrag_users -+{ -+ IP_DEFRAG_LOCAL_DELIVER, -+ IP_DEFRAG_CALL_RA_CHAIN, -+ IP_DEFRAG_CONNTRACK_IN, -+ IP_DEFRAG_CONNTRACK_OUT, -+ IP_DEFRAG_NAT_OUT, -+ IP_DEFRAG_FW_COMPAT, -+ IP_DEFRAG_VS_IN, -+ IP_DEFRAG_VS_OUT, -+ IP_DEFRAG_VS_FWD -+}; -+ -+struct sk_buff *ip_defrag(struct sk_buff *skb, u32 user); - extern int ip_frag_nqueues; - extern atomic_t ip_frag_mem; - -diff -uprN linux-2.6.8.1.orig/include/net/ip_fib.h linux-2.6.8.1-ve022stab078/include/net/ip_fib.h ---- linux-2.6.8.1.orig/include/net/ip_fib.h 2004-08-14 14:56:15.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/net/ip_fib.h 2006-05-11 13:05:40.000000000 +0400 -@@ -139,10 +139,22 @@ struct fib_table - unsigned char tb_data[0]; - }; - -+struct fn_zone; -+struct fn_hash -+{ -+ struct fn_zone *fn_zones[33]; -+ struct fn_zone *fn_zone_list; -+}; -+ - #ifndef CONFIG_IP_MULTIPLE_TABLES - -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+#define ip_fib_local_table get_exec_env()->_local_table -+#define ip_fib_main_table get_exec_env()->_main_table -+#else - extern struct fib_table *ip_fib_local_table; - extern struct fib_table *ip_fib_main_table; -+#endif - - static inline struct fib_table *fib_get_table(int id) - { -@@ -174,7 +186,12 @@ static inline void fib_select_default(co - #define ip_fib_local_table (fib_tables[RT_TABLE_LOCAL]) - #define ip_fib_main_table (fib_tables[RT_TABLE_MAIN]) - -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+#define fib_tables get_exec_env()->_fib_tables -+#else - extern struct fib_table * fib_tables[RT_TABLE_MAX+1]; -+#endif -+ - extern int fib_lookup(const struct flowi *flp, struct fib_result *res); - extern struct fib_table *__fib_new_table(int id); - extern void fib_rule_put(struct fib_rule *r); -@@ -231,10 +248,19 @@ extern u32 __fib_res_prefsrc(struct fib - - /* Exported by fib_hash.c */ - extern struct fib_table *fib_hash_init(int id); -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+struct ve_struct; -+extern int init_ve_route(struct ve_struct *ve); -+extern void fini_ve_route(struct ve_struct *ve); -+#else -+#define init_ve_route(ve) (0) -+#define fini_ve_route(ve) do { } while (0) -+#endif - - #ifdef CONFIG_IP_MULTIPLE_TABLES - /* Exported by fib_rules.c */ -- -+extern int fib_rules_create(void); -+extern void fib_rules_destroy(void); - extern int inet_rtm_delrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg); - extern int inet_rtm_newrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg); - extern int inet_dump_rules(struct sk_buff *skb, struct netlink_callback *cb); -diff -uprN linux-2.6.8.1.orig/include/net/scm.h linux-2.6.8.1-ve022stab078/include/net/scm.h ---- linux-2.6.8.1.orig/include/net/scm.h 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/net/scm.h 2006-05-11 13:05:40.000000000 +0400 -@@ -40,7 +40,7 @@ static __inline__ int scm_send(struct so - memset(scm, 0, sizeof(*scm)); - scm->creds.uid = current->uid; - scm->creds.gid = current->gid; -- scm->creds.pid = current->tgid; -+ scm->creds.pid = virt_tgid(current); - if (msg->msg_controllen <= 0) - return 0; - return __scm_send(sock, msg, scm); -diff -uprN linux-2.6.8.1.orig/include/net/sock.h linux-2.6.8.1-ve022stab078/include/net/sock.h ---- linux-2.6.8.1.orig/include/net/sock.h 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/net/sock.h 2006-05-11 13:05:40.000000000 +0400 -@@ -55,6 +55,8 @@ - #include <net/dst.h> - #include <net/checksum.h> - -+#include <ub/ub_net.h> -+ - /* - * This structure really needs to be cleaned up. - * Most of it is for TCP, and not used by any of -@@ -266,8 +268,12 @@ struct sock { - int (*sk_backlog_rcv)(struct sock *sk, - struct sk_buff *skb); - void (*sk_destruct)(struct sock *sk); -+ struct sock_beancounter sk_bc; -+ struct ve_struct *sk_owner_env; - }; - -+DCL_VE_OWNER_PROTO(SK, SLAB, struct sock, sk_owner_env, , (noinline, regparm(1))) -+ - /* - * Hashed lists helper routines - */ -@@ -488,7 +494,8 @@ do { if (!(__sk)->sk_backlog.tail) { - }) - - extern int sk_stream_wait_connect(struct sock *sk, long *timeo_p); --extern int sk_stream_wait_memory(struct sock *sk, long *timeo_p); -+extern int sk_stream_wait_memory(struct sock *sk, long *timeo_p, -+ unsigned long amount); - extern void sk_stream_wait_close(struct sock *sk, long timeo_p); - extern int sk_stream_error(struct sock *sk, int flags, int err); - extern void sk_stream_kill_queues(struct sock *sk); -@@ -672,8 +679,11 @@ static inline void sk_stream_writequeue_ - - static inline int sk_stream_rmem_schedule(struct sock *sk, struct sk_buff *skb) - { -- return (int)skb->truesize <= sk->sk_forward_alloc || -- sk_stream_mem_schedule(sk, skb->truesize, 1); -+ if ((int)skb->truesize > sk->sk_forward_alloc && -+ !sk_stream_mem_schedule(sk, skb->truesize, 1)) -+ /* The situation is bad according to mainstream. Den */ -+ return 0; -+ return ub_tcprcvbuf_charge(sk, skb) == 0; - } - - /* Used by processes to "lock" a socket state, so that -@@ -724,6 +734,11 @@ extern struct sk_buff *sock_alloc_send - unsigned long size, - int noblock, - int *errcode); -+extern struct sk_buff *sock_alloc_send_skb2(struct sock *sk, -+ unsigned long size, -+ unsigned long size2, -+ int noblock, -+ int *errcode); - extern struct sk_buff *sock_alloc_send_pskb(struct sock *sk, - unsigned long header_len, - unsigned long data_len, -@@ -1073,6 +1088,10 @@ static inline int sock_queue_rcv_skb(str - goto out; - } - -+ err = ub_sockrcvbuf_charge(sk, skb); -+ if (err < 0) -+ goto out; -+ - /* It would be deadlock, if sock_queue_rcv_skb is used - with socket lock! We assume that users of this - function are lock free. -diff -uprN linux-2.6.8.1.orig/include/net/tcp.h linux-2.6.8.1-ve022stab078/include/net/tcp.h ---- linux-2.6.8.1.orig/include/net/tcp.h 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/net/tcp.h 2006-05-11 13:05:45.000000000 +0400 -@@ -30,6 +30,7 @@ - #include <linux/slab.h> - #include <linux/cache.h> - #include <linux/percpu.h> -+#include <linux/ve_owner.h> - #include <net/checksum.h> - #include <net/sock.h> - #include <net/snmp.h> -@@ -39,6 +40,10 @@ - #endif - #include <linux/seq_file.h> - -+ -+#define TCP_PAGE(sk) (sk->sk_sndmsg_page) -+#define TCP_OFF(sk) (sk->sk_sndmsg_off) -+ - /* This is for all connections with a full identity, no wildcards. - * New scheme, half the table is for TIME_WAIT, the other half is - * for the rest. I'll experiment with dynamic table growth later. -@@ -83,12 +88,16 @@ struct tcp_ehash_bucket { - * ports are created in O(1) time? I thought so. ;-) -DaveM - */ - struct tcp_bind_bucket { -+ struct ve_struct *owner_env; - unsigned short port; - signed short fastreuse; - struct hlist_node node; - struct hlist_head owners; - }; - -+DCL_VE_OWNER_PROTO(TB, GENERIC, struct tcp_bind_bucket, owner_env, -+ inline, (always_inline)); -+ - #define tb_for_each(tb, node, head) hlist_for_each_entry(tb, node, head, node) - - struct tcp_bind_hashbucket { -@@ -158,16 +167,17 @@ extern kmem_cache_t *tcp_sk_cachep; - - extern kmem_cache_t *tcp_bucket_cachep; - extern struct tcp_bind_bucket *tcp_bucket_create(struct tcp_bind_hashbucket *head, -- unsigned short snum); -+ unsigned short snum, -+ struct ve_struct *env); - extern void tcp_bucket_destroy(struct tcp_bind_bucket *tb); - extern void tcp_bucket_unlock(struct sock *sk); - extern int tcp_port_rover; - extern struct sock *tcp_v4_lookup_listener(u32 addr, unsigned short hnum, int dif); - - /* These are AF independent. */ --static __inline__ int tcp_bhashfn(__u16 lport) -+static __inline__ int tcp_bhashfn(__u16 lport, unsigned veid) - { -- return (lport & (tcp_bhash_size - 1)); -+ return ((lport + (veid ^ (veid >> 16))) & (tcp_bhash_size - 1)); - } - - extern void tcp_bind_hash(struct sock *sk, struct tcp_bind_bucket *tb, -@@ -217,13 +227,19 @@ struct tcp_tw_bucket { - unsigned long tw_ttd; - struct tcp_bind_bucket *tw_tb; - struct hlist_node tw_death_node; -+ spinlock_t tw_lock; - #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) - struct in6_addr tw_v6_daddr; - struct in6_addr tw_v6_rcv_saddr; - int tw_v6_ipv6only; - #endif -+ envid_t tw_owner_env; - }; - -+#define TW_VEID(tw) ((tw)->tw_owner_env) -+#define SET_TW_VEID(tw, veid) ((tw)->tw_owner_env) = (veid) -+ -+ - static __inline__ void tw_add_node(struct tcp_tw_bucket *tw, - struct hlist_head *list) - { -@@ -304,7 +320,11 @@ static inline int tcp_v6_ipv6only(const - # define tcp_v6_ipv6only(__sk) 0 - #endif - -+#define TW_WSCALE_MASK 0x0f -+#define TW_WSCALE_SPEC 0x10 -+ - extern kmem_cache_t *tcp_timewait_cachep; -+#include <ub/ub_net.h> - - static inline void tcp_tw_put(struct tcp_tw_bucket *tw) - { -@@ -340,28 +360,38 @@ extern void tcp_tw_deschedule(struct tcp - #define TCP_V4_ADDR_COOKIE(__name, __saddr, __daddr) \ - __u64 __name = (((__u64)(__daddr))<<32)|((__u64)(__saddr)); - #endif /* __BIG_ENDIAN */ --#define TCP_IPV4_MATCH(__sk, __cookie, __saddr, __daddr, __ports, __dif)\ -+#define TCP_IPV4_MATCH_ALLVE(__sk, __cookie, __saddr, __daddr, __ports, __dif)\ - (((*((__u64 *)&(inet_sk(__sk)->daddr)))== (__cookie)) && \ - ((*((__u32 *)&(inet_sk(__sk)->dport)))== (__ports)) && \ - (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif)))) --#define TCP_IPV4_TW_MATCH(__sk, __cookie, __saddr, __daddr, __ports, __dif)\ -+#define TCP_IPV4_TW_MATCH_ALLVE(__sk, __cookie, __saddr, __daddr, __ports, __dif)\ - (((*((__u64 *)&(tcptw_sk(__sk)->tw_daddr))) == (__cookie)) && \ - ((*((__u32 *)&(tcptw_sk(__sk)->tw_dport))) == (__ports)) && \ - (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif)))) - #else /* 32-bit arch */ - #define TCP_V4_ADDR_COOKIE(__name, __saddr, __daddr) --#define TCP_IPV4_MATCH(__sk, __cookie, __saddr, __daddr, __ports, __dif)\ -+#define TCP_IPV4_MATCH_ALLVE(__sk, __cookie, __saddr, __daddr, __ports, __dif)\ - ((inet_sk(__sk)->daddr == (__saddr)) && \ - (inet_sk(__sk)->rcv_saddr == (__daddr)) && \ - ((*((__u32 *)&(inet_sk(__sk)->dport)))== (__ports)) && \ - (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif)))) --#define TCP_IPV4_TW_MATCH(__sk, __cookie, __saddr, __daddr, __ports, __dif)\ -+#define TCP_IPV4_TW_MATCH_ALLVE(__sk, __cookie, __saddr, __daddr, __ports, __dif)\ - ((tcptw_sk(__sk)->tw_daddr == (__saddr)) && \ - (tcptw_sk(__sk)->tw_rcv_saddr == (__daddr)) && \ - ((*((__u32 *)&(tcptw_sk(__sk)->tw_dport))) == (__ports)) && \ - (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif)))) - #endif /* 64-bit arch */ - -+#define TCP_IPV4_MATCH(__sk, __cookie, __saddr, __daddr, __ports, __dif, __ve)\ -+ (TCP_IPV4_MATCH_ALLVE((__sk), (__cookie), (__saddr), (__daddr), \ -+ (__ports), (__dif)) \ -+ && ve_accessible_strict(VE_OWNER_SK((__sk)), (__ve))) -+ -+#define TCP_IPV4_TW_MATCH(__sk, __cookie, __saddr, __daddr, __ports, __dif, __ve)\ -+ (TCP_IPV4_TW_MATCH_ALLVE((__sk), (__cookie), (__saddr), (__daddr), \ -+ (__ports), (__dif)) \ -+ && ve_accessible_strict(TW_VEID(tcptw_sk(__sk)), VEID(__ve))) -+ - #define TCP_IPV6_MATCH(__sk, __saddr, __daddr, __ports, __dif) \ - (((*((__u32 *)&(inet_sk(__sk)->dport)))== (__ports)) && \ - ((__sk)->sk_family == AF_INET6) && \ -@@ -370,16 +400,16 @@ extern void tcp_tw_deschedule(struct tcp - (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif)))) - - /* These can have wildcards, don't try too hard. */ --static __inline__ int tcp_lhashfn(unsigned short num) -+static __inline__ int tcp_lhashfn(unsigned short num, unsigned veid) - { -- return num & (TCP_LHTABLE_SIZE - 1); -+ return ((num + (veid ^ (veid >> 16))) & (TCP_LHTABLE_SIZE - 1)); - } - - static __inline__ int tcp_sk_listen_hashfn(struct sock *sk) - { -- return tcp_lhashfn(inet_sk(sk)->num); -+ return tcp_lhashfn(inet_sk(sk)->num, VEID(VE_OWNER_SK(sk))); - } -- -+ - #define MAX_TCP_HEADER (128 + MAX_HEADER) - - /* -@@ -598,7 +628,9 @@ extern int sysctl_tcp_mem[3]; - extern int sysctl_tcp_wmem[3]; - extern int sysctl_tcp_rmem[3]; - extern int sysctl_tcp_app_win; -+#ifndef sysctl_tcp_adv_win_scale - extern int sysctl_tcp_adv_win_scale; -+#endif - extern int sysctl_tcp_tw_reuse; - extern int sysctl_tcp_frto; - extern int sysctl_tcp_low_latency; -@@ -613,6 +645,7 @@ extern int sysctl_tcp_bic_fast_convergen - extern int sysctl_tcp_bic_low_window; - extern int sysctl_tcp_default_win_scale; - extern int sysctl_tcp_moderate_rcvbuf; -+extern int sysctl_tcp_use_sg; - - extern atomic_t tcp_memory_allocated; - extern atomic_t tcp_sockets_allocated; -@@ -765,12 +798,17 @@ static inline int between(__u32 seq1, __ - extern struct proto tcp_prot; - - DECLARE_SNMP_STAT(struct tcp_mib, tcp_statistics); --#define TCP_INC_STATS(field) SNMP_INC_STATS(tcp_statistics, field) --#define TCP_INC_STATS_BH(field) SNMP_INC_STATS_BH(tcp_statistics, field) --#define TCP_INC_STATS_USER(field) SNMP_INC_STATS_USER(tcp_statistics, field) --#define TCP_DEC_STATS(field) SNMP_DEC_STATS(tcp_statistics, field) --#define TCP_ADD_STATS_BH(field, val) SNMP_ADD_STATS_BH(tcp_statistics, field, val) --#define TCP_ADD_STATS_USER(field, val) SNMP_ADD_STATS_USER(tcp_statistics, field, val) -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+#define ve_tcp_statistics (get_exec_env()->_tcp_statistics) -+#else -+#define ve_tcp_statistics tcp_statistics -+#endif -+#define TCP_INC_STATS(field) SNMP_INC_STATS(ve_tcp_statistics, field) -+#define TCP_INC_STATS_BH(field) SNMP_INC_STATS_BH(ve_tcp_statistics, field) -+#define TCP_INC_STATS_USER(field) SNMP_INC_STATS_USER(ve_tcp_statistics, field) -+#define TCP_DEC_STATS(field) SNMP_DEC_STATS(ve_tcp_statistics, field) -+#define TCP_ADD_STATS_BH(field, val) SNMP_ADD_STATS_BH(ve_tcp_statistics, field, val) -+#define TCP_ADD_STATS_USER(field, val) SNMP_ADD_STATS_USER(ve_tcp_statistics, field, val) - - extern void tcp_put_port(struct sock *sk); - extern void tcp_inherit_port(struct sock *sk, struct sock *child); -@@ -837,9 +875,9 @@ static __inline__ void tcp_delack_init(s - memset(&tp->ack, 0, sizeof(tp->ack)); - } - --static inline void tcp_clear_options(struct tcp_opt *tp) -+static inline void tcp_clear_options(struct tcp_options_received *rx_opt) - { -- tp->tstamp_ok = tp->sack_ok = tp->wscale_ok = tp->snd_wscale = 0; -+ rx_opt->tstamp_ok = rx_opt->sack_ok = rx_opt->wscale_ok = rx_opt->snd_wscale = 0; - } - - enum tcp_tw_status -@@ -888,7 +926,7 @@ extern int tcp_recvmsg(struct kiocb *i - extern int tcp_listen_start(struct sock *sk); - - extern void tcp_parse_options(struct sk_buff *skb, -- struct tcp_opt *tp, -+ struct tcp_options_received *opt_rx, - int estab); - - /* -@@ -1062,9 +1100,9 @@ static __inline__ unsigned int tcp_curre - tp->ext2_header_len != dst->header_len) - mss_now = tcp_sync_mss(sk, mtu); - } -- if (tp->eff_sacks) -+ if (tp->rx_opt.eff_sacks) - mss_now -= (TCPOLEN_SACK_BASE_ALIGNED + -- (tp->eff_sacks * TCPOLEN_SACK_PERBLOCK)); -+ (tp->rx_opt.eff_sacks * TCPOLEN_SACK_PERBLOCK)); - return mss_now; - } - -@@ -1097,7 +1135,7 @@ static __inline__ void __tcp_fast_path_o - - static __inline__ void tcp_fast_path_on(struct tcp_opt *tp) - { -- __tcp_fast_path_on(tp, tp->snd_wnd>>tp->snd_wscale); -+ __tcp_fast_path_on(tp, tp->snd_wnd >> tp->rx_opt.snd_wscale); - } - - static inline void tcp_fast_path_check(struct sock *sk, struct tcp_opt *tp) -@@ -1134,7 +1172,7 @@ extern u32 __tcp_select_window(struct so - * only use of the low 32-bits of jiffies and hide the ugly - * casts with the following macro. - */ --#define tcp_time_stamp ((__u32)(jiffies)) -+#define tcp_time_stamp ((__u32)(jiffies + get_exec_env()->jiffies_fixup)) - - /* This is what the send packet queueing engine uses to pass - * TCP per-packet control information to the transmission -@@ -1305,7 +1343,8 @@ static inline __u32 tcp_current_ssthresh - - static inline void tcp_sync_left_out(struct tcp_opt *tp) - { -- if (tp->sack_ok && tp->sacked_out >= tp->packets_out - tp->lost_out) -+ if (tp->rx_opt.sack_ok && -+ tp->sacked_out >= tp->packets_out - tp->lost_out) - tp->sacked_out = tp->packets_out - tp->lost_out; - tp->left_out = tp->sacked_out + tp->lost_out; - } -@@ -1615,39 +1654,39 @@ static __inline__ void tcp_done(struct s - tcp_destroy_sock(sk); - } - --static __inline__ void tcp_sack_reset(struct tcp_opt *tp) -+static __inline__ void tcp_sack_reset(struct tcp_options_received *rx_opt) - { -- tp->dsack = 0; -- tp->eff_sacks = 0; -- tp->num_sacks = 0; -+ rx_opt->dsack = 0; -+ rx_opt->eff_sacks = 0; -+ rx_opt->num_sacks = 0; - } - - static __inline__ void tcp_build_and_update_options(__u32 *ptr, struct tcp_opt *tp, __u32 tstamp) - { -- if (tp->tstamp_ok) { -+ if (tp->rx_opt.tstamp_ok) { - *ptr++ = __constant_htonl((TCPOPT_NOP << 24) | - (TCPOPT_NOP << 16) | - (TCPOPT_TIMESTAMP << 8) | - TCPOLEN_TIMESTAMP); - *ptr++ = htonl(tstamp); -- *ptr++ = htonl(tp->ts_recent); -+ *ptr++ = htonl(tp->rx_opt.ts_recent); - } -- if (tp->eff_sacks) { -- struct tcp_sack_block *sp = tp->dsack ? tp->duplicate_sack : tp->selective_acks; -+ if (tp->rx_opt.eff_sacks) { -+ struct tcp_sack_block *sp = tp->rx_opt.dsack ? tp->duplicate_sack : tp->selective_acks; - int this_sack; - - *ptr++ = __constant_htonl((TCPOPT_NOP << 24) | - (TCPOPT_NOP << 16) | - (TCPOPT_SACK << 8) | - (TCPOLEN_SACK_BASE + -- (tp->eff_sacks * TCPOLEN_SACK_PERBLOCK))); -- for(this_sack = 0; this_sack < tp->eff_sacks; this_sack++) { -+ (tp->rx_opt.eff_sacks * TCPOLEN_SACK_PERBLOCK))); -+ for(this_sack = 0; this_sack < tp->rx_opt.eff_sacks; this_sack++) { - *ptr++ = htonl(sp[this_sack].start_seq); - *ptr++ = htonl(sp[this_sack].end_seq); - } -- if (tp->dsack) { -- tp->dsack = 0; -- tp->eff_sacks--; -+ if (tp->rx_opt.dsack) { -+ tp->rx_opt.dsack = 0; -+ tp->rx_opt.eff_sacks--; - } - } - } -@@ -1851,17 +1890,17 @@ static inline void tcp_synq_drop(struct - } - - static __inline__ void tcp_openreq_init(struct open_request *req, -- struct tcp_opt *tp, -+ struct tcp_options_received *rx_opt, - struct sk_buff *skb) - { - req->rcv_wnd = 0; /* So that tcp_send_synack() knows! */ - req->rcv_isn = TCP_SKB_CB(skb)->seq; -- req->mss = tp->mss_clamp; -- req->ts_recent = tp->saw_tstamp ? tp->rcv_tsval : 0; -- req->tstamp_ok = tp->tstamp_ok; -- req->sack_ok = tp->sack_ok; -- req->snd_wscale = tp->snd_wscale; -- req->wscale_ok = tp->wscale_ok; -+ req->mss = rx_opt->mss_clamp; -+ req->ts_recent = rx_opt->saw_tstamp ? rx_opt->rcv_tsval : 0; -+ req->tstamp_ok = rx_opt->tstamp_ok; -+ req->sack_ok = rx_opt->sack_ok; -+ req->snd_wscale = rx_opt->snd_wscale; -+ req->wscale_ok = rx_opt->wscale_ok; - req->acked = 0; - req->ecn_ok = 0; - req->rmt_port = skb->h.th->source; -@@ -1910,11 +1949,11 @@ static inline int tcp_fin_time(struct tc - return fin_timeout; - } - --static inline int tcp_paws_check(struct tcp_opt *tp, int rst) -+static inline int tcp_paws_check(struct tcp_options_received *rx_opt, int rst) - { -- if ((s32)(tp->rcv_tsval - tp->ts_recent) >= 0) -+ if ((s32)(rx_opt->rcv_tsval - rx_opt->ts_recent) >= 0) - return 0; -- if (xtime.tv_sec >= tp->ts_recent_stamp + TCP_PAWS_24DAYS) -+ if (xtime.tv_sec >= rx_opt->ts_recent_stamp + TCP_PAWS_24DAYS) - return 0; - - /* RST segments are not recommended to carry timestamp, -@@ -1929,7 +1968,7 @@ static inline int tcp_paws_check(struct - - However, we can relax time bounds for RST segments to MSL. - */ -- if (rst && xtime.tv_sec >= tp->ts_recent_stamp + TCP_PAWS_MSL) -+ if (rst && xtime.tv_sec >= rx_opt->ts_recent_stamp + TCP_PAWS_MSL) - return 0; - return 1; - } -@@ -1941,6 +1980,8 @@ static inline void tcp_v4_setup_caps(str - if (sk->sk_no_largesend || dst->header_len) - sk->sk_route_caps &= ~NETIF_F_TSO; - } -+ if (!sysctl_tcp_use_sg) -+ sk->sk_route_caps &= ~NETIF_F_SG; - } - - #define TCP_CHECK_TIMER(sk) do { } while (0) -diff -uprN linux-2.6.8.1.orig/include/net/udp.h linux-2.6.8.1-ve022stab078/include/net/udp.h ---- linux-2.6.8.1.orig/include/net/udp.h 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/include/net/udp.h 2006-05-11 13:05:40.000000000 +0400 -@@ -40,13 +40,19 @@ extern rwlock_t udp_hash_lock; - - extern int udp_port_rover; - --static inline int udp_lport_inuse(u16 num) -+static inline int udp_hashfn(u16 num, unsigned veid) -+{ -+ return ((num + (veid ^ (veid >> 16))) & (UDP_HTABLE_SIZE - 1)); -+} -+ -+static inline int udp_lport_inuse(u16 num, struct ve_struct *env) - { - struct sock *sk; - struct hlist_node *node; - -- sk_for_each(sk, node, &udp_hash[num & (UDP_HTABLE_SIZE - 1)]) -- if (inet_sk(sk)->num == num) -+ sk_for_each(sk, node, &udp_hash[udp_hashfn(num, VEID(env))]) -+ if (inet_sk(sk)->num == num && -+ ve_accessible_strict(VE_OWNER_SK(sk), env)) - return 1; - return 0; - } -@@ -73,9 +79,14 @@ extern int udp_ioctl(struct sock *sk, in - extern int udp_disconnect(struct sock *sk, int flags); - - DECLARE_SNMP_STAT(struct udp_mib, udp_statistics); --#define UDP_INC_STATS(field) SNMP_INC_STATS(udp_statistics, field) --#define UDP_INC_STATS_BH(field) SNMP_INC_STATS_BH(udp_statistics, field) --#define UDP_INC_STATS_USER(field) SNMP_INC_STATS_USER(udp_statistics, field) -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+#define ve_udp_statistics (get_exec_env()->_udp_statistics) -+#else -+#define ve_udp_statistics udp_statistics -+#endif -+#define UDP_INC_STATS(field) SNMP_INC_STATS(ve_udp_statistics, field) -+#define UDP_INC_STATS_BH(field) SNMP_INC_STATS_BH(ve_udp_statistics, field) -+#define UDP_INC_STATS_USER(field) SNMP_INC_STATS_USER(ve_udp_statistics, field) - - /* /proc */ - struct udp_seq_afinfo { -diff -uprN linux-2.6.8.1.orig/include/ub/beancounter.h linux-2.6.8.1-ve022stab078/include/ub/beancounter.h ---- linux-2.6.8.1.orig/include/ub/beancounter.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/ub/beancounter.h 2006-05-11 13:05:48.000000000 +0400 -@@ -0,0 +1,321 @@ -+/* -+ * include/ub/beancounter.h -+ * -+ * Copyright (C) 1999-2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ * Andrey Savochkin saw@sw-soft.com -+ * -+ */ -+ -+#ifndef _LINUX_BEANCOUNTER_H -+#define _LINUX_BEANCOUNTER_H -+ -+#include <linux/config.h> -+ -+/* -+ * Generic ratelimiting stuff. -+ */ -+ -+struct ub_rate_info { -+ int burst; -+ int interval; /* jiffy_t per event */ -+ int bucket; /* kind of leaky bucket */ -+ unsigned long last; /* last event */ -+}; -+ -+/* Return true if rate limit permits. */ -+int ub_ratelimit(struct ub_rate_info *); -+ -+ -+/* -+ * This magic is used to distinuish user beancounter and pages beancounter -+ * in struct page. page_ub and page_bc are placed in union and MAGIC -+ * ensures us that we don't use pbc as ubc in ub_page_uncharge(). -+ */ -+#define UB_MAGIC 0x62756275 -+ -+/* -+ * Resource list. -+ */ -+ -+#define UB_KMEMSIZE 0 /* Unswappable kernel memory size including -+ * struct task, page directories, etc. -+ */ -+#define UB_LOCKEDPAGES 1 /* Mlock()ed pages. */ -+#define UB_PRIVVMPAGES 2 /* Total number of pages, counting potentially -+ * private pages as private and used. -+ */ -+#define UB_SHMPAGES 3 /* IPC SHM segment size. */ -+#define UB_ZSHMPAGES 4 /* Anonymous shared memory. */ -+#define UB_NUMPROC 5 /* Number of processes. */ -+#define UB_PHYSPAGES 6 /* All resident pages, for swapout guarantee. */ -+#define UB_VMGUARPAGES 7 /* Guarantee for memory allocation, -+ * checked against PRIVVMPAGES. -+ */ -+#define UB_OOMGUARPAGES 8 /* Guarantees against OOM kill. -+ * Only limit is used, no accounting. -+ */ -+#define UB_NUMTCPSOCK 9 /* Number of TCP sockets. */ -+#define UB_NUMFLOCK 10 /* Number of file locks. */ -+#define UB_NUMPTY 11 /* Number of PTYs. */ -+#define UB_NUMSIGINFO 12 /* Number of siginfos. */ -+#define UB_TCPSNDBUF 13 /* Total size of tcp send buffers. */ -+#define UB_TCPRCVBUF 14 /* Total size of tcp receive buffers. */ -+#define UB_OTHERSOCKBUF 15 /* Total size of other socket -+ * send buffers (all buffers for PF_UNIX). -+ */ -+#define UB_DGRAMRCVBUF 16 /* Total size of other socket -+ * receive buffers. -+ */ -+#define UB_NUMOTHERSOCK 17 /* Number of other sockets. */ -+#define UB_DCACHESIZE 18 /* Size of busy dentry/inode cache. */ -+#define UB_NUMFILE 19 /* Number of open files. */ -+ -+#define UB_RESOURCES 24 -+ -+#define UB_UNUSEDPRIVVM (UB_RESOURCES + 0) -+#define UB_TMPFSPAGES (UB_RESOURCES + 1) -+#define UB_SWAPPAGES (UB_RESOURCES + 2) -+#define UB_HELDPAGES (UB_RESOURCES + 3) -+ -+struct ubparm { -+ /* -+ * A barrier over which resource allocations are failed gracefully. -+ * If the amount of consumed memory is over the barrier further sbrk() -+ * or mmap() calls fail, the existing processes are not killed. -+ */ -+ unsigned long barrier; -+ /* hard resource limit */ -+ unsigned long limit; -+ /* consumed resources */ -+ unsigned long held; -+ /* maximum amount of consumed resources through the last period */ -+ unsigned long maxheld; -+ /* minimum amount of consumed resources through the last period */ -+ unsigned long minheld; -+ /* count of failed charges */ -+ unsigned long failcnt; -+}; -+ -+/* -+ * Kernel internal part. -+ */ -+ -+#ifdef __KERNEL__ -+ -+#include <ub/ub_debug.h> -+#include <linux/interrupt.h> -+#include <asm/atomic.h> -+#include <linux/spinlock.h> -+#include <linux/cache.h> -+#include <linux/threads.h> -+ -+/* -+ * UB_MAXVALUE is essentially LONG_MAX declared in a cross-compiling safe form. -+ */ -+#define UB_MAXVALUE ( (1UL << (sizeof(unsigned long)*8-1)) - 1) -+ -+ -+/* -+ * Resource management structures -+ * Serialization issues: -+ * beancounter list management is protected via ub_hash_lock -+ * task pointers are set only for current task and only once -+ * refcount is managed atomically -+ * value and limit comparison and change are protected by per-ub spinlock -+ */ -+ -+struct page_beancounter; -+struct task_beancounter; -+struct sock_beancounter; -+ -+struct page_private { -+ unsigned long ubp_unused_privvmpages; -+ unsigned long ubp_tmpfs_respages; -+ unsigned long ubp_swap_pages; -+ unsigned long long ubp_held_pages; -+}; -+ -+struct sock_private { -+ unsigned long ubp_rmem_thres; -+ unsigned long ubp_wmem_pressure; -+ unsigned long ubp_maxadvmss; -+ unsigned long ubp_rmem_pressure; -+#define UB_RMEM_EXPAND 0 -+#define UB_RMEM_KEEP 1 -+#define UB_RMEM_SHRINK 2 -+ struct list_head ubp_other_socks; -+ struct list_head ubp_tcp_socks; -+ atomic_t ubp_orphan_count; -+}; -+ -+struct ub_perfstat { -+ unsigned long unmap; -+ unsigned long swapin; -+} ____cacheline_aligned_in_smp; -+ -+struct user_beancounter -+{ -+ unsigned long ub_magic; -+ atomic_t ub_refcount; -+ struct user_beancounter *ub_next; -+ spinlock_t ub_lock; -+ uid_t ub_uid; -+ -+ struct ub_rate_info ub_limit_rl; -+ int ub_oom_noproc; -+ -+ struct page_private ppriv; -+#define ub_unused_privvmpages ppriv.ubp_unused_privvmpages -+#define ub_tmpfs_respages ppriv.ubp_tmpfs_respages -+#define ub_swap_pages ppriv.ubp_swap_pages -+#define ub_held_pages ppriv.ubp_held_pages -+ struct sock_private spriv; -+#define ub_rmem_thres spriv.ubp_rmem_thres -+#define ub_maxadvmss spriv.ubp_maxadvmss -+#define ub_rmem_pressure spriv.ubp_rmem_pressure -+#define ub_wmem_pressure spriv.ubp_wmem_pressure -+#define ub_tcp_sk_list spriv.ubp_tcp_socks -+#define ub_other_sk_list spriv.ubp_other_socks -+#define ub_orphan_count spriv.ubp_orphan_count -+ -+ struct user_beancounter *parent; -+ void *private_data; -+ unsigned long ub_aflags; -+ -+ /* resources statistic and settings */ -+ struct ubparm ub_parms[UB_RESOURCES]; -+ /* resources statistic for last interval */ -+ struct ubparm ub_store[UB_RESOURCES]; -+ -+ struct ub_perfstat ub_perfstat[NR_CPUS]; -+ -+#ifdef CONFIG_UBC_DEBUG_KMEM -+ struct list_head ub_cclist; -+ long ub_pages_charged[NR_CPUS]; -+ long ub_vmalloc_charged[NR_CPUS]; -+#endif -+}; -+ -+enum severity { UB_HARD, UB_SOFT, UB_FORCE }; -+ -+#define UB_AFLAG_NOTIF_PAGEIN 0 -+ -+static inline int ub_barrier_hit(struct user_beancounter *ub, int resource) -+{ -+ return ub->ub_parms[resource].held > ub->ub_parms[resource].barrier; -+} -+ -+static inline int ub_hfbarrier_hit(struct user_beancounter *ub, int resource) -+{ -+ return (ub->ub_parms[resource].held > -+ ((ub->ub_parms[resource].barrier) >> 1)); -+} -+ -+#ifndef CONFIG_USER_RESOURCE -+ -+extern inline struct user_beancounter *get_beancounter_byuid -+ (uid_t uid, int create) { return NULL; } -+extern inline struct user_beancounter *get_beancounter -+ (struct user_beancounter *ub) { return NULL; } -+extern inline void put_beancounter(struct user_beancounter *ub) {;} -+ -+static inline void page_ubc_init(void) { }; -+static inline void beancounter_init(unsigned long mempages) { }; -+static inline void ub0_init(void) { }; -+ -+#else /* CONFIG_USER_RESOURCE */ -+ -+/* -+ * Charge/uncharge operations -+ */ -+ -+extern int __charge_beancounter_locked(struct user_beancounter *ub, -+ int resource, unsigned long val, enum severity strict); -+ -+extern void __uncharge_beancounter_locked(struct user_beancounter *ub, -+ int resource, unsigned long val); -+ -+extern void __put_beancounter(struct user_beancounter *ub); -+ -+extern void uncharge_warn(struct user_beancounter *ub, int resource, -+ unsigned long val, unsigned long held); -+ -+extern const char *ub_rnames[]; -+/* -+ * Put a beancounter reference -+ */ -+ -+static inline void put_beancounter(struct user_beancounter *ub) -+{ -+ if (unlikely(ub == NULL)) -+ return; -+ -+ __put_beancounter(ub); -+} -+ -+/* -+ * Create a new beancounter reference -+ */ -+extern struct user_beancounter *get_beancounter_byuid(uid_t uid, int create); -+ -+static inline -+struct user_beancounter *get_beancounter(struct user_beancounter *ub) -+{ -+ if (unlikely(ub == NULL)) -+ return NULL; -+ -+ atomic_inc(&ub->ub_refcount); -+ return ub; -+} -+ -+extern struct user_beancounter *get_subbeancounter_byid( -+ struct user_beancounter *, -+ int id, int create); -+extern struct user_beancounter *subbeancounter_findcreate( -+ struct user_beancounter *p, int id); -+ -+extern void beancounter_init(unsigned long); -+extern void page_ubc_init(void); -+extern struct user_beancounter ub0; -+extern void ub0_init(void); -+#define get_ub0() (&ub0) -+ -+extern void print_ub_uid(struct user_beancounter *ub, char *buf, int size); -+ -+/* -+ * Resource charging -+ * Change user's account and compare against limits -+ */ -+ -+static inline void ub_adjust_maxheld(struct user_beancounter *ub, int resource) -+{ -+ if (ub->ub_parms[resource].maxheld < ub->ub_parms[resource].held) -+ ub->ub_parms[resource].maxheld = ub->ub_parms[resource].held; -+ if (ub->ub_parms[resource].minheld > ub->ub_parms[resource].held) -+ ub->ub_parms[resource].minheld = ub->ub_parms[resource].held; -+} -+ -+#endif /* CONFIG_USER_RESOURCE */ -+ -+#include <ub/ub_decl.h> -+UB_DECLARE_FUNC(int, charge_beancounter(struct user_beancounter *ub, -+ int resource, unsigned long val, enum severity strict)); -+UB_DECLARE_VOID_FUNC(uncharge_beancounter(struct user_beancounter *ub, -+ int resource, unsigned long val)); -+ -+UB_DECLARE_VOID_FUNC(charge_beancounter_notop(struct user_beancounter *ub, -+ int resource, unsigned long val)); -+UB_DECLARE_VOID_FUNC(uncharge_beancounter_notop(struct user_beancounter *ub, -+ int resource, unsigned long val)); -+ -+#ifndef CONFIG_USER_RESOURCE_PROC -+static inline void beancounter_proc_init(void) { }; -+#else -+extern void beancounter_proc_init(void); -+#endif -+#endif /* __KERNEL__ */ -+#endif /* _LINUX_BEANCOUNTER_H */ -diff -uprN linux-2.6.8.1.orig/include/ub/ub_dcache.h linux-2.6.8.1-ve022stab078/include/ub/ub_dcache.h ---- linux-2.6.8.1.orig/include/ub/ub_dcache.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/ub/ub_dcache.h 2006-05-11 13:05:39.000000000 +0400 -@@ -0,0 +1,56 @@ -+/* -+ * include/ub/ub_dcache.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef __UB_DCACHE_H_ -+#define __UB_DCACHE_H_ -+ -+#include <ub/ub_decl.h> -+ -+/* -+ * UB_DCACHESIZE accounting -+ */ -+ -+struct dentry_beancounter -+{ -+ /* -+ * d_inuse = -+ * <number of external refs> + -+ * <number of 'used' childs> -+ * -+ * d_inuse == -1 means that dentry is unused -+ * state change -1 => 0 causes charge -+ * state change 0 => -1 causes uncharge -+ */ -+ atomic_t d_inuse; -+ /* charged size, including name length if name is not inline */ -+ unsigned long d_ubsize; -+ struct user_beancounter *d_ub; -+}; -+ -+extern unsigned int inode_memusage(void); -+extern unsigned int dentry_memusage(void); -+ -+struct dentry; -+ -+UB_DECLARE_FUNC(int, ub_dentry_alloc(struct dentry *d)) -+UB_DECLARE_VOID_FUNC(ub_dentry_free(struct dentry *d)) -+UB_DECLARE_VOID_FUNC(ub_dentry_charge_nofail(struct dentry *d)) -+UB_DECLARE_VOID_FUNC(ub_dentry_uncharge(struct dentry *d)) -+ -+#ifdef CONFIG_USER_RESOURCE -+UB_DECLARE_FUNC(int, ub_dentry_charge(struct dentry *d)) -+#else -+#define ub_dentry_charge(d) ({ \ -+ spin_unlock(&d->d_lock); \ -+ rcu_read_unlock(); \ -+ 0; \ -+ }) -+#endif -+#endif -diff -uprN linux-2.6.8.1.orig/include/ub/ub_debug.h linux-2.6.8.1-ve022stab078/include/ub/ub_debug.h ---- linux-2.6.8.1.orig/include/ub/ub_debug.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/ub/ub_debug.h 2006-05-11 13:05:39.000000000 +0400 -@@ -0,0 +1,95 @@ -+/* -+ * include/ub/ub_debug.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef __UB_DEBUG_H_ -+#define __UB_DEBUG_H_ -+ -+/* -+ * general debugging -+ */ -+ -+#define UBD_ALLOC 0x1 -+#define UBD_CHARGE 0x2 -+#define UBD_LIMIT 0x4 -+#define UBD_TRACE 0x8 -+ -+/* -+ * ub_net debugging -+ */ -+ -+#define UBD_NET_SOCKET 0x10 -+#define UBD_NET_SLEEP 0x20 -+#define UBD_NET_SEND 0x40 -+#define UBD_NET_RECV 0x80 -+ -+/* -+ * Main routines -+ */ -+ -+#define UB_DEBUG (0) -+#define DEBUG_RESOURCE (0ULL) -+ -+#define ub_dbg_cond(__cond, __str, args...) \ -+ do { \ -+ if ((__cond) != 0) \ -+ printk(__str, ##args); \ -+ } while(0) -+ -+#define ub_debug(__section, __str, args...) \ -+ ub_dbg_cond(UB_DEBUG & (__section), __str, ##args) -+ -+#define ub_debug_resource(__resource, __str, args...) \ -+ ub_dbg_cond((UB_DEBUG & UBD_CHARGE) && \ -+ (DEBUG_RESOURCE & (1 << (__resource))), \ -+ __str, ##args) -+ -+#if UB_DEBUG & UBD_TRACE -+#define ub_debug_trace(__cond, __b, __r) \ -+ do { \ -+ static struct ub_rate_info ri = { __b, __r }; \ -+ if ((__cond) != 0 && ub_ratelimit(&ri)) \ -+ dump_stack(); \ -+ } while(0) -+#else -+#define ub_debug_trace(__cond, __burst, __rate) -+#endif -+ -+#include <linux/config.h> -+ -+#ifdef CONFIG_UBC_DEBUG_KMEM -+#include <linux/list.h> -+#include <linux/kmem_cache.h> -+ -+struct user_beancounter; -+struct ub_cache_counter { -+ struct list_head ulist; -+ struct ub_cache_counter *next; -+ struct user_beancounter *ub; -+ kmem_cache_t *cachep; -+ unsigned long counter; -+}; -+ -+extern spinlock_t cc_lock; -+extern void init_cache_counters(void); -+extern void ub_free_counters(struct user_beancounter *); -+extern void ub_kmemcache_free(kmem_cache_t *cachep); -+ -+struct vm_struct; -+extern void inc_vmalloc_charged(struct vm_struct *, int); -+extern void dec_vmalloc_charged(struct vm_struct *); -+#else -+#define init_cache_counters() do { } while (0) -+#define inc_vmalloc_charged(vm, f) do { } while (0) -+#define dec_vmalloc_charged(vm) do { } while (0) -+#define ub_free_counters(ub) do { } while (0) -+#define ub_kmemcache_free(cachep) do { } while (0) -+#endif -+ -+#endif -diff -uprN linux-2.6.8.1.orig/include/ub/ub_decl.h linux-2.6.8.1-ve022stab078/include/ub/ub_decl.h ---- linux-2.6.8.1.orig/include/ub/ub_decl.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/ub/ub_decl.h 2006-05-11 13:05:39.000000000 +0400 -@@ -0,0 +1,40 @@ -+/* -+ * include/ub/ub_decl.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef __UB_DECL_H_ -+#define __UB_DECL_H_ -+ -+#include <linux/config.h> -+ -+/* -+ * Naming convension: -+ * ub_<section|object>_<operation> -+ */ -+ -+#ifdef CONFIG_USER_RESOURCE -+ -+#define UB_DECLARE_FUNC(ret_type, decl) extern ret_type decl; -+#define UB_DECLARE_VOID_FUNC(decl) extern void decl; -+ -+#else /* CONFIG_USER_RESOURCE */ -+ -+#define UB_DECLARE_FUNC(ret_type, decl) \ -+ static inline ret_type decl \ -+ { \ -+ return (ret_type)0; \ -+ } -+#define UB_DECLARE_VOID_FUNC(decl) \ -+ static inline void decl \ -+ { \ -+ } -+ -+#endif /* CONFIG_USER_RESOURCE */ -+ -+#endif -diff -uprN linux-2.6.8.1.orig/include/ub/ub_hash.h linux-2.6.8.1-ve022stab078/include/ub/ub_hash.h ---- linux-2.6.8.1.orig/include/ub/ub_hash.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/ub/ub_hash.h 2006-05-11 13:05:39.000000000 +0400 -@@ -0,0 +1,41 @@ -+/* -+ * include/ub/ub_hash.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef _LINUX_UBHASH_H -+#define _LINUX_UBHASH_H -+ -+#ifdef __KERNEL__ -+ -+#define UB_HASH_SIZE 256 -+ -+struct ub_hash_slot { -+ struct user_beancounter *ubh_beans; -+}; -+ -+extern struct ub_hash_slot ub_hash[]; -+extern spinlock_t ub_hash_lock; -+ -+#ifdef CONFIG_USER_RESOURCE -+ -+/* -+ * Iterate over beancounters -+ * @__slot - hash slot -+ * @__ubp - beancounter ptr -+ * Can use break :) -+ */ -+#define for_each_beancounter(__slot, __ubp) \ -+ for (__slot = 0, __ubp = NULL; \ -+ __slot < UB_HASH_SIZE && __ubp == NULL; __slot++) \ -+ for (__ubp = ub_hash[__slot].ubh_beans; __ubp; \ -+ __ubp = __ubp->ub_next) -+ -+#endif /* CONFIG_USER_RESOURCE */ -+#endif /* __KERNEL__ */ -+#endif /* _LINUX_UBHASH_H */ -diff -uprN linux-2.6.8.1.orig/include/ub/ub_mem.h linux-2.6.8.1-ve022stab078/include/ub/ub_mem.h ---- linux-2.6.8.1.orig/include/ub/ub_mem.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/ub/ub_mem.h 2006-05-11 13:05:39.000000000 +0400 -@@ -0,0 +1,90 @@ -+/* -+ * include/ub/ub_mem.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef __UB_SLAB_H_ -+#define __UB_SLAB_H_ -+ -+#include <linux/config.h> -+#include <linux/kmem_slab.h> -+#include <linux/vmalloc.h> -+#include <linux/gfp.h> -+#include <asm/pgtable.h> -+#include <ub/beancounter.h> -+#include <ub/ub_decl.h> -+ -+/* -+ * UB_KMEMSIZE accounting -+ * oom_killer related -+ */ -+ -+/* -+ * Memory freeing statistics to make correct OOM decision -+ */ -+ -+struct oom_freeing_stat -+{ -+ unsigned long oom_generation; /* current OOM gen */ -+ unsigned long freed; -+ unsigned long swapped; /* page referrence counters removed */ -+ unsigned long written; /* IO started */ -+ unsigned long slabs; /* slabs shrinked */ -+}; -+ -+extern int oom_generation; -+extern int oom_kill_counter; -+extern spinlock_t oom_generation_lock; -+ -+#ifdef CONFIG_UBC_DEBUG_ITEMS -+#define CHARGE_ORDER(__o) (1 << __o) -+#define CHARGE_SIZE(__s) 1 -+#else -+#define CHARGE_ORDER(__o) (PAGE_SIZE << (__o)) -+#define CHARGE_SIZE(__s) (__s) -+#endif -+ -+#define page_ub(__page) ((__page)->bc.page_ub) -+ -+struct mm_struct; -+struct page; -+ -+UB_DECLARE_FUNC(struct user_beancounter *, slab_ub(void *obj)) -+UB_DECLARE_FUNC(struct user_beancounter *, vmalloc_ub(void *obj)) -+UB_DECLARE_FUNC(struct user_beancounter *, mem_ub(void *obj)) -+ -+UB_DECLARE_FUNC(int, ub_page_charge(struct page *page, int order, int mask)) -+UB_DECLARE_VOID_FUNC(ub_page_uncharge(struct page *page, int order)) -+ -+UB_DECLARE_VOID_FUNC(ub_clear_oom(void)) -+UB_DECLARE_VOID_FUNC(ub_oomkill_task(struct mm_struct *mm, -+ struct user_beancounter *ub, long overdraft)) -+UB_DECLARE_FUNC(int, ub_slab_charge(void *objp, int flags)) -+UB_DECLARE_VOID_FUNC(ub_slab_uncharge(void *obj)) -+ -+#ifdef CONFIG_USER_RESOURCE -+/* Flags without __GFP_UBC must comply with vmalloc */ -+#define ub_vmalloc(size) __vmalloc(size, \ -+ GFP_KERNEL | __GFP_HIGHMEM | __GFP_UBC, PAGE_KERNEL) -+#define ub_kmalloc(size, flags) kmalloc(size, ((flags) | __GFP_UBC)) -+extern struct user_beancounter *ub_select_worst(long *); -+#else -+#define ub_vmalloc(size) vmalloc(size) -+#define ub_kmalloc(size, flags) kmalloc(size, flags) -+static inline struct user_beancounter *ub_select_worst(long *over) -+{ -+ *over = 0; -+ return NULL; -+} -+#endif -+ -+#define slab_ubcs(cachep, slabp) ((struct user_beancounter **)\ -+ (ALIGN((unsigned long)(slab_bufctl(slabp) + (cachep)->num),\ -+ sizeof(void *)))) -+ -+#endif -diff -uprN linux-2.6.8.1.orig/include/ub/ub_misc.h linux-2.6.8.1-ve022stab078/include/ub/ub_misc.h ---- linux-2.6.8.1.orig/include/ub/ub_misc.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/ub/ub_misc.h 2006-05-11 13:05:39.000000000 +0400 -@@ -0,0 +1,33 @@ -+/* -+ * include/ub/ub_misc.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef __UB_MISC_H_ -+#define __UB_MISC_H_ -+ -+#include <ub/ub_decl.h> -+ -+struct tty_struct; -+struct file; -+struct file_lock; -+ -+UB_DECLARE_FUNC(int, ub_file_charge(struct file *f)) -+UB_DECLARE_VOID_FUNC(ub_file_uncharge(struct file *f)) -+UB_DECLARE_FUNC(int, ub_flock_charge(struct file_lock *fl, int hard)) -+UB_DECLARE_VOID_FUNC(ub_flock_uncharge(struct file_lock *fl)) -+UB_DECLARE_FUNC(int, ub_siginfo_charge(struct user_beancounter *ub, -+ unsigned long size)) -+UB_DECLARE_VOID_FUNC(ub_siginfo_uncharge(struct user_beancounter *ub, -+ unsigned long size)) -+UB_DECLARE_FUNC(int, ub_task_charge(struct task_struct *parent, -+ struct task_struct *task)) -+UB_DECLARE_VOID_FUNC(ub_task_uncharge(struct task_struct *task)) -+UB_DECLARE_FUNC(int, ub_pty_charge(struct tty_struct *tty)) -+UB_DECLARE_VOID_FUNC(ub_pty_uncharge(struct tty_struct *tty)) -+#endif -diff -uprN linux-2.6.8.1.orig/include/ub/ub_net.h linux-2.6.8.1-ve022stab078/include/ub/ub_net.h ---- linux-2.6.8.1.orig/include/ub/ub_net.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/ub/ub_net.h 2006-05-11 13:05:39.000000000 +0400 -@@ -0,0 +1,141 @@ -+/* -+ * include/ub/ub_net.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef __UB_NET_H_ -+#define __UB_NET_H_ -+ -+/* -+ * UB_NUMXXXSOCK, UB_XXXBUF accounting -+ */ -+ -+#include <ub/ub_decl.h> -+#include <ub/ub_sk.h> -+ -+#define bid2sid(__bufid) \ -+ ((__bufid) == UB_TCPSNDBUF ? UB_NUMTCPSOCK : UB_NUMOTHERSOCK) -+ -+#define SOCK_MIN_UBCSPACE ((int)((2048 - sizeof(struct skb_shared_info)) & \ -+ ~(SMP_CACHE_BYTES-1))) -+#define SOCK_MIN_UBCSPACE_CH skb_charge_size(SOCK_MIN_UBCSPACE) -+ -+ -+#define IS_TCP_SOCK(__family, __type) \ -+ ((__family) == PF_INET && (__type) == SOCK_STREAM) -+ -+UB_DECLARE_FUNC(int, ub_sock_charge(struct sock *sk, int family, int type)) -+UB_DECLARE_FUNC(int, ub_tcp_sock_charge(struct sock *sk)) -+UB_DECLARE_FUNC(int, ub_other_sock_charge(struct sock *sk)) -+UB_DECLARE_VOID_FUNC(ub_sock_uncharge(struct sock *sk)) -+UB_DECLARE_VOID_FUNC(ub_skb_uncharge(struct sk_buff *skb)) -+UB_DECLARE_FUNC(int, ub_skb_alloc_bc(struct sk_buff *skb, int gfp_mask)) -+UB_DECLARE_VOID_FUNC(ub_skb_free_bc(struct sk_buff *skb)) -+UB_DECLARE_FUNC(int, ub_nlrcvbuf_charge(struct sk_buff *skb, struct sock *sk)) -+UB_DECLARE_FUNC(int, ub_sockrcvbuf_charge(struct sock *sk, struct sk_buff *skb)) -+UB_DECLARE_VOID_FUNC(ub_sock_snd_queue_add(struct sock *sk, int resource, -+ unsigned long size)) -+UB_DECLARE_FUNC(long, ub_sock_wait_for_space(struct sock *sk, long timeo, -+ unsigned long size)) -+ -+UB_DECLARE_FUNC(int, ub_tcprcvbuf_charge(struct sock *sk, struct sk_buff *skb)) -+UB_DECLARE_FUNC(int, ub_tcprcvbuf_charge_forced(struct sock *sk, -+ struct sk_buff *skb)) -+UB_DECLARE_FUNC(int, ub_tcpsndbuf_charge(struct sock *sk, struct sk_buff *skb)) -+UB_DECLARE_FUNC(int, ub_tcpsndbuf_charge_forced(struct sock *sk, -+ struct sk_buff *skb)) -+ -+/* Charge size */ -+static inline unsigned long skb_charge_datalen(unsigned long chargesize) -+{ -+#ifdef CONFIG_USER_RESOURCE -+ unsigned long slabsize; -+ -+ chargesize -= sizeof(struct sk_buff); -+ slabsize = 64; -+ do { -+ slabsize <<= 1; -+ } while (slabsize <= chargesize); -+ -+ slabsize >>= 1; -+ return (slabsize - sizeof(struct skb_shared_info)) & -+ ~(SMP_CACHE_BYTES-1); -+#else -+ return 0; -+#endif -+} -+ -+static inline unsigned long skb_charge_size_gen(unsigned long size) -+{ -+#ifdef CONFIG_USER_RESOURCE -+ unsigned int slabsize; -+ -+ size = SKB_DATA_ALIGN(size) + sizeof(struct skb_shared_info); -+ slabsize = 32; /* min size is 64 because of skb_shared_info */ -+ do { -+ slabsize <<= 1; -+ } while (slabsize < size); -+ -+ return slabsize + sizeof(struct sk_buff); -+#else -+ return 0; -+#endif -+ -+} -+ -+static inline unsigned long skb_charge_size_const(unsigned long size) -+{ -+#ifdef CONFIG_USER_RESOURCE -+ unsigned int ret; -+ if (SKB_DATA_ALIGN(size) + sizeof(struct skb_shared_info) <= 64) -+ ret = 64 + sizeof(struct sk_buff); -+ else if (SKB_DATA_ALIGN(size) + sizeof(struct skb_shared_info) <= 128) -+ ret = 128 + sizeof(struct sk_buff); -+ else if (SKB_DATA_ALIGN(size) + sizeof(struct skb_shared_info) <= 256) -+ ret = 256 + sizeof(struct sk_buff); -+ else if (SKB_DATA_ALIGN(size) + sizeof(struct skb_shared_info) <= 512) -+ ret = 512 + sizeof(struct sk_buff); -+ else if (SKB_DATA_ALIGN(size) + sizeof(struct skb_shared_info) <= 1024) -+ ret = 1024 + sizeof(struct sk_buff); -+ else if (SKB_DATA_ALIGN(size) + sizeof(struct skb_shared_info) <= 2048) -+ ret = 2048 + sizeof(struct sk_buff); -+ else if (SKB_DATA_ALIGN(size) + sizeof(struct skb_shared_info) <= 4096) -+ ret = 4096 + sizeof(struct sk_buff); -+ else -+ ret = skb_charge_size_gen(size); -+ return ret; -+#else -+ return 0; -+#endif -+} -+ -+ -+#define skb_charge_size(__size) \ -+ (__builtin_constant_p(__size) ? \ -+ skb_charge_size_const(__size) : \ -+ skb_charge_size_gen(__size)) -+ -+UB_DECLARE_FUNC(int, skb_charge_fullsize(struct sk_buff *skb)) -+UB_DECLARE_VOID_FUNC(ub_skb_set_charge(struct sk_buff *skb, -+ struct sock *sk, unsigned long size, int res)) -+ -+/* Poll reserv */ -+UB_DECLARE_FUNC(int, ub_sock_makewres_other(struct sock *sk, unsigned long sz)) -+UB_DECLARE_FUNC(int, ub_sock_makewres_tcp(struct sock *sk, unsigned long size)) -+UB_DECLARE_FUNC(int, ub_sock_getwres_other(struct sock *sk, unsigned long size)) -+UB_DECLARE_FUNC(int, ub_sock_getwres_tcp(struct sock *sk, unsigned long size)) -+UB_DECLARE_VOID_FUNC(ub_sock_retwres_other(struct sock *sk, unsigned long size, -+ unsigned long ressize)) -+UB_DECLARE_VOID_FUNC(ub_sock_retwres_tcp(struct sock *sk, unsigned long size, -+ unsigned long ressize)) -+UB_DECLARE_VOID_FUNC(ub_sock_sndqueueadd_other(struct sock *sk, -+ unsigned long size)) -+UB_DECLARE_VOID_FUNC(ub_sock_sndqueueadd_tcp(struct sock *sk, unsigned long sz)) -+UB_DECLARE_VOID_FUNC(ub_sock_sndqueuedel(struct sock *sk)) -+ -+#endif -diff -uprN linux-2.6.8.1.orig/include/ub/ub_orphan.h linux-2.6.8.1-ve022stab078/include/ub/ub_orphan.h ---- linux-2.6.8.1.orig/include/ub/ub_orphan.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/ub/ub_orphan.h 2006-05-11 13:05:39.000000000 +0400 -@@ -0,0 +1,54 @@ -+/* -+ * include/ub/ub_orphan.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef __UB_ORPHAN_H_ -+#define __UB_ORPHAN_H_ -+ -+#include "ub/beancounter.h" -+#include "ub/ub_net.h" -+ -+ -+extern int ub_too_many_orphans(struct sock *sk, int count); -+static inline int tcp_too_many_orphans(struct sock *sk, int count) -+{ -+#ifdef CONFIG_USER_RESOURCE -+ if (ub_too_many_orphans(sk, count)) -+ return 1; -+#endif -+ return (atomic_read(&tcp_orphan_count) > sysctl_tcp_max_orphans || -+ (sk->sk_wmem_queued > SOCK_MIN_SNDBUF && -+ atomic_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])); -+} -+ -+static inline atomic_t *tcp_get_orphan_count_ptr(struct sock *sk) -+{ -+#ifdef CONFIG_USER_RESOURCE -+ if (sock_has_ubc(sk)) -+ return &sock_bc(sk)->ub->ub_orphan_count; -+#endif -+ return &tcp_orphan_count; -+} -+ -+static inline void tcp_inc_orphan_count(struct sock *sk) -+{ -+ atomic_inc(tcp_get_orphan_count_ptr(sk)); -+} -+ -+static inline void tcp_dec_orphan_count(struct sock *sk) -+{ -+ atomic_dec(tcp_get_orphan_count_ptr(sk)); -+} -+ -+static inline int tcp_get_orphan_count(struct sock *sk) -+{ -+ return atomic_read(tcp_get_orphan_count_ptr(sk)); -+} -+ -+#endif -diff -uprN linux-2.6.8.1.orig/include/ub/ub_page.h linux-2.6.8.1-ve022stab078/include/ub/ub_page.h ---- linux-2.6.8.1.orig/include/ub/ub_page.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/ub/ub_page.h 2006-05-11 13:05:39.000000000 +0400 -@@ -0,0 +1,48 @@ -+/* -+ * include/ub/ub_page.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef __UB_PAGE_H_ -+#define __UB_PAGE_H_ -+ -+#include <linux/config.h> -+ -+/* -+ * Page_beancounters -+ */ -+ -+struct page; -+struct user_beancounter; -+ -+#define PB_MAGIC 0x62700001UL -+ -+struct page_beancounter { -+ unsigned long pb_magic; -+ struct page *page; -+ struct user_beancounter *ub; -+ struct page_beancounter *next_hash; -+ unsigned refcount; -+ struct list_head page_list; -+}; -+ -+#define PB_REFCOUNT_BITS 24 -+#define PB_SHIFT_GET(c) ((c) >> PB_REFCOUNT_BITS) -+#define PB_SHIFT_INC(c) ((c) += (1 << PB_REFCOUNT_BITS)) -+#define PB_SHIFT_DEC(c) ((c) -= (1 << PB_REFCOUNT_BITS)) -+#define PB_COUNT_GET(c) ((c) & ((1 << PB_REFCOUNT_BITS) - 1)) -+#define PB_COUNT_INC(c) ((c)++) -+#define PB_COUNT_DEC(c) ((c)--) -+#define PB_REFCOUNT_MAKE(s, c) (((s) << PB_REFCOUNT_BITS) + (c)) -+ -+#define page_pbc(__page) ((__page)->bc.page_pbc) -+ -+struct address_space; -+extern int is_shmem_mapping(struct address_space *); -+ -+#endif -diff -uprN linux-2.6.8.1.orig/include/ub/ub_sk.h linux-2.6.8.1-ve022stab078/include/ub/ub_sk.h ---- linux-2.6.8.1.orig/include/ub/ub_sk.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/ub/ub_sk.h 2006-05-11 13:05:39.000000000 +0400 -@@ -0,0 +1,45 @@ -+/* -+ * include/ub/ub_sk.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef __UB_SK_H_ -+#define __UB_SK_H_ -+ -+#include <linux/config.h> -+#include <ub/ub_task.h> -+ -+struct sock; -+struct sk_buff; -+ -+struct skb_beancounter { -+ struct user_beancounter *ub; -+ unsigned long charged:27, resource:5; -+}; -+ -+struct sock_beancounter { -+ /* -+ * already charged for future sends, to make poll work; -+ * changes are protected by bc spinlock, read is under socket -+ * semaphore for sends and unprotected in poll -+ */ -+ unsigned long poll_reserv; -+ unsigned long ub_waitspc; /* space waiting for */ -+ unsigned long ub_wcharged; -+ struct list_head ub_sock_list; -+ struct user_beancounter *ub; -+}; -+ -+#define sock_bc(__sk) (&(__sk)->sk_bc) -+#define skb_bc(__skb) (&(__skb)->skb_bc) -+#define skbc_sock(__skbc) (container_of(__skbc, struct sock, sk_bc)) -+#define sock_has_ubc(__sk) (sock_bc(__sk)->ub != NULL) -+ -+#define set_sk_exec_ub(__sk) (set_exec_ub(sock_bc(sk)->ub)) -+ -+#endif -diff -uprN linux-2.6.8.1.orig/include/ub/ub_stat.h linux-2.6.8.1-ve022stab078/include/ub/ub_stat.h ---- linux-2.6.8.1.orig/include/ub/ub_stat.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/ub/ub_stat.h 2006-05-11 13:05:39.000000000 +0400 -@@ -0,0 +1,70 @@ -+/* -+ * include/ub/ub_stat.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef __UB_STAT_H_ -+#define __UB_STAT_H_ -+ -+/* sys_ubstat commands list */ -+#define UBSTAT_READ_ONE 0x010000 -+#define UBSTAT_READ_ALL 0x020000 -+#define UBSTAT_READ_FULL 0x030000 -+#define UBSTAT_UBLIST 0x040000 -+#define UBSTAT_UBPARMNUM 0x050000 -+#define UBSTAT_GETTIME 0x060000 -+ -+#define UBSTAT_CMD(func) ((func) & 0xF0000) -+#define UBSTAT_PARMID(func) ((func) & 0x0FFFF) -+ -+#define TIME_MAX_SEC (LONG_MAX / HZ) -+#define TIME_MAX_JIF (TIME_MAX_SEC * HZ) -+ -+typedef unsigned long ubstattime_t; -+ -+typedef struct { -+ ubstattime_t start_time; -+ ubstattime_t end_time; -+ ubstattime_t cur_time; -+} ubgettime_t; -+ -+typedef struct { -+ long maxinterval; -+ int signum; -+} ubnotifrq_t; -+ -+typedef struct { -+ unsigned long maxheld; -+ unsigned long failcnt; -+} ubstatparm_t; -+ -+typedef struct { -+ unsigned long barrier; -+ unsigned long limit; -+ unsigned long held; -+ unsigned long maxheld; -+ unsigned long minheld; -+ unsigned long failcnt; -+ unsigned long __unused1; -+ unsigned long __unused2; -+} ubstatparmf_t; -+ -+typedef struct { -+ ubstattime_t start_time; -+ ubstattime_t end_time; -+ ubstatparmf_t param[0]; -+} ubstatfull_t; -+ -+#ifdef __KERNEL__ -+struct ub_stat_notify { -+ struct list_head list; -+ struct task_struct *task; -+ int signum; -+}; -+#endif -+#endif -diff -uprN linux-2.6.8.1.orig/include/ub/ub_task.h linux-2.6.8.1-ve022stab078/include/ub/ub_task.h ---- linux-2.6.8.1.orig/include/ub/ub_task.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/ub/ub_task.h 2006-05-11 13:05:49.000000000 +0400 -@@ -0,0 +1,50 @@ -+/* -+ * include/ub/ub_task.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef __UB_TASK_H_ -+#define __UB_TASK_H_ -+ -+#include <linux/config.h> -+ -+struct user_beancounter; -+ -+ -+#ifdef CONFIG_USER_RESOURCE -+ -+struct task_beancounter { -+ struct user_beancounter *exec_ub; -+ struct user_beancounter *task_ub; -+ struct user_beancounter *fork_sub; -+ void *task_fnode, *task_freserv; -+ unsigned long task_data[4]; -+}; -+ -+#define task_bc(__tsk) (&((__tsk)->task_bc)) -+ -+#define get_exec_ub() (task_bc(current)->exec_ub) -+#define get_task_ub(__task) (task_bc(__task)->task_ub) -+#define set_exec_ub(__newub) \ -+({ \ -+ struct user_beancounter *old; \ -+ struct task_beancounter *tbc; \ -+ tbc = task_bc(current); \ -+ old = tbc->exec_ub; \ -+ tbc->exec_ub = __newub; \ -+ old; \ -+}) -+ -+#else /* CONFIG_USER_RESOURCE */ -+ -+#define get_exec_ub() (NULL) -+#define get_task_ub(task) (NULL) -+#define set_exec_ub(__ub) (NULL) -+ -+#endif /* CONFIG_USER_RESOURCE */ -+#endif /* __UB_TASK_H_ */ -diff -uprN linux-2.6.8.1.orig/include/ub/ub_tcp.h linux-2.6.8.1-ve022stab078/include/ub/ub_tcp.h ---- linux-2.6.8.1.orig/include/ub/ub_tcp.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/ub/ub_tcp.h 2006-05-11 13:05:39.000000000 +0400 -@@ -0,0 +1,79 @@ -+/* -+ * include/ub/ub_tcp.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef __UB_TCP_H_ -+#define __UB_TCP_H_ -+ -+/* -+ * UB_NUMXXXSOCK, UB_XXXBUF accounting -+ */ -+ -+#include <ub/ub_sk.h> -+#include <ub/beancounter.h> -+ -+static inline void ub_tcp_update_maxadvmss(struct sock *sk) -+{ -+#ifdef CONFIG_USER_RESOURCE -+ if (!sock_has_ubc(sk)) -+ return; -+ if (sock_bc(sk)->ub->ub_maxadvmss >= tcp_sk(sk)->advmss) -+ return; -+ -+ sock_bc(sk)->ub->ub_maxadvmss = -+ skb_charge_size(MAX_HEADER + sizeof(struct iphdr) -+ + sizeof(struct tcphdr) + tcp_sk(sk)->advmss); -+#endif -+} -+ -+static inline int ub_tcp_rmem_allows_expand(struct sock *sk) -+{ -+ if (tcp_memory_pressure) -+ return 0; -+#ifdef CONFIG_USER_RESOURCE -+ if (sock_has_ubc(sk)) { -+ struct user_beancounter *ub; -+ -+ ub = sock_bc(sk)->ub; -+ if (ub->ub_rmem_pressure == UB_RMEM_EXPAND) -+ return 1; -+ if (ub->ub_rmem_pressure == UB_RMEM_SHRINK) -+ return 0; -+ return sk->sk_rcvbuf <= ub->ub_rmem_thres; -+ } -+#endif -+ return 1; -+} -+ -+static inline int ub_tcp_memory_pressure(struct sock *sk) -+{ -+ if (tcp_memory_pressure) -+ return 1; -+#ifdef CONFIG_USER_RESOURCE -+ if (sock_has_ubc(sk)) -+ return sock_bc(sk)->ub->ub_rmem_pressure != UB_RMEM_EXPAND; -+#endif -+ return 0; -+} -+ -+static inline int ub_tcp_shrink_rcvbuf(struct sock *sk) -+{ -+ if (tcp_memory_pressure) -+ return 1; -+#ifdef CONFIG_USER_RESOURCE -+ if (sock_has_ubc(sk)) -+ return sock_bc(sk)->ub->ub_rmem_pressure == UB_RMEM_SHRINK; -+#endif -+ return 0; -+} -+ -+UB_DECLARE_FUNC(int, ub_sock_tcp_chargepage(struct sock *sk)) -+UB_DECLARE_VOID_FUNC(ub_sock_tcp_detachpage(struct sock *sk)) -+ -+#endif -diff -uprN linux-2.6.8.1.orig/include/ub/ub_vmpages.h linux-2.6.8.1-ve022stab078/include/ub/ub_vmpages.h ---- linux-2.6.8.1.orig/include/ub/ub_vmpages.h 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/include/ub/ub_vmpages.h 2006-05-11 13:05:39.000000000 +0400 -@@ -0,0 +1,121 @@ -+/* -+ * include/ub/ub_vmpages.h -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#ifndef __UB_PAGES_H_ -+#define __UB_PAGES_H_ -+ -+#include <linux/linkage.h> -+#include <linux/config.h> -+#include <ub/beancounter.h> -+#include <ub/ub_decl.h> -+ -+/* -+ * UB_XXXPAGES -+ */ -+ -+/* -+ * Check whether vma has private or copy-on-write mapping. -+ * Should match checks in ub_protected_charge(). -+ */ -+#define VM_UB_PRIVATE(__flags, __file) \ -+ ( ((__flags) & VM_WRITE) ? \ -+ (__file) == NULL || !((__flags) & VM_SHARED) : \ -+ 0 \ -+ ) -+ -+#define UB_PAGE_WEIGHT_SHIFT 24 -+#define UB_PAGE_WEIGHT (1 << UB_PAGE_WEIGHT_SHIFT) -+ -+struct page_beancounter; -+ -+/* Mprotect charging result */ -+#define PRIVVM_ERROR -1 -+#define PRIVVM_NO_CHARGE 0 -+#define PRIVVM_TO_PRIVATE 1 -+#define PRIVVM_TO_SHARED 2 -+ -+#ifdef CONFIG_USER_RESOURCE -+extern int ub_protected_charge(struct user_beancounter *ub, unsigned long size, -+ unsigned long newflags, struct vm_area_struct *vma); -+#else -+static inline int ub_protected_charge(struct user_beancounter *ub, -+ unsigned long size, unsigned long flags, -+ struct vm_area_struct *vma) -+{ -+ return PRIVVM_NO_CHARGE; -+} -+#endif -+ -+UB_DECLARE_VOID_FUNC(ub_tmpfs_respages_inc(struct user_beancounter *ub, -+ unsigned long size)) -+UB_DECLARE_VOID_FUNC(ub_tmpfs_respages_dec(struct user_beancounter *ub, -+ unsigned long size)) -+UB_DECLARE_FUNC(int, ub_shmpages_charge(struct user_beancounter *ub, -+ unsigned long size)) -+UB_DECLARE_VOID_FUNC(ub_shmpages_uncharge(struct user_beancounter *ub, -+ unsigned long size)) -+UB_DECLARE_FUNC(int, ub_locked_mem_charge(struct user_beancounter *ub, long sz)) -+UB_DECLARE_VOID_FUNC(ub_locked_mem_uncharge(struct user_beancounter *ub, -+ long size)) -+UB_DECLARE_FUNC(int, ub_privvm_charge(struct user_beancounter *ub, -+ unsigned long flags, struct file *file, -+ unsigned long size)) -+UB_DECLARE_VOID_FUNC(ub_privvm_uncharge(struct user_beancounter *ub, -+ unsigned long flags, struct file *file, -+ unsigned long size)) -+UB_DECLARE_FUNC(int, ub_unused_privvm_inc(struct user_beancounter * ub, -+ long size, struct vm_area_struct *vma)) -+UB_DECLARE_VOID_FUNC(ub_unused_privvm_dec(struct user_beancounter *ub, long sz, -+ struct vm_area_struct *vma)) -+UB_DECLARE_VOID_FUNC(__ub_unused_privvm_dec(struct user_beancounter *ub, long sz)) -+UB_DECLARE_FUNC(int, ub_memory_charge(struct user_beancounter * ub, -+ unsigned long size, unsigned vm_flags, -+ struct file *vm_file, int strict)) -+UB_DECLARE_VOID_FUNC(ub_memory_uncharge(struct user_beancounter * ub, -+ unsigned long size, unsigned vm_flags, -+ struct file *vm_file)) -+UB_DECLARE_FUNC(unsigned long, pages_in_vma_range(struct vm_area_struct *vma, -+ unsigned long start, unsigned long end)) -+#define pages_in_vma(vma) \ -+ (pages_in_vma_range((vma), (vma)->vm_start, (vma)->vm_end)) -+ -+extern void fastcall __ub_update_physpages(struct user_beancounter *ub); -+extern void fastcall __ub_update_oomguarpages(struct user_beancounter *ub); -+extern void fastcall __ub_update_privvm(struct user_beancounter *ub); -+ -+#ifdef CONFIG_USER_SWAP_ACCOUNTING -+extern void ub_swapentry_inc(struct user_beancounter *ub); -+extern void ub_swapentry_dec(struct user_beancounter *ub); -+#endif -+ -+#ifdef CONFIG_USER_RSS_ACCOUNTING -+#define PB_DECLARE_FUNC(ret, decl) UB_DECLARE_FUNC(ret, decl) -+#define PB_DECLARE_VOID_FUNC(decl) UB_DECLARE_VOID_FUNC(decl) -+#else -+#define PB_DECLARE_FUNC(ret, decl) static inline ret decl {return (ret)0;} -+#define PB_DECLARE_VOID_FUNC(decl) static inline void decl { } -+#endif -+ -+PB_DECLARE_FUNC(int, pb_reserve_all(struct page_beancounter **pbc)) -+PB_DECLARE_FUNC(int, pb_alloc(struct page_beancounter **pbc)) -+PB_DECLARE_FUNC(int, pb_alloc_list(struct page_beancounter **pbc, int num, -+ struct mm_struct *mm)) -+PB_DECLARE_FUNC(int, pb_add_ref(struct page *page, struct user_beancounter *ub, -+ struct page_beancounter **pbc)) -+PB_DECLARE_VOID_FUNC(pb_free_list(struct page_beancounter **pb)) -+PB_DECLARE_VOID_FUNC(pb_free(struct page_beancounter **pb)) -+PB_DECLARE_VOID_FUNC(pb_add_list_ref(struct page *page, -+ struct user_beancounter *ub, -+ struct page_beancounter **pbc)) -+PB_DECLARE_VOID_FUNC(pb_remove_ref(struct page *page, -+ struct user_beancounter *ub)) -+PB_DECLARE_FUNC(struct user_beancounter *, pb_grab_page_ub(struct page *page)) -+ -+#endif -diff -uprN linux-2.6.8.1.orig/init/do_mounts_initrd.c linux-2.6.8.1-ve022stab078/init/do_mounts_initrd.c ---- linux-2.6.8.1.orig/init/do_mounts_initrd.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/init/do_mounts_initrd.c 2006-05-11 13:05:37.000000000 +0400 -@@ -10,7 +10,7 @@ - - #include "do_mounts.h" - --unsigned long initrd_start, initrd_end; -+unsigned long initrd_start, initrd_end, initrd_copy; - int initrd_below_start_ok; - unsigned int real_root_dev; /* do_proc_dointvec cannot handle kdev_t */ - static int __initdata old_fd, root_fd; -diff -uprN linux-2.6.8.1.orig/init/main.c linux-2.6.8.1-ve022stab078/init/main.c ---- linux-2.6.8.1.orig/init/main.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/init/main.c 2006-05-11 13:05:40.000000000 +0400 -@@ -49,6 +49,8 @@ - #include <asm/bugs.h> - #include <asm/setup.h> - -+#include <ub/beancounter.h> -+ - /* - * This is one of the first .c files built. Error out early - * if we have compiler trouble.. -@@ -85,6 +87,7 @@ extern void sbus_init(void); - extern void sysctl_init(void); - extern void signals_init(void); - extern void buffer_init(void); -+extern void fairsched_init_late(void); - extern void pidhash_init(void); - extern void pidmap_init(void); - extern void prio_tree_init(void); -@@ -101,6 +104,16 @@ extern void tc_init(void); - enum system_states system_state; - EXPORT_SYMBOL(system_state); - -+#ifdef CONFIG_VE -+extern void init_ve_system(void); -+#endif -+ -+void prepare_ve0_process(struct task_struct *tsk); -+void prepare_ve0_proc_root(void); -+void prepare_ve0_sysctl(void); -+void prepare_ve0_loopback(void); -+void prepare_virtual_fs(void); -+ - /* - * Boot command-line arguments - */ -@@ -184,6 +197,52 @@ unsigned long loops_per_jiffy = (1<<12); - - EXPORT_SYMBOL(loops_per_jiffy); - -+unsigned long cycles_per_jiffy, cycles_per_clock; -+ -+void calibrate_cycles(void) -+{ -+ unsigned long ticks; -+ cycles_t time; -+ -+ ticks = jiffies; -+ while (ticks == jiffies) -+ /* nothing */; -+ time = get_cycles(); -+ ticks = jiffies; -+ while (ticks == jiffies) -+ /* nothing */; -+ -+ time = get_cycles() - time; -+ cycles_per_jiffy = time; -+ if ((time >> 32) != 0) { -+ printk("CPU too fast! timings are incorrect\n"); -+ cycles_per_jiffy = -1; -+ } -+} -+ -+EXPORT_SYMBOL(cycles_per_jiffy); -+ -+void calc_cycles_per_jiffy(void) -+{ -+#if defined(__i386__) -+ extern unsigned long fast_gettimeoffset_quotient; -+ unsigned long low, high; -+ -+ if (fast_gettimeoffset_quotient != 0) { -+ __asm__("divl %2" -+ :"=a" (low), "=d" (high) -+ :"r" (fast_gettimeoffset_quotient), -+ "0" (0), "1" (1000000/HZ)); -+ -+ cycles_per_jiffy = low; -+ } -+#endif -+ if (cycles_per_jiffy == 0) -+ calibrate_cycles(); -+ -+ cycles_per_clock = cycles_per_jiffy * (HZ / CLOCKS_PER_SEC); -+} -+ - /* This is the number of bits of precision for the loops_per_jiffy. Each - bit takes on average 1.5/HZ seconds. This (like the original) is a little - better than 1% */ -@@ -228,6 +287,8 @@ void __devinit calibrate_delay(void) - printk("%lu.%02lu BogoMIPS\n", - loops_per_jiffy/(500000/HZ), - (loops_per_jiffy/(5000/HZ)) % 100); -+ -+ calc_cycles_per_jiffy(); - } - - static int __init debug_kernel(char *str) -@@ -397,7 +458,8 @@ static void __init smp_init(void) - - static void noinline rest_init(void) - { -- kernel_thread(init, NULL, CLONE_FS | CLONE_SIGHAND); -+ kernel_thread(init, NULL, CLONE_FS | CLONE_SIGHAND | CLONE_STOPPED); -+ wake_up_init(); - numa_default_policy(); - unlock_kernel(); - cpu_idle(); -@@ -438,7 +500,6 @@ void __init parse_early_param(void) - /* - * Activate the first processor. - */ -- - asmlinkage void __init start_kernel(void) - { - char * command_line; -@@ -448,6 +509,7 @@ asmlinkage void __init start_kernel(void - * enable them - */ - lock_kernel(); -+ ub0_init(); - page_address_init(); - printk(linux_banner); - setup_arch(&command_line); -@@ -459,6 +521,8 @@ asmlinkage void __init start_kernel(void - */ - smp_prepare_boot_cpu(); - -+ prepare_ve0_process(&init_task); -+ - /* - * Set up the scheduler prior starting any interrupts (such as the - * timer interrupt). Full topology setup happens at smp_init() -@@ -517,6 +581,7 @@ asmlinkage void __init start_kernel(void - #endif - fork_init(num_physpages); - proc_caches_init(); -+ beancounter_init(num_physpages); - buffer_init(); - unnamed_dev_init(); - security_scaffolding_startup(); -@@ -526,7 +591,10 @@ asmlinkage void __init start_kernel(void - /* rootfs populating might need page-writeback */ - page_writeback_init(); - #ifdef CONFIG_PROC_FS -+ prepare_ve0_proc_root(); -+ prepare_ve0_sysctl(); - proc_root_init(); -+ beancounter_proc_init(); - #endif - check_bugs(); - -@@ -538,6 +606,7 @@ asmlinkage void __init start_kernel(void - init_idle(current, smp_processor_id()); - - /* Do the rest non-__init'ed, we're now alive */ -+ page_ubc_init(); - rest_init(); - } - -@@ -598,6 +667,9 @@ static void __init do_initcalls(void) - */ - static void __init do_basic_setup(void) - { -+ prepare_ve0_loopback(); -+ init_ve_system(); -+ - driver_init(); - - #ifdef CONFIG_SYSCTL -@@ -614,7 +686,7 @@ static void __init do_basic_setup(void) - static void do_pre_smp_initcalls(void) - { - extern int spawn_ksoftirqd(void); --#ifdef CONFIG_SMP -+#if defined(CONFIG_SMP) || defined(CONFIG_SCHED_VCPU) - extern int migration_init(void); - - migration_init(); -@@ -666,6 +738,12 @@ static int init(void * unused) - - fixup_cpu_present_map(); - smp_init(); -+ -+ /* -+ * This should be done after all cpus are known to -+ * be online. smp_init gives us confidence in it. -+ */ -+ fairsched_init_late(); - sched_init_smp(); - - /* -diff -uprN linux-2.6.8.1.orig/init/version.c linux-2.6.8.1-ve022stab078/init/version.c ---- linux-2.6.8.1.orig/init/version.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/init/version.c 2006-05-11 13:05:42.000000000 +0400 -@@ -28,6 +28,12 @@ struct new_utsname system_utsname = { - - EXPORT_SYMBOL(system_utsname); - -+struct new_utsname virt_utsname = { -+ /* we need only this field */ -+ .release = UTS_RELEASE, -+}; -+EXPORT_SYMBOL(virt_utsname); -+ - const char *linux_banner = - "Linux version " UTS_RELEASE " (" LINUX_COMPILE_BY "@" - LINUX_COMPILE_HOST ") (" LINUX_COMPILER ") " UTS_VERSION "\n"; -diff -uprN linux-2.6.8.1.orig/ipc/compat.c linux-2.6.8.1-ve022stab078/ipc/compat.c ---- linux-2.6.8.1.orig/ipc/compat.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/ipc/compat.c 2006-05-11 13:05:40.000000000 +0400 -@@ -33,6 +33,8 @@ - #include <asm/semaphore.h> - #include <asm/uaccess.h> - -+#include <linux/ve_owner.h> -+ - #include "util.h" - - struct compat_msgbuf { -diff -uprN linux-2.6.8.1.orig/ipc/mqueue.c linux-2.6.8.1-ve022stab078/ipc/mqueue.c ---- linux-2.6.8.1.orig/ipc/mqueue.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/ipc/mqueue.c 2006-05-11 13:05:38.000000000 +0400 -@@ -631,7 +631,8 @@ static int oflag2acc[O_ACCMODE] = { MAY_ - if ((oflag & O_ACCMODE) == (O_RDWR | O_WRONLY)) - return ERR_PTR(-EINVAL); - -- if (permission(dentry->d_inode, oflag2acc[oflag & O_ACCMODE], NULL)) -+ if (permission(dentry->d_inode, oflag2acc[oflag & O_ACCMODE], -+ NULL, NULL)) - return ERR_PTR(-EACCES); - - filp = dentry_open(dentry, mqueue_mnt, oflag); -@@ -1008,7 +1009,7 @@ retry: - goto out; - } - -- ret = netlink_attachskb(sock, nc, 0, MAX_SCHEDULE_TIMEOUT); -+ ret = netlink_attachskb(sock, nc, 0, MAX_SCHEDULE_TIMEOUT, NULL); - if (ret == 1) - goto retry; - if (ret) { -diff -uprN linux-2.6.8.1.orig/ipc/msg.c linux-2.6.8.1-ve022stab078/ipc/msg.c ---- linux-2.6.8.1.orig/ipc/msg.c 2004-08-14 14:54:46.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/ipc/msg.c 2006-05-11 13:05:45.000000000 +0400 -@@ -75,6 +75,16 @@ static int newque (key_t key, int msgflg - static int sysvipc_msg_read_proc(char *buffer, char **start, off_t offset, int length, int *eof, void *data); - #endif - -+void prepare_msg(void) -+{ -+#ifdef CONFIG_VE -+ get_ve0()->_msg_ids = &msg_ids; -+ get_ve0()->_msg_ctlmax = msg_ctlmax; -+ get_ve0()->_msg_ctlmnb = msg_ctlmnb; -+ get_ve0()->_msg_ctlmni = msg_ctlmni; -+#endif -+} -+ - void __init msg_init (void) - { - ipc_init_ids(&msg_ids,msg_ctlmni); -@@ -84,6 +94,23 @@ void __init msg_init (void) - #endif - } - -+#ifdef CONFIG_VE -+# define msg_ids (*(get_exec_env()->_msg_ids)) -+# define msg_ctlmax (get_exec_env()->_msg_ctlmax) -+# define msg_ctlmnb (get_exec_env()->_msg_ctlmnb) -+# define msg_ctlmni (get_exec_env()->_msg_ctlmni) -+#endif -+ -+#ifdef CONFIG_VE -+void ve_msg_ipc_init (void) -+{ -+ msg_ctlmax = MSGMAX; -+ msg_ctlmnb = MSGMNB; -+ msg_ctlmni = MSGMNI; -+ ve_ipc_init_ids(&msg_ids, MSGMNI); -+} -+#endif -+ - static int newque (key_t key, int msgflg) - { - int id; -@@ -104,7 +131,7 @@ static int newque (key_t key, int msgflg - return retval; - } - -- id = ipc_addid(&msg_ids, &msq->q_perm, msg_ctlmni); -+ id = ipc_addid(&msg_ids, &msq->q_perm, msg_ctlmni, -1); - if(id == -1) { - security_msg_queue_free(msq); - ipc_rcu_free(msq, sizeof(*msq)); -@@ -441,7 +468,7 @@ asmlinkage long sys_msgctl (int msqid, i - ipcp = &msq->q_perm; - err = -EPERM; - if (current->euid != ipcp->cuid && -- current->euid != ipcp->uid && !capable(CAP_SYS_ADMIN)) -+ current->euid != ipcp->uid && !capable(CAP_VE_SYS_ADMIN)) - /* We _could_ check for CAP_CHOWN above, but we don't */ - goto out_unlock_up; - -@@ -529,7 +556,7 @@ static inline int pipelined_send(struct - wake_up_process(msr->r_tsk); - } else { - msr->r_msg = msg; -- msq->q_lrpid = msr->r_tsk->pid; -+ msq->q_lrpid = virt_pid(msr->r_tsk); - msq->q_rtime = get_seconds(); - wake_up_process(msr->r_tsk); - return 1; -@@ -603,7 +630,7 @@ retry: - goto retry; - } - -- msq->q_lspid = current->tgid; -+ msq->q_lspid = virt_tgid(current); - msq->q_stime = get_seconds(); - - if(!pipelined_send(msq,msg)) { -@@ -697,7 +724,7 @@ retry: - list_del(&msg->m_list); - msq->q_qnum--; - msq->q_rtime = get_seconds(); -- msq->q_lrpid = current->tgid; -+ msq->q_lrpid = virt_tgid(current); - msq->q_cbytes -= msg->m_ts; - atomic_sub(msg->m_ts,&msg_bytes); - atomic_dec(&msg_hdrs); -@@ -828,3 +855,39 @@ done: - return len; - } - #endif -+ -+#ifdef CONFIG_VE -+void ve_msg_ipc_cleanup(void) -+{ -+ int i; -+ struct msg_queue *msq; -+ -+ down(&msg_ids.sem); -+ for (i = 0; i <= msg_ids.max_id; i++) { -+ msq = msg_lock(i); -+ if (msq == NULL) -+ continue; -+ freeque(msq, i); -+ } -+ up(&msg_ids.sem); -+} -+ -+int sysvipc_walk_msg(int (*func)(int i, struct msg_queue*, void *), void *arg) -+{ -+ int i; -+ int err = 0; -+ struct msg_queue * msq; -+ -+ down(&msg_ids.sem); -+ for(i = 0; i <= msg_ids.max_id; i++) { -+ if ((msq = msg_lock(i)) == NULL) -+ continue; -+ err = func(msg_buildid(i,msq->q_perm.seq), msq, arg); -+ msg_unlock(msq); -+ if (err) -+ break; -+ } -+ up(&msg_ids.sem); -+ return err; -+} -+#endif -diff -uprN linux-2.6.8.1.orig/ipc/msgutil.c linux-2.6.8.1-ve022stab078/ipc/msgutil.c ---- linux-2.6.8.1.orig/ipc/msgutil.c 2004-08-14 14:55:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/ipc/msgutil.c 2006-05-11 13:05:39.000000000 +0400 -@@ -17,6 +17,8 @@ - - #include "util.h" - -+#include <ub/ub_mem.h> -+ - struct msg_msgseg { - struct msg_msgseg* next; - /* the next part of the message follows immediately */ -@@ -36,7 +38,7 @@ struct msg_msg *load_msg(const void __us - if (alen > DATALEN_MSG) - alen = DATALEN_MSG; - -- msg = (struct msg_msg *)kmalloc(sizeof(*msg) + alen, GFP_KERNEL); -+ msg = (struct msg_msg *)ub_kmalloc(sizeof(*msg) + alen, GFP_KERNEL); - if (msg == NULL) - return ERR_PTR(-ENOMEM); - -@@ -56,7 +58,7 @@ struct msg_msg *load_msg(const void __us - alen = len; - if (alen > DATALEN_SEG) - alen = DATALEN_SEG; -- seg = (struct msg_msgseg *)kmalloc(sizeof(*seg) + alen, -+ seg = (struct msg_msgseg *)ub_kmalloc(sizeof(*seg) + alen, - GFP_KERNEL); - if (seg == NULL) { - err = -ENOMEM; -diff -uprN linux-2.6.8.1.orig/ipc/sem.c linux-2.6.8.1-ve022stab078/ipc/sem.c ---- linux-2.6.8.1.orig/ipc/sem.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/ipc/sem.c 2006-05-11 13:05:45.000000000 +0400 -@@ -74,6 +74,7 @@ - #include <asm/uaccess.h> - #include "util.h" - -+#include <ub/ub_mem.h> - - #define sem_lock(id) ((struct sem_array*)ipc_lock(&sem_ids,id)) - #define sem_unlock(sma) ipc_unlock(&(sma)->sem_perm) -@@ -82,9 +83,13 @@ - ipc_checkid(&sem_ids,&sma->sem_perm,semid) - #define sem_buildid(id, seq) \ - ipc_buildid(&sem_ids, id, seq) -+ -+int sem_ctls[4] = {SEMMSL, SEMMNS, SEMOPM, SEMMNI}; -+ - static struct ipc_ids sem_ids; -+static int used_sems; - --static int newary (key_t, int, int); -+static int newary (key_t, int, int, int); - static void freeary (struct sem_array *sma, int id); - #ifdef CONFIG_PROC_FS - static int sysvipc_sem_read_proc(char *buffer, char **start, off_t offset, int length, int *eof, void *data); -@@ -102,24 +107,51 @@ static int sysvipc_sem_read_proc(char *b - * - */ - --int sem_ctls[4] = {SEMMSL, SEMMNS, SEMOPM, SEMMNI}; - #define sc_semmsl (sem_ctls[0]) - #define sc_semmns (sem_ctls[1]) - #define sc_semopm (sem_ctls[2]) - #define sc_semmni (sem_ctls[3]) - --static int used_sems; -+void prepare_sem(void) -+{ -+#ifdef CONFIG_VE -+ get_ve0()->_sem_ids = &sem_ids; -+ get_ve0()->_used_sems = used_sems; -+ get_ve0()->_sem_ctls[0] = sem_ctls[0]; -+ get_ve0()->_sem_ctls[1] = sem_ctls[1]; -+ get_ve0()->_sem_ctls[2] = sem_ctls[2]; -+ get_ve0()->_sem_ctls[3] = sem_ctls[3]; -+#endif -+} - - void __init sem_init (void) - { - used_sems = 0; -- ipc_init_ids(&sem_ids,sc_semmni); -+ ipc_init_ids(&sem_ids, SEMMNI); - - #ifdef CONFIG_PROC_FS - create_proc_read_entry("sysvipc/sem", 0, NULL, sysvipc_sem_read_proc, NULL); - #endif - } - -+#ifdef CONFIG_VE -+# define sem_ids (*(get_exec_env()->_sem_ids)) -+# define used_sems (get_exec_env()->_used_sems) -+# define sem_ctls (get_exec_env()->_sem_ctls) -+#endif -+ -+#ifdef CONFIG_VE -+void ve_sem_ipc_init (void) -+{ -+ used_sems = 0; -+ sem_ctls[0] = SEMMSL; -+ sem_ctls[1] = SEMMNS; -+ sem_ctls[2] = SEMOPM; -+ sem_ctls[3] = SEMMNI; -+ ve_ipc_init_ids(&sem_ids, SEMMNI); -+} -+#endif -+ - /* - * Lockless wakeup algorithm: - * Without the check/retry algorithm a lockless wakeup is possible: -@@ -154,7 +186,7 @@ void __init sem_init (void) - */ - #define IN_WAKEUP 1 - --static int newary (key_t key, int nsems, int semflg) -+static int newary (key_t key, int semid, int nsems, int semflg) - { - int id; - int retval; -@@ -183,7 +215,7 @@ static int newary (key_t key, int nsems, - return retval; - } - -- id = ipc_addid(&sem_ids, &sma->sem_perm, sc_semmni); -+ id = ipc_addid(&sem_ids, &sma->sem_perm, sc_semmni, semid); - if(id == -1) { - security_sem_free(sma); - ipc_rcu_free(sma, size); -@@ -212,12 +244,12 @@ asmlinkage long sys_semget (key_t key, i - down(&sem_ids.sem); - - if (key == IPC_PRIVATE) { -- err = newary(key, nsems, semflg); -+ err = newary(key, -1, nsems, semflg); - } else if ((id = ipc_findkey(&sem_ids, key)) == -1) { /* key not used */ - if (!(semflg & IPC_CREAT)) - err = -ENOENT; - else -- err = newary(key, nsems, semflg); -+ err = newary(key, -1, nsems, semflg); - } else if (semflg & IPC_CREAT && semflg & IPC_EXCL) { - err = -EEXIST; - } else { -@@ -715,7 +747,7 @@ static int semctl_main(int semid, int se - for (un = sma->undo; un; un = un->id_next) - un->semadj[semnum] = 0; - curr->semval = val; -- curr->sempid = current->tgid; -+ curr->sempid = virt_tgid(current); - sma->sem_ctime = get_seconds(); - /* maybe some queued-up processes were waiting for this */ - update_queue(sma); -@@ -793,7 +825,7 @@ static int semctl_down(int semid, int se - ipcp = &sma->sem_perm; - - if (current->euid != ipcp->cuid && -- current->euid != ipcp->uid && !capable(CAP_SYS_ADMIN)) { -+ current->euid != ipcp->uid && !capable(CAP_VE_SYS_ADMIN)) { - err=-EPERM; - goto out_unlock; - } -@@ -914,7 +946,8 @@ static inline int get_undo_list(struct s - undo_list = current->sysvsem.undo_list; - if (!undo_list) { - size = sizeof(struct sem_undo_list); -- undo_list = (struct sem_undo_list *) kmalloc(size, GFP_KERNEL); -+ undo_list = (struct sem_undo_list *) ub_kmalloc(size, -+ GFP_KERNEL); - if (undo_list == NULL) - return -ENOMEM; - memset(undo_list, 0, size); -@@ -979,7 +1012,8 @@ static struct sem_undo *find_undo(int se - nsems = sma->sem_nsems; - sem_unlock(sma); - -- new = (struct sem_undo *) kmalloc(sizeof(struct sem_undo) + sizeof(short)*nsems, GFP_KERNEL); -+ new = (struct sem_undo *) ub_kmalloc(sizeof(struct sem_undo) + -+ sizeof(short)*nsems, GFP_KERNEL); - if (!new) - return ERR_PTR(-ENOMEM); - memset(new, 0, sizeof(struct sem_undo) + sizeof(short)*nsems); -@@ -1028,7 +1062,7 @@ asmlinkage long sys_semtimedop(int semid - if (nsops > sc_semopm) - return -E2BIG; - if(nsops > SEMOPM_FAST) { -- sops = kmalloc(sizeof(*sops)*nsops,GFP_KERNEL); -+ sops = ub_kmalloc(sizeof(*sops)*nsops, GFP_KERNEL); - if(sops==NULL) - return -ENOMEM; - } -@@ -1100,7 +1134,7 @@ retry_undos: - if (error) - goto out_unlock_free; - -- error = try_atomic_semop (sma, sops, nsops, un, current->tgid); -+ error = try_atomic_semop (sma, sops, nsops, un, virt_tgid(current)); - if (error <= 0) - goto update; - -@@ -1112,7 +1146,7 @@ retry_undos: - queue.sops = sops; - queue.nsops = nsops; - queue.undo = un; -- queue.pid = current->tgid; -+ queue.pid = virt_tgid(current); - queue.id = semid; - if (alter) - append_to_queue(sma ,&queue); -@@ -1271,7 +1305,7 @@ found: - sem->semval += u->semadj[i]; - if (sem->semval < 0) - sem->semval = 0; /* shouldn't happen */ -- sem->sempid = current->tgid; -+ sem->sempid = virt_tgid(current); - } - } - sma->sem_otime = get_seconds(); -@@ -1331,3 +1365,58 @@ done: - return len; - } - #endif -+ -+#ifdef CONFIG_VE -+void ve_sem_ipc_cleanup(void) -+{ -+ int i; -+ struct sem_array *sma; -+ -+ down(&sem_ids.sem); -+ for (i = 0; i <= sem_ids.max_id; i++) { -+ sma = sem_lock(i); -+ if (sma == NULL) -+ continue; -+ freeary(sma, i); -+ } -+ up(&sem_ids.sem); -+} -+ -+int sysvipc_setup_sem(key_t key, int semid, size_t size, int semflg) -+{ -+ int err = 0; -+ struct sem_array *sma; -+ -+ down(&sem_ids.sem); -+ sma = sem_lock(semid); -+ if (!sma) { -+ err = newary(key, semid, size, semflg); -+ if (err >= 0) -+ sma = sem_lock(semid); -+ } -+ if (sma) -+ sem_unlock(sma); -+ up(&sem_ids.sem); -+ -+ return err > 0 ? 0 : err; -+} -+ -+int sysvipc_walk_sem(int (*func)(int i, struct sem_array*, void *), void *arg) -+{ -+ int i; -+ int err = 0; -+ struct sem_array *sma; -+ -+ down(&sem_ids.sem); -+ for (i = 0; i <= sem_ids.max_id; i++) { -+ if ((sma = sem_lock(i)) == NULL) -+ continue; -+ err = func(sem_buildid(i,sma->sem_perm.seq), sma, arg); -+ sem_unlock(sma); -+ if (err) -+ break; -+ } -+ up(&sem_ids.sem); -+ return err; -+} -+#endif -diff -uprN linux-2.6.8.1.orig/ipc/shm.c linux-2.6.8.1-ve022stab078/ipc/shm.c ---- linux-2.6.8.1.orig/ipc/shm.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/ipc/shm.c 2006-05-11 13:05:45.000000000 +0400 -@@ -28,6 +28,9 @@ - #include <linux/security.h> - #include <asm/uaccess.h> - -+#include <ub/beancounter.h> -+#include <ub/ub_vmpages.h> -+ - #include "util.h" - - #define shm_flags shm_perm.mode -@@ -43,7 +46,7 @@ static struct ipc_ids shm_ids; - #define shm_buildid(id, seq) \ - ipc_buildid(&shm_ids, id, seq) - --static int newseg (key_t key, int shmflg, size_t size); -+static int newseg (key_t key, int shmid, int shmflg, size_t size); - static void shm_open (struct vm_area_struct *shmd); - static void shm_close (struct vm_area_struct *shmd); - #ifdef CONFIG_PROC_FS -@@ -55,6 +58,28 @@ size_t shm_ctlall = SHMALL; - int shm_ctlmni = SHMMNI; - - static int shm_tot; /* total number of shared memory pages */ -+ -+void prepare_shm(void) -+{ -+#ifdef CONFIG_VE -+ int i; -+ struct shmid_kernel* shp; -+ -+ get_ve0()->_shm_ids = &shm_ids; -+ for (i = 0; i <= shm_ids.max_id; i++) { -+ shp = (struct shmid_kernel *)ipc_lock(&shm_ids, i); -+ if (shp != NULL) { -+ shp->_shm_ids = &shm_ids; -+ ipc_unlock(&shp->shm_perm); -+ } -+ } -+ -+ get_ve0()->_shm_ctlmax = shm_ctlmax; -+ get_ve0()->_shm_ctlall = shm_ctlall; -+ get_ve0()->_shm_ctlmni = shm_ctlmni; -+ get_ve0()->_shm_tot = shm_tot; -+#endif -+} - - void __init shm_init (void) - { -@@ -64,6 +89,42 @@ void __init shm_init (void) - #endif - } - -+#ifdef CONFIG_VE -+# define shm_ids (*(get_exec_env()->_shm_ids)) -+# define shm_ctlmax (get_exec_env()->_shm_ctlmax) -+# define shm_ctlall (get_exec_env()->_shm_ctlall) -+# define shm_ctlmni (get_exec_env()->_shm_ctlmni) -+/* renamed since there is a struct field named shm_tot */ -+# define shm_total (get_exec_env()->_shm_tot) -+#else -+# define shm_total shm_tot -+#endif -+ -+#ifdef CONFIG_VE -+void ve_shm_ipc_init (void) -+{ -+ shm_ctlmax = SHMMAX; -+ shm_ctlall = SHMALL; -+ shm_ctlmni = SHMMNI; -+ shm_total = 0; -+ ve_ipc_init_ids(&shm_ids, 1); -+} -+#endif -+ -+static struct shmid_kernel* shm_lock_by_sb(int id, struct super_block* sb) -+{ -+ struct ve_struct *fs_envid; -+ fs_envid = VE_OWNER_FSTYPE(sb->s_type); -+ return (struct shmid_kernel *)ipc_lock(fs_envid->_shm_ids, id); -+} -+ -+static inline int *shm_total_sb(struct super_block *sb) -+{ -+ struct ve_struct *fs_envid; -+ fs_envid = VE_OWNER_FSTYPE(sb->s_type); -+ return &fs_envid->_shm_tot; -+} -+ - static inline int shm_checkid(struct shmid_kernel *s, int id) - { - if (ipc_checkid(&shm_ids,&s->shm_perm,id)) -@@ -71,25 +132,25 @@ static inline int shm_checkid(struct shm - return 0; - } - --static inline struct shmid_kernel *shm_rmid(int id) -+static inline struct shmid_kernel *shm_rmid(struct ipc_ids *ids, int id) - { -- return (struct shmid_kernel *)ipc_rmid(&shm_ids,id); -+ return (struct shmid_kernel *)ipc_rmid(ids, id); - } - --static inline int shm_addid(struct shmid_kernel *shp) -+static inline int shm_addid(struct shmid_kernel *shp, int reqid) - { -- return ipc_addid(&shm_ids, &shp->shm_perm, shm_ctlmni+1); -+ return ipc_addid(&shm_ids, &shp->shm_perm, shm_ctlmni+1, reqid); - } - - - --static inline void shm_inc (int id) { -+static inline void shm_inc (int id, struct super_block * sb) { - struct shmid_kernel *shp; - -- if(!(shp = shm_lock(id))) -+ if(!(shp = shm_lock_by_sb(id, sb))) - BUG(); - shp->shm_atim = get_seconds(); -- shp->shm_lprid = current->tgid; -+ shp->shm_lprid = virt_tgid(current); - shp->shm_nattch++; - shm_unlock(shp); - } -@@ -97,7 +158,40 @@ static inline void shm_inc (int id) { - /* This is called by fork, once for every shm attach. */ - static void shm_open (struct vm_area_struct *shmd) - { -- shm_inc (shmd->vm_file->f_dentry->d_inode->i_ino); -+ shm_inc (shmd->vm_file->f_dentry->d_inode->i_ino, -+ shmd->vm_file->f_dentry->d_inode->i_sb); -+} -+ -+static int shmem_lock(struct shmid_kernel *shp, int lock) -+{ -+ struct inode *inode = shp->shm_file->f_dentry->d_inode; -+ struct shmem_inode_info *info = SHMEM_I(inode); -+ unsigned long size; -+ -+ if (!is_file_hugepages(shp->shm_file)) -+ return 0; -+ -+ spin_lock(&info->lock); -+ if (!!lock == !!(info->flags & VM_LOCKED)) -+ goto out; -+ -+ /* size will be re-calculated in pages inside (un)charge */ -+ size = shp->shm_segsz + PAGE_SIZE - 1; -+ -+ if (!lock) { -+ ub_locked_mem_uncharge(shmid_ub(shp), size); -+ info->flags &= ~VM_LOCKED; -+ } else if (ub_locked_mem_charge(shmid_ub(shp), size) < 0) -+ goto out_err; -+ else -+ info->flags |= VM_LOCKED; -+out: -+ spin_unlock(&info->lock); -+ return 0; -+ -+out_err: -+ spin_unlock(&info->lock); -+ return -ENOMEM; - } - - /* -@@ -110,13 +204,23 @@ static void shm_open (struct vm_area_str - */ - static void shm_destroy (struct shmid_kernel *shp) - { -- shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT; -- shm_rmid (shp->id); -+ int numpages; -+ struct super_block *sb; -+ int *shm_totalp; -+ struct file *file; -+ -+ file = shp->shm_file; -+ numpages = (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT; -+ sb = file->f_dentry->d_inode->i_sb; -+ shm_totalp = shm_total_sb(sb); -+ *shm_totalp -= numpages; -+ shm_rmid(shp->_shm_ids, shp->id); - shm_unlock(shp); -- if (!is_file_hugepages(shp->shm_file)) -- shmem_lock(shp->shm_file, 0); -- fput (shp->shm_file); -+ shmem_lock(shp, 0); -+ fput (file); - security_shm_free(shp); -+ put_beancounter(shmid_ub(shp)); -+ shmid_ub(shp) = NULL; - ipc_rcu_free(shp, sizeof(struct shmid_kernel)); - } - -@@ -130,13 +234,25 @@ static void shm_close (struct vm_area_st - { - struct file * file = shmd->vm_file; - int id = file->f_dentry->d_inode->i_ino; -+ struct super_block *sb; - struct shmid_kernel *shp; -+ struct ipc_ids* ids; -+#ifdef CONFIG_VE -+ struct ve_struct *fs_envid; -+#endif - -- down (&shm_ids.sem); -+ sb = file->f_dentry->d_inode->i_sb; -+#ifdef CONFIG_VE -+ fs_envid = get_ve(VE_OWNER_FSTYPE(sb->s_type)); -+ ids = fs_envid->_shm_ids; -+#else -+ ids = &shm_ids; -+#endif -+ down (&ids->sem); - /* remove from the list of attaches of the shm segment */ -- if(!(shp = shm_lock(id))) -+ if(!(shp = shm_lock_by_sb(id, sb))) - BUG(); -- shp->shm_lprid = current->tgid; -+ shp->shm_lprid = virt_tgid(current); - shp->shm_dtim = get_seconds(); - shp->shm_nattch--; - if(shp->shm_nattch == 0 && -@@ -144,14 +260,20 @@ static void shm_close (struct vm_area_st - shm_destroy (shp); - else - shm_unlock(shp); -- up (&shm_ids.sem); -+ up (&ids->sem); -+#ifdef CONFIG_VE -+ put_ve(fs_envid); -+#endif - } - - static int shm_mmap(struct file * file, struct vm_area_struct * vma) - { - file_accessed(file); - vma->vm_ops = &shm_vm_ops; -- shm_inc(file->f_dentry->d_inode->i_ino); -+ if (!(vma->vm_flags & VM_WRITE)) -+ vma->vm_flags &= ~VM_MAYWRITE; -+ shm_inc(file->f_dentry->d_inode->i_ino, -+ file->f_dentry->d_inode->i_sb); - return 0; - } - -@@ -169,19 +291,19 @@ static struct vm_operations_struct shm_v - #endif - }; - --static int newseg (key_t key, int shmflg, size_t size) -+static int newseg (key_t key, int shmid, int shmflg, size_t size) - { - int error; - struct shmid_kernel *shp; - int numpages = (size + PAGE_SIZE -1) >> PAGE_SHIFT; - struct file * file; -- char name[13]; -+ char name[26]; - int id; - - if (size < SHMMIN || size > shm_ctlmax) - return -EINVAL; - -- if (shm_tot + numpages >= shm_ctlall) -+ if (shm_total + numpages >= shm_ctlall) - return -ENOSPC; - - shp = ipc_rcu_alloc(sizeof(*shp)); -@@ -201,7 +323,11 @@ static int newseg (key_t key, int shmflg - if (shmflg & SHM_HUGETLB) - file = hugetlb_zero_setup(size); - else { -+#ifdef CONFIG_VE -+ sprintf (name, "VE%d.SYSV%08x", get_exec_env()->veid, key); -+#else - sprintf (name, "SYSV%08x", key); -+#endif - file = shmem_file_setup(name, size, VM_ACCOUNT); - } - error = PTR_ERR(file); -@@ -209,24 +335,26 @@ static int newseg (key_t key, int shmflg - goto no_file; - - error = -ENOSPC; -- id = shm_addid(shp); -+ id = shm_addid(shp, shmid); - if(id == -1) - goto no_id; - -- shp->shm_cprid = current->tgid; -+ shp->shm_cprid = virt_tgid(current); - shp->shm_lprid = 0; - shp->shm_atim = shp->shm_dtim = 0; - shp->shm_ctim = get_seconds(); - shp->shm_segsz = size; - shp->shm_nattch = 0; - shp->id = shm_buildid(id,shp->shm_perm.seq); -+ shp->_shm_ids = &shm_ids; - shp->shm_file = file; -+ shmid_ub(shp) = get_beancounter(get_exec_ub()); - file->f_dentry->d_inode->i_ino = shp->id; - if (shmflg & SHM_HUGETLB) - set_file_hugepages(file); - else - file->f_op = &shm_file_operations; -- shm_tot += numpages; -+ shm_total += numpages; - shm_unlock(shp); - return shp->id; - -@@ -245,12 +373,12 @@ asmlinkage long sys_shmget (key_t key, s - - down(&shm_ids.sem); - if (key == IPC_PRIVATE) { -- err = newseg(key, shmflg, size); -+ err = newseg(key, -1, shmflg, size); - } else if ((id = ipc_findkey(&shm_ids, key)) == -1) { - if (!(shmflg & IPC_CREAT)) - err = -ENOENT; - else -- err = newseg(key, shmflg, size); -+ err = newseg(key, -1, shmflg, size); - } else if ((shmflg & IPC_CREAT) && (shmflg & IPC_EXCL)) { - err = -EEXIST; - } else { -@@ -443,7 +571,7 @@ asmlinkage long sys_shmctl (int shmid, i - down(&shm_ids.sem); - shm_info.used_ids = shm_ids.in_use; - shm_get_stat (&shm_info.shm_rss, &shm_info.shm_swp); -- shm_info.shm_tot = shm_tot; -+ shm_info.shm_tot = shm_total; - shm_info.swap_attempts = 0; - shm_info.swap_successes = 0; - err = shm_ids.max_id; -@@ -526,12 +654,10 @@ asmlinkage long sys_shmctl (int shmid, i - goto out_unlock; - - if(cmd==SHM_LOCK) { -- if (!is_file_hugepages(shp->shm_file)) -- shmem_lock(shp->shm_file, 1); -- shp->shm_flags |= SHM_LOCKED; -+ if ((err = shmem_lock(shp, 1)) == 0) -+ shp->shm_flags |= SHM_LOCKED; - } else { -- if (!is_file_hugepages(shp->shm_file)) -- shmem_lock(shp->shm_file, 0); -+ shmem_lock(shp, 0); - shp->shm_flags &= ~SHM_LOCKED; - } - shm_unlock(shp); -@@ -560,7 +686,7 @@ asmlinkage long sys_shmctl (int shmid, i - - if (current->euid != shp->shm_perm.uid && - current->euid != shp->shm_perm.cuid && -- !capable(CAP_SYS_ADMIN)) { -+ !capable(CAP_VE_SYS_ADMIN)) { - err=-EPERM; - goto out_unlock_up; - } -@@ -597,7 +723,7 @@ asmlinkage long sys_shmctl (int shmid, i - err=-EPERM; - if (current->euid != shp->shm_perm.uid && - current->euid != shp->shm_perm.cuid && -- !capable(CAP_SYS_ADMIN)) { -+ !capable(CAP_VE_SYS_ADMIN)) { - goto out_unlock_up; - } - -@@ -818,6 +944,7 @@ asmlinkage long sys_shmdt(char __user *s - * could possibly have landed at. Also cast things to loff_t to - * prevent overflows and make comparisions vs. equal-width types. - */ -+ size = PAGE_ALIGN(size); - while (vma && (loff_t)(vma->vm_end - addr) <= size) { - next = vma->vm_next; - -@@ -894,3 +1021,72 @@ done: - return len; - } - #endif -+ -+#ifdef CONFIG_VE -+void ve_shm_ipc_cleanup(void) -+{ -+ int i; -+ -+ down(&shm_ids.sem); -+ for (i = 0; i <= shm_ids.max_id; i++) { -+ struct shmid_kernel *shp; -+ -+ if (!(shp = shm_lock(i))) -+ continue; -+ if (shp->shm_nattch) { -+ shp->shm_flags |= SHM_DEST; -+ shp->shm_perm.key = IPC_PRIVATE; -+ shm_unlock(shp); -+ } else -+ shm_destroy(shp); -+ } -+ up(&shm_ids.sem); -+} -+#endif -+ -+struct file * sysvipc_setup_shm(key_t key, int shmid, size_t size, int shmflg) -+{ -+ struct shmid_kernel *shp; -+ struct file *file; -+ -+ down(&shm_ids.sem); -+ shp = shm_lock(shmid); -+ if (!shp) { -+ int err; -+ -+ err = newseg(key, shmid, shmflg, size); -+ file = ERR_PTR(err); -+ if (err < 0) -+ goto out; -+ shp = shm_lock(shmid); -+ } -+ file = ERR_PTR(-EINVAL); -+ if (shp) { -+ file = shp->shm_file; -+ get_file(file); -+ shm_unlock(shp); -+ } -+out: -+ up(&shm_ids.sem); -+ -+ return file; -+} -+ -+int sysvipc_walk_shm(int (*func)(struct shmid_kernel*, void *), void *arg) -+{ -+ int i; -+ int err = 0; -+ struct shmid_kernel* shp; -+ -+ down(&shm_ids.sem); -+ for(i = 0; i <= shm_ids.max_id; i++) { -+ if ((shp = shm_lock(i)) == NULL) -+ continue; -+ err = func(shp, arg); -+ shm_unlock(shp); -+ if (err) -+ break; -+ } -+ up(&shm_ids.sem); -+ return err; -+} -diff -uprN linux-2.6.8.1.orig/ipc/util.c linux-2.6.8.1-ve022stab078/ipc/util.c ---- linux-2.6.8.1.orig/ipc/util.c 2004-08-14 14:56:26.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/ipc/util.c 2006-05-11 13:05:48.000000000 +0400 -@@ -13,6 +13,7 @@ - */ - - #include <linux/config.h> -+#include <linux/module.h> - #include <linux/mm.h> - #include <linux/shm.h> - #include <linux/init.h> -@@ -27,8 +28,12 @@ - - #include <asm/unistd.h> - -+#include <ub/ub_mem.h> -+ - #include "util.h" - -+DCL_VE_OWNER(IPCIDS, STATIC_SOFT, struct ipc_ids, owner_env, inline, ()) -+ - /** - * ipc_init - initialise IPC subsystem - * -@@ -55,7 +60,7 @@ __initcall(ipc_init); - * array itself. - */ - --void __init ipc_init_ids(struct ipc_ids* ids, int size) -+void ve_ipc_init_ids(struct ipc_ids* ids, int size) - { - int i; - sema_init(&ids->sem,1); -@@ -82,7 +87,25 @@ void __init ipc_init_ids(struct ipc_ids* - } - for(i=0;i<ids->size;i++) - ids->entries[i].p = NULL; -+#ifdef CONFIG_VE -+ SET_VE_OWNER_IPCIDS(ids, get_exec_env()); -+#endif -+} -+ -+void __init ipc_init_ids(struct ipc_ids* ids, int size) -+{ -+ ve_ipc_init_ids(ids, size); -+} -+ -+#ifdef CONFIG_VE -+static void ipc_free_ids(struct ipc_ids* ids) -+{ -+ if (ids == NULL) -+ return; -+ ipc_rcu_free(ids->entries, sizeof(struct ipc_id)*ids->size); -+ kfree(ids); - } -+#endif - - /** - * ipc_findkey - find a key in an ipc identifier set -@@ -165,10 +188,20 @@ static int grow_ary(struct ipc_ids* ids, - * Called with ipc_ids.sem held. - */ - --int ipc_addid(struct ipc_ids* ids, struct kern_ipc_perm* new, int size) -+int ipc_addid(struct ipc_ids* ids, struct kern_ipc_perm* new, int size, int reqid) - { - int id; - -+ if (reqid >= 0) { -+ id = reqid%SEQ_MULTIPLIER; -+ size = grow_ary(ids,id+1); -+ if (id >= size) -+ return -1; -+ if (ids->entries[id].p == NULL) -+ goto found; -+ return -1; -+ } -+ - size = grow_ary(ids,size); - - /* -@@ -181,6 +214,10 @@ int ipc_addid(struct ipc_ids* ids, struc - } - return -1; - found: -+#ifdef CONFIG_VE -+ if (ids->in_use == 0) -+ (void)get_ve(VE_OWNER_IPCIDS(ids)); -+#endif - ids->in_use++; - if (id > ids->max_id) - ids->max_id = id; -@@ -188,9 +225,13 @@ found: - new->cuid = new->uid = current->euid; - new->gid = new->cgid = current->egid; - -- new->seq = ids->seq++; -- if(ids->seq > ids->seq_max) -- ids->seq = 0; -+ if (reqid >= 0) { -+ new->seq = reqid/SEQ_MULTIPLIER; -+ } else { -+ new->seq = ids->seq++; -+ if(ids->seq > ids->seq_max) -+ ids->seq = 0; -+ } - - new->lock = SPIN_LOCK_UNLOCKED; - new->deleted = 0; -@@ -238,6 +279,10 @@ struct kern_ipc_perm* ipc_rmid(struct ip - } while (ids->entries[lid].p == NULL); - ids->max_id = lid; - } -+#ifdef CONFIG_VE -+ if (ids->in_use == 0) -+ put_ve(VE_OWNER_IPCIDS(ids)); -+#endif - p->deleted = 1; - return p; - } -@@ -254,9 +299,9 @@ void* ipc_alloc(int size) - { - void* out; - if(size > PAGE_SIZE) -- out = vmalloc(size); -+ out = ub_vmalloc(size); - else -- out = kmalloc(size, GFP_KERNEL); -+ out = ub_kmalloc(size, GFP_KERNEL); - return out; - } - -@@ -317,7 +362,7 @@ void* ipc_rcu_alloc(int size) - * workqueue if necessary (for vmalloc). - */ - if (rcu_use_vmalloc(size)) { -- out = vmalloc(sizeof(struct ipc_rcu_vmalloc) + size); -+ out = ub_vmalloc(sizeof(struct ipc_rcu_vmalloc) + size); - if (out) out += sizeof(struct ipc_rcu_vmalloc); - } else { - out = kmalloc(sizeof(struct ipc_rcu_kmalloc)+size, GFP_KERNEL); -@@ -524,6 +569,85 @@ int ipc_checkid(struct ipc_ids* ids, str - return 0; - } - -+#ifdef CONFIG_VE -+ -+void prepare_ipc(void) -+{ -+ /* -+ * Note: we don't need to call SET_VE_OWNER_IPCIDS inside, -+ * since we use static variables for ve0 (see STATIC_SOFT decl). -+ */ -+ prepare_msg(); -+ prepare_sem(); -+ prepare_shm(); -+} -+ -+int init_ve_ipc(struct ve_struct * envid) -+{ -+ struct ve_struct * saved_envid; -+ -+ envid->_msg_ids = kmalloc(sizeof(struct ipc_ids) + sizeof(void *), -+ GFP_KERNEL); -+ if (envid->_msg_ids == NULL) -+ goto out_nomem; -+ envid->_sem_ids = kmalloc(sizeof(struct ipc_ids) + sizeof(void *), -+ GFP_KERNEL); -+ if (envid->_sem_ids == NULL) -+ goto out_free_msg; -+ envid->_shm_ids = kmalloc(sizeof(struct ipc_ids) + sizeof(void *), -+ GFP_KERNEL); -+ if (envid->_shm_ids == NULL) -+ goto out_free_sem; -+ -+ /* -+ * Bad style, but save a lot of code (charging to proper VE) -+ * Here we temporary change VEID of the process involved in VE init. -+ * The same is effect for ve_ipc_cleanup in real_do_env_cleanup(). -+ */ -+ saved_envid = set_exec_env(envid); -+ -+ ve_msg_ipc_init(); -+ ve_sem_ipc_init(); -+ ve_shm_ipc_init(); -+ -+ (void)set_exec_env(saved_envid); -+ return 0; -+ -+out_free_sem: -+ kfree(envid->_sem_ids); -+out_free_msg: -+ kfree(envid->_msg_ids); -+out_nomem: -+ return -ENOMEM; -+} -+ -+void ve_ipc_cleanup(void) -+{ -+ ve_msg_ipc_cleanup(); -+ ve_sem_ipc_cleanup(); -+ ve_shm_ipc_cleanup(); -+} -+ -+void ve_ipc_free(struct ve_struct *envid) -+{ -+ ipc_free_ids(envid->_msg_ids); -+ ipc_free_ids(envid->_sem_ids); -+ ipc_free_ids(envid->_shm_ids); -+ envid->_msg_ids = envid->_sem_ids = envid->_shm_ids = NULL; -+} -+ -+void fini_ve_ipc(struct ve_struct *ptr) -+{ -+ ve_ipc_cleanup(); -+ ve_ipc_free(ptr); -+} -+ -+EXPORT_SYMBOL(init_ve_ipc); -+EXPORT_SYMBOL(ve_ipc_cleanup); -+EXPORT_SYMBOL(ve_ipc_free); -+EXPORT_SYMBOL(fini_ve_ipc); -+#endif /* CONFIG_VE */ -+ - #ifdef __ARCH_WANT_IPC_PARSE_VERSION - - -diff -uprN linux-2.6.8.1.orig/ipc/util.h linux-2.6.8.1-ve022stab078/ipc/util.h ---- linux-2.6.8.1.orig/ipc/util.h 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/ipc/util.h 2006-05-11 13:05:45.000000000 +0400 -@@ -15,6 +15,20 @@ void sem_init (void); - void msg_init (void); - void shm_init (void); - -+#ifdef CONFIG_VE -+ -+void ve_msg_ipc_init(void); -+void ve_sem_ipc_init(void); -+void ve_shm_ipc_init(void); -+void prepare_msg(void); -+void prepare_sem(void); -+void prepare_shm(void); -+void ve_msg_ipc_cleanup(void); -+void ve_sem_ipc_cleanup(void); -+void ve_shm_ipc_cleanup(void); -+ -+#endif -+ - struct ipc_ids { - int size; - int in_use; -@@ -23,17 +37,21 @@ struct ipc_ids { - unsigned short seq_max; - struct semaphore sem; - struct ipc_id* entries; -+ struct ve_struct *owner_env; - }; - -+DCL_VE_OWNER_PROTO(IPCIDS, STATIC_SOFT, struct ipc_ids, owner_env, inline, ()) -+ - struct ipc_id { - struct kern_ipc_perm* p; - }; - --void __init ipc_init_ids(struct ipc_ids* ids, int size); -+void ipc_init_ids(struct ipc_ids* ids, int size); -+void ve_ipc_init_ids(struct ipc_ids* ids, int size); - - /* must be called with ids->sem acquired.*/ - int ipc_findkey(struct ipc_ids* ids, key_t key); --int ipc_addid(struct ipc_ids* ids, struct kern_ipc_perm* new, int size); -+int ipc_addid(struct ipc_ids* ids, struct kern_ipc_perm* new, int size, int reqid); - - /* must be called with both locks acquired. */ - struct kern_ipc_perm* ipc_rmid(struct ipc_ids* ids, int id); -diff -uprN linux-2.6.8.1.orig/kernel/Kconfig.openvz linux-2.6.8.1-ve022stab078/kernel/Kconfig.openvz ---- linux-2.6.8.1.orig/kernel/Kconfig.openvz 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/kernel/Kconfig.openvz 2006-05-11 13:05:49.000000000 +0400 -@@ -0,0 +1,46 @@ -+# Copyright (C) 2005 SWsoft -+# All rights reserved. -+# Licensing governed by "linux/COPYING.SWsoft" file. -+ -+config VE -+ bool "Virtual Environment support" -+ depends on !SECURITY -+ default y -+ help -+ This option adds support of virtual Linux running on the original box -+ with fully supported virtual network driver, tty subsystem and -+ configurable access for hardware and other resources. -+ -+config VE_CALLS -+ tristate "VE calls interface" -+ depends on VE -+ default m -+ help -+ This option controls how to build vzmon code containing VE calls. -+ By default it's build in module vzmon.o -+ -+config VZ_GENCALLS -+ bool -+ default y -+ -+config VE_NETDEV -+ tristate "VE networking" -+ depends on VE -+ default m -+ help -+ This option controls whether to build VE networking code. -+ -+config VE_IPTABLES -+ bool "VE netfiltering" -+ depends on VE && VE_NETDEV && INET && NETFILTER -+ default y -+ help -+ This option controls whether to build VE netfiltering code. -+ -+config VZ_WDOG -+ tristate "VE watchdog module" -+ depends on VE -+ default m -+ help -+ This option controls building of vzwdog module, which dumps -+ a lot of useful system info on console periodically. -diff -uprN linux-2.6.8.1.orig/kernel/capability.c linux-2.6.8.1-ve022stab078/kernel/capability.c ---- linux-2.6.8.1.orig/kernel/capability.c 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/capability.c 2006-05-11 13:05:40.000000000 +0400 -@@ -23,6 +23,7 @@ EXPORT_SYMBOL(cap_bset); - * Locking rule: acquire this prior to tasklist_lock. - */ - spinlock_t task_capability_lock = SPIN_LOCK_UNLOCKED; -+EXPORT_SYMBOL(task_capability_lock); - - /* - * For sys_getproccap() and sys_setproccap(), any of the three -@@ -59,8 +60,8 @@ asmlinkage long sys_capget(cap_user_head - spin_lock(&task_capability_lock); - read_lock(&tasklist_lock); - -- if (pid && pid != current->pid) { -- target = find_task_by_pid(pid); -+ if (pid && pid != virt_pid(current)) { -+ target = find_task_by_pid_ve(pid); - if (!target) { - ret = -ESRCH; - goto out; -@@ -89,14 +90,16 @@ static inline void cap_set_pg(int pgrp, - kernel_cap_t *permitted) - { - task_t *g, *target; -- struct list_head *l; -- struct pid *pid; - -- for_each_task_pid(pgrp, PIDTYPE_PGID, g, l, pid) { -+ pgrp = vpid_to_pid(pgrp); -+ if (pgrp < 0) -+ return; -+ -+ do_each_task_pid_ve(pgrp, PIDTYPE_PGID, g) { - target = g; -- while_each_thread(g, target) -+ while_each_thread_ve(g, target) - security_capset_set(target, effective, inheritable, permitted); -- } -+ } while_each_task_pid_ve(pgrp, PIDTYPE_PGID, g); - } - - /* -@@ -109,11 +112,11 @@ static inline void cap_set_all(kernel_ca - { - task_t *g, *target; - -- do_each_thread(g, target) { -+ do_each_thread_ve(g, target) { - if (target == current || target->pid == 1) - continue; - security_capset_set(target, effective, inheritable, permitted); -- } while_each_thread(g, target); -+ } while_each_thread_ve(g, target); - } - - /* -@@ -159,8 +162,8 @@ asmlinkage long sys_capset(cap_user_head - spin_lock(&task_capability_lock); - read_lock(&tasklist_lock); - -- if (pid > 0 && pid != current->pid) { -- target = find_task_by_pid(pid); -+ if (pid > 0 && pid != virt_pid(current)) { -+ target = find_task_by_pid_ve(pid); - if (!target) { - ret = -ESRCH; - goto out; -diff -uprN linux-2.6.8.1.orig/kernel/compat.c linux-2.6.8.1-ve022stab078/kernel/compat.c ---- linux-2.6.8.1.orig/kernel/compat.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/compat.c 2006-05-11 13:05:27.000000000 +0400 -@@ -559,5 +559,84 @@ long compat_clock_nanosleep(clockid_t wh - return err; - } - -+void -+sigset_from_compat (sigset_t *set, compat_sigset_t *compat) -+{ -+ switch (_NSIG_WORDS) { -+ case 4: set->sig[3] = compat->sig[6] | (((long)compat->sig[7]) << 32 ); -+ case 3: set->sig[2] = compat->sig[4] | (((long)compat->sig[5]) << 32 ); -+ case 2: set->sig[1] = compat->sig[2] | (((long)compat->sig[3]) << 32 ); -+ case 1: set->sig[0] = compat->sig[0] | (((long)compat->sig[1]) << 32 ); -+ } -+} -+ -+asmlinkage long -+compat_rt_sigtimedwait (compat_sigset_t __user *uthese, -+ struct compat_siginfo __user *uinfo, -+ struct compat_timespec __user *uts, compat_size_t sigsetsize) -+{ -+ compat_sigset_t s32; -+ sigset_t s; -+ int sig; -+ struct timespec t; -+ siginfo_t info; -+ long ret, timeout = 0; -+ -+ if (sigsetsize != sizeof(sigset_t)) -+ return -EINVAL; -+ -+ if (copy_from_user(&s32, uthese, sizeof(compat_sigset_t))) -+ return -EFAULT; -+ sigset_from_compat(&s, &s32); -+ sigdelsetmask(&s,sigmask(SIGKILL)|sigmask(SIGSTOP)); -+ signotset(&s); -+ -+ if (uts) { -+ if (get_compat_timespec (&t, uts)) -+ return -EFAULT; -+ if (t.tv_nsec >= 1000000000L || t.tv_nsec < 0 -+ || t.tv_sec < 0) -+ return -EINVAL; -+ } -+ -+ spin_lock_irq(¤t->sighand->siglock); -+ sig = dequeue_signal(current, &s, &info); -+ if (!sig) { -+ timeout = MAX_SCHEDULE_TIMEOUT; -+ if (uts) -+ timeout = timespec_to_jiffies(&t) -+ +(t.tv_sec || t.tv_nsec); -+ if (timeout) { -+ current->real_blocked = current->blocked; -+ sigandsets(¤t->blocked, ¤t->blocked, &s); -+ -+ recalc_sigpending(); -+ spin_unlock_irq(¤t->sighand->siglock); -+ -+ current->state = TASK_INTERRUPTIBLE; -+ timeout = schedule_timeout(timeout); -+ -+ spin_lock_irq(¤t->sighand->siglock); -+ sig = dequeue_signal(current, &s, &info); -+ current->blocked = current->real_blocked; -+ siginitset(¤t->real_blocked, 0); -+ recalc_sigpending(); -+ } -+ } -+ spin_unlock_irq(¤t->sighand->siglock); -+ -+ if (sig) { -+ ret = sig; -+ if (uinfo) { -+ if (copy_siginfo_to_user32(uinfo, &info)) -+ ret = -EFAULT; -+ } -+ }else { -+ ret = timeout?-EINTR:-EAGAIN; -+ } -+ return ret; -+ -+} -+ - /* timer_create is architecture specific because it needs sigevent conversion */ - -diff -uprN linux-2.6.8.1.orig/kernel/configs.c linux-2.6.8.1-ve022stab078/kernel/configs.c ---- linux-2.6.8.1.orig/kernel/configs.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/configs.c 2006-05-11 13:05:42.000000000 +0400 -@@ -89,8 +89,7 @@ static int __init ikconfig_init(void) - struct proc_dir_entry *entry; - - /* create the current config file */ -- entry = create_proc_entry("config.gz", S_IFREG | S_IRUGO, -- &proc_root); -+ entry = create_proc_entry("config.gz", S_IFREG | S_IRUGO, NULL); - if (!entry) - return -ENOMEM; - -diff -uprN linux-2.6.8.1.orig/kernel/cpu.c linux-2.6.8.1-ve022stab078/kernel/cpu.c ---- linux-2.6.8.1.orig/kernel/cpu.c 2004-08-14 14:56:13.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/cpu.c 2006-05-11 13:05:40.000000000 +0400 -@@ -43,13 +43,18 @@ void unregister_cpu_notifier(struct noti - EXPORT_SYMBOL(unregister_cpu_notifier); - - #ifdef CONFIG_HOTPLUG_CPU -+ -+#ifdef CONFIG_SCHED_VCPU -+#error "CONFIG_HOTPLUG_CPU isn't supported with CONFIG_SCHED_VCPU" -+#endif -+ - static inline void check_for_tasks(int cpu) - { - struct task_struct *p; - - write_lock_irq(&tasklist_lock); -- for_each_process(p) { -- if (task_cpu(p) == cpu && (p->utime != 0 || p->stime != 0)) -+ for_each_process_all(p) { -+ if (task_pcpu(p) == cpu && (p->utime != 0 || p->stime != 0)) - printk(KERN_WARNING "Task %s (pid = %d) is on cpu %d\ - (state = %ld, flags = %lx) \n", - p->comm, p->pid, cpu, p->state, p->flags); -@@ -104,6 +109,13 @@ static int take_cpu_down(void *unused) - return err; - } - -+#ifdef CONFIG_SCHED_VCPU -+#error VCPU vs. HOTPLUG: fix hotplug code below -+/* -+ * What should be fixed: -+ * - check for if (idle_cpu()) yield() -+ */ -+#endif - int cpu_down(unsigned int cpu) - { - int err; -diff -uprN linux-2.6.8.1.orig/kernel/exit.c linux-2.6.8.1-ve022stab078/kernel/exit.c ---- linux-2.6.8.1.orig/kernel/exit.c 2004-08-14 14:56:01.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/exit.c 2006-05-11 13:05:49.000000000 +0400 -@@ -23,12 +23,17 @@ - #include <linux/mount.h> - #include <linux/proc_fs.h> - #include <linux/mempolicy.h> -+#include <linux/swap.h> -+#include <linux/fairsched.h> -+#include <linux/faudit.h> - - #include <asm/uaccess.h> - #include <asm/unistd.h> - #include <asm/pgtable.h> - #include <asm/mmu_context.h> - -+#include <ub/ub_mem.h> -+ - extern void sem_exit (void); - extern struct task_struct *child_reaper; - -@@ -47,20 +52,19 @@ static void __unhash_process(struct task - } - - REMOVE_LINKS(p); -+ REMOVE_VE_LINKS(p); - } - - void release_task(struct task_struct * p) - { - int zap_leader; - task_t *leader; -- struct dentry *proc_dentry; -+ struct dentry *proc_dentry[2]; - - repeat: -- BUG_ON(p->state < TASK_ZOMBIE); -- - atomic_dec(&p->user->processes); - spin_lock(&p->proc_lock); -- proc_dentry = proc_pid_unhash(p); -+ proc_pid_unhash(p, proc_dentry); - write_lock_irq(&tasklist_lock); - if (unlikely(p->ptrace)) - __ptrace_unlink(p); -@@ -68,6 +72,8 @@ repeat: - __exit_signal(p); - __exit_sighand(p); - __unhash_process(p); -+ nr_zombie--; -+ nr_dead++; - - /* - * If we are the last non-leader member of the thread -@@ -76,7 +82,7 @@ repeat: - */ - zap_leader = 0; - leader = p->group_leader; -- if (leader != p && thread_group_empty(leader) && leader->state == TASK_ZOMBIE) { -+ if (leader != p && thread_group_empty(leader) && leader->exit_state == EXIT_ZOMBIE) { - BUG_ON(leader->exit_signal == -1); - do_notify_parent(leader, leader->exit_signal); - /* -@@ -101,6 +107,8 @@ repeat: - spin_unlock(&p->proc_lock); - proc_pid_flush(proc_dentry); - release_thread(p); -+ if (atomic_dec_and_test(&VE_TASK_INFO(p)->owner_env->pcounter)) -+ do_env_cleanup(VE_TASK_INFO(p)->owner_env); - put_task_struct(p); - - p = leader; -@@ -112,10 +120,10 @@ repeat: - - void unhash_process(struct task_struct *p) - { -- struct dentry *proc_dentry; -+ struct dentry *proc_dentry[2]; - - spin_lock(&p->proc_lock); -- proc_dentry = proc_pid_unhash(p); -+ proc_pid_unhash(p, proc_dentry); - write_lock_irq(&tasklist_lock); - __unhash_process(p); - write_unlock_irq(&tasklist_lock); -@@ -131,17 +139,18 @@ void unhash_process(struct task_struct * - int session_of_pgrp(int pgrp) - { - struct task_struct *p; -- struct list_head *l; -- struct pid *pid; - int sid = -1; - -+ WARN_ON(is_virtual_pid(pgrp)); -+ - read_lock(&tasklist_lock); -- for_each_task_pid(pgrp, PIDTYPE_PGID, p, l, pid) -+ do_each_task_pid_ve(pgrp, PIDTYPE_PGID, p) { - if (p->signal->session > 0) { - sid = p->signal->session; - goto out; - } -- p = find_task_by_pid(pgrp); -+ } while_each_task_pid_ve(pgrp, PIDTYPE_PGID, p); -+ p = find_task_by_pid_ve(pgrp); - if (p) - sid = p->signal->session; - out: -@@ -161,21 +170,21 @@ out: - static int will_become_orphaned_pgrp(int pgrp, task_t *ignored_task) - { - struct task_struct *p; -- struct list_head *l; -- struct pid *pid; - int ret = 1; - -- for_each_task_pid(pgrp, PIDTYPE_PGID, p, l, pid) { -+ WARN_ON(is_virtual_pid(pgrp)); -+ -+ do_each_task_pid_ve(pgrp, PIDTYPE_PGID, p) { - if (p == ignored_task -- || p->state >= TASK_ZOMBIE -- || p->real_parent->pid == 1) -+ || p->exit_state -+ || virt_pid(p->real_parent) == 1) - continue; - if (process_group(p->real_parent) != pgrp - && p->real_parent->signal->session == p->signal->session) { - ret = 0; - break; - } -- } -+ } while_each_task_pid_ve(pgrp, PIDTYPE_PGID, p); - return ret; /* (sighing) "Often!" */ - } - -@@ -183,6 +192,8 @@ int is_orphaned_pgrp(int pgrp) - { - int retval; - -+ WARN_ON(is_virtual_pid(pgrp)); -+ - read_lock(&tasklist_lock); - retval = will_become_orphaned_pgrp(pgrp, NULL); - read_unlock(&tasklist_lock); -@@ -194,10 +205,10 @@ static inline int has_stopped_jobs(int p - { - int retval = 0; - struct task_struct *p; -- struct list_head *l; -- struct pid *pid; - -- for_each_task_pid(pgrp, PIDTYPE_PGID, p, l, pid) { -+ WARN_ON(is_virtual_pid(pgrp)); -+ -+ do_each_task_pid_ve(pgrp, PIDTYPE_PGID, p) { - if (p->state != TASK_STOPPED) - continue; - -@@ -213,7 +224,7 @@ static inline int has_stopped_jobs(int p - - retval = 1; - break; -- } -+ } while_each_task_pid_ve(pgrp, PIDTYPE_PGID, p); - return retval; - } - -@@ -260,6 +271,9 @@ void __set_special_pids(pid_t session, p - { - struct task_struct *curr = current; - -+ WARN_ON(is_virtual_pid(pgrp)); -+ WARN_ON(is_virtual_pid(session)); -+ - if (curr->signal->session != session) { - detach_pid(curr, PIDTYPE_SID); - curr->signal->session = session; -@@ -278,6 +292,7 @@ void set_special_pids(pid_t session, pid - __set_special_pids(session, pgrp); - write_unlock_irq(&tasklist_lock); - } -+EXPORT_SYMBOL(set_special_pids); - - /* - * Let kernel threads use this to say that they -@@ -342,7 +357,9 @@ void daemonize(const char *name, ...) - exit_mm(current); - - set_special_pids(1, 1); -+ down(&tty_sem); - current->signal->tty = NULL; -+ up(&tty_sem); - - /* Block and flush all signals */ - sigfillset(&blocked); -@@ -529,12 +546,8 @@ static inline void choose_new_parent(tas - * Make sure we're not reparenting to ourselves and that - * the parent is not a zombie. - */ -- if (p == reaper || reaper->state >= TASK_ZOMBIE) -- p->real_parent = child_reaper; -- else -- p->real_parent = reaper; -- if (p->parent == p->real_parent) -- BUG(); -+ BUG_ON(p == reaper || reaper->exit_state); -+ p->real_parent = reaper; - } - - static inline void reparent_thread(task_t *p, task_t *father, int traced) -@@ -566,7 +579,7 @@ static inline void reparent_thread(task_ - /* If we'd notified the old parent about this child's death, - * also notify the new parent. - */ -- if (p->state == TASK_ZOMBIE && p->exit_signal != -1 && -+ if (p->exit_state == EXIT_ZOMBIE && p->exit_signal != -1 && - thread_group_empty(p)) - do_notify_parent(p, p->exit_signal); - } -@@ -597,12 +610,15 @@ static inline void reparent_thread(task_ - static inline void forget_original_parent(struct task_struct * father, - struct list_head *to_release) - { -- struct task_struct *p, *reaper = father; -+ struct task_struct *p, *tsk_reaper, *reaper = father; - struct list_head *_p, *_n; - -- reaper = father->group_leader; -- if (reaper == father) -- reaper = child_reaper; -+ do { -+ reaper = next_thread(reaper); -+ if (reaper == father) { -+ break; -+ } -+ } while (reaper->exit_state); - - /* - * There are only two places where our children can be: -@@ -621,14 +637,21 @@ static inline void forget_original_paren - /* if father isn't the real parent, then ptrace must be enabled */ - BUG_ON(father != p->real_parent && !ptrace); - -+ tsk_reaper = reaper; -+ if (tsk_reaper == father) -+#ifdef CONFIG_VE -+ tsk_reaper = VE_TASK_INFO(p)->owner_env->init_entry; -+ if (tsk_reaper == p) -+#endif -+ tsk_reaper = child_reaper; - if (father == p->real_parent) { -- /* reparent with a reaper, real father it's us */ -- choose_new_parent(p, reaper, child_reaper); -+ /* reparent with a tsk_reaper, real father it's us */ -+ choose_new_parent(p, tsk_reaper, child_reaper); - reparent_thread(p, father, 0); - } else { - /* reparent ptraced task to its real parent */ - __ptrace_unlink (p); -- if (p->state == TASK_ZOMBIE && p->exit_signal != -1 && -+ if (p->exit_state == EXIT_ZOMBIE && p->exit_signal != -1 && - thread_group_empty(p)) - do_notify_parent(p, p->exit_signal); - } -@@ -639,12 +662,20 @@ static inline void forget_original_paren - * zombie forever since we prevented it from self-reap itself - * while it was being traced by us, to be able to see it in wait4. - */ -- if (unlikely(ptrace && p->state == TASK_ZOMBIE && p->exit_signal == -1)) -+ if (unlikely(ptrace && p->exit_state == EXIT_ZOMBIE && p->exit_signal == -1)) - list_add(&p->ptrace_list, to_release); - } - list_for_each_safe(_p, _n, &father->ptrace_children) { - p = list_entry(_p,struct task_struct,ptrace_list); -- choose_new_parent(p, reaper, child_reaper); -+ -+ tsk_reaper = reaper; -+ if (tsk_reaper == father) -+#ifdef CONFIG_VE -+ tsk_reaper = VE_TASK_INFO(p)->owner_env->init_entry; -+ if (tsk_reaper == p) -+#endif -+ tsk_reaper = child_reaper; -+ choose_new_parent(p, tsk_reaper, child_reaper); - reparent_thread(p, father, 1); - } - } -@@ -740,6 +771,9 @@ static void exit_notify(struct task_stru - && !capable(CAP_KILL)) - tsk->exit_signal = SIGCHLD; - -+ if (tsk->exit_signal != -1 && t == child_reaper) -+ /* We dont want people slaying init. */ -+ tsk->exit_signal = SIGCHLD; - - /* If something other than our normal parent is ptracing us, then - * send it a SIGCHLD instead of honoring exit_signal. exit_signal -@@ -752,11 +786,11 @@ static void exit_notify(struct task_stru - do_notify_parent(tsk, SIGCHLD); - } - -- state = TASK_ZOMBIE; -+ state = EXIT_ZOMBIE; - if (tsk->exit_signal == -1 && tsk->ptrace == 0) -- state = TASK_DEAD; -- tsk->state = state; -- tsk->flags |= PF_DEAD; -+ state = EXIT_DEAD; -+ tsk->exit_state = state; -+ nr_zombie++; - - /* - * Clear these here so that update_process_times() won't try to deliver -@@ -766,20 +800,7 @@ static void exit_notify(struct task_stru - tsk->it_prof_value = 0; - tsk->rlim[RLIMIT_CPU].rlim_cur = RLIM_INFINITY; - -- /* -- * In the preemption case it must be impossible for the task -- * to get runnable again, so use "_raw_" unlock to keep -- * preempt_count elevated until we schedule(). -- * -- * To avoid deadlock on SMP, interrupts must be unmasked. If we -- * don't, subsequently called functions (e.g, wait_task_inactive() -- * via release_task()) will spin, with interrupt flags -- * unwittingly blocked, until the other task sleeps. That task -- * may itself be waiting for smp_call_function() to answer and -- * complete, and with interrupts blocked that will never happen. -- */ -- _raw_write_unlock(&tasklist_lock); -- local_irq_enable(); -+ write_unlock_irq(&tasklist_lock); - - list_for_each_safe(_p, _n, &ptrace_dead) { - list_del_init(_p); -@@ -788,21 +809,110 @@ static void exit_notify(struct task_stru - } - - /* If the process is dead, release it - nobody will wait for it */ -- if (state == TASK_DEAD) -+ if (state == EXIT_DEAD) - release_task(tsk); - -+ /* PF_DEAD causes final put_task_struct after we schedule. */ -+ preempt_disable(); -+ tsk->flags |= PF_DEAD; - } - -+asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru); -+ -+#ifdef CONFIG_VE -+/* -+ * Handle exitting of init process, it's a special case for VE. -+ */ -+static void do_initproc_exit(void) -+{ -+ struct task_struct *tsk; -+ struct ve_struct *env; -+ struct siginfo info; -+ struct task_struct *g, *p; -+ long delay = 1L; -+ -+ tsk = current; -+ env = VE_TASK_INFO(current)->owner_env; -+ if (env->init_entry != tsk) -+ return; -+ -+ if (ve_is_super(env) && tsk->pid == 1) -+ panic("Attempted to kill init!"); -+ -+ memset(&info, 0, sizeof(info)); -+ info.si_errno = 0; -+ info.si_code = SI_KERNEL; -+ info.si_pid = virt_pid(tsk); -+ info.si_uid = current->uid; -+ info.si_signo = SIGKILL; -+ -+ /* -+ * Here the VE changes its state into "not running". -+ * op_sem taken for write is a barrier to all VE manipulations from -+ * ioctl: it waits for operations currently in progress and blocks all -+ * subsequent operations until is_running is set to 0 and op_sem is -+ * released. -+ */ -+ down_write(&env->op_sem); -+ env->is_running = 0; -+ up_write(&env->op_sem); -+ -+ /* send kill to all processes of VE */ -+ read_lock(&tasklist_lock); -+ do_each_thread_ve(g, p) { -+ force_sig_info(SIGKILL, &info, p); -+ } while_each_thread_ve(g, p); -+ read_unlock(&tasklist_lock); -+ -+ /* wait for all init childs exit */ -+ while (atomic_read(&env->pcounter) > 1) { -+ if (sys_wait4(-1, NULL, __WALL | WNOHANG, NULL) > 0) -+ continue; -+ /* it was ENOCHLD or no more children somehow */ -+ if (atomic_read(&env->pcounter) == 1) -+ break; -+ -+ /* clear all signals to avoid wakeups */ -+ if (signal_pending(tsk)) -+ flush_signals(tsk); -+ /* we have child without signal sent */ -+ __set_current_state(TASK_INTERRUPTIBLE); -+ schedule_timeout(delay); -+ delay = (delay < HZ) ? (delay << 1) : HZ; -+ read_lock(&tasklist_lock); -+ do_each_thread_ve(g, p) { -+ if (p != tsk) -+ force_sig_info(SIGKILL, &info, p); -+ } while_each_thread_ve(g, p); -+ read_unlock(&tasklist_lock); -+ } -+ env->init_entry = child_reaper; -+ write_lock_irq(&tasklist_lock); -+ REMOVE_LINKS(tsk); -+ tsk->parent = tsk->real_parent = child_reaper; -+ SET_LINKS(tsk); -+ write_unlock_irq(&tasklist_lock); -+} -+#endif -+ - asmlinkage NORET_TYPE void do_exit(long code) - { - struct task_struct *tsk = current; -+ struct mm_struct *mm; - -+ mm = tsk->mm; - if (unlikely(in_interrupt())) - panic("Aiee, killing interrupt handler!"); - if (unlikely(!tsk->pid)) - panic("Attempted to kill the idle task!"); -+#ifndef CONFIG_VE - if (unlikely(tsk->pid == 1)) - panic("Attempted to kill init!"); -+#else -+ do_initproc_exit(); -+#endif -+ virtinfo_gencall(VIRTINFO_DOEXIT, NULL); -+ - if (tsk->io_context) - exit_io_context(); - tsk->flags |= PF_EXITING; -@@ -817,7 +927,9 @@ asmlinkage NORET_TYPE void do_exit(long - - if (unlikely(current->ptrace & PT_TRACE_EXIT)) { - current->ptrace_message = code; -+ set_pn_state(current, PN_STOP_EXIT); - ptrace_notify((PTRACE_EVENT_EXIT << 8) | SIGTRAP); -+ clear_pn_state(current); - } - - acct_process(code); -@@ -838,10 +950,25 @@ asmlinkage NORET_TYPE void do_exit(long - - tsk->exit_code = code; - exit_notify(tsk); -+ -+ /* In order to allow OOM to happen from now on */ -+ spin_lock(&oom_generation_lock); -+ if (tsk->flags & PF_MEMDIE) { -+ if (!oom_kill_counter || !--oom_kill_counter) -+ oom_generation++; -+ printk("OOM killed process %s (pid=%d, ve=%d) (mm=%p) exited, free=%u.\n", -+ tsk->comm, tsk->pid, -+ VEID(VE_TASK_INFO(current)->owner_env), -+ mm, nr_free_pages()); -+ } -+ spin_unlock(&oom_generation_lock); -+ - #ifdef CONFIG_NUMA - mpol_free(tsk->mempolicy); - tsk->mempolicy = NULL; - #endif -+ -+ BUG_ON(!(current->flags & PF_DEAD)); - schedule(); - BUG(); - /* Avoid "noreturn function does return". */ -@@ -860,26 +987,22 @@ EXPORT_SYMBOL(complete_and_exit); - - asmlinkage long sys_exit(int error_code) - { -+ virtinfo_notifier_call(VITYPE_FAUDIT, -+ VIRTINFO_FAUDIT_EXIT, &error_code); - do_exit((error_code&0xff)<<8); - } - - task_t fastcall *next_thread(const task_t *p) - { -- const struct pid_link *link = p->pids + PIDTYPE_TGID; -- const struct list_head *tmp, *head = &link->pidptr->task_list; -- -+ task_t *tsk; - #ifdef CONFIG_SMP -- if (!p->sighand) -- BUG(); -- if (!spin_is_locked(&p->sighand->siglock) && -- !rwlock_is_locked(&tasklist_lock)) -+ if (!rwlock_is_locked(&tasklist_lock) || p->pids[PIDTYPE_TGID].nr == 0) - BUG(); - #endif -- tmp = link->pid_chain.next; -- if (tmp == head) -- tmp = head->next; -- -- return pid_task(tmp, PIDTYPE_TGID); -+ tsk = pid_task(p->pids[PIDTYPE_TGID].pid_list.next, PIDTYPE_TGID); -+ /* all threads should belong to ONE ve! */ -+ BUG_ON(VE_TASK_INFO(tsk)->owner_env != VE_TASK_INFO(p)->owner_env); -+ return tsk; - } - - EXPORT_SYMBOL(next_thread); -@@ -929,21 +1052,26 @@ asmlinkage void sys_exit_group(int error - static int eligible_child(pid_t pid, int options, task_t *p) - { - if (pid > 0) { -- if (p->pid != pid) -+ if ((is_virtual_pid(pid) ? virt_pid(p) : p->pid) != pid) - return 0; - } else if (!pid) { - if (process_group(p) != process_group(current)) - return 0; - } else if (pid != -1) { -- if (process_group(p) != -pid) -- return 0; -+ if (__is_virtual_pid(-pid)) { -+ if (virt_pgid(p) != -pid) -+ return 0; -+ } else { -+ if (process_group(p) != -pid) -+ return 0; -+ } - } - - /* - * Do not consider detached threads that are - * not ptraced: - */ -- if (p->exit_signal == -1 && !p->ptrace) -+ if (unlikely(p->exit_signal == -1 && p->ptrace == 0)) - return 0; - - /* Wait for all children (clone and not) if __WALL is set; -@@ -968,7 +1096,7 @@ static int eligible_child(pid_t pid, int - } - - /* -- * Handle sys_wait4 work for one task in state TASK_ZOMBIE. We hold -+ * Handle sys_wait4 work for one task in state EXIT_ZOMBIE. We hold - * read_lock(&tasklist_lock) on entry. If we return zero, we still hold - * the lock and this task is uninteresting. If we return nonzero, we have - * released the lock and the system call should return. -@@ -982,9 +1110,9 @@ static int wait_task_zombie(task_t *p, u - * Try to move the task's state to DEAD - * only one thread is allowed to do this: - */ -- state = xchg(&p->state, TASK_DEAD); -- if (state != TASK_ZOMBIE) { -- BUG_ON(state != TASK_DEAD); -+ state = xchg(&p->exit_state, EXIT_DEAD); -+ if (state != EXIT_ZOMBIE) { -+ BUG_ON(state != EXIT_DEAD); - return 0; - } - if (unlikely(p->exit_signal == -1 && p->ptrace == 0)) -@@ -996,7 +1124,7 @@ static int wait_task_zombie(task_t *p, u - - /* - * Now we are sure this task is interesting, and no other -- * thread can reap it because we set its state to TASK_DEAD. -+ * thread can reap it because we set its state to EXIT_DEAD. - */ - read_unlock(&tasklist_lock); - -@@ -1008,16 +1136,18 @@ static int wait_task_zombie(task_t *p, u - retval = put_user(p->exit_code, stat_addr); - } - if (retval) { -- p->state = TASK_ZOMBIE; -+ // TODO: is this safe? -+ p->exit_state = EXIT_ZOMBIE; - return retval; - } -- retval = p->pid; -+ retval = get_task_pid(p); - if (p->real_parent != p->parent) { - write_lock_irq(&tasklist_lock); - /* Double-check with lock held. */ - if (p->real_parent != p->parent) { - __ptrace_unlink(p); -- p->state = TASK_ZOMBIE; -+ // TODO: is this safe? -+ p->exit_state = EXIT_ZOMBIE; - /* - * If this is not a detached task, notify the parent. If it's - * still not detached after that, don't release it now. -@@ -1072,13 +1202,13 @@ static int wait_task_stopped(task_t *p, - /* - * This uses xchg to be atomic with the thread resuming and setting - * it. It must also be done with the write lock held to prevent a -- * race with the TASK_ZOMBIE case. -+ * race with the EXIT_ZOMBIE case. - */ - exit_code = xchg(&p->exit_code, 0); - if (unlikely(p->state > TASK_STOPPED)) { - /* - * The task resumed and then died. Let the next iteration -- * catch it in TASK_ZOMBIE. Note that exit_code might -+ * catch it in EXIT_ZOMBIE. Note that exit_code might - * already be zero here if it resumed and did _exit(0). - * The task itself is dead and won't touch exit_code again; - * other processors in this function are locked out. -@@ -1107,7 +1237,7 @@ static int wait_task_stopped(task_t *p, - if (!retval && stat_addr) - retval = put_user((exit_code << 8) | 0x7f, stat_addr); - if (!retval) -- retval = p->pid; -+ retval = get_task_pid(p); - put_task_struct(p); - - BUG_ON(!retval); -@@ -1152,16 +1282,25 @@ repeat: - if (retval != 0) /* He released the lock. */ - goto end_wait4; - break; -- case TASK_ZOMBIE: -- /* -- * Eligible but we cannot release it yet: -- */ -- if (ret == 2) -- continue; -- retval = wait_task_zombie(p, stat_addr, ru); -- if (retval != 0) /* He released the lock. */ -- goto end_wait4; -- break; -+ default: -+ // case EXIT_DEAD: -+ if (p->exit_state == EXIT_DEAD) -+ continue; -+ // case EXIT_ZOMBIE: -+ if (p->exit_state == EXIT_ZOMBIE) { -+ /* -+ * Eligible but we cannot release -+ * it yet: -+ */ -+ if (ret == 2) -+ continue; -+ retval = wait_task_zombie( -+ p, stat_addr, ru); -+ /* He released the lock. */ -+ if (retval != 0) -+ goto end_wait4; -+ break; -+ } - } - } - if (!flag) { -diff -uprN linux-2.6.8.1.orig/kernel/extable.c linux-2.6.8.1-ve022stab078/kernel/extable.c ---- linux-2.6.8.1.orig/kernel/extable.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/extable.c 2006-05-11 13:05:40.000000000 +0400 -@@ -49,6 +49,7 @@ static int core_kernel_text(unsigned lon - if (addr >= (unsigned long)_sinittext && - addr <= (unsigned long)_einittext) - return 1; -+ - return 0; - } - -diff -uprN linux-2.6.8.1.orig/kernel/fairsched.c linux-2.6.8.1-ve022stab078/kernel/fairsched.c ---- linux-2.6.8.1.orig/kernel/fairsched.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/kernel/fairsched.c 2006-05-11 13:05:40.000000000 +0400 -@@ -0,0 +1,1286 @@ -+/* -+ * Fair Scheduler -+ * -+ * Copyright (C) 2000-2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ * Start-tag scheduling follows the theory presented in -+ * http://www.cs.utexas.edu/users/dmcl/papers/ps/SIGCOMM96.ps -+ */ -+ -+#include <linux/config.h> -+#include <linux/kernel.h> -+#include <asm/timex.h> -+#include <asm/atomic.h> -+#include <linux/spinlock.h> -+#include <asm/semaphore.h> -+#include <linux/init.h> -+#include <linux/slab.h> -+#include <ub/ub_mem.h> -+#include <linux/proc_fs.h> -+#include <linux/seq_file.h> -+#include <linux/fs.h> -+#include <linux/dcache.h> -+#include <linux/sysctl.h> -+#include <linux/module.h> -+#include <linux/sched.h> -+#include <linux/fairsched.h> -+#include <linux/vsched.h> -+ -+/* we need it for vsched routines in sched.c */ -+spinlock_t fairsched_lock = SPIN_LOCK_UNLOCKED; -+ -+#ifdef CONFIG_FAIRSCHED -+ -+#define FAIRSHED_DEBUG " debug" -+ -+ -+/*********************************************************************/ -+/* -+ * Special arithmetics -+ */ -+/*********************************************************************/ -+ -+#define CYCLES_SHIFT (8) -+#define SCYCLES_TIME(time) \ -+ ((scycles_t) {((time) + (1 << CYCLES_SHIFT) - 1) >> CYCLES_SHIFT}) -+ -+#define CYCLES_ZERO (0) -+static inline int CYCLES_BEFORE(cycles_t x, cycles_t y) -+{ -+ return (__s64)(x-y) < 0; -+} -+static inline int CYCLES_AFTER(cycles_t x, cycles_t y) -+{ -+ return (__s64)(y-x) < 0; -+} -+static inline void CYCLES_DADD(cycles_t *x, fschdur_t y) {*x+=y.d;} -+ -+#define FSCHDUR_ZERO (0) -+#define TICK_DUR ((fschdur_t){cycles_per_jiffy}) -+static inline fschdur_t FSCHDURATION(cycles_t x, cycles_t y) -+{ -+ return (fschdur_t){x - y}; -+} -+static inline int FSCHDUR_CMP(fschdur_t x, fschdur_t y) -+{ -+ if (x.d < y.d) return -1; -+ if (x.d > y.d) return 1; -+ return 0; -+} -+static inline fschdur_t FSCHDUR_SUB(fschdur_t x, fschdur_t y) -+{ -+ return (fschdur_t){x.d - y.d}; -+} -+ -+#define FSCHTAG_ZERO ((fschtag_t){0}) -+static inline int FSCHTAG_CMP(fschtag_t x, fschtag_t y) -+{ -+ if (x.t < y.t) return -1; -+ if (x.t > y.t) return 1; -+ return 0; -+} -+static inline fschtag_t FSCHTAG_MAX(fschtag_t x, fschtag_t y) -+{ -+ return x.t >= y.t ? x : y; -+} -+static inline int FSCHTAG_DADD(fschtag_t *tag, fschdur_t dur, unsigned w) -+{ -+ cycles_t new_tag; -+ new_tag = tag->t + (cycles_t)dur.d * w; -+ if (new_tag < tag->t) -+ return -1; -+ /* DEBUG */ -+ if (new_tag >= (1ULL << 48)) -+ return -1; -+ tag->t = new_tag; -+ return 0; -+} -+static inline int FSCHTAG_ADD(fschtag_t *tag, fschtag_t y) -+{ -+ cycles_t new_tag; -+ new_tag = tag->t + y.t; -+ if (new_tag < tag->t) -+ return -1; -+ tag->t = new_tag; -+ return 0; -+} -+static inline fschtag_t FSCHTAG_SUB(fschtag_t x, fschtag_t y) -+{ -+ return (fschtag_t){x.t - y.t}; -+} -+ -+#define FSCHVALUE_ZERO ((fschvalue_t){0}) -+#define TICK_VALUE ((fschvalue_t){(cycles_t)cycles_per_jiffy << FSCHRATE_SHIFT}) -+static inline fschvalue_t FSCHVALUE(unsigned long t) -+{ -+ return (fschvalue_t){(cycles_t)t << FSCHRATE_SHIFT}; -+} -+static inline int FSCHVALUE_CMP(fschvalue_t x, fschvalue_t y) -+{ -+ if (x.v < y.v) return -1; -+ if (x.v > y.v) return 1; -+ return 0; -+} -+static inline void FSCHVALUE_DADD(fschvalue_t *val, fschdur_t dur, -+ unsigned rate) -+{ -+ val->v += (cycles_t)dur.d * rate; -+} -+static inline fschvalue_t FSCHVALUE_SUB(fschvalue_t x, fschvalue_t y) -+{ -+ return (fschvalue_t){x.v - y.v}; -+} -+static inline cycles_t FSCHVALUE_TO_DELAY(fschvalue_t val, unsigned rate) -+{ -+ unsigned long t; -+ /* -+ * Here we lose precision to make the division 32-bit on IA-32. -+ * The value is not greater than TICK_VALUE. -+ * (TICK_VALUE >> FSCHRATE_SHIFT) fits unsigned long. -+ */ -+ t = (val.v + (1 << FSCHRATE_SHIFT) - 1) >> FSCHRATE_SHIFT; -+ return (cycles_t)((t + rate - 1) / rate) << FSCHRATE_SHIFT; -+} -+ -+ -+/*********************************************************************/ -+/* -+ * Global data -+ */ -+/*********************************************************************/ -+ -+#define fsch_assert(x) \ -+ do { \ -+ static int count; \ -+ if (!(x) && count++ < 10) \ -+ printk("fsch_assert " #x " failed\n"); \ -+ } while (0) -+ -+/* -+ * Configurable parameters -+ */ -+unsigned fairsched_max_latency = 25; /* jiffies */ -+ -+/* -+ * Parameters initialized at startup -+ */ -+/* Number of online CPUs */ -+unsigned fairsched_nr_cpus; -+/* Token Bucket depth (burst size) */ -+static fschvalue_t max_value; -+ -+struct fairsched_node fairsched_init_node = { -+ .id = INT_MAX, -+#ifdef CONFIG_VE -+ .owner_env = get_ve0(), -+#endif -+ .weight = 1, -+}; -+EXPORT_SYMBOL(fairsched_init_node); -+ -+struct fairsched_node fairsched_idle_node = { -+ .id = -1, -+}; -+ -+static int fairsched_nr_nodes; -+static LIST_HEAD(fairsched_node_head); -+static LIST_HEAD(fairsched_running_head); -+static LIST_HEAD(fairsched_delayed_head); -+ -+DEFINE_PER_CPU(cycles_t, prev_schedule); -+static fschtag_t max_latency; -+ -+static DECLARE_MUTEX(fairsched_mutex); -+ -+/*********************************************************************/ -+/* -+ * Small helper routines -+ */ -+/*********************************************************************/ -+ -+/* this didn't proved to be very valuable statistics... */ -+#define fairsched_inc_ve_strv(node, cycles) do {} while(0) -+#define fairsched_dec_ve_strv(node, cycles) do {} while(0) -+ -+/*********************************************************************/ -+/* -+ * Runlist management -+ */ -+/*********************************************************************/ -+ -+/* -+ * Returns the start_tag of the first runnable node, or 0. -+ */ -+static inline fschtag_t virtual_time(void) -+{ -+ struct fairsched_node *p; -+ -+ if (!list_empty(&fairsched_running_head)) { -+ p = list_first_entry(&fairsched_running_head, -+ struct fairsched_node, runlist); -+ return p->start_tag; -+ } -+ return FSCHTAG_ZERO; -+} -+ -+static void fairsched_recompute_max_latency(void) -+{ -+ struct fairsched_node *p; -+ unsigned w; -+ fschtag_t tag; -+ -+ w = FSCHWEIGHT_MAX; -+ list_for_each_entry(p, &fairsched_node_head, nodelist) { -+ if (p->weight < w) -+ w = p->weight; -+ } -+ tag = FSCHTAG_ZERO; -+ (void) FSCHTAG_DADD(&tag, TICK_DUR, -+ fairsched_nr_cpus * fairsched_max_latency * w); -+ max_latency = tag; -+} -+ -+static void fairsched_reset_start_tags(void) -+{ -+ struct fairsched_node *cnode; -+ fschtag_t min_tag; -+ -+ min_tag = virtual_time(); -+ list_for_each_entry(cnode, &fairsched_node_head, nodelist) { -+ if (FSCHTAG_CMP(cnode->start_tag, min_tag) > 0) -+ cnode->start_tag = FSCHTAG_SUB(cnode->start_tag, -+ min_tag); -+ else -+ cnode->start_tag = FSCHTAG_ZERO; -+ } -+} -+ -+static void fairsched_running_insert(struct fairsched_node *node) -+{ -+ struct list_head *tmp; -+ struct fairsched_node *p; -+ fschtag_t start_tag_max; -+ -+ if (!list_empty(&fairsched_running_head)) { -+ start_tag_max = virtual_time(); -+ if (!FSCHTAG_ADD(&start_tag_max, max_latency) && -+ FSCHTAG_CMP(start_tag_max, node->start_tag) < 0) -+ node->start_tag = start_tag_max; -+ } -+ -+ list_for_each(tmp, &fairsched_running_head) { -+ p = list_entry(tmp, struct fairsched_node, runlist); -+ if (FSCHTAG_CMP(node->start_tag, p->start_tag) <= 0) -+ break; -+ } -+ /* insert node just before tmp */ -+ list_add_tail(&node->runlist, tmp); -+} -+ -+static inline void fairsched_running_insert_fromsleep( -+ struct fairsched_node *node) -+{ -+ node->start_tag = FSCHTAG_MAX(node->start_tag, virtual_time()); -+ fairsched_running_insert(node); -+} -+ -+ -+/*********************************************************************/ -+/* -+ * CPU limiting helper functions -+ * -+ * These functions compute rates, delays and manipulate with sleep -+ * lists and so on. -+ */ -+/*********************************************************************/ -+ -+/* -+ * Insert a node into the list of nodes removed from scheduling, -+ * sorted by the time at which the the node is allowed to run, -+ * historically called `delay'. -+ */ -+static void fairsched_delayed_insert(struct fairsched_node *node) -+{ -+ struct fairsched_node *p; -+ struct list_head *tmp; -+ -+ list_for_each(tmp, &fairsched_delayed_head) { -+ p = list_entry(tmp, struct fairsched_node, -+ runlist); -+ if (CYCLES_AFTER(p->delay, node->delay)) -+ break; -+ } -+ /* insert node just before tmp */ -+ list_add_tail(&node->runlist, tmp); -+} -+ -+static inline void nodevalue_add(struct fairsched_node *node, -+ fschdur_t duration, unsigned rate) -+{ -+ FSCHVALUE_DADD(&node->value, duration, rate); -+ if (FSCHVALUE_CMP(node->value, max_value) > 0) -+ node->value = max_value; -+} -+ -+/* -+ * The node has been selected to run. -+ * This function accounts in advance for the time that the node will run. -+ * The advance not used by the node will be credited back. -+ */ -+static void fairsched_ratelimit_charge_advance( -+ struct fairsched_node *node, -+ cycles_t time) -+{ -+ fsch_assert(!node->delayed); -+ fsch_assert(FSCHVALUE_CMP(node->value, TICK_VALUE) >= 0); -+ -+ /* -+ * Account for the time passed since last update. -+ * It might be needed if the node has become runnable because of -+ * a wakeup, but hasn't gone through other functions updating -+ * the bucket value. -+ */ -+ if (CYCLES_AFTER(time, node->last_updated_at)) { -+ nodevalue_add(node, FSCHDURATION(time, node->last_updated_at), -+ node->rate); -+ node->last_updated_at = time; -+ } -+ -+ /* charge for the full tick the node might be running */ -+ node->value = FSCHVALUE_SUB(node->value, TICK_VALUE); -+ if (FSCHVALUE_CMP(node->value, TICK_VALUE) < 0) { -+ list_del(&node->runlist); -+ node->delayed = 1; -+ node->delay = node->last_updated_at + FSCHVALUE_TO_DELAY( -+ FSCHVALUE_SUB(TICK_VALUE, node->value), -+ node->rate); -+ node->nr_ready = 0; -+ fairsched_delayed_insert(node); -+ } -+} -+ -+static void fairsched_ratelimit_credit_unused( -+ struct fairsched_node *node, -+ cycles_t time, fschdur_t duration) -+{ -+ /* account for the time passed since last update */ -+ if (CYCLES_AFTER(time, node->last_updated_at)) { -+ nodevalue_add(node, FSCHDURATION(time, node->last_updated_at), -+ node->rate); -+ node->last_updated_at = time; -+ } -+ -+ /* -+ * When the node was given this CPU, it was charged for 1 tick. -+ * Credit back the unused time. -+ */ -+ if (FSCHDUR_CMP(duration, TICK_DUR) < 0) -+ nodevalue_add(node, FSCHDUR_SUB(TICK_DUR, duration), -+ 1 << FSCHRATE_SHIFT); -+ -+ /* check if the node is allowed to run */ -+ if (FSCHVALUE_CMP(node->value, TICK_VALUE) < 0) { -+ /* -+ * The node was delayed and remain such. -+ * But since the bucket value has been updated, -+ * update the delay time and move the node in the list. -+ */ -+ fsch_assert(node->delayed); -+ node->delay = node->last_updated_at + FSCHVALUE_TO_DELAY( -+ FSCHVALUE_SUB(TICK_VALUE, node->value), -+ node->rate); -+ } else if (node->delayed) { -+ /* -+ * The node was delayed, but now it is allowed to run. -+ * We do not manipulate with lists, it will be done by the -+ * caller. -+ */ -+ node->nr_ready = node->nr_runnable; -+ node->delayed = 0; -+ } -+} -+ -+static void fairsched_delayed_wake(cycles_t time) -+{ -+ struct fairsched_node *p; -+ -+ while (!list_empty(&fairsched_delayed_head)) { -+ p = list_entry(fairsched_delayed_head.next, -+ struct fairsched_node, -+ runlist); -+ if (CYCLES_AFTER(p->delay, time)) -+ break; -+ -+ /* ok, the delay period is completed */ -+ /* account for the time passed since last update */ -+ if (CYCLES_AFTER(time, p->last_updated_at)) { -+ nodevalue_add(p, FSCHDURATION(time, p->last_updated_at), -+ p->rate); -+ p->last_updated_at = time; -+ } -+ -+ fsch_assert(FSCHVALUE_CMP(p->value, TICK_VALUE) >= 0); -+ p->nr_ready = p->nr_runnable; -+ p->delayed = 0; -+ list_del_init(&p->runlist); -+ if (p->nr_ready) -+ fairsched_running_insert_fromsleep(p); -+ } -+} -+ -+static struct fairsched_node *fairsched_find(unsigned int id); -+ -+void fairsched_cpu_online_map(int id, cpumask_t *mask) -+{ -+ struct fairsched_node *node; -+ -+ down(&fairsched_mutex); -+ node = fairsched_find(id); -+ if (node == NULL) -+ *mask = CPU_MASK_NONE; -+ else -+ vsched_cpu_online_map(node->vsched, mask); -+ up(&fairsched_mutex); -+} -+ -+ -+/*********************************************************************/ -+/* -+ * The heart of the algorithm: -+ * fairsched_incrun, fairsched_decrun, fairsched_schedule -+ * -+ * Note: old property nr_ready >= nr_pcpu doesn't hold anymore. -+ * However, nr_runnable, nr_ready and delayed are maintained in sync. -+ */ -+/*********************************************************************/ -+ -+/* -+ * Called on a wakeup inside the node. -+ */ -+void fairsched_incrun(struct fairsched_node *node) -+{ -+ if (!node->delayed && !node->nr_ready++) -+ /* the node wasn't on the running list, insert */ -+ fairsched_running_insert_fromsleep(node); -+ node->nr_runnable++; -+} -+ -+/* -+ * Called from inside schedule() when a sleeping state is entered. -+ */ -+void fairsched_decrun(struct fairsched_node *node) -+{ -+ if (!node->delayed && !--node->nr_ready) -+ /* nr_ready changed 1->0, remove from the running list */ -+ list_del_init(&node->runlist); -+ --node->nr_runnable; -+} -+ -+void fairsched_inccpu(struct fairsched_node *node) -+{ -+ node->nr_pcpu++; -+ fairsched_dec_ve_strv(node, cycles); -+} -+ -+static inline void __fairsched_deccpu(struct fairsched_node *node) -+{ -+ node->nr_pcpu--; -+ fairsched_inc_ve_strv(node, cycles); -+} -+ -+void fairsched_deccpu(struct fairsched_node *node) -+{ -+ if (node == &fairsched_idle_node) -+ return; -+ -+ __fairsched_deccpu(node); -+} -+ -+static void fairsched_account(struct fairsched_node *node, -+ cycles_t time) -+{ -+ fschdur_t duration; -+ -+ duration = FSCHDURATION(time, __get_cpu_var(prev_schedule)); -+#ifdef CONFIG_VE -+ CYCLES_DADD(&node->owner_env->cpu_used_ve, duration); -+#endif -+ -+ /* -+ * The duration is not greater than TICK_DUR since -+ * task->need_resched is always 1. -+ */ -+ if (FSCHTAG_DADD(&node->start_tag, duration, node->weight)) { -+ fairsched_reset_start_tags(); -+ (void) FSCHTAG_DADD(&node->start_tag, duration, -+ node->weight); -+ } -+ -+ list_del_init(&node->runlist); -+ if (node->rate_limited) -+ fairsched_ratelimit_credit_unused(node, time, duration); -+ if (!node->delayed) { -+ if (node->nr_ready) -+ fairsched_running_insert(node); -+ } else -+ fairsched_delayed_insert(node); -+} -+ -+/* -+ * Scheduling decision -+ * -+ * Updates CPU usage for the node releasing the CPU and selects a new node. -+ */ -+struct fairsched_node *fairsched_schedule( -+ struct fairsched_node *prev_node, -+ struct fairsched_node *cur_node, -+ int cur_node_active, -+ cycles_t time) -+{ -+ struct fairsched_node *p; -+ -+ if (prev_node != &fairsched_idle_node) -+ fairsched_account(prev_node, time); -+ __get_cpu_var(prev_schedule) = time; -+ -+ fairsched_delayed_wake(time); -+ -+ list_for_each_entry(p, &fairsched_running_head, runlist) { -+ if (p->nr_pcpu < p->nr_ready || -+ (cur_node_active && p == cur_node)) { -+ if (p->rate_limited) -+ fairsched_ratelimit_charge_advance(p, time); -+ return p; -+ } -+ } -+ return NULL; -+} -+ -+ -+/*********************************************************************/ -+/* -+ * System calls -+ * -+ * All do_xxx functions are called under fairsched semaphore and after -+ * capability check. -+ * -+ * The binary interfaces follow some other Fair Scheduler implementations -+ * (although some system call arguments are not needed for our implementation). -+ */ -+/*********************************************************************/ -+ -+static struct fairsched_node *fairsched_find(unsigned int id) -+{ -+ struct fairsched_node *p; -+ -+ list_for_each_entry(p, &fairsched_node_head, nodelist) { -+ if (p->id == id) -+ return p; -+ } -+ return NULL; -+} -+ -+static int do_fairsched_mknod(unsigned int parent, unsigned int weight, -+ unsigned int newid) -+{ -+ struct fairsched_node *node; -+ int retval; -+ -+ retval = -EINVAL; -+ if (weight < 1 || weight > FSCHWEIGHT_MAX) -+ goto out; -+ if (newid < 0 || newid > INT_MAX) -+ goto out; -+ -+ retval = -EBUSY; -+ if (fairsched_find(newid) != NULL) -+ goto out; -+ -+ retval = -ENOMEM; -+ node = kmalloc(sizeof(*node), GFP_KERNEL); -+ if (node == NULL) -+ goto out; -+ -+ memset(node, 0, sizeof(*node)); -+ node->weight = weight; -+ INIT_LIST_HEAD(&node->runlist); -+ node->id = newid; -+#ifdef CONFIG_VE -+ node->owner_env = get_exec_env(); -+#endif -+ -+ spin_lock_irq(&fairsched_lock); -+ list_add(&node->nodelist, &fairsched_node_head); -+ fairsched_nr_nodes++; -+ fairsched_recompute_max_latency(); -+ spin_unlock_irq(&fairsched_lock); -+ -+ retval = newid; -+out: -+ return retval; -+} -+ -+asmlinkage int sys_fairsched_mknod(unsigned int parent, unsigned int weight, -+ unsigned int newid) -+{ -+ int retval; -+ -+ if (!capable(CAP_SETVEID)) -+ return -EPERM; -+ -+ down(&fairsched_mutex); -+ retval = do_fairsched_mknod(parent, weight, newid); -+ up(&fairsched_mutex); -+ -+ return retval; -+} -+EXPORT_SYMBOL(sys_fairsched_mknod); -+ -+static int do_fairsched_rmnod(unsigned int id) -+{ -+ struct fairsched_node *node; -+ int retval; -+ -+ retval = -EINVAL; -+ node = fairsched_find(id); -+ if (node == NULL) -+ goto out; -+ if (node == &fairsched_init_node) -+ goto out; -+ -+ retval = vsched_destroy(node->vsched); -+ if (retval) -+ goto out; -+ -+ spin_lock_irq(&fairsched_lock); -+ list_del(&node->runlist); /* required for delayed nodes */ -+ list_del(&node->nodelist); -+ fairsched_nr_nodes--; -+ fairsched_recompute_max_latency(); -+ spin_unlock_irq(&fairsched_lock); -+ -+ kfree(node); -+ retval = 0; -+out: -+ return retval; -+} -+ -+asmlinkage int sys_fairsched_rmnod(unsigned int id) -+{ -+ int retval; -+ -+ if (!capable(CAP_SETVEID)) -+ return -EPERM; -+ -+ down(&fairsched_mutex); -+ retval = do_fairsched_rmnod(id); -+ up(&fairsched_mutex); -+ -+ return retval; -+} -+EXPORT_SYMBOL(sys_fairsched_rmnod); -+ -+int do_fairsched_chwt(unsigned int id, unsigned weight) -+{ -+ struct fairsched_node *node; -+ -+ if (id == 0) -+ return -EINVAL; -+ if (weight < 1 || weight > FSCHWEIGHT_MAX) -+ return -EINVAL; -+ -+ node = fairsched_find(id); -+ if (node == NULL) -+ return -ENOENT; -+ -+ spin_lock_irq(&fairsched_lock); -+ node->weight = weight; -+ fairsched_recompute_max_latency(); -+ spin_unlock_irq(&fairsched_lock); -+ -+ return 0; -+} -+ -+asmlinkage int sys_fairsched_chwt(unsigned int id, unsigned weight) -+{ -+ int retval; -+ -+ if (!capable(CAP_SETVEID)) -+ return -EPERM; -+ -+ down(&fairsched_mutex); -+ retval = do_fairsched_chwt(id, weight); -+ up(&fairsched_mutex); -+ -+ return retval; -+} -+ -+int do_fairsched_rate(unsigned int id, int op, unsigned rate) -+{ -+ struct fairsched_node *node; -+ cycles_t time; -+ int retval; -+ -+ if (id == 0) -+ return -EINVAL; -+ if (op == 0 && (rate < 1 || rate >= (1UL << 31))) -+ return -EINVAL; -+ -+ node = fairsched_find(id); -+ if (node == NULL) -+ return -ENOENT; -+ -+ retval = -EINVAL; -+ spin_lock_irq(&fairsched_lock); -+ time = get_cycles(); -+ switch (op) { -+ case 0: -+ node->rate = rate; -+ if (node->rate > (fairsched_nr_cpus << FSCHRATE_SHIFT)) -+ node->rate = -+ fairsched_nr_cpus << FSCHRATE_SHIFT; -+ node->rate_limited = 1; -+ node->value = max_value; -+ if (node->delayed) { -+ list_del(&node->runlist); -+ node->delay = time; -+ fairsched_delayed_insert(node); -+ node->last_updated_at = time; -+ fairsched_delayed_wake(time); -+ } -+ retval = node->rate; -+ break; -+ case 1: -+ node->rate = 0; /* This assignment is not needed -+ for the kernel code, and it should -+ not rely on rate being 0 when it's -+ unset. This is a band-aid for some -+ existing tools (don't know which one -+ exactly). --SAW */ -+ node->rate_limited = 0; -+ node->value = max_value; -+ if (node->delayed) { -+ list_del(&node->runlist); -+ node->delay = time; -+ fairsched_delayed_insert(node); -+ node->last_updated_at = time; -+ fairsched_delayed_wake(time); -+ } -+ retval = 0; -+ break; -+ case 2: -+ if (node->rate_limited) -+ retval = node->rate; -+ else -+ retval = -ENODATA; -+ break; -+ } -+ spin_unlock_irq(&fairsched_lock); -+ -+ return retval; -+} -+ -+asmlinkage int sys_fairsched_rate(unsigned int id, int op, unsigned rate) -+{ -+ int retval; -+ -+ if (!capable(CAP_SETVEID)) -+ return -EPERM; -+ -+ down(&fairsched_mutex); -+ retval = do_fairsched_rate(id, op, rate); -+ up(&fairsched_mutex); -+ -+ return retval; -+} -+ -+/* -+ * Called under fairsched_mutex. -+ */ -+static int __do_fairsched_mvpr(struct task_struct *p, -+ struct fairsched_node *node) -+{ -+ int retval; -+ -+ if (node->vsched == NULL) { -+ retval = vsched_create(node->id, node); -+ if (retval < 0) -+ return retval; -+ } -+ -+ /* no need to destroy vsched in case of mvpr failure */ -+ return vsched_mvpr(p, node->vsched); -+} -+ -+int do_fairsched_mvpr(pid_t pid, unsigned int nodeid) -+{ -+ struct task_struct *p; -+ struct fairsched_node *node; -+ int retval; -+ -+ retval = -ENOENT; -+ node = fairsched_find(nodeid); -+ if (node == NULL) -+ goto out; -+ -+ read_lock(&tasklist_lock); -+ retval = -ESRCH; -+ p = find_task_by_pid_all(pid); -+ if (p == NULL) -+ goto out_unlock; -+ get_task_struct(p); -+ read_unlock(&tasklist_lock); -+ -+ retval = __do_fairsched_mvpr(p, node); -+ put_task_struct(p); -+ return retval; -+ -+out_unlock: -+ read_unlock(&tasklist_lock); -+out: -+ return retval; -+} -+ -+asmlinkage int sys_fairsched_mvpr(pid_t pid, unsigned int nodeid) -+{ -+ int retval; -+ -+ if (!capable(CAP_SETVEID)) -+ return -EPERM; -+ -+ down(&fairsched_mutex); -+ retval = do_fairsched_mvpr(pid, nodeid); -+ up(&fairsched_mutex); -+ -+ return retval; -+} -+EXPORT_SYMBOL(sys_fairsched_mvpr); -+ -+ -+/*********************************************************************/ -+/* -+ * proc interface -+ */ -+/*********************************************************************/ -+ -+struct fairsched_node_dump { -+#ifdef CONFIG_VE -+ envid_t veid; -+#endif -+ int id; -+ unsigned weight; -+ unsigned rate; -+ unsigned rate_limited : 1, -+ delayed : 1; -+ fschtag_t start_tag; -+ fschvalue_t value; -+ cycles_t delay; -+ int nr_ready; -+ int nr_runnable; -+ int nr_pcpu; -+ int nr_tasks, nr_runtasks; -+}; -+ -+struct fairsched_dump { -+ int len, compat; -+ struct fairsched_node_dump nodes[0]; -+}; -+ -+static struct fairsched_dump *fairsched_do_dump(int compat) -+{ -+ int nr_nodes; -+ int len, i; -+ struct fairsched_dump *dump; -+ struct fairsched_node *node; -+ struct fairsched_node_dump *p; -+ unsigned long flags; -+ -+start: -+ nr_nodes = (ve_is_super(get_exec_env()) ? fairsched_nr_nodes + 16 : 1); -+ len = sizeof(*dump) + nr_nodes * sizeof(dump->nodes[0]); -+ dump = ub_vmalloc(len); -+ if (dump == NULL) -+ goto out; -+ -+ spin_lock_irqsave(&fairsched_lock, flags); -+ if (ve_is_super(get_exec_env()) && nr_nodes < fairsched_nr_nodes) -+ goto repeat; -+ p = dump->nodes; -+ list_for_each_entry_reverse(node, &fairsched_node_head, nodelist) { -+ if ((char *)p - (char *)dump >= len) -+ break; -+ p->nr_tasks = 0; -+ p->nr_runtasks = 0; -+#ifdef CONFIG_VE -+ if (!ve_accessible(node->owner_env, get_exec_env())) -+ continue; -+ p->veid = node->owner_env->veid; -+ if (compat) { -+ p->nr_tasks = atomic_read(&node->owner_env->pcounter); -+ for (i = 0; i < NR_CPUS; i++) -+ p->nr_runtasks += -+ VE_CPU_STATS(node->owner_env, i) -+ ->nr_running; -+ if (p->nr_runtasks < 0) -+ p->nr_runtasks = 0; -+ } -+#endif -+ p->id = node->id; -+ p->weight = node->weight; -+ p->rate = node->rate; -+ p->rate_limited = node->rate_limited; -+ p->delayed = node->delayed; -+ p->start_tag = node->start_tag; -+ p->value = node->value; -+ p->delay = node->delay; -+ p->nr_ready = node->nr_ready; -+ p->nr_runnable = node->nr_runnable; -+ p->nr_pcpu = node->nr_pcpu; -+ p++; -+ } -+ dump->len = p - dump->nodes; -+ dump->compat = compat; -+ spin_unlock_irqrestore(&fairsched_lock, flags); -+ -+out: -+ return dump; -+ -+repeat: -+ spin_unlock_irqrestore(&fairsched_lock, flags); -+ vfree(dump); -+ goto start; -+} -+ -+#define FAIRSCHED_PROC_HEADLINES 2 -+ -+#if defined(CONFIG_VE) -+/* -+ * File format is dictated by compatibility reasons. -+ */ -+static int fairsched_seq_show(struct seq_file *m, void *v) -+{ -+ struct fairsched_dump *dump; -+ struct fairsched_node_dump *p; -+ unsigned vid, nid, pid, r; -+ -+ dump = m->private; -+ p = (struct fairsched_node_dump *)((unsigned long)v & ~3UL); -+ if (p - dump->nodes < FAIRSCHED_PROC_HEADLINES) { -+ if (p == dump->nodes) -+ seq_printf(m, "Version: 2.6 debug\n"); -+ else if (p == dump->nodes + 1) -+ seq_printf(m, -+ " veid " -+ " id " -+ " parent " -+ "weight " -+ " rate " -+ "tasks " -+ " run " -+ "cpus" -+ " " -+ "flg " -+ "ready " -+ " start_tag " -+ " value " -+ " delay" -+ "\n"); -+ } else { -+ p -= FAIRSCHED_PROC_HEADLINES; -+ vid = nid = pid = 0; -+ r = (unsigned long)v & 3; -+ if (p == dump->nodes) { -+ if (r == 2) -+ nid = p->id; -+ } else { -+ if (!r) -+ nid = p->id; -+ else if (r == 1) -+ vid = pid = p->id; -+ else -+ vid = p->id, nid = 1; -+ } -+ seq_printf(m, -+ "%10u " -+ "%10u %10u %6u %5u %5u %5u %4u" -+ " " -+ " %c%c %5u %20Lu %20Lu %20Lu" -+ "\n", -+ vid, -+ nid, -+ pid, -+ p->weight, -+ p->rate, -+ p->nr_tasks, -+ p->nr_runtasks, -+ p->nr_pcpu, -+ p->rate_limited ? 'L' : '.', -+ p->delayed ? 'D' : '.', -+ p->nr_ready, -+ p->start_tag.t, -+ p->value.v, -+ p->delay -+ ); -+ } -+ -+ return 0; -+} -+ -+static void *fairsched_seq_start(struct seq_file *m, loff_t *pos) -+{ -+ struct fairsched_dump *dump; -+ unsigned long l; -+ -+ dump = m->private; -+ if (*pos >= dump->len * 3 - 1 + FAIRSCHED_PROC_HEADLINES) -+ return NULL; -+ if (*pos < FAIRSCHED_PROC_HEADLINES) -+ return dump->nodes + *pos; -+ /* guess why... */ -+ l = (unsigned long)(dump->nodes + -+ ((unsigned long)*pos + FAIRSCHED_PROC_HEADLINES * 2 + 1) / 3); -+ l |= ((unsigned long)*pos + FAIRSCHED_PROC_HEADLINES * 2 + 1) % 3; -+ return (void *)l; -+} -+static void *fairsched_seq_next(struct seq_file *m, void *v, loff_t *pos) -+{ -+ ++*pos; -+ return fairsched_seq_start(m, pos); -+} -+#endif -+ -+static int fairsched2_seq_show(struct seq_file *m, void *v) -+{ -+ struct fairsched_dump *dump; -+ struct fairsched_node_dump *p; -+ -+ dump = m->private; -+ p = v; -+ if (p - dump->nodes < FAIRSCHED_PROC_HEADLINES) { -+ if (p == dump->nodes) -+ seq_printf(m, "Version: 2.7" FAIRSHED_DEBUG "\n"); -+ else if (p == dump->nodes + 1) -+ seq_printf(m, -+ " id " -+ "weight " -+ " rate " -+ " run " -+ "cpus" -+#ifdef FAIRSHED_DEBUG -+ " " -+ "flg " -+ "ready " -+ " start_tag " -+ " value " -+ " delay" -+#endif -+ "\n"); -+ } else { -+ p -= FAIRSCHED_PROC_HEADLINES; -+ seq_printf(m, -+ "%10u %6u %5u %5u %4u" -+#ifdef FAIRSHED_DEBUG -+ " " -+ " %c%c %5u %20Lu %20Lu %20Lu" -+#endif -+ "\n", -+ p->id, -+ p->weight, -+ p->rate, -+ p->nr_runnable, -+ p->nr_pcpu -+#ifdef FAIRSHED_DEBUG -+ , -+ p->rate_limited ? 'L' : '.', -+ p->delayed ? 'D' : '.', -+ p->nr_ready, -+ p->start_tag.t, -+ p->value.v, -+ p->delay -+#endif -+ ); -+ } -+ -+ return 0; -+} -+ -+static void *fairsched2_seq_start(struct seq_file *m, loff_t *pos) -+{ -+ struct fairsched_dump *dump; -+ -+ dump = m->private; -+ if (*pos >= dump->len + FAIRSCHED_PROC_HEADLINES) -+ return NULL; -+ return dump->nodes + *pos; -+} -+static void *fairsched2_seq_next(struct seq_file *m, void *v, loff_t *pos) -+{ -+ ++*pos; -+ return fairsched2_seq_start(m, pos); -+} -+static void fairsched2_seq_stop(struct seq_file *m, void *v) -+{ -+} -+ -+#ifdef CONFIG_VE -+static struct seq_operations fairsched_seq_op = { -+ .start = fairsched_seq_start, -+ .next = fairsched_seq_next, -+ .stop = fairsched2_seq_stop, -+ .show = fairsched_seq_show -+}; -+#endif -+static struct seq_operations fairsched2_seq_op = { -+ .start = fairsched2_seq_start, -+ .next = fairsched2_seq_next, -+ .stop = fairsched2_seq_stop, -+ .show = fairsched2_seq_show -+}; -+static int fairsched_seq_open(struct inode *inode, struct file *file) -+{ -+ int ret; -+ struct seq_file *m; -+ int compat; -+ -+#ifdef CONFIG_VE -+ compat = (file->f_dentry->d_name.len == sizeof("fairsched") - 1); -+ ret = seq_open(file, compat ? &fairsched_seq_op : &fairsched2_seq_op); -+#else -+ compat = 0; -+ ret = seq_open(file, fairsched2_seq_op); -+#endif -+ if (ret) -+ return ret; -+ m = file->private_data; -+ m->private = fairsched_do_dump(compat); -+ if (m->private == NULL) { -+ seq_release(inode, file); -+ ret = -ENOMEM; -+ } -+ return ret; -+} -+static int fairsched_seq_release(struct inode *inode, struct file *file) -+{ -+ struct seq_file *m; -+ struct fairsched_dump *dump; -+ -+ m = file->private_data; -+ dump = m->private; -+ m->private = NULL; -+ vfree(dump); -+ seq_release(inode, file); -+ return 0; -+} -+static struct file_operations proc_fairsched_operations = { -+ .open = fairsched_seq_open, -+ .read = seq_read, -+ .llseek = seq_lseek, -+ .release = fairsched_seq_release -+}; -+ -+ -+/*********************************************************************/ -+/* -+ * Fairsched initialization -+ */ -+/*********************************************************************/ -+ -+int fsch_sysctl_latency(ctl_table *ctl, int write, struct file *filp, -+ void *buffer, size_t *lenp, loff_t *ppos) -+{ -+ int *valp = ctl->data; -+ int val = *valp; -+ int ret; -+ -+ ret = proc_dointvec(ctl, write, filp, buffer, lenp, ppos); -+ -+ if (!write || *valp == val) -+ return ret; -+ -+ spin_lock_irq(&fairsched_lock); -+ fairsched_recompute_max_latency(); -+ spin_unlock_irq(&fairsched_lock); -+ return ret; -+} -+ -+static void fairsched_calibrate(void) -+{ -+ fairsched_nr_cpus = num_online_cpus(); -+ max_value = FSCHVALUE(cycles_per_jiffy * (fairsched_nr_cpus + 1)); -+} -+ -+void __init fairsched_init_early(void) -+{ -+ printk(KERN_INFO "Virtuozzo Fair CPU scheduler\n"); -+ list_add(&fairsched_init_node.nodelist, &fairsched_node_head); -+ fairsched_nr_nodes++; -+} -+ -+/* -+ * Note: this function is execute late in the initialization sequence. -+ * We ourselves need calibrated cycles and initialized procfs... -+ * The consequence of this late initialization is that start tags are -+ * efficiently ignored and each node preempts others on insertion. -+ * But it isn't a problem (only init node can be runnable). -+ */ -+void __init fairsched_init_late(void) -+{ -+ struct proc_dir_entry *entry; -+ -+ if (get_cycles() == 0) -+ panic("FAIRSCHED: no TSC!\n"); -+ fairsched_calibrate(); -+ fairsched_recompute_max_latency(); -+ -+ entry = create_proc_glob_entry("fairsched", S_IRUGO, NULL); -+ if (entry) -+ entry->proc_fops = &proc_fairsched_operations; -+ entry = create_proc_glob_entry("fairsched2", S_IRUGO, NULL); -+ if (entry) -+ entry->proc_fops = &proc_fairsched_operations; -+} -+ -+ -+#else /* CONFIG_FAIRSCHED */ -+ -+ -+/*********************************************************************/ -+/* -+ * No Fairsched -+ */ -+/*********************************************************************/ -+ -+asmlinkage int sys_fairsched_mknod(unsigned int parent, unsigned int weight, -+ unsigned int newid) -+{ -+ return -ENOSYS; -+} -+ -+asmlinkage int sys_fairsched_rmnod(unsigned int id) -+{ -+ return -ENOSYS; -+} -+ -+asmlinkage int sys_fairsched_chwt(unsigned int id, unsigned int weight) -+{ -+ return -ENOSYS; -+} -+ -+asmlinkage int sys_fairsched_mvpr(pid_t pid, unsigned int nodeid) -+{ -+ return -ENOSYS; -+} -+ -+asmlinkage int sys_fairsched_rate(unsigned int id, int op, unsigned rate) -+{ -+ return -ENOSYS; -+} -+ -+void __init fairsched_init_late(void) -+{ -+} -+ -+#endif /* CONFIG_FAIRSCHED */ -diff -uprN linux-2.6.8.1.orig/kernel/fork.c linux-2.6.8.1-ve022stab078/kernel/fork.c ---- linux-2.6.8.1.orig/kernel/fork.c 2004-08-14 14:54:49.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/fork.c 2006-05-11 13:05:49.000000000 +0400 -@@ -20,12 +20,14 @@ - #include <linux/vmalloc.h> - #include <linux/completion.h> - #include <linux/namespace.h> -+#include <linux/file.h> - #include <linux/personality.h> - #include <linux/mempolicy.h> - #include <linux/sem.h> - #include <linux/file.h> - #include <linux/binfmts.h> - #include <linux/mman.h> -+#include <linux/virtinfo.h> - #include <linux/fs.h> - #include <linux/cpu.h> - #include <linux/security.h> -@@ -36,6 +38,7 @@ - #include <linux/mount.h> - #include <linux/audit.h> - #include <linux/rmap.h> -+#include <linux/fairsched.h> - - #include <asm/pgtable.h> - #include <asm/pgalloc.h> -@@ -44,10 +47,14 @@ - #include <asm/cacheflush.h> - #include <asm/tlbflush.h> - -+#include <ub/ub_misc.h> -+#include <ub/ub_vmpages.h> -+ - /* The idle threads do not count.. - * Protected by write_lock_irq(&tasklist_lock) - */ - int nr_threads; -+EXPORT_SYMBOL(nr_threads); - - int max_threads; - unsigned long total_forks; /* Handle normal Linux uptimes. */ -@@ -77,13 +84,14 @@ static kmem_cache_t *task_struct_cachep; - - static void free_task(struct task_struct *tsk) - { -+ ub_task_uncharge(tsk); - free_thread_info(tsk->thread_info); - free_task_struct(tsk); - } - - void __put_task_struct(struct task_struct *tsk) - { -- WARN_ON(!(tsk->state & (TASK_DEAD | TASK_ZOMBIE))); -+ WARN_ON(!(tsk->exit_state & (EXIT_DEAD | EXIT_ZOMBIE))); - WARN_ON(atomic_read(&tsk->usage)); - WARN_ON(tsk == current); - -@@ -92,6 +100,13 @@ void __put_task_struct(struct task_struc - security_task_free(tsk); - free_uid(tsk->user); - put_group_info(tsk->group_info); -+ -+#ifdef CONFIG_VE -+ put_ve(VE_TASK_INFO(tsk)->owner_env); -+ write_lock_irq(&tasklist_lock); -+ nr_dead--; -+ write_unlock_irq(&tasklist_lock); -+#endif - free_task(tsk); - } - -@@ -219,7 +234,7 @@ void __init fork_init(unsigned long memp - /* create a slab on which task_structs can be allocated */ - task_struct_cachep = - kmem_cache_create("task_struct", sizeof(struct task_struct), -- ARCH_MIN_TASKALIGN, SLAB_PANIC, NULL, NULL); -+ ARCH_MIN_TASKALIGN, SLAB_PANIC | SLAB_UBC, NULL, NULL); - #endif - - /* -@@ -250,19 +265,30 @@ static struct task_struct *dup_task_stru - return NULL; - - ti = alloc_thread_info(tsk); -- if (!ti) { -- free_task_struct(tsk); -- return NULL; -- } -+ if (ti == NULL) -+ goto out_free_task; - - *ti = *orig->thread_info; - *tsk = *orig; - tsk->thread_info = ti; - ti->task = tsk; - -+ /* Our parent has been killed by OOM killer... Go away */ -+ if (tsk->flags & PF_MEMDIE) -+ goto out_free_thread; -+ -+ if (ub_task_charge(orig, tsk) < 0) -+ goto out_free_thread; -+ - /* One for us, one for whoever does the "release_task()" (usually parent) */ - atomic_set(&tsk->usage,2); - return tsk; -+ -+out_free_thread: -+ free_thread_info(ti); -+out_free_task: -+ free_task_struct(tsk); -+ return NULL; - } - - #ifdef CONFIG_MMU -@@ -308,9 +334,14 @@ static inline int dup_mmap(struct mm_str - if (mpnt->vm_flags & VM_ACCOUNT) { - unsigned int len = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT; - if (security_vm_enough_memory(len)) -- goto fail_nomem; -+ goto fail_nocharge; - charge = len; - } -+ -+ if (ub_privvm_charge(mm_ub(mm), mpnt->vm_flags, mpnt->vm_file, -+ mpnt->vm_end - mpnt->vm_start)) -+ goto fail_nocharge; -+ - tmp = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL); - if (!tmp) - goto fail_nomem; -@@ -323,6 +354,7 @@ static inline int dup_mmap(struct mm_str - tmp->vm_flags &= ~VM_LOCKED; - tmp->vm_mm = mm; - tmp->vm_next = NULL; -+ tmp->vm_rss = 0; - anon_vma_link(tmp); - vma_prio_tree_init(tmp); - file = tmp->vm_file; -@@ -372,6 +404,9 @@ out: - fail_nomem_policy: - kmem_cache_free(vm_area_cachep, tmp); - fail_nomem: -+ ub_privvm_uncharge(mm_ub(mm), mpnt->vm_flags, mpnt->vm_file, -+ mpnt->vm_end - mpnt->vm_start); -+fail_nocharge: - retval = -ENOMEM; - vm_unacct_memory(charge); - goto out; -@@ -398,12 +433,15 @@ static inline void mm_free_pgd(struct mm - spinlock_t mmlist_lock __cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED; - int mmlist_nr; - -+EXPORT_SYMBOL(mmlist_lock); -+ - #define allocate_mm() (kmem_cache_alloc(mm_cachep, SLAB_KERNEL)) - #define free_mm(mm) (kmem_cache_free(mm_cachep, (mm))) - - #include <linux/init_task.h> - --static struct mm_struct * mm_init(struct mm_struct * mm) -+static struct mm_struct * mm_init(struct mm_struct * mm, -+ struct user_beancounter * ub) - { - atomic_set(&mm->mm_users, 1); - atomic_set(&mm->mm_count, 1); -@@ -414,11 +452,15 @@ static struct mm_struct * mm_init(struct - mm->ioctx_list = NULL; - mm->default_kioctx = (struct kioctx)INIT_KIOCTX(mm->default_kioctx, *mm); - mm->free_area_cache = TASK_UNMAPPED_BASE; -+#ifdef CONFIG_USER_RESOURCE -+ mm_ub(mm) = get_beancounter(ub); -+#endif - - if (likely(!mm_alloc_pgd(mm))) { - mm->def_flags = 0; - return mm; - } -+ put_beancounter(mm_ub(mm)); - free_mm(mm); - return NULL; - } -@@ -433,7 +475,7 @@ struct mm_struct * mm_alloc(void) - mm = allocate_mm(); - if (mm) { - memset(mm, 0, sizeof(*mm)); -- mm = mm_init(mm); -+ mm = mm_init(mm, get_exec_ub()); - } - return mm; - } -@@ -448,6 +490,7 @@ void fastcall __mmdrop(struct mm_struct - BUG_ON(mm == &init_mm); - mm_free_pgd(mm); - destroy_context(mm); -+ put_beancounter(mm_ub(mm)); - free_mm(mm); - } - -@@ -462,6 +505,7 @@ void mmput(struct mm_struct *mm) - spin_unlock(&mmlist_lock); - exit_aio(mm); - exit_mmap(mm); -+ (void) virtinfo_gencall(VIRTINFO_EXITMMAP, mm); - mmdrop(mm); - } - } -@@ -562,7 +606,7 @@ static int copy_mm(unsigned long clone_f - - /* Copy the current MM stuff.. */ - memcpy(mm, oldmm, sizeof(*mm)); -- if (!mm_init(mm)) -+ if (!mm_init(mm, get_task_ub(tsk))) - goto fail_nomem; - - if (init_new_context(tsk,mm)) -@@ -588,6 +632,7 @@ fail_nocontext: - * because it calls destroy_context() - */ - mm_free_pgd(mm); -+ put_beancounter(mm_ub(mm)); - free_mm(mm); - return retval; - } -@@ -853,7 +898,7 @@ asmlinkage long sys_set_tid_address(int - { - current->clear_child_tid = tidptr; - -- return current->pid; -+ return virt_pid(current); - } - - /* -@@ -869,7 +914,8 @@ struct task_struct *copy_process(unsigne - struct pt_regs *regs, - unsigned long stack_size, - int __user *parent_tidptr, -- int __user *child_tidptr) -+ int __user *child_tidptr, -+ long pid) - { - int retval; - struct task_struct *p = NULL; -@@ -929,19 +975,28 @@ struct task_struct *copy_process(unsigne - - p->did_exec = 0; - copy_flags(clone_flags, p); -- if (clone_flags & CLONE_IDLETASK) -+ if (clone_flags & CLONE_IDLETASK) { - p->pid = 0; -- else { -+ set_virt_pid(p, 0); -+ } else { - p->pid = alloc_pidmap(); - if (p->pid == -1) -+ goto bad_fork_cleanup_pid; -+#ifdef CONFIG_VE -+ set_virt_pid(p, alloc_vpid(p->pid, pid ? : -1)); -+ if (virt_pid(p) < 0) - goto bad_fork_cleanup; -+#endif - } - retval = -EFAULT; - if (clone_flags & CLONE_PARENT_SETTID) -- if (put_user(p->pid, parent_tidptr)) -+ if (put_user(virt_pid(p), parent_tidptr)) - goto bad_fork_cleanup; - - p->proc_dentry = NULL; -+#ifdef CONFIG_VE -+ VE_TASK_INFO(p)->glob_proc_dentry = NULL; -+#endif - - INIT_LIST_HEAD(&p->children); - INIT_LIST_HEAD(&p->sibling); -@@ -1017,6 +1072,7 @@ struct task_struct *copy_process(unsigne - /* ok, now we should be set up.. */ - p->exit_signal = (clone_flags & CLONE_THREAD) ? -1 : (clone_flags & CSIGNAL); - p->pdeath_signal = 0; -+ p->exit_state = 0; - - /* Perform scheduler related setup */ - sched_fork(p); -@@ -1026,12 +1082,26 @@ struct task_struct *copy_process(unsigne - * We dont wake it up yet. - */ - p->tgid = p->pid; -+ set_virt_tgid(p, virt_pid(p)); -+ set_virt_pgid(p, virt_pgid(current)); -+ set_virt_sid(p, virt_sid(current)); - p->group_leader = p; - INIT_LIST_HEAD(&p->ptrace_children); - INIT_LIST_HEAD(&p->ptrace_list); - - /* Need tasklist lock for parent etc handling! */ - write_lock_irq(&tasklist_lock); -+ -+ /* -+ * The task hasn't been attached yet, so cpus_allowed mask cannot -+ * have changed. The cpus_allowed mask of the parent may have -+ * changed after it was copied first time, and it may then move to -+ * another CPU - so we re-copy it here and set the child's CPU to -+ * the parent's CPU. This avoids alot of nasty races. -+ */ -+ p->cpus_allowed = current->cpus_allowed; -+ set_task_cpu(p, task_cpu(current)); -+ - /* - * Check for pending SIGKILL! The new thread should not be allowed - * to slip out of an OOM kill. (or normal SIGKILL.) -@@ -1043,7 +1113,7 @@ struct task_struct *copy_process(unsigne - } - - /* CLONE_PARENT re-uses the old parent */ -- if (clone_flags & CLONE_PARENT) -+ if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) - p->real_parent = current->real_parent; - else - p->real_parent = current; -@@ -1063,6 +1133,7 @@ struct task_struct *copy_process(unsigne - goto bad_fork_cleanup_namespace; - } - p->tgid = current->tgid; -+ set_virt_tgid(p, virt_tgid(current)); - p->group_leader = current->group_leader; - - if (current->signal->group_stop_count > 0) { -@@ -1082,15 +1153,20 @@ struct task_struct *copy_process(unsigne - if (p->ptrace & PT_PTRACED) - __ptrace_link(p, current->parent); - -+#ifdef CONFIG_VE -+ SET_VE_LINKS(p); -+ atomic_inc(&VE_TASK_INFO(p)->owner_env->pcounter); -+ get_ve(VE_TASK_INFO(p)->owner_env); -+ seqcount_init(&VE_TASK_INFO(p)->wakeup_lock); -+#endif - attach_pid(p, PIDTYPE_PID, p->pid); -+ attach_pid(p, PIDTYPE_TGID, p->tgid); - if (thread_group_leader(p)) { -- attach_pid(p, PIDTYPE_TGID, p->tgid); - attach_pid(p, PIDTYPE_PGID, process_group(p)); - attach_pid(p, PIDTYPE_SID, p->signal->session); - if (p->pid) - __get_cpu_var(process_counts)++; -- } else -- link_pid(p, p->pids + PIDTYPE_TGID, &p->group_leader->pids[PIDTYPE_TGID].pid); -+ } - - nr_threads++; - write_unlock_irq(&tasklist_lock); -@@ -1126,6 +1202,11 @@ bad_fork_cleanup_policy: - mpol_free(p->mempolicy); - #endif - bad_fork_cleanup: -+#ifdef CONFIG_VE -+ if (virt_pid(p) != p->pid && virt_pid(p) > 0) -+ free_vpid(virt_pid(p), get_exec_env()); -+#endif -+bad_fork_cleanup_pid: - if (p->pid > 0) - free_pidmap(p->pid); - if (p->binfmt) -@@ -1163,12 +1244,13 @@ static inline int fork_traceflag (unsign - * It copies the process, and if successful kick-starts - * it and waits for it to finish using the VM if required. - */ --long do_fork(unsigned long clone_flags, -+long do_fork_pid(unsigned long clone_flags, - unsigned long stack_start, - struct pt_regs *regs, - unsigned long stack_size, - int __user *parent_tidptr, -- int __user *child_tidptr) -+ int __user *child_tidptr, -+ long pid0) - { - struct task_struct *p; - int trace = 0; -@@ -1180,12 +1262,16 @@ long do_fork(unsigned long clone_flags, - clone_flags |= CLONE_PTRACE; - } - -- p = copy_process(clone_flags, stack_start, regs, stack_size, parent_tidptr, child_tidptr); -+ pid = virtinfo_gencall(VIRTINFO_DOFORK, (void *)clone_flags); -+ if (pid) -+ return pid; -+ -+ p = copy_process(clone_flags, stack_start, regs, stack_size, parent_tidptr, child_tidptr, pid0); - /* - * Do this prior waking up the new thread - the thread pointer - * might get invalid after that point, if the thread exits quickly. - */ -- pid = IS_ERR(p) ? PTR_ERR(p) : p->pid; -+ pid = IS_ERR(p) ? PTR_ERR(p) : virt_pid(p); - - if (!IS_ERR(p)) { - struct completion vfork; -@@ -1203,6 +1289,7 @@ long do_fork(unsigned long clone_flags, - set_tsk_thread_flag(p, TIF_SIGPENDING); - } - -+ virtinfo_gencall(VIRTINFO_DOFORKRET, p); - if (!(clone_flags & CLONE_STOPPED)) { - /* - * Do the wakeup last. On SMP we treat fork() and -@@ -1220,25 +1307,24 @@ long do_fork(unsigned long clone_flags, - else - wake_up_forked_process(p); - } else { -- int cpu = get_cpu(); -- - p->state = TASK_STOPPED; -- if (cpu_is_offline(task_cpu(p))) -- set_task_cpu(p, cpu); -- -- put_cpu(); - } - ++total_forks; - - if (unlikely (trace)) { - current->ptrace_message = pid; -+ set_pn_state(current, PN_STOP_FORK); - ptrace_notify ((trace << 8) | SIGTRAP); -+ clear_pn_state(current); - } - - if (clone_flags & CLONE_VFORK) { - wait_for_completion(&vfork); -- if (unlikely (current->ptrace & PT_TRACE_VFORK_DONE)) -+ if (unlikely (current->ptrace & PT_TRACE_VFORK_DONE)) { -+ set_pn_state(current, PN_STOP_VFORK); - ptrace_notify ((PTRACE_EVENT_VFORK_DONE << 8) | SIGTRAP); -+ clear_pn_state(current); -+ } - } else - /* - * Let the child process run first, to avoid most of the -@@ -1246,9 +1332,24 @@ long do_fork(unsigned long clone_flags, - */ - set_need_resched(); - } -+ virtinfo_gencall(VIRTINFO_DOFORKPOST, (void *)(long)pid); - return pid; - } - -+EXPORT_SYMBOL(do_fork_pid); -+ -+long do_fork(unsigned long clone_flags, -+ unsigned long stack_start, -+ struct pt_regs *regs, -+ unsigned long stack_size, -+ int __user *parent_tidptr, -+ int __user *child_tidptr) -+{ -+ return do_fork_pid(clone_flags, stack_start, regs, stack_size, -+ parent_tidptr, child_tidptr, 0); -+} -+ -+ - /* SLAB cache for signal_struct structures (tsk->signal) */ - kmem_cache_t *signal_cachep; - -@@ -1267,24 +1368,26 @@ kmem_cache_t *vm_area_cachep; - /* SLAB cache for mm_struct structures (tsk->mm) */ - kmem_cache_t *mm_cachep; - -+#include <linux/kmem_cache.h> - void __init proc_caches_init(void) - { - sighand_cachep = kmem_cache_create("sighand_cache", - sizeof(struct sighand_struct), 0, -- SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); -+ SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_UBC, NULL, NULL); - signal_cachep = kmem_cache_create("signal_cache", - sizeof(struct signal_struct), 0, -- SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); -+ SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_UBC, NULL, NULL); - files_cachep = kmem_cache_create("files_cache", - sizeof(struct files_struct), 0, -- SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); -+ SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_UBC, NULL, NULL); -+ files_cachep->flags |= CFLGS_ENVIDS; - fs_cachep = kmem_cache_create("fs_cache", - sizeof(struct fs_struct), 0, -- SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); -+ SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_UBC, NULL, NULL); - vm_area_cachep = kmem_cache_create("vm_area_struct", - sizeof(struct vm_area_struct), 0, -- SLAB_PANIC, NULL, NULL); -+ SLAB_PANIC|SLAB_UBC, NULL, NULL); - mm_cachep = kmem_cache_create("mm_struct", - sizeof(struct mm_struct), 0, -- SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); -+ SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_UBC, NULL, NULL); - } -diff -uprN linux-2.6.8.1.orig/kernel/futex.c linux-2.6.8.1-ve022stab078/kernel/futex.c ---- linux-2.6.8.1.orig/kernel/futex.c 2004-08-14 14:55:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/futex.c 2006-05-11 13:05:33.000000000 +0400 -@@ -258,6 +258,18 @@ static void drop_key_refs(union futex_ke - } - } - -+static inline int get_futex_value_locked(int *dest, int __user *from) -+{ -+ int ret; -+ -+ inc_preempt_count(); -+ ret = __copy_from_user(dest, from, sizeof(int)); -+ dec_preempt_count(); -+ preempt_check_resched(); -+ -+ return ret ? -EFAULT : 0; -+} -+ - /* - * The hash bucket lock must be held when this is called. - * Afterwards, the futex_q must not be accessed. -@@ -329,6 +341,7 @@ static int futex_requeue(unsigned long u - int ret, drop_count = 0; - unsigned int nqueued; - -+ retry: - down_read(¤t->mm->mmap_sem); - - ret = get_futex_key(uaddr1, &key1); -@@ -355,9 +368,20 @@ static int futex_requeue(unsigned long u - before *uaddr1. */ - smp_mb(); - -- if (get_user(curval, (int __user *)uaddr1) != 0) { -- ret = -EFAULT; -- goto out; -+ ret = get_futex_value_locked(&curval, (int __user *)uaddr1); -+ -+ if (unlikely(ret)) { -+ /* If we would have faulted, release mmap_sem, fault -+ * it in and start all over again. -+ */ -+ up_read(¤t->mm->mmap_sem); -+ -+ ret = get_user(curval, (int __user *)uaddr1); -+ -+ if (!ret) -+ goto retry; -+ -+ return ret; - } - if (curval != *valp) { - ret = -EAGAIN; -@@ -480,6 +504,7 @@ static int futex_wait(unsigned long uadd - int ret, curval; - struct futex_q q; - -+ retry: - down_read(¤t->mm->mmap_sem); - - ret = get_futex_key(uaddr, &q.key); -@@ -493,9 +518,23 @@ static int futex_wait(unsigned long uadd - * We hold the mmap semaphore, so the mapping cannot have changed - * since we looked it up. - */ -- if (get_user(curval, (int __user *)uaddr) != 0) { -- ret = -EFAULT; -- goto out_unqueue; -+ -+ ret = get_futex_value_locked(&curval, (int __user *)uaddr); -+ -+ if (unlikely(ret)) { -+ /* If we would have faulted, release mmap_sem, fault it in and -+ * start all over again. -+ */ -+ up_read(¤t->mm->mmap_sem); -+ -+ if (!unqueue_me(&q)) /* There's a chance we got woken already */ -+ return 0; -+ -+ ret = get_user(curval, (int __user *)uaddr); -+ -+ if (!ret) -+ goto retry; -+ return ret; - } - if (curval != val) { - ret = -EWOULDBLOCK; -@@ -538,8 +577,8 @@ static int futex_wait(unsigned long uadd - return 0; - if (time == 0) - return -ETIMEDOUT; -- /* A spurious wakeup should never happen. */ -- WARN_ON(!signal_pending(current)); -+ /* We expect signal_pending(current), but another thread may -+ * have handled it for us already. */ - return -EINTR; - - out_unqueue: -diff -uprN linux-2.6.8.1.orig/kernel/kmod.c linux-2.6.8.1-ve022stab078/kernel/kmod.c ---- linux-2.6.8.1.orig/kernel/kmod.c 2004-08-14 14:55:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/kmod.c 2006-05-11 13:05:40.000000000 +0400 -@@ -78,6 +78,10 @@ int request_module(const char *fmt, ...) - #define MAX_KMOD_CONCURRENT 50 /* Completely arbitrary value - KAO */ - static int kmod_loop_msg; - -+ /* Don't allow request_module() inside VE. */ -+ if (!ve_is_super(get_exec_env())) -+ return -EPERM; -+ - va_start(args, fmt); - ret = vsnprintf(module_name, MODULE_NAME_LEN, fmt, args); - va_end(args); -@@ -260,6 +264,9 @@ int call_usermodehelper(char *path, char - }; - DECLARE_WORK(work, __call_usermodehelper, &sub_info); - -+ if (!ve_is_super(get_exec_env())) -+ return -EPERM; -+ - if (!khelper_wq) - return -EBUSY; - -diff -uprN linux-2.6.8.1.orig/kernel/kthread.c linux-2.6.8.1-ve022stab078/kernel/kthread.c ---- linux-2.6.8.1.orig/kernel/kthread.c 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/kthread.c 2006-05-11 13:05:40.000000000 +0400 -@@ -108,7 +108,7 @@ static void keventd_create_kthread(void - create->result = ERR_PTR(pid); - } else { - wait_for_completion(&create->started); -- create->result = find_task_by_pid(pid); -+ create->result = find_task_by_pid_all(pid); - } - complete(&create->done); - } -@@ -151,6 +151,7 @@ void kthread_bind(struct task_struct *k, - BUG_ON(k->state != TASK_INTERRUPTIBLE); - /* Must have done schedule() in kthread() before we set_task_cpu */ - wait_task_inactive(k); -+ /* The following lines look to be unprotected, possible race - vlad */ - set_task_cpu(k, cpu); - k->cpus_allowed = cpumask_of_cpu(cpu); - } -diff -uprN linux-2.6.8.1.orig/kernel/module.c linux-2.6.8.1-ve022stab078/kernel/module.c ---- linux-2.6.8.1.orig/kernel/module.c 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/module.c 2006-05-11 13:05:42.000000000 +0400 -@@ -2045,6 +2045,8 @@ static void *m_start(struct seq_file *m, - loff_t n = 0; - - down(&module_mutex); -+ if (!ve_is_super(get_exec_env())) -+ return NULL; - list_for_each(i, &modules) { - if (n++ == *pos) - break; -diff -uprN linux-2.6.8.1.orig/kernel/panic.c linux-2.6.8.1-ve022stab078/kernel/panic.c ---- linux-2.6.8.1.orig/kernel/panic.c 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/panic.c 2006-05-11 13:05:40.000000000 +0400 -@@ -23,6 +23,8 @@ - int panic_timeout; - int panic_on_oops; - int tainted; -+int kernel_text_csum_broken; -+EXPORT_SYMBOL(kernel_text_csum_broken); - - EXPORT_SYMBOL(panic_timeout); - -@@ -125,7 +127,8 @@ const char *print_tainted(void) - { - static char buf[20]; - if (tainted) { -- snprintf(buf, sizeof(buf), "Tainted: %c%c%c", -+ snprintf(buf, sizeof(buf), "Tainted: %c%c%c%c", -+ kernel_text_csum_broken ? 'B' : ' ', - tainted & TAINT_PROPRIETARY_MODULE ? 'P' : 'G', - tainted & TAINT_FORCED_MODULE ? 'F' : ' ', - tainted & TAINT_UNSAFE_SMP ? 'S' : ' '); -diff -uprN linux-2.6.8.1.orig/kernel/pid.c linux-2.6.8.1-ve022stab078/kernel/pid.c ---- linux-2.6.8.1.orig/kernel/pid.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/pid.c 2006-05-11 13:05:43.000000000 +0400 -@@ -26,8 +26,12 @@ - #include <linux/bootmem.h> - #include <linux/hash.h> - -+#ifdef CONFIG_VE -+static void __free_vpid(int vpid, struct ve_struct *ve); -+#endif -+ - #define pid_hashfn(nr) hash_long((unsigned long)nr, pidhash_shift) --static struct list_head *pid_hash[PIDTYPE_MAX]; -+static struct hlist_head *pid_hash[PIDTYPE_MAX]; - static int pidhash_shift; - - int pid_max = PID_MAX_DEFAULT; -@@ -50,8 +54,14 @@ typedef struct pidmap { - void *page; - } pidmap_t; - -+#ifdef CONFIG_VE -+#define PIDMAP_NRFREE (BITS_PER_PAGE/2) -+#else -+#define PIDMAP_NRFREE BITS_PER_PAGE -+#endif -+ - static pidmap_t pidmap_array[PIDMAP_ENTRIES] = -- { [ 0 ... PIDMAP_ENTRIES-1 ] = { ATOMIC_INIT(BITS_PER_PAGE), NULL } }; -+ { [ 0 ... PIDMAP_ENTRIES-1 ] = { ATOMIC_INIT(PIDMAP_NRFREE), NULL } }; - - static pidmap_t *map_limit = pidmap_array + PIDMAP_ENTRIES; - -@@ -62,6 +72,8 @@ fastcall void free_pidmap(int pid) - pidmap_t *map = pidmap_array + pid / BITS_PER_PAGE; - int offset = pid & BITS_PER_PAGE_MASK; - -+ BUG_ON(__is_virtual_pid(pid) || pid == 1); -+ - clear_bit(offset, map->page); - atomic_inc(&map->nr_free); - } -@@ -103,6 +115,8 @@ int alloc_pidmap(void) - pidmap_t *map; - - pid = last_pid + 1; -+ if (__is_virtual_pid(pid)) -+ pid += VPID_DIV; - if (pid >= pid_max) - pid = RESERVED_PIDS; - -@@ -133,6 +147,8 @@ next_map: - */ - scan_more: - offset = find_next_zero_bit(map->page, BITS_PER_PAGE, offset); -+ if (__is_virtual_pid(offset)) -+ offset += VPID_DIV; - if (offset >= BITS_PER_PAGE) - goto next_map; - if (test_and_set_bit(offset, map->page)) -@@ -146,92 +162,134 @@ failure: - return -1; - } - --fastcall struct pid *find_pid(enum pid_type type, int nr) -+struct pid * fastcall find_pid(enum pid_type type, int nr) - { -- struct list_head *elem, *bucket = &pid_hash[type][pid_hashfn(nr)]; -+ struct hlist_node *elem; - struct pid *pid; - -- __list_for_each(elem, bucket) { -- pid = list_entry(elem, struct pid, hash_chain); -+ hlist_for_each_entry(pid, elem, -+ &pid_hash[type][pid_hashfn(nr)], pid_chain) { - if (pid->nr == nr) - return pid; - } - return NULL; - } -- --void fastcall link_pid(task_t *task, struct pid_link *link, struct pid *pid) --{ -- atomic_inc(&pid->count); -- list_add_tail(&link->pid_chain, &pid->task_list); -- link->pidptr = pid; --} -+EXPORT_SYMBOL(find_pid); - - int fastcall attach_pid(task_t *task, enum pid_type type, int nr) - { -- struct pid *pid = find_pid(type, nr); -+ struct pid *pid, *task_pid; - -- if (pid) -- atomic_inc(&pid->count); -- else { -- pid = &task->pids[type].pid; -- pid->nr = nr; -- atomic_set(&pid->count, 1); -- INIT_LIST_HEAD(&pid->task_list); -- pid->task = task; -- get_task_struct(task); -- list_add(&pid->hash_chain, &pid_hash[type][pid_hashfn(nr)]); -+ task_pid = &task->pids[type]; -+ pid = find_pid(type, nr); -+ if (pid == NULL) { -+ hlist_add_head(&task_pid->pid_chain, -+ &pid_hash[type][pid_hashfn(nr)]); -+ INIT_LIST_HEAD(&task_pid->pid_list); -+ } else { -+ INIT_HLIST_NODE(&task_pid->pid_chain); -+ list_add_tail(&task_pid->pid_list, &pid->pid_list); - } -- list_add_tail(&task->pids[type].pid_chain, &pid->task_list); -- task->pids[type].pidptr = pid; -+ task_pid->nr = nr; - - return 0; - } - --static inline int __detach_pid(task_t *task, enum pid_type type) -+static fastcall int __detach_pid(task_t *task, enum pid_type type) - { -- struct pid_link *link = task->pids + type; -- struct pid *pid = link->pidptr; -- int nr; -+ struct pid *pid, *pid_next; -+ int nr = 0; -+ -+ pid = &task->pids[type]; -+ if (!hlist_unhashed(&pid->pid_chain)) { -+ hlist_del(&pid->pid_chain); -+ -+ if (list_empty(&pid->pid_list)) -+ nr = pid->nr; -+ else { -+ pid_next = list_entry(pid->pid_list.next, -+ struct pid, pid_list); -+ /* insert next pid from pid_list to hash */ -+ hlist_add_head(&pid_next->pid_chain, -+ &pid_hash[type][pid_hashfn(pid_next->nr)]); -+ } -+ } - -- list_del(&link->pid_chain); -- if (!atomic_dec_and_test(&pid->count)) -- return 0; -- -- nr = pid->nr; -- list_del(&pid->hash_chain); -- put_task_struct(pid->task); -+ list_del(&pid->pid_list); -+ pid->nr = 0; - - return nr; - } - --static void _detach_pid(task_t *task, enum pid_type type) --{ -- __detach_pid(task, type); --} -- - void fastcall detach_pid(task_t *task, enum pid_type type) - { -- int nr = __detach_pid(task, type); -+ int i; -+ int nr; - -+ nr = __detach_pid(task, type); - if (!nr) - return; - -- for (type = 0; type < PIDTYPE_MAX; ++type) -- if (find_pid(type, nr)) -+ for (i = 0; i < PIDTYPE_MAX; ++i) -+ if (find_pid(i, nr)) - return; -+ -+#ifdef CONFIG_VE -+ __free_vpid(task->pids[type].vnr, VE_TASK_INFO(task)->owner_env); -+#endif - free_pidmap(nr); - } - --task_t *find_task_by_pid(int nr) -+task_t *find_task_by_pid_type(int type, int nr) - { -- struct pid *pid = find_pid(PIDTYPE_PID, nr); -+ BUG(); -+ return NULL; -+} - -+EXPORT_SYMBOL(find_task_by_pid_type); -+ -+task_t *find_task_by_pid_type_all(int type, int nr) -+{ -+ struct pid *pid; -+ -+ BUG_ON(nr != -1 && is_virtual_pid(nr)); -+ -+ pid = find_pid(type, nr); - if (!pid) - return NULL; -- return pid_task(pid->task_list.next, PIDTYPE_PID); -+ -+ return pid_task(&pid->pid_list, type); - } - --EXPORT_SYMBOL(find_task_by_pid); -+EXPORT_SYMBOL(find_task_by_pid_type_all); -+ -+#ifdef CONFIG_VE -+ -+task_t *find_task_by_pid_type_ve(int type, int nr) -+{ -+ task_t *tsk; -+ int gnr = nr; -+ struct pid *pid; -+ -+ if (is_virtual_pid(nr)) { -+ gnr = __vpid_to_pid(nr); -+ if (unlikely(gnr == -1)) -+ return NULL; -+ } -+ -+ pid = find_pid(type, gnr); -+ if (!pid) -+ return NULL; -+ -+ tsk = pid_task(&pid->pid_list, type); -+ if (!ve_accessible(VE_TASK_INFO(tsk)->owner_env, get_exec_env())) -+ return NULL; -+ return tsk; -+} -+ -+EXPORT_SYMBOL(find_task_by_pid_type_ve); -+ -+#endif - - /* - * This function switches the PIDs if a non-leader thread calls -@@ -240,22 +298,26 @@ EXPORT_SYMBOL(find_task_by_pid); - */ - void switch_exec_pids(task_t *leader, task_t *thread) - { -- _detach_pid(leader, PIDTYPE_PID); -- _detach_pid(leader, PIDTYPE_TGID); -- _detach_pid(leader, PIDTYPE_PGID); -- _detach_pid(leader, PIDTYPE_SID); -+ __detach_pid(leader, PIDTYPE_PID); -+ __detach_pid(leader, PIDTYPE_TGID); -+ __detach_pid(leader, PIDTYPE_PGID); -+ __detach_pid(leader, PIDTYPE_SID); - -- _detach_pid(thread, PIDTYPE_PID); -- _detach_pid(thread, PIDTYPE_TGID); -+ __detach_pid(thread, PIDTYPE_PID); -+ __detach_pid(thread, PIDTYPE_TGID); - - leader->pid = leader->tgid = thread->pid; - thread->pid = thread->tgid; -+ set_virt_tgid(leader, virt_pid(thread)); -+ set_virt_pid(leader, virt_pid(thread)); -+ set_virt_pid(thread, virt_tgid(thread)); - - attach_pid(thread, PIDTYPE_PID, thread->pid); - attach_pid(thread, PIDTYPE_TGID, thread->tgid); - attach_pid(thread, PIDTYPE_PGID, thread->signal->pgrp); - attach_pid(thread, PIDTYPE_SID, thread->signal->session); - list_add_tail(&thread->tasks, &init_task.tasks); -+ SET_VE_LINKS(thread); - - attach_pid(leader, PIDTYPE_PID, leader->pid); - attach_pid(leader, PIDTYPE_TGID, leader->tgid); -@@ -263,6 +325,338 @@ void switch_exec_pids(task_t *leader, ta - attach_pid(leader, PIDTYPE_SID, leader->signal->session); - } - -+#ifdef CONFIG_VE -+ -+/* Virtual PID bits. -+ * -+ * At the moment all internal structures in kernel store real global pid. -+ * The only place, where virtual PID is used, is at user frontend. We -+ * remap virtual pids obtained from user to global ones (vpid_to_pid) and -+ * map globals to virtuals before showing them to user (virt_pid_type). -+ * -+ * We hold virtual PIDs inside struct pid, so map global -> virtual is easy. -+ */ -+ -+pid_t _pid_type_to_vpid(int type, pid_t pid) -+{ -+ struct pid * p; -+ -+ if (unlikely(is_virtual_pid(pid))) -+ return -1; -+ -+ read_lock(&tasklist_lock); -+ p = find_pid(type, pid); -+ if (p) { -+ pid = p->vnr; -+ } else { -+ pid = -1; -+ } -+ read_unlock(&tasklist_lock); -+ return pid; -+} -+ -+pid_t pid_type_to_vpid(int type, pid_t pid) -+{ -+ int vpid; -+ -+ if (unlikely(pid <= 0)) -+ return pid; -+ -+ BUG_ON(is_virtual_pid(pid)); -+ -+ if (ve_is_super(get_exec_env())) -+ return pid; -+ -+ vpid = _pid_type_to_vpid(type, pid); -+ if (unlikely(vpid == -1)) { -+ /* It is allowed: global pid can be used everywhere. -+ * This can happen, when kernel remembers stray pids: -+ * signal queues, locks etc. -+ */ -+ vpid = pid; -+ } -+ return vpid; -+} -+ -+/* To map virtual pids to global we maintain special hash table. -+ * -+ * Mapping entries are allocated when a process with non-trivial -+ * mapping is forked, which is possible only after VE migrated. -+ * Mappings are destroyed, when a global pid is removed from global -+ * pidmap, which means we do not need to refcount mappings. -+ */ -+ -+static struct hlist_head *vpid_hash; -+ -+struct vpid_mapping -+{ -+ int vpid; -+ int veid; -+ int pid; -+ struct hlist_node link; -+}; -+ -+static kmem_cache_t *vpid_mapping_cachep; -+ -+static inline int vpid_hashfn(int vnr, int veid) -+{ -+ return hash_long((unsigned long)(vnr+(veid<<16)), pidhash_shift); -+} -+ -+struct vpid_mapping *__lookup_vpid_mapping(int vnr, int veid) -+{ -+ struct hlist_node *elem; -+ struct vpid_mapping *map; -+ -+ hlist_for_each_entry(map, elem, -+ &vpid_hash[vpid_hashfn(vnr, veid)], link) { -+ if (map->vpid == vnr && map->veid == veid) -+ return map; -+ } -+ return NULL; -+} -+ -+/* __vpid_to_pid() is raw version of vpid_to_pid(). It is to be used -+ * only under tasklist_lock. In some places we must use only this version -+ * (f.e. __kill_pg_info is called under write lock!) -+ * -+ * Caller should pass virtual pid. This function returns an error, when -+ * seeing a global pid. -+ */ -+int __vpid_to_pid(int pid) -+{ -+ struct vpid_mapping *map; -+ -+ if (unlikely(!is_virtual_pid(pid) || ve_is_super(get_exec_env()))) -+ return -1; -+ -+ if (!get_exec_env()->sparse_vpid) { -+ if (pid != 1) -+ return pid - VPID_DIV; -+ return get_exec_env()->init_entry->pid; -+ } -+ -+ map = __lookup_vpid_mapping(pid, VEID(get_exec_env())); -+ if (map) -+ return map->pid; -+ return -1; -+} -+ -+int vpid_to_pid(int pid) -+{ -+ /* User gave bad pid. It is his problem. */ -+ if (unlikely(pid <= 0)) -+ return pid; -+ -+ if (!is_virtual_pid(pid)) -+ return pid; -+ -+ read_lock(&tasklist_lock); -+ pid = __vpid_to_pid(pid); -+ read_unlock(&tasklist_lock); -+ return pid; -+} -+ -+/* VEs which never migrated have trivial "arithmetic" mapping pid <-> vpid: -+ * -+ * vpid == 1 -> ve->init_task->pid -+ * else pid & ~VPID_DIV -+ * -+ * In this case VE has ve->sparse_vpid = 0 and we do not use vpid hash table. -+ * -+ * When VE migrates and we see non-trivial mapping the first time, we -+ * scan process table and populate mapping hash table. -+ */ -+ -+static int add_mapping(int pid, int vpid, int veid, struct hlist_head *cache) -+{ -+ if (pid > 0 && vpid > 0 && !__lookup_vpid_mapping(vpid, veid)) { -+ struct vpid_mapping *m; -+ if (hlist_empty(cache)) { -+ m = kmem_cache_alloc(vpid_mapping_cachep, GFP_ATOMIC); -+ if (unlikely(m == NULL)) -+ return -ENOMEM; -+ } else { -+ m = hlist_entry(cache->first, struct vpid_mapping, link); -+ hlist_del(&m->link); -+ } -+ m->pid = pid; -+ m->vpid = vpid; -+ m->veid = veid; -+ hlist_add_head(&m->link, -+ &vpid_hash[vpid_hashfn(vpid, veid)]); -+ } -+ return 0; -+} -+ -+static int switch_to_sparse_mapping(int pid) -+{ -+ struct ve_struct *env = get_exec_env(); -+ struct hlist_head cache; -+ task_t *g, *t; -+ int pcount; -+ int err; -+ -+ /* Transition happens under write_lock_irq, so we try to make -+ * it more reliable and fast preallocating mapping entries. -+ * pcounter may be not enough, we could have lots of orphaned -+ * process groups and sessions, which also require mappings. -+ */ -+ INIT_HLIST_HEAD(&cache); -+ pcount = atomic_read(&env->pcounter); -+ err = -ENOMEM; -+ while (pcount > 0) { -+ struct vpid_mapping *m; -+ m = kmem_cache_alloc(vpid_mapping_cachep, GFP_KERNEL); -+ if (!m) -+ goto out; -+ hlist_add_head(&m->link, &cache); -+ pcount--; -+ } -+ -+ write_lock_irq(&tasklist_lock); -+ err = 0; -+ if (env->sparse_vpid) -+ goto out_unlock; -+ -+ err = -ENOMEM; -+ do_each_thread_ve(g, t) { -+ if (t->pid == pid) -+ continue; -+ if (add_mapping(t->pid, virt_pid(t), VEID(env), &cache)) -+ goto out_unlock; -+ } while_each_thread_ve(g, t); -+ -+ for_each_process_ve(t) { -+ if (t->pid == pid) -+ continue; -+ -+ if (add_mapping(t->tgid, virt_tgid(t), VEID(env), &cache)) -+ goto out_unlock; -+ if (add_mapping(t->signal->pgrp, virt_pgid(t), VEID(env), &cache)) -+ goto out_unlock; -+ if (add_mapping(t->signal->session, virt_sid(t), VEID(env), &cache)) -+ goto out_unlock; -+ } -+ env->sparse_vpid = 1; -+ err = 0; -+ -+out_unlock: -+ if (err) { -+ int i; -+ -+ for (i=0; i<(1<<pidhash_shift); i++) { -+ struct hlist_node *elem, *next; -+ struct vpid_mapping *map; -+ -+ hlist_for_each_entry_safe(map, elem, next, &vpid_hash[i], link) { -+ if (map->veid == VEID(env)) { -+ hlist_del(elem); -+ hlist_add_head(elem, &cache); -+ } -+ } -+ } -+ } -+ write_unlock_irq(&tasklist_lock); -+ -+out: -+ while (!hlist_empty(&cache)) { -+ struct vpid_mapping *m; -+ m = hlist_entry(cache.first, struct vpid_mapping, link); -+ hlist_del(&m->link); -+ kmem_cache_free(vpid_mapping_cachep, m); -+ } -+ return err; -+} -+ -+int alloc_vpid(int pid, int virt_pid) -+{ -+ int result; -+ struct vpid_mapping *m; -+ struct ve_struct *env = get_exec_env(); -+ -+ if (ve_is_super(env) || !env->virt_pids) -+ return pid; -+ -+ if (!env->sparse_vpid) { -+ if (virt_pid == -1) -+ return pid + VPID_DIV; -+ -+ if (virt_pid == 1 || virt_pid == pid + VPID_DIV) -+ return virt_pid; -+ -+ if ((result = switch_to_sparse_mapping(pid)) < 0) -+ return result; -+ } -+ -+ m = kmem_cache_alloc(vpid_mapping_cachep, GFP_KERNEL); -+ if (!m) -+ return -ENOMEM; -+ -+ m->pid = pid; -+ m->veid = VEID(env); -+ -+ result = (virt_pid == -1) ? pid + VPID_DIV : virt_pid; -+ -+ write_lock_irq(&tasklist_lock); -+ if (unlikely(__lookup_vpid_mapping(result, m->veid))) { -+ if (virt_pid > 0) { -+ result = -EEXIST; -+ goto out; -+ } -+ -+ /* No luck. Now we search for some not-existing vpid. -+ * It is weak place. We do linear search. */ -+ do { -+ result++; -+ if (!__is_virtual_pid(result)) -+ result += VPID_DIV; -+ if (result >= pid_max) -+ result = RESERVED_PIDS + VPID_DIV; -+ } while (__lookup_vpid_mapping(result, m->veid) != NULL); -+ -+ /* And set last_pid in hope future alloc_pidmap to avoid -+ * collisions after future alloc_pidmap() */ -+ last_pid = result - VPID_DIV; -+ } -+ if (result > 0) { -+ m->vpid = result; -+ hlist_add_head(&m->link, -+ &vpid_hash[vpid_hashfn(result, m->veid)]); -+ } -+out: -+ write_unlock_irq(&tasklist_lock); -+ if (result < 0) -+ kmem_cache_free(vpid_mapping_cachep, m); -+ return result; -+} -+EXPORT_SYMBOL(alloc_vpid); -+ -+static void __free_vpid(int vpid, struct ve_struct *ve) -+{ -+ struct vpid_mapping *m; -+ -+ if (!ve->sparse_vpid) -+ return; -+ -+ if (!__is_virtual_pid(vpid) && (vpid != 1 || ve_is_super(ve))) -+ return; -+ -+ m = __lookup_vpid_mapping(vpid, ve->veid); -+ BUG_ON(m == NULL); -+ hlist_del(&m->link); -+ kmem_cache_free(vpid_mapping_cachep, m); -+} -+ -+void free_vpid(int vpid, struct ve_struct *ve) -+{ -+ write_lock_irq(&tasklist_lock); -+ __free_vpid(vpid, ve); -+ write_unlock_irq(&tasklist_lock); -+} -+EXPORT_SYMBOL(free_vpid); -+#endif -+ - /* - * The pid hash table is scaled according to the amount of memory in the - * machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or -@@ -283,12 +677,20 @@ void __init pidhash_init(void) - - for (i = 0; i < PIDTYPE_MAX; i++) { - pid_hash[i] = alloc_bootmem(pidhash_size * -- sizeof(struct list_head)); -+ sizeof(struct hlist_head)); - if (!pid_hash[i]) - panic("Could not alloc pidhash!\n"); - for (j = 0; j < pidhash_size; j++) -- INIT_LIST_HEAD(&pid_hash[i][j]); -+ INIT_HLIST_HEAD(&pid_hash[i][j]); - } -+ -+#ifdef CONFIG_VE -+ vpid_hash = alloc_bootmem(pidhash_size * sizeof(struct hlist_head)); -+ if (!vpid_hash) -+ panic("Could not alloc vpid_hash!\n"); -+ for (j = 0; j < pidhash_size; j++) -+ INIT_HLIST_HEAD(&vpid_hash[j]); -+#endif - } - - void __init pidmap_init(void) -@@ -305,4 +707,12 @@ void __init pidmap_init(void) - - for (i = 0; i < PIDTYPE_MAX; i++) - attach_pid(current, i, 0); -+ -+#ifdef CONFIG_VE -+ vpid_mapping_cachep = -+ kmem_cache_create("vpid_mapping", -+ sizeof(struct vpid_mapping), -+ __alignof__(struct vpid_mapping), -+ SLAB_PANIC|SLAB_UBC, NULL, NULL); -+#endif - } -diff -uprN linux-2.6.8.1.orig/kernel/posix-timers.c linux-2.6.8.1-ve022stab078/kernel/posix-timers.c ---- linux-2.6.8.1.orig/kernel/posix-timers.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/posix-timers.c 2006-05-11 13:05:40.000000000 +0400 -@@ -31,6 +31,7 @@ - * POSIX clocks & timers - */ - #include <linux/mm.h> -+#include <linux/module.h> - #include <linux/smp_lock.h> - #include <linux/interrupt.h> - #include <linux/slab.h> -@@ -223,7 +224,8 @@ static __init int init_posix_timers(void - register_posix_clock(CLOCK_MONOTONIC, &clock_monotonic); - - posix_timers_cache = kmem_cache_create("posix_timers_cache", -- sizeof (struct k_itimer), 0, 0, NULL, NULL); -+ sizeof (struct k_itimer), 0, SLAB_UBC, -+ NULL, NULL); - idr_init(&posix_timers_id); - return 0; - } -@@ -394,6 +396,11 @@ exit: - static void timer_notify_task(struct k_itimer *timr) - { - int ret; -+ struct ve_struct *old_ve; -+ struct user_beancounter *old_ub; -+ -+ old_ve = set_exec_env(VE_TASK_INFO(timr->it_process)->owner_env); -+ old_ub = set_exec_ub(task_bc(timr->it_process)->task_ub); - - memset(&timr->sigq->info, 0, sizeof(siginfo_t)); - -@@ -440,6 +447,9 @@ static void timer_notify_task(struct k_i - */ - schedule_next_timer(timr); - } -+ -+ (void)set_exec_ub(old_ub); -+ (void)set_exec_env(old_ve); - } - - /* -@@ -499,7 +509,7 @@ static inline struct task_struct * good_ - struct task_struct *rtn = current->group_leader; - - if ((event->sigev_notify & SIGEV_THREAD_ID ) && -- (!(rtn = find_task_by_pid(event->sigev_notify_thread_id)) || -+ (!(rtn = find_task_by_pid_ve(event->sigev_notify_thread_id)) || - rtn->tgid != current->tgid || - (event->sigev_notify & ~SIGEV_THREAD_ID) != SIGEV_SIGNAL)) - return NULL; -@@ -1228,6 +1238,7 @@ int do_posix_clock_monotonic_gettime(str - } - return 0; - } -+EXPORT_SYMBOL(do_posix_clock_monotonic_gettime); - - int do_posix_clock_monotonic_settime(struct timespec *tp) - { -diff -uprN linux-2.6.8.1.orig/kernel/power/pmdisk.c linux-2.6.8.1-ve022stab078/kernel/power/pmdisk.c ---- linux-2.6.8.1.orig/kernel/power/pmdisk.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/power/pmdisk.c 2006-05-11 13:05:39.000000000 +0400 -@@ -206,7 +206,7 @@ static int write_swap_page(unsigned long - swp_entry_t entry; - int error = 0; - -- entry = get_swap_page(); -+ entry = get_swap_page(mm_ub(&init_mm)); - if (swp_offset(entry) && - swapfile_used[swp_type(entry)] == SWAPFILE_SUSPEND) { - error = rw_swap_page_sync(WRITE, entry, -diff -uprN linux-2.6.8.1.orig/kernel/power/process.c linux-2.6.8.1-ve022stab078/kernel/power/process.c ---- linux-2.6.8.1.orig/kernel/power/process.c 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/power/process.c 2006-05-11 13:05:45.000000000 +0400 -@@ -23,15 +23,15 @@ static inline int freezeable(struct task - { - if ((p == current) || - (p->flags & PF_NOFREEZE) || -- (p->state == TASK_ZOMBIE) || -- (p->state == TASK_DEAD) || -+ (p->exit_state == EXIT_ZOMBIE) || -+ (p->exit_state == EXIT_DEAD) || - (p->state == TASK_STOPPED)) - return 0; - return 1; - } - - /* Refrigerator is place where frozen processes are stored :-). */ --void refrigerator(unsigned long flag) -+void refrigerator() - { - /* Hmm, should we be allowed to suspend when there are realtime - processes around? */ -@@ -39,14 +39,19 @@ void refrigerator(unsigned long flag) - save = current->state; - current->state = TASK_UNINTERRUPTIBLE; - pr_debug("%s entered refrigerator\n", current->comm); -- printk("="); -- current->flags &= ~PF_FREEZE; -+ /* printk("="); */ - - spin_lock_irq(¤t->sighand->siglock); -- recalc_sigpending(); /* We sent fake signal, clean it up */ -+ if (test_and_clear_thread_flag(TIF_FREEZE)) { -+ recalc_sigpending(); /* We sent fake signal, clean it up */ -+ current->flags |= PF_FROZEN; -+ } else { -+ /* Freeze request could be canceled before we entered -+ * refrigerator(). In this case we do nothing. */ -+ current->state = save; -+ } - spin_unlock_irq(¤t->sighand->siglock); - -- current->flags |= PF_FROZEN; - while (current->flags & PF_FROZEN) - schedule(); - pr_debug("%s left refrigerator\n", current->comm); -@@ -65,7 +70,7 @@ int freeze_processes(void) - do { - todo = 0; - read_lock(&tasklist_lock); -- do_each_thread(g, p) { -+ do_each_thread_all(g, p) { - unsigned long flags; - if (!freezeable(p)) - continue; -@@ -75,12 +80,12 @@ int freeze_processes(void) - - /* FIXME: smp problem here: we may not access other process' flags - without locking */ -- p->flags |= PF_FREEZE; - spin_lock_irqsave(&p->sighand->siglock, flags); -+ set_tsk_thread_flag(p, TIF_FREEZE); - signal_wake_up(p, 0); - spin_unlock_irqrestore(&p->sighand->siglock, flags); - todo++; -- } while_each_thread(g, p); -+ } while_each_thread_all(g, p); - read_unlock(&tasklist_lock); - yield(); /* Yield is okay here */ - if (time_after(jiffies, start_time + TIMEOUT)) { -@@ -90,7 +95,7 @@ int freeze_processes(void) - } - } while(todo); - -- printk( "|\n" ); -+ /* printk( "|\n" ); */ - BUG_ON(in_atomic()); - return 0; - } -@@ -101,15 +106,18 @@ void thaw_processes(void) - - printk( "Restarting tasks..." ); - read_lock(&tasklist_lock); -- do_each_thread(g, p) { -+ do_each_thread_all(g, p) { -+ unsigned long flags; - if (!freezeable(p)) - continue; -+ spin_lock_irqsave(&p->sighand->siglock, flags); - if (p->flags & PF_FROZEN) { - p->flags &= ~PF_FROZEN; - wake_up_process(p); - } else - printk(KERN_INFO " Strange, %s not stopped\n", p->comm ); -- } while_each_thread(g, p); -+ spin_unlock_irqrestore(&p->sighand->siglock, flags); -+ } while_each_thread_all(g, p); - - read_unlock(&tasklist_lock); - schedule(); -diff -uprN linux-2.6.8.1.orig/kernel/power/swsusp.c linux-2.6.8.1-ve022stab078/kernel/power/swsusp.c ---- linux-2.6.8.1.orig/kernel/power/swsusp.c 2004-08-14 14:55:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/power/swsusp.c 2006-05-11 13:05:39.000000000 +0400 -@@ -317,7 +317,7 @@ static int write_suspend_image(void) - for (i=0; i<nr_copy_pages; i++) { - if (!(i%100)) - printk( "." ); -- entry = get_swap_page(); -+ entry = get_swap_page(mm_ub(&init_mm)); - if (!entry.val) - panic("\nNot enough swapspace when writing data" ); - -@@ -335,7 +335,7 @@ static int write_suspend_image(void) - cur = (union diskpage *)((char *) pagedir_nosave)+i; - BUG_ON ((char *) cur != (((char *) pagedir_nosave) + i*PAGE_SIZE)); - printk( "." ); -- entry = get_swap_page(); -+ entry = get_swap_page(mm_ub(&init_mm)); - if (!entry.val) { - printk(KERN_CRIT "Not enough swapspace when writing pgdir\n" ); - panic("Don't know how to recover"); -@@ -358,7 +358,7 @@ static int write_suspend_image(void) - BUG_ON (sizeof(struct suspend_header) > PAGE_SIZE-sizeof(swp_entry_t)); - BUG_ON (sizeof(union diskpage) != PAGE_SIZE); - BUG_ON (sizeof(struct link) != PAGE_SIZE); -- entry = get_swap_page(); -+ entry = get_swap_page(mm_ub(&init_mm)); - if (!entry.val) - panic( "\nNot enough swapspace when writing header" ); - if (swapfile_used[swp_type(entry)] != SWAPFILE_SUSPEND) -diff -uprN linux-2.6.8.1.orig/kernel/printk.c linux-2.6.8.1-ve022stab078/kernel/printk.c ---- linux-2.6.8.1.orig/kernel/printk.c 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/printk.c 2006-05-11 13:05:42.000000000 +0400 -@@ -26,10 +26,13 @@ - #include <linux/module.h> - #include <linux/interrupt.h> /* For in_interrupt() */ - #include <linux/config.h> -+#include <linux/slab.h> - #include <linux/delay.h> - #include <linux/smp.h> - #include <linux/security.h> - #include <linux/bootmem.h> -+#include <linux/vzratelimit.h> -+#include <linux/veprintk.h> - - #include <asm/uaccess.h> - -@@ -53,6 +56,7 @@ int console_printk[4] = { - - EXPORT_SYMBOL(console_printk); - -+int console_silence_loglevel; - int oops_in_progress; - - /* -@@ -77,7 +81,7 @@ static int console_locked; - * It is also used in interesting ways to provide interlocking in - * release_console_sem(). - */ --static spinlock_t logbuf_lock = SPIN_LOCK_UNLOCKED; -+spinlock_t logbuf_lock = SPIN_LOCK_UNLOCKED; - - static char __log_buf[__LOG_BUF_LEN]; - static char *log_buf = __log_buf; -@@ -151,6 +155,43 @@ static int __init console_setup(char *st - - __setup("console=", console_setup); - -+static int __init setup_console_silencelevel(char *str) -+{ -+ int level; -+ -+ if (get_option(&str, &level) != 1) -+ return 0; -+ -+ console_silence_loglevel = level; -+ return 1; -+} -+ -+__setup("silencelevel=", setup_console_silencelevel); -+ -+static inline int ve_log_init(void) -+{ -+#ifdef CONFIG_VE -+ if (ve_log_buf != NULL) -+ return 0; -+ -+ if (ve_is_super(get_exec_env())) { -+ ve0._log_wait = &log_wait; -+ ve0._log_start = &log_start; -+ ve0._log_end = &log_end; -+ ve0._logged_chars = &logged_chars; -+ ve0.log_buf = log_buf; -+ return 0; -+ } -+ -+ ve_log_buf = kmalloc(ve_log_buf_len, GFP_ATOMIC); -+ if (!ve_log_buf) -+ return -ENOMEM; -+ -+ memset(ve_log_buf, 0, ve_log_buf_len); -+#endif -+ return 0; -+} -+ - /** - * add_preferred_console - add a device to the list of preferred consoles. - * -@@ -249,6 +290,10 @@ int do_syslog(int type, char __user * bu - char c; - int error = 0; - -+ if (!ve_is_super(get_exec_env()) && -+ (type == 6 || type == 7 || type == 8)) -+ goto out; -+ - error = security_syslog(type); - if (error) - return error; -@@ -268,14 +313,15 @@ int do_syslog(int type, char __user * bu - error = verify_area(VERIFY_WRITE,buf,len); - if (error) - goto out; -- error = wait_event_interruptible(log_wait, (log_start - log_end)); -+ error = wait_event_interruptible(ve_log_wait, -+ (ve_log_start - ve_log_end)); - if (error) - goto out; - i = 0; - spin_lock_irq(&logbuf_lock); -- while (!error && (log_start != log_end) && i < len) { -- c = LOG_BUF(log_start); -- log_start++; -+ while (!error && (ve_log_start != ve_log_end) && i < len) { -+ c = VE_LOG_BUF(ve_log_start); -+ ve_log_start++; - spin_unlock_irq(&logbuf_lock); - error = __put_user(c,buf); - buf++; -@@ -299,15 +345,17 @@ int do_syslog(int type, char __user * bu - error = verify_area(VERIFY_WRITE,buf,len); - if (error) - goto out; -+ if (ve_log_buf == NULL) -+ goto out; - count = len; -- if (count > log_buf_len) -- count = log_buf_len; -+ if (count > ve_log_buf_len) -+ count = ve_log_buf_len; - spin_lock_irq(&logbuf_lock); -- if (count > logged_chars) -- count = logged_chars; -+ if (count > ve_logged_chars) -+ count = ve_logged_chars; - if (do_clear) -- logged_chars = 0; -- limit = log_end; -+ ve_logged_chars = 0; -+ limit = ve_log_end; - /* - * __put_user() could sleep, and while we sleep - * printk() could overwrite the messages -@@ -316,9 +364,9 @@ int do_syslog(int type, char __user * bu - */ - for(i = 0; i < count && !error; i++) { - j = limit-1-i; -- if (j + log_buf_len < log_end) -+ if (j + ve_log_buf_len < ve_log_end) - break; -- c = LOG_BUF(j); -+ c = VE_LOG_BUF(j); - spin_unlock_irq(&logbuf_lock); - error = __put_user(c,&buf[count-1-i]); - spin_lock_irq(&logbuf_lock); -@@ -340,7 +388,7 @@ int do_syslog(int type, char __user * bu - } - break; - case 5: /* Clear ring buffer */ -- logged_chars = 0; -+ ve_logged_chars = 0; - break; - case 6: /* Disable logging to console */ - console_loglevel = minimum_console_loglevel; -@@ -358,10 +406,10 @@ int do_syslog(int type, char __user * bu - error = 0; - break; - case 9: /* Number of chars in the log buffer */ -- error = log_end - log_start; -+ error = ve_log_end - ve_log_start; - break; - case 10: /* Size of the log buffer */ -- error = log_buf_len; -+ error = ve_log_buf_len; - break; - default: - error = -EINVAL; -@@ -461,14 +509,14 @@ static void call_console_drivers(unsigne - - static void emit_log_char(char c) - { -- LOG_BUF(log_end) = c; -- log_end++; -- if (log_end - log_start > log_buf_len) -- log_start = log_end - log_buf_len; -- if (log_end - con_start > log_buf_len) -+ VE_LOG_BUF(ve_log_end) = c; -+ ve_log_end++; -+ if (ve_log_end - ve_log_start > ve_log_buf_len) -+ ve_log_start = ve_log_end - ve_log_buf_len; -+ if (ve_is_super(get_exec_env()) && log_end - con_start > log_buf_len) - con_start = log_end - log_buf_len; -- if (logged_chars < log_buf_len) -- logged_chars++; -+ if (ve_logged_chars < ve_log_buf_len) -+ ve_logged_chars++; - } - - /* -@@ -505,14 +553,14 @@ static void zap_locks(void) - * then changes console_loglevel may break. This is because console_loglevel - * is inspected when the actual printing occurs. - */ --asmlinkage int printk(const char *fmt, ...) -+asmlinkage int vprintk(const char *fmt, va_list args) - { -- va_list args; - unsigned long flags; - int printed_len; - char *p; - static char printk_buf[1024]; - static int log_level_unknown = 1; -+ int err, need_wake; - - if (unlikely(oops_in_progress)) - zap_locks(); -@@ -520,10 +568,14 @@ asmlinkage int printk(const char *fmt, . - /* This stops the holder of console_sem just where we want him */ - spin_lock_irqsave(&logbuf_lock, flags); - -+ err = ve_log_init(); -+ if (err) { -+ spin_unlock_irqrestore(&logbuf_lock, flags); -+ return err; -+ } -+ - /* Emit the output into the temporary buffer */ -- va_start(args, fmt); - printed_len = vscnprintf(printk_buf, sizeof(printk_buf), fmt, args); -- va_end(args); - - /* - * Copy the output into log_buf. If the caller didn't provide -@@ -554,7 +606,12 @@ asmlinkage int printk(const char *fmt, . - spin_unlock_irqrestore(&logbuf_lock, flags); - goto out; - } -- if (!down_trylock(&console_sem)) { -+ if (!ve_is_super(get_exec_env())) { -+ need_wake = (ve_log_start != ve_log_end); -+ spin_unlock_irqrestore(&logbuf_lock, flags); -+ if (!oops_in_progress && need_wake) -+ wake_up_interruptible(&ve_log_wait); -+ } else if (!down_trylock(&console_sem)) { - console_locked = 1; - /* - * We own the drivers. We can drop the spinlock and let -@@ -574,8 +631,49 @@ asmlinkage int printk(const char *fmt, . - out: - return printed_len; - } -+ -+EXPORT_SYMBOL(vprintk); -+ -+asmlinkage int printk(const char *fmt, ...) -+{ -+ va_list args; -+ int i; -+ struct ve_struct *env; -+ -+ va_start(args, fmt); -+ env = set_exec_env(get_ve0()); -+ i = vprintk(fmt, args); -+ set_exec_env(env); -+ va_end(args); -+ return i; -+} -+ - EXPORT_SYMBOL(printk); - -+asmlinkage int ve_printk(int dst, const char *fmt, ...) -+{ -+ va_list args; -+ int printed_len; -+ -+ printed_len = 0; -+ if (ve_is_super(get_exec_env()) || (dst & VE0_LOG)) { -+ struct ve_struct *env; -+ va_start(args, fmt); -+ env = set_exec_env(get_ve0()); -+ printed_len = vprintk(fmt, args); -+ set_exec_env(env); -+ va_end(args); -+ } -+ if (!ve_is_super(get_exec_env()) && (dst & VE_LOG)) { -+ va_start(args, fmt); -+ printed_len = vprintk(fmt, args); -+ va_end(args); -+ } -+ return printed_len; -+} -+EXPORT_SYMBOL(ve_printk); -+ -+ - /** - * acquire_console_sem - lock the console system for exclusive use. - * -@@ -600,6 +698,12 @@ int is_console_locked(void) - } - EXPORT_SYMBOL(is_console_locked); - -+void wake_up_klogd(void) -+{ -+ if (!oops_in_progress && waitqueue_active(&log_wait)) -+ wake_up_interruptible(&log_wait); -+} -+ - /** - * release_console_sem - unlock the console system - * -@@ -635,8 +739,8 @@ void release_console_sem(void) - console_may_schedule = 0; - up(&console_sem); - spin_unlock_irqrestore(&logbuf_lock, flags); -- if (wake_klogd && !oops_in_progress && waitqueue_active(&log_wait)) -- wake_up_interruptible(&log_wait); -+ if (wake_klogd) -+ wake_up_klogd(); - } - EXPORT_SYMBOL(release_console_sem); - -@@ -895,3 +999,33 @@ int printk_ratelimit(void) - printk_ratelimit_burst); - } - EXPORT_SYMBOL(printk_ratelimit); -+ -+/* -+ * Rate limiting stuff. -+ */ -+int vz_ratelimit(struct vz_rate_info *p) -+{ -+ unsigned long cjif, djif; -+ unsigned long flags; -+ static spinlock_t ratelimit_lock = SPIN_LOCK_UNLOCKED; -+ long new_bucket; -+ -+ spin_lock_irqsave(&ratelimit_lock, flags); -+ cjif = jiffies; -+ djif = cjif - p->last; -+ if (djif < p->interval) { -+ if (p->bucket >= p->burst) { -+ spin_unlock_irqrestore(&ratelimit_lock, flags); -+ return 0; -+ } -+ p->bucket++; -+ } else { -+ new_bucket = p->bucket - (djif / (unsigned)p->interval); -+ if (new_bucket < 0) -+ new_bucket = 0; -+ p->bucket = new_bucket + 1; -+ } -+ p->last = cjif; -+ spin_unlock_irqrestore(&ratelimit_lock, flags); -+ return 1; -+} -diff -uprN linux-2.6.8.1.orig/kernel/ptrace.c linux-2.6.8.1-ve022stab078/kernel/ptrace.c ---- linux-2.6.8.1.orig/kernel/ptrace.c 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/ptrace.c 2006-05-11 13:05:40.000000000 +0400 -@@ -46,8 +46,8 @@ void __ptrace_link(task_t *child, task_t - */ - void __ptrace_unlink(task_t *child) - { -- if (!child->ptrace) -- BUG(); -+ BUG_ON(!child->ptrace); -+ - child->ptrace = 0; - if (list_empty(&child->ptrace_list)) - return; -@@ -85,7 +85,7 @@ int ptrace_attach(struct task_struct *ta - retval = -EPERM; - if (task->pid <= 1) - goto bad; -- if (task == current) -+ if (task->tgid == current->tgid) - goto bad; - if (!task->mm) - goto bad; -@@ -99,6 +99,8 @@ int ptrace_attach(struct task_struct *ta - rmb(); - if (!task->mm->dumpable && !capable(CAP_SYS_PTRACE)) - goto bad; -+ if (!task->mm->vps_dumpable && !ve_is_super(get_exec_env())) -+ goto bad; - /* the same process cannot be attached many times */ - if (task->ptrace & PT_PTRACED) - goto bad; -@@ -124,22 +126,27 @@ bad: - return retval; - } - -+void __ptrace_detach(struct task_struct *child, unsigned int data) -+{ -+ child->exit_code = data; -+ /* .. re-parent .. */ -+ __ptrace_unlink(child); -+ /* .. and wake it up. */ -+ if (child->exit_state != EXIT_ZOMBIE) -+ wake_up_process(child); -+} -+ - int ptrace_detach(struct task_struct *child, unsigned int data) - { - if ((unsigned long) data > _NSIG) -- return -EIO; -+ return -EIO; - - /* Architecture-specific hardware disable .. */ - ptrace_disable(child); - -- /* .. re-parent .. */ -- child->exit_code = data; -- - write_lock_irq(&tasklist_lock); -- __ptrace_unlink(child); -- /* .. and wake it up. */ -- if (child->state != TASK_ZOMBIE) -- wake_up_process(child); -+ if (child->ptrace) -+ __ptrace_detach(child, data); - write_unlock_irq(&tasklist_lock); - - return 0; -diff -uprN linux-2.6.8.1.orig/kernel/sched.c linux-2.6.8.1-ve022stab078/kernel/sched.c ---- linux-2.6.8.1.orig/kernel/sched.c 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/sched.c 2006-05-11 13:05:49.000000000 +0400 -@@ -25,6 +25,7 @@ - #include <asm/uaccess.h> - #include <linux/highmem.h> - #include <linux/smp_lock.h> -+#include <linux/pagemap.h> - #include <asm/mmu_context.h> - #include <linux/interrupt.h> - #include <linux/completion.h> -@@ -40,6 +41,8 @@ - #include <linux/cpu.h> - #include <linux/percpu.h> - #include <linux/kthread.h> -+#include <linux/vsched.h> -+#include <linux/fairsched.h> - #include <asm/tlb.h> - - #include <asm/unistd.h> -@@ -132,7 +135,7 @@ - #ifdef CONFIG_SMP - #define TIMESLICE_GRANULARITY(p) (MIN_TIMESLICE * \ - (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)) * \ -- num_online_cpus()) -+ vsched_num_online_vcpus(task_vsched(p))) - #else - #define TIMESLICE_GRANULARITY(p) (MIN_TIMESLICE * \ - (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1))) -@@ -203,6 +206,7 @@ struct prio_array { - * (such as the load balancing or the thread migration code), lock - * acquire operations must be ordered by ascending &runqueue. - */ -+typedef struct vcpu_info *vcpu_t; - struct runqueue { - spinlock_t lock; - -@@ -217,7 +221,7 @@ struct runqueue { - unsigned long long nr_switches; - unsigned long expired_timestamp, nr_uninterruptible; - unsigned long long timestamp_last_tick; -- task_t *curr, *idle; -+ task_t *curr; - struct mm_struct *prev_mm; - prio_array_t *active, *expired, arrays[2]; - int best_expired_prio; -@@ -225,35 +229,623 @@ struct runqueue { - - #ifdef CONFIG_SMP - struct sched_domain *sd; -- - /* For active balancing */ - int active_balance; -- int push_cpu; -+#endif -+ vcpu_t push_cpu; - - task_t *migration_thread; - struct list_head migration_queue; --#endif - }; - --static DEFINE_PER_CPU(struct runqueue, runqueues); -+/* VCPU scheduler state description */ -+struct vcpu_info; -+struct vcpu_scheduler { -+ struct list_head idle_list; -+ struct list_head active_list; -+ struct list_head running_list; -+#ifdef CONFIG_FAIRSCHED -+ struct fairsched_node *node; -+#endif -+ struct vcpu_info *vcpu[NR_CPUS]; -+ int id; -+ cpumask_t vcpu_online_map, vcpu_running_map; -+ cpumask_t pcpu_running_map; -+ int num_online_vcpus; -+} ____cacheline_maxaligned_in_smp; -+ -+/* virtual CPU description */ -+struct vcpu_info { -+ struct runqueue rq; -+#ifdef CONFIG_SCHED_VCPU -+ unsigned active : 1, -+ running : 1; -+ struct list_head list; -+ struct vcpu_scheduler *vsched; -+ int last_pcpu; -+ u32 start_time; -+#endif -+ int id; -+} ____cacheline_maxaligned_in_smp; -+ -+/* physical CPU description */ -+struct pcpu_info { -+ struct vcpu_scheduler *vsched; -+ struct vcpu_info *vcpu; -+ task_t *idle; -+#ifdef CONFIG_SMP -+ struct sched_domain *sd; -+#endif -+ int id; -+} ____cacheline_maxaligned_in_smp; -+ -+struct pcpu_info pcpu_info[NR_CPUS]; -+ -+#define pcpu(nr) (&pcpu_info[nr]) -+#define this_pcpu() (pcpu(smp_processor_id())) - - #define for_each_domain(cpu, domain) \ -- for (domain = cpu_rq(cpu)->sd; domain; domain = domain->parent) -+ for (domain = vcpu_rq(cpu)->sd; domain; domain = domain->parent) -+ -+#ifdef CONFIG_SCHED_VCPU -+ -+u32 vcpu_sched_timeslice = 5; -+u32 vcpu_timeslice = 0; -+EXPORT_SYMBOL(vcpu_sched_timeslice); -+EXPORT_SYMBOL(vcpu_timeslice); -+ -+extern spinlock_t fairsched_lock; -+static struct vcpu_scheduler default_vsched, idle_vsched; -+static struct vcpu_info boot_vcpu; -+ -+#define vsched_default_vsched() (&default_vsched) -+#define vsched_default_vcpu(id) (default_vsched.vcpu[id]) -+ -+/* -+ * All macroses below could be used without locks, if there is no -+ * strict ordering requirements, because we assume, that: -+ * -+ * 1. VCPU could not disappear "on the fly" (FIXME) -+ * -+ * 2. p->vsched access is atomic. -+ */ -+ -+#define task_vsched(tsk) ((tsk)->vsched) -+#define this_vsched() (task_vsched(current)) -+ -+#define vsched_vcpu(vsched, id) ((vsched)->vcpu[id]) -+#define this_vcpu() (task_vcpu(current)) -+#define task_vcpu(p) ((p)->vcpu) -+ -+#define vsched_id(vsched) ((vsched)->id) -+#define vsched_vcpu_online_map(vsched) ((vsched)->vcpu_online_map) -+#define vsched_num_online_vcpus(vsched) ((vsched)->num_online_vcpus) -+#define vsched_pcpu_running_map(vsched) ((vsched)->pcpu_running_map) -+ -+#define vcpu_vsched(vcpu) ((vcpu)->vsched) -+#define vcpu_last_pcpu(vcpu) ((vcpu)->last_pcpu) -+#define vcpu_isset(vcpu, mask) (cpu_isset((vcpu)->id, mask)) -+#define vcpu_is_offline(vcpu) (!vcpu_isset(vcpu, \ -+ vcpu_vsched(vcpu)->vcpu_online_map)) -+ -+static int __add_vcpu(struct vcpu_scheduler *vsched, int id); -+ -+#else /* CONFIG_SCHED_VCPU */ -+ -+static DEFINE_PER_CPU(struct vcpu_info, vcpu_info); -+ -+#define task_vsched(p) NULL -+#define this_vcpu() (task_vcpu(current)) -+#define task_vcpu(p) (vcpu(task_cpu(p))) -+ -+#define vsched_vcpu(sched, id) (vcpu(id)) -+#define vsched_id(vsched) 0 -+#define vsched_default_vsched() NULL -+#define vsched_default_vcpu(id) (vcpu(id)) -+ -+#define vsched_vcpu_online_map(vsched) (cpu_online_map) -+#define vsched_num_online_vcpus(vsched) (num_online_cpus()) -+#define vsched_pcpu_running_map(vsched) (cpu_online_map) -+ -+#define vcpu(id) (&per_cpu(vcpu_info, id)) -+ -+#define vcpu_vsched(vcpu) NULL -+#define vcpu_last_pcpu(vcpu) ((vcpu)->id) -+#define vcpu_isset(vcpu, mask) (cpu_isset((vcpu)->id, mask)) -+#define vcpu_is_offline(vcpu) (cpu_is_offline((vcpu)->id)) -+ -+#endif /* CONFIG_SCHED_VCPU */ -+ -+#define this_rq() (vcpu_rq(this_vcpu())) -+#define task_rq(p) (vcpu_rq(task_vcpu(p))) -+#define vcpu_rq(vcpu) (&(vcpu)->rq) -+#define get_vcpu() ({ preempt_disable(); this_vcpu(); }) -+#define put_vcpu() ({ put_cpu(); }) -+#define rq_vcpu(__rq) (container_of((__rq), struct vcpu_info, rq)) -+ -+task_t *idle_task(int cpu) -+{ -+ return pcpu(cpu)->idle; -+} -+ -+#ifdef CONFIG_SMP -+static inline void update_rq_cpu_load(runqueue_t *rq) -+{ -+ unsigned long old_load, this_load; -+ -+ if (rq->nr_running == 0) { -+ rq->cpu_load = 0; -+ return; -+ } -+ -+ old_load = rq->cpu_load; -+ this_load = rq->nr_running * SCHED_LOAD_SCALE; -+ /* -+ * Round up the averaging division if load is increasing. This -+ * prevents us from getting stuck on 9 if the load is 10, for -+ * example. -+ */ -+ if (this_load > old_load) -+ old_load++; -+ rq->cpu_load = (old_load + this_load) / 2; -+} -+#else /* CONFIG_SMP */ -+static inline void update_rq_cpu_load(runqueue_t *rq) -+{ -+} -+#endif /* CONFIG_SMP */ -+ -+#ifdef CONFIG_SCHED_VCPU -+ -+void fastcall vsched_cpu_online_map(struct vcpu_scheduler *vsched, -+ cpumask_t *mask) -+{ -+ unsigned long flags; -+ -+ spin_lock_irqsave(&fairsched_lock, flags); -+ *mask = vsched->vcpu_online_map; -+ spin_unlock_irqrestore(&fairsched_lock, flags); -+} -+ -+static inline void set_task_vsched(task_t *p, struct vcpu_scheduler *vsched) -+{ -+ /* NOTE: set_task_cpu() is required after every set_task_vsched()! */ -+ p->vsched = vsched; -+ p->vsched_id = vsched_id(vsched); -+} -+ -+inline void set_task_cpu(struct task_struct *p, unsigned int vcpu_id) -+{ -+ p->vcpu = vsched_vcpu(task_vsched(p), vcpu_id); -+ p->vcpu_id = vcpu_id; -+} -+ -+static inline void set_task_vcpu(struct task_struct *p, vcpu_t vcpu) -+{ -+ p->vcpu = vcpu; -+ p->vcpu_id = vcpu->id; -+} -+ -+ -+#ifdef CONFIG_VE -+#define cycles_after(a, b) ((long long)(b) - (long long)(a) < 0) -+ -+cycles_t ve_sched_get_idle_time(struct ve_struct *ve, int cpu) -+{ -+ struct ve_cpu_stats *ve_stat; -+ unsigned v; -+ cycles_t strt, ret, cycles; -+ -+ ve_stat = VE_CPU_STATS(ve, cpu); -+ do { -+ v = read_seqcount_begin(&ve_stat->stat_lock); -+ ret = ve_stat->idle_time; -+ strt = ve_stat->strt_idle_time; -+ if (strt && nr_uninterruptible_ve(ve) == 0) { -+ cycles = get_cycles(); -+ if (cycles_after(cycles, strt)) -+ ret += cycles - strt; -+ } -+ } while (read_seqcount_retry(&ve_stat->stat_lock, v)); -+ return ret; -+} -+ -+cycles_t ve_sched_get_iowait_time(struct ve_struct *ve, int cpu) -+{ -+ struct ve_cpu_stats *ve_stat; -+ unsigned v; -+ cycles_t strt, ret, cycles; -+ -+ ve_stat = VE_CPU_STATS(ve, cpu); -+ do { -+ v = read_seqcount_begin(&ve_stat->stat_lock); -+ ret = ve_stat->iowait_time; -+ strt = ve_stat->strt_idle_time; -+ if (strt && nr_iowait_ve(ve) > 0) { -+ cycles = get_cycles(); -+ if (cycles_after(cycles, strt)) -+ ret += cycles - strt; -+ } -+ } while (read_seqcount_retry(&ve_stat->stat_lock, v)); -+ return ret; -+} -+ -+static inline void vcpu_save_ve_idle(struct ve_struct *ve, -+ unsigned int vcpu, cycles_t cycles) -+{ -+ struct ve_cpu_stats *ve_stat; -+ -+ ve_stat = VE_CPU_STATS(ve, vcpu); -+ -+ write_seqcount_begin(&ve_stat->stat_lock); -+ if (ve_stat->strt_idle_time) { -+ if (cycles_after(cycles, ve_stat->strt_idle_time)) { -+ if (nr_iowait_ve(ve) == 0) -+ ve_stat->idle_time += cycles - -+ ve_stat->strt_idle_time; -+ else -+ ve_stat->iowait_time += cycles - -+ ve_stat->strt_idle_time; -+ } -+ ve_stat->strt_idle_time = 0; -+ } -+ write_seqcount_end(&ve_stat->stat_lock); -+} -+ -+static inline void vcpu_strt_ve_idle(struct ve_struct *ve, -+ unsigned int vcpu, cycles_t cycles) -+{ -+ struct ve_cpu_stats *ve_stat; -+ -+ ve_stat = VE_CPU_STATS(ve, vcpu); -+ -+ write_seqcount_begin(&ve_stat->stat_lock); -+ ve_stat->strt_idle_time = cycles; -+ write_seqcount_end(&ve_stat->stat_lock); -+} -+ -+#else -+#define vcpu_save_ve_idle(ve, vcpu, cycles) do { } while (0) -+#define vcpu_strt_ve_idle(ve, vcpu, cycles) do { } while (0) -+#endif -+ -+/* this is called when rq->nr_running changes from 0 to 1 */ -+static void vcpu_attach(runqueue_t *rq) -+{ -+ struct vcpu_scheduler *vsched; -+ vcpu_t vcpu; -+ -+ vcpu = rq_vcpu(rq); -+ vsched = vcpu_vsched(vcpu); -+ -+ BUG_ON(vcpu->active); -+ spin_lock(&fairsched_lock); -+ vcpu->active = 1; -+ if (!vcpu->running) -+ list_move_tail(&vcpu->list, &vsched->active_list); -+ -+ fairsched_incrun(vsched->node); -+ spin_unlock(&fairsched_lock); -+} -+ -+/* this is called when rq->nr_running changes from 1 to 0 */ -+static void vcpu_detach(runqueue_t *rq) -+{ -+ struct vcpu_scheduler *vsched; -+ vcpu_t vcpu; -+ -+ vcpu = rq_vcpu(rq); -+ vsched = vcpu_vsched(vcpu); -+ BUG_ON(!vcpu->active); -+ -+ spin_lock(&fairsched_lock); -+ fairsched_decrun(vsched->node); -+ -+ vcpu->active = 0; -+ if (!vcpu->running) -+ list_move_tail(&vcpu->list, &vsched->idle_list); -+ spin_unlock(&fairsched_lock); -+} -+ -+static inline void __vcpu_get(vcpu_t vcpu) -+{ -+ struct pcpu_info *pcpu; -+ struct vcpu_scheduler *vsched; -+ -+ BUG_ON(!this_vcpu()->running); -+ -+ pcpu = this_pcpu(); -+ vsched = vcpu_vsched(vcpu); -+ -+ pcpu->vcpu = vcpu; -+ pcpu->vsched = vsched; -+ -+ fairsched_inccpu(vsched->node); -+ -+ list_move_tail(&vcpu->list, &vsched->running_list); -+ vcpu->start_time = jiffies; -+ vcpu->last_pcpu = pcpu->id; -+ vcpu->running = 1; -+ __set_bit(vcpu->id, vsched->vcpu_running_map.bits); -+ __set_bit(pcpu->id, vsched->pcpu_running_map.bits); -+#ifdef CONFIG_SMP -+ vcpu_rq(vcpu)->sd = pcpu->sd; -+#endif -+} -+ -+static void vcpu_put(vcpu_t vcpu) -+{ -+ struct vcpu_scheduler *vsched; -+ struct pcpu_info *cur_pcpu; -+ runqueue_t *rq; - --#define cpu_rq(cpu) (&per_cpu(runqueues, (cpu))) --#define this_rq() (&__get_cpu_var(runqueues)) --#define task_rq(p) cpu_rq(task_cpu(p)) --#define cpu_curr(cpu) (cpu_rq(cpu)->curr) -+ vsched = vcpu_vsched(vcpu); -+ rq = vcpu_rq(vcpu); -+ cur_pcpu = this_pcpu(); -+ -+ BUG_ON(!vcpu->running); -+ -+ spin_lock(&fairsched_lock); -+ vcpu->running = 0; -+ list_move_tail(&vcpu->list, -+ vcpu->active ? &vsched->active_list : &vsched->idle_list); -+ fairsched_deccpu(vsched->node); -+ __clear_bit(vcpu->id, vsched->vcpu_running_map.bits); -+ if (vsched != this_vsched()) -+ __clear_bit(cur_pcpu->id, vsched->pcpu_running_map.bits); -+ -+ if (!vcpu->active) -+ rq->expired_timestamp = 0; -+ /* from this point task_running(prev_rq, prev) will be 0 */ -+ rq->curr = cur_pcpu->idle; -+ update_rq_cpu_load(rq); -+ spin_unlock(&fairsched_lock); -+} -+ -+static vcpu_t schedule_vcpu(vcpu_t cur_vcpu, cycles_t cycles) -+{ -+ struct vcpu_scheduler *vsched; -+ vcpu_t vcpu; -+ runqueue_t *rq; -+#ifdef CONFIG_FAIRSCHED -+ struct fairsched_node *node, *nodec; -+ -+ nodec = vcpu_vsched(cur_vcpu)->node; -+ node = nodec; -+#endif -+ -+ BUG_ON(!cur_vcpu->running); -+restart: -+ spin_lock(&fairsched_lock); -+#ifdef CONFIG_FAIRSCHED -+ node = fairsched_schedule(node, nodec, -+ cur_vcpu->active, -+ cycles); -+ if (unlikely(node == NULL)) -+ goto idle; -+ -+ vsched = node->vsched; -+#else -+ vsched = &default_vsched; -+#endif -+ /* FIXME: optimize vcpu switching, maybe we do not need to call -+ fairsched_schedule() at all if vcpu is still active and too -+ little time have passed so far */ -+ if (cur_vcpu->vsched == vsched && cur_vcpu->active && -+ jiffies - cur_vcpu->start_time < msecs_to_jiffies(vcpu_sched_timeslice)) { -+ vcpu = cur_vcpu; -+ goto done; -+ } -+ -+ if (list_empty(&vsched->active_list)) { -+ /* nothing except for this cpu can be scheduled */ -+ if (likely(cur_vcpu->vsched == vsched && cur_vcpu->active)) { -+ /* -+ * Current vcpu is the one we need. We have not -+ * put it yet, so it's not on the active_list. -+ */ -+ vcpu = cur_vcpu; -+ goto done; -+ } else -+ goto none; -+ } -+ -+ /* select vcpu and add to running list */ -+ vcpu = list_entry(vsched->active_list.next, struct vcpu_info, list); -+ __vcpu_get(vcpu); -+done: -+ spin_unlock(&fairsched_lock); -+ -+ rq = vcpu_rq(vcpu); -+ if (unlikely(vcpu != cur_vcpu)) { -+ spin_unlock(&vcpu_rq(cur_vcpu)->lock); -+ spin_lock(&rq->lock); -+ if (unlikely(!rq->nr_running)) { -+ /* race with balancing? */ -+ spin_unlock(&rq->lock); -+ vcpu_put(vcpu); -+ spin_lock(&vcpu_rq(cur_vcpu)->lock); -+ goto restart; -+ } -+ } -+ BUG_ON(!rq->nr_running); -+ return vcpu; -+ -+none: -+#ifdef CONFIG_FAIRSCHED -+ spin_unlock(&fairsched_lock); -+ -+ /* fairsched doesn't schedule more CPUs than we have active */ -+ BUG_ON(1); -+#else -+ goto idle; -+#endif -+ -+idle: -+ vcpu = task_vcpu(this_pcpu()->idle); -+ __vcpu_get(vcpu); -+ spin_unlock(&fairsched_lock); -+ spin_unlock(&vcpu_rq(cur_vcpu)->lock); -+ -+ spin_lock(&vcpu_rq(vcpu)->lock); -+ return vcpu; -+} -+ -+#else /* CONFIG_SCHED_VCPU */ -+ -+#define set_task_vsched(task, vsched) do { } while (0) -+ -+static inline void vcpu_attach(runqueue_t *rq) -+{ -+} -+ -+static inline void vcpu_detach(runqueue_t *rq) -+{ -+} -+ -+static inline void vcpu_put(vcpu_t vcpu) -+{ -+} -+ -+static inline vcpu_t schedule_vcpu(vcpu_t prev_vcpu, cycles_t cycles) -+{ -+ return prev_vcpu; -+} -+ -+static inline void set_task_vcpu(struct task_struct *p, vcpu_t vcpu) -+{ -+ set_task_pcpu(p, vcpu->id); -+} -+ -+#endif /* CONFIG_SCHED_VCPU */ -+ -+int vcpu_online(int cpu) -+{ -+ return cpu_isset(cpu, vsched_vcpu_online_map(this_vsched())); -+} - - /* - * Default context-switch locking: - */ - #ifndef prepare_arch_switch - # define prepare_arch_switch(rq, next) do { } while (0) --# define finish_arch_switch(rq, next) spin_unlock_irq(&(rq)->lock) -+# define finish_arch_switch(rq, next) spin_unlock(&(rq)->lock) - # define task_running(rq, p) ((rq)->curr == (p)) - #endif - -+struct kernel_stat_glob kstat_glob; -+spinlock_t kstat_glb_lock = SPIN_LOCK_UNLOCKED; -+EXPORT_SYMBOL(kstat_glob); -+EXPORT_SYMBOL(kstat_glb_lock); -+ -+#ifdef CONFIG_VE -+ -+#define ve_nr_running_inc(env, cpu) \ -+ do { \ -+ VE_CPU_STATS((env), (cpu))->nr_running++; \ -+ } while(0) -+#define ve_nr_running_dec(env, cpu) \ -+ do { \ -+ VE_CPU_STATS((env), (cpu))->nr_running--; \ -+ } while(0) -+#define ve_nr_iowait_inc(env, cpu) \ -+ do { \ -+ VE_CPU_STATS((env), (cpu))->nr_iowait++; \ -+ } while(0) -+#define ve_nr_iowait_dec(env, cpu) \ -+ do { \ -+ VE_CPU_STATS((env), (cpu))->nr_iowait--; \ -+ } while(0) -+#define ve_nr_unint_inc(env, cpu) \ -+ do { \ -+ VE_CPU_STATS((env), (cpu))->nr_unint++; \ -+ } while(0) -+#define ve_nr_unint_dec(env, cpu) \ -+ do { \ -+ VE_CPU_STATS((env), (cpu))->nr_unint--; \ -+ } while(0) -+ -+void ve_sched_attach(struct ve_struct *envid) -+{ -+ struct task_struct *tsk; -+ unsigned int vcpu; -+ -+ tsk = current; -+ preempt_disable(); -+ vcpu = task_cpu(tsk); -+ ve_nr_running_dec(VE_TASK_INFO(tsk)->owner_env, vcpu); -+ ve_nr_running_inc(envid, vcpu); -+ preempt_enable(); -+} -+EXPORT_SYMBOL(ve_sched_attach); -+ -+#else -+ -+#define ve_nr_running_inc(env, cpu) do { } while(0) -+#define ve_nr_running_dec(env, cpu) do { } while(0) -+#define ve_nr_iowait_inc(env, cpu) do { } while(0) -+#define ve_nr_iowait_dec(env, cpu) do { } while(0) -+#define ve_nr_unint_inc(env, cpu) do { } while(0) -+#define ve_nr_unint_dec(env, cpu) do { } while(0) -+ -+#endif -+ -+struct task_nrs_struct { -+ long nr_running; -+ long nr_uninterruptible; -+ long nr_stopped; -+ long nr_sleeping; -+ long nr_iowait; -+ long long nr_switches; -+} ____cacheline_aligned_in_smp; -+ -+static struct task_nrs_struct glob_tasks_nrs[NR_CPUS]; -+unsigned long nr_zombie = 0; /* protected by tasklist_lock */ -+unsigned long nr_dead = 0; -+EXPORT_SYMBOL(nr_zombie); -+EXPORT_SYMBOL(nr_dead); -+ -+#define nr_running_inc(cpu, vcpu, ve) do { \ -+ glob_tasks_nrs[cpu].nr_running++; \ -+ ve_nr_running_inc(ve, vcpu); \ -+ } while (0) -+#define nr_running_dec(cpu, vcpu, ve) do { \ -+ glob_tasks_nrs[cpu].nr_running--; \ -+ ve_nr_running_dec(ve, vcpu); \ -+ } while (0) -+ -+#define nr_unint_inc(cpu, vcpu, ve) do { \ -+ glob_tasks_nrs[cpu].nr_uninterruptible++; \ -+ ve_nr_unint_inc(ve, vcpu); \ -+ } while (0) -+#define nr_unint_dec(cpu, vcpu, ve) do { \ -+ glob_tasks_nrs[cpu].nr_uninterruptible--; \ -+ ve_nr_unint_dec(ve, vcpu); \ -+ } while (0) -+ -+#define nr_iowait_inc(cpu, vcpu, ve) do { \ -+ glob_tasks_nrs[cpu].nr_iowait++; \ -+ ve_nr_iowait_inc(ve, vcpu); \ -+ } while (0) -+#define nr_iowait_dec(cpu, vcpu, ve) do { \ -+ glob_tasks_nrs[cpu].nr_iowait--; \ -+ ve_nr_iowait_dec(ve, vcpu); \ -+ } while (0) -+ -+#define nr_stopped_inc(cpu, vcpu, ve) do { \ -+ glob_tasks_nrs[cpu].nr_stopped++; \ -+ } while (0) -+#define nr_stopped_dec(cpu, vcpu, ve) do { \ -+ glob_tasks_nrs[cpu].nr_stopped--; \ -+ } while (0) -+ -+#define nr_sleeping_inc(cpu, vcpu, ve) do { \ -+ glob_tasks_nrs[cpu].nr_sleeping++; \ -+ } while (0) -+#define nr_sleeping_dec(cpu, vcpu, ve) do { \ -+ glob_tasks_nrs[cpu].nr_sleeping--; \ -+ } while (0) -+ - /* - * task_rq_lock - lock the runqueue a given task resides on and disable - * interrupts. Note the ordering: we can safely lookup the task_rq without -@@ -361,13 +953,39 @@ static int effective_prio(task_t *p) - return prio; - } - -+static inline void write_wakeup_stamp(struct task_struct *p, cycles_t cyc) -+{ -+ struct ve_task_info *ti; -+ -+ ti = VE_TASK_INFO(p); -+ write_seqcount_begin(&ti->wakeup_lock); -+ ti->wakeup_stamp = cyc; -+ write_seqcount_end(&ti->wakeup_lock); -+} -+ - /* - * __activate_task - move a task to the runqueue. - */ - static inline void __activate_task(task_t *p, runqueue_t *rq) - { -+ cycles_t cycles; -+ unsigned int vcpu; -+ struct ve_struct *ve; -+ -+ cycles = get_cycles(); -+ vcpu = task_cpu(p); -+ ve = VE_TASK_INFO(p)->owner_env; -+ -+ write_wakeup_stamp(p, cycles); -+ VE_TASK_INFO(p)->sleep_time += cycles; -+ nr_running_inc(smp_processor_id(), vcpu, ve); -+ - enqueue_task(p, rq->active); - rq->nr_running++; -+ if (rq->nr_running == 1) { -+ vcpu_save_ve_idle(ve, vcpu, cycles); -+ vcpu_attach(rq); -+ } - } - - /* -@@ -507,11 +1125,33 @@ static void activate_task(task_t *p, run - */ - static void deactivate_task(struct task_struct *p, runqueue_t *rq) - { -+ cycles_t cycles; -+ unsigned int cpu, vcpu; -+ struct ve_struct *ve; -+ -+ cycles = get_cycles(); -+ cpu = smp_processor_id(); -+ vcpu = rq_vcpu(rq)->id; -+ ve = VE_TASK_INFO(p)->owner_env; -+ -+ VE_TASK_INFO(p)->sleep_time -= cycles; - rq->nr_running--; -- if (p->state == TASK_UNINTERRUPTIBLE) -+ nr_running_dec(cpu, vcpu, ve); -+ if (p->state == TASK_UNINTERRUPTIBLE) { - rq->nr_uninterruptible++; -+ nr_unint_inc(cpu, vcpu, ve); -+ } -+ if (p->state == TASK_INTERRUPTIBLE) -+ nr_sleeping_inc(cpu, vcpu, ve); -+ if (p->state == TASK_STOPPED) -+ nr_stopped_inc(cpu, vcpu, ve); -+ /* nr_zombie is calced in exit.c */ - dequeue_task(p, p->array); - p->array = NULL; -+ if (rq->nr_running == 0) { -+ vcpu_strt_ve_idle(ve, vcpu, cycles); -+ vcpu_detach(rq); -+ } - } - - /* -@@ -522,6 +1162,7 @@ static void deactivate_task(struct task_ - * the target CPU. - */ - #ifdef CONFIG_SMP -+/* FIXME: need to add vsched arg */ - static void resched_task(task_t *p) - { - int need_resched, nrpolling; -@@ -532,8 +1173,9 @@ static void resched_task(task_t *p) - need_resched = test_and_set_tsk_thread_flag(p,TIF_NEED_RESCHED); - nrpolling |= test_tsk_thread_flag(p,TIF_POLLING_NRFLAG); - -- if (!need_resched && !nrpolling && (task_cpu(p) != smp_processor_id())) -- smp_send_reschedule(task_cpu(p)); -+ /* FIXME: think over */ -+ if (!need_resched && !nrpolling && (task_pcpu(p) != smp_processor_id())) -+ smp_send_reschedule(task_pcpu(p)); - preempt_enable(); - } - #else -@@ -549,10 +1191,29 @@ static inline void resched_task(task_t * - */ - inline int task_curr(const task_t *p) - { -- return cpu_curr(task_cpu(p)) == p; -+ return task_rq(p)->curr == p; -+} -+ -+/** -+ * idle_cpu - is a given cpu idle currently? -+ * @cpu: the processor in question. -+ */ -+inline int idle_cpu(int cpu) -+{ -+ return pcpu(cpu)->vsched == &idle_vsched; -+} -+ -+EXPORT_SYMBOL_GPL(idle_cpu); -+ -+static inline int idle_vcpu(vcpu_t cpu) -+{ -+#ifdef CONFIG_SCHED_VCPU -+ return !cpu->active; -+#else -+ return idle_cpu(cpu->id); -+#endif - } - --#ifdef CONFIG_SMP - enum request_type { - REQ_MOVE_TASK, - REQ_SET_DOMAIN, -@@ -564,7 +1225,7 @@ typedef struct { - - /* For REQ_MOVE_TASK */ - task_t *task; -- int dest_cpu; -+ vcpu_t dest_cpu; - - /* For REQ_SET_DOMAIN */ - struct sched_domain *sd; -@@ -576,7 +1237,7 @@ typedef struct { - * The task's runqueue lock must be held. - * Returns true if you have to wait for migration thread. - */ --static int migrate_task(task_t *p, int dest_cpu, migration_req_t *req) -+static int migrate_task(task_t *p, vcpu_t dest_cpu, migration_req_t *req) - { - runqueue_t *rq = task_rq(p); - -@@ -584,8 +1245,13 @@ static int migrate_task(task_t *p, int d - * If the task is not on a runqueue (and not running), then - * it is sufficient to simply update the task's cpu field. - */ -+#ifdef CONFIG_SCHED_VCPU -+ BUG_ON(task_vsched(p) == &idle_vsched); -+ BUG_ON(vcpu_vsched(dest_cpu) == &idle_vsched); -+#endif - if (!p->array && !task_running(rq, p)) { -- set_task_cpu(p, dest_cpu); -+ set_task_vsched(p, vcpu_vsched(dest_cpu)); -+ set_task_vcpu(p, dest_cpu); - return 0; - } - -@@ -597,6 +1263,7 @@ static int migrate_task(task_t *p, int d - return 1; - } - -+#ifdef CONFIG_SMP - /* - * wait_task_inactive - wait for a thread to unschedule. - * -@@ -615,7 +1282,12 @@ void wait_task_inactive(task_t * p) - repeat: - rq = task_rq_lock(p, &flags); - /* Must be off runqueue entirely, not preempted. */ -- if (unlikely(p->array)) { -+ /* -+ * VCPU: we need to check task_running() here, since -+ * we drop rq->lock in the middle of schedule() and task -+ * can be deactivated, but still running until it calls vcpu_put() -+ */ -+ if (unlikely(p->array) || task_running(rq, p)) { - /* If it's preempted, we yield. It could be a while. */ - preempted = !task_running(rq, p); - task_rq_unlock(rq, &flags); -@@ -639,8 +1311,11 @@ void kick_process(task_t *p) - int cpu; - - preempt_disable(); -- cpu = task_cpu(p); -+ cpu = task_pcpu(p); - if ((cpu != smp_processor_id()) && task_curr(p)) -+ /* FIXME: ??? think over */ -+ /* should add something like get_pcpu(cpu)->vcpu->id == task_cpu(p), -+ but with serialization of vcpu access... */ - smp_send_reschedule(cpu); - preempt_enable(); - } -@@ -653,9 +1328,9 @@ EXPORT_SYMBOL_GPL(kick_process); - * We want to under-estimate the load of migration sources, to - * balance conservatively. - */ --static inline unsigned long source_load(int cpu) -+static inline unsigned long source_load(vcpu_t cpu) - { -- runqueue_t *rq = cpu_rq(cpu); -+ runqueue_t *rq = vcpu_rq(cpu); - unsigned long load_now = rq->nr_running * SCHED_LOAD_SCALE; - - return min(rq->cpu_load, load_now); -@@ -664,9 +1339,9 @@ static inline unsigned long source_load( - /* - * Return a high guess at the load of a migration-target cpu - */ --static inline unsigned long target_load(int cpu) -+static inline unsigned long target_load(vcpu_t cpu) - { -- runqueue_t *rq = cpu_rq(cpu); -+ runqueue_t *rq = vcpu_rq(cpu); - unsigned long load_now = rq->nr_running * SCHED_LOAD_SCALE; - - return max(rq->cpu_load, load_now); -@@ -682,32 +1357,38 @@ static inline unsigned long target_load( - * Returns the CPU we should wake onto. - */ - #if defined(ARCH_HAS_SCHED_WAKE_IDLE) --static int wake_idle(int cpu, task_t *p) -+static vcpu_t wake_idle(vcpu_t cpu, task_t *p) - { -- cpumask_t tmp; -- runqueue_t *rq = cpu_rq(cpu); -+ cpumask_t tmp, vtmp; -+ runqueue_t *rq = vcpu_rq(cpu); - struct sched_domain *sd; -+ struct vcpu_scheduler *vsched; - int i; - -- if (idle_cpu(cpu)) -+ if (idle_vcpu(cpu)) - return cpu; - - sd = rq->sd; - if (!(sd->flags & SD_WAKE_IDLE)) - return cpu; - -+ vsched = vcpu_vsched(cpu); - cpus_and(tmp, sd->span, cpu_online_map); -- cpus_and(tmp, tmp, p->cpus_allowed); -+ cpus_and(vtmp, vsched_vcpu_online_map(vsched), p->cpus_allowed); - -- for_each_cpu_mask(i, tmp) { -- if (idle_cpu(i)) -- return i; -+ for_each_cpu_mask(i, vtmp) { -+ vcpu_t vcpu; -+ vcpu = vsched_vcpu(vsched, i); -+ if (!cpu_isset(vcpu_last_pcpu(vcpu), tmp)) -+ continue; -+ if (idle_vcpu(vcpu)) -+ return vcpu; - } - - return cpu; - } - #else --static inline int wake_idle(int cpu, task_t *p) -+static inline vcpu_t wake_idle(vcpu_t cpu, task_t *p) - { - return cpu; - } -@@ -729,15 +1410,17 @@ static inline int wake_idle(int cpu, tas - */ - static int try_to_wake_up(task_t * p, unsigned int state, int sync) - { -- int cpu, this_cpu, success = 0; -+ vcpu_t cpu, this_cpu; -+ int success = 0; - unsigned long flags; - long old_state; - runqueue_t *rq; - #ifdef CONFIG_SMP - unsigned long load, this_load; - struct sched_domain *sd; -- int new_cpu; -+ vcpu_t new_cpu; - #endif -+ cpu = NULL; - - rq = task_rq_lock(p, &flags); - old_state = p->state; -@@ -747,8 +1430,8 @@ static int try_to_wake_up(task_t * p, un - if (p->array) - goto out_running; - -- cpu = task_cpu(p); -- this_cpu = smp_processor_id(); -+ cpu = task_vcpu(p); -+ this_cpu = this_vcpu(); - - #ifdef CONFIG_SMP - if (unlikely(task_running(rq, p))) -@@ -756,7 +1439,10 @@ static int try_to_wake_up(task_t * p, un - - new_cpu = cpu; - -- if (cpu == this_cpu || unlikely(!cpu_isset(this_cpu, p->cpus_allowed))) -+ /* FIXME: add vsched->last_vcpu array to optimize wakeups in different vsched */ -+ if (vcpu_vsched(cpu) != vcpu_vsched(this_cpu)) -+ goto out_set_cpu; -+ if (cpu == this_cpu || unlikely(!vcpu_isset(this_cpu, p->cpus_allowed))) - goto out_set_cpu; - - load = source_load(cpu); -@@ -795,7 +1481,7 @@ static int try_to_wake_up(task_t * p, un - * Now sd has SD_WAKE_AFFINE and p is cache cold in sd - * or sd has SD_WAKE_BALANCE and there is an imbalance - */ -- if (cpu_isset(cpu, sd->span)) -+ if (cpu_isset(vcpu_last_pcpu(cpu), sd->span)) - goto out_set_cpu; - } - } -@@ -803,8 +1489,8 @@ static int try_to_wake_up(task_t * p, un - new_cpu = cpu; /* Could not wake to this_cpu. Wake to cpu instead */ - out_set_cpu: - new_cpu = wake_idle(new_cpu, p); -- if (new_cpu != cpu && cpu_isset(new_cpu, p->cpus_allowed)) { -- set_task_cpu(p, new_cpu); -+ if (new_cpu != cpu && vcpu_isset(new_cpu, p->cpus_allowed)) { -+ set_task_vcpu(p, new_cpu); - task_rq_unlock(rq, &flags); - /* might preempt at this point */ - rq = task_rq_lock(p, &flags); -@@ -814,20 +1500,28 @@ out_set_cpu: - if (p->array) - goto out_running; - -- this_cpu = smp_processor_id(); -- cpu = task_cpu(p); -+ this_cpu = this_vcpu(); -+ cpu = task_vcpu(p); - } - - out_activate: - #endif /* CONFIG_SMP */ - if (old_state == TASK_UNINTERRUPTIBLE) { - rq->nr_uninterruptible--; -+ nr_unint_dec(smp_processor_id(), task_cpu(p), -+ VE_TASK_INFO(p)->owner_env); - /* - * Tasks on involuntary sleep don't earn - * sleep_avg beyond just interactive state. - */ - p->activated = -1; - } -+ if (old_state == TASK_INTERRUPTIBLE) -+ nr_sleeping_dec(smp_processor_id(), task_cpu(p), -+ VE_TASK_INFO(p)->owner_env); -+ if (old_state == TASK_STOPPED) -+ nr_stopped_dec(smp_processor_id(), task_cpu(p), -+ VE_TASK_INFO(p)->owner_env); - - /* - * Sync wakeups (i.e. those types of wakeups where the waker -@@ -866,6 +1560,37 @@ int fastcall wake_up_state(task_t *p, un - } - - /* -+ * init is special, it is forked from swapper (idle_vsched) and should -+ * belong to default_vsched, so we have to change it's vsched/fairsched manually -+ */ -+void wake_up_init(void) -+{ -+ task_t *p; -+ runqueue_t *rq; -+ unsigned long flags; -+ -+ p = find_task_by_pid_all(1); -+ BUG_ON(p == NULL || p->state != TASK_STOPPED); -+ -+ /* we should change both fairsched node and vsched here */ -+ set_task_vsched(p, &default_vsched); -+ set_task_cpu(p, 0); -+ -+ /* -+ * can't call wake_up_forked_thread() directly here, -+ * since it assumes that a child belongs to the same vsched -+ */ -+ p->state = TASK_RUNNING; -+ p->sleep_avg = 0; -+ p->interactive_credit = 0; -+ p->prio = effective_prio(p); -+ -+ rq = task_rq_lock(p, &flags); -+ __activate_task(p, rq); -+ task_rq_unlock(rq, &flags); -+} -+ -+/* - * Perform scheduler related setup for a newly forked process p. - * p is forked by current. - */ -@@ -904,6 +1629,7 @@ void fastcall sched_fork(task_t *p) - p->first_time_slice = 1; - current->time_slice >>= 1; - p->timestamp = sched_clock(); -+ VE_TASK_INFO(p)->sleep_time -= get_cycles(); /*cosmetic: sleep till wakeup below*/ - if (!current->time_slice) { - /* - * This case is rare, it happens when the parent has only -@@ -931,6 +1657,7 @@ void fastcall wake_up_forked_process(tas - runqueue_t *rq = task_rq_lock(current, &flags); - - BUG_ON(p->state != TASK_RUNNING); -+ BUG_ON(task_vsched(current) != task_vsched(p)); - - /* - * We decrease the sleep average of forking parents -@@ -946,7 +1673,8 @@ void fastcall wake_up_forked_process(tas - p->interactive_credit = 0; - - p->prio = effective_prio(p); -- set_task_cpu(p, smp_processor_id()); -+ set_task_pcpu(p, task_pcpu(current)); -+ set_task_vcpu(p, this_vcpu()); - - if (unlikely(!current->array)) - __activate_task(p, rq); -@@ -956,6 +1684,8 @@ void fastcall wake_up_forked_process(tas - p->array = current->array; - p->array->nr_active++; - rq->nr_running++; -+ nr_running_inc(smp_processor_id(), task_cpu(p), -+ VE_TASK_INFO(p)->owner_env); - } - task_rq_unlock(rq, &flags); - } -@@ -974,18 +1704,16 @@ void fastcall sched_exit(task_t * p) - unsigned long flags; - runqueue_t *rq; - -- local_irq_save(flags); -- if (p->first_time_slice) { -- p->parent->time_slice += p->time_slice; -- if (unlikely(p->parent->time_slice > MAX_TIMESLICE)) -- p->parent->time_slice = MAX_TIMESLICE; -- } -- local_irq_restore(flags); - /* - * If the child was a (relative-) CPU hog then decrease - * the sleep_avg of the parent as well. - */ - rq = task_rq_lock(p->parent, &flags); -+ if (p->first_time_slice && task_cpu(p) == task_cpu(p->parent)) { -+ p->parent->time_slice += p->time_slice; -+ if (unlikely(p->parent->time_slice > MAX_TIMESLICE)) -+ p->parent->time_slice = MAX_TIMESLICE; -+ } - if (p->sleep_avg < p->parent->sleep_avg) - p->parent->sleep_avg = p->parent->sleep_avg / - (EXIT_WEIGHT + 1) * EXIT_WEIGHT + p->sleep_avg / -@@ -1008,25 +1736,39 @@ void fastcall sched_exit(task_t * p) - */ - static void finish_task_switch(task_t *prev) - { -- runqueue_t *rq = this_rq(); -- struct mm_struct *mm = rq->prev_mm; -+ runqueue_t *rq; -+ struct mm_struct *mm; - unsigned long prev_task_flags; -+ vcpu_t prev_vcpu, vcpu; - -+ prev_vcpu = task_vcpu(prev); -+ vcpu = this_vcpu(); -+ rq = vcpu_rq(vcpu); -+ mm = rq->prev_mm; - rq->prev_mm = NULL; - - /* - * A task struct has one reference for the use as "current". -- * If a task dies, then it sets TASK_ZOMBIE in tsk->state and calls -- * schedule one last time. The schedule call will never return, -+ * If a task dies, then it sets EXIT_ZOMBIE in tsk->exit_state and -+ * calls schedule one last time. The schedule call will never return, - * and the scheduled task must drop that reference. -- * The test for TASK_ZOMBIE must occur while the runqueue locks are -+ * The test for EXIT_ZOMBIE must occur while the runqueue locks are - * still held, otherwise prev could be scheduled on another cpu, die - * there before we look at prev->state, and then the reference would - * be dropped twice. - * Manfred Spraul <manfred@colorfullife.com> - */ - prev_task_flags = prev->flags; -+ -+ /* -+ * no schedule() should happen until vcpu_put, -+ * and schedule_tail() calls us with preempt enabled... -+ */ - finish_arch_switch(rq, prev); -+ if (prev_vcpu != vcpu) -+ vcpu_put(prev_vcpu); -+ local_irq_enable(); -+ - if (mm) - mmdrop(mm); - if (unlikely(prev_task_flags & PF_DEAD)) -@@ -1042,7 +1784,7 @@ asmlinkage void schedule_tail(task_t *pr - finish_task_switch(prev); - - if (current->set_child_tid) -- put_user(current->pid, current->set_child_tid); -+ put_user(virt_pid(current), current->set_child_tid); - } - - /* -@@ -1083,44 +1825,109 @@ task_t * context_switch(runqueue_t *rq, - */ - unsigned long nr_running(void) - { -- unsigned long i, sum = 0; -- -- for_each_cpu(i) -- sum += cpu_rq(i)->nr_running; -+ int i; -+ long sum; - -- return sum; -+ sum = 0; -+ for (i = 0; i < NR_CPUS; i++) -+ sum += glob_tasks_nrs[i].nr_running; -+ return (unsigned long)(sum < 0 ? 0 : sum); - } -+EXPORT_SYMBOL(nr_running); - - unsigned long nr_uninterruptible(void) - { -- unsigned long i, sum = 0; -- -- for_each_cpu(i) -- sum += cpu_rq(i)->nr_uninterruptible; -+ int i; -+ long sum; - -- return sum; -+ sum = 0; -+ for (i = 0; i < NR_CPUS; i++) -+ sum += glob_tasks_nrs[i].nr_uninterruptible; -+ return (unsigned long)(sum < 0 ? 0 : sum); - } -+EXPORT_SYMBOL(nr_uninterruptible); - --unsigned long long nr_context_switches(void) -+unsigned long nr_sleeping(void) - { -- unsigned long long i, sum = 0; -+ int i; -+ long sum; - -- for_each_cpu(i) -- sum += cpu_rq(i)->nr_switches; -+ sum = 0; -+ for (i = 0; i < NR_CPUS; i++) -+ sum += glob_tasks_nrs[i].nr_sleeping; -+ return (unsigned long)(sum < 0 ? 0 : sum); -+} -+EXPORT_SYMBOL(nr_sleeping); - -- return sum; -+unsigned long nr_stopped(void) -+{ -+ int i; -+ long sum; -+ -+ sum = 0; -+ for (i = 0; i < NR_CPUS; i++) -+ sum += glob_tasks_nrs[i].nr_stopped; -+ return (unsigned long)(sum < 0 ? 0 : sum); - } -+EXPORT_SYMBOL(nr_stopped); - - unsigned long nr_iowait(void) - { -- unsigned long i, sum = 0; -+ int i; -+ long sum; - -- for_each_cpu(i) -- sum += atomic_read(&cpu_rq(i)->nr_iowait); -+ sum = 0; -+ for (i = 0; i < NR_CPUS; i++) -+ sum += glob_tasks_nrs[i].nr_iowait; -+ return (unsigned long)(sum < 0 ? 0 : sum); -+} -+ -+unsigned long long nr_context_switches(void) -+{ -+ int i; -+ long long sum; - -+ sum = 0; -+ for (i = 0; i < NR_CPUS; i++) -+ sum += glob_tasks_nrs[i].nr_switches; - return sum; - } - -+#ifdef CONFIG_VE -+unsigned long nr_running_ve(struct ve_struct *ve) -+{ -+ int i; -+ long sum; -+ -+ sum = 0; -+ for (i = 0; i < NR_CPUS; i++) -+ sum += VE_CPU_STATS(ve, i)->nr_running; -+ return (unsigned long)(sum < 0 ? 0 : sum); -+} -+ -+unsigned long nr_uninterruptible_ve(struct ve_struct *ve) -+{ -+ int i; -+ long sum; -+ -+ sum = 0; -+ for (i = 0; i < NR_CPUS; i++) -+ sum += VE_CPU_STATS(ve, i)->nr_unint; -+ return (unsigned long)(sum < 0 ? 0 : sum); -+} -+ -+unsigned long nr_iowait_ve(struct ve_struct *ve) -+{ -+ int i; -+ long sum; -+ -+ sum = 0; -+ for (i = 0; i < NR_CPUS; i++) -+ sum += VE_CPU_STATS(ve, i)->nr_iowait; -+ return (unsigned long)(sum < 0 ? 0 : sum); -+} -+#endif -+ - /* - * double_rq_lock - safely lock two runqueues - * -@@ -1167,24 +1974,32 @@ enum idle_type - /* - * find_idlest_cpu - find the least busy runqueue. - */ --static int find_idlest_cpu(struct task_struct *p, int this_cpu, -+static vcpu_t find_idlest_cpu(struct task_struct *p, vcpu_t this_cpu, - struct sched_domain *sd) - { - unsigned long load, min_load, this_load; -- int i, min_cpu; -- cpumask_t mask; -+ int i; -+ vcpu_t min_cpu; -+ cpumask_t mask, vmask; -+ struct vcpu_scheduler *vsched; - -- min_cpu = UINT_MAX; -+ vsched = task_vsched(p); -+ min_cpu = NULL; - min_load = ULONG_MAX; - - cpus_and(mask, sd->span, cpu_online_map); -- cpus_and(mask, mask, p->cpus_allowed); -+ cpus_and(vmask, vsched_vcpu_online_map(vsched), p->cpus_allowed); - -- for_each_cpu_mask(i, mask) { -- load = target_load(i); -+ for_each_cpu_mask(i, vmask) { -+ vcpu_t vcpu; -+ vcpu = vsched_vcpu(vsched, i); - -+ if (!cpu_isset(vcpu_last_pcpu(vcpu), mask)) -+ continue; -+ -+ load = target_load(vcpu); - if (load < min_load) { -- min_cpu = i; -+ min_cpu = vcpu; - min_load = load; - - /* break out early on an idle CPU: */ -@@ -1193,6 +2008,9 @@ static int find_idlest_cpu(struct task_s - } - } - -+ if (min_cpu == NULL) -+ return this_cpu; -+ - /* add +1 to account for the new task */ - this_load = source_load(this_cpu) + SCHED_LOAD_SCALE; - -@@ -1220,9 +2038,9 @@ static int find_idlest_cpu(struct task_s - void fastcall wake_up_forked_thread(task_t * p) - { - unsigned long flags; -- int this_cpu = get_cpu(), cpu; -+ vcpu_t this_cpu = get_vcpu(), cpu; - struct sched_domain *tmp, *sd = NULL; -- runqueue_t *this_rq = cpu_rq(this_cpu), *rq; -+ runqueue_t *this_rq = vcpu_rq(this_cpu), *rq; - - /* - * Find the largest domain that this CPU is part of that -@@ -1238,7 +2056,7 @@ void fastcall wake_up_forked_thread(task - - local_irq_save(flags); - lock_again: -- rq = cpu_rq(cpu); -+ rq = vcpu_rq(cpu); - double_rq_lock(this_rq, rq); - - BUG_ON(p->state != TASK_RUNNING); -@@ -1248,7 +2066,7 @@ lock_again: - * the mask could have changed - just dont migrate - * in this case: - */ -- if (unlikely(!cpu_isset(cpu, p->cpus_allowed))) { -+ if (unlikely(!vcpu_isset(cpu, p->cpus_allowed))) { - cpu = this_cpu; - double_rq_unlock(this_rq, rq); - goto lock_again; -@@ -1267,7 +2085,7 @@ lock_again: - p->interactive_credit = 0; - - p->prio = effective_prio(p); -- set_task_cpu(p, cpu); -+ set_task_vcpu(p, cpu); - - if (cpu == this_cpu) { - if (unlikely(!current->array)) -@@ -1278,6 +2096,8 @@ lock_again: - p->array = current->array; - p->array->nr_active++; - rq->nr_running++; -+ nr_running_inc(smp_processor_id(), task_cpu(p), -+ VE_TASK_INFO(p)->owner_env); - } - } else { - /* Not the local CPU - must adjust timestamp */ -@@ -1290,8 +2110,9 @@ lock_again: - - double_rq_unlock(this_rq, rq); - local_irq_restore(flags); -- put_cpu(); -+ put_vcpu(); - } -+#endif - - /* - * If dest_cpu is allowed for this process, migrate the task to it. -@@ -1299,15 +2120,15 @@ lock_again: - * allow dest_cpu, which will force the cpu onto dest_cpu. Then - * the cpu_allowed mask is restored. - */ --static void sched_migrate_task(task_t *p, int dest_cpu) -+static void sched_migrate_task(task_t *p, vcpu_t dest_cpu) - { - migration_req_t req; - runqueue_t *rq; - unsigned long flags; - - rq = task_rq_lock(p, &flags); -- if (!cpu_isset(dest_cpu, p->cpus_allowed) -- || unlikely(cpu_is_offline(dest_cpu))) -+ if (!vcpu_isset(dest_cpu, p->cpus_allowed) -+ || unlikely(vcpu_is_offline(dest_cpu))) - goto out; - - /* force the process onto the specified CPU */ -@@ -1325,6 +2146,7 @@ out: - task_rq_unlock(rq, &flags); - } - -+#ifdef CONFIG_SMP - /* - * sched_balance_exec(): find the highest-level, exec-balance-capable - * domain and try to migrate the task to the least loaded CPU. -@@ -1335,10 +2157,10 @@ out: - void sched_balance_exec(void) - { - struct sched_domain *tmp, *sd = NULL; -- int new_cpu, this_cpu = get_cpu(); -+ vcpu_t new_cpu, this_cpu = get_vcpu(); - - /* Prefer the current CPU if there's only this task running */ -- if (this_rq()->nr_running <= 1) -+ if (vcpu_rq(this_cpu)->nr_running <= 1) - goto out; - - for_each_domain(this_cpu, tmp) -@@ -1354,7 +2176,7 @@ void sched_balance_exec(void) - } - } - out: -- put_cpu(); -+ put_vcpu(); - } - - /* -@@ -1378,12 +2200,26 @@ static void double_lock_balance(runqueue - */ - static inline - void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, -- runqueue_t *this_rq, prio_array_t *this_array, int this_cpu) -+ runqueue_t *this_rq, prio_array_t *this_array, vcpu_t this_cpu) - { -+ struct ve_struct *ve; -+ cycles_t cycles; -+ -+ cycles = get_cycles(); -+ ve = VE_TASK_INFO(p)->owner_env; -+ - dequeue_task(p, src_array); - src_rq->nr_running--; -- set_task_cpu(p, this_cpu); -+ if (src_rq->nr_running == 0) { -+ vcpu_detach(src_rq); -+ vcpu_strt_ve_idle(ve, rq_vcpu(src_rq)->id, cycles); -+ } -+ set_task_vcpu(p, this_cpu); - this_rq->nr_running++; -+ if (this_rq->nr_running == 1) { -+ vcpu_save_ve_idle(ve, this_cpu->id, cycles); -+ vcpu_attach(this_rq); -+ } - enqueue_task(p, this_array); - p->timestamp = (p->timestamp - src_rq->timestamp_last_tick) - + this_rq->timestamp_last_tick; -@@ -1399,7 +2235,7 @@ void pull_task(runqueue_t *src_rq, prio_ - * can_migrate_task - may task p from runqueue rq be migrated to this_cpu? - */ - static inline --int can_migrate_task(task_t *p, runqueue_t *rq, int this_cpu, -+int can_migrate_task(task_t *p, runqueue_t *rq, vcpu_t this_cpu, - struct sched_domain *sd, enum idle_type idle) - { - /* -@@ -1410,7 +2246,7 @@ int can_migrate_task(task_t *p, runqueue - */ - if (task_running(rq, p)) - return 0; -- if (!cpu_isset(this_cpu, p->cpus_allowed)) -+ if (!vcpu_isset(this_cpu, p->cpus_allowed)) - return 0; - - /* Aggressive migration if we've failed balancing */ -@@ -1430,7 +2266,7 @@ int can_migrate_task(task_t *p, runqueue - * - * Called with both runqueues locked. - */ --static int move_tasks(runqueue_t *this_rq, int this_cpu, runqueue_t *busiest, -+static int move_tasks(runqueue_t *this_rq, vcpu_t this_cpu, runqueue_t *busiest, - unsigned long max_nr_move, struct sched_domain *sd, - enum idle_type idle) - { -@@ -1506,12 +2342,17 @@ out: - * moved to restore balance via the imbalance parameter. - */ - static struct sched_group * --find_busiest_group(struct sched_domain *sd, int this_cpu, -+find_busiest_group(struct sched_domain *sd, vcpu_t this_cpu, - unsigned long *imbalance, enum idle_type idle) - { - struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups; - unsigned long max_load, avg_load, total_load, this_load, total_pwr; -+ struct vcpu_scheduler *vsched; -+ vcpu_t vcpu; -+ int this_pcpu; - -+ vsched = vcpu_vsched(this_cpu); -+ this_pcpu = vcpu_last_pcpu(this_cpu); - max_load = this_load = total_load = total_pwr = 0; - - do { -@@ -1520,20 +2361,21 @@ find_busiest_group(struct sched_domain * - int local_group; - int i, nr_cpus = 0; - -- local_group = cpu_isset(this_cpu, group->cpumask); -+ local_group = cpu_isset(this_pcpu, group->cpumask); - - /* Tally up the load of all CPUs in the group */ - avg_load = 0; -- cpus_and(tmp, group->cpumask, cpu_online_map); -+ cpus_and(tmp, group->cpumask, vsched_pcpu_running_map(vsched)); - if (unlikely(cpus_empty(tmp))) - goto nextgroup; - - for_each_cpu_mask(i, tmp) { -+ vcpu = pcpu(i)->vcpu; - /* Bias balancing toward cpus of our domain */ - if (local_group) -- load = target_load(i); -+ load = target_load(vcpu); - else -- load = source_load(i); -+ load = source_load(vcpu); - - nr_cpus++; - avg_load += load; -@@ -1562,6 +2404,8 @@ nextgroup: - - if (!busiest || this_load >= max_load) - goto out_balanced; -+ if (!this) -+ this = busiest; /* this->cpu_power is needed below */ - - avg_load = (SCHED_LOAD_SCALE * total_load) / total_pwr; - -@@ -1645,36 +2489,71 @@ out_balanced: - /* - * find_busiest_queue - find the busiest runqueue among the cpus in group. - */ --static runqueue_t *find_busiest_queue(struct sched_group *group) -+static vcpu_t find_busiest_queue(vcpu_t this_cpu, -+ struct sched_group *group, enum idle_type idle) - { - cpumask_t tmp; -+ vcpu_t vcpu; -+ struct vcpu_scheduler *vsched; - unsigned long load, max_load = 0; -- runqueue_t *busiest = NULL; -+ vcpu_t busiest = NULL; - int i; - -+ vsched = vcpu_vsched(this_cpu); - cpus_and(tmp, group->cpumask, cpu_online_map); - for_each_cpu_mask(i, tmp) { -- load = source_load(i); -+ vcpu = pcpu(i)->vcpu; -+ if (vcpu_vsched(vcpu) != vsched && idle != IDLE) -+ continue; -+ load = source_load(vcpu); -+ if (load > max_load) { -+ max_load = load; -+ busiest = vcpu; -+ } -+ } -+ -+#ifdef CONFIG_SCHED_VCPU -+ cpus_andnot(tmp, vsched->vcpu_online_map, vsched->vcpu_running_map); -+ for_each_cpu_mask(i, tmp) { -+ vcpu = vsched_vcpu(vsched, i); -+ load = source_load(vcpu); - - if (load > max_load) { - max_load = load; -- busiest = cpu_rq(i); -+ busiest = vcpu; - } - } -+#endif - - return busiest; - } - -+#ifdef CONFIG_SCHED_VCPU -+vcpu_t find_idle_vcpu(struct vcpu_scheduler *vsched) -+{ -+ vcpu_t vcpu; -+ -+ vcpu = NULL; -+ spin_lock(&fairsched_lock); -+ if (!list_empty(&vsched->idle_list)) -+ vcpu = list_entry(vsched->idle_list.next, -+ struct vcpu_info, list); -+ spin_unlock(&fairsched_lock); -+ return vcpu; -+} -+#endif -+ - /* - * Check this_cpu to ensure it is balanced within domain. Attempt to move - * tasks if there is an imbalance. - * - * Called with this_rq unlocked. - */ --static int load_balance(int this_cpu, runqueue_t *this_rq, -+static int load_balance(vcpu_t this_cpu, runqueue_t *this_rq, - struct sched_domain *sd, enum idle_type idle) - { - struct sched_group *group; -+ vcpu_t busiest_vcpu; - runqueue_t *busiest; - unsigned long imbalance; - int nr_moved; -@@ -1685,18 +2564,34 @@ static int load_balance(int this_cpu, ru - if (!group) - goto out_balanced; - -- busiest = find_busiest_queue(group); -- if (!busiest) -+ busiest_vcpu = find_busiest_queue(this_cpu, group, idle); -+ if (!busiest_vcpu) - goto out_balanced; -+ -+#ifdef CONFIG_SCHED_VCPU -+ if (vcpu_vsched(this_cpu) != vcpu_vsched(busiest_vcpu)) { -+ spin_unlock(&this_rq->lock); -+ this_cpu = find_idle_vcpu(vcpu_vsched(busiest_vcpu)); -+ if (!this_cpu) -+ goto out_tune; -+ this_rq = vcpu_rq(this_cpu); -+ spin_lock(&this_rq->lock); -+ /* -+ * The check below is not mandatory, the lock may -+ * be dropped below in double_lock_balance. -+ */ -+ if (this_rq->nr_running) -+ goto out_balanced; -+ } -+#endif -+ busiest = vcpu_rq(busiest_vcpu); - /* - * This should be "impossible", but since load - * balancing is inherently racy and statistical, - * it could happen in theory. - */ -- if (unlikely(busiest == this_rq)) { -- WARN_ON(1); -+ if (unlikely(busiest == this_rq)) - goto out_balanced; -- } - - nr_moved = 0; - if (busiest->nr_running > 1) { -@@ -1746,6 +2641,7 @@ static int load_balance(int this_cpu, ru - out_balanced: - spin_unlock(&this_rq->lock); - -+out_tune: - /* tune up the balancing interval */ - if (sd->balance_interval < sd->max_interval) - sd->balance_interval *= 2; -@@ -1760,50 +2656,54 @@ out_balanced: - * Called from schedule when this_rq is about to become idle (NEWLY_IDLE). - * this_rq is locked. - */ --static int load_balance_newidle(int this_cpu, runqueue_t *this_rq, -+static int load_balance_newidle(vcpu_t this_cpu, runqueue_t *this_rq, - struct sched_domain *sd) - { - struct sched_group *group; -- runqueue_t *busiest = NULL; -+ vcpu_t busiest_vcpu; -+ runqueue_t *busiest; - unsigned long imbalance; -- int nr_moved = 0; - - group = find_busiest_group(sd, this_cpu, &imbalance, NEWLY_IDLE); - if (!group) - goto out; - -- busiest = find_busiest_queue(group); -- if (!busiest || busiest == this_rq) -+ busiest_vcpu = find_busiest_queue(this_cpu, group, NEWLY_IDLE); -+ if (!busiest_vcpu || busiest_vcpu == this_cpu) - goto out; -+ busiest = vcpu_rq(busiest_vcpu); - - /* Attempt to move tasks */ - double_lock_balance(this_rq, busiest); - -- nr_moved = move_tasks(this_rq, this_cpu, busiest, -- imbalance, sd, NEWLY_IDLE); -+ move_tasks(this_rq, this_cpu, busiest, -+ imbalance, sd, NEWLY_IDLE); - - spin_unlock(&busiest->lock); - - out: -- return nr_moved; -+ return 0; - } - - /* - * idle_balance is called by schedule() if this_cpu is about to become - * idle. Attempts to pull tasks from other CPUs. -+ * -+ * Returns whether to continue with another runqueue -+ * instead of switching to idle. - */ --static inline void idle_balance(int this_cpu, runqueue_t *this_rq) -+static int idle_balance(vcpu_t this_cpu, runqueue_t *this_rq) - { - struct sched_domain *sd; - - for_each_domain(this_cpu, sd) { - if (sd->flags & SD_BALANCE_NEWIDLE) { -- if (load_balance_newidle(this_cpu, this_rq, sd)) { -+ if (load_balance_newidle(this_cpu, this_rq, sd)) - /* We've pulled tasks over so stop searching */ -- break; -- } -+ return 1; - } - } -+ return 0; - } - - /* -@@ -1813,34 +2713,52 @@ static inline void idle_balance(int this - * logical imbalance. - * - * Called with busiest locked. -+ * -+ * In human terms: balancing of CPU load by moving tasks between CPUs is -+ * performed by 2 methods, push and pull. -+ * In certain places when CPU is found to be idle, it performs pull from busy -+ * CPU to current (idle) CPU. -+ * active_load_balance implements push method, with migration thread getting -+ * scheduled on a busy CPU (hence, making all running processes on this CPU sit -+ * in the queue) and selecting where to push and which task. - */ --static void active_load_balance(runqueue_t *busiest, int busiest_cpu) -+static void active_load_balance(runqueue_t *busiest, vcpu_t busiest_cpu) - { - struct sched_domain *sd; - struct sched_group *group, *busy_group; -+ struct vcpu_scheduler *vsched; - int i; - - if (busiest->nr_running <= 1) - return; - -+ /* -+ * Our main candidate where to push our tasks is busiest->push_cpu. -+ * First, find the domain that spans over both that candidate CPU and -+ * the current one. -+ * -+ * FIXME: make sure that push_cpu doesn't disappear before we get here. -+ */ - for_each_domain(busiest_cpu, sd) -- if (cpu_isset(busiest->push_cpu, sd->span)) -+ if (cpu_isset(vcpu_last_pcpu(busiest->push_cpu), sd->span)) - break; - if (!sd) { - WARN_ON(1); - return; - } - -+ /* Remember the group containing the current CPU (to ignore it). */ - group = sd->groups; -- while (!cpu_isset(busiest_cpu, group->cpumask)) -+ while (!cpu_isset(vcpu_last_pcpu(busiest_cpu), group->cpumask)) - group = group->next; - busy_group = group; - -+ vsched = vcpu_vsched(busiest_cpu); - group = sd->groups; - do { - cpumask_t tmp; - runqueue_t *rq; -- int push_cpu = 0; -+ vcpu_t vcpu, push_cpu; - - if (group == busy_group) - goto next_group; -@@ -1849,13 +2767,21 @@ static void active_load_balance(runqueue - if (!cpus_weight(tmp)) - goto next_group; - -+ push_cpu = NULL; - for_each_cpu_mask(i, tmp) { -- if (!idle_cpu(i)) -+ vcpu = pcpu(i)->vcpu; -+ if (vcpu_vsched(vcpu) != vsched) -+ continue; -+ if (!idle_vcpu(vcpu)) - goto next_group; -- push_cpu = i; -+ push_cpu = vcpu; - } -+#ifdef CONFIG_SCHED_VCPU -+ if (push_cpu == NULL) -+ goto next_group; -+#endif - -- rq = cpu_rq(push_cpu); -+ rq = vcpu_rq(push_cpu); - - /* - * This condition is "impossible", but since load -@@ -1871,6 +2797,28 @@ static void active_load_balance(runqueue - next_group: - group = group->next; - } while (group != sd->groups); -+ -+#ifdef CONFIG_SCHED_VCPU -+ if (busiest->nr_running > 2) { /* 1 for migration thread, 1 for task */ -+ cpumask_t tmp; -+ runqueue_t *rq; -+ vcpu_t vcpu; -+ -+ cpus_andnot(tmp, vsched->vcpu_online_map, -+ vsched->vcpu_running_map); -+ for_each_cpu_mask(i, tmp) { -+ vcpu = vsched_vcpu(vsched, i); -+ if (!idle_vcpu(vcpu)) -+ continue; -+ rq = vcpu_rq(vcpu); -+ double_lock_balance(busiest, rq); -+ move_tasks(rq, vcpu, busiest, 1, sd, IDLE); -+ spin_unlock(&rq->lock); -+ if (busiest->nr_running <= 2) -+ break; -+ } -+ } -+#endif - } - - /* -@@ -1883,27 +2831,18 @@ next_group: - */ - - /* Don't have all balancing operations going off at once */ --#define CPU_OFFSET(cpu) (HZ * cpu / NR_CPUS) -+#define CPU_OFFSET(cpu) (HZ * (cpu) / NR_CPUS) - --static void rebalance_tick(int this_cpu, runqueue_t *this_rq, -+static void rebalance_tick(vcpu_t this_cpu, runqueue_t *this_rq, - enum idle_type idle) - { -- unsigned long old_load, this_load; -- unsigned long j = jiffies + CPU_OFFSET(this_cpu); -+ unsigned long j; - struct sched_domain *sd; - - /* Update our load */ -- old_load = this_rq->cpu_load; -- this_load = this_rq->nr_running * SCHED_LOAD_SCALE; -- /* -- * Round up the averaging division if load is increasing. This -- * prevents us from getting stuck on 9 if the load is 10, for -- * example. -- */ -- if (this_load > old_load) -- old_load++; -- this_rq->cpu_load = (old_load + this_load) / 2; -+ update_rq_cpu_load(this_rq); - -+ j = jiffies + CPU_OFFSET(smp_processor_id()); - for_each_domain(this_cpu, sd) { - unsigned long interval = sd->balance_interval; - -@@ -1914,7 +2853,6 @@ static void rebalance_tick(int this_cpu, - interval = msecs_to_jiffies(interval); - if (unlikely(!interval)) - interval = 1; -- - if (j - sd->last_balance >= interval) { - if (load_balance(this_cpu, this_rq, sd, idle)) { - /* We've pulled tasks over so no longer idle */ -@@ -1928,26 +2866,30 @@ static void rebalance_tick(int this_cpu, - /* - * on UP we do not need to balance between CPUs: - */ --static inline void rebalance_tick(int cpu, runqueue_t *rq, enum idle_type idle) -+static inline void rebalance_tick(vcpu_t cpu, runqueue_t *rq, enum idle_type idle) - { - } --static inline void idle_balance(int cpu, runqueue_t *rq) -+static inline void idle_balance(vcpu_t cpu, runqueue_t *rq) - { - } - #endif - --static inline int wake_priority_sleeper(runqueue_t *rq) -+static inline int wake_priority_sleeper(runqueue_t *rq, task_t *idle) - { -+#ifndef CONFIG_SCHED_VCPU -+ /* FIXME: can we implement SMT priority sleeping for this? */ - #ifdef CONFIG_SCHED_SMT - /* - * If an SMT sibling task has been put to sleep for priority - * reasons reschedule the idle task to see if it can now run. - */ - if (rq->nr_running) { -- resched_task(rq->idle); -+ /* FIXME */ -+ resched_task(idle); - return 1; - } - #endif -+#endif - return 0; - } - -@@ -1971,6 +2913,25 @@ EXPORT_PER_CPU_SYMBOL(kstat); - STARVATION_LIMIT * ((rq)->nr_running) + 1))) || \ - ((rq)->curr->static_prio > (rq)->best_expired_prio)) - -+#ifdef CONFIG_VE -+#define update_ve_nice(p, tick) do { \ -+ VE_CPU_STATS(VE_TASK_INFO(p)->owner_env, \ -+ task_cpu(p))->nice += tick; \ -+ } while (0) -+#define update_ve_user(p, tick) do { \ -+ VE_CPU_STATS(VE_TASK_INFO(p)->owner_env, \ -+ task_cpu(p))->user += tick; \ -+ } while (0) -+#define update_ve_system(p, tick) do { \ -+ VE_CPU_STATS(VE_TASK_INFO(p)->owner_env, \ -+ task_cpu(p))->system += tick; \ -+ } while (0) -+#else -+#define update_ve_nice(p, tick) do { } while (0) -+#define update_ve_user(p, tick) do { } while (0) -+#define update_ve_system(p, tick) do { } while (0) -+#endif -+ - /* - * This function gets called by the timer code, with HZ frequency. - * We call it with interrupts disabled. -@@ -1981,12 +2942,17 @@ EXPORT_PER_CPU_SYMBOL(kstat); - void scheduler_tick(int user_ticks, int sys_ticks) - { - int cpu = smp_processor_id(); -+ vcpu_t vcpu; - struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat; -- runqueue_t *rq = this_rq(); -+ runqueue_t *rq; - task_t *p = current; - -+ vcpu = this_vcpu(); -+ rq = vcpu_rq(vcpu); - rq->timestamp_last_tick = sched_clock(); - -+ set_tsk_need_resched(p); //FIXME -+ - if (rcu_pending(cpu)) - rcu_check_callbacks(cpu, user_ticks); - -@@ -1998,22 +2964,25 @@ void scheduler_tick(int user_ticks, int - cpustat->softirq += sys_ticks; - sys_ticks = 0; - } -- -- if (p == rq->idle) { -+ if (p == pcpu(cpu)->idle) { - if (atomic_read(&rq->nr_iowait) > 0) - cpustat->iowait += sys_ticks; - else - cpustat->idle += sys_ticks; -- if (wake_priority_sleeper(rq)) -+ if (wake_priority_sleeper(rq, pcpu(cpu)->idle)) - goto out; -- rebalance_tick(cpu, rq, IDLE); -+ rebalance_tick(vcpu, rq, IDLE); - return; - } -- if (TASK_NICE(p) > 0) -+ if (TASK_NICE(p) > 0) { - cpustat->nice += user_ticks; -- else -+ update_ve_nice(p, user_ticks); -+ } else { - cpustat->user += user_ticks; -+ update_ve_user(p, user_ticks); -+ } - cpustat->system += sys_ticks; -+ update_ve_system(p, sys_ticks); - - /* Task might have expired already, but not scheduled off yet */ - if (p->array != rq->active) { -@@ -2076,9 +3045,22 @@ void scheduler_tick(int user_ticks, int - * This only applies to tasks in the interactive - * delta range with at least TIMESLICE_GRANULARITY to requeue. - */ -+ unsigned long ts_gran; -+ -+ ts_gran = TIMESLICE_GRANULARITY(p); -+ if (ts_gran == 0) { -+ printk("BUG!!! Zero granulatity!\n" -+ "Task %d/%s, VE %d, sleep_avg %lu, cpus %d\n", -+ p->pid, p->comm, -+ VE_TASK_INFO(p)->owner_env->veid, -+ p->sleep_avg, -+ vsched_num_online_vcpus(task_vsched(p))); -+ ts_gran = 1; -+ } -+ - if (TASK_INTERACTIVE(p) && !((task_timeslice(p) - -- p->time_slice) % TIMESLICE_GRANULARITY(p)) && -- (p->time_slice >= TIMESLICE_GRANULARITY(p)) && -+ p->time_slice) % ts_gran) && -+ (p->time_slice >= ts_gran) && - (p->array == rq->active)) { - - dequeue_task(p, rq->active); -@@ -2090,11 +3072,12 @@ void scheduler_tick(int user_ticks, int - out_unlock: - spin_unlock(&rq->lock); - out: -- rebalance_tick(cpu, rq, NOT_IDLE); -+ rebalance_tick(vcpu, rq, NOT_IDLE); - } - --#ifdef CONFIG_SCHED_SMT --static inline void wake_sleeping_dependent(int cpu, runqueue_t *rq) -+#if defined(CONFIG_SCHED_SMT) && !defined(CONFIG_SCHED_VCPU) -+/* FIXME: SMT scheduling */ -+static void wake_sleeping_dependent(int cpu, runqueue_t *rq) - { - int i; - struct sched_domain *sd = rq->sd; -@@ -2110,18 +3093,18 @@ static inline void wake_sleeping_depende - if (i == cpu) - continue; - -- smt_rq = cpu_rq(i); -+ smt_rq = vcpu_rq(vcpu(i)); - - /* - * If an SMT sibling task is sleeping due to priority - * reasons wake it up now. - */ -- if (smt_rq->curr == smt_rq->idle && smt_rq->nr_running) -- resched_task(smt_rq->idle); -+ if (smt_rq->curr == pcpu(i)->idle && smt_rq->nr_running) -+ resched_task(pcpu(i)->idle); - } - } - --static inline int dependent_sleeper(int cpu, runqueue_t *rq, task_t *p) -+static int dependent_sleeper(int cpu, runqueue_t *rq, task_t *p) - { - struct sched_domain *sd = rq->sd; - cpumask_t sibling_map; -@@ -2138,7 +3121,7 @@ static inline int dependent_sleeper(int - if (i == cpu) - continue; - -- smt_rq = cpu_rq(i); -+ smt_rq = vcpu_rq(vcpu(i)); - smt_curr = smt_rq->curr; - - /* -@@ -2162,7 +3145,7 @@ static inline int dependent_sleeper(int - if ((((p->time_slice * (100 - sd->per_cpu_gain) / 100) > - task_timeslice(smt_curr) || rt_task(p)) && - smt_curr->mm && p->mm && !rt_task(smt_curr)) || -- (smt_curr == smt_rq->idle && smt_rq->nr_running)) -+ (smt_curr == pcpu(i)->idle && smt_rq->nr_running)) - resched_task(smt_curr); - } - return ret; -@@ -2178,6 +3161,24 @@ static inline int dependent_sleeper(int - } - #endif - -+static void update_sched_lat(struct task_struct *t, cycles_t cycles) -+{ -+ int cpu; -+ cycles_t ve_wstamp; -+ -+ /* safe due to runqueue lock */ -+ ve_wstamp = VE_TASK_INFO(t)->wakeup_stamp; -+ cpu = smp_processor_id(); -+ if (ve_wstamp && cycles > ve_wstamp) { -+ KSTAT_LAT_PCPU_ADD(&kstat_glob.sched_lat, -+ cpu, cycles - ve_wstamp); -+#ifdef CONFIG_VE -+ KSTAT_LAT_PCPU_ADD(&VE_TASK_INFO(t)->exec_env->sched_lat_ve, -+ cpu, cycles - ve_wstamp); -+#endif -+ } -+} -+ - /* - * schedule() is the main scheduler function. - */ -@@ -2190,30 +3191,34 @@ asmlinkage void __sched schedule(void) - struct list_head *queue; - unsigned long long now; - unsigned long run_time; -- int cpu, idx; -+ int idx; -+ vcpu_t vcpu; -+ cycles_t cycles; - - /* - * Test if we are atomic. Since do_exit() needs to call into - * schedule() atomically, we ignore that path for now. - * Otherwise, whine if we are scheduling when we should not be. - */ -- if (likely(!(current->state & (TASK_DEAD | TASK_ZOMBIE)))) { -+ if (likely(!current->exit_state)) { - if (unlikely(in_atomic())) { - printk(KERN_ERR "bad: scheduling while atomic!\n"); - dump_stack(); - } - } -- - need_resched: -+ cycles = get_cycles(); - preempt_disable(); - prev = current; - rq = this_rq(); - - release_kernel_lock(prev); - now = sched_clock(); -- if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG)) -+ if (likely((long long)(now - prev->timestamp) < NS_MAX_SLEEP_AVG)) { - run_time = now - prev->timestamp; -- else -+ if (unlikely((long long)(now - prev->timestamp) < 0)) -+ run_time = 0; -+ } else - run_time = NS_MAX_SLEEP_AVG; - - /* -@@ -2226,6 +3231,8 @@ need_resched: - - spin_lock_irq(&rq->lock); - -+ if (unlikely(current->flags & PF_DEAD)) -+ current->state = EXIT_DEAD; - /* - * if entering off of a kernel preemption go straight - * to picking the next task. -@@ -2233,24 +3240,40 @@ need_resched: - switch_count = &prev->nivcsw; - if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) { - switch_count = &prev->nvcsw; -- if (unlikely((prev->state & TASK_INTERRUPTIBLE) && -- unlikely(signal_pending(prev)))) -+ if (unlikely(((prev->state & TASK_INTERRUPTIBLE) && -+ unlikely(signal_pending(prev))) || -+ ((prev->state & TASK_STOPPED) && -+ sigismember(&prev->pending.signal, SIGKILL)))) - prev->state = TASK_RUNNING; - else - deactivate_task(prev, rq); - } - -- cpu = smp_processor_id(); -+ prev->sleep_avg -= run_time; -+ if ((long)prev->sleep_avg <= 0) { -+ prev->sleep_avg = 0; -+ if (!(HIGH_CREDIT(prev) || LOW_CREDIT(prev))) -+ prev->interactive_credit--; -+ } -+ -+ vcpu = rq_vcpu(rq); -+ if (rq->nr_running && -+ jiffies - vcpu->start_time < msecs_to_jiffies(vcpu_timeslice)) -+ goto same_vcpu; -+ -+ if (unlikely(!rq->nr_running)) -+ idle_balance(vcpu, rq); -+ vcpu = schedule_vcpu(vcpu, cycles); -+ rq = vcpu_rq(vcpu); -+ - if (unlikely(!rq->nr_running)) { -- idle_balance(cpu, rq); -- if (!rq->nr_running) { -- next = rq->idle; -- rq->expired_timestamp = 0; -- wake_sleeping_dependent(cpu, rq); -- goto switch_tasks; -- } -+ next = this_pcpu()->idle; -+ rq->expired_timestamp = 0; -+ wake_sleeping_dependent(vcpu->id, rq); -+ goto switch_tasks; - } - -+same_vcpu: - array = rq->active; - if (unlikely(!array->nr_active)) { - /* -@@ -2266,14 +3289,15 @@ need_resched: - idx = sched_find_first_bit(array->bitmap); - queue = array->queue + idx; - next = list_entry(queue->next, task_t, run_list); -- -- if (dependent_sleeper(cpu, rq, next)) { -- next = rq->idle; -+ if (dependent_sleeper(vcpu->id, rq, next)) { -+ /* FIXME: switch to idle if CONFIG_SCHED_VCPU */ -+ next = this_pcpu()->idle; - goto switch_tasks; - } -- - if (!rt_task(next) && next->activated > 0) { - unsigned long long delta = now - next->timestamp; -+ if (unlikely((long long)delta < 0)) -+ delta = 0; - - if (next->activated == 1) - delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128; -@@ -2284,37 +3308,68 @@ need_resched: - enqueue_task(next, array); - } - next->activated = 0; -+ - switch_tasks: - prefetch(next); - clear_tsk_need_resched(prev); -- RCU_qsctr(task_cpu(prev))++; -+ RCU_qsctr(task_pcpu(prev))++; - -- prev->sleep_avg -= run_time; -- if ((long)prev->sleep_avg <= 0) { -- prev->sleep_avg = 0; -- if (!(HIGH_CREDIT(prev) || LOW_CREDIT(prev))) -- prev->interactive_credit--; -- } -+ /* updated w/o rq->lock, which is ok due to after-read-checks */ - prev->timestamp = now; - - if (likely(prev != next)) { -+ /* current physical CPU id should be valid after switch */ -+ set_task_vcpu(next, vcpu); -+ set_task_pcpu(next, task_pcpu(prev)); - next->timestamp = now; - rq->nr_switches++; -+ glob_tasks_nrs[smp_processor_id()].nr_switches++; - rq->curr = next; - ++*switch_count; - -+ VE_TASK_INFO(prev)->sleep_stamp = cycles; -+ if (prev->state == TASK_RUNNING && prev != this_pcpu()->idle) -+ write_wakeup_stamp(prev, cycles); -+ update_sched_lat(next, cycles); -+ -+ /* because next & prev are protected with -+ * runqueue lock we may not worry about -+ * wakeup_stamp and sched_time protection -+ * (same thing in 'else' branch below) -+ */ -+ if (prev != this_pcpu()->idle) { -+#ifdef CONFIG_VE -+ VE_CPU_STATS(VE_TASK_INFO(prev)->owner_env, -+ smp_processor_id())->used_time += -+ cycles - VE_TASK_INFO(prev)->sched_time; -+#endif -+ VE_TASK_INFO(prev)->sched_time = 0; -+ } -+ VE_TASK_INFO(next)->sched_time = cycles; -+ write_wakeup_stamp(next, 0); -+ - prepare_arch_switch(rq, next); - prev = context_switch(rq, prev, next); - barrier(); - - finish_task_switch(prev); -- } else -+ } else { -+ if (prev != this_pcpu()->idle) { -+#ifdef CONFIG_VE -+ VE_CPU_STATS(VE_TASK_INFO(prev)->owner_env, -+ smp_processor_id())->used_time += -+ cycles - VE_TASK_INFO(prev)->sched_time; -+#endif -+ VE_TASK_INFO(prev)->sched_time = cycles; -+ } - spin_unlock_irq(&rq->lock); -+ } - - reacquire_kernel_lock(current); - preempt_enable_no_resched(); - if (test_thread_flag(TIF_NEED_RESCHED)) - goto need_resched; -+ return; - } - - EXPORT_SYMBOL(schedule); -@@ -2675,23 +3730,12 @@ int task_nice(const task_t *p) - EXPORT_SYMBOL(task_nice); - - /** -- * idle_cpu - is a given cpu idle currently? -- * @cpu: the processor in question. -- */ --int idle_cpu(int cpu) --{ -- return cpu_curr(cpu) == cpu_rq(cpu)->idle; --} -- --EXPORT_SYMBOL_GPL(idle_cpu); -- --/** - * find_process_by_pid - find a process with a matching PID value. - * @pid: the pid in question. - */ - static inline task_t *find_process_by_pid(pid_t pid) - { -- return pid ? find_task_by_pid(pid) : current; -+ return pid ? find_task_by_pid_ve(pid) : current; - } - - /* Actually do priority change: must hold rq lock. */ -@@ -2709,7 +3753,7 @@ static void __setscheduler(struct task_s - /* - * setscheduler - change the scheduling policy and/or RT priority of a thread. - */ --static int setscheduler(pid_t pid, int policy, struct sched_param __user *param) -+int setscheduler(pid_t pid, int policy, struct sched_param __user *param) - { - struct sched_param lp; - int retval = -EINVAL; -@@ -2764,7 +3808,7 @@ static int setscheduler(pid_t pid, int p - - retval = -EPERM; - if ((policy == SCHED_FIFO || policy == SCHED_RR) && -- !capable(CAP_SYS_NICE)) -+ !capable(CAP_SYS_ADMIN)) - goto out_unlock; - if ((current->euid != p->euid) && (current->euid != p->uid) && - !capable(CAP_SYS_NICE)) -@@ -2802,6 +3846,7 @@ out_unlock_tasklist: - out_nounlock: - return retval; - } -+EXPORT_SYMBOL(setscheduler); - - /** - * sys_sched_setscheduler - set/change the scheduler policy and RT priority -@@ -3065,9 +4110,14 @@ EXPORT_SYMBOL(yield); - void __sched io_schedule(void) - { - struct runqueue *rq = this_rq(); -+ struct ve_struct *ve; -+ -+ ve = VE_TASK_INFO(current)->owner_env; - - atomic_inc(&rq->nr_iowait); -+ nr_iowait_inc(smp_processor_id(), task_cpu(current), ve); - schedule(); -+ nr_iowait_dec(smp_processor_id(), task_cpu(current), ve); - atomic_dec(&rq->nr_iowait); - } - -@@ -3077,9 +4127,14 @@ long __sched io_schedule_timeout(long ti - { - struct runqueue *rq = this_rq(); - long ret; -+ struct ve_struct *ve; -+ -+ ve = VE_TASK_INFO(current)->owner_env; - - atomic_inc(&rq->nr_iowait); -+ nr_iowait_inc(smp_processor_id(), task_cpu(current), ve); - ret = schedule_timeout(timeout); -+ nr_iowait_dec(smp_processor_id(), task_cpu(current), ve); - atomic_dec(&rq->nr_iowait); - return ret; - } -@@ -3199,16 +4254,13 @@ static void show_task(task_t * p) - printk(stat_nam[state]); - else - printk("?"); -+ if (state) -+ printk(" %012Lx", (unsigned long long) -+ (VE_TASK_INFO(p)->sleep_stamp >> 16)); - #if (BITS_PER_LONG == 32) -- if (state == TASK_RUNNING) -- printk(" running "); -- else -- printk(" %08lX ", thread_saved_pc(p)); -+ printk(" %08lX ", (unsigned long)p); - #else -- if (state == TASK_RUNNING) -- printk(" running task "); -- else -- printk(" %016lx ", thread_saved_pc(p)); -+ printk(" %016lx ", (unsigned long)p); - #endif - #ifdef CONFIG_DEBUG_STACK_USAGE - { -@@ -3247,39 +4299,82 @@ void show_state(void) - #if (BITS_PER_LONG == 32) - printk("\n" - " sibling\n"); -- printk(" task PC pid father child younger older\n"); -+ printk(" task taskaddr pid father child younger older\n"); - #else - printk("\n" - " sibling\n"); -- printk(" task PC pid father child younger older\n"); -+ printk(" task taskaddr pid father child younger older\n"); - #endif - read_lock(&tasklist_lock); -- do_each_thread(g, p) { -+ do_each_thread_all(g, p) { - /* - * reset the NMI-timeout, listing all files on a slow - * console might take alot of time: - */ - touch_nmi_watchdog(); - show_task(p); -- } while_each_thread(g, p); -+ } while_each_thread_all(g, p); - - read_unlock(&tasklist_lock); - } - -+static void init_rq(struct runqueue *rq); -+ -+static void init_vcpu(vcpu_t vcpu, int id) -+{ -+ memset(vcpu, 0, sizeof(struct vcpu_info)); -+ vcpu->id = id; -+#ifdef CONFIG_SCHED_VCPU -+ vcpu->last_pcpu = id; -+#endif -+ init_rq(vcpu_rq(vcpu)); -+} -+ - void __devinit init_idle(task_t *idle, int cpu) - { -- runqueue_t *idle_rq = cpu_rq(cpu), *rq = cpu_rq(task_cpu(idle)); -+ struct vcpu_scheduler *vsched; -+ vcpu_t vcpu; -+ runqueue_t *idle_rq, *rq; - unsigned long flags; - -+#ifdef CONFIG_SCHED_VCPU -+ if (__add_vcpu(&idle_vsched, cpu)) -+ panic("Can't create idle vcpu %d\n", cpu); -+ -+ /* Also create vcpu for default_vsched */ -+ if (cpu > 0 && __add_vcpu(&default_vsched, cpu) != 0) -+ panic("Can't create default vcpu %d\n", cpu); -+ cpu_set(cpu, idle_vsched.pcpu_running_map); -+#endif -+ vsched = &idle_vsched; -+ vcpu = vsched_vcpu(vsched, cpu); -+ -+ idle_rq = vcpu_rq(vcpu); -+ rq = vcpu_rq(task_vcpu(idle)); -+ - local_irq_save(flags); - double_rq_lock(idle_rq, rq); - -- idle_rq->curr = idle_rq->idle = idle; -+ pcpu(cpu)->idle = idle; -+ idle_rq->curr = idle; - deactivate_task(idle, rq); - idle->array = NULL; - idle->prio = MAX_PRIO; - idle->state = TASK_RUNNING; -- set_task_cpu(idle, cpu); -+ set_task_pcpu(idle, cpu); -+#ifdef CONFIG_SCHED_VCPU -+ /* the following code is very close to vcpu_get */ -+ spin_lock(&fairsched_lock); -+ pcpu(cpu)->vcpu = vcpu; -+ pcpu(cpu)->vsched = vcpu->vsched; -+ list_move_tail(&vcpu->list, &vsched->running_list); -+ __set_bit(cpu, vsched->vcpu_running_map.bits); -+ __set_bit(cpu, vsched->pcpu_running_map.bits); -+ vcpu->running = 1; -+ spin_unlock(&fairsched_lock); -+#endif -+ set_task_vsched(idle, vsched); -+ set_task_vcpu(idle, vcpu); - double_rq_unlock(idle_rq, rq); - set_tsk_need_resched(idle); - local_irq_restore(flags); -@@ -3301,7 +4396,7 @@ void __devinit init_idle(task_t *idle, i - */ - cpumask_t nohz_cpu_mask = CPU_MASK_NONE; - --#ifdef CONFIG_SMP -+#if defined(CONFIG_SMP) || defined(CONFIG_SCHED_VCPU) - /* - * This is how migration works: - * -@@ -3327,15 +4422,18 @@ cpumask_t nohz_cpu_mask = CPU_MASK_NONE; - * task must not exit() & deallocate itself prematurely. The - * call is not atomic; no spinlocks may be held. - */ -+#ifdef CONFIG_SMP - int set_cpus_allowed(task_t *p, cpumask_t new_mask) - { - unsigned long flags; - int ret = 0; - migration_req_t req; - runqueue_t *rq; -+ struct vcpu_scheduler *vsched; - -+ vsched = task_vsched(p); - rq = task_rq_lock(p, &flags); -- if (!cpus_intersects(new_mask, cpu_online_map)) { -+ if (!cpus_intersects(new_mask, vsched_vcpu_online_map(vsched))) { - ret = -EINVAL; - goto out; - } -@@ -3345,7 +4443,8 @@ int set_cpus_allowed(task_t *p, cpumask_ - if (cpu_isset(task_cpu(p), new_mask)) - goto out; - -- if (migrate_task(p, any_online_cpu(new_mask), &req)) { -+ if (migrate_task(p, vsched_vcpu(vsched, any_online_cpu(new_mask)), -+ &req)) { - /* Need help from migration thread: drop lock and wait. */ - task_rq_unlock(rq, &flags); - wake_up_process(rq->migration_thread); -@@ -3359,6 +4458,7 @@ out: - } - - EXPORT_SYMBOL_GPL(set_cpus_allowed); -+#endif - - /* - * Move (not current) task off this cpu, onto dest cpu. We're doing -@@ -3369,25 +4469,30 @@ EXPORT_SYMBOL_GPL(set_cpus_allowed); - * So we race with normal scheduler movements, but that's OK, as long - * as the task is no longer on this CPU. - */ --static void __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu) -+static void __migrate_task(struct task_struct *p, vcpu_t src_cpu, vcpu_t dest_cpu) - { - runqueue_t *rq_dest, *rq_src; - -- if (unlikely(cpu_is_offline(dest_cpu))) -+ if (unlikely(vcpu_is_offline(dest_cpu))) - return; - -- rq_src = cpu_rq(src_cpu); -- rq_dest = cpu_rq(dest_cpu); -+#ifdef CONFIG_SCHED_VCPU -+ BUG_ON(vcpu_vsched(src_cpu) == &idle_vsched); -+#endif -+ rq_src = vcpu_rq(src_cpu); -+ rq_dest = vcpu_rq(dest_cpu); - - double_rq_lock(rq_src, rq_dest); - /* Already moved. */ -- if (task_cpu(p) != src_cpu) -+ if (task_vcpu(p) != src_cpu) - goto out; - /* Affinity changed (again). */ -- if (!cpu_isset(dest_cpu, p->cpus_allowed)) -+ if (!vcpu_isset(dest_cpu, p->cpus_allowed)) - goto out; - -- set_task_cpu(p, dest_cpu); -+ BUG_ON(task_running(rq_src, p)); -+ set_task_vsched(p, vcpu_vsched(dest_cpu)); -+ set_task_vcpu(p, dest_cpu); - if (p->array) { - /* - * Sync timestamp with rq_dest's before activating. -@@ -3415,9 +4520,9 @@ out: - static int migration_thread(void * data) - { - runqueue_t *rq; -- int cpu = (long)data; -+ vcpu_t cpu = (vcpu_t)data; - -- rq = cpu_rq(cpu); -+ rq = vcpu_rq(cpu); - BUG_ON(rq->migration_thread != current); - - set_current_state(TASK_INTERRUPTIBLE); -@@ -3425,21 +4530,21 @@ static int migration_thread(void * data) - struct list_head *head; - migration_req_t *req; - -- if (current->flags & PF_FREEZE) -- refrigerator(PF_FREEZE); -+ if (test_thread_flag(TIF_FREEZE)) -+ refrigerator(); - - spin_lock_irq(&rq->lock); - -- if (cpu_is_offline(cpu)) { -+ if (vcpu_is_offline(cpu)) { - spin_unlock_irq(&rq->lock); - goto wait_to_die; - } -- -+#ifdef CONFIG_SMP - if (rq->active_balance) { - active_load_balance(rq, cpu); - rq->active_balance = 0; - } -- -+#endif - head = &rq->migration_queue; - - if (list_empty(head)) { -@@ -3453,12 +4558,14 @@ static int migration_thread(void * data) - - if (req->type == REQ_MOVE_TASK) { - spin_unlock(&rq->lock); -- __migrate_task(req->task, smp_processor_id(), -+ __migrate_task(req->task, this_vcpu(), - req->dest_cpu); - local_irq_enable(); -+#ifdef CONFIG_SMP - } else if (req->type == REQ_SET_DOMAIN) { - rq->sd = req->sd; - spin_unlock_irq(&rq->lock); -+#endif - } else { - spin_unlock_irq(&rq->lock); - WARN_ON(1); -@@ -3480,10 +4587,10 @@ wait_to_die: - return 0; - } - --#ifdef CONFIG_HOTPLUG_CPU - /* migrate_all_tasks - function to migrate all tasks from the dead cpu. */ --static void migrate_all_tasks(int src_cpu) -+static void migrate_all_tasks(vcpu_t src_vcpu) - { -+#if defined(CONFIG_HOTPLUG_CPU) && !defined(CONFIG_SCHED_VCPU) - struct task_struct *tsk, *t; - int dest_cpu; - unsigned int node; -@@ -3491,14 +4598,14 @@ static void migrate_all_tasks(int src_cp - write_lock_irq(&tasklist_lock); - - /* watch out for per node tasks, let's stay on this node */ -- node = cpu_to_node(src_cpu); -+ node = cpu_to_node(src_vcpu); - -- do_each_thread(t, tsk) { -+ do_each_thread_all(t, tsk) { - cpumask_t mask; - if (tsk == current) - continue; - -- if (task_cpu(tsk) != src_cpu) -+ if (task_vcpu(tsk) != src_vcpu) - continue; - - /* Figure out where this task should go (attempting to -@@ -3520,22 +4627,43 @@ static void migrate_all_tasks(int src_cp - if (tsk->mm && printk_ratelimit()) - printk(KERN_INFO "process %d (%s) no " - "longer affine to cpu%d\n", -- tsk->pid, tsk->comm, src_cpu); -+ tsk->pid, tsk->comm, src_vcpu->id); - } -- -- __migrate_task(tsk, src_cpu, dest_cpu); -- } while_each_thread(t, tsk); -+ __migrate_task(tsk, src_vcpu, -+ vsched_vcpu(vcpu_vsched(src_vcpu), dest_cpu)); -+ } while_each_thread_all(t, tsk); - - write_unlock_irq(&tasklist_lock); -+#elif defined(CONFIG_SCHED_VCPU) -+ struct task_struct *tsk, *t; -+ -+ /* -+ * FIXME: should migrate tasks from scr_vcpu to others if dynamic -+ * VCPU add/del is implemented. Right now just does sanity checks. -+ */ -+ read_lock(&tasklist_lock); -+ do_each_thread_all(t, tsk) { -+ if (task_vcpu(tsk) != src_vcpu) -+ continue; -+ if (tsk == vcpu_rq(src_vcpu)->migration_thread) -+ continue; -+ -+ printk("VSCHED: task %s (%d) was left on src VCPU %d:%d\n", -+ tsk->comm, tsk->pid, -+ vcpu_vsched(src_vcpu)->id, src_vcpu->id); -+ } while_each_thread_all(t, tsk); -+ read_unlock(&tasklist_lock); -+#endif - } - -+#ifdef CONFIG_HOTPLUG_CPU - /* Schedules idle task to be the next runnable task on current CPU. - * It does so by boosting its priority to highest possible and adding it to - * the _front_ of runqueue. Used by CPU offline code. - */ - void sched_idle_next(void) - { -- int cpu = smp_processor_id(); -+ int cpu = this_vcpu(); - runqueue_t *rq = this_rq(); - struct task_struct *p = rq->idle; - unsigned long flags; -@@ -3550,60 +4678,100 @@ void sched_idle_next(void) - - __setscheduler(p, SCHED_FIFO, MAX_RT_PRIO-1); - /* Add idle task to _front_ of it's priority queue */ -+#ifdef CONFIG_SCHED_VCPU -+#error "FIXME: VCPU vs. HOTPLUG: fix the code below" -+#endif - __activate_idle_task(p, rq); - - spin_unlock_irqrestore(&rq->lock, flags); - } - #endif /* CONFIG_HOTPLUG_CPU */ - -+static void migration_thread_bind(struct task_struct *k, vcpu_t cpu) -+{ -+ BUG_ON(k->state != TASK_INTERRUPTIBLE); -+ /* Must have done schedule() in kthread() before we set_task_cpu */ -+ wait_task_inactive(k); -+ -+ set_task_vsched(k, vcpu_vsched(cpu)); -+ set_task_vcpu(k, cpu); -+ k->cpus_allowed = cpumask_of_cpu(cpu->id); -+} -+ -+static void migration_thread_stop(runqueue_t *rq) -+{ -+ struct task_struct *thread; -+ -+ thread = rq->migration_thread; -+ if (thread == NULL) -+ return; -+ -+ get_task_struct(thread); -+ kthread_stop(thread); -+ -+ /* We MUST ensure, that the do_exit of the migration thread is -+ * completed and it will never scheduled again before vsched_destroy. -+ * The task with flag PF_DEAD if unscheduled will never receive -+ * CPU again. */ -+ while (!(thread->flags & PF_DEAD) || task_running(rq, thread)) -+ yield(); -+ put_task_struct(thread); -+ -+ rq->migration_thread = NULL; -+} -+ - /* - * migration_call - callback that gets triggered when a CPU is added. - * Here we can start up the necessary migration thread for the new CPU. - */ --static int migration_call(struct notifier_block *nfb, unsigned long action, -+static int vmigration_call(struct notifier_block *nfb, unsigned long action, - void *hcpu) - { -- int cpu = (long)hcpu; -+ vcpu_t cpu = (vcpu_t)hcpu; - struct task_struct *p; - struct runqueue *rq; - unsigned long flags; - - switch (action) { - case CPU_UP_PREPARE: -- p = kthread_create(migration_thread, hcpu, "migration/%d",cpu); -+ p = kthread_create(migration_thread, hcpu, "migration/%d/%d", -+ vsched_id(vcpu_vsched(cpu)), cpu->id); - if (IS_ERR(p)) - return NOTIFY_BAD; - p->flags |= PF_NOFREEZE; -- kthread_bind(p, cpu); -- /* Must be high prio: stop_machine expects to yield to it. */ -+ -+ migration_thread_bind(p, cpu); - rq = task_rq_lock(p, &flags); -+ /* Must be high prio: stop_machine expects to yield to it. */ - __setscheduler(p, SCHED_FIFO, MAX_RT_PRIO-1); - task_rq_unlock(rq, &flags); -- cpu_rq(cpu)->migration_thread = p; -+ vcpu_rq(cpu)->migration_thread = p; - break; - case CPU_ONLINE: - /* Strictly unneccessary, as first user will wake it. */ -- wake_up_process(cpu_rq(cpu)->migration_thread); -+ wake_up_process(vcpu_rq(cpu)->migration_thread); - break; --#ifdef CONFIG_HOTPLUG_CPU -+ -+#if defined(CONFIG_HOTPLUG_CPU) && defined(CONFIG_SCHED_VCPU) -+#error "FIXME: CPU down code doesn't work yet with VCPUs" -+#endif - case CPU_UP_CANCELED: - /* Unbind it from offline cpu so it can run. Fall thru. */ -- kthread_bind(cpu_rq(cpu)->migration_thread,smp_processor_id()); -- kthread_stop(cpu_rq(cpu)->migration_thread); -- cpu_rq(cpu)->migration_thread = NULL; -+ migration_thread_bind(vcpu_rq(cpu)->migration_thread, this_vcpu()); -+ migration_thread_stop(vcpu_rq(cpu)); - break; - case CPU_DEAD: - migrate_all_tasks(cpu); -- rq = cpu_rq(cpu); -- kthread_stop(rq->migration_thread); -- rq->migration_thread = NULL; -+ rq = vcpu_rq(cpu); -+ migration_thread_stop(rq); -+#ifdef CONFIG_HOTPLUG_CPU - /* Idle task back to normal (off runqueue, low prio) */ - rq = task_rq_lock(rq->idle, &flags); - deactivate_task(rq->idle, rq); - rq->idle->static_prio = MAX_PRIO; - __setscheduler(rq->idle, SCHED_NORMAL, 0); - task_rq_unlock(rq, &flags); -- BUG_ON(rq->nr_running != 0); -+#endif - - /* No need to migrate the tasks: it was best-effort if - * they didn't do lock_cpu_hotplug(). Just wake up -@@ -3619,11 +4787,17 @@ static int migration_call(struct notifie - } - spin_unlock_irq(&rq->lock); - break; --#endif - } - return NOTIFY_OK; - } - -+static int migration_call(struct notifier_block *nfb, unsigned long action, -+ void *hcpu) -+{ -+ /* we need to translate pcpu to vcpu */ -+ return vmigration_call(nfb, action, vsched_default_vcpu((long)hcpu)); -+} -+ - /* Register at highest priority so that task migration (migrate_all_tasks) - * happens before everything else. - */ -@@ -3664,13 +4838,14 @@ void cpu_attach_domain(struct sched_doma - { - migration_req_t req; - unsigned long flags; -- runqueue_t *rq = cpu_rq(cpu); -+ runqueue_t *rq = vcpu_rq(vsched_default_vcpu(cpu)); - int local = 1; - - lock_cpu_hotplug(); - - spin_lock_irqsave(&rq->lock, flags); - -+ pcpu(cpu)->sd = sd; - if (cpu == smp_processor_id() || !cpu_online(cpu)) { - rq->sd = sd; - } else { -@@ -3815,11 +4990,10 @@ void sched_domain_debug(void) - int i; - - for_each_cpu(i) { -- runqueue_t *rq = cpu_rq(i); - struct sched_domain *sd; - int level = 0; - -- sd = rq->sd; -+ sd = pcpu(i)->sd; - - printk(KERN_DEBUG "CPU%d: %s\n", - i, (cpu_online(i) ? " online" : "offline")); -@@ -3836,7 +5010,8 @@ void sched_domain_debug(void) - printk(KERN_DEBUG); - for (j = 0; j < level + 1; j++) - printk(" "); -- printk("domain %d: span %s\n", level, str); -+ printk("domain %d: span %s flags 0x%x\n", -+ level, str, sd->flags); - - if (!cpu_isset(i, sd->span)) - printk(KERN_DEBUG "ERROR domain->span does not contain CPU%d\n", i); -@@ -3907,16 +5082,13 @@ int in_sched_functions(unsigned long add - && addr < (unsigned long)__sched_text_end; - } - --void __init sched_init(void) --{ -- runqueue_t *rq; -- int i, j, k; -- - #ifdef CONFIG_SMP -- /* Set up an initial dummy domain for early boot */ -- static struct sched_domain sched_domain_init; -- static struct sched_group sched_group_init; -+static struct sched_domain sched_domain_init; -+static struct sched_group sched_group_init; - -+/* Set up an initial dummy domain for early boot */ -+static void init_sd(void) -+{ - memset(&sched_domain_init, 0, sizeof(struct sched_domain)); - sched_domain_init.span = CPU_MASK_ALL; - sched_domain_init.groups = &sched_group_init; -@@ -3928,45 +5100,570 @@ void __init sched_init(void) - sched_group_init.cpumask = CPU_MASK_ALL; - sched_group_init.next = &sched_group_init; - sched_group_init.cpu_power = SCHED_LOAD_SCALE; -+} -+#else -+static void inline init_sd(void) -+{ -+} - #endif - -- for (i = 0; i < NR_CPUS; i++) { -- prio_array_t *array; -+static void init_rq(struct runqueue *rq) -+{ -+ int j, k; -+ prio_array_t *array; - -- rq = cpu_rq(i); -- spin_lock_init(&rq->lock); -- rq->active = rq->arrays; -- rq->expired = rq->arrays + 1; -- rq->best_expired_prio = MAX_PRIO; -+ spin_lock_init(&rq->lock); -+ rq->active = &rq->arrays[0]; -+ rq->expired = &rq->arrays[1]; -+ rq->best_expired_prio = MAX_PRIO; - - #ifdef CONFIG_SMP -- rq->sd = &sched_domain_init; -- rq->cpu_load = 0; -- rq->active_balance = 0; -- rq->push_cpu = 0; -- rq->migration_thread = NULL; -- INIT_LIST_HEAD(&rq->migration_queue); --#endif -- atomic_set(&rq->nr_iowait, 0); -- -- for (j = 0; j < 2; j++) { -- array = rq->arrays + j; -- for (k = 0; k < MAX_PRIO; k++) { -- INIT_LIST_HEAD(array->queue + k); -- __clear_bit(k, array->bitmap); -- } -- // delimiter for bitsearch -- __set_bit(MAX_PRIO, array->bitmap); -+ rq->sd = &sched_domain_init; -+ rq->cpu_load = 0; -+ rq->active_balance = 0; -+#endif -+ rq->push_cpu = 0; -+ rq->migration_thread = NULL; -+ INIT_LIST_HEAD(&rq->migration_queue); -+ atomic_set(&rq->nr_iowait, 0); -+ -+ for (j = 0; j < 2; j++) { -+ array = rq->arrays + j; -+ for (k = 0; k < MAX_PRIO; k++) { -+ INIT_LIST_HEAD(array->queue + k); -+ __clear_bit(k, array->bitmap); -+ } -+ // delimiter for bitsearch -+ __set_bit(MAX_PRIO, array->bitmap); -+ } -+} -+ -+#if defined(CONFIG_SCHED_VCPU) || defined(CONFIG_FAIRSCHED) -+/* both rq and vsched lock should be taken */ -+static void __install_vcpu(struct vcpu_scheduler *vsched, vcpu_t vcpu) -+{ -+ int id; -+ -+ id = vcpu->id; -+ vcpu->vsched = vsched; -+ vsched->vcpu[id] = vcpu; -+ vcpu->last_pcpu = id; -+ wmb(); -+ /* FIXME: probably locking should be reworked, e.g. -+ we don't have corresponding rmb(), so we need to update mask -+ only after quiscent state */ -+ /* init_boot_vcpu() should be remade if RCU is used here */ -+ list_add(&vcpu->list, &vsched->idle_list); -+ cpu_set(id, vsched->vcpu_online_map); -+ vsched->num_online_vcpus++; -+} -+ -+static int install_vcpu(vcpu_t vcpu, struct vcpu_scheduler *vsched) -+{ -+ runqueue_t *rq; -+ unsigned long flags; -+ int res = 0; -+ -+ rq = vcpu_rq(vcpu); -+ spin_lock_irqsave(&rq->lock, flags); -+ spin_lock(&fairsched_lock); -+ -+ if (vsched->vcpu[vcpu->id] != NULL) -+ res = -EBUSY; -+ else -+ __install_vcpu(vsched, vcpu); -+ -+ spin_unlock(&fairsched_lock); -+ spin_unlock_irqrestore(&rq->lock, flags); -+ return res; -+} -+ -+static int __add_vcpu(struct vcpu_scheduler *vsched, int id) -+{ -+ vcpu_t vcpu; -+ int res; -+ -+ res = -ENOMEM; -+ vcpu = kmalloc(sizeof(struct vcpu_info), GFP_KERNEL); -+ if (vcpu == NULL) -+ goto out; -+ -+ init_vcpu(vcpu, id); -+ vcpu_rq(vcpu)->curr = this_pcpu()->idle; -+ res = install_vcpu(vcpu, vsched); -+ if (res < 0) -+ goto out_free; -+ return 0; -+ -+out_free: -+ kfree(vcpu); -+out: -+ return res; -+} -+ -+void vsched_init(struct vcpu_scheduler *vsched, int id) -+{ -+ memset(vsched, 0, sizeof(*vsched)); -+ -+ INIT_LIST_HEAD(&vsched->idle_list); -+ INIT_LIST_HEAD(&vsched->active_list); -+ INIT_LIST_HEAD(&vsched->running_list); -+ vsched->num_online_vcpus = 0; -+ vsched->vcpu_online_map = CPU_MASK_NONE; -+ vsched->vcpu_running_map = CPU_MASK_NONE; -+ vsched->pcpu_running_map = CPU_MASK_NONE; -+ vsched->id = id; -+} -+ -+#ifdef CONFIG_FAIRSCHED -+ -+/* No locks supposed to be held */ -+static void vsched_del_vcpu(vcpu_t vcpu); -+static int vsched_add_vcpu(struct vcpu_scheduler *vsched) -+{ -+ int res, err; -+ vcpu_t vcpu; -+ int id; -+ static DECLARE_MUTEX(id_mutex); -+ -+ down(&id_mutex); -+ id = find_first_zero_bit(vsched->vcpu_online_map.bits, NR_CPUS); -+ if (id >= NR_CPUS) { -+ err = -EBUSY; -+ goto out_up; -+ } -+ -+ err = __add_vcpu(vsched, id); -+ if (err < 0) -+ goto out_up; -+ -+ vcpu = vsched_vcpu(vsched, id); -+ err = -ENOMEM; -+ -+ res = vmigration_call(&migration_notifier, CPU_UP_PREPARE, vcpu); -+ if (res != NOTIFY_OK) -+ goto out_del_up; -+ -+ res = vmigration_call(&migration_notifier, CPU_ONLINE, vcpu); -+ if (res != NOTIFY_OK) -+ goto out_cancel_del_up; -+ -+ err = 0; -+ -+out_up: -+ up(&id_mutex); -+ return err; -+ -+out_cancel_del_up: -+ vmigration_call(&migration_notifier, CPU_UP_CANCELED, vcpu); -+out_del_up: -+ vsched_del_vcpu(vcpu); -+ goto out_up; -+} -+ -+static void vsched_del_vcpu(vcpu_t vcpu) -+{ -+ struct vcpu_scheduler *vsched; -+ runqueue_t *rq; -+ -+ vsched = vcpu_vsched(vcpu); -+ rq = vcpu_rq(vcpu); -+ -+ spin_lock_irq(&rq->lock); -+ spin_lock(&fairsched_lock); -+ cpu_clear(vcpu->id, vsched->vcpu_online_map); -+ vsched->num_online_vcpus--; -+ spin_unlock(&fairsched_lock); -+ spin_unlock_irq(&rq->lock); -+ -+ /* -+ * all tasks should migrate from this VCPU somewhere, -+ * also, since this moment VCPU is offline, so migration_thread -+ * won't accept any new tasks... -+ */ -+ vmigration_call(&migration_notifier, CPU_DEAD, vcpu); -+ BUG_ON(rq->nr_running != 0); -+ -+ /* vcpu_put() is called after deactivate_task. This loop makes sure -+ * that vcpu_put() was finished and vcpu can be freed */ -+ while ((volatile int)vcpu->running) -+ cpu_relax(); -+ -+ BUG_ON(vcpu->active); /* should be in idle_list */ -+ -+ spin_lock_irq(&fairsched_lock); -+ list_del(&vcpu->list); -+ vsched_vcpu(vsched, vcpu->id) = NULL; -+ spin_unlock_irq(&fairsched_lock); -+ -+ kfree(vcpu); -+} -+ -+int vsched_mvpr(struct task_struct *p, struct vcpu_scheduler *vsched) -+{ -+ vcpu_t dest_vcpu; -+ int id; -+ int res; -+ -+ res = 0; -+ while(1) { -+ /* FIXME: we suppose here that vcpu can't dissapear on the fly */ -+ for(id = first_cpu(vsched->vcpu_online_map); id < NR_CPUS; -+ id++) { -+ if ((vsched->vcpu[id] != NULL) && -+ !vcpu_isset(vsched->vcpu[id], p->cpus_allowed)) -+ continue; -+ else -+ break; -+ } -+ if (id >= NR_CPUS) { -+ res = -EINVAL; -+ goto out; -+ } -+ -+ dest_vcpu = vsched_vcpu(vsched, id); -+ while(1) { -+ sched_migrate_task(p, dest_vcpu); -+ if (task_vsched_id(p) == vsched_id(vsched)) -+ goto out; -+ if (!vcpu_isset(vsched->vcpu[id], p->cpus_allowed)) -+ break; -+ } -+ } -+out: -+ return res; -+} -+ -+void vsched_fairsched_link(struct vcpu_scheduler *vsched, -+ struct fairsched_node *node) -+{ -+ vsched->node = node; -+ node->vsched = vsched; -+} -+ -+void vsched_fairsched_unlink(struct vcpu_scheduler *vsched, -+ struct fairsched_node *node) -+{ -+ vsched->node = NULL; -+ node->vsched = NULL; -+} -+ -+int vsched_create(int id, struct fairsched_node *node) -+{ -+ struct vcpu_scheduler *vsched; -+ int i, res; -+ -+ vsched = kmalloc(sizeof(*vsched), GFP_KERNEL); -+ if (vsched == NULL) -+ return -ENOMEM; -+ -+ vsched_init(vsched, node->id); -+ vsched_fairsched_link(vsched, node); -+ -+ for(i = 0; i < num_online_cpus(); i++) { -+ res = vsched_add_vcpu(vsched); -+ if (res < 0) -+ goto err_add; -+ } -+ return 0; -+ -+err_add: -+ vsched_destroy(vsched); -+ return res; -+} -+ -+int vsched_destroy(struct vcpu_scheduler *vsched) -+{ -+ vcpu_t vcpu; -+ -+ if (vsched == NULL) -+ return 0; -+ -+ spin_lock_irq(&fairsched_lock); -+ while(1) { -+ if (!list_empty(&vsched->running_list)) -+ vcpu = list_entry(vsched->running_list.next, -+ struct vcpu_info, list); -+ else if (!list_empty(&vsched->active_list)) -+ vcpu = list_entry(vsched->active_list.next, -+ struct vcpu_info, list); -+ else if (!list_empty(&vsched->idle_list)) -+ vcpu = list_entry(vsched->idle_list.next, -+ struct vcpu_info, list); -+ else -+ break; -+ spin_unlock_irq(&fairsched_lock); -+ vsched_del_vcpu(vcpu); -+ spin_lock_irq(&fairsched_lock); -+ } -+ if (vsched->num_online_vcpus) -+ goto err_busy; -+ spin_unlock_irq(&fairsched_lock); -+ -+ vsched_fairsched_unlink(vsched, vsched->node); -+ kfree(vsched); -+ return 0; -+ -+err_busy: -+ printk(KERN_ERR "BUG in vsched_destroy, vsched id %d\n", -+ vsched->id); -+ spin_unlock_irq(&fairsched_lock); -+ return -EBUSY; -+ -+} -+#endif /* defined(CONFIG_FAIRSCHED) */ -+#endif /* defined(CONFIG_SCHED_VCPU) || defined(CONFIG_FAIRSCHED) */ -+ -+#ifdef CONFIG_VE -+/* -+ * This function is used to show fake CPU information. -+ * -+ * I'm still quite unsure that faking CPU speed is such a good idea, -+ * but someone (Kirill?) has made this decision. -+ * What I'm absolutely sure is that it's a part of virtualization, -+ * not a scheduler. 20050727 SAW -+ */ -+#ifdef CONFIG_FAIRSCHED -+unsigned long ve_scale_khz(unsigned long khz) -+{ -+ struct fairsched_node *node; -+ int cpus; -+ unsigned long rate; -+ -+ cpus = fairsched_nr_cpus; -+ rate = cpus << FSCHRATE_SHIFT; -+ -+ /* -+ * Ideally fairsched node should be taken from the current ve_struct. -+ * However, to simplify the code and locking, it is taken from current -+ * (currently fairsched_node can be changed only for a sleeping task). -+ * That means that VE0 processes moved to some special node will get -+ * fake CPU speed, but that shouldn't be a big problem. -+ */ -+ preempt_disable(); -+ node = current->vsched->node; -+ if (node->rate_limited) -+ rate = node->rate; -+ preempt_enable(); -+ -+ return ((unsigned long long)khz * (rate / cpus)) >> FSCHRATE_SHIFT; -+} -+#endif -+#endif /* CONFIG_VE */ -+ -+static void init_boot_vcpu(void) -+{ -+ int res; -+ -+ /* -+ * We setup boot_vcpu and it's runqueue until init_idle() happens -+ * on cpu0. This is required since timer interrupts can happen -+ * between sched_init() and init_idle(). -+ */ -+ init_vcpu(&boot_vcpu, 0); -+ vcpu_rq(&boot_vcpu)->curr = current; -+ res = install_vcpu(&boot_vcpu, &default_vsched); -+ if (res < 0) -+ panic("Can't install boot vcpu"); -+ -+ this_pcpu()->vcpu = &boot_vcpu; -+ this_pcpu()->vsched = boot_vcpu.vsched; -+} -+ -+static void init_pcpu(int id) -+{ -+ struct pcpu_info *pcpu; -+ -+ pcpu = pcpu(id); -+ pcpu->id = id; -+#ifdef CONFIG_SMP -+ pcpu->sd = &sched_domain_init; -+#endif -+ -+#ifndef CONFIG_SCHED_VCPU -+ init_vcpu(vcpu(id), id); -+#endif -+} -+ -+static void init_pcpus(void) -+{ -+ int i; -+ for (i = 0; i < NR_CPUS; i++) -+ init_pcpu(i); -+} -+ -+#ifdef CONFIG_SCHED_VCPU -+static void show_vcpu_list(struct vcpu_scheduler *vsched, struct list_head *lh) -+{ -+ cpumask_t m; -+ vcpu_t vcpu; -+ int i; -+ -+ cpus_clear(m); -+ list_for_each_entry(vcpu, lh, list) -+ cpu_set(vcpu->id, m); -+ -+ for (i = 0; i < NR_CPUS; i++) -+ if (cpu_isset(i, m)) -+ printk("%d ", i); -+} -+ -+#define PRINT(s, sz, fmt...) \ -+ do { \ -+ int __out; \ -+ __out = scnprintf(*s, *sz, fmt); \ -+ *s += __out; \ -+ *sz -= __out; \ -+ } while(0) -+ -+static void show_rq_array(prio_array_t *array, char *header, char **s, int *sz) -+{ -+ struct list_head *list; -+ task_t *p; -+ int k, h; -+ -+ h = 0; -+ for (k = 0; k < MAX_PRIO; k++) { -+ list = array->queue + k; -+ if (list_empty(list)) -+ continue; -+ -+ if (!h) { -+ PRINT(s, sz, header); -+ h = 1; - } -+ -+ PRINT(s, sz, " prio %d (", k); -+ list_for_each_entry(p, list, run_list) -+ PRINT(s, sz, "%s[%d] ", p->comm, p->pid); -+ PRINT(s, sz, ")"); - } -+ if (h) -+ PRINT(s, sz, "\n"); -+} -+ -+static void show_vcpu(vcpu_t vcpu) -+{ -+ runqueue_t *rq; -+ char buf[1024], *s; -+ unsigned long flags; -+ int sz; -+ -+ if (vcpu == NULL) -+ return; -+ -+ rq = vcpu_rq(vcpu); -+ spin_lock_irqsave(&rq->lock, flags); -+ printk(" vcpu %d: last_pcpu %d, state %s%s\n", -+ vcpu->id, vcpu->last_pcpu, -+ vcpu->active ? "A" : "", -+ vcpu->running ? "R" : ""); -+ -+ printk(" rq: running %lu, load %lu, sw %Lu, sd %p\n", -+ rq->nr_running, -+#ifdef CONFIG_SMP -+ rq->cpu_load, -+#else -+ 0LU, -+#endif -+ rq->nr_switches, -+#ifdef CONFIG_SMP -+ rq->sd -+#else -+ NULL -+#endif -+ ); -+ -+ s = buf; -+ sz = sizeof(buf) - 1; -+ -+ show_rq_array(rq->active, " active:", &s, &sz); -+ show_rq_array(rq->expired, " expired:", &s, &sz); -+ spin_unlock_irqrestore(&rq->lock, flags); -+ -+ *s = 0; -+ printk(buf); -+} -+ -+static inline void fairsched_show_node(struct vcpu_scheduler *vsched) -+{ -+#ifdef CONFIG_FAIRSCHED -+ struct fairsched_node *node; -+ -+ node = vsched->node; -+ printk("fsnode: ready %d run %d cpu %d vsched %p, pcpu %d\n", -+ node->nr_ready, node->nr_runnable, node->nr_pcpu, -+ node->vsched, smp_processor_id()); -+#endif -+} -+ -+static void __show_vsched(struct vcpu_scheduler *vsched) -+{ -+ char mask[NR_CPUS + 1]; -+ int i; -+ unsigned long flags; -+ -+ spin_lock_irqsave(&fairsched_lock, flags); -+ printk("vsched id=%d\n", vsched_id(vsched)); -+ fairsched_show_node(vsched); -+ -+ printk(" idle cpus "); -+ show_vcpu_list(vsched, &vsched->idle_list); -+ printk("; active cpus "); -+ show_vcpu_list(vsched, &vsched->active_list); -+ printk("; running cpus "); -+ show_vcpu_list(vsched, &vsched->running_list); -+ printk("\n"); -+ -+ cpumask_scnprintf(mask, NR_CPUS, vsched->vcpu_online_map); -+ printk(" num_online_cpus=%d, mask=%s (w=%d)\n", -+ vsched->num_online_vcpus, mask, -+ cpus_weight(vsched->vcpu_online_map)); -+ spin_unlock_irqrestore(&fairsched_lock, flags); -+ -+ for (i = 0; i < NR_CPUS; i++) -+ show_vcpu(vsched->vcpu[i]); -+} -+ -+void show_vsched(void) -+{ -+ oops_in_progress = 1; -+ __show_vsched(&idle_vsched); -+ __show_vsched(&default_vsched); -+ oops_in_progress = 0; -+} -+#endif /* CONFIG_SCHED_VCPU */ -+ -+void __init sched_init(void) -+{ -+ runqueue_t *rq; -+ -+ init_sd(); -+ init_pcpus(); -+#if defined(CONFIG_SCHED_VCPU) -+ vsched_init(&idle_vsched, -1); -+ vsched_init(&default_vsched, 0); -+#if defined(CONFIG_FAIRSCHED) -+ fairsched_init_early(); -+ vsched_fairsched_link(&idle_vsched, &fairsched_idle_node); -+ vsched_fairsched_link(&default_vsched, &fairsched_init_node); -+#endif -+ init_boot_vcpu(); -+#else -+#if defined(CONFIG_FAIRSCHED) -+ fairsched_init_early(); -+#endif -+#endif - /* - * We have to do a little magic to get the first - * thread right in SMP mode. - */ -+ set_task_vsched(current, &default_vsched); -+ set_task_cpu(current, smp_processor_id()); -+ /* FIXME: remove or is it required for UP? --set in vsched_init() */ - rq = this_rq(); - rq->curr = current; -- rq->idle = current; -- set_task_cpu(current, smp_processor_id()); -+ this_pcpu()->idle = current; - wake_up_forked_process(current); - - /* -@@ -4043,3 +5740,7 @@ void __sched __preempt_write_lock(rwlock - - EXPORT_SYMBOL(__preempt_write_lock); - #endif /* defined(CONFIG_SMP) && defined(CONFIG_PREEMPT) */ -+ -+EXPORT_SYMBOL(ve_sched_get_idle_time); -+EXPORT_SYMBOL(nr_running_ve); -+EXPORT_SYMBOL(nr_uninterruptible_ve); -diff -uprN linux-2.6.8.1.orig/kernel/signal.c linux-2.6.8.1-ve022stab078/kernel/signal.c ---- linux-2.6.8.1.orig/kernel/signal.c 2004-08-14 14:55:19.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/signal.c 2006-05-11 13:05:45.000000000 +0400 -@@ -12,6 +12,7 @@ - - #include <linux/config.h> - #include <linux/slab.h> -+#include <linux/kmem_cache.h> - #include <linux/module.h> - #include <linux/smp_lock.h> - #include <linux/init.h> -@@ -26,6 +27,9 @@ - #include <asm/unistd.h> - #include <asm/siginfo.h> - -+#include <ub/beancounter.h> -+#include <ub/ub_misc.h> -+ - /* - * SLAB caches for signal bits. - */ -@@ -214,6 +218,7 @@ static inline int has_pending_signals(si - fastcall void recalc_sigpending_tsk(struct task_struct *t) - { - if (t->signal->group_stop_count > 0 || -+ test_tsk_thread_flag(t,TIF_FREEZE) || - PENDING(&t->pending, &t->blocked) || - PENDING(&t->signal->shared_pending, &t->blocked)) - set_tsk_thread_flag(t, TIF_SIGPENDING); -@@ -267,13 +272,26 @@ static struct sigqueue *__sigqueue_alloc - struct sigqueue *q = NULL; - - if (atomic_read(¤t->user->sigpending) < -- current->rlim[RLIMIT_SIGPENDING].rlim_cur) -+ current->rlim[RLIMIT_SIGPENDING].rlim_cur) { - q = kmem_cache_alloc(sigqueue_cachep, GFP_ATOMIC); -+ if (q != NULL) { -+ /* -+ * Note: use of get_exec_ub() here vs get_task_ub() -+ * in send_signal() is not intentional. SAW 2005/03/09 -+ */ -+ if (ub_siginfo_charge(get_exec_ub(), -+ kmem_cache_memusage(sigqueue_cachep))) { -+ kfree(q); -+ q = NULL; -+ } -+ } -+ } - if (q) { - INIT_LIST_HEAD(&q->list); - q->flags = 0; - q->lock = NULL; - q->user = get_uid(current->user); -+ sig_ub(q) = get_beancounter(get_exec_ub()); - atomic_inc(&q->user->sigpending); - } - return(q); -@@ -283,6 +301,8 @@ static inline void __sigqueue_free(struc - { - if (q->flags & SIGQUEUE_PREALLOC) - return; -+ ub_siginfo_uncharge(sig_ub(q), kmem_cache_memusage(sigqueue_cachep)); -+ put_beancounter(sig_ub(q)); - atomic_dec(&q->user->sigpending); - free_uid(q->user); - kmem_cache_free(sigqueue_cachep, q); -@@ -500,7 +520,16 @@ static int __dequeue_signal(struct sigpe - { - int sig = 0; - -- sig = next_signal(pending, mask); -+ /* SIGKILL must have priority, otherwise it is quite easy -+ * to create an unkillable process, sending sig < SIGKILL -+ * to self */ -+ if (unlikely(sigismember(&pending->signal, SIGKILL))) { -+ if (!sigismember(mask, SIGKILL)) -+ sig = SIGKILL; -+ } -+ -+ if (likely(!sig)) -+ sig = next_signal(pending, mask); - if (sig) { - if (current->notifier) { - if (sigismember(current->notifier_mask, sig)) { -@@ -721,12 +750,21 @@ static int send_signal(int sig, struct s - pass on the info struct. */ - - if (atomic_read(&t->user->sigpending) < -- t->rlim[RLIMIT_SIGPENDING].rlim_cur) -+ t->rlim[RLIMIT_SIGPENDING].rlim_cur) { - q = kmem_cache_alloc(sigqueue_cachep, GFP_ATOMIC); -+ if (q != NULL) { -+ if (ub_siginfo_charge(get_task_ub(t), -+ kmem_cache_memusage(sigqueue_cachep))) { -+ kfree(q); -+ q = NULL; -+ } -+ } -+ } - - if (q) { - q->flags = 0; - q->user = get_uid(t->user); -+ sig_ub(q) = get_beancounter(get_task_ub(t)); - atomic_inc(&q->user->sigpending); - list_add_tail(&q->list, &signals->list); - switch ((unsigned long) info) { -@@ -734,7 +772,7 @@ static int send_signal(int sig, struct s - q->info.si_signo = sig; - q->info.si_errno = 0; - q->info.si_code = SI_USER; -- q->info.si_pid = current->pid; -+ q->info.si_pid = virt_pid(current); - q->info.si_uid = current->uid; - break; - case 1: -@@ -855,7 +893,7 @@ force_sig_specific(int sig, struct task_ - */ - #define wants_signal(sig, p, mask) \ - (!sigismember(&(p)->blocked, sig) \ -- && !((p)->state & mask) \ -+ && !(((p)->state | (p)->exit_state) & mask) \ - && !((p)->flags & PF_EXITING) \ - && (task_curr(p) || !signal_pending(p))) - -@@ -993,7 +1031,7 @@ __group_send_sig_info(int sig, struct si - * Don't bother zombies and stopped tasks (but - * SIGKILL will punch through stopped state) - */ -- mask = TASK_DEAD | TASK_ZOMBIE; -+ mask = EXIT_DEAD | EXIT_ZOMBIE; - if (sig != SIGKILL) - mask |= TASK_STOPPED; - -@@ -1026,7 +1064,7 @@ void zap_other_threads(struct task_struc - /* - * Don't bother with already dead threads - */ -- if (t->state & (TASK_ZOMBIE|TASK_DEAD)) -+ if (t->exit_state & (EXIT_ZOMBIE|EXIT_DEAD)) - continue; - - /* -@@ -1072,20 +1110,23 @@ int group_send_sig_info(int sig, struct - int __kill_pg_info(int sig, struct siginfo *info, pid_t pgrp) - { - struct task_struct *p; -- struct list_head *l; -- struct pid *pid; - int retval, success; - - if (pgrp <= 0) - return -EINVAL; - -+ /* Use __vpid_to_pid(). This function is used under write_lock -+ * tasklist_lock. */ -+ if (is_virtual_pid(pgrp)) -+ pgrp = __vpid_to_pid(pgrp); -+ - success = 0; - retval = -ESRCH; -- for_each_task_pid(pgrp, PIDTYPE_PGID, p, l, pid) { -+ do_each_task_pid_ve(pgrp, PIDTYPE_PGID, p) { - int err = group_send_sig_info(sig, info, p); - success |= !err; - retval = err; -- } -+ } while_each_task_pid_ve(pgrp, PIDTYPE_PGID, p); - return success ? 0 : retval; - } - -@@ -1112,22 +1153,22 @@ int - kill_sl_info(int sig, struct siginfo *info, pid_t sid) - { - int err, retval = -EINVAL; -- struct pid *pid; -- struct list_head *l; - struct task_struct *p; - - if (sid <= 0) - goto out; - -+ sid = vpid_to_pid(sid); -+ - retval = -ESRCH; - read_lock(&tasklist_lock); -- for_each_task_pid(sid, PIDTYPE_SID, p, l, pid) { -+ do_each_task_pid_ve(sid, PIDTYPE_SID, p) { - if (!p->signal->leader) - continue; - err = group_send_sig_info(sig, info, p); - if (retval) - retval = err; -- } -+ } while_each_task_pid_ve(sid, PIDTYPE_SID, p); - read_unlock(&tasklist_lock); - out: - return retval; -@@ -1140,7 +1181,7 @@ kill_proc_info(int sig, struct siginfo * - struct task_struct *p; - - read_lock(&tasklist_lock); -- p = find_task_by_pid(pid); -+ p = find_task_by_pid_ve(pid); - error = -ESRCH; - if (p) - error = group_send_sig_info(sig, info, p); -@@ -1165,8 +1206,8 @@ static int kill_something_info(int sig, - struct task_struct * p; - - read_lock(&tasklist_lock); -- for_each_process(p) { -- if (p->pid > 1 && p->tgid != current->tgid) { -+ for_each_process_ve(p) { -+ if (virt_pid(p) > 1 && p->tgid != current->tgid) { - int err = group_send_sig_info(sig, info, p); - ++count; - if (err != -EPERM) -@@ -1377,7 +1418,7 @@ send_group_sigqueue(int sig, struct sigq - * Don't bother zombies and stopped tasks (but - * SIGKILL will punch through stopped state) - */ -- mask = TASK_DEAD | TASK_ZOMBIE; -+ mask = EXIT_DEAD | EXIT_ZOMBIE; - if (sig != SIGKILL) - mask |= TASK_STOPPED; - -@@ -1436,12 +1477,22 @@ void do_notify_parent(struct task_struct - if (sig == -1) - BUG(); - -- BUG_ON(tsk->group_leader != tsk && tsk->group_leader->state != TASK_ZOMBIE && !tsk->ptrace); -+ BUG_ON(tsk->group_leader != tsk && -+ tsk->group_leader->exit_state != EXIT_ZOMBIE && -+ tsk->group_leader->exit_state != EXIT_DEAD && -+ !tsk->ptrace); - BUG_ON(tsk->group_leader == tsk && !thread_group_empty(tsk) && !tsk->ptrace); - -+#ifdef CONFIG_VE -+ /* Allow to send only SIGCHLD from VE */ -+ if (sig != SIGCHLD && -+ VE_TASK_INFO(tsk)->owner_env != VE_TASK_INFO(tsk->parent)->owner_env) -+ sig = SIGCHLD; -+#endif -+ - info.si_signo = sig; - info.si_errno = 0; -- info.si_pid = tsk->pid; -+ info.si_pid = get_task_pid_ve(tsk, VE_TASK_INFO(tsk->parent)->owner_env); - info.si_uid = tsk->uid; - - /* FIXME: find out whether or not this is supposed to be c*time. */ -@@ -1475,7 +1526,7 @@ void do_notify_parent(struct task_struct - - psig = tsk->parent->sighand; - spin_lock_irqsave(&psig->siglock, flags); -- if (sig == SIGCHLD && tsk->state != TASK_STOPPED && -+ if (!tsk->ptrace && sig == SIGCHLD && tsk->state != TASK_STOPPED && - (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN || - (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) { - /* -@@ -1530,7 +1581,7 @@ do_notify_parent_cldstop(struct task_str - - info.si_signo = SIGCHLD; - info.si_errno = 0; -- info.si_pid = tsk->pid; -+ info.si_pid = get_task_pid_ve(tsk, VE_TASK_INFO(parent)->owner_env); - info.si_uid = tsk->uid; - - /* FIXME: find out whether or not this is supposed to be c*time. */ -@@ -1575,7 +1626,9 @@ finish_stop(int stop_count) - read_unlock(&tasklist_lock); - } - -+ set_stop_state(current); - schedule(); -+ clear_stop_state(current); - /* - * Now we don't run again until continued. - */ -@@ -1756,10 +1809,12 @@ relock: - /* Let the debugger run. */ - current->exit_code = signr; - current->last_siginfo = info; -+ set_pn_state(current, PN_STOP_SIGNAL); - set_current_state(TASK_STOPPED); - spin_unlock_irq(¤t->sighand->siglock); - notify_parent(current, SIGCHLD); - schedule(); -+ clear_pn_state(current); - - current->last_siginfo = NULL; - -@@ -1779,7 +1834,7 @@ relock: - info->si_signo = signr; - info->si_errno = 0; - info->si_code = SI_USER; -- info->si_pid = current->parent->pid; -+ info->si_pid = virt_pid(current->parent); - info->si_uid = current->parent->uid; - } - -@@ -1803,8 +1858,14 @@ relock: - continue; - - /* Init gets no signals it doesn't want. */ -- if (current->pid == 1) -+ if (virt_pid(current) == 1) { -+ /* Allow SIGKILL for non-root VE */ -+#ifdef CONFIG_VE -+ if (current->pid == 1 || -+ signr != SIGKILL) -+#endif - continue; -+ } - - if (sig_kernel_stop(signr)) { - /* -@@ -2174,7 +2235,7 @@ sys_kill(int pid, int sig) - info.si_signo = sig; - info.si_errno = 0; - info.si_code = SI_USER; -- info.si_pid = current->tgid; -+ info.si_pid = virt_tgid(current); - info.si_uid = current->uid; - - return kill_something_info(sig, &info, pid); -@@ -2203,13 +2264,13 @@ asmlinkage long sys_tgkill(int tgid, int - info.si_signo = sig; - info.si_errno = 0; - info.si_code = SI_TKILL; -- info.si_pid = current->tgid; -+ info.si_pid = virt_tgid(current); - info.si_uid = current->uid; - - read_lock(&tasklist_lock); -- p = find_task_by_pid(pid); -+ p = find_task_by_pid_ve(pid); - error = -ESRCH; -- if (p && (p->tgid == tgid)) { -+ if (p && (virt_tgid(p) == tgid)) { - error = check_kill_permission(sig, &info, p); - /* - * The null signal is a permissions and process existence -@@ -2243,11 +2304,11 @@ sys_tkill(int pid, int sig) - info.si_signo = sig; - info.si_errno = 0; - info.si_code = SI_TKILL; -- info.si_pid = current->tgid; -+ info.si_pid = virt_tgid(current); - info.si_uid = current->uid; - - read_lock(&tasklist_lock); -- p = find_task_by_pid(pid); -+ p = find_task_by_pid_ve(pid); - error = -ESRCH; - if (p) { - error = check_kill_permission(sig, &info, p); -@@ -2285,7 +2346,7 @@ sys_rt_sigqueueinfo(int pid, int sig, si - } - - int --do_sigaction(int sig, const struct k_sigaction *act, struct k_sigaction *oact) -+do_sigaction(int sig, struct k_sigaction *act, struct k_sigaction *oact) - { - struct k_sigaction *k; - -@@ -2308,6 +2369,8 @@ do_sigaction(int sig, const struct k_sig - *oact = *k; - - if (act) { -+ sigdelsetmask(&act->sa.sa_mask, -+ sigmask(SIGKILL) | sigmask(SIGSTOP)); - /* - * POSIX 3.3.1.3: - * "Setting a signal action to SIG_IGN for a signal that is -@@ -2333,8 +2396,6 @@ do_sigaction(int sig, const struct k_sig - read_lock(&tasklist_lock); - spin_lock_irq(&t->sighand->siglock); - *k = *act; -- sigdelsetmask(&k->sa.sa_mask, -- sigmask(SIGKILL) | sigmask(SIGSTOP)); - rm_from_queue(sigmask(sig), &t->signal->shared_pending); - do { - rm_from_queue(sigmask(sig), &t->pending); -@@ -2347,8 +2408,6 @@ do_sigaction(int sig, const struct k_sig - } - - *k = *act; -- sigdelsetmask(&k->sa.sa_mask, -- sigmask(SIGKILL) | sigmask(SIGSTOP)); - } - - spin_unlock_irq(¤t->sighand->siglock); -@@ -2554,6 +2613,7 @@ sys_signal(int sig, __sighandler_t handl - - new_sa.sa.sa_handler = handler; - new_sa.sa.sa_flags = SA_ONESHOT | SA_NOMASK; -+ sigemptyset(&new_sa.sa.sa_mask); - - ret = do_sigaction(sig, &new_sa, &old_sa); - -@@ -2579,5 +2639,5 @@ void __init signals_init(void) - kmem_cache_create("sigqueue", - sizeof(struct sigqueue), - __alignof__(struct sigqueue), -- SLAB_PANIC, NULL, NULL); -+ SLAB_PANIC|SLAB_UBC, NULL, NULL); - } -diff -uprN linux-2.6.8.1.orig/kernel/softirq.c linux-2.6.8.1-ve022stab078/kernel/softirq.c ---- linux-2.6.8.1.orig/kernel/softirq.c 2004-08-14 14:54:52.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/softirq.c 2006-05-11 13:05:40.000000000 +0400 -@@ -15,8 +15,10 @@ - #include <linux/percpu.h> - #include <linux/cpu.h> - #include <linux/kthread.h> -+#include <linux/sysctl.h> - - #include <asm/irq.h> -+#include <ub/beancounter.h> - /* - - No shared variables, all the data are CPU local. - - If a softirq needs serialization, let it serialize itself -@@ -43,6 +45,8 @@ EXPORT_SYMBOL(irq_stat); - static struct softirq_action softirq_vec[32] __cacheline_aligned_in_smp; - - static DEFINE_PER_CPU(struct task_struct *, ksoftirqd); -+static DEFINE_PER_CPU(struct task_struct *, ksoftirqd_wakeup); -+static int ksoftirqd_stat[NR_CPUS]; - - /* - * we cannot loop indefinitely here to avoid userspace starvation, -@@ -53,7 +57,7 @@ static DEFINE_PER_CPU(struct task_struct - static inline void wakeup_softirqd(void) - { - /* Interrupts are disabled: no need to stop preemption */ -- struct task_struct *tsk = __get_cpu_var(ksoftirqd); -+ struct task_struct *tsk = __get_cpu_var(ksoftirqd_wakeup); - - if (tsk && tsk->state != TASK_RUNNING) - wake_up_process(tsk); -@@ -75,10 +79,13 @@ asmlinkage void __do_softirq(void) - struct softirq_action *h; - __u32 pending; - int max_restart = MAX_SOFTIRQ_RESTART; -+ struct user_beancounter *old_exec_ub; -+ struct ve_struct *envid; - - pending = local_softirq_pending(); - - local_bh_disable(); -+ envid = set_exec_env(get_ve0()); - restart: - /* Reset the pending bitmask before enabling irqs */ - local_softirq_pending() = 0; -@@ -87,6 +94,8 @@ restart: - - h = softirq_vec; - -+ old_exec_ub = set_exec_ub(get_ub0()); -+ - do { - if (pending & 1) - h->action(h); -@@ -94,6 +103,8 @@ restart: - pending >>= 1; - } while (pending); - -+ (void)set_exec_ub(old_exec_ub); -+ - local_irq_disable(); - - pending = local_softirq_pending(); -@@ -103,6 +114,7 @@ restart: - if (pending) - wakeup_softirqd(); - -+ (void)set_exec_env(envid); - __local_bh_enable(); - } - -@@ -451,6 +463,52 @@ static int __devinit cpu_callback(struct - return NOTIFY_OK; - } - -+static int proc_ksoftirqd(ctl_table *ctl, int write, struct file *filp, -+ void __user *buffer, size_t *lenp, loff_t *ppos) -+{ -+ int ret, cpu; -+ -+ ret = proc_dointvec(ctl, write, filp, buffer, lenp, ppos); -+ if (!write) -+ return ret; -+ -+ for_each_online_cpu(cpu) { -+ per_cpu(ksoftirqd_wakeup, cpu) = -+ ksoftirqd_stat[cpu] ? per_cpu(ksoftirqd, cpu) : NULL; -+ } -+ return ret; -+} -+ -+static int sysctl_ksoftirqd(ctl_table *table, int *name, int nlen, -+ void *oldval, size_t *oldlenp, void *newval, size_t newlen, -+ void **context) -+{ -+ return -EINVAL; -+} -+ -+static ctl_table debug_table[] = { -+ { -+ .ctl_name = 1246, -+ .procname = "ksoftirqd", -+ .data = ksoftirqd_stat, -+ .maxlen = sizeof(ksoftirqd_stat), -+ .mode = 0644, -+ .proc_handler = &proc_ksoftirqd, -+ .strategy = &sysctl_ksoftirqd -+ }, -+ {0} -+}; -+ -+static ctl_table root_table[] = { -+ { -+ .ctl_name = CTL_DEBUG, -+ .procname = "debug", -+ .mode = 0555, -+ .child = debug_table -+ }, -+ {0} -+}; -+ - static struct notifier_block __devinitdata cpu_nfb = { - .notifier_call = cpu_callback - }; -@@ -461,5 +519,6 @@ __init int spawn_ksoftirqd(void) - cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu); - cpu_callback(&cpu_nfb, CPU_ONLINE, cpu); - register_cpu_notifier(&cpu_nfb); -+ register_sysctl_table(root_table, 0); - return 0; - } -diff -uprN linux-2.6.8.1.orig/kernel/stop_machine.c linux-2.6.8.1-ve022stab078/kernel/stop_machine.c ---- linux-2.6.8.1.orig/kernel/stop_machine.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/stop_machine.c 2006-05-11 13:05:39.000000000 +0400 -@@ -6,6 +6,7 @@ - #include <linux/syscalls.h> - #include <asm/atomic.h> - #include <asm/semaphore.h> -+#include <asm/uaccess.h> - - /* Since we effect priority and affinity (both of which are visible - * to, and settable by outside processes) we do indirection via a -@@ -81,16 +82,20 @@ static int stop_machine(void) - { - int i, ret = 0; - struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 }; -+ mm_segment_t old_fs = get_fs(); - - /* One high-prio thread per cpu. We'll do this one. */ -- sys_sched_setscheduler(current->pid, SCHED_FIFO, ¶m); -+ set_fs(KERNEL_DS); -+ sys_sched_setscheduler(current->pid, SCHED_FIFO, -+ (struct sched_param __user *)¶m); -+ set_fs(old_fs); - - atomic_set(&stopmachine_thread_ack, 0); - stopmachine_num_threads = 0; - stopmachine_state = STOPMACHINE_WAIT; - - for_each_online_cpu(i) { -- if (i == smp_processor_id()) -+ if (i == task_cpu(current)) - continue; - ret = kernel_thread(stopmachine, (void *)(long)i,CLONE_KERNEL); - if (ret < 0) -@@ -109,13 +114,12 @@ static int stop_machine(void) - return ret; - } - -- /* Don't schedule us away at this point, please. */ -- local_irq_disable(); -- - /* Now they are all started, make them hold the CPUs, ready. */ -+ preempt_disable(); - stopmachine_set_state(STOPMACHINE_PREPARE); - - /* Make them disable irqs. */ -+ local_irq_disable(); - stopmachine_set_state(STOPMACHINE_DISABLE_IRQ); - - return 0; -@@ -125,6 +129,7 @@ static void restart_machine(void) - { - stopmachine_set_state(STOPMACHINE_EXIT); - local_irq_enable(); -+ preempt_enable_no_resched(); - } - - struct stop_machine_data -diff -uprN linux-2.6.8.1.orig/kernel/sys.c linux-2.6.8.1-ve022stab078/kernel/sys.c ---- linux-2.6.8.1.orig/kernel/sys.c 2004-08-14 14:54:49.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/sys.c 2006-05-11 13:05:47.000000000 +0400 -@@ -12,6 +12,7 @@ - #include <linux/mman.h> - #include <linux/smp_lock.h> - #include <linux/notifier.h> -+#include <linux/virtinfo.h> - #include <linux/reboot.h> - #include <linux/prctl.h> - #include <linux/init.h> -@@ -23,6 +24,7 @@ - #include <linux/security.h> - #include <linux/dcookies.h> - #include <linux/suspend.h> -+#include <linux/tty.h> - - #include <asm/uaccess.h> - #include <asm/io.h> -@@ -213,6 +215,102 @@ int unregister_reboot_notifier(struct no - - EXPORT_SYMBOL(unregister_reboot_notifier); - -+DECLARE_MUTEX(virtinfo_sem); -+EXPORT_SYMBOL(virtinfo_sem); -+static struct vnotifier_block *virtinfo_chain[VIRT_TYPES]; -+ -+void __virtinfo_notifier_register(int type, struct vnotifier_block *nb) -+{ -+ struct vnotifier_block **p; -+ -+ for (p = &virtinfo_chain[type]; -+ *p != NULL && nb->priority < (*p)->priority; -+ p = &(*p)->next); -+ nb->next = *p; -+ smp_wmb(); -+ *p = nb; -+} -+ -+EXPORT_SYMBOL(__virtinfo_notifier_register); -+ -+void virtinfo_notifier_register(int type, struct vnotifier_block *nb) -+{ -+ down(&virtinfo_sem); -+ __virtinfo_notifier_register(type, nb); -+ up(&virtinfo_sem); -+} -+ -+EXPORT_SYMBOL(virtinfo_notifier_register); -+ -+struct virtinfo_cnt_struct { -+ volatile unsigned long exit[NR_CPUS]; -+ volatile unsigned long entry; -+}; -+static DEFINE_PER_CPU(struct virtinfo_cnt_struct, virtcnt); -+ -+void virtinfo_notifier_unregister(int type, struct vnotifier_block *nb) -+{ -+ struct vnotifier_block **p; -+ int entry_cpu, exit_cpu; -+ unsigned long cnt, ent; -+ -+ down(&virtinfo_sem); -+ for (p = &virtinfo_chain[type]; *p != nb; p = &(*p)->next); -+ *p = nb->next; -+ smp_mb(); -+ -+ for_each_cpu_mask(entry_cpu, cpu_possible_map) { -+ while (1) { -+ cnt = 0; -+ for_each_cpu_mask(exit_cpu, cpu_possible_map) -+ cnt += -+ per_cpu(virtcnt, entry_cpu).exit[exit_cpu]; -+ smp_rmb(); -+ ent = per_cpu(virtcnt, entry_cpu).entry; -+ if (cnt == ent) -+ break; -+ __set_current_state(TASK_UNINTERRUPTIBLE); -+ schedule_timeout(HZ / 100); -+ } -+ } -+ up(&virtinfo_sem); -+} -+ -+EXPORT_SYMBOL(virtinfo_notifier_unregister); -+ -+int virtinfo_notifier_call(int type, unsigned long n, void *data) -+{ -+ int ret; -+ int entry_cpu, exit_cpu; -+ struct vnotifier_block *nb; -+ -+ entry_cpu = get_cpu(); -+ per_cpu(virtcnt, entry_cpu).entry++; -+ smp_wmb(); -+ put_cpu(); -+ -+ nb = virtinfo_chain[type]; -+ ret = NOTIFY_DONE; -+ while (nb) -+ { -+ ret = nb->notifier_call(nb, n, data, ret); -+ if(ret & NOTIFY_STOP_MASK) { -+ ret &= ~NOTIFY_STOP_MASK; -+ break; -+ } -+ nb = nb->next; -+ } -+ -+ exit_cpu = get_cpu(); -+ smp_wmb(); -+ per_cpu(virtcnt, entry_cpu).exit[exit_cpu]++; -+ put_cpu(); -+ -+ return ret; -+} -+ -+EXPORT_SYMBOL(virtinfo_notifier_call); -+ - asmlinkage long sys_ni_syscall(void) - { - return -ENOSYS; -@@ -310,8 +408,6 @@ asmlinkage long sys_setpriority(int whic - { - struct task_struct *g, *p; - struct user_struct *user; -- struct pid *pid; -- struct list_head *l; - int error = -EINVAL; - - if (which > 2 || which < 0) -@@ -328,16 +424,19 @@ asmlinkage long sys_setpriority(int whic - switch (which) { - case PRIO_PROCESS: - if (!who) -- who = current->pid; -- p = find_task_by_pid(who); -+ who = virt_pid(current); -+ p = find_task_by_pid_ve(who); - if (p) - error = set_one_prio(p, niceval, error); - break; - case PRIO_PGRP: - if (!who) - who = process_group(current); -- for_each_task_pid(who, PIDTYPE_PGID, p, l, pid) -+ else -+ who = vpid_to_pid(who); -+ do_each_task_pid_ve(who, PIDTYPE_PGID, p) { - error = set_one_prio(p, niceval, error); -+ } while_each_task_pid_ve(who, PIDTYPE_PGID, p); - break; - case PRIO_USER: - if (!who) -@@ -348,10 +447,10 @@ asmlinkage long sys_setpriority(int whic - if (!user) - goto out_unlock; - -- do_each_thread(g, p) -+ do_each_thread_ve(g, p) { - if (p->uid == who) - error = set_one_prio(p, niceval, error); -- while_each_thread(g, p); -+ } while_each_thread_ve(g, p); - if (who) - free_uid(user); /* For find_user() */ - break; -@@ -371,8 +470,6 @@ out: - asmlinkage long sys_getpriority(int which, int who) - { - struct task_struct *g, *p; -- struct list_head *l; -- struct pid *pid; - struct user_struct *user; - long niceval, retval = -ESRCH; - -@@ -383,8 +480,8 @@ asmlinkage long sys_getpriority(int whic - switch (which) { - case PRIO_PROCESS: - if (!who) -- who = current->pid; -- p = find_task_by_pid(who); -+ who = virt_pid(current); -+ p = find_task_by_pid_ve(who); - if (p) { - niceval = 20 - task_nice(p); - if (niceval > retval) -@@ -394,11 +491,13 @@ asmlinkage long sys_getpriority(int whic - case PRIO_PGRP: - if (!who) - who = process_group(current); -- for_each_task_pid(who, PIDTYPE_PGID, p, l, pid) { -+ else -+ who = vpid_to_pid(who); -+ do_each_task_pid_ve(who, PIDTYPE_PGID, p) { - niceval = 20 - task_nice(p); - if (niceval > retval) - retval = niceval; -- } -+ } while_each_task_pid_ve(who, PIDTYPE_PGID, p); - break; - case PRIO_USER: - if (!who) -@@ -409,13 +508,13 @@ asmlinkage long sys_getpriority(int whic - if (!user) - goto out_unlock; - -- do_each_thread(g, p) -+ do_each_thread_ve(g, p) { - if (p->uid == who) { - niceval = 20 - task_nice(p); - if (niceval > retval) - retval = niceval; - } -- while_each_thread(g, p); -+ } while_each_thread_ve(g, p); - if (who) - free_uid(user); /* for find_user() */ - break; -@@ -451,6 +550,35 @@ asmlinkage long sys_reboot(int magic1, i - magic2 != LINUX_REBOOT_MAGIC2C)) - return -EINVAL; - -+#ifdef CONFIG_VE -+ if (!ve_is_super(get_exec_env())) -+ switch (cmd) { -+ case LINUX_REBOOT_CMD_RESTART: -+ case LINUX_REBOOT_CMD_HALT: -+ case LINUX_REBOOT_CMD_POWER_OFF: -+ case LINUX_REBOOT_CMD_RESTART2: { -+ struct siginfo info; -+ -+ info.si_errno = 0; -+ info.si_code = SI_KERNEL; -+ info.si_pid = virt_pid(current); -+ info.si_uid = current->uid; -+ info.si_signo = SIGKILL; -+ -+ /* Sending to real init is safe */ -+ send_sig_info(SIGKILL, &info, -+ get_exec_env()->init_entry); -+ } -+ -+ case LINUX_REBOOT_CMD_CAD_ON: -+ case LINUX_REBOOT_CMD_CAD_OFF: -+ return 0; -+ -+ default: -+ return -EINVAL; -+ } -+#endif -+ - lock_kernel(); - switch (cmd) { - case LINUX_REBOOT_CMD_RESTART: -@@ -641,7 +769,7 @@ asmlinkage long sys_setgid(gid_t gid) - return 0; - } - --static int set_user(uid_t new_ruid, int dumpclear) -+int set_user(uid_t new_ruid, int dumpclear) - { - struct user_struct *new_user; - -@@ -666,6 +794,7 @@ static int set_user(uid_t new_ruid, int - current->uid = new_ruid; - return 0; - } -+EXPORT_SYMBOL(set_user); - - /* - * Unprivileged users may change the real uid to the effective uid -@@ -954,7 +1083,12 @@ asmlinkage long sys_times(struct tms __u - if (copy_to_user(tbuf, &tmp, sizeof(struct tms))) - return -EFAULT; - } -+#ifndef CONFIG_VE - return (long) jiffies_64_to_clock_t(get_jiffies_64()); -+#else -+ return (long) jiffies_64_to_clock_t(get_jiffies_64() - -+ get_exec_env()->init_entry->start_time); -+#endif - } - - /* -@@ -974,21 +1108,24 @@ asmlinkage long sys_setpgid(pid_t pid, p - { - struct task_struct *p; - int err = -EINVAL; -+ pid_t _pgid; - - if (!pid) -- pid = current->pid; -+ pid = virt_pid(current); - if (!pgid) - pgid = pid; - if (pgid < 0) - return -EINVAL; - -+ _pgid = vpid_to_pid(pgid); -+ - /* From this point forward we keep holding onto the tasklist lock - * so that our parent does not change from under us. -DaveM - */ - write_lock_irq(&tasklist_lock); - - err = -ESRCH; -- p = find_task_by_pid(pid); -+ p = find_task_by_pid_ve(pid); - if (!p) - goto out; - -@@ -1013,26 +1150,35 @@ asmlinkage long sys_setpgid(pid_t pid, p - if (p->signal->leader) - goto out; - -- if (pgid != pid) { -+ pgid = virt_pid(p); -+ if (_pgid != p->pid) { - struct task_struct *p; -- struct pid *pid; -- struct list_head *l; - -- for_each_task_pid(pgid, PIDTYPE_PGID, p, l, pid) -- if (p->signal->session == current->signal->session) -+ do_each_task_pid_ve(_pgid, PIDTYPE_PGID, p) { -+ if (p->signal->session == current->signal->session) { -+ pgid = virt_pgid(p); - goto ok_pgid; -+ } -+ } while_each_task_pid_ve(_pgid, PIDTYPE_PGID, p); - goto out; - } - - ok_pgid: -- err = security_task_setpgid(p, pgid); -+ err = security_task_setpgid(p, _pgid); - if (err) - goto out; - -- if (process_group(p) != pgid) { -+ if (process_group(p) != _pgid) { - detach_pid(p, PIDTYPE_PGID); -- p->signal->pgrp = pgid; -- attach_pid(p, PIDTYPE_PGID, pgid); -+ p->signal->pgrp = _pgid; -+ set_virt_pgid(p, pgid); -+ attach_pid(p, PIDTYPE_PGID, _pgid); -+ if (atomic_read(&p->signal->count) != 1) { -+ task_t *t; -+ for (t = next_thread(p); t != p; t = next_thread(t)) { -+ set_virt_pgid(t, pgid); -+ } -+ } - } - - err = 0; -@@ -1045,19 +1191,19 @@ out: - asmlinkage long sys_getpgid(pid_t pid) - { - if (!pid) { -- return process_group(current); -+ return virt_pgid(current); - } else { - int retval; - struct task_struct *p; - - read_lock(&tasklist_lock); -- p = find_task_by_pid(pid); -+ p = find_task_by_pid_ve(pid); - - retval = -ESRCH; - if (p) { - retval = security_task_getpgid(p); - if (!retval) -- retval = process_group(p); -+ retval = virt_pgid(p); - } - read_unlock(&tasklist_lock); - return retval; -@@ -1069,7 +1215,7 @@ asmlinkage long sys_getpgid(pid_t pid) - asmlinkage long sys_getpgrp(void) - { - /* SMP - assuming writes are word atomic this is fine */ -- return process_group(current); -+ return virt_pgid(current); - } - - #endif -@@ -1077,19 +1223,19 @@ asmlinkage long sys_getpgrp(void) - asmlinkage long sys_getsid(pid_t pid) - { - if (!pid) { -- return current->signal->session; -+ return virt_sid(current); - } else { - int retval; - struct task_struct *p; - - read_lock(&tasklist_lock); -- p = find_task_by_pid(pid); -+ p = find_task_by_pid_ve(pid); - - retval = -ESRCH; - if(p) { - retval = security_task_getsid(p); - if (!retval) -- retval = p->signal->session; -+ retval = virt_sid(p); - } - read_unlock(&tasklist_lock); - return retval; -@@ -1104,6 +1250,7 @@ asmlinkage long sys_setsid(void) - if (!thread_group_leader(current)) - return -EINVAL; - -+ down(&tty_sem); - write_lock_irq(&tasklist_lock); - - pid = find_pid(PIDTYPE_PGID, current->pid); -@@ -1112,11 +1259,22 @@ asmlinkage long sys_setsid(void) - - current->signal->leader = 1; - __set_special_pids(current->pid, current->pid); -+ set_virt_pgid(current, virt_pid(current)); -+ set_virt_sid(current, virt_pid(current)); - current->signal->tty = NULL; - current->signal->tty_old_pgrp = 0; -- err = process_group(current); -+ if (atomic_read(¤t->signal->count) != 1) { -+ task_t *t; -+ for (t = next_thread(current); t != current; t = next_thread(t)) { -+ set_virt_pgid(t, virt_pid(current)); -+ set_virt_sid(t, virt_pid(current)); -+ } -+ } -+ -+ err = virt_pgid(current); - out: - write_unlock_irq(&tasklist_lock); -+ up(&tty_sem); - return err; - } - -@@ -1393,7 +1551,7 @@ asmlinkage long sys_newuname(struct new_ - int errno = 0; - - down_read(&uts_sem); -- if (copy_to_user(name,&system_utsname,sizeof *name)) -+ if (copy_to_user(name,&ve_utsname,sizeof *name)) - errno = -EFAULT; - up_read(&uts_sem); - return errno; -@@ -1404,15 +1562,15 @@ asmlinkage long sys_sethostname(char __u - int errno; - char tmp[__NEW_UTS_LEN]; - -- if (!capable(CAP_SYS_ADMIN)) -+ if (!capable(CAP_VE_SYS_ADMIN)) - return -EPERM; - if (len < 0 || len > __NEW_UTS_LEN) - return -EINVAL; - down_write(&uts_sem); - errno = -EFAULT; - if (!copy_from_user(tmp, name, len)) { -- memcpy(system_utsname.nodename, tmp, len); -- system_utsname.nodename[len] = 0; -+ memcpy(ve_utsname.nodename, tmp, len); -+ ve_utsname.nodename[len] = 0; - errno = 0; - } - up_write(&uts_sem); -@@ -1428,11 +1586,11 @@ asmlinkage long sys_gethostname(char __u - if (len < 0) - return -EINVAL; - down_read(&uts_sem); -- i = 1 + strlen(system_utsname.nodename); -+ i = 1 + strlen(ve_utsname.nodename); - if (i > len) - i = len; - errno = 0; -- if (copy_to_user(name, system_utsname.nodename, i)) -+ if (copy_to_user(name, ve_utsname.nodename, i)) - errno = -EFAULT; - up_read(&uts_sem); - return errno; -@@ -1449,7 +1607,7 @@ asmlinkage long sys_setdomainname(char _ - int errno; - char tmp[__NEW_UTS_LEN]; - -- if (!capable(CAP_SYS_ADMIN)) -+ if (!capable(CAP_VE_SYS_ADMIN)) - return -EPERM; - if (len < 0 || len > __NEW_UTS_LEN) - return -EINVAL; -@@ -1457,8 +1615,8 @@ asmlinkage long sys_setdomainname(char _ - down_write(&uts_sem); - errno = -EFAULT; - if (!copy_from_user(tmp, name, len)) { -- memcpy(system_utsname.domainname, tmp, len); -- system_utsname.domainname[len] = 0; -+ memcpy(ve_utsname.domainname, tmp, len); -+ ve_utsname.domainname[len] = 0; - errno = 0; - } - up_write(&uts_sem); -diff -uprN linux-2.6.8.1.orig/kernel/sysctl.c linux-2.6.8.1-ve022stab078/kernel/sysctl.c ---- linux-2.6.8.1.orig/kernel/sysctl.c 2004-08-14 14:54:49.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/sysctl.c 2006-05-11 13:05:49.000000000 +0400 -@@ -25,6 +25,8 @@ - #include <linux/slab.h> - #include <linux/sysctl.h> - #include <linux/proc_fs.h> -+#include <linux/ve_owner.h> -+#include <linux/ve.h> - #include <linux/ctype.h> - #include <linux/utsname.h> - #include <linux/capability.h> -@@ -57,6 +59,7 @@ extern int sysctl_overcommit_ratio; - extern int max_threads; - extern int sysrq_enabled; - extern int core_uses_pid; -+extern int sysctl_at_vsyscall; - extern char core_pattern[]; - extern int cad_pid; - extern int pid_max; -@@ -64,6 +67,10 @@ extern int sysctl_lower_zone_protection; - extern int min_free_kbytes; - extern int printk_ratelimit_jiffies; - extern int printk_ratelimit_burst; -+#ifdef CONFIG_VE -+int glob_virt_pids = 1; -+EXPORT_SYMBOL(glob_virt_pids); -+#endif - - /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */ - static int maxolduid = 65535; -@@ -89,6 +96,10 @@ extern int msg_ctlmnb; - extern int msg_ctlmni; - extern int sem_ctls[]; - #endif -+#ifdef CONFIG_SCHED_VCPU -+extern u32 vcpu_sched_timeslice; -+extern u32 vcpu_timeslice; -+#endif - - #ifdef __sparc__ - extern char reboot_command []; -@@ -109,6 +120,7 @@ extern int sysctl_userprocess_debug; - #endif - - extern int sysctl_hz_timer; -+int decode_call_traces = 1; - - #if defined(CONFIG_PPC32) && defined(CONFIG_6xx) - extern unsigned long powersave_nap; -@@ -120,10 +132,14 @@ int proc_dol2crvec(ctl_table *table, int - extern int acct_parm[]; - #endif - -+#ifdef CONFIG_FAIRSCHED -+extern int fairsched_max_latency; -+int fsch_sysctl_latency(ctl_table *ctl, int write, struct file *filp, -+ void __user *buffer, size_t *lenp, loff_t *ppos); -+#endif -+ - static int parse_table(int __user *, int, void __user *, size_t __user *, void __user *, size_t, - ctl_table *, void **); --static int proc_doutsstring(ctl_table *table, int write, struct file *filp, -- void __user *buffer, size_t *lenp, loff_t *ppos); - - static ctl_table root_table[]; - static struct ctl_table_header root_table_header = -@@ -143,6 +159,8 @@ extern ctl_table random_table[]; - extern ctl_table pty_table[]; - #endif - -+extern int ve_area_access_check; /* fs/namei.c */ -+ - /* /proc declarations: */ - - #ifdef CONFIG_PROC_FS -@@ -159,8 +177,10 @@ struct file_operations proc_sys_file_ope - - extern struct proc_dir_entry *proc_sys_root; - --static void register_proc_table(ctl_table *, struct proc_dir_entry *); -+static void register_proc_table(ctl_table *, struct proc_dir_entry *, void *); - static void unregister_proc_table(ctl_table *, struct proc_dir_entry *); -+ -+extern struct new_utsname virt_utsname; - #endif - - /* The default sysctl tables: */ -@@ -260,6 +280,15 @@ static ctl_table kern_table[] = { - .strategy = &sysctl_string, - }, - { -+ .ctl_name = KERN_VIRT_OSRELEASE, -+ .procname = "virt_osrelease", -+ .data = virt_utsname.release, -+ .maxlen = sizeof(virt_utsname.release), -+ .mode = 0644, -+ .proc_handler = &proc_doutsstring, -+ .strategy = &sysctl_string, -+ }, -+ { - .ctl_name = KERN_PANIC, - .procname = "panic", - .data = &panic_timeout, -@@ -579,6 +608,24 @@ static ctl_table kern_table[] = { - .proc_handler = &proc_dointvec, - }, - #endif -+#ifdef CONFIG_SCHED_VCPU -+ { -+ .ctl_name = KERN_VCPU_SCHED_TIMESLICE, -+ .procname = "vcpu_sched_timeslice", -+ .data = &vcpu_sched_timeslice, -+ .maxlen = sizeof(int), -+ .mode = 0644, -+ .proc_handler = &proc_dointvec, -+ }, -+ { -+ .ctl_name = KERN_VCPU_TIMESLICE, -+ .procname = "vcpu_timeslice", -+ .data = &vcpu_timeslice, -+ .maxlen = sizeof(int), -+ .mode = 0644, -+ .proc_handler = &proc_dointvec, -+ }, -+#endif - { - .ctl_name = KERN_PIDMAX, - .procname = "pid_max", -@@ -587,6 +634,16 @@ static ctl_table kern_table[] = { - .mode = 0644, - .proc_handler = &proc_dointvec, - }, -+#ifdef CONFIG_VE -+ { -+ .ctl_name = KERN_VIRT_PIDS, -+ .procname = "virt_pids", -+ .data = &glob_virt_pids, -+ .maxlen = sizeof(int), -+ .mode = 0644, -+ .proc_handler = &proc_dointvec, -+ }, -+#endif - { - .ctl_name = KERN_PANIC_ON_OOPS, - .procname = "panic_on_oops", -@@ -620,6 +677,32 @@ static ctl_table kern_table[] = { - .mode = 0444, - .proc_handler = &proc_dointvec, - }, -+ { -+ .ctl_name = KERN_SILENCE_LEVEL, -+ .procname = "silence-level", -+ .data = &console_silence_loglevel, -+ .maxlen = sizeof(int), -+ .mode = 0644, -+ .proc_handler = &proc_dointvec -+ }, -+ { -+ .ctl_name = KERN_ALLOC_FAIL_WARN, -+ .procname = "alloc_fail_warn", -+ .data = &alloc_fail_warn, -+ .maxlen = sizeof(int), -+ .mode = 0644, -+ .proc_handler = &proc_dointvec -+ }, -+#ifdef CONFIG_FAIRSCHED -+ { -+ .ctl_name = KERN_FAIRSCHED_MAX_LATENCY, -+ .procname = "fairsched-max-latency", -+ .data = &fairsched_max_latency, -+ .maxlen = sizeof(int), -+ .mode = 0644, -+ .proc_handler = &fsch_sysctl_latency -+ }, -+#endif - { .ctl_name = 0 } - }; - -@@ -899,10 +982,26 @@ static ctl_table fs_table[] = { - .mode = 0644, - .proc_handler = &proc_dointvec, - }, -+ { -+ .ctl_name = FS_AT_VSYSCALL, -+ .procname = "vsyscall", -+ .data = &sysctl_at_vsyscall, -+ .maxlen = sizeof(int), -+ .mode = 0644, -+ .proc_handler = &proc_dointvec -+ }, - { .ctl_name = 0 } - }; - - static ctl_table debug_table[] = { -+ { -+ .ctl_name = DBG_DECODE_CALLTRACES, -+ .procname = "decode_call_traces", -+ .data = &decode_call_traces, -+ .maxlen = sizeof(int), -+ .mode = 0644, -+ .proc_handler = &proc_dointvec -+ }, - { .ctl_name = 0 } - }; - -@@ -912,10 +1011,51 @@ static ctl_table dev_table[] = { - - extern void init_irq_proc (void); - -+static spinlock_t sysctl_lock = SPIN_LOCK_UNLOCKED; -+ -+/* called under sysctl_lock */ -+static int use_table(struct ctl_table_header *p) -+{ -+ if (unlikely(p->unregistering)) -+ return 0; -+ p->used++; -+ return 1; -+} -+ -+/* called under sysctl_lock */ -+static void unuse_table(struct ctl_table_header *p) -+{ -+ if (!--p->used) -+ if (unlikely(p->unregistering)) -+ complete(p->unregistering); -+} -+ -+/* called under sysctl_lock, will reacquire if has to wait */ -+static void start_unregistering(struct ctl_table_header *p) -+{ -+ /* -+ * if p->used is 0, nobody will ever touch that entry again; -+ * we'll eliminate all paths to it before dropping sysctl_lock -+ */ -+ if (unlikely(p->used)) { -+ struct completion wait; -+ init_completion(&wait); -+ p->unregistering = &wait; -+ spin_unlock(&sysctl_lock); -+ wait_for_completion(&wait); -+ spin_lock(&sysctl_lock); -+ } -+ /* -+ * do not remove from the list until nobody holds it; walking the -+ * list in do_sysctl() relies on that. -+ */ -+ list_del_init(&p->ctl_entry); -+} -+ - void __init sysctl_init(void) - { - #ifdef CONFIG_PROC_FS -- register_proc_table(root_table, proc_sys_root); -+ register_proc_table(root_table, proc_sys_root, &root_table_header); - init_irq_proc(); - #endif - } -@@ -924,6 +1064,8 @@ int do_sysctl(int __user *name, int nlen - void __user *newval, size_t newlen) - { - struct list_head *tmp; -+ int error = -ENOTDIR; -+ struct ve_struct *ve; - - if (nlen <= 0 || nlen >= CTL_MAXNAME) - return -ENOTDIR; -@@ -932,21 +1074,35 @@ int do_sysctl(int __user *name, int nlen - if (!oldlenp || get_user(old_len, oldlenp)) - return -EFAULT; - } -- tmp = &root_table_header.ctl_entry; -+ ve = get_exec_env(); -+ spin_lock(&sysctl_lock); -+ tmp = ve->sysctl_lh.next; - do { -- struct ctl_table_header *head = -- list_entry(tmp, struct ctl_table_header, ctl_entry); -+ struct ctl_table_header *head; - void *context = NULL; -- int error = parse_table(name, nlen, oldval, oldlenp, -+ -+ if (tmp == &ve->sysctl_lh) -+ /* second pass over global variables */ -+ tmp = &root_table_header.ctl_entry; -+ -+ head = list_entry(tmp, struct ctl_table_header, ctl_entry); -+ if (!use_table(head)) -+ continue; -+ -+ spin_unlock(&sysctl_lock); -+ -+ error = parse_table(name, nlen, oldval, oldlenp, - newval, newlen, head->ctl_table, - &context); -- if (context) -- kfree(context); -+ kfree(context); -+ -+ spin_lock(&sysctl_lock); -+ unuse_table(head); - if (error != -ENOTDIR) -- return error; -- tmp = tmp->next; -- } while (tmp != &root_table_header.ctl_entry); -- return -ENOTDIR; -+ break; -+ } while ((tmp = tmp->next) != &root_table_header.ctl_entry); -+ spin_unlock(&sysctl_lock); -+ return error; - } - - asmlinkage long sys_sysctl(struct __sysctl_args __user *args) -@@ -983,10 +1139,14 @@ static int test_perm(int mode, int op) - static inline int ctl_perm(ctl_table *table, int op) - { - int error; -+ int mode = table->mode; -+ - error = security_sysctl(table, op); - if (error) - return error; -- return test_perm(table->mode, op); -+ if (!ve_accessible(table->owner_env, get_exec_env())) -+ mode &= ~0222; /* disable write access */ -+ return test_perm(mode, op); - } - - static int parse_table(int __user *name, int nlen, -@@ -1152,21 +1312,62 @@ struct ctl_table_header *register_sysctl - int insert_at_head) - { - struct ctl_table_header *tmp; -+ struct list_head *lh; -+ - tmp = kmalloc(sizeof(struct ctl_table_header), GFP_KERNEL); - if (!tmp) - return NULL; - tmp->ctl_table = table; - INIT_LIST_HEAD(&tmp->ctl_entry); -+ tmp->used = 0; -+ tmp->unregistering = NULL; -+ spin_lock(&sysctl_lock); -+#ifdef CONFIG_VE -+ lh = &get_exec_env()->sysctl_lh; -+#else -+ lh = &root_table_header.ctl_entry; -+#endif - if (insert_at_head) -- list_add(&tmp->ctl_entry, &root_table_header.ctl_entry); -+ list_add(&tmp->ctl_entry, lh); - else -- list_add_tail(&tmp->ctl_entry, &root_table_header.ctl_entry); -+ list_add_tail(&tmp->ctl_entry, lh); -+ spin_unlock(&sysctl_lock); - #ifdef CONFIG_PROC_FS -- register_proc_table(table, proc_sys_root); -+#ifdef CONFIG_VE -+ register_proc_table(table, get_exec_env()->proc_sys_root, tmp); -+#else -+ register_proc_table(table, proc_sys_root, tmp); -+#endif - #endif - return tmp; - } - -+void free_sysctl_clone(ctl_table *clone) -+{ -+ kfree(clone); -+} -+ -+ctl_table *clone_sysctl_template(ctl_table *tmpl, int nr) -+{ -+ int i; -+ ctl_table *clone; -+ -+ clone = kmalloc(nr * sizeof(ctl_table), GFP_KERNEL); -+ if (clone == NULL) -+ return NULL; -+ -+ memcpy(clone, tmpl, nr * sizeof(ctl_table)); -+ for (i = 0; i < nr; i++) { -+ if (tmpl[i].ctl_name == 0) -+ continue; -+ clone[i].owner_env = get_exec_env(); -+ if (tmpl[i].child == NULL) -+ continue; -+ clone[i].child = clone + (tmpl[i].child - tmpl); -+ } -+ return clone; -+} -+ - /** - * unregister_sysctl_table - unregister a sysctl table hierarchy - * @header: the header returned from register_sysctl_table -@@ -1176,10 +1377,17 @@ struct ctl_table_header *register_sysctl - */ - void unregister_sysctl_table(struct ctl_table_header * header) - { -- list_del(&header->ctl_entry); -+ might_sleep(); -+ spin_lock(&sysctl_lock); -+ start_unregistering(header); - #ifdef CONFIG_PROC_FS -+#ifdef CONFIG_VE -+ unregister_proc_table(header->ctl_table, get_exec_env()->proc_sys_root); -+#else - unregister_proc_table(header->ctl_table, proc_sys_root); - #endif -+#endif -+ spin_unlock(&sysctl_lock); - kfree(header); - } - -@@ -1190,7 +1398,7 @@ void unregister_sysctl_table(struct ctl_ - #ifdef CONFIG_PROC_FS - - /* Scan the sysctl entries in table and add them all into /proc */ --static void register_proc_table(ctl_table * table, struct proc_dir_entry *root) -+static void register_proc_table(ctl_table * table, struct proc_dir_entry *root, void *set) - { - struct proc_dir_entry *de; - int len; -@@ -1226,13 +1434,14 @@ static void register_proc_table(ctl_tabl - de = create_proc_entry(table->procname, mode, root); - if (!de) - continue; -+ de->set = set; - de->data = (void *) table; - if (table->proc_handler) - de->proc_fops = &proc_sys_file_operations; - } - table->de = de; - if (de->mode & S_IFDIR) -- register_proc_table(table->child, de); -+ register_proc_table(table->child, de, set); - } - } - -@@ -1257,12 +1466,15 @@ static void unregister_proc_table(ctl_ta - continue; - } - -- /* Don't unregister proc entries that are still being used.. */ -- if (atomic_read(&de->count)) -- continue; -- -+ de->data = NULL; - table->de = NULL; -+ /* -+ * sys_sysctl can't find us, since we are removed from list. -+ * proc won't touch either, since de->data is NULL. -+ */ -+ spin_unlock(&sysctl_lock); - remove_proc_entry(table->procname, root); -+ spin_lock(&sysctl_lock); - } - } - -@@ -1270,27 +1482,38 @@ static ssize_t do_rw_proc(int write, str - size_t count, loff_t *ppos) - { - int op; -- struct proc_dir_entry *de; -+ struct proc_dir_entry *de = PDE(file->f_dentry->d_inode); - struct ctl_table *table; - size_t res; -- ssize_t error; -- -- de = PDE(file->f_dentry->d_inode); -- if (!de || !de->data) -- return -ENOTDIR; -- table = (struct ctl_table *) de->data; -- if (!table || !table->proc_handler) -- return -ENOTDIR; -- op = (write ? 002 : 004); -- if (ctl_perm(table, op)) -- return -EPERM; -+ ssize_t error = -ENOTDIR; - -- res = count; -- -- error = (*table->proc_handler) (table, write, file, buf, &res, ppos); -- if (error) -- return error; -- return res; -+ spin_lock(&sysctl_lock); -+ if (de && de->data && use_table(de->set)) { -+ /* -+ * at that point we know that sysctl was not unregistered -+ * and won't be until we finish -+ */ -+ spin_unlock(&sysctl_lock); -+ table = (struct ctl_table *) de->data; -+ if (!table || !table->proc_handler) -+ goto out; -+ error = -EPERM; -+ op = (write ? 002 : 004); -+ if (ctl_perm(table, op)) -+ goto out; -+ -+ /* careful: calling conventions are nasty here */ -+ res = count; -+ error = (*table->proc_handler)(table, write, file, -+ buf, &res, ppos); -+ if (!error) -+ error = res; -+ out: -+ spin_lock(&sysctl_lock); -+ unuse_table(de->set); -+ } -+ spin_unlock(&sysctl_lock); -+ return error; - } - - static int proc_opensys(struct inode *inode, struct file *file) -@@ -1390,7 +1613,7 @@ int proc_dostring(ctl_table *table, int - * to observe. Should this be in kernel/sys.c ???? - */ - --static int proc_doutsstring(ctl_table *table, int write, struct file *filp, -+int proc_doutsstring(ctl_table *table, int write, struct file *filp, - void __user *buffer, size_t *lenp, loff_t *ppos) - { - int r; -@@ -1914,7 +2137,7 @@ int proc_dostring(ctl_table *table, int - return -ENOSYS; - } - --static int proc_doutsstring(ctl_table *table, int write, struct file *filp, -+int proc_doutsstring(ctl_table *table, int write, struct file *filp, - void __user *buffer, size_t *lenp, loff_t *ppos) - { - return -ENOSYS; -@@ -1967,7 +2190,6 @@ int proc_doulongvec_ms_jiffies_minmax(ct - - #endif /* CONFIG_PROC_FS */ - -- - /* - * General sysctl support routines - */ -@@ -2169,6 +2391,14 @@ void unregister_sysctl_table(struct ctl_ - { - } - -+ctl_table * clone_sysctl_template(ctl_table *tmpl, int nr) -+{ -+ return NULL; -+} -+ -+void free_sysctl_clone(ctl_table *tmpl) -+{ -+} - #endif /* CONFIG_SYSCTL */ - - /* -@@ -2180,9 +2410,12 @@ EXPORT_SYMBOL(proc_dointvec_jiffies); - EXPORT_SYMBOL(proc_dointvec_minmax); - EXPORT_SYMBOL(proc_dointvec_userhz_jiffies); - EXPORT_SYMBOL(proc_dostring); -+EXPORT_SYMBOL(proc_doutsstring); - EXPORT_SYMBOL(proc_doulongvec_minmax); - EXPORT_SYMBOL(proc_doulongvec_ms_jiffies_minmax); - EXPORT_SYMBOL(register_sysctl_table); -+EXPORT_SYMBOL(clone_sysctl_template); -+EXPORT_SYMBOL(free_sysctl_clone); - EXPORT_SYMBOL(sysctl_intvec); - EXPORT_SYMBOL(sysctl_jiffies); - EXPORT_SYMBOL(sysctl_string); -diff -uprN linux-2.6.8.1.orig/kernel/time.c linux-2.6.8.1-ve022stab078/kernel/time.c ---- linux-2.6.8.1.orig/kernel/time.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/time.c 2006-05-11 13:05:32.000000000 +0400 -@@ -30,6 +30,7 @@ - #include <linux/smp_lock.h> - #include <asm/uaccess.h> - #include <asm/unistd.h> -+#include <linux/fs.h> - - /* - * The timezone where the local system is located. Used as a default by some -@@ -421,6 +422,50 @@ struct timespec current_kernel_time(void - - EXPORT_SYMBOL(current_kernel_time); - -+/** -+ * current_fs_time - Return FS time -+ * @sb: Superblock. -+ * -+ * Return the current time truncated to the time granuality supported by -+ * the fs. -+ */ -+struct timespec current_fs_time(struct super_block *sb) -+{ -+ struct timespec now = current_kernel_time(); -+ return timespec_trunc(now, get_sb_time_gran(sb)); -+} -+EXPORT_SYMBOL(current_fs_time); -+ -+/** -+ * timespec_trunc - Truncate timespec to a granuality -+ * @t: Timespec -+ * @gran: Granuality in ns. -+ * -+ * Truncate a timespec to a granuality. gran must be smaller than a second. -+ * Always rounds down. -+ * -+ * This function should be only used for timestamps returned by -+ * current_kernel_time() or CURRENT_TIME, not with do_gettimeofday() because -+ * it doesn't handle the better resolution of the later. -+ */ -+struct timespec timespec_trunc(struct timespec t, unsigned gran) -+{ -+ /* -+ * Division is pretty slow so avoid it for common cases. -+ * Currently current_kernel_time() never returns better than -+ * jiffies resolution. Exploit that. -+ */ -+ if (gran <= jiffies_to_usecs(1) * 1000) { -+ /* nothing */ -+ } else if (gran == 1000000000) { -+ t.tv_nsec = 0; -+ } else { -+ t.tv_nsec -= t.tv_nsec % gran; -+ } -+ return t; -+} -+EXPORT_SYMBOL(timespec_trunc); -+ - #if (BITS_PER_LONG < 64) - u64 get_jiffies_64(void) - { -diff -uprN linux-2.6.8.1.orig/kernel/timer.c linux-2.6.8.1-ve022stab078/kernel/timer.c ---- linux-2.6.8.1.orig/kernel/timer.c 2004-08-14 14:56:00.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/timer.c 2006-05-11 13:05:49.000000000 +0400 -@@ -31,6 +31,7 @@ - #include <linux/time.h> - #include <linux/jiffies.h> - #include <linux/cpu.h> -+#include <linux/virtinfo.h> - - #include <asm/uaccess.h> - #include <asm/unistd.h> -@@ -299,6 +300,10 @@ repeat: - goto repeat; - } - list_del(&timer->entry); -+ smp_wmb(); /* the list del must have taken effect before timer->base -+ * change is visible to other CPUs, or a concurrent mod_timer -+ * would cause a race with list_add -+ */ - timer->base = NULL; - spin_unlock_irqrestore(&base->lock, flags); - -@@ -444,6 +449,7 @@ repeat: - if (!list_empty(head)) { - void (*fn)(unsigned long); - unsigned long data; -+ struct ve_struct *envid; - - timer = list_entry(head->next,struct timer_list,entry); - fn = timer->function; -@@ -451,11 +457,16 @@ repeat: - - list_del(&timer->entry); - set_running_timer(base, timer); -- smp_wmb(); -+ smp_wmb(); /* the list del must have taken effect before timer->base -+ * change is visible to other CPUs, or a concurrent mod_timer -+ * would cause a race with list_add -+ */ - timer->base = NULL; -+ envid = set_exec_env(get_ve0()); - spin_unlock_irq(&base->lock); - fn(data); - spin_lock_irq(&base->lock); -+ (void)set_exec_env(envid); - goto repeat; - } - } -@@ -776,13 +787,12 @@ static void update_wall_time(unsigned lo - do { - ticks--; - update_wall_time_one_tick(); -+ if (xtime.tv_nsec >= 1000000000) { -+ xtime.tv_nsec -= 1000000000; -+ xtime.tv_sec++; -+ second_overflow(); -+ } - } while (ticks); -- -- if (xtime.tv_nsec >= 1000000000) { -- xtime.tv_nsec -= 1000000000; -- xtime.tv_sec++; -- second_overflow(); -- } - } - - static inline void do_process_times(struct task_struct *p, -@@ -869,6 +879,22 @@ static unsigned long count_active_tasks( - */ - unsigned long avenrun[3]; - -+static void calc_load_ve(void) -+{ -+ unsigned long flags, nr_unint; -+ -+ nr_unint = nr_uninterruptible() * FIXED_1; -+ spin_lock_irqsave(&kstat_glb_lock, flags); -+ CALC_LOAD(kstat_glob.nr_unint_avg[0], EXP_1, nr_unint); -+ CALC_LOAD(kstat_glob.nr_unint_avg[1], EXP_5, nr_unint); -+ CALC_LOAD(kstat_glob.nr_unint_avg[2], EXP_15, nr_unint); -+ spin_unlock_irqrestore(&kstat_glb_lock, flags); -+ -+#ifdef CONFIG_VE -+ do_update_load_avg_ve(); -+#endif -+} -+ - /* - * calc_load - given tick count, update the avenrun load estimates. - * This is called while holding a write_lock on xtime_lock. -@@ -885,6 +911,7 @@ static inline void calc_load(unsigned lo - CALC_LOAD(avenrun[0], EXP_1, active_tasks); - CALC_LOAD(avenrun[1], EXP_5, active_tasks); - CALC_LOAD(avenrun[2], EXP_15, active_tasks); -+ calc_load_ve(); - } - } - -@@ -996,7 +1023,7 @@ asmlinkage unsigned long sys_alarm(unsig - */ - asmlinkage long sys_getpid(void) - { -- return current->tgid; -+ return virt_tgid(current); - } - - /* -@@ -1018,28 +1045,15 @@ asmlinkage long sys_getpid(void) - asmlinkage long sys_getppid(void) - { - int pid; -- struct task_struct *me = current; -- struct task_struct *parent; - -- parent = me->group_leader->real_parent; -- for (;;) { -- pid = parent->tgid; --#ifdef CONFIG_SMP --{ -- struct task_struct *old = parent; -- -- /* -- * Make sure we read the pid before re-reading the -- * parent pointer: -- */ -- rmb(); -- parent = me->group_leader->real_parent; -- if (old != parent) -- continue; --} --#endif -- break; -- } -+ /* Some smart code used to be here. It was wrong. -+ * ->real_parent could be released before dereference and -+ * we accessed freed kernel memory, which faults with debugging on. -+ * Keep it simple and stupid. -+ */ -+ read_lock(&tasklist_lock); -+ pid = virt_tgid(current->group_leader->real_parent); -+ read_unlock(&tasklist_lock); - return pid; - } - -@@ -1157,7 +1171,7 @@ EXPORT_SYMBOL(schedule_timeout); - /* Thread ID - the internal kernel "pid" */ - asmlinkage long sys_gettid(void) - { -- return current->pid; -+ return virt_pid(current); - } - - static long __sched nanosleep_restart(struct restart_block *restart) -@@ -1227,11 +1241,12 @@ asmlinkage long sys_sysinfo(struct sysin - unsigned long mem_total, sav_total; - unsigned int mem_unit, bitcount; - unsigned long seq; -+ unsigned long *__avenrun; -+ struct timespec tp; - - memset((char *)&val, 0, sizeof(struct sysinfo)); - - do { -- struct timespec tp; - seq = read_seqbegin(&xtime_lock); - - /* -@@ -1249,18 +1264,34 @@ asmlinkage long sys_sysinfo(struct sysin - tp.tv_nsec = tp.tv_nsec - NSEC_PER_SEC; - tp.tv_sec++; - } -- val.uptime = tp.tv_sec + (tp.tv_nsec ? 1 : 0); -- -- val.loads[0] = avenrun[0] << (SI_LOAD_SHIFT - FSHIFT); -- val.loads[1] = avenrun[1] << (SI_LOAD_SHIFT - FSHIFT); -- val.loads[2] = avenrun[2] << (SI_LOAD_SHIFT - FSHIFT); -+ } while (read_seqretry(&xtime_lock, seq)); - -+ if (ve_is_super(get_exec_env())) { -+ val.uptime = tp.tv_sec + (tp.tv_nsec ? 1 : 0); -+ __avenrun = &avenrun[0]; - val.procs = nr_threads; -- } while (read_seqretry(&xtime_lock, seq)); -+ } -+#ifdef CONFIG_VE -+ else { -+ struct ve_struct *ve; -+ ve = get_exec_env(); -+ __avenrun = &ve->avenrun[0]; -+ val.procs = atomic_read(&ve->pcounter); -+ val.uptime = tp.tv_sec - ve->start_timespec.tv_sec; -+ } -+#endif -+ val.loads[0] = __avenrun[0] << (SI_LOAD_SHIFT - FSHIFT); -+ val.loads[1] = __avenrun[1] << (SI_LOAD_SHIFT - FSHIFT); -+ val.loads[2] = __avenrun[2] << (SI_LOAD_SHIFT - FSHIFT); - - si_meminfo(&val); - si_swapinfo(&val); - -+#ifdef CONFIG_USER_RESOURCE -+ if (virtinfo_notifier_call(VITYPE_GENERAL, VIRTINFO_SYSINFO, &val) -+ & NOTIFY_FAIL) -+ return -ENOMSG; -+#endif - /* - * If the sum of all the available memory (i.e. ram + swap) - * is less than can be stored in a 32 bit unsigned long then -diff -uprN linux-2.6.8.1.orig/kernel/ub/Kconfig linux-2.6.8.1-ve022stab078/kernel/ub/Kconfig ---- linux-2.6.8.1.orig/kernel/ub/Kconfig 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/kernel/ub/Kconfig 2006-05-11 13:05:39.000000000 +0400 -@@ -0,0 +1,89 @@ -+# -+# User resources part (UBC) -+# -+# Copyright (C) 2005 SWsoft -+# All rights reserved. -+# -+# Licensing governed by "linux/COPYING.SWsoft" file. -+ -+menu "User resources" -+ -+config USER_RESOURCE -+ bool "Enable user resource accounting" -+ default y -+ help -+ This patch provides accounting and allows to configure -+ limits for user's consumption of exhaustible system resources. -+ The most important resource controlled by this patch is unswappable -+ memory (either mlock'ed or used by internal kernel structures and -+ buffers). The main goal of this patch is to protect processes -+ from running short of important resources because of an accidental -+ misbehavior of processes or malicious activity aiming to ``kill'' -+ the system. It's worth to mention that resource limits configured -+ by setrlimit(2) do not give an acceptable level of protection -+ because they cover only small fraction of resources and work on a -+ per-process basis. Per-process accounting doesn't prevent malicious -+ users from spawning a lot of resource-consuming processes. -+ -+config USER_RSS_ACCOUNTING -+ bool "Account physical memory usage" -+ default y -+ depends on USER_RESOURCE -+ help -+ This allows to estimate per beancounter physical memory usage. -+ Implemented alghorithm accounts shared pages of memory as well, -+ dividing them by number of beancounter which use the page. -+ -+config USER_SWAP_ACCOUNTING -+ bool "Account swap usage" -+ default y -+ depends on USER_RESOURCE -+ help -+ This allows accounting of swap usage. -+ -+config USER_RESOURCE_PROC -+ bool "Report resource usage in /proc" -+ default y -+ depends on USER_RESOURCE -+ help -+ Allows a system administrator to inspect resource accounts and limits. -+ -+config UBC_DEBUG -+ bool "User resources debug features" -+ default n -+ depends on USER_RESOURCE -+ help -+ Enables to setup debug features for user resource accounting -+ -+config UBC_DEBUG_KMEM -+ bool "Debug kmemsize with cache counters" -+ default n -+ depends on UBC_DEBUG -+ help -+ Adds /proc/user_beancounters_debug entry to get statistics -+ about cache usage of each beancounter -+ -+config UBC_KEEP_UNUSED -+ bool "Keep unused beancounter alive" -+ default y -+ depends on UBC_DEBUG -+ help -+ If on, unused beancounters are kept on the hash and maxheld value -+ can be looked through. -+ -+config UBC_DEBUG_ITEMS -+ bool "Account resources in items rather than in bytes" -+ default y -+ depends on UBC_DEBUG -+ help -+ When true some of the resources (e.g. kmemsize) are accounted -+ in items instead of bytes. -+ -+config UBC_UNLIMITED -+ bool "Use unlimited ubc settings" -+ default y -+ depends on UBC_DEBUG -+ help -+ When ON all limits and barriers are set to max values. -+ -+endmenu -diff -uprN linux-2.6.8.1.orig/kernel/ub/Makefile linux-2.6.8.1-ve022stab078/kernel/ub/Makefile ---- linux-2.6.8.1.orig/kernel/ub/Makefile 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/kernel/ub/Makefile 2006-05-11 13:05:39.000000000 +0400 -@@ -0,0 +1,20 @@ -+# -+# User resources part (UBC) -+# -+# Copyright (C) 2005 SWsoft -+# All rights reserved. -+# -+# Licensing governed by "linux/COPYING.SWsoft" file. -+ -+obj-y := ub_sys.o -+obj-$(CONFIG_USER_RESOURCE) += beancounter.o -+obj-$(CONFIG_USER_RESOURCE) += ub_dcache.o -+obj-$(CONFIG_USER_RESOURCE) += ub_mem.o -+obj-$(CONFIG_USER_RESOURCE) += ub_misc.o -+obj-$(CONFIG_USER_RESOURCE) += ub_net.o -+obj-$(CONFIG_USER_RESOURCE) += ub_pages.o -+obj-$(CONFIG_USER_RESOURCE) += ub_stat.o -+obj-$(CONFIG_USER_RESOURCE) += ub_oom.o -+ -+obj-$(CONFIG_USER_RSS_ACCOUNTING) += ub_page_bc.o -+obj-$(CONFIG_USER_RESOURCE_PROC) += ub_proc.o -diff -uprN linux-2.6.8.1.orig/kernel/ub/beancounter.c linux-2.6.8.1-ve022stab078/kernel/ub/beancounter.c ---- linux-2.6.8.1.orig/kernel/ub/beancounter.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/kernel/ub/beancounter.c 2006-05-11 13:05:39.000000000 +0400 -@@ -0,0 +1,675 @@ -+/* -+ * linux/kernel/ub/beancounter.c -+ * -+ * Copyright (C) 1998 Alan Cox -+ * 1998-2000 Andrey V. Savochkin <saw@saw.sw.com.sg> -+ * Copyright (C) 2000-2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ * TODO: -+ * - more intelligent limit check in mremap(): currently the new size is -+ * charged and _then_ old size is uncharged -+ * (almost done: !move_vma case is completely done, -+ * move_vma in its current implementation requires too many conditions to -+ * do things right, because it may be not only expansion, but shrinking -+ * also, plus do_munmap will require an additional parameter...) -+ * - problem: bad pmd page handling -+ * - consider /proc redesign -+ * - TCP/UDP ports -+ * + consider whether __charge_beancounter_locked should be inline -+ * -+ * Changes: -+ * 1999/08/17 Marcelo Tosatti <marcelo@conectiva.com.br> -+ * - Set "barrier" and "limit" parts of limits atomically. -+ * 1999/10/06 Marcelo Tosatti <marcelo@conectiva.com.br> -+ * - setublimit system call. -+ */ -+ -+#include <linux/slab.h> -+#include <linux/module.h> -+ -+#include <ub/beancounter.h> -+#include <ub/ub_hash.h> -+#include <ub/ub_vmpages.h> -+ -+static kmem_cache_t *ub_cachep; -+static struct user_beancounter default_beancounter; -+struct user_beancounter ub0; -+ -+const char *ub_rnames[] = { -+ "kmemsize", /* 0 */ -+ "lockedpages", -+ "privvmpages", -+ "shmpages", -+ "dummy", -+ "numproc", /* 5 */ -+ "physpages", -+ "vmguarpages", -+ "oomguarpages", -+ "numtcpsock", -+ "numflock", /* 10 */ -+ "numpty", -+ "numsiginfo", -+ "tcpsndbuf", -+ "tcprcvbuf", -+ "othersockbuf", /* 15 */ -+ "dgramrcvbuf", -+ "numothersock", -+ "dcachesize", -+ "numfile", -+ "dummy", /* 20 */ -+ "dummy", -+ "dummy", -+ "numiptent", -+ "unused_privvmpages", /* UB_RESOURCES */ -+ "tmpfs_respages", -+ "swap_pages", -+ "held_pages", -+}; -+ -+static void init_beancounter_struct(struct user_beancounter *ub); -+static void init_beancounter_store(struct user_beancounter *ub); -+static void init_beancounter_nolimits(struct user_beancounter *ub); -+ -+void print_ub_uid(struct user_beancounter *ub, char *buf, int size) -+{ -+ if (ub->parent != NULL) -+ snprintf(buf, size, "%u.%u", ub->parent->ub_uid, ub->ub_uid); -+ else -+ snprintf(buf, size, "%u", ub->ub_uid); -+} -+EXPORT_SYMBOL(print_ub_uid); -+ -+#define ub_hash_fun(x) ((((x) >> 8) ^ (x)) & (UB_HASH_SIZE - 1)) -+#define ub_subhash_fun(p, id) ub_hash_fun((p)->ub_uid + (id) * 17) -+struct ub_hash_slot ub_hash[UB_HASH_SIZE]; -+spinlock_t ub_hash_lock; -+EXPORT_SYMBOL(ub_hash); -+EXPORT_SYMBOL(ub_hash_lock); -+ -+/* -+ * Per user resource beancounting. Resources are tied to their luid. -+ * The resource structure itself is tagged both to the process and -+ * the charging resources (a socket doesn't want to have to search for -+ * things at irq time for example). Reference counters keep things in -+ * hand. -+ * -+ * The case where a user creates resource, kills all his processes and -+ * then starts new ones is correctly handled this way. The refcounters -+ * will mean the old entry is still around with resource tied to it. -+ */ -+struct user_beancounter *get_beancounter_byuid(uid_t uid, int create) -+{ -+ struct user_beancounter *new_ub, *ub; -+ unsigned long flags; -+ struct ub_hash_slot *slot; -+ -+ slot = &ub_hash[ub_hash_fun(uid)]; -+ new_ub = NULL; -+ -+retry: -+ spin_lock_irqsave(&ub_hash_lock, flags); -+ ub = slot->ubh_beans; -+ while (ub != NULL && (ub->ub_uid != uid || ub->parent != NULL)) -+ ub = ub->ub_next; -+ -+ if (ub != NULL) { -+ /* found */ -+ get_beancounter(ub); -+ spin_unlock_irqrestore(&ub_hash_lock, flags); -+ if (new_ub != NULL) -+ kmem_cache_free(ub_cachep, new_ub); -+ return ub; -+ } -+ -+ if (!create) { -+ /* no ub found */ -+ spin_unlock_irqrestore(&ub_hash_lock, flags); -+ return NULL; -+ } -+ -+ if (new_ub != NULL) { -+ /* install new ub */ -+ new_ub->ub_next = slot->ubh_beans; -+ slot->ubh_beans = new_ub; -+ spin_unlock_irqrestore(&ub_hash_lock, flags); -+ return new_ub; -+ } -+ spin_unlock_irqrestore(&ub_hash_lock, flags); -+ -+ /* alloc new ub */ -+ new_ub = (struct user_beancounter *)kmem_cache_alloc(ub_cachep, -+ GFP_KERNEL); -+ if (new_ub == NULL) -+ return NULL; -+ -+ ub_debug(UBD_ALLOC, "Creating ub %p in slot %p\n", new_ub, slot); -+ memcpy(new_ub, &default_beancounter, sizeof(*new_ub)); -+ init_beancounter_struct(new_ub); -+ new_ub->ub_uid = uid; -+ goto retry; -+} -+EXPORT_SYMBOL(get_beancounter_byuid); -+ -+struct user_beancounter *get_subbeancounter_byid(struct user_beancounter *p, -+ int id, int create) -+{ -+ struct user_beancounter *new_ub, *ub; -+ unsigned long flags; -+ struct ub_hash_slot *slot; -+ -+ slot = &ub_hash[ub_subhash_fun(p, id)]; -+ new_ub = NULL; -+ -+retry: -+ spin_lock_irqsave(&ub_hash_lock, flags); -+ ub = slot->ubh_beans; -+ while (ub != NULL && (ub->parent != p || ub->ub_uid != id)) -+ ub = ub->ub_next; -+ -+ if (ub != NULL) { -+ /* found */ -+ get_beancounter(ub); -+ spin_unlock_irqrestore(&ub_hash_lock, flags); -+ if (new_ub != NULL) { -+ put_beancounter(new_ub->parent); -+ kmem_cache_free(ub_cachep, new_ub); -+ } -+ return ub; -+ } -+ -+ if (!create) { -+ /* no ub found */ -+ spin_unlock_irqrestore(&ub_hash_lock, flags); -+ return NULL; -+ } -+ -+ if (new_ub != NULL) { -+ /* install new ub */ -+ get_beancounter(new_ub); -+ new_ub->ub_next = slot->ubh_beans; -+ slot->ubh_beans = new_ub; -+ spin_unlock_irqrestore(&ub_hash_lock, flags); -+ return new_ub; -+ } -+ spin_unlock_irqrestore(&ub_hash_lock, flags); -+ -+ /* alloc new ub */ -+ new_ub = (struct user_beancounter *)kmem_cache_alloc(ub_cachep, -+ GFP_KERNEL); -+ if (new_ub == NULL) -+ return NULL; -+ -+ ub_debug(UBD_ALLOC, "Creating sub %p in slot %p\n", new_ub, slot); -+ memset(new_ub, 0, sizeof(*new_ub)); -+ init_beancounter_nolimits(new_ub); -+ init_beancounter_store(new_ub); -+ init_beancounter_struct(new_ub); -+ atomic_set(&new_ub->ub_refcount, 0); -+ new_ub->ub_uid = id; -+ new_ub->parent = get_beancounter(p); -+ goto retry; -+} -+EXPORT_SYMBOL(get_subbeancounter_byid); -+ -+struct user_beancounter *subbeancounter_findcreate(struct user_beancounter *p, -+ int id) -+{ -+ struct user_beancounter *ub; -+ unsigned long flags; -+ struct ub_hash_slot *slot; -+ -+ slot = &ub_hash[ub_subhash_fun(p, id)]; -+ -+ spin_lock_irqsave(&ub_hash_lock, flags); -+ ub = slot->ubh_beans; -+ while (ub != NULL && (ub->parent != p || ub->ub_uid != id)) -+ ub = ub->ub_next; -+ -+ if (ub != NULL) { -+ /* found */ -+ get_beancounter(ub); -+ goto done; -+ } -+ -+ /* alloc new ub */ -+ /* Can be called from non-atomic contexts. Den */ -+ ub = (struct user_beancounter *)kmem_cache_alloc(ub_cachep, GFP_ATOMIC); -+ if (ub == NULL) -+ goto done; -+ -+ ub_debug(UBD_ALLOC, "Creating sub %p in slot %p\n", ub, slot); -+ memset(ub, 0, sizeof(*ub)); -+ init_beancounter_nolimits(ub); -+ init_beancounter_store(ub); -+ init_beancounter_struct(ub); -+ atomic_set(&ub->ub_refcount, 0); -+ ub->ub_uid = id; -+ ub->parent = get_beancounter(p); -+ -+ /* install new ub */ -+ get_beancounter(ub); -+ ub->ub_next = slot->ubh_beans; -+ slot->ubh_beans = ub; -+ -+done: -+ spin_unlock_irqrestore(&ub_hash_lock, flags); -+ return ub; -+} -+EXPORT_SYMBOL(subbeancounter_findcreate); -+#ifndef CONFIG_UBC_KEEP_UNUSED -+ -+static int verify_res(struct user_beancounter *ub, int resource, -+ unsigned long held) -+{ -+ char id[64]; -+ -+ if (likely(held == 0)) -+ return 1; -+ -+ print_ub_uid(ub, id, sizeof(id)); -+ printk(KERN_WARNING "Ub %s helds %lu in %s on put\n", -+ id, held, ub_rnames[resource]); -+ return 0; -+} -+ -+static inline void verify_held(struct user_beancounter *ub) -+{ -+ int i, clean; -+ -+ clean = 1; -+ for (i = 0; i < UB_RESOURCES; i++) -+ clean &= verify_res(ub, i, ub->ub_parms[i].held); -+ -+ clean &= verify_res(ub, UB_UNUSEDPRIVVM, ub->ub_unused_privvmpages); -+ clean &= verify_res(ub, UB_TMPFSPAGES, ub->ub_tmpfs_respages); -+ clean &= verify_res(ub, UB_SWAPPAGES, ub->ub_swap_pages); -+ clean &= verify_res(ub, UB_HELDPAGES, (unsigned long)ub->ub_held_pages); -+ -+ ub_debug_trace(!clean, 5, 60*HZ); -+} -+ -+static void __unhash_beancounter(struct user_beancounter *ub) -+{ -+ struct user_beancounter **ubptr; -+ struct ub_hash_slot *slot; -+ -+ if (ub->parent != NULL) -+ slot = &ub_hash[ub_subhash_fun(ub->parent, ub->ub_uid)]; -+ else -+ slot = &ub_hash[ub_hash_fun(ub->ub_uid)]; -+ ubptr = &slot->ubh_beans; -+ -+ while (*ubptr != NULL) { -+ if (*ubptr == ub) { -+ verify_held(ub); -+ *ubptr = ub->ub_next; -+ return; -+ } -+ ubptr = &((*ubptr)->ub_next); -+ } -+ printk(KERN_ERR "Invalid beancounter %p, luid=%d on free, slot %p\n", -+ ub, ub->ub_uid, slot); -+} -+#endif -+ -+void __put_beancounter(struct user_beancounter *ub) -+{ -+ unsigned long flags; -+ struct user_beancounter *parent; -+ -+again: -+ parent = ub->parent; -+ ub_debug(UBD_ALLOC, "__put bc %p (cnt %d) for %.20s pid %d " -+ "cur %08lx cpu %d.\n", -+ ub, atomic_read(&ub->ub_refcount), -+ current->comm, current->pid, -+ (unsigned long)current, smp_processor_id()); -+ -+ /* equevalent to atomic_dec_and_lock_irqsave() */ -+ local_irq_save(flags); -+ if (likely(!atomic_dec_and_lock(&ub->ub_refcount, &ub_hash_lock))) { -+ if (unlikely(atomic_read(&ub->ub_refcount) < 0)) -+ printk(KERN_ERR "UB: Bad ub refcount: ub=%p, " -+ "luid=%d, ref=%d\n", -+ ub, ub->ub_uid, -+ atomic_read(&ub->ub_refcount)); -+ local_irq_restore(flags); -+ return; -+ } -+ -+ if (unlikely(ub == get_ub0())) { -+ printk(KERN_ERR "Trying to put ub0\n"); -+ spin_unlock_irqrestore(&ub_hash_lock, flags); -+ return; -+ } -+ -+#ifndef CONFIG_UBC_KEEP_UNUSED -+ __unhash_beancounter(ub); -+ spin_unlock_irqrestore(&ub_hash_lock, flags); -+ ub_free_counters(ub); -+ kmem_cache_free(ub_cachep, ub); -+#else -+ spin_unlock_irqrestore(&ub_hash_lock, flags); -+#endif -+ ub = parent; -+ if (ub != NULL) -+ goto again; -+} -+EXPORT_SYMBOL(__put_beancounter); -+ -+/* -+ * Generic resource charging stuff -+ */ -+ -+int __charge_beancounter_locked(struct user_beancounter *ub, -+ int resource, unsigned long val, enum severity strict) -+{ -+ ub_debug_resource(resource, "Charging %lu for %d of %p with %lu\n", -+ val, resource, ub, ub->ub_parms[resource].held); -+ /* -+ * ub_value <= UB_MAXVALUE, value <= UB_MAXVALUE, and only one addition -+ * at the moment is possible so an overflow is impossible. -+ */ -+ ub->ub_parms[resource].held += val; -+ -+ switch (strict) { -+ case UB_HARD: -+ if (ub->ub_parms[resource].held > -+ ub->ub_parms[resource].barrier) -+ break; -+ case UB_SOFT: -+ if (ub->ub_parms[resource].held > -+ ub->ub_parms[resource].limit) -+ break; -+ case UB_FORCE: -+ ub_adjust_maxheld(ub, resource); -+ return 0; -+ default: -+ BUG(); -+ } -+ -+ if (strict == UB_SOFT && ub_ratelimit(&ub->ub_limit_rl)) -+ printk(KERN_INFO "Fatal resource shortage: %s, UB %d.\n", -+ ub_rnames[resource], ub->ub_uid); -+ ub->ub_parms[resource].failcnt++; -+ ub->ub_parms[resource].held -= val; -+ return -ENOMEM; -+} -+ -+int charge_beancounter(struct user_beancounter *ub, -+ int resource, unsigned long val, enum severity strict) -+{ -+ int retval; -+ struct user_beancounter *p, *q; -+ unsigned long flags; -+ -+ retval = -EINVAL; -+ if (val > UB_MAXVALUE) -+ goto out; -+ -+ local_irq_save(flags); -+ for (p = ub; p != NULL; p = p->parent) { -+ spin_lock(&p->ub_lock); -+ retval = __charge_beancounter_locked(p, resource, val, strict); -+ spin_unlock(&p->ub_lock); -+ if (retval) -+ goto unroll; -+ } -+out_restore: -+ local_irq_restore(flags); -+out: -+ return retval; -+ -+unroll: -+ for (q = ub; q != p; q = q->parent) { -+ spin_lock(&q->ub_lock); -+ __uncharge_beancounter_locked(q, resource, val); -+ spin_unlock(&q->ub_lock); -+ } -+ goto out_restore; -+} -+ -+EXPORT_SYMBOL(charge_beancounter); -+ -+void charge_beancounter_notop(struct user_beancounter *ub, -+ int resource, unsigned long val) -+{ -+ struct user_beancounter *p; -+ unsigned long flags; -+ -+ local_irq_save(flags); -+ for (p = ub; p->parent != NULL; p = p->parent) { -+ spin_lock(&p->ub_lock); -+ __charge_beancounter_locked(p, resource, val, UB_FORCE); -+ spin_unlock(&p->ub_lock); -+ } -+ local_irq_restore(flags); -+} -+ -+EXPORT_SYMBOL(charge_beancounter_notop); -+ -+void uncharge_warn(struct user_beancounter *ub, int resource, -+ unsigned long val, unsigned long held) -+{ -+ char id[64]; -+ -+ print_ub_uid(ub, id, sizeof(id)); -+ printk(KERN_ERR "Uncharging too much %lu h %lu, res %s ub %s\n", -+ val, held, ub_rnames[resource], id); -+ ub_debug_trace(1, 10, 10*HZ); -+} -+ -+void __uncharge_beancounter_locked(struct user_beancounter *ub, -+ int resource, unsigned long val) -+{ -+ ub_debug_resource(resource, "Uncharging %lu for %d of %p with %lu\n", -+ val, resource, ub, ub->ub_parms[resource].held); -+ if (ub->ub_parms[resource].held < val) { -+ uncharge_warn(ub, resource, -+ val, ub->ub_parms[resource].held); -+ val = ub->ub_parms[resource].held; -+ } -+ ub->ub_parms[resource].held -= val; -+} -+ -+void uncharge_beancounter(struct user_beancounter *ub, -+ int resource, unsigned long val) -+{ -+ unsigned long flags; -+ struct user_beancounter *p; -+ -+ for (p = ub; p != NULL; p = p->parent) { -+ spin_lock_irqsave(&p->ub_lock, flags); -+ __uncharge_beancounter_locked(p, resource, val); -+ spin_unlock_irqrestore(&p->ub_lock, flags); -+ } -+} -+ -+EXPORT_SYMBOL(uncharge_beancounter); -+ -+void uncharge_beancounter_notop(struct user_beancounter *ub, -+ int resource, unsigned long val) -+{ -+ struct user_beancounter *p; -+ unsigned long flags; -+ -+ local_irq_save(flags); -+ for (p = ub; p->parent != NULL; p = p->parent) { -+ spin_lock(&p->ub_lock); -+ __uncharge_beancounter_locked(p, resource, val); -+ spin_unlock(&p->ub_lock); -+ } -+ local_irq_restore(flags); -+} -+ -+EXPORT_SYMBOL(uncharge_beancounter_notop); -+ -+ -+/* -+ * Rate limiting stuff. -+ */ -+int ub_ratelimit(struct ub_rate_info *p) -+{ -+ unsigned long cjif, djif; -+ unsigned long flags; -+ static spinlock_t ratelimit_lock = SPIN_LOCK_UNLOCKED; -+ long new_bucket; -+ -+ spin_lock_irqsave(&ratelimit_lock, flags); -+ cjif = jiffies; -+ djif = cjif - p->last; -+ if (djif < p->interval) { -+ if (p->bucket >= p->burst) { -+ spin_unlock_irqrestore(&ratelimit_lock, flags); -+ return 0; -+ } -+ p->bucket++; -+ } else { -+ new_bucket = p->bucket - (djif / (unsigned)p->interval); -+ if (new_bucket < 0) -+ new_bucket = 0; -+ p->bucket = new_bucket + 1; -+ } -+ p->last = cjif; -+ spin_unlock_irqrestore(&ratelimit_lock, flags); -+ return 1; -+} -+EXPORT_SYMBOL(ub_ratelimit); -+ -+ -+/* -+ * Initialization -+ * -+ * struct user_beancounter contains -+ * - limits and other configuration settings, -+ * with a copy stored for accounting purposes, -+ * - structural fields: lists, spinlocks and so on. -+ * -+ * Before these parts are initialized, the structure should be memset -+ * to 0 or copied from a known clean structure. That takes care of a lot -+ * of fields not initialized explicitly. -+ */ -+ -+static void init_beancounter_struct(struct user_beancounter *ub) -+{ -+ ub->ub_magic = UB_MAGIC; -+ atomic_set(&ub->ub_refcount, 1); -+ spin_lock_init(&ub->ub_lock); -+ INIT_LIST_HEAD(&ub->ub_tcp_sk_list); -+ INIT_LIST_HEAD(&ub->ub_other_sk_list); -+#ifdef CONFIG_UBC_DEBUG_KMEM -+ INIT_LIST_HEAD(&ub->ub_cclist); -+#endif -+} -+ -+static void init_beancounter_store(struct user_beancounter *ub) -+{ -+ int k; -+ -+ for (k = 0; k < UB_RESOURCES; k++) { -+ memcpy(&ub->ub_store[k], &ub->ub_parms[k], -+ sizeof(struct ubparm)); -+ } -+} -+ -+static void init_beancounter_nolimits(struct user_beancounter *ub) -+{ -+ int k; -+ -+ for (k = 0; k < UB_RESOURCES; k++) { -+ ub->ub_parms[k].limit = UB_MAXVALUE; -+ /* FIXME: whether this is right for physpages and guarantees? */ -+ ub->ub_parms[k].barrier = UB_MAXVALUE; -+ } -+ -+ /* FIXME: set unlimited rate? */ -+ ub->ub_limit_rl.burst = 4; -+ ub->ub_limit_rl.interval = 300*HZ; -+} -+ -+static void init_beancounter_syslimits(struct user_beancounter *ub, -+ unsigned long mp) -+{ -+ extern int max_threads; -+ int k; -+ -+ ub->ub_parms[UB_KMEMSIZE].limit = -+ mp > (192*1024*1024 >> PAGE_SHIFT) ? -+ 32*1024*1024 : (mp << PAGE_SHIFT) / 6; -+ ub->ub_parms[UB_LOCKEDPAGES].limit = 8; -+ ub->ub_parms[UB_PRIVVMPAGES].limit = UB_MAXVALUE; -+ ub->ub_parms[UB_SHMPAGES].limit = 64; -+ ub->ub_parms[UB_NUMPROC].limit = max_threads / 2; -+ ub->ub_parms[UB_NUMTCPSOCK].limit = 1024; -+ ub->ub_parms[UB_TCPSNDBUF].limit = 1024*4*1024; /* 4k per socket */ -+ ub->ub_parms[UB_TCPRCVBUF].limit = 1024*6*1024; /* 6k per socket */ -+ ub->ub_parms[UB_NUMOTHERSOCK].limit = 256; -+ ub->ub_parms[UB_DGRAMRCVBUF].limit = 256*4*1024; /* 4k per socket */ -+ ub->ub_parms[UB_OTHERSOCKBUF].limit = 256*8*1024; /* 8k per socket */ -+ ub->ub_parms[UB_NUMFLOCK].limit = 1024; -+ ub->ub_parms[UB_NUMPTY].limit = 16; -+ ub->ub_parms[UB_NUMSIGINFO].limit = 1024; -+ ub->ub_parms[UB_DCACHESIZE].limit = 1024*1024; -+ ub->ub_parms[UB_NUMFILE].limit = 1024; -+ -+ for (k = 0; k < UB_RESOURCES; k++) -+ ub->ub_parms[k].barrier = ub->ub_parms[k].limit; -+ -+ ub->ub_limit_rl.burst = 4; -+ ub->ub_limit_rl.interval = 300*HZ; -+} -+ -+void __init ub0_init(void) -+{ -+ struct user_beancounter *ub; -+ -+ init_cache_counters(); -+ ub = get_ub0(); -+ memset(ub, 0, sizeof(*ub)); -+ ub->ub_uid = 0; -+ init_beancounter_nolimits(ub); -+ init_beancounter_store(ub); -+ init_beancounter_struct(ub); -+ -+ memset(task_bc(current), 0, sizeof(struct task_beancounter)); -+ (void)set_exec_ub(get_ub0()); -+ task_bc(current)->fork_sub = get_beancounter(get_ub0()); -+ mm_ub(&init_mm) = get_beancounter(ub); -+} -+ -+void __init ub_hash_init(void) -+{ -+ struct ub_hash_slot *slot; -+ -+ spin_lock_init(&ub_hash_lock); -+ /* insert ub0 into the hash */ -+ slot = &ub_hash[ub_hash_fun(get_ub0()->ub_uid)]; -+ slot->ubh_beans = get_ub0(); -+} -+ -+void __init beancounter_init(unsigned long mempages) -+{ -+ extern int skbc_cache_init(void); -+ int res; -+ -+ res = skbc_cache_init(); -+ ub_cachep = kmem_cache_create("user_beancounters", -+ sizeof(struct user_beancounter), -+ 0, SLAB_HWCACHE_ALIGN, NULL, NULL); -+ if (res < 0 || ub_cachep == NULL) -+ panic("Can't create ubc caches\n"); -+ -+ memset(&default_beancounter, 0, sizeof(default_beancounter)); -+#ifdef CONFIG_UBC_UNLIMITED -+ init_beancounter_nolimits(&default_beancounter); -+#else -+ init_beancounter_syslimits(&default_beancounter, mempages); -+#endif -+ init_beancounter_store(&default_beancounter); -+ init_beancounter_struct(&default_beancounter); -+ -+ ub_hash_init(); -+} -diff -uprN linux-2.6.8.1.orig/kernel/ub/ub_dcache.c linux-2.6.8.1-ve022stab078/kernel/ub/ub_dcache.c ---- linux-2.6.8.1.orig/kernel/ub/ub_dcache.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/kernel/ub/ub_dcache.c 2006-05-11 13:05:39.000000000 +0400 -@@ -0,0 +1,333 @@ -+/* -+ * kernel/ub/ub_dcache.c -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#include <linux/config.h> -+#include <linux/dcache.h> -+#include <linux/slab.h> -+#include <linux/kmem_cache.h> -+#include <linux/err.h> -+ -+#include <ub/beancounter.h> -+#include <ub/ub_mem.h> -+#include <ub/ub_dcache.h> -+ -+/* -+ * Locking -+ * traverse dcache_lock d_lock -+ * ub_dentry_charge + + + -+ * ub_dentry_uncharge + - + -+ * ub_dentry_charge_nofail + + - -+ * -+ * d_inuse is atomic so that we can inc dentry's parent d_inuse in -+ * ub_dentry_charhe with the only dentry's d_lock held. -+ * -+ * Race in uncharge vs charge_nofail is handled with dcache_lock. -+ * Race in charge vs charge_nofail is inessential since they both inc d_inuse. -+ * Race in uncharge vs charge is handled by altering d_inuse under d_lock. -+ * -+ * Race with d_move is handled this way: -+ * - charge_nofail and uncharge are protected by dcache_lock; -+ * - charge works only with dentry and dentry->d_parent->d_inuse, so -+ * it's enough to lock only the dentry. -+ */ -+ -+/* -+ * Beancounting -+ * UB argument must NOT be NULL -+ */ -+ -+static int do_charge_dcache(struct user_beancounter *ub, unsigned long size, -+ enum severity sv) -+{ -+ unsigned long flags; -+ -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ if (__charge_beancounter_locked(ub, UB_KMEMSIZE, CHARGE_SIZE(size), sv)) -+ goto out_mem; -+ if (__charge_beancounter_locked(ub, UB_DCACHESIZE, size, sv)) -+ goto out_dcache; -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+ return 0; -+ -+out_dcache: -+ __uncharge_beancounter_locked(ub, UB_KMEMSIZE, CHARGE_SIZE(size)); -+out_mem: -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+ return -ENOMEM; -+} -+ -+static void do_uncharge_dcache(struct user_beancounter *ub, -+ unsigned long size) -+{ -+ unsigned long flags; -+ -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ __uncharge_beancounter_locked(ub, UB_KMEMSIZE, CHARGE_SIZE(size)); -+ __uncharge_beancounter_locked(ub, UB_DCACHESIZE, size); -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+} -+ -+static int charge_dcache(struct user_beancounter *ub, unsigned long size, -+ enum severity sv) -+{ -+ struct user_beancounter *p, *q; -+ -+ for (p = ub; p != NULL; p = p->parent) { -+ if (do_charge_dcache(p, size, sv)) -+ goto unroll; -+ } -+ return 0; -+ -+unroll: -+ for (q = ub; q != p; q = q->parent) -+ do_uncharge_dcache(q, size); -+ return -ENOMEM; -+} -+ -+void uncharge_dcache(struct user_beancounter *ub, unsigned long size) -+{ -+ for (; ub != NULL; ub = ub->parent) -+ do_uncharge_dcache(ub, size); -+} -+ -+static inline void charge_dcache_forced(struct user_beancounter *ub, -+ unsigned long size) -+{ -+ charge_dcache(ub, size, UB_FORCE); -+} -+ -+static inline void d_forced_charge(struct dentry_beancounter *d_bc) -+{ -+ d_bc->d_ub = get_beancounter(get_exec_ub()); -+ if (d_bc->d_ub == NULL) -+ return; -+ -+ charge_dcache_forced(d_bc->d_ub, d_bc->d_ubsize); -+} -+ -+static inline void d_uncharge(struct dentry_beancounter *d_bc) -+{ -+ if (d_bc->d_ub == NULL) -+ return; -+ -+ uncharge_dcache(d_bc->d_ub, d_bc->d_ubsize); -+ put_beancounter(d_bc->d_ub); -+ d_bc->d_ub = NULL; -+} -+ -+/* -+ * Alloc / free dentry_beancounter -+ */ -+ -+static inline int d_alloc_beancounter(struct dentry *d) -+{ -+ return 0; -+} -+ -+static inline void d_free_beancounter(struct dentry_beancounter *d_bc) -+{ -+} -+ -+static inline unsigned long d_charge_size(struct dentry *dentry) -+{ -+ /* dentry's d_name is already set to appropriate value (see d_alloc) */ -+ return inode_memusage() + dentry_memusage() + -+ (dname_external(dentry) ? -+ kmem_obj_memusage((void *)dentry->d_name.name) : 0); -+} -+ -+/* -+ * dentry mark in use operation -+ * d_lock is held -+ */ -+ -+static int d_inc_inuse(struct dentry *dentry) -+{ -+ struct user_beancounter *ub; -+ struct dentry_beancounter *d_bc; -+ -+ if (dentry != dentry->d_parent) { -+ struct dentry *parent; -+ -+ /* -+ * Increment d_inuse of parent. -+ * It can't change since dentry->d_lock is held. -+ */ -+ parent = dentry->d_parent; -+ if (atomic_inc_and_test(&dentry_bc(parent)->d_inuse)) -+ BUG(); -+ } -+ -+ d_bc = dentry_bc(dentry); -+ ub = get_beancounter(get_exec_ub()); -+ -+ if (ub != NULL && charge_dcache(ub, d_bc->d_ubsize, UB_SOFT)) -+ goto out_err; -+ -+ d_bc->d_ub = ub; -+ return 0; -+ -+out_err: -+ put_beancounter(ub); -+ d_bc->d_ub = NULL; -+ return -ENOMEM; -+} -+ -+/* -+ * no locks -+ */ -+int ub_dentry_alloc(struct dentry *dentry) -+{ -+ int err; -+ struct dentry_beancounter *d_bc; -+ -+ err = d_alloc_beancounter(dentry); -+ if (err < 0) -+ return err; -+ -+ d_bc = dentry_bc(dentry); -+ d_bc->d_ub = get_beancounter(get_exec_ub()); -+ atomic_set(&d_bc->d_inuse, 0); /* see comment in ub_dcache.h */ -+ d_bc->d_ubsize = d_charge_size(dentry); -+ -+ err = 0; -+ if (d_bc->d_ub != NULL && -+ charge_dcache(d_bc->d_ub, d_bc->d_ubsize, UB_HARD)) { -+ put_beancounter(d_bc->d_ub); -+ d_free_beancounter(d_bc); -+ err = -ENOMEM; -+ } -+ -+ return err; -+} -+ -+void ub_dentry_free(struct dentry *dentry) -+{ -+} -+ -+/* -+ * Charge / uncharge functions. -+ * -+ * We take d_lock to protect dentry_bc from concurrent acces -+ * when simultaneous __d_lookup and d_put happens on one dentry. -+ */ -+ -+/* -+ * no dcache_lock, d_lock and rcu_read_lock are held -+ * drops d_lock, rcu_read_lock and returns error if any -+ */ -+int ub_dentry_charge(struct dentry *dentry) -+{ -+ int err; -+ -+ err = 0; -+ if (atomic_inc_and_test(&dentry_bc(dentry)->d_inuse)) -+ err = d_inc_inuse(dentry); -+ -+ /* -+ * d_lock and rcu_read_lock are dropped here -+ * (see also __d_lookup) -+ */ -+ spin_unlock(&dentry->d_lock); -+ rcu_read_unlock(); -+ -+ if (!err) -+ return 0; -+ -+ /* -+ * d_invlaidate is required for real_lookup -+ * since it tries to create new dentry on -+ * d_lookup failure. -+ */ -+ if (!d_invalidate(dentry)) -+ return err; -+ -+ /* didn't succeeded, force dentry to be charged */ -+ d_forced_charge(dentry_bc(dentry)); -+ return 0; -+} -+ -+/* -+ * dcache_lock is held -+ * no d_locks, sequentaly takes and drops from dentry upward -+ */ -+void ub_dentry_uncharge(struct dentry *dentry) -+{ -+ struct dentry_beancounter *d_bc; -+ struct dentry *parent; -+ -+ /* go up until status is changed and root is not reached */ -+ while (1) { -+ d_bc = dentry_bc(dentry); -+ -+ /* -+ * We need d_lock here to handle -+ * the race with ub_dentry_charge -+ */ -+ spin_lock(&dentry->d_lock); -+ if (!atomic_add_negative(-1, &d_bc->d_inuse)) { -+ spin_unlock(&dentry->d_lock); -+ break; -+ } -+ -+ /* state transition 0 => -1 */ -+ d_uncharge(d_bc); -+ parent = dentry->d_parent; -+ spin_unlock(&dentry->d_lock); -+ -+ /* -+ * dcache_lock is held (see comment in __dget_locked) -+ * so we can safely move upwards. -+ */ -+ if (dentry == parent) -+ break; -+ dentry = parent; -+ } -+} -+ -+/* -+ * forced version. for dget in clean cache, when error is not an option -+ * -+ * dcache_lock is held -+ * no d_locks -+ */ -+void ub_dentry_charge_nofail(struct dentry *dentry) -+{ -+ struct dentry_beancounter *d_bc; -+ struct dentry *parent; -+ -+ /* go up until status is changed and root is not reached */ -+ while (1) { -+ d_bc = dentry_bc(dentry); -+ if (!atomic_inc_and_test(&d_bc->d_inuse)) -+ break; -+ -+ /* -+ * state transition -1 => 0 -+ * -+ * No need to lock dentry before atomic_inc -+ * like we do in ub_dentry_uncharge. -+ * We can't race with ub_dentry_uncharge due -+ * to dcache_lock. The only possible race with -+ * ub_dentry_charge is OK since they both -+ * do atomic_inc. -+ */ -+ d_forced_charge(d_bc); -+ /* -+ * dcache_lock is held (see comment in __dget_locked) -+ * so we can safely move upwards. -+ */ -+ parent = dentry->d_parent; -+ -+ if (dentry == parent) -+ break; -+ dentry = parent; -+ } -+} -diff -uprN linux-2.6.8.1.orig/kernel/ub/ub_mem.c linux-2.6.8.1-ve022stab078/kernel/ub/ub_mem.c ---- linux-2.6.8.1.orig/kernel/ub/ub_mem.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/kernel/ub/ub_mem.c 2006-05-11 13:05:39.000000000 +0400 -@@ -0,0 +1,377 @@ -+/* -+ * kernel/ub/ub_mem.c -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#include <linux/slab.h> -+#include <linux/kmem_cache.h> -+#include <linux/kmem_slab.h> -+#include <linux/highmem.h> -+#include <linux/vmalloc.h> -+#include <linux/mm.h> -+#include <linux/gfp.h> -+#include <linux/swap.h> -+#include <linux/spinlock.h> -+#include <linux/sched.h> -+#include <linux/module.h> -+#include <ub/beancounter.h> -+#include <ub/ub_mem.h> -+#include <ub/ub_hash.h> -+ -+/* -+ * Initialization -+ */ -+ -+extern void __init page_beancounters_init(void); -+ -+void __init page_ubc_init(void) -+{ -+#ifdef CONFIG_USER_RSS_ACCOUNTING -+ page_beancounters_init(); -+#endif -+} -+ -+/* -+ * Slab accounting -+ */ -+ -+#ifdef CONFIG_UBC_DEBUG_KMEM -+ -+#define CC_HASH_SIZE 1024 -+static struct ub_cache_counter *cc_hash[CC_HASH_SIZE]; -+spinlock_t cc_lock; -+ -+static void __free_cache_counters(struct user_beancounter *ub, -+ kmem_cache_t *cachep) -+{ -+ struct ub_cache_counter *cc, **pprev, *del; -+ int i; -+ unsigned long flags; -+ -+ del = NULL; -+ spin_lock_irqsave(&cc_lock, flags); -+ for (i = 0; i < CC_HASH_SIZE; i++) { -+ pprev = &cc_hash[i]; -+ cc = cc_hash[i]; -+ while (cc != NULL) { -+ if (cc->ub != ub && cc->cachep != cachep) { -+ pprev = &cc->next; -+ cc = cc->next; -+ continue; -+ } -+ -+ list_del(&cc->ulist); -+ *pprev = cc->next; -+ cc->next = del; -+ del = cc; -+ cc = *pprev; -+ } -+ } -+ spin_unlock_irqrestore(&cc_lock, flags); -+ -+ while (del != NULL) { -+ cc = del->next; -+ kfree(del); -+ del = cc; -+ } -+} -+ -+void ub_free_counters(struct user_beancounter *ub) -+{ -+ __free_cache_counters(ub, NULL); -+} -+ -+void ub_kmemcache_free(kmem_cache_t *cachep) -+{ -+ __free_cache_counters(NULL, cachep); -+} -+ -+void __init init_cache_counters(void) -+{ -+ memset(cc_hash, 0, CC_HASH_SIZE * sizeof(cc_hash[0])); -+ spin_lock_init(&cc_lock); -+} -+ -+#define cc_hash_fun(ub, cachep) ( \ -+ (((unsigned long)(ub) >> L1_CACHE_SHIFT) ^ \ -+ ((unsigned long)(ub) >> (BITS_PER_LONG / 2)) ^ \ -+ ((unsigned long)(cachep) >> L1_CACHE_SHIFT) ^ \ -+ ((unsigned long)(cachep) >> (BITS_PER_LONG / 2)) \ -+ ) & (CC_HASH_SIZE - 1)) -+ -+static int change_slab_charged(struct user_beancounter *ub, void *objp, -+ unsigned long val, int mask) -+{ -+ struct ub_cache_counter *cc, *new_cnt, **pprev; -+ kmem_cache_t *cachep; -+ unsigned long flags; -+ -+ cachep = GET_PAGE_CACHE(virt_to_page(objp)); -+ new_cnt = NULL; -+ -+again: -+ spin_lock_irqsave(&cc_lock, flags); -+ cc = cc_hash[cc_hash_fun(ub, cachep)]; -+ while (cc) { -+ if (cc->ub == ub && cc->cachep == cachep) -+ goto found; -+ cc = cc->next; -+ } -+ -+ if (new_cnt != NULL) -+ goto insert; -+ -+ spin_unlock_irqrestore(&cc_lock, flags); -+ -+ new_cnt = kmalloc(sizeof(*new_cnt), mask & ~__GFP_UBC); -+ if (new_cnt == NULL) -+ return -ENOMEM; -+ -+ new_cnt->counter = 0; -+ new_cnt->ub = ub; -+ new_cnt->cachep = cachep; -+ goto again; -+ -+insert: -+ pprev = &cc_hash[cc_hash_fun(ub, cachep)]; -+ new_cnt->next = *pprev; -+ *pprev = new_cnt; -+ list_add(&new_cnt->ulist, &ub->ub_cclist); -+ cc = new_cnt; -+ new_cnt = NULL; -+ -+found: -+ cc->counter += val; -+ spin_unlock_irqrestore(&cc_lock, flags); -+ if (new_cnt) -+ kfree(new_cnt); -+ return 0; -+} -+ -+static inline int inc_slab_charged(struct user_beancounter *ub, -+ void *objp, int mask) -+{ -+ return change_slab_charged(ub, objp, 1, mask); -+} -+ -+static inline void dec_slab_charged(struct user_beancounter *ub, void *objp) -+{ -+ if (change_slab_charged(ub, objp, -1, 0) < 0) -+ BUG(); -+} -+ -+#include <linux/vmalloc.h> -+ -+static inline int inc_pages_charged(struct user_beancounter *ub, -+ struct page *pg, int order) -+{ -+ int cpu; -+ -+ cpu = get_cpu(); -+ ub->ub_pages_charged[cpu]++; -+ put_cpu(); -+ return 0; -+} -+ -+static inline void dec_pages_charged(struct user_beancounter *ub, -+ struct page *pg, int order) -+{ -+ int cpu; -+ -+ cpu = get_cpu(); -+ ub->ub_pages_charged[cpu]--; -+ put_cpu(); -+} -+ -+void inc_vmalloc_charged(struct vm_struct *vm, int flags) -+{ -+ int cpu; -+ struct user_beancounter *ub; -+ -+ if (!(flags & __GFP_UBC)) -+ return; -+ -+ ub = get_exec_ub(); -+ if (ub == NULL) -+ return; -+ -+ cpu = get_cpu(); -+ ub->ub_vmalloc_charged[cpu] += vm->nr_pages; -+ put_cpu(); -+} -+ -+void dec_vmalloc_charged(struct vm_struct *vm) -+{ -+ int cpu; -+ struct user_beancounter *ub; -+ -+ ub = page_ub(vm->pages[0]); -+ if (ub == NULL) -+ return; -+ -+ cpu = get_cpu(); -+ ub->ub_vmalloc_charged[cpu] -= vm->nr_pages; -+ put_cpu(); -+} -+ -+#else -+#define inc_slab_charged(ub, o, m) (0) -+#define dec_slab_charged(ub, o) do { } while (0) -+#define inc_pages_charged(ub, pg, o) (0) -+#define dec_pages_charged(ub, pg, o) do { } while (0) -+#endif -+ -+static inline struct user_beancounter **slab_ub_ref(void *objp) -+{ -+ struct page *pg; -+ kmem_cache_t *cachep; -+ struct slab *slabp; -+ int objnr; -+ -+ pg = virt_to_page(objp); -+ cachep = GET_PAGE_CACHE(pg); -+ BUG_ON(!(cachep->flags & SLAB_UBC)); -+ slabp = GET_PAGE_SLAB(pg); -+ objnr = (objp - slabp->s_mem) / cachep->objsize; -+ return slab_ubcs(cachep, slabp) + objnr; -+} -+ -+struct user_beancounter *slab_ub(void *objp) -+{ -+ struct user_beancounter **ub_ref; -+ -+ ub_ref = slab_ub_ref(objp); -+ return *ub_ref; -+} -+ -+EXPORT_SYMBOL(slab_ub); -+ -+int ub_slab_charge(void *objp, int flags) -+{ -+ unsigned int size; -+ struct user_beancounter *ub; -+ -+ ub = get_beancounter(get_exec_ub()); -+ if (ub == NULL) -+ return 0; -+ -+ size = CHARGE_SIZE(kmem_obj_memusage(objp)); -+ if (charge_beancounter(ub, UB_KMEMSIZE, size, -+ (flags & __GFP_SOFT_UBC ? UB_SOFT : UB_HARD))) -+ goto out_err; -+ -+ if (inc_slab_charged(ub, objp, flags) < 0) { -+ uncharge_beancounter(ub, UB_KMEMSIZE, size); -+ goto out_err; -+ } -+ *slab_ub_ref(objp) = ub; -+ return 0; -+ -+out_err: -+ put_beancounter(ub); -+ return -ENOMEM; -+} -+ -+void ub_slab_uncharge(void *objp) -+{ -+ unsigned int size; -+ struct user_beancounter **ub_ref; -+ -+ ub_ref = slab_ub_ref(objp); -+ if (*ub_ref == NULL) -+ return; -+ -+ dec_slab_charged(*ub_ref, objp); -+ size = CHARGE_SIZE(kmem_obj_memusage(objp)); -+ uncharge_beancounter(*ub_ref, UB_KMEMSIZE, size); -+ put_beancounter(*ub_ref); -+ *ub_ref = NULL; -+} -+ -+/* -+ * Pages accounting -+ */ -+ -+inline int ub_page_charge(struct page *page, int order, int mask) -+{ -+ struct user_beancounter *ub; -+ -+ ub = NULL; -+ if (!(mask & __GFP_UBC)) -+ goto out; -+ -+ ub = get_beancounter(get_exec_ub()); -+ if (ub == NULL) -+ goto out; -+ -+ if (charge_beancounter(ub, UB_KMEMSIZE, CHARGE_ORDER(order), -+ (mask & __GFP_SOFT_UBC ? UB_SOFT : UB_HARD))) -+ goto err; -+ if (inc_pages_charged(ub, page, order) < 0) { -+ uncharge_beancounter(ub, UB_KMEMSIZE, CHARGE_ORDER(order)); -+ goto err; -+ } -+out: -+ BUG_ON(page_ub(page) != NULL); -+ page_ub(page) = ub; -+ return 0; -+ -+err: -+ BUG_ON(page_ub(page) != NULL); -+ put_beancounter(ub); -+ return -ENOMEM; -+} -+ -+inline void ub_page_uncharge(struct page *page, int order) -+{ -+ struct user_beancounter *ub; -+ -+ ub = page_ub(page); -+ if (ub == NULL) -+ return; -+ -+ dec_pages_charged(ub, page, order); -+ BUG_ON(ub->ub_magic != UB_MAGIC); -+ uncharge_beancounter(ub, UB_KMEMSIZE, CHARGE_ORDER(order)); -+ put_beancounter(ub); -+ page_ub(page) = NULL; -+} -+ -+/* -+ * takes init_mm.page_table_lock -+ * some outer lock to protect pages from vmalloced area must be held -+ */ -+struct user_beancounter *vmalloc_ub(void *obj) -+{ -+ struct page *pg; -+ -+ spin_lock(&init_mm.page_table_lock); -+ pg = follow_page_k((unsigned long)obj, 0); -+ spin_unlock(&init_mm.page_table_lock); -+ if (pg == NULL) -+ return NULL; -+ -+ return page_ub(pg); -+} -+ -+EXPORT_SYMBOL(vmalloc_ub); -+ -+struct user_beancounter *mem_ub(void *obj) -+{ -+ struct user_beancounter *ub; -+ -+ if ((unsigned long)obj >= VMALLOC_START && -+ (unsigned long)obj < VMALLOC_END) -+ ub = vmalloc_ub(obj); -+ else -+ ub = slab_ub(obj); -+ -+ return ub; -+} -+ -+EXPORT_SYMBOL(mem_ub); -diff -uprN linux-2.6.8.1.orig/kernel/ub/ub_misc.c linux-2.6.8.1-ve022stab078/kernel/ub/ub_misc.c ---- linux-2.6.8.1.orig/kernel/ub/ub_misc.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/kernel/ub/ub_misc.c 2006-05-11 13:05:49.000000000 +0400 -@@ -0,0 +1,227 @@ -+/* -+ * kernel/ub/ub_misc.c -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#include <linux/tty.h> -+#include <linux/tty_driver.h> -+#include <linux/signal.h> -+#include <linux/slab.h> -+#include <linux/fs.h> -+#include <linux/sched.h> -+#include <linux/module.h> -+ -+#include <ub/beancounter.h> -+#include <ub/ub_mem.h> -+ -+/* -+ * Task staff -+ */ -+ -+static void init_task_sub(struct task_struct *parent, -+ struct task_struct *tsk, -+ struct task_beancounter *old_bc) -+{ -+ struct task_beancounter *new_bc; -+ struct user_beancounter *sub; -+ -+ new_bc = task_bc(tsk); -+ sub = old_bc->fork_sub; -+ new_bc->fork_sub = get_beancounter(sub); -+ new_bc->task_fnode = NULL; -+ new_bc->task_freserv = old_bc->task_freserv; -+ old_bc->task_freserv = NULL; -+ memset(&new_bc->task_data, 0, sizeof(new_bc->task_data)); -+} -+ -+int ub_task_charge(struct task_struct *parent, struct task_struct *task) -+{ -+ struct task_beancounter *old_bc; -+ struct task_beancounter *new_bc; -+ struct user_beancounter *ub; -+ -+ old_bc = task_bc(parent); -+ ub = old_bc->fork_sub; -+ -+ if (charge_beancounter(ub, UB_NUMPROC, 1, UB_HARD) < 0) -+ return -ENOMEM; -+ -+ new_bc = task_bc(task); -+ new_bc->task_ub = get_beancounter(ub); -+ new_bc->exec_ub = get_beancounter(ub); -+ init_task_sub(parent, task, old_bc); -+ return 0; -+} -+ -+void ub_task_uncharge(struct task_struct *task) -+{ -+ struct task_beancounter *task_bc; -+ -+ task_bc = task_bc(task); -+ if (task_bc->task_ub != NULL) -+ uncharge_beancounter(task_bc->task_ub, UB_NUMPROC, 1); -+ -+ put_beancounter(task_bc->exec_ub); -+ put_beancounter(task_bc->task_ub); -+ put_beancounter(task_bc->fork_sub); -+ /* can't be freed elsewhere, failures possible in the middle of fork */ -+ if (task_bc->task_freserv != NULL) -+ kfree(task_bc->task_freserv); -+ -+ task_bc->exec_ub = (struct user_beancounter *)0xdeadbcbc; -+} -+ -+/* -+ * Files and file locks. -+ */ -+ -+int ub_file_charge(struct file *f) -+{ -+ struct user_beancounter *ub; -+ -+ /* No need to get_beancounter here since it's already got in slab */ -+ ub = slab_ub(f); -+ if (ub == NULL) -+ return 0; -+ -+ return charge_beancounter(ub, UB_NUMFILE, 1, UB_HARD); -+} -+ -+void ub_file_uncharge(struct file *f) -+{ -+ struct user_beancounter *ub; -+ -+ /* Ub will be put in slab */ -+ ub = slab_ub(f); -+ if (ub == NULL) -+ return; -+ -+ uncharge_beancounter(ub, UB_NUMFILE, 1); -+} -+ -+int ub_flock_charge(struct file_lock *fl, int hard) -+{ -+ struct user_beancounter *ub; -+ int err; -+ -+ /* No need to get_beancounter here since it's already got in slab */ -+ ub = slab_ub(fl); -+ if (ub == NULL) -+ return 0; -+ -+ err = charge_beancounter(ub, UB_NUMFLOCK, 1, hard ? UB_HARD : UB_SOFT); -+ if (!err) -+ fl->fl_charged = 1; -+ return err; -+} -+ -+void ub_flock_uncharge(struct file_lock *fl) -+{ -+ struct user_beancounter *ub; -+ -+ /* Ub will be put in slab */ -+ ub = slab_ub(fl); -+ if (ub == NULL || !fl->fl_charged) -+ return; -+ -+ uncharge_beancounter(ub, UB_NUMFLOCK, 1); -+ fl->fl_charged = 0; -+} -+ -+/* -+ * Signal handling -+ */ -+ -+static int do_ub_siginfo_charge(struct user_beancounter *ub, -+ unsigned long size) -+{ -+ unsigned long flags; -+ -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ if (__charge_beancounter_locked(ub, UB_KMEMSIZE, size, UB_HARD)) -+ goto out_kmem; -+ -+ if (__charge_beancounter_locked(ub, UB_NUMSIGINFO, 1, UB_HARD)) -+ goto out_num; -+ -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+ return 0; -+ -+out_num: -+ __uncharge_beancounter_locked(ub, UB_KMEMSIZE, size); -+out_kmem: -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+ return -ENOMEM; -+} -+ -+static void do_ub_siginfo_uncharge(struct user_beancounter *ub, -+ unsigned long size) -+{ -+ unsigned long flags; -+ -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ __uncharge_beancounter_locked(ub, UB_KMEMSIZE, size); -+ __uncharge_beancounter_locked(ub, UB_NUMSIGINFO, 1); -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+} -+ -+int ub_siginfo_charge(struct user_beancounter *ub, unsigned long size) -+{ -+ struct user_beancounter *p, *q; -+ -+ size = CHARGE_SIZE(size); -+ for (p = ub; p != NULL; p = p->parent) { -+ if (do_ub_siginfo_charge(p, size)) -+ goto unroll; -+ } -+ return 0; -+ -+unroll: -+ for (q = ub; q != p; q = q->parent) -+ do_ub_siginfo_uncharge(q, size); -+ return -ENOMEM; -+} -+ -+void ub_siginfo_uncharge(struct user_beancounter *ub, unsigned long size) -+{ -+ size = CHARGE_SIZE(size); -+ for (; ub != NULL; ub = ub->parent) -+ do_ub_siginfo_uncharge(ub, size); -+} -+ -+/* -+ * PTYs -+ */ -+ -+int ub_pty_charge(struct tty_struct *tty) -+{ -+ struct user_beancounter *ub; -+ int retval; -+ -+ ub = tty_ub(tty); -+ retval = 0; -+ if (ub && tty->driver->subtype == PTY_TYPE_MASTER && -+ !test_bit(TTY_CHARGED, &tty->flags)) { -+ retval = charge_beancounter(ub, UB_NUMPTY, 1, UB_HARD); -+ if (!retval) -+ set_bit(TTY_CHARGED, &tty->flags); -+ } -+ return retval; -+} -+ -+void ub_pty_uncharge(struct tty_struct *tty) -+{ -+ struct user_beancounter *ub; -+ -+ ub = tty_ub(tty); -+ if (ub && tty->driver->subtype == PTY_TYPE_MASTER && -+ test_bit(TTY_CHARGED, &tty->flags)) { -+ uncharge_beancounter(ub, UB_NUMPTY, 1); -+ clear_bit(TTY_CHARGED, &tty->flags); -+ } -+} -diff -uprN linux-2.6.8.1.orig/kernel/ub/ub_net.c linux-2.6.8.1-ve022stab078/kernel/ub/ub_net.c ---- linux-2.6.8.1.orig/kernel/ub/ub_net.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/kernel/ub/ub_net.c 2006-05-11 13:05:39.000000000 +0400 -@@ -0,0 +1,1041 @@ -+/* -+ * linux/kernel/ub/ub_net.c -+ * -+ * Copyright (C) 1998-2004 Andrey V. Savochkin <saw@saw.sw.com.sg> -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ * TODO: -+ * - sizeof(struct inode) charge -+ * = tcp_mem_schedule() feedback based on ub limits -+ * + measures so that one socket won't exhaust all send buffers, -+ * see bug in bugzilla -+ * = sk->socket check for NULL in snd_wakeups -+ * (tcp_write_space checks for NULL itself) -+ * + in tcp_close(), orphaned socket abortion should be based on ubc -+ * resources (same in tcp_out_of_resources) -+ * Beancounter should also have separate orphaned socket counter... -+ * + for rcv, in-order segment should be accepted -+ * if only barrier is exceeded -+ * = tcp_rmem_schedule() feedback based on ub limits -+ * - repair forward_alloc mechanism for receive buffers -+ * It's idea is that some buffer space is pre-charged so that receive fast -+ * path doesn't need to take spinlocks and do other heavy stuff -+ * + tcp_prune_queue actions based on ub limits -+ * + window adjustments depending on available buffers for receive -+ * - window adjustments depending on available buffers for send -+ * + race around usewreserv -+ * + avoid allocating new page for each tiny-gram, see letter from ANK -+ * + rename ub_sock_lock -+ * + sk->sleep wait queue probably can be used for all wakeups, and -+ * sk->ub_wait is unnecessary -+ * + for UNIX sockets, the current algorithm will lead to -+ * UB_UNIX_MINBUF-sized messages only for non-blocking case -+ * - charge for af_packet sockets -+ * + all datagram sockets should be charged to NUMUNIXSOCK -+ * - we do not charge for skb copies and clones staying in device queues -+ * + live-lock if number of sockets is big and buffer limits are small -+ * [diff-ubc-dbllim3] -+ * - check that multiple readers/writers on the same socket won't cause fatal -+ * consequences -+ * - check allocation/charge orders -+ * + There is potential problem with callback_lock. In *snd_wakeup we take -+ * beancounter first, in sock_def_error_report - callback_lock first. -+ * then beancounter. This is not a problem if callback_lock taken -+ * readonly, but anyway... -+ * - SKB_CHARGE_SIZE doesn't include the space wasted by slab allocator -+ * General kernel problems: -+ * - in tcp_sendmsg(), if allocation fails, non-blocking sockets with ASYNC -+ * notification won't get signals -+ * - datagram_poll looks racy -+ * -+ */ -+ -+#include <linux/net.h> -+#include <linux/slab.h> -+#include <linux/kmem_cache.h> -+#include <linux/gfp.h> -+#include <linux/err.h> -+#include <linux/socket.h> -+#include <linux/module.h> -+#include <linux/sched.h> -+ -+#include <net/sock.h> -+ -+#include <ub/beancounter.h> -+#include <ub/ub_net.h> -+#include <ub/ub_debug.h> -+ -+ -+/* Skb truesize definition. Bad place. Den */ -+ -+static inline int skb_chargesize_head(struct sk_buff *skb) -+{ -+ return skb_charge_size(skb->end - skb->head + -+ sizeof(struct skb_shared_info)); -+} -+ -+int skb_charge_fullsize(struct sk_buff *skb) -+{ -+ int chargesize; -+ struct sk_buff *skbfrag; -+ -+ chargesize = skb_chargesize_head(skb) + -+ PAGE_SIZE * skb_shinfo(skb)->nr_frags; -+ if (likely(skb_shinfo(skb)->frag_list == NULL)) -+ return chargesize; -+ for (skbfrag = skb_shinfo(skb)->frag_list; -+ skbfrag != NULL; -+ skbfrag = skbfrag->next) { -+ chargesize += skb_charge_fullsize(skbfrag); -+ } -+ return chargesize; -+} -+EXPORT_SYMBOL(skb_charge_fullsize); -+ -+static int ub_sock_makewreserv_locked(struct sock *sk, -+ int bufid, int sockid, unsigned long size); -+ -+int ub_too_many_orphans(struct sock *sk, int count) -+{ -+ struct user_beancounter *ub; -+ -+ if (sock_has_ubc(sk)) { -+ for (ub = sock_bc(sk)->ub; ub->parent != NULL; ub = ub->parent); -+ if (count >= ub->ub_parms[UB_NUMTCPSOCK].barrier >> 2) -+ return 1; -+ } -+ return 0; -+} -+ -+/* -+ * Queueing -+ */ -+ -+static void ub_sock_snd_wakeup(struct user_beancounter *ub) -+{ -+ struct list_head *p; -+ struct sock_beancounter *skbc; -+ struct sock *sk; -+ struct user_beancounter *cub; -+ unsigned long added; -+ -+ while (!list_empty(&ub->ub_other_sk_list)) { -+ p = ub->ub_other_sk_list.next; -+ skbc = list_entry(p, struct sock_beancounter, ub_sock_list); -+ sk = skbc_sock(skbc); -+ ub_debug(UBD_NET_SLEEP, "Found sock to wake up\n"); -+ added = -skbc->poll_reserv; -+ if (ub_sock_makewreserv_locked(sk, UB_OTHERSOCKBUF, -+ UB_NUMOTHERSOCK, skbc->ub_waitspc)) -+ break; -+ added += skbc->poll_reserv; -+ -+ /* -+ * See comments in ub_tcp_snd_wakeup. -+ * Locking note: both unix_write_space and -+ * sock_def_write_space take callback_lock themselves. -+ * We take it here just to be on the safe side and to -+ * act the same way as ub_tcp_snd_wakeup does. -+ */ -+ sk->sk_write_space(sk); -+ -+ list_del_init(&skbc->ub_sock_list); -+ -+ if (skbc->ub != ub && added) { -+ cub = get_beancounter(skbc->ub); -+ spin_unlock(&ub->ub_lock); -+ charge_beancounter_notop(cub, UB_OTHERSOCKBUF, added); -+ put_beancounter(cub); -+ spin_lock(&ub->ub_lock); -+ } -+ } -+} -+ -+static void ub_tcp_snd_wakeup(struct user_beancounter *ub) -+{ -+ struct list_head *p; -+ struct sock *sk; -+ struct sock_beancounter *skbc; -+ struct socket *sock; -+ struct user_beancounter *cub; -+ unsigned long added; -+ -+ while (!list_empty(&ub->ub_tcp_sk_list)) { -+ p = ub->ub_tcp_sk_list.next; -+ skbc = list_entry(p, struct sock_beancounter, ub_sock_list); -+ sk = skbc_sock(skbc); -+ -+ added = 0; -+ sock = sk->sk_socket; -+ if (sock == NULL) -+ /* sk being destroyed */ -+ goto cont; -+ -+ ub_debug(UBD_NET_SLEEP, -+ "Checking queue, waiting %lu, reserv %lu\n", -+ skbc->ub_waitspc, skbc->poll_reserv); -+ added = -skbc->poll_reserv; -+ if (ub_sock_makewreserv_locked(sk, UB_TCPSNDBUF, -+ UB_NUMTCPSOCK, skbc->ub_waitspc)) -+ break; -+ added += skbc->poll_reserv; -+ -+ /* -+ * Send async notifications and wake up. -+ * Locking note: we get callback_lock here because -+ * tcp_write_space is over-optimistic about calling context -+ * (socket lock is presumed). So we get the lock here although -+ * it belongs to the callback. -+ */ -+ sk->sk_write_space(sk); -+ -+cont: -+ list_del_init(&skbc->ub_sock_list); -+ -+ if (skbc->ub != ub && added) { -+ cub = get_beancounter(skbc->ub); -+ spin_unlock(&ub->ub_lock); -+ charge_beancounter_notop(cub, UB_TCPSNDBUF, added); -+ put_beancounter(cub); -+ spin_lock(&ub->ub_lock); -+ } -+ } -+} -+ -+void ub_sock_snd_queue_add(struct sock *sk, int res, unsigned long size) -+{ -+ unsigned long flags; -+ struct sock_beancounter *skbc; -+ struct user_beancounter *ub; -+ unsigned long added_reserv; -+ -+ if (!sock_has_ubc(sk)) -+ return; -+ -+ skbc = sock_bc(sk); -+ for (ub = skbc->ub; ub->parent != NULL; ub = ub->parent); -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ ub_debug(UBD_NET_SLEEP, "attempt to charge for %lu\n", size); -+ added_reserv = -skbc->poll_reserv; -+ if (!ub_sock_makewreserv_locked(sk, res, bid2sid(res), size)) { -+ /* -+ * It looks a bit hackish, but it is compatible with both -+ * wait_for_xx_ubspace and poll. -+ * This __set_current_state is equivalent to a wakeup event -+ * right after spin_unlock_irqrestore. -+ */ -+ __set_current_state(TASK_RUNNING); -+ added_reserv += skbc->poll_reserv; -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+ if (added_reserv) -+ charge_beancounter_notop(skbc->ub, res, added_reserv); -+ return; -+ } -+ -+ ub_debug(UBD_NET_SLEEP, "Adding sk to queue\n"); -+ skbc->ub_waitspc = size; -+ if (!list_empty(&skbc->ub_sock_list)) { -+ ub_debug(UBD_NET_SOCKET, -+ "re-adding socket to beancounter %p.\n", ub); -+ goto out; -+ } -+ -+ switch (res) { -+ case UB_TCPSNDBUF: -+ list_add_tail(&skbc->ub_sock_list, -+ &ub->ub_tcp_sk_list); -+ break; -+ case UB_OTHERSOCKBUF: -+ list_add_tail(&skbc->ub_sock_list, -+ &ub->ub_other_sk_list); -+ break; -+ default: -+ BUG(); -+ } -+out: -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+} -+ -+ -+/* -+ * Helpers -+ */ -+ -+void ub_skb_set_charge(struct sk_buff *skb, struct sock *sk, -+ unsigned long size, int resource) -+{ -+ if (!sock_has_ubc(sk)) -+ return; -+ -+ if (sock_bc(sk)->ub == NULL) -+ BUG(); -+ skb_bc(skb)->ub = sock_bc(sk)->ub; -+ skb_bc(skb)->charged = size; -+ skb_bc(skb)->resource = resource; -+ -+ /* Ugly. Ugly. Skb in sk writequeue can live without ref to sk */ -+ if (skb->sk == NULL) -+ skb->sk = sk; -+} -+ -+static inline void ub_skb_set_uncharge(struct sk_buff *skb) -+{ -+ skb_bc(skb)->ub = NULL; -+ skb_bc(skb)->charged = 0; -+ skb_bc(skb)->resource = 0; -+} -+ -+static inline void __uncharge_sockbuf(struct sock_beancounter *skbc, -+ struct user_beancounter *ub, int resource, unsigned long size) -+{ -+ if (ub != NULL) -+ __uncharge_beancounter_locked(ub, resource, size); -+ -+ if (skbc != NULL) { -+ if (skbc->ub_wcharged > size) -+ skbc->ub_wcharged -= size; -+ else -+ skbc->ub_wcharged = 0; -+ } -+} -+ -+static void ub_update_rmem_thres(struct sock_beancounter *skub) -+{ -+ struct user_beancounter *ub; -+ -+ if (skub && skub->ub) { -+ for (ub = skub->ub; ub->parent != NULL; ub = ub->parent); -+ ub->ub_rmem_thres = ub->ub_parms[UB_TCPRCVBUF].barrier / -+ (ub->ub_parms[UB_NUMTCPSOCK].held + 1); -+ } -+} -+inline int ub_skb_alloc_bc(struct sk_buff *skb, int gfp_mask) -+{ -+ memset(skb_bc(skb), 0, sizeof(struct skb_beancounter)); -+ return 0; -+} -+ -+inline void ub_skb_free_bc(struct sk_buff *skb) -+{ -+} -+ -+ -+/* -+ * Charge socket number -+ */ -+ -+static inline int sk_alloc_beancounter(struct sock *sk) -+{ -+ struct sock_beancounter *skbc; -+ -+ skbc = sock_bc(sk); -+ memset(skbc, 0, sizeof(struct sock_beancounter)); -+ return 0; -+} -+ -+static inline void sk_free_beancounter(struct sock *sk) -+{ -+} -+ -+static int __sock_charge(struct sock *sk, int res) -+{ -+ struct sock_beancounter *skbc; -+ struct user_beancounter *ub; -+ -+ ub = get_exec_ub(); -+ if (ub == NULL) -+ return 0; -+ if (sk_alloc_beancounter(sk) < 0) -+ return -ENOMEM; -+ -+ skbc = sock_bc(sk); -+ INIT_LIST_HEAD(&skbc->ub_sock_list); -+ -+ if (charge_beancounter(ub, res, 1, UB_HARD) < 0) -+ goto out_limit; -+ -+ /* TCP listen sock or process keeps referrence to UB */ -+ skbc->ub = get_beancounter(ub); -+ return 0; -+ -+out_limit: -+ sk_free_beancounter(sk); -+ return -ENOMEM; -+} -+ -+int ub_tcp_sock_charge(struct sock *sk) -+{ -+ int ret; -+ -+ ret = __sock_charge(sk, UB_NUMTCPSOCK); -+ ub_update_rmem_thres(sock_bc(sk)); -+ -+ return ret; -+} -+ -+int ub_other_sock_charge(struct sock *sk) -+{ -+ return __sock_charge(sk, UB_NUMOTHERSOCK); -+} -+ -+EXPORT_SYMBOL(ub_other_sock_charge); -+ -+int ub_sock_charge(struct sock *sk, int family, int type) -+{ -+ return (IS_TCP_SOCK(family, type) ? -+ ub_tcp_sock_charge(sk) : ub_other_sock_charge(sk)); -+} -+ -+/* -+ * Uncharge socket number -+ */ -+ -+void ub_sock_uncharge(struct sock *sk) -+{ -+ int is_tcp_sock; -+ unsigned long flags; -+ struct sock_beancounter *skbc; -+ struct user_beancounter *ub; -+ unsigned long reserv; -+ -+ if (!sock_has_ubc(sk)) -+ return; -+ -+ is_tcp_sock = IS_TCP_SOCK(sk->sk_family, sk->sk_type); -+ skbc = sock_bc(sk); -+ ub_debug(UBD_NET_SOCKET, "Calling ub_sock_uncharge on %p\n", sk); -+ -+ for (ub = skbc->ub; ub->parent != NULL; ub = ub->parent); -+ -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ if (!list_empty(&skbc->ub_sock_list)) { -+ ub_debug(UBD_NET_SOCKET, -+ "ub_sock_uncharge: removing from ub(%p) queue.\n", -+ skbc); -+ list_del_init(&skbc->ub_sock_list); -+ } -+ -+ reserv = skbc->poll_reserv; -+ __uncharge_beancounter_locked(ub, -+ (is_tcp_sock ? UB_TCPSNDBUF : UB_OTHERSOCKBUF), -+ reserv); -+ __uncharge_beancounter_locked(ub, -+ (is_tcp_sock ? UB_NUMTCPSOCK : UB_NUMOTHERSOCK), 1); -+ -+ /* The check sk->sk_family != PF_NETLINK is made as the skb is -+ * queued to the kernel end of socket while changed to the user one. -+ * Den */ -+ if (skbc->ub_wcharged > reserv && -+ sk->sk_family != PF_NETLINK) { -+ skbc->ub_wcharged -= reserv; -+ printk(KERN_WARNING -+ "ub_sock_uncharge: wch=%lu for ub %p (%d).\n", -+ skbc->ub_wcharged, skbc->ub, skbc->ub->ub_uid); -+ } else -+ skbc->ub_wcharged = 0; -+ skbc->poll_reserv = 0; -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+ -+ uncharge_beancounter_notop(skbc->ub, -+ (is_tcp_sock ? UB_TCPSNDBUF : UB_OTHERSOCKBUF), -+ reserv); -+ uncharge_beancounter_notop(skbc->ub, -+ (is_tcp_sock ? UB_NUMTCPSOCK : UB_NUMOTHERSOCK), 1); -+ -+ put_beancounter(skbc->ub); -+ sk_free_beancounter(sk); -+} -+ -+/* -+ * Send - receive buffers -+ */ -+ -+/* Special case for netlink_dump - (un)charges precalculated size */ -+int ub_nlrcvbuf_charge(struct sk_buff *skb, struct sock *sk) -+{ -+ int ret; -+ unsigned long chargesize; -+ -+ if (!sock_has_ubc(sk)) -+ return 0; -+ -+ chargesize = skb_charge_fullsize(skb); -+ ret = charge_beancounter(sock_bc(sk)->ub, -+ UB_DGRAMRCVBUF, chargesize, UB_HARD); -+ if (ret < 0) -+ return ret; -+ ub_skb_set_charge(skb, sk, chargesize, UB_DGRAMRCVBUF); -+ return ret; -+} -+ -+/* -+ * Poll reserv accounting -+ */ -+static int ub_sock_makewreserv_locked(struct sock *sk, -+ int bufid, int sockid, unsigned long size) -+{ -+ unsigned long wcharge_added; -+ struct sock_beancounter *skbc; -+ struct user_beancounter *ub; -+ -+ if (!sock_has_ubc(sk)) -+ goto out; -+ -+ skbc = sock_bc(sk); -+ if (skbc->poll_reserv >= size) /* no work to be done */ -+ goto out; -+ -+ for (ub = skbc->ub; ub->parent != NULL; ub = ub->parent); -+ ub->ub_parms[bufid].held += size - skbc->poll_reserv; -+ -+ wcharge_added = 0; -+ /* -+ * Logic: -+ * 1) when used memory hits barrier, we set wmem_pressure; -+ * wmem_pressure is reset under barrier/2; -+ * between barrier/2 and barrier we limit per-socket buffer growth; -+ * 2) each socket is guaranteed to get (limit-barrier)/maxsockets -+ * calculated on the base of memory eaten after the barrier is hit -+ */ -+ skbc = sock_bc(sk); -+ if (!ub_hfbarrier_hit(ub, bufid)) { -+ if (ub->ub_wmem_pressure) -+ ub_debug(UBD_NET_SEND, "makewres: pressure -> 0 " -+ "sk %p sz %lu pr %lu hd %lu wc %lu sb %d.\n", -+ sk, size, skbc->poll_reserv, -+ ub->ub_parms[bufid].held, -+ skbc->ub_wcharged, sk->sk_sndbuf); -+ ub->ub_wmem_pressure = 0; -+ } -+ if (ub_barrier_hit(ub, bufid)) { -+ if (!ub->ub_wmem_pressure) -+ ub_debug(UBD_NET_SEND, "makewres: pressure -> 1 " -+ "sk %p sz %lu pr %lu hd %lu wc %lu sb %d.\n", -+ sk, size, skbc->poll_reserv, -+ ub->ub_parms[bufid].held, -+ skbc->ub_wcharged, sk->sk_sndbuf); -+ ub->ub_wmem_pressure = 1; -+ wcharge_added = size - skbc->poll_reserv; -+ skbc->ub_wcharged += wcharge_added; -+ if (skbc->ub_wcharged * ub->ub_parms[sockid].limit + -+ ub->ub_parms[bufid].barrier > -+ ub->ub_parms[bufid].limit) -+ goto unroll; -+ } -+ if (ub->ub_parms[bufid].held > ub->ub_parms[bufid].limit) -+ goto unroll; -+ -+ ub_adjust_maxheld(ub, bufid); -+ skbc->poll_reserv = size; -+out: -+ return 0; -+ -+unroll: -+ ub_debug(UBD_NET_SEND, -+ "makewres: deny " -+ "sk %p sz %lu pr %lu hd %lu wc %lu sb %d.\n", -+ sk, size, skbc->poll_reserv, ub->ub_parms[bufid].held, -+ skbc->ub_wcharged, sk->sk_sndbuf); -+ skbc->ub_wcharged -= wcharge_added; -+ ub->ub_parms[bufid].failcnt++; -+ ub->ub_parms[bufid].held -= size - skbc->poll_reserv; -+ return -ENOMEM; -+} -+ -+int ub_sock_make_wreserv(struct sock *sk, int bufid, unsigned long size) -+{ -+ struct sock_beancounter *skbc; -+ struct user_beancounter *ub; -+ unsigned long flags; -+ unsigned long added_reserv; -+ int err; -+ -+ skbc = sock_bc(sk); -+ -+ /* -+ * This function provides that there is sufficient reserve upon return -+ * only if sk has only one user. We can check poll_reserv without -+ * serialization and avoid locking if the reserve already exists. -+ */ -+ if (!sock_has_ubc(sk) || skbc->poll_reserv >= size) -+ return 0; -+ -+ for (ub = skbc->ub; ub->parent != NULL; ub = ub->parent); -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ added_reserv = -skbc->poll_reserv; -+ err = ub_sock_makewreserv_locked(sk, bufid, bid2sid(bufid), size); -+ added_reserv += skbc->poll_reserv; -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+ -+ if (added_reserv) -+ charge_beancounter_notop(skbc->ub, bufid, added_reserv); -+ -+ return err; -+} -+ -+int ub_sock_get_wreserv(struct sock *sk, int bufid, unsigned long size) -+{ -+ struct sock_beancounter *skbc; -+ struct user_beancounter *ub; -+ unsigned long flags; -+ unsigned long added_reserv; -+ int err; -+ -+ if (!sock_has_ubc(sk)) -+ return 0; -+ -+ skbc = sock_bc(sk); -+ for (ub = skbc->ub; ub->parent != NULL; ub = ub->parent); -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ added_reserv = -skbc->poll_reserv; -+ err = ub_sock_makewreserv_locked(sk, bufid, bid2sid(bufid), size); -+ added_reserv += skbc->poll_reserv; -+ if (!err) -+ skbc->poll_reserv -= size; -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+ -+ if (added_reserv) -+ charge_beancounter_notop(skbc->ub, bufid, added_reserv); -+ -+ return err; -+} -+ -+void ub_sock_ret_wreserv(struct sock *sk, int bufid, -+ unsigned long size, unsigned long ressize) -+{ -+ struct sock_beancounter *skbc; -+ struct user_beancounter *ub; -+ unsigned long extra; -+ unsigned long flags; -+ -+ if (!sock_has_ubc(sk)) -+ return; -+ -+ extra = 0; -+ skbc = sock_bc(sk); -+ for (ub = skbc->ub; ub->parent != NULL; ub = ub->parent); -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ skbc->poll_reserv += size; -+ if (skbc->poll_reserv > ressize) { -+ extra = skbc->poll_reserv - ressize; -+ __uncharge_beancounter_locked(ub, bufid, extra); -+ -+ if (skbc->ub_wcharged > skbc->poll_reserv - ressize) -+ skbc->ub_wcharged -= skbc->poll_reserv - ressize; -+ else -+ skbc->ub_wcharged = 0; -+ skbc->poll_reserv = ressize; -+ } -+ -+ ub_tcp_snd_wakeup(ub); -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+ -+ if (extra) -+ uncharge_beancounter_notop(skbc->ub, bufid, extra); -+} -+ -+long ub_sock_wait_for_space(struct sock *sk, long timeo, unsigned long size) -+{ -+ DECLARE_WAITQUEUE(wait, current); -+ -+ add_wait_queue(sk->sk_sleep, &wait); -+ for (;;) { -+ if (signal_pending(current)) -+ break; -+ set_current_state(TASK_INTERRUPTIBLE); -+ if (!ub_sock_make_wreserv(sk, UB_OTHERSOCKBUF, size)) -+ break; -+ -+ if (sk->sk_shutdown & SEND_SHUTDOWN) -+ break; -+ if (sk->sk_err) -+ break; -+ ub_sock_snd_queue_add(sk, UB_OTHERSOCKBUF, size); -+ timeo = schedule_timeout(timeo); -+ } -+ __set_current_state(TASK_RUNNING); -+ remove_wait_queue(sk->sk_sleep, &wait); -+ return timeo; -+} -+ -+int ub_sock_makewres_other(struct sock *sk, unsigned long size) -+{ -+ return ub_sock_make_wreserv(sk, UB_OTHERSOCKBUF, size); -+} -+ -+int ub_sock_makewres_tcp(struct sock *sk, unsigned long size) -+{ -+ return ub_sock_make_wreserv(sk, UB_TCPSNDBUF, size); -+} -+ -+int ub_sock_getwres_other(struct sock *sk, unsigned long size) -+{ -+ return ub_sock_get_wreserv(sk, UB_OTHERSOCKBUF, size); -+} -+ -+int ub_sock_getwres_tcp(struct sock *sk, unsigned long size) -+{ -+ return ub_sock_get_wreserv(sk, UB_TCPSNDBUF, size); -+} -+ -+void ub_sock_retwres_other(struct sock *sk, unsigned long size, -+ unsigned long ressize) -+{ -+ ub_sock_ret_wreserv(sk, UB_OTHERSOCKBUF, size, ressize); -+} -+ -+void ub_sock_retwres_tcp(struct sock *sk, unsigned long size, -+ unsigned long ressize) -+{ -+ ub_sock_ret_wreserv(sk, UB_TCPSNDBUF, size, ressize); -+} -+ -+void ub_sock_sndqueueadd_other(struct sock *sk, unsigned long sz) -+{ -+ ub_sock_snd_queue_add(sk, UB_OTHERSOCKBUF, sz); -+} -+ -+void ub_sock_sndqueueadd_tcp(struct sock *sk, unsigned long sz) -+{ -+ ub_sock_snd_queue_add(sk, UB_TCPSNDBUF, sz); -+} -+ -+void ub_sock_sndqueuedel(struct sock *sk) -+{ -+ struct sock_beancounter *skbc; -+ unsigned long flags; -+ -+ if (!sock_has_ubc(sk)) -+ return; -+ skbc = sock_bc(sk); -+ -+ /* race with write_space callback of other socket */ -+ spin_lock_irqsave(&skbc->ub->ub_lock, flags); -+ list_del_init(&skbc->ub_sock_list); -+ spin_unlock_irqrestore(&skbc->ub->ub_lock, flags); -+} -+ -+/* -+ * UB_DGRAMRCVBUF -+ */ -+ -+int ub_sockrcvbuf_charge(struct sock *sk, struct sk_buff *skb) -+{ -+ unsigned long chargesize; -+ -+ if (!sock_has_ubc(sk)) -+ return 0; -+ -+ chargesize = skb_charge_fullsize(skb); -+ if (charge_beancounter(sock_bc(sk)->ub, UB_DGRAMRCVBUF, -+ chargesize, UB_HARD)) -+ return -ENOMEM; -+ -+ ub_skb_set_charge(skb, sk, chargesize, UB_DGRAMRCVBUF); -+ return 0; -+} -+ -+EXPORT_SYMBOL(ub_sockrcvbuf_charge); -+ -+static void ub_sockrcvbuf_uncharge(struct sk_buff *skb) -+{ -+ uncharge_beancounter(skb_bc(skb)->ub, UB_DGRAMRCVBUF, -+ skb_bc(skb)->charged); -+ ub_skb_set_uncharge(skb); -+} -+ -+/* -+ * UB_TCPRCVBUF -+ */ -+static int charge_tcprcvbuf(struct sock *sk, struct sk_buff *skb, -+ enum severity strict) -+{ -+ int retval; -+ unsigned long flags; -+ struct user_beancounter *ub; -+ unsigned long chargesize; -+ -+ if (!sock_has_ubc(sk)) -+ return 0; -+ -+ /* -+ * Memory pressure reactions: -+ * 1) set UB_RMEM_KEEP (clearing UB_RMEM_EXPAND) -+ * 2) set UB_RMEM_SHRINK and tcp_clamp_window() -+ * tcp_collapse_queues() if rmem_alloc > rcvbuf -+ * 3) drop OFO, tcp_purge_ofo() -+ * 4) drop all. -+ * Currently, we do #2 and #3 at once (which means that current -+ * collapsing of OFO queue in tcp_collapse_queues() is a waste of time, -+ * for example...) -+ * On memory pressure we jump from #0 to #3, and when the pressure -+ * subsides, to #1. -+ */ -+ retval = 0; -+ chargesize = skb_charge_fullsize(skb); -+ -+ for (ub = sock_bc(sk)->ub; ub->parent != NULL; ub = ub->parent); -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ ub->ub_parms[UB_TCPRCVBUF].held += chargesize; -+ if (ub->ub_parms[UB_TCPRCVBUF].held > -+ ub->ub_parms[UB_TCPRCVBUF].barrier && -+ strict != UB_FORCE) -+ goto excess; -+ ub_adjust_maxheld(ub, UB_TCPRCVBUF); -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+ -+out: -+ if (retval == 0) { -+ charge_beancounter_notop(sock_bc(sk)->ub, UB_TCPRCVBUF, -+ chargesize); -+ ub_skb_set_charge(skb, sk, chargesize, UB_TCPRCVBUF); -+ } -+ return retval; -+ -+excess: -+ ub->ub_rmem_pressure = UB_RMEM_SHRINK; -+ if (strict == UB_HARD) -+ retval = -ENOMEM; -+ if (ub->ub_parms[UB_TCPRCVBUF].held > ub->ub_parms[UB_TCPRCVBUF].limit) -+ retval = -ENOMEM; -+ /* -+ * We try to leave numsock*maxadvmss as a reserve for sockets not -+ * queueing any data yet (if the difference between the barrier and the -+ * limit is enough for this reserve). -+ */ -+ if (ub->ub_parms[UB_TCPRCVBUF].held + -+ ub->ub_parms[UB_NUMTCPSOCK].limit * ub->ub_maxadvmss -+ > ub->ub_parms[UB_TCPRCVBUF].limit && -+ atomic_read(&sk->sk_rmem_alloc)) -+ retval = -ENOMEM; -+ if (retval) { -+ ub->ub_parms[UB_TCPRCVBUF].held -= chargesize; -+ ub->ub_parms[UB_TCPRCVBUF].failcnt++; -+ } -+ ub_adjust_maxheld(ub, UB_TCPRCVBUF); -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+ goto out; -+} -+ -+int ub_tcprcvbuf_charge(struct sock *sk, struct sk_buff *skb) -+{ -+ return charge_tcprcvbuf(sk, skb, UB_HARD); -+} -+ -+int ub_tcprcvbuf_charge_forced(struct sock *sk, struct sk_buff *skb) -+{ -+ return charge_tcprcvbuf(sk, skb, UB_FORCE); -+} -+ -+static void ub_tcprcvbuf_uncharge(struct sk_buff *skb) -+{ -+ unsigned long flags; -+ unsigned long held, bar; -+ int prev_pres; -+ struct user_beancounter *ub; -+ -+ for (ub = skb_bc(skb)->ub; ub->parent != NULL; ub = ub->parent); -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ if (ub->ub_parms[UB_TCPRCVBUF].held < skb_bc(skb)->charged) { -+ printk(KERN_ERR "Uncharging %d for tcprcvbuf of %p with %lu\n", -+ skb_bc(skb)->charged, -+ ub, ub->ub_parms[UB_TCPRCVBUF].held); -+ /* ass-saving bung */ -+ skb_bc(skb)->charged = ub->ub_parms[UB_TCPRCVBUF].held; -+ } -+ ub->ub_parms[UB_TCPRCVBUF].held -= skb_bc(skb)->charged; -+ held = ub->ub_parms[UB_TCPRCVBUF].held; -+ bar = ub->ub_parms[UB_TCPRCVBUF].barrier; -+ prev_pres = ub->ub_rmem_pressure; -+ if (held <= bar - (bar >> 2)) -+ ub->ub_rmem_pressure = UB_RMEM_EXPAND; -+ else if (held <= bar) -+ ub->ub_rmem_pressure = UB_RMEM_KEEP; -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+ -+ uncharge_beancounter_notop(skb_bc(skb)->ub, UB_TCPRCVBUF, -+ skb_bc(skb)->charged); -+ ub_skb_set_uncharge(skb); -+} -+ -+ -+/* -+ * UB_OTHERSOCKBUF -+ */ -+ -+static void ub_socksndbuf_uncharge(struct sk_buff *skb) -+{ -+ unsigned long flags; -+ struct user_beancounter *ub, *cub; -+ struct sock_beancounter *sk_bc; -+ -+ /* resource was set. no check for ub required */ -+ cub = skb_bc(skb)->ub; -+ for (ub = cub; ub->parent != NULL; ub = ub->parent); -+ skb_bc(skb)->ub = NULL; -+ if (skb->sk != NULL) -+ sk_bc = sock_bc(skb->sk); -+ else -+ sk_bc = NULL; -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ __uncharge_sockbuf(sk_bc, ub, UB_OTHERSOCKBUF, -+ skb_bc(skb)->charged); -+ ub_sock_snd_wakeup(ub); -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+ -+ uncharge_beancounter_notop(cub, UB_OTHERSOCKBUF, skb_bc(skb)->charged); -+ ub_skb_set_uncharge(skb); -+} -+ -+static void ub_tcpsndbuf_uncharge(struct sk_buff *skb) -+{ -+ unsigned long flags; -+ struct user_beancounter *ub, *cub; -+ -+ /* resource can be not set, called manually */ -+ cub = skb_bc(skb)->ub; -+ if (cub == NULL) -+ return; -+ for (ub = cub; ub->parent != NULL; ub = ub->parent); -+ skb_bc(skb)->ub = NULL; -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ __uncharge_sockbuf(sock_bc(skb->sk), ub, UB_TCPSNDBUF, -+ skb_bc(skb)->charged); -+ ub_tcp_snd_wakeup(ub); -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+ -+ uncharge_beancounter_notop(cub, UB_TCPSNDBUF, skb_bc(skb)->charged); -+ ub_skb_set_uncharge(skb); -+} -+ -+void ub_skb_uncharge(struct sk_buff *skb) -+{ -+ switch (skb_bc(skb)->resource) { -+ case UB_TCPSNDBUF: -+ ub_tcpsndbuf_uncharge(skb); -+ break; -+ case UB_TCPRCVBUF: -+ ub_tcprcvbuf_uncharge(skb); -+ break; -+ case UB_DGRAMRCVBUF: -+ ub_sockrcvbuf_uncharge(skb); -+ break; -+ case UB_OTHERSOCKBUF: -+ ub_socksndbuf_uncharge(skb); -+ break; -+ } -+} -+ -+EXPORT_SYMBOL(ub_skb_uncharge); /* due to skb_orphan()/conntracks */ -+ -+/* -+ * TCP send buffers accouting. Paged part -+ */ -+int ub_sock_tcp_chargepage(struct sock *sk) -+{ -+ struct sock_beancounter *skbc; -+ struct user_beancounter *ub; -+ unsigned long added; -+ unsigned long flags; -+ int err; -+ -+ if (!sock_has_ubc(sk)) -+ return 0; -+ -+ skbc = sock_bc(sk); -+ -+ for (ub = skbc->ub; ub->parent != NULL; ub = ub->parent); -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ /* Try to charge full page */ -+ err = ub_sock_makewreserv_locked(sk, UB_TCPSNDBUF, UB_NUMTCPSOCK, -+ PAGE_SIZE); -+ if (err == 0) { -+ skbc->poll_reserv -= PAGE_SIZE; -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+ charge_beancounter_notop(skbc->ub, UB_TCPSNDBUF, PAGE_SIZE); -+ return 0; -+ } -+ -+ /* Try to charge page enough to satisfy sys_select. The possible -+ overdraft for the rest of the page is generally better then -+ requesting full page in tcp_poll. This should not happen -+ frequently. Den */ -+ added = -skbc->poll_reserv; -+ err = ub_sock_makewreserv_locked(sk, UB_TCPSNDBUF, UB_NUMTCPSOCK, -+ SOCK_MIN_UBCSPACE); -+ if (err < 0) { -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+ return err; -+ } -+ __charge_beancounter_locked(ub, UB_TCPSNDBUF, -+ PAGE_SIZE - skbc->poll_reserv, -+ UB_FORCE); -+ added += PAGE_SIZE; -+ skbc->poll_reserv = 0; -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+ -+ charge_beancounter_notop(skbc->ub, UB_TCPSNDBUF, added); -+ -+ return 0; -+ -+} -+ -+void ub_sock_tcp_detachpage(struct sock *sk) -+{ -+ struct sk_buff *skb; -+ -+ if (!sock_has_ubc(sk)) -+ return; -+ -+ /* The page is just detached from socket. The last skb in queue -+ with paged part holds referrence to it */ -+ skb = skb_peek_tail(&sk->sk_write_queue); -+ if (skb == NULL) { -+ /* If the queue is empty - all data is sent and page is about -+ to be freed */ -+ uncharge_beancounter(sock_bc(sk)->ub, UB_TCPSNDBUF, PAGE_SIZE); -+ return; -+ } -+ /* Last skb is a good aproximation for a last skb with paged part */ -+ skb_bc(skb)->charged += PAGE_SIZE; -+} -+ -+static int charge_tcpsndbuf(struct sock *sk, struct sk_buff *skb, -+ enum severity strict) -+{ -+ int ret; -+ unsigned long chargesize; -+ -+ if (!sock_has_ubc(sk)) -+ return 0; -+ -+ chargesize = skb_charge_fullsize(skb); -+ ret = charge_beancounter(sock_bc(sk)->ub, UB_TCPSNDBUF, chargesize, -+ strict); -+ if (ret < 0) -+ return ret; -+ ub_skb_set_charge(skb, sk, chargesize, UB_TCPSNDBUF); -+ sock_bc(sk)->ub_wcharged += chargesize; -+ return ret; -+} -+ -+int ub_tcpsndbuf_charge(struct sock *sk, struct sk_buff *skb) -+{ -+ return charge_tcpsndbuf(sk, skb, UB_HARD); -+} -+ -+int ub_tcpsndbuf_charge_forced(struct sock *sk, struct sk_buff *skb) -+{ -+ return charge_tcpsndbuf(sk, skb, UB_FORCE); -+} -+ -+/* -+ * Initialization staff -+ */ -+int __init skbc_cache_init(void) -+{ -+ return 0; -+} -diff -uprN linux-2.6.8.1.orig/kernel/ub/ub_oom.c linux-2.6.8.1-ve022stab078/kernel/ub/ub_oom.c ---- linux-2.6.8.1.orig/kernel/ub/ub_oom.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/kernel/ub/ub_oom.c 2006-05-11 13:05:48.000000000 +0400 -@@ -0,0 +1,93 @@ -+/* -+ * kernel/ub/ub_oom.c -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#include <linux/sched.h> -+#include <linux/spinlock.h> -+#include <linux/mm.h> -+#include <linux/swap.h> -+ -+#include <asm/page.h> -+ -+#include <ub/beancounter.h> -+#include <ub/ub_misc.h> -+#include <ub/ub_hash.h> -+ -+static inline long ub_current_overdraft(struct user_beancounter *ub) -+{ -+ return ub->ub_parms[UB_OOMGUARPAGES].held + -+ ((ub->ub_parms[UB_KMEMSIZE].held -+ + ub->ub_parms[UB_TCPSNDBUF].held -+ + ub->ub_parms[UB_TCPRCVBUF].held -+ + ub->ub_parms[UB_OTHERSOCKBUF].held -+ + ub->ub_parms[UB_DGRAMRCVBUF].held) -+ >> PAGE_SHIFT) - ub->ub_parms[UB_OOMGUARPAGES].barrier; -+} -+ -+/* -+ * Select an user_beancounter to find task inside it to be killed. -+ * Select the beancounter with the biggest excess of resource usage -+ * to kill a process belonging to that beancounter later, or returns -+ * NULL if there are no beancounters with such excess. -+ */ -+ -+struct user_beancounter *ub_select_worst(long *ub_maxover) -+{ -+ struct user_beancounter *ub, *walkp; -+ unsigned long flags; -+ int i; -+ -+ *ub_maxover = 0; -+ ub = NULL; -+ spin_lock_irqsave(&ub_hash_lock, flags); -+ -+ for_each_beancounter(i, walkp) { -+ long ub_overdraft; -+ -+ if (walkp->parent != NULL) -+ continue; -+ if (walkp->ub_oom_noproc) -+ continue; -+ ub_overdraft = ub_current_overdraft(walkp); -+ if (ub_overdraft > *ub_maxover) { -+ ub = walkp; -+ *ub_maxover = ub_overdraft; -+ } -+ } -+ get_beancounter(ub); -+ if(ub) -+ ub->ub_oom_noproc = 1; -+ spin_unlock_irqrestore(&ub_hash_lock, flags); -+ -+ return ub; -+} -+ -+void ub_oomkill_task(struct mm_struct * mm, struct user_beancounter *ub, -+ long maxover) -+{ -+ static struct ub_rate_info ri = { 5, 60*HZ }; -+ -+ /* increment is serialized with oom_generation_lock */ -+ mm_ub(mm)->ub_parms[UB_OOMGUARPAGES].failcnt++; -+ -+ if (ub_ratelimit(&ri)) -+ show_mem(); -+} -+ -+void ub_clear_oom(void) -+{ -+ unsigned long flags; -+ int i; -+ struct user_beancounter *walkp; -+ -+ spin_lock_irqsave(&ub_hash_lock, flags); -+ for_each_beancounter(i, walkp) -+ walkp->ub_oom_noproc = 0; -+ spin_unlock_irqrestore(&ub_hash_lock, flags); -+} -diff -uprN linux-2.6.8.1.orig/kernel/ub/ub_page_bc.c linux-2.6.8.1-ve022stab078/kernel/ub/ub_page_bc.c ---- linux-2.6.8.1.orig/kernel/ub/ub_page_bc.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/kernel/ub/ub_page_bc.c 2006-05-11 13:05:39.000000000 +0400 -@@ -0,0 +1,403 @@ -+/* -+ * kernel/ub/ub_page_bc.c -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#include <linux/spinlock.h> -+#include <linux/slab.h> -+#include <linux/mm.h> -+#include <linux/gfp.h> -+#include <linux/vmalloc.h> -+ -+#include <ub/beancounter.h> -+#include <ub/ub_hash.h> -+#include <ub/ub_vmpages.h> -+#include <ub/ub_page.h> -+ -+static kmem_cache_t *pb_cachep; -+static spinlock_t pb_lock = SPIN_LOCK_UNLOCKED; -+static struct page_beancounter **pb_hash_table; -+static unsigned int pb_hash_mask; -+ -+/* -+ * Auxiliary staff -+ */ -+ -+static inline struct page_beancounter *next_page_pb(struct page_beancounter *p) -+{ -+ return list_entry(p->page_list.next, struct page_beancounter, -+ page_list); -+} -+ -+static inline struct page_beancounter *prev_page_pb(struct page_beancounter *p) -+{ -+ return list_entry(p->page_list.prev, struct page_beancounter, -+ page_list); -+} -+ -+/* -+ * Held pages manipulation -+ */ -+static inline void set_held_pages(struct user_beancounter *bc) -+{ -+ /* all three depend on ub_held_pages */ -+ __ub_update_physpages(bc); -+ __ub_update_oomguarpages(bc); -+ __ub_update_privvm(bc); -+} -+ -+static inline void do_dec_held_pages(struct user_beancounter *ub, int value) -+{ -+ unsigned long flags; -+ -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ ub->ub_held_pages -= value; -+ set_held_pages(ub); -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+} -+ -+static void dec_held_pages(struct user_beancounter *ub, int value) -+{ -+ for (; ub != NULL; ub = ub->parent) -+ do_dec_held_pages(ub, value); -+} -+ -+static inline void do_inc_held_pages(struct user_beancounter *ub, int value) -+{ -+ unsigned long flags; -+ -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ ub->ub_held_pages += value; -+ set_held_pages(ub); -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+} -+ -+static void inc_held_pages(struct user_beancounter *ub, int value) -+{ -+ for (; ub != NULL; ub = ub->parent) -+ do_inc_held_pages(ub, value); -+} -+ -+/* -+ * Alloc - free -+ */ -+ -+inline int pb_alloc(struct page_beancounter **pbc) -+{ -+ *pbc = kmem_cache_alloc(pb_cachep, GFP_KERNEL); -+ if (*pbc != NULL) -+ (*pbc)->pb_magic = PB_MAGIC; -+ return (*pbc == NULL); -+} -+ -+inline void pb_free(struct page_beancounter **pb) -+{ -+ if (*pb != NULL) { -+ kmem_cache_free(pb_cachep, *pb); -+ *pb = NULL; -+ } -+} -+ -+void pb_free_list(struct page_beancounter **p_pb) -+{ -+ struct page_beancounter *list = *p_pb, *pb; -+ while (list) { -+ pb = list; -+ list = list->next_hash; -+ pb_free(&pb); -+ } -+ *p_pb = NULL; -+} -+ -+/* -+ * head -> <new objs> -> <old objs> -> ... -+ */ -+static int __alloc_list(struct page_beancounter **head, int num) -+{ -+ struct page_beancounter *pb; -+ -+ while (num > 0) { -+ if (pb_alloc(&pb)) -+ return -1; -+ pb->next_hash = *head; -+ *head = pb; -+ num--; -+ } -+ -+ return num; -+} -+ -+/* -+ * Ensure that the list contains at least num elements. -+ * p_pb points to an initialized list, may be of the zero length. -+ * -+ * mm->page_table_lock should be held -+ */ -+int pb_alloc_list(struct page_beancounter **p_pb, int num, -+ struct mm_struct *mm) -+{ -+ struct page_beancounter *list; -+ -+ for (list = *p_pb; list != NULL && num; list = list->next_hash, num--); -+ if (!num) -+ return 0; -+ -+ spin_unlock(&mm->page_table_lock); -+ /* -+ * *p_pb(after) *p_pb (before) -+ * \ \ -+ * <new objs> -...-> <old objs> -> ... -+ */ -+ if (__alloc_list(p_pb, num) < 0) -+ goto nomem; -+ spin_lock(&mm->page_table_lock); -+ return 0; -+ -+nomem: -+ spin_lock(&mm->page_table_lock); -+ pb_free_list(p_pb); -+ return -ENOMEM; -+} -+ -+/* -+ * Hash routines -+ */ -+ -+static inline int pb_hash(struct user_beancounter *ub, struct page *page) -+{ -+ return (((unsigned long)ub << 16) + ((unsigned long)ub >> 16) + -+ (page_to_pfn(page) >> 7)) & pb_hash_mask; -+} -+ -+/* pb_lock should be held */ -+static inline void insert_pb(struct page_beancounter *p, struct page *page, -+ struct user_beancounter *ub, int hash) -+{ -+ p->page = page; -+ p->ub = get_beancounter(ub); -+ p->next_hash = pb_hash_table[hash]; -+ pb_hash_table[hash] = p; -+} -+ -+/* -+ * Heart -+ */ -+ -+int pb_reserve_all(struct page_beancounter **pbs) -+{ -+ int i, need_alloc; -+ unsigned long flags; -+ struct user_beancounter *ub; -+ -+ spin_lock_irqsave(&ub_hash_lock, flags); -+ need_alloc = 0; -+ for_each_beancounter(i, ub) -+ need_alloc++; -+ spin_unlock_irqrestore(&ub_hash_lock, flags); -+ -+ if (!__alloc_list(pbs, need_alloc)) -+ return 0; -+ -+ pb_free_list(pbs); -+ return -ENOMEM; -+} -+ -+int pb_add_ref(struct page *page, struct user_beancounter *bc, -+ struct page_beancounter **p_pb) -+{ -+ int hash; -+ struct page_beancounter *p; -+ int shift; -+ struct page_beancounter *head; -+ -+ if (bc == NULL || is_shmem_mapping(page->mapping)) -+ return 0; -+ -+ hash = pb_hash(bc, page); -+ -+ spin_lock(&pb_lock); -+ for (p = pb_hash_table[hash]; -+ p != NULL && (p->page != page || p->ub != bc); -+ p = p->next_hash); -+ if (p != NULL) { -+ /* -+ * This page is already associated with this beancounter, -+ * increment the usage counter. -+ */ -+ PB_COUNT_INC(p->refcount); -+ spin_unlock(&pb_lock); -+ return 0; -+ } -+ -+ p = *p_pb; -+ if (p == NULL) { -+ spin_unlock(&pb_lock); -+ return -1; -+ } -+ -+ *p_pb = NULL; -+ insert_pb(p, page, bc, hash); -+ head = page_pbc(page); -+ -+ if (head != NULL) { -+ /* -+ * Move the first element to the end of the list. -+ * List head (pb_head) is set to the next entry. -+ * Note that this code works even if head is the only element -+ * on the list (because it's cyclic). -+ */ -+ BUG_ON(head->pb_magic != PB_MAGIC); -+ page_pbc(page) = next_page_pb(head); -+ PB_SHIFT_INC(head->refcount); -+ shift = PB_SHIFT_GET(head->refcount); -+ /* -+ * Update user beancounter, the share of head has been changed. -+ * Note that the shift counter is taken after increment. -+ */ -+ dec_held_pages(head->ub, UB_PAGE_WEIGHT >> shift); -+ /* add the new page beancounter to the end of the list */ -+ list_add_tail(&p->page_list, &page_pbc(page)->page_list); -+ } else { -+ page_pbc(page) = p; -+ shift = 0; -+ INIT_LIST_HEAD(&p->page_list); -+ } -+ -+ p->refcount = PB_REFCOUNT_MAKE(shift, 1); -+ spin_unlock(&pb_lock); -+ -+ /* update user beancounter for the new page beancounter */ -+ inc_held_pages(bc, UB_PAGE_WEIGHT >> shift); -+ return 0; -+} -+ -+void pb_remove_ref(struct page *page, struct user_beancounter *bc) -+{ -+ int hash; -+ struct page_beancounter *p, **q; -+ int shift, shiftt; -+ -+ if (bc == NULL || is_shmem_mapping(page->mapping)) -+ return; -+ -+ hash = pb_hash(bc, page); -+ -+ spin_lock(&pb_lock); -+ BUG_ON(page_pbc(page) != NULL && page_pbc(page)->pb_magic != PB_MAGIC); -+ for (q = pb_hash_table + hash, p = *q; -+ p != NULL && (p->page != page || p->ub != bc); -+ q = &p->next_hash, p = *q); -+ if (p == NULL) -+ goto out_unlock; -+ -+ PB_COUNT_DEC(p->refcount); -+ if (PB_COUNT_GET(p->refcount)) -+ /* -+ * More references from the same user beancounter exist. -+ * Nothing needs to be done. -+ */ -+ goto out_unlock; -+ -+ /* remove from the hash list */ -+ *q = p->next_hash; -+ -+ shift = PB_SHIFT_GET(p->refcount); -+ -+ dec_held_pages(p->ub, UB_PAGE_WEIGHT >> shift); -+ -+ if (page_pbc(page) == p) { -+ if (list_empty(&p->page_list)) -+ goto out_free; -+ page_pbc(page) = next_page_pb(p); -+ } -+ list_del(&p->page_list); -+ put_beancounter(p->ub); -+ pb_free(&p); -+ -+ /* Now balance the list. Move the tail and adjust its shift counter. */ -+ p = prev_page_pb(page_pbc(page)); -+ shiftt = PB_SHIFT_GET(p->refcount); -+ page_pbc(page) = p; -+ PB_SHIFT_DEC(p->refcount); -+ -+ inc_held_pages(p->ub, UB_PAGE_WEIGHT >> shiftt); -+ -+ /* -+ * If the shift counter of the moved beancounter is different from the -+ * removed one's, repeat the procedure for one more tail beancounter -+ */ -+ if (shiftt > shift) { -+ p = prev_page_pb(page_pbc(page)); -+ page_pbc(page) = p; -+ PB_SHIFT_DEC(p->refcount); -+ inc_held_pages(p->ub, UB_PAGE_WEIGHT >> shiftt); -+ } -+ spin_unlock(&pb_lock); -+ return; -+ -+out_free: -+ page_pbc(page) = NULL; -+ put_beancounter(p->ub); -+ pb_free(&p); -+out_unlock: -+ spin_unlock(&pb_lock); -+ return; -+} -+ -+void pb_add_list_ref(struct page *page, struct user_beancounter *bc, -+ struct page_beancounter **p_pb) -+{ -+ struct page_beancounter *list, *pb; -+ -+ pb = *p_pb; -+ if (pb == NULL) { -+ /* Typical case due to caller constraints */ -+ if (pb_add_ref(page, bc, &pb)) -+ BUG(); -+ return; -+ } -+ -+ list = pb->next_hash; -+ if (pb_add_ref(page, bc, &pb)) -+ BUG(); -+ if (pb != NULL) { -+ pb->next_hash = list; -+ list = pb; -+ } -+ *p_pb = list; -+} -+ -+struct user_beancounter *pb_grab_page_ub(struct page *page) -+{ -+ struct page_beancounter *pb; -+ struct user_beancounter *ub; -+ -+ spin_lock(&pb_lock); -+ pb = page_pbc(page); -+ ub = (pb == NULL ? ERR_PTR(-EINVAL) : -+ get_beancounter(pb->ub)); -+ spin_unlock(&pb_lock); -+ return ub; -+} -+ -+void __init page_beancounters_init(void) -+{ -+ unsigned long hash_size; -+ -+ pb_cachep = kmem_cache_create("page_beancounter", -+ sizeof(struct page_beancounter), 0, -+ SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL, NULL); -+ hash_size = num_physpages >> 2; -+ for (pb_hash_mask = 1; -+ (hash_size & pb_hash_mask) != hash_size; -+ pb_hash_mask = (pb_hash_mask << 1) + 1); -+ hash_size = pb_hash_mask + 1; -+ printk(KERN_INFO "Page beancounter hash is %lu entries.\n", hash_size); -+ pb_hash_table = vmalloc(hash_size * sizeof(struct page_beancounter *)); -+ memset(pb_hash_table, 0, hash_size * sizeof(struct page_beancounter *)); -+} -diff -uprN linux-2.6.8.1.orig/kernel/ub/ub_pages.c linux-2.6.8.1-ve022stab078/kernel/ub/ub_pages.c ---- linux-2.6.8.1.orig/kernel/ub/ub_pages.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/kernel/ub/ub_pages.c 2006-05-11 13:05:40.000000000 +0400 -@@ -0,0 +1,483 @@ -+/* -+ * kernel/ub/ub_pages.c -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#include <linux/mm.h> -+#include <linux/highmem.h> -+#include <linux/virtinfo.h> -+#include <linux/module.h> -+ -+#include <asm/page.h> -+ -+#include <ub/beancounter.h> -+#include <ub/ub_vmpages.h> -+ -+void fastcall __ub_update_physpages(struct user_beancounter *ub) -+{ -+ ub->ub_parms[UB_PHYSPAGES].held = ub->ub_tmpfs_respages -+ + (ub->ub_held_pages >> UB_PAGE_WEIGHT_SHIFT); -+ ub_adjust_maxheld(ub, UB_PHYSPAGES); -+} -+ -+void fastcall __ub_update_oomguarpages(struct user_beancounter *ub) -+{ -+ ub->ub_parms[UB_OOMGUARPAGES].held = -+ ub->ub_parms[UB_PHYSPAGES].held + ub->ub_swap_pages; -+ ub_adjust_maxheld(ub, UB_OOMGUARPAGES); -+} -+ -+void fastcall __ub_update_privvm(struct user_beancounter *ub) -+{ -+ ub->ub_parms[UB_PRIVVMPAGES].held = -+ (ub->ub_held_pages >> UB_PAGE_WEIGHT_SHIFT) -+ + ub->ub_unused_privvmpages -+ + ub->ub_parms[UB_SHMPAGES].held; -+ ub_adjust_maxheld(ub, UB_PRIVVMPAGES); -+} -+ -+static inline unsigned long pages_in_pte(pte_t *pte) -+{ -+ struct page *pg; -+ -+ if (!pte_present(*pte)) -+ return 0; -+ -+ pg = pte_page(*pte); -+ if (!pfn_valid(page_to_pfn(pg))) -+ return 0; -+ if (PageReserved(pg)) -+ return 0; -+ return 1; -+} -+ -+static inline unsigned long pages_in_pmd(pmd_t *pmd, -+ unsigned long start, unsigned long end) -+{ -+ unsigned long pages, pmd_end, address; -+ pte_t *pte; -+ -+ pages = 0; -+ if (pmd_none(*pmd)) -+ goto out; -+ if (pmd_bad(*pmd)) { -+ pmd_ERROR(*pmd); -+ pmd_clear(pmd); -+ goto out; -+ } -+ -+ pte = pte_offset_map(pmd, start); -+ pmd_end = (start + PMD_SIZE) & PMD_MASK; -+ if (pmd_end && (end > pmd_end)) -+ end = pmd_end; -+ -+ address = start; -+ do { -+ pages += pages_in_pte(pte); -+ address += PAGE_SIZE; -+ pte++; -+ } while (address && (address < end)); -+ pte_unmap(pte-1); -+out: -+ return pages; -+} -+ -+static inline unsigned long pages_in_pgd(pgd_t *pgd, -+ unsigned long start, unsigned long end) -+{ -+ unsigned long pages, pgd_end, address; -+ pmd_t *pmd; -+ -+ pages = 0; -+ if (pgd_none(*pgd)) -+ goto out; -+ if (pgd_bad(*pgd)) { -+ pgd_ERROR(*pgd); -+ pgd_clear(pgd); -+ goto out; -+ } -+ -+ pmd = pmd_offset(pgd, start); -+ pgd_end = (start + PGDIR_SIZE) & PGDIR_MASK; -+ if (pgd_end && (end > pgd_end)) -+ end = pgd_end; -+ -+ address = start; -+ do { -+ pages += pages_in_pmd(pmd, address, end); -+ address = (address + PMD_SIZE) & PMD_MASK; -+ pmd++; -+ } while (address && (address < end)); -+out: -+ return pages; -+} -+ -+/* -+ * Calculate number of pages presenting in the address space within single -+ * vm_area. mm->page_table_lock must be already held. -+ */ -+unsigned long pages_in_vma_range(struct vm_area_struct *vma, -+ unsigned long start, unsigned long end) -+{ -+ unsigned long address, pages; -+ pgd_t *pgd; -+ -+ pages = 0; -+ address = start; -+ pgd = pgd_offset(vma->vm_mm, start); -+ do { -+ pages += pages_in_pgd(pgd, address, end); -+ address = (address + PGDIR_SIZE) & PGDIR_MASK; -+ pgd++; -+ } while (address && (address < end)); -+ -+ return pages; -+} -+ -+int ub_unused_privvm_inc(struct user_beancounter *ub, long size, -+ struct vm_area_struct *vma) -+{ -+ unsigned long flags; -+ -+ if (ub == NULL || !VM_UB_PRIVATE(vma->vm_flags, vma->vm_file)) -+ return 0; -+ -+ for (; ub->parent != NULL; ub = ub->parent); -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ ub->ub_unused_privvmpages += size; -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+ -+ return 0; -+} -+ -+static void __unused_privvm_dec_locked(struct user_beancounter *ub, -+ long size) -+{ -+ /* catch possible overflow */ -+ if (ub->ub_unused_privvmpages < size) { -+ uncharge_warn(ub, UB_UNUSEDPRIVVM, -+ size, ub->ub_unused_privvmpages); -+ size = ub->ub_unused_privvmpages; -+ } -+ ub->ub_unused_privvmpages -= size; -+ __ub_update_privvm(ub); -+} -+ -+void __ub_unused_privvm_dec(struct user_beancounter *ub, long size) -+{ -+ unsigned long flags; -+ -+ if (ub == NULL) -+ return; -+ -+ for (; ub->parent != NULL; ub = ub->parent); -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ __unused_privvm_dec_locked(ub, size); -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+} -+ -+void ub_unused_privvm_dec(struct user_beancounter *ub, long size, -+ struct vm_area_struct *vma) -+{ -+ if (VM_UB_PRIVATE(vma->vm_flags, vma->vm_file)) -+ __ub_unused_privvm_dec(ub, size); -+} -+ -+static inline int __charge_privvm_locked(struct user_beancounter *ub, -+ unsigned long s, enum severity strict) -+{ -+ if (__charge_beancounter_locked(ub, UB_PRIVVMPAGES, s, strict) < 0) -+ return -ENOMEM; -+ -+ ub->ub_unused_privvmpages += s; -+ return 0; -+} -+ -+int ub_privvm_charge(struct user_beancounter *ub, unsigned long vm_flags, -+ struct file *vm_file, unsigned long size) -+{ -+ int retval; -+ unsigned long flags; -+ -+ if (ub == NULL || !VM_UB_PRIVATE(vm_flags, vm_file)) -+ return 0; -+ -+ for (; ub->parent != NULL; ub = ub->parent); -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ retval = __charge_privvm_locked(ub, size >> PAGE_SHIFT, UB_SOFT); -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+ return retval; -+} -+ -+void ub_privvm_uncharge(struct user_beancounter *ub, unsigned long vm_flags, -+ struct file *vm_file, unsigned long size) -+{ -+ unsigned long flags; -+ -+ if (ub == NULL || !VM_UB_PRIVATE(vm_flags, vm_file)) -+ return; -+ -+ for (; ub->parent != NULL; ub = ub->parent); -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ __unused_privvm_dec_locked(ub, size >> PAGE_SHIFT); -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+} -+ -+int ub_protected_charge(struct user_beancounter *ub, unsigned long size, -+ unsigned long newflags, struct vm_area_struct *vma) -+{ -+ unsigned long flags; -+ struct file *file; -+ -+ if (ub == NULL) -+ return PRIVVM_NO_CHARGE; -+ -+ flags = vma->vm_flags; -+ if (!((newflags ^ flags) & VM_WRITE)) -+ return PRIVVM_NO_CHARGE; -+ -+ file = vma->vm_file; -+ if (!VM_UB_PRIVATE(newflags | VM_WRITE, file)) -+ return PRIVVM_NO_CHARGE; -+ -+ if (flags & VM_WRITE) -+ return PRIVVM_TO_SHARED; -+ -+ for (; ub->parent != NULL; ub = ub->parent); -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ if (__charge_privvm_locked(ub, size, UB_SOFT) < 0) -+ goto err; -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+ return PRIVVM_TO_PRIVATE; -+ -+err: -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+ return PRIVVM_ERROR; -+} -+ -+int ub_locked_mem_charge(struct user_beancounter *ub, long size) -+{ -+ if (ub == NULL) -+ return 0; -+ -+ return charge_beancounter(ub, UB_LOCKEDPAGES, -+ size >> PAGE_SHIFT, UB_HARD); -+} -+ -+void ub_locked_mem_uncharge(struct user_beancounter *ub, long size) -+{ -+ if (ub == NULL) -+ return; -+ -+ uncharge_beancounter(ub, UB_LOCKEDPAGES, size >> PAGE_SHIFT); -+} -+ -+int ub_shmpages_charge(struct user_beancounter *ub, unsigned long size) -+{ -+ int ret; -+ unsigned long flags; -+ -+ ret = 0; -+ if (ub == NULL) -+ return 0; -+ -+ for (; ub->parent != NULL; ub = ub->parent); -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ ret = __charge_beancounter_locked(ub, UB_SHMPAGES, size, UB_HARD); -+ if (ret == 0) -+ __ub_update_privvm(ub); -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+ return ret; -+} -+ -+void ub_shmpages_uncharge(struct user_beancounter *ub, unsigned long size) -+{ -+ unsigned long flags; -+ -+ if (ub == NULL) -+ return; -+ -+ for (; ub->parent != NULL; ub = ub->parent); -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ __uncharge_beancounter_locked(ub, UB_SHMPAGES, size); -+ __ub_update_privvm(ub); -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+} -+ -+int ub_memory_charge(struct user_beancounter *ub, unsigned long size, -+ unsigned vm_flags, struct file *vm_file, int sv) -+{ -+ struct user_beancounter *ubl; -+ unsigned long flags; -+ -+ if (ub == NULL) -+ return 0; -+ -+ size >>= PAGE_SHIFT; -+ -+ if (size > UB_MAXVALUE) -+ return -EINVAL; -+ -+ BUG_ON(sv != UB_SOFT && sv != UB_HARD); -+ -+ if ((vm_flags & VM_LOCKED) && -+ charge_beancounter(ub, UB_LOCKEDPAGES, size, sv)) -+ goto out_err; -+ if (VM_UB_PRIVATE(vm_flags, vm_file)) { -+ for (ubl = ub; ubl->parent != NULL; ubl = ubl->parent); -+ spin_lock_irqsave(&ubl->ub_lock, flags); -+ if (__charge_privvm_locked(ubl, size, sv)) -+ goto out_private; -+ spin_unlock_irqrestore(&ubl->ub_lock, flags); -+ } -+ return 0; -+ -+out_private: -+ spin_unlock_irqrestore(&ubl->ub_lock, flags); -+ if (vm_flags & VM_LOCKED) -+ uncharge_beancounter(ub, UB_LOCKEDPAGES, size); -+out_err: -+ return -ENOMEM; -+} -+ -+void ub_memory_uncharge(struct user_beancounter *ub, unsigned long size, -+ unsigned vm_flags, struct file *vm_file) -+{ -+ unsigned long flags; -+ -+ if (ub == NULL) -+ return; -+ -+ size >>= PAGE_SHIFT; -+ -+ if (vm_flags & VM_LOCKED) -+ uncharge_beancounter(ub, UB_LOCKEDPAGES, size); -+ if (VM_UB_PRIVATE(vm_flags, vm_file)) { -+ for (; ub->parent != NULL; ub = ub->parent); -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ __unused_privvm_dec_locked(ub, size); -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+ } -+} -+ -+static inline void do_ub_tmpfs_respages_inc(struct user_beancounter *ub, -+ unsigned long size) -+{ -+ unsigned long flags; -+ -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ ub->ub_tmpfs_respages += size; -+ __ub_update_physpages(ub); -+ __ub_update_oomguarpages(ub); -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+} -+ -+void ub_tmpfs_respages_inc(struct user_beancounter *ub, -+ unsigned long size) -+{ -+ for (; ub != NULL; ub = ub->parent) -+ do_ub_tmpfs_respages_inc(ub, size); -+} -+ -+static inline void do_ub_tmpfs_respages_dec(struct user_beancounter *ub, -+ unsigned long size) -+{ -+ unsigned long flags; -+ -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ /* catch possible overflow */ -+ if (ub->ub_tmpfs_respages < size) { -+ uncharge_warn(ub, UB_TMPFSPAGES, -+ size, ub->ub_tmpfs_respages); -+ size = ub->ub_tmpfs_respages; -+ } -+ ub->ub_tmpfs_respages -= size; -+ /* update values what is the most interesting */ -+ __ub_update_physpages(ub); -+ __ub_update_oomguarpages(ub); -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+} -+ -+void ub_tmpfs_respages_dec(struct user_beancounter *ub, -+ unsigned long size) -+{ -+ for (; ub != NULL; ub = ub->parent) -+ do_ub_tmpfs_respages_dec(ub, size); -+} -+ -+#ifdef CONFIG_USER_SWAP_ACCOUNTING -+static inline void do_ub_swapentry_inc(struct user_beancounter *ub) -+{ -+ unsigned long flags; -+ -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ ub->ub_swap_pages++; -+ __ub_update_oomguarpages(ub); -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+} -+ -+void ub_swapentry_inc(struct user_beancounter *ub) -+{ -+ for (; ub != NULL; ub = ub->parent) -+ do_ub_swapentry_inc(ub); -+} -+EXPORT_SYMBOL(ub_swapentry_inc); -+ -+static inline void do_ub_swapentry_dec(struct user_beancounter *ub) -+{ -+ unsigned long flags; -+ -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ if (ub->ub_swap_pages < 1) -+ uncharge_warn(ub, UB_SWAPPAGES, 1, ub->ub_swap_pages); -+ else -+ ub->ub_swap_pages -= 1; -+ __ub_update_oomguarpages(ub); -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+} -+ -+void ub_swapentry_dec(struct user_beancounter *ub) -+{ -+ for (; ub != NULL; ub = ub->parent) -+ do_ub_swapentry_dec(ub); -+} -+#endif -+ -+static int vmguar_enough_memory(struct vnotifier_block *self, -+ unsigned long event, void *arg, int old_ret) -+{ -+ struct user_beancounter *ub; -+ -+ if (event != VIRTINFO_ENOUGHMEM) -+ return old_ret; -+ -+ for (ub = mm_ub(current->mm); ub->parent != NULL; ub = ub->parent); -+ if (ub->ub_parms[UB_PRIVVMPAGES].held > -+ ub->ub_parms[UB_VMGUARPAGES].barrier) -+ return old_ret; -+ -+ return NOTIFY_OK; -+} -+ -+static struct vnotifier_block vmguar_notifier_block = { -+ .notifier_call = vmguar_enough_memory -+}; -+ -+static int __init init_vmguar_notifier(void) -+{ -+ virtinfo_notifier_register(VITYPE_GENERAL, &vmguar_notifier_block); -+ return 0; -+} -+ -+static void __exit fini_vmguar_notifier(void) -+{ -+ virtinfo_notifier_unregister(VITYPE_GENERAL, &vmguar_notifier_block); -+} -+ -+module_init(init_vmguar_notifier); -+module_exit(fini_vmguar_notifier); -diff -uprN linux-2.6.8.1.orig/kernel/ub/ub_proc.c linux-2.6.8.1-ve022stab078/kernel/ub/ub_proc.c ---- linux-2.6.8.1.orig/kernel/ub/ub_proc.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/kernel/ub/ub_proc.c 2006-05-11 13:05:39.000000000 +0400 -@@ -0,0 +1,380 @@ -+/* -+ * linux/fs/proc/proc_ub.c -+ * -+ * Copyright (C) 1998-2000 Andrey V. Savochkin <saw@saw.sw.com.sg> -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ * TODO: -+ * -+ * Changes: -+ */ -+ -+#include <linux/errno.h> -+#include <linux/sched.h> -+#include <linux/kernel.h> -+#include <linux/mm.h> -+#include <linux/proc_fs.h> -+ -+#include <ub/beancounter.h> -+#include <ub/ub_hash.h> -+#include <ub/ub_debug.h> -+ -+#include <asm/page.h> -+#include <asm/uaccess.h> -+ -+/* -+ * we have 8 format strings depending on: -+ * 1. BITS_PER_LONG -+ * 2. CONFIG_UBC_KEEP_UNUSED -+ * 3. resource number (see out_proc_beancounter) -+ */ -+ -+#ifdef CONFIG_UBC_KEEP_UNUSED -+#define REF_FORMAT "%5.5s %4i: %-12s " -+#define UID_HEAD_STR "uid ref" -+#else -+#define REF_FORMAT "%10.10s: %-12s " -+#define UID_HEAD_STR "uid" -+#endif -+#define REF2_FORMAT "%10s %-12s " -+ -+#if BITS_PER_LONG == 32 -+#define RES_FORMAT "%10lu %10lu %10lu %10lu %10lu" -+#define HEAD_FORMAT "%10s %10s %10s %10s %10s" -+#define UB_PROC_LINE_TEXT (10+2+12+1+10+1+10+1+10+1+10+1+10) -+#else -+#define RES_FORMAT "%20lu %20lu %20lu %20lu %20lu" -+#define HEAD_FORMAT "%20s %20s %20s %20s %20s" -+#define UB_PROC_LINE_TEXT (10+2+12+1+20+1+20+1+20+1+20+1+20) -+#endif -+ -+#define UB_PROC_LINE_LEN (UB_PROC_LINE_TEXT + 1) -+ -+static void out_proc_version(char *buf) -+{ -+ int len; -+ -+ len = sprintf(buf, "Version: 2.5"); -+ memset(buf + len, ' ', UB_PROC_LINE_TEXT - len); -+ buf[UB_PROC_LINE_TEXT] = '\n'; -+} -+ -+static void out_proc_head(char *buf) -+{ -+ sprintf(buf, REF2_FORMAT HEAD_FORMAT, -+ UID_HEAD_STR, "resource", "held", "maxheld", -+ "barrier", "limit", "failcnt"); -+ buf[UB_PROC_LINE_TEXT] = '\n'; -+} -+ -+static void out_proc_beancounter(char *buf, struct user_beancounter *ub, int r) -+{ -+ if (r == 0) { -+ char tmpbuf[64]; -+ print_ub_uid(ub, tmpbuf, sizeof(tmpbuf)); -+ sprintf(buf, REF_FORMAT RES_FORMAT, -+ tmpbuf, -+#ifdef CONFIG_UBC_KEEP_UNUSED -+ atomic_read(&ub->ub_refcount), -+#endif -+ ub_rnames[r], ub->ub_parms[r].held, -+ ub->ub_parms[r].maxheld, ub->ub_parms[r].barrier, -+ ub->ub_parms[r].limit, ub->ub_parms[r].failcnt); -+ } else -+ sprintf(buf, REF2_FORMAT RES_FORMAT, -+ "", ub_rnames[r], -+ ub->ub_parms[r].held, ub->ub_parms[r].maxheld, -+ ub->ub_parms[r].barrier, ub->ub_parms[r].limit, -+ ub->ub_parms[r].failcnt); -+ -+ buf[UB_PROC_LINE_TEXT] = '\n'; -+} -+ -+static int ub_accessible(struct user_beancounter *ub, -+ struct user_beancounter *exec_ub, -+ struct file *file) -+{ -+ struct user_beancounter *p, *q; -+ -+ for (p = exec_ub; p->parent != NULL; p = p->parent); -+ for (q = ub; q->parent != NULL; q = q->parent); -+ if (p != get_ub0() && q != p) -+ return 0; -+ if (ub->parent == NULL) -+ return 1; -+ return file->private_data == NULL ? 0 : 1; -+} -+ -+static ssize_t ub_proc_read(struct file *file, char *usrbuf, size_t len, -+ loff_t *poff) -+{ -+ ssize_t retval; -+ char *buf; -+ unsigned long flags; -+ int i, resource; -+ struct ub_hash_slot *slot; -+ struct user_beancounter *ub; -+ struct user_beancounter *exec_ub = get_exec_ub(); -+ loff_t n, off; -+ int rem, produced, job, tocopy; -+ const int is_capable = -+ (capable(CAP_DAC_OVERRIDE) || capable(CAP_DAC_READ_SEARCH)); -+ -+ retval = -ENOBUFS; -+ buf = (char *)__get_free_page(GFP_KERNEL); -+ if (buf == NULL) -+ goto out; -+ -+ retval = 0; -+ if (!is_capable) -+ goto out_free; -+ -+ off = *poff; -+ if (off < 0) /* can't happen, just in case */ -+ goto inval; -+ -+again: -+ i = 0; -+ slot = ub_hash; -+ n = off; /* The amount of data tp skip */ -+ produced = 0; -+ if (n < (UB_PROC_LINE_LEN * 2)) { -+ if (n < UB_PROC_LINE_LEN) { -+ out_proc_version(buf); -+ produced += UB_PROC_LINE_LEN; -+ n += UB_PROC_LINE_LEN; -+ } -+ out_proc_head(buf + produced); -+ produced += UB_PROC_LINE_LEN; -+ n += UB_PROC_LINE_LEN; -+ } -+ n -= (2 * UB_PROC_LINE_LEN); -+ spin_lock_irqsave(&ub_hash_lock, flags); -+ while (1) { -+ for (ub = slot->ubh_beans; -+ ub != NULL && n >= (UB_RESOURCES * UB_PROC_LINE_LEN); -+ ub = ub->ub_next) -+ if (is_capable && ub_accessible(ub, exec_ub, file)) -+ n -= (UB_RESOURCES * UB_PROC_LINE_LEN); -+ if (ub != NULL || ++i >= UB_HASH_SIZE) -+ break; -+ ++slot; -+ } -+ rem = n; /* the amount of the data in the buffer to skip */ -+ job = PAGE_SIZE - UB_PROC_LINE_LEN + 1; /* end of buffer data */ -+ if (len < job - rem) -+ job = rem + len; -+ while (ub != NULL && produced < job) { -+ if (is_capable && ub_accessible(ub, exec_ub, file)) -+ for (resource = 0; -+ produced < job && resource < UB_RESOURCES; -+ resource++, produced += UB_PROC_LINE_LEN) -+ { -+ out_proc_beancounter(buf + produced, -+ ub, resource); -+ } -+ if (produced >= job) -+ break; -+ /* Find the next beancounter to produce more data. */ -+ ub = ub->ub_next; -+ while (ub == NULL && ++i < UB_HASH_SIZE) { -+ ++slot; -+ ub = slot->ubh_beans; -+ } -+ } -+ -+ spin_unlock_irqrestore(&ub_hash_lock, flags); -+ ub_debug(UBD_ALLOC, KERN_DEBUG "UB_PROC: produced %d, job %d, rem %d\n", -+ produced, job, rem); -+ -+ /* -+ * Temporary buffer `buf' contains `produced' bytes. -+ * Extract no more than `len' bytes at offset `rem'. -+ */ -+ if (produced <= rem) -+ goto out_free; -+ tocopy = produced - rem; -+ if (len < tocopy) -+ tocopy = len; -+ if (!tocopy) -+ goto out_free; -+ if (copy_to_user(usrbuf, buf + rem, tocopy)) -+ goto fault; -+ off += tocopy; /* can't overflow */ -+ *poff = off; -+ len -= tocopy; -+ retval += tocopy; -+ if (!len) -+ goto out_free; -+ usrbuf += tocopy; -+ goto again; -+ -+fault: -+ retval = -EFAULT; -+out_free: -+ free_page((unsigned long)buf); -+out: -+ return retval; -+ -+inval: -+ retval = -EINVAL; -+ goto out_free; -+} -+ -+static int ub_proc_open(struct inode *inode, struct file *file) -+{ -+ file->private_data = strcmp(file->f_dentry->d_name.name, -+ "user_beancounters") ? -+ (void *)-1 : NULL; -+ return 0; -+} -+ -+static struct file_operations ub_file_operations = { -+ .read = &ub_proc_read, -+ .open = &ub_proc_open -+}; -+ -+#ifdef CONFIG_UBC_DEBUG_KMEM -+#include <linux/seq_file.h> -+#include <linux/kmem_cache.h> -+ -+static void *ubd_start(struct seq_file *m, loff_t *pos) -+{ -+ loff_t n = *pos; -+ struct user_beancounter *ub; -+ long slot; -+ -+ spin_lock_irq(&ub_hash_lock); -+ for (slot = 0; slot < UB_HASH_SIZE; slot++) -+ for (ub = ub_hash[slot].ubh_beans; ub; ub = ub->ub_next) { -+ if (n == 0) { -+ m->private = (void *)slot; -+ return (void *)ub; -+ } -+ n--; -+ } -+ return NULL; -+} -+ -+static void *ubd_next(struct seq_file *m, void *p, loff_t *pos) -+{ -+ struct user_beancounter *ub; -+ long slot; -+ -+ ub = (struct user_beancounter *)p; -+ slot = (long)m->private; -+ -+ ++*pos; -+ ub = ub->ub_next; -+ while (1) { -+ for (; ub; ub = ub->ub_next) { -+ m->private = (void *)slot; -+ return (void *)ub; -+ } -+ slot++; -+ if (slot == UB_HASH_SIZE) -+ break; -+ ub = ub_hash[slot].ubh_beans; -+ } -+ return NULL; -+} -+ -+static void ubd_stop(struct seq_file *m, void *p) -+{ -+ spin_unlock_irq(&ub_hash_lock); -+} -+ -+#define PROC_LINE_FMT "\t%-17s\t%5lu\t%5lu\n" -+ -+static int ubd_show(struct seq_file *m, void *p) -+{ -+ struct user_beancounter *ub; -+ struct ub_cache_counter *cc; -+ long pages, vmpages; -+ int i; -+ char id[64]; -+ -+ ub = (struct user_beancounter *)p; -+ print_ub_uid(ub, id, sizeof(id)); -+ seq_printf(m, "%s:\n", id); -+ -+ pages = vmpages = 0; -+ for (i = 0; i < NR_CPUS; i++) { -+ pages += ub->ub_pages_charged[i]; -+ vmpages += ub->ub_vmalloc_charged[i]; -+ } -+ if (pages < 0) -+ pages = 0; -+ if (vmpages < 0) -+ vmpages = 0; -+ seq_printf(m, PROC_LINE_FMT, "pages", pages, PAGE_SIZE); -+ seq_printf(m, PROC_LINE_FMT, "vmalloced", vmpages, PAGE_SIZE); -+ -+ seq_printf(m, PROC_LINE_FMT, ub_rnames[UB_UNUSEDPRIVVM], -+ ub->ub_unused_privvmpages, PAGE_SIZE); -+ seq_printf(m, PROC_LINE_FMT, ub_rnames[UB_TMPFSPAGES], -+ ub->ub_tmpfs_respages, PAGE_SIZE); -+ seq_printf(m, PROC_LINE_FMT, ub_rnames[UB_SWAPPAGES], -+ ub->ub_swap_pages, PAGE_SIZE); -+ /* interrupts are disabled by locking ub_hash_lock */ -+ spin_lock(&cc_lock); -+ list_for_each_entry (cc, &ub->ub_cclist, ulist) { -+ kmem_cache_t *cachep; -+ -+ cachep = cc->cachep; -+ seq_printf(m, PROC_LINE_FMT, -+ cachep->name, -+ cc->counter, -+ (unsigned long)cachep->objuse); -+ } -+ spin_unlock(&cc_lock); -+ return 0; -+} -+ -+static struct seq_operations kmemdebug_op = { -+ .start = ubd_start, -+ .next = ubd_next, -+ .stop = ubd_stop, -+ .show = ubd_show, -+}; -+ -+static int kmem_debug_open(struct inode *inode, struct file *file) -+{ -+ return seq_open(file, &kmemdebug_op); -+} -+ -+static struct file_operations kmem_debug_ops = { -+ .open = kmem_debug_open, -+ .read = seq_read, -+ .llseek = seq_lseek, -+ .release = seq_release, -+}; -+#endif -+ -+void __init beancounter_proc_init(void) -+{ -+ struct proc_dir_entry *entry; -+ -+ entry = create_proc_entry("user_beancounters", S_IRUGO, NULL); -+ if (entry) -+ entry->proc_fops = &ub_file_operations; -+ else -+ panic("Can't create /proc/user_beancounters entry!\n"); -+ -+ entry = create_proc_entry("user_beancounters_sub", S_IRUGO, NULL); -+ if (entry) -+ entry->proc_fops = &ub_file_operations; -+ else -+ panic("Can't create /proc/user_beancounters2 entry!\n"); -+ -+#ifdef CONFIG_UBC_DEBUG_KMEM -+ entry = create_proc_entry("user_beancounters_debug", S_IRUGO, NULL); -+ if (entry) -+ entry->proc_fops = &kmem_debug_ops; -+ else -+ panic("Can't create /proc/user_beancounters_debug entry!\n"); -+#endif -+} -diff -uprN linux-2.6.8.1.orig/kernel/ub/ub_stat.c linux-2.6.8.1-ve022stab078/kernel/ub/ub_stat.c ---- linux-2.6.8.1.orig/kernel/ub/ub_stat.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/kernel/ub/ub_stat.c 2006-05-11 13:05:39.000000000 +0400 -@@ -0,0 +1,465 @@ -+/* -+ * kernel/ub/ub_stat.c -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#include <linux/config.h> -+#include <linux/timer.h> -+#include <linux/sched.h> -+#include <linux/init.h> -+#include <linux/jiffies.h> -+#include <linux/list.h> -+#include <linux/errno.h> -+#include <linux/suspend.h> -+ -+#include <asm/uaccess.h> -+#include <asm/param.h> -+ -+#include <ub/beancounter.h> -+#include <ub/ub_hash.h> -+#include <ub/ub_stat.h> -+ -+static spinlock_t ubs_notify_lock = SPIN_LOCK_UNLOCKED; -+static LIST_HEAD(ubs_notify_list); -+static long ubs_min_interval; -+static ubstattime_t ubs_start_time, ubs_end_time; -+static struct timer_list ubs_timer; -+ -+static int ubstat_get_list(void *buf, long size) -+{ -+ int retval; -+ unsigned long flags; -+ int slotnr; -+ struct ub_hash_slot *slot; -+ struct user_beancounter *ub, *last_ub; -+ long *page, *ptr, *end; -+ int len; -+ -+ page = (long *)__get_free_page(GFP_KERNEL); -+ if (page == NULL) -+ return -ENOMEM; -+ -+ retval = 0; -+ slotnr = 0; -+ slot = ub_hash; -+ last_ub = NULL; -+ while (1) { -+ ptr = page; -+ end = page + PAGE_SIZE / sizeof(*ptr); -+ -+ spin_lock_irqsave(&ub_hash_lock, flags); -+ if (last_ub == NULL) -+ ub = slot->ubh_beans; -+ else -+ ub = last_ub->ub_next; -+ while (1) { -+ for (; ub != NULL; ub = ub->ub_next) { -+ if (ub->parent != NULL) -+ continue; -+ *ptr++ = ub->ub_uid; -+ if (ptr == end) -+ break; -+ } -+ if (ptr == end) -+ break; -+ ++slot; -+ if (++slotnr >= UB_HASH_SIZE) -+ break; -+ ub = slot->ubh_beans; -+ } -+ if (ptr == page) -+ goto out_unlock; -+ if (ub != NULL) -+ get_beancounter(ub); -+ spin_unlock_irqrestore(&ub_hash_lock, flags); -+ -+ if (last_ub != NULL) -+ put_beancounter(last_ub); -+ last_ub = ub; /* last visited beancounter in the slot */ -+ -+ len = min_t(long, (ptr - page) * sizeof(*ptr), size); -+ if (copy_to_user(buf, page, len)) { -+ retval = -EFAULT; -+ break; -+ } -+ retval += len; -+ if (len < PAGE_SIZE) -+ break; -+ buf += len; -+ size -= len; -+ } -+out: -+ if (last_ub != NULL) -+ put_beancounter(last_ub); -+ free_page((unsigned long)page); -+ return retval; -+ -+out_unlock: -+ spin_unlock_irqrestore(&ub_hash_lock, flags); -+ goto out; -+} -+ -+static int ubstat_gettime(void *buf, long size) -+{ -+ ubgettime_t data; -+ int retval; -+ -+ spin_lock(&ubs_notify_lock); -+ data.start_time = ubs_start_time; -+ data.end_time = ubs_end_time; -+ data.cur_time = ubs_start_time + (jiffies - ubs_start_time * HZ) / HZ; -+ spin_unlock(&ubs_notify_lock); -+ -+ retval = min_t(long, sizeof(data), size); -+ if (copy_to_user(buf, &data, retval)) -+ retval = -EFAULT; -+ return retval; -+} -+ -+static int ubstat_do_read_one(struct user_beancounter *ub, int res, void *kbuf) -+{ -+ struct { -+ ubstattime_t start_time; -+ ubstattime_t end_time; -+ ubstatparm_t param[1]; -+ } *data; -+ -+ data = kbuf; -+ data->start_time = ubs_start_time; -+ data->end_time = ubs_end_time; -+ -+ data->param[0].maxheld = ub->ub_store[res].maxheld; -+ data->param[0].failcnt = ub->ub_store[res].failcnt; -+ -+ return sizeof(*data); -+} -+ -+static int ubstat_do_read_all(struct user_beancounter *ub, void *kbuf, int size) -+{ -+ int wrote; -+ struct { -+ ubstattime_t start_time; -+ ubstattime_t end_time; -+ ubstatparm_t param[UB_RESOURCES]; -+ } *data; -+ int resource; -+ -+ data = kbuf; -+ data->start_time = ubs_start_time; -+ data->end_time = ubs_end_time; -+ wrote = sizeof(data->start_time) + sizeof(data->end_time); -+ -+ for (resource = 0; resource < UB_RESOURCES; resource++) { -+ if (size < wrote + sizeof(data->param[resource])) -+ break; -+ data->param[resource].maxheld = ub->ub_store[resource].maxheld; -+ data->param[resource].failcnt = ub->ub_store[resource].failcnt; -+ wrote += sizeof(data->param[resource]); -+ } -+ -+ return wrote; -+} -+ -+static int ubstat_do_read_full(struct user_beancounter *ub, void *kbuf, -+ int size) -+{ -+ int wrote; -+ struct { -+ ubstattime_t start_time; -+ ubstattime_t end_time; -+ ubstatparmf_t param[UB_RESOURCES]; -+ } *data; -+ int resource; -+ -+ data = kbuf; -+ data->start_time = ubs_start_time; -+ data->end_time = ubs_end_time; -+ wrote = sizeof(data->start_time) + sizeof(data->end_time); -+ -+ for (resource = 0; resource < UB_RESOURCES; resource++) { -+ if (size < wrote + sizeof(data->param[resource])) -+ break; -+ /* The beginning of ubstatparmf_t matches struct ubparm. */ -+ memcpy(&data->param[resource], &ub->ub_store[resource], -+ sizeof(ub->ub_store[resource])); -+ data->param[resource].__unused1 = 0; -+ data->param[resource].__unused2 = 0; -+ wrote += sizeof(data->param[resource]); -+ } -+ return wrote; -+} -+ -+static int ubstat_get_stat(struct user_beancounter *ub, long cmd, -+ void *buf, long size) -+{ -+ void *kbuf; -+ int retval; -+ -+ kbuf = (void *)__get_free_page(GFP_KERNEL); -+ if (kbuf == NULL) -+ return -ENOMEM; -+ -+ spin_lock(&ubs_notify_lock); -+ switch (UBSTAT_CMD(cmd)) { -+ case UBSTAT_READ_ONE: -+ retval = -EINVAL; -+ if (UBSTAT_PARMID(cmd) >= UB_RESOURCES) -+ break; -+ retval = ubstat_do_read_one(ub, -+ UBSTAT_PARMID(cmd), kbuf); -+ break; -+ case UBSTAT_READ_ALL: -+ retval = ubstat_do_read_all(ub, kbuf, PAGE_SIZE); -+ break; -+ case UBSTAT_READ_FULL: -+ retval = ubstat_do_read_full(ub, kbuf, PAGE_SIZE); -+ break; -+ default: -+ retval = -EINVAL; -+ } -+ spin_unlock(&ubs_notify_lock); -+ -+ if (retval > 0) { -+ retval = min_t(long, retval, size); -+ if (copy_to_user(buf, kbuf, retval)) -+ retval = -EFAULT; -+ } -+ -+ free_page((unsigned long)kbuf); -+ return retval; -+} -+ -+static int ubstat_handle_notifrq(ubnotifrq_t *req) -+{ -+ int retval; -+ struct ub_stat_notify *new_notify; -+ struct list_head *entry; -+ struct task_struct *tsk_to_free; -+ -+ new_notify = kmalloc(sizeof(new_notify), GFP_KERNEL); -+ if (new_notify == NULL) -+ return -ENOMEM; -+ -+ tsk_to_free = NULL; -+ INIT_LIST_HEAD(&new_notify->list); -+ -+ spin_lock(&ubs_notify_lock); -+ list_for_each(entry, &ubs_notify_list) { -+ struct ub_stat_notify *notify; -+ -+ notify = list_entry(entry, struct ub_stat_notify, list); -+ if (notify->task == current) { -+ kfree(new_notify); -+ new_notify = notify; -+ break; -+ } -+ } -+ -+ retval = -EINVAL; -+ if (req->maxinterval < 1) -+ goto out_unlock; -+ if (req->maxinterval > TIME_MAX_SEC) -+ req->maxinterval = TIME_MAX_SEC; -+ if (req->maxinterval < ubs_min_interval) { -+ unsigned long dif; -+ -+ ubs_min_interval = req->maxinterval; -+ dif = (ubs_timer.expires - jiffies + HZ - 1) / HZ; -+ if (dif > req->maxinterval) -+ mod_timer(&ubs_timer, -+ ubs_timer.expires - -+ (dif - req->maxinterval) * HZ); -+ } -+ -+ if (entry != &ubs_notify_list) { -+ list_del(&new_notify->list); -+ tsk_to_free = new_notify->task; -+ } -+ if (req->signum) { -+ new_notify->task = current; -+ get_task_struct(new_notify->task); -+ new_notify->signum = req->signum; -+ list_add(&new_notify->list, &ubs_notify_list); -+ } else -+ kfree(new_notify); -+ retval = 0; -+out_unlock: -+ spin_unlock(&ubs_notify_lock); -+ if (tsk_to_free != NULL) -+ put_task_struct(tsk_to_free); -+ return retval; -+} -+ -+/* -+ * former sys_ubstat -+ */ -+long do_ubstat(int func, unsigned long arg1, unsigned long arg2, void *buf, -+ long size) -+{ -+ int retval; -+ struct user_beancounter *ub; -+ -+ if (func == UBSTAT_UBPARMNUM) -+ return UB_RESOURCES; -+ if (func == UBSTAT_UBLIST) -+ return ubstat_get_list(buf, size); -+ if (!(capable(CAP_DAC_OVERRIDE) || capable(CAP_DAC_READ_SEARCH))) -+ return -EPERM; -+ -+ if (func == UBSTAT_GETTIME) { -+ retval = ubstat_gettime(buf, size); -+ goto notify; -+ } -+ -+ ub = get_exec_ub(); -+ if (ub != NULL && ub->ub_uid == arg1) -+ get_beancounter(ub); -+ else /* FIXME must be if (ve_is_super) */ -+ ub = get_beancounter_byuid(arg1, 0); -+ -+ if (ub == NULL) -+ return -ESRCH; -+ -+ retval = ubstat_get_stat(ub, func, buf, size); -+ put_beancounter(ub); -+notify: -+ /* Handle request for notification */ -+ if (retval >= 0) { -+ ubnotifrq_t notifrq; -+ int err; -+ -+ err = -EFAULT; -+ if (!copy_from_user(¬ifrq, (void *)arg2, sizeof(notifrq))) -+ err = ubstat_handle_notifrq(¬ifrq); -+ if (err) -+ retval = err; -+ } -+ -+ return retval; -+} -+ -+static void ubstat_save_onestat(struct user_beancounter *ub) -+{ -+ int resource; -+ -+ /* called with local irq disabled */ -+ spin_lock(&ub->ub_lock); -+ for (resource = 0; resource < UB_RESOURCES; resource++) { -+ memcpy(&ub->ub_store[resource], &ub->ub_parms[resource], -+ sizeof(struct ubparm)); -+ ub->ub_parms[resource].minheld = -+ ub->ub_parms[resource].maxheld = -+ ub->ub_parms[resource].held; -+ } -+ spin_unlock(&ub->ub_lock); -+} -+ -+static void ubstat_save_statistics(void) -+{ -+ unsigned long flags; -+ int i; -+ struct user_beancounter *ub; -+ -+ spin_lock_irqsave(&ub_hash_lock, flags); -+ for_each_beancounter(i, ub) -+ ubstat_save_onestat(ub); -+ spin_unlock_irqrestore(&ub_hash_lock, flags); -+} -+ -+static void ubstatd_timeout(unsigned long __data) -+{ -+ struct task_struct *p; -+ -+ p = (struct task_struct *) __data; -+ wake_up_process(p); -+} -+ -+/* -+ * Safe wrapper for send_sig. It prevents a race with release_task -+ * for sighand. -+ * Should be called under tasklist_lock. -+ */ -+static void task_send_sig(struct ub_stat_notify *notify) -+{ -+ if (likely(notify->task->sighand != NULL)) -+ send_sig(notify->signum, notify->task, 1); -+} -+ -+static inline void do_notifies(void) -+{ -+ LIST_HEAD(notif_free_list); -+ struct ub_stat_notify *notify; -+ struct ub_stat_notify *tmp; -+ -+ spin_lock(&ubs_notify_lock); -+ ubs_start_time = ubs_end_time; -+ /* -+ * the expression below relies on time being unsigned long and -+ * arithmetic promotion rules -+ */ -+ ubs_end_time += (ubs_timer.expires - ubs_start_time * HZ) / HZ; -+ mod_timer(&ubs_timer, ubs_timer.expires + ubs_min_interval * HZ); -+ ubs_min_interval = TIME_MAX_SEC; -+ /* save statistics accumulated for the interval */ -+ ubstat_save_statistics(); -+ /* send signals */ -+ read_lock(&tasklist_lock); -+ while (!list_empty(&ubs_notify_list)) { -+ notify = list_entry(ubs_notify_list.next, -+ struct ub_stat_notify, list); -+ task_send_sig(notify); -+ list_del(¬ify->list); -+ list_add(¬ify->list, ¬if_free_list); -+ } -+ read_unlock(&tasklist_lock); -+ spin_unlock(&ubs_notify_lock); -+ -+ list_for_each_entry_safe(notify, tmp, ¬if_free_list, list) { -+ put_task_struct(notify->task); -+ kfree(notify); -+ } -+} -+ -+/* -+ * Kernel thread -+ */ -+static int ubstatd(void *unused) -+{ -+ /* daemonize call will take care of signals */ -+ daemonize("ubstatd"); -+ -+ ubs_timer.data = (unsigned long)current; -+ ubs_timer.function = ubstatd_timeout; -+ add_timer(&ubs_timer); -+ -+ while (1) { -+ set_task_state(current, TASK_INTERRUPTIBLE); -+ if (time_after(ubs_timer.expires, jiffies)) { -+ schedule(); -+ if (test_thread_flag(TIF_FREEZE)) -+ refrigerator(); -+ continue; -+ } -+ -+ __set_task_state(current, TASK_RUNNING); -+ do_notifies(); -+ } -+} -+ -+static int __init ubstatd_init(void) -+{ -+ init_timer(&ubs_timer); -+ ubs_timer.expires = TIME_MAX_JIF; -+ ubs_min_interval = TIME_MAX_SEC; -+ ubs_start_time = ubs_end_time = 0; -+ -+ kernel_thread(ubstatd, NULL, 0); -+ return 0; -+} -+ -+module_init(ubstatd_init); -diff -uprN linux-2.6.8.1.orig/kernel/ub/ub_sys.c linux-2.6.8.1-ve022stab078/kernel/ub/ub_sys.c ---- linux-2.6.8.1.orig/kernel/ub/ub_sys.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/kernel/ub/ub_sys.c 2006-05-11 13:05:48.000000000 +0400 -@@ -0,0 +1,168 @@ -+/* -+ * kernel/ub/ub_sys.c -+ * -+ * Copyright (C) 2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#include <linux/config.h> -+#include <linux/virtinfo.h> -+#include <asm/uaccess.h> -+ -+#include <ub/beancounter.h> -+ -+#ifndef CONFIG_USER_RESOURCE -+asmlinkage long sys_getluid(void) -+{ -+ return -ENOSYS; -+} -+ -+asmlinkage long sys_setluid(uid_t uid) -+{ -+ return -ENOSYS; -+} -+ -+asmlinkage long sys_setublimit(uid_t uid, unsigned long resource, -+ unsigned long *limits) -+{ -+ return -ENOSYS; -+} -+ -+asmlinkage long sys_ubstat(int func, unsigned long arg1, unsigned long arg2, -+ void *buf, long size) -+{ -+ return -ENOSYS; -+} -+#else /* CONFIG_USER_RESOURCE */ -+ -+/* -+ * The (rather boring) getluid syscall -+ */ -+asmlinkage long sys_getluid(void) -+{ -+ struct user_beancounter *ub; -+ -+ ub = get_exec_ub(); -+ if (ub == NULL) -+ return -EINVAL; -+ -+ return ub->ub_uid; -+} -+ -+/* -+ * The setluid syscall -+ */ -+asmlinkage long sys_setluid(uid_t uid) -+{ -+ struct user_beancounter *ub; -+ struct task_beancounter *task_bc; -+ int error; -+ -+ task_bc = task_bc(current); -+ -+ /* You may not disown a setluid */ -+ error = -EINVAL; -+ if (uid == (uid_t)-1) -+ goto out; -+ -+ /* You may only set an ub as root */ -+ error = -EPERM; -+ if (!capable(CAP_SETUID)) -+ goto out; -+ -+ /* -+ * The ub once set is irrevocable to all -+ * unless it's set from ve0. -+ */ -+ if (!ve_is_super(get_exec_env())) -+ goto out; -+ -+ /* Ok - set up a beancounter entry for this user */ -+ error = -ENOBUFS; -+ ub = get_beancounter_byuid(uid, 1); -+ if (ub == NULL) -+ goto out; -+ -+ ub_debug(UBD_ALLOC | UBD_LIMIT, "setluid, bean %p (count %d) " -+ "for %.20s pid %d\n", -+ ub, atomic_read(&ub->ub_refcount), -+ current->comm, current->pid); -+ /* install bc */ -+ error = virtinfo_notifier_call(VITYPE_GENERAL, VIRTINFO_NEWUBC, ub); -+ if (!(error & NOTIFY_FAIL)) { -+ put_beancounter(task_bc->exec_ub); -+ task_bc->exec_ub = ub; -+ if (!(error & NOTIFY_OK)) { -+ put_beancounter(task_bc->fork_sub); -+ task_bc->fork_sub = get_beancounter(ub); -+ } -+ error = 0; -+ } else -+ error = -ENOBUFS; -+out: -+ return error; -+} -+ -+/* -+ * The setbeanlimit syscall -+ */ -+asmlinkage long sys_setublimit(uid_t uid, unsigned long resource, -+ unsigned long *limits) -+{ -+ int error; -+ unsigned long flags; -+ struct user_beancounter *ub; -+ unsigned long new_limits[2]; -+ -+ error = -EPERM; -+ if(!capable(CAP_SYS_RESOURCE)) -+ goto out; -+ -+ if (!ve_is_super(get_exec_env())) -+ goto out; -+ -+ error = -EINVAL; -+ if (resource >= UB_RESOURCES) -+ goto out; -+ -+ error = -EFAULT; -+ if (copy_from_user(&new_limits, limits, sizeof(new_limits))) -+ goto out; -+ -+ error = -EINVAL; -+ if (new_limits[0] > UB_MAXVALUE || new_limits[1] > UB_MAXVALUE) -+ goto out; -+ -+ error = -ENOENT; -+ ub = get_beancounter_byuid(uid, 0); -+ if (ub == NULL) { -+ ub_debug(UBD_LIMIT, "No login bc for uid %d\n", uid); -+ goto out; -+ } -+ -+ spin_lock_irqsave(&ub->ub_lock, flags); -+ ub->ub_parms[resource].barrier = new_limits[0]; -+ ub->ub_parms[resource].limit = new_limits[1]; -+ spin_unlock_irqrestore(&ub->ub_lock, flags); -+ -+ put_beancounter(ub); -+ -+ error = 0; -+out: -+ return error; -+} -+ -+extern long do_ubstat(int func, unsigned long arg1, unsigned long arg2, -+ void *buf, long size); -+asmlinkage long sys_ubstat(int func, unsigned long arg1, unsigned long arg2, -+ void *buf, long size) -+{ -+ if (!ve_is_super(get_exec_env())) -+ return -EPERM; -+ -+ return do_ubstat(func, arg1, arg2, buf, size); -+} -+#endif -diff -uprN linux-2.6.8.1.orig/kernel/user.c linux-2.6.8.1-ve022stab078/kernel/user.c ---- linux-2.6.8.1.orig/kernel/user.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/kernel/user.c 2006-05-11 13:05:40.000000000 +0400 -@@ -21,7 +21,20 @@ - #define UIDHASH_SZ (1 << UIDHASH_BITS) - #define UIDHASH_MASK (UIDHASH_SZ - 1) - #define __uidhashfn(uid) (((uid >> UIDHASH_BITS) + uid) & UIDHASH_MASK) --#define uidhashentry(uid) (uidhash_table + __uidhashfn((uid))) -+#define __uidhashentry(uid) (uidhash_table + __uidhashfn((uid))) -+ -+#ifdef CONFIG_VE -+#define UIDHASH_MASK_VE (UIDHASH_SZ_VE - 1) -+#define __uidhashfn_ve(uid) (((uid >> UIDHASH_BITS_VE) ^ uid) & \ -+ UIDHASH_MASK_VE) -+#define __uidhashentry_ve(uid, envid) ((envid)->uidhash_table + \ -+ __uidhashfn_ve(uid)) -+#define uidhashentry_ve(uid) (ve_is_super(get_exec_env()) ? \ -+ __uidhashentry(uid) : \ -+ __uidhashentry_ve(uid, get_exec_env())) -+#else -+#define uidhashentry_ve(uid) __uidhashentry(uid) -+#endif - - static kmem_cache_t *uid_cachep; - static struct list_head uidhash_table[UIDHASH_SZ]; -@@ -77,7 +90,7 @@ struct user_struct *find_user(uid_t uid) - struct user_struct *ret; - - spin_lock(&uidhash_lock); -- ret = uid_hash_find(uid, uidhashentry(uid)); -+ ret = uid_hash_find(uid, uidhashentry_ve(uid)); - spin_unlock(&uidhash_lock); - return ret; - } -@@ -93,7 +106,7 @@ void free_uid(struct user_struct *up) - - struct user_struct * alloc_uid(uid_t uid) - { -- struct list_head *hashent = uidhashentry(uid); -+ struct list_head *hashent = uidhashentry_ve(uid); - struct user_struct *up; - - spin_lock(&uidhash_lock); -@@ -154,14 +167,14 @@ static int __init uid_cache_init(void) - int n; - - uid_cachep = kmem_cache_create("uid_cache", sizeof(struct user_struct), -- 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); -+ 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_UBC, NULL, NULL); - - for(n = 0; n < UIDHASH_SZ; ++n) - INIT_LIST_HEAD(uidhash_table + n); - - /* Insert the root user immediately (init already runs as root) */ - spin_lock(&uidhash_lock); -- uid_hash_insert(&root_user, uidhashentry(0)); -+ uid_hash_insert(&root_user, __uidhashentry(0)); - spin_unlock(&uidhash_lock); - - return 0; -diff -uprN linux-2.6.8.1.orig/kernel/ve.c linux-2.6.8.1-ve022stab078/kernel/ve.c ---- linux-2.6.8.1.orig/kernel/ve.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/kernel/ve.c 2006-05-11 13:05:42.000000000 +0400 -@@ -0,0 +1,178 @@ -+/* -+ * linux/kernel/ve.c -+ * -+ * Copyright (C) 2000-2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+/* -+ * 've.c' helper file performing VE sub-system initialization -+ */ -+ -+#include <linux/sched.h> -+#include <linux/delay.h> -+#include <linux/capability.h> -+#include <linux/ve.h> -+#include <linux/smp_lock.h> -+#include <linux/init.h> -+ -+#include <linux/errno.h> -+#include <linux/unistd.h> -+#include <linux/slab.h> -+#include <linux/sys.h> -+#include <linux/kdev_t.h> -+#include <linux/termios.h> -+#include <linux/tty_driver.h> -+#include <linux/netdevice.h> -+#include <linux/utsname.h> -+#include <linux/proc_fs.h> -+#include <linux/kernel_stat.h> -+#include <linux/module.h> -+#include <linux/rcupdate.h> -+#include <linux/ve_proto.h> -+#include <linux/ve_owner.h> -+ -+#include <linux/nfcalls.h> -+ -+unsigned long vz_rstamp = 0x37e0f59d; -+ -+#ifdef CONFIG_MODULES -+struct module no_module = { .state = MODULE_STATE_GOING }; -+EXPORT_SYMBOL(no_module); -+#endif -+ -+#ifdef CONFIG_VE -+ -+DCL_VE_OWNER(SKB, SLAB, struct sk_buff, owner_env, , (noinline, regparm(1))) -+DCL_VE_OWNER(SK, SLAB, struct sock, sk_owner_env, , (noinline, regparm(1))) -+DCL_VE_OWNER(TW, SLAB, struct tcp_tw_bucket, tw_owner_env, , (noinline, regparm(1))) -+DCL_VE_OWNER(FILP, GENERIC, struct file, owner_env, inline, (always_inline)) -+DCL_VE_OWNER(FSTYPE, MODULE, struct file_system_type, owner_env, , ()) -+ -+#if defined(CONFIG_VE_IPTABLES) -+INIT_KSYM_MODULE(ip_tables); -+INIT_KSYM_MODULE(iptable_filter); -+INIT_KSYM_MODULE(iptable_mangle); -+INIT_KSYM_MODULE(ipt_limit); -+INIT_KSYM_MODULE(ipt_multiport); -+INIT_KSYM_MODULE(ipt_tos); -+INIT_KSYM_MODULE(ipt_TOS); -+INIT_KSYM_MODULE(ipt_REJECT); -+INIT_KSYM_MODULE(ipt_TCPMSS); -+INIT_KSYM_MODULE(ipt_tcpmss); -+INIT_KSYM_MODULE(ipt_ttl); -+INIT_KSYM_MODULE(ipt_LOG); -+INIT_KSYM_MODULE(ipt_length); -+INIT_KSYM_MODULE(ip_conntrack); -+INIT_KSYM_MODULE(ip_conntrack_ftp); -+INIT_KSYM_MODULE(ip_conntrack_irc); -+INIT_KSYM_MODULE(ipt_conntrack); -+INIT_KSYM_MODULE(ipt_state); -+INIT_KSYM_MODULE(ipt_helper); -+INIT_KSYM_MODULE(iptable_nat); -+INIT_KSYM_MODULE(ip_nat_ftp); -+INIT_KSYM_MODULE(ip_nat_irc); -+INIT_KSYM_MODULE(ipt_REDIRECT); -+ -+INIT_KSYM_CALL(int, init_netfilter, (void)); -+INIT_KSYM_CALL(int, init_iptables, (void)); -+INIT_KSYM_CALL(int, init_iptable_filter, (void)); -+INIT_KSYM_CALL(int, init_iptable_mangle, (void)); -+INIT_KSYM_CALL(int, init_iptable_limit, (void)); -+INIT_KSYM_CALL(int, init_iptable_multiport, (void)); -+INIT_KSYM_CALL(int, init_iptable_tos, (void)); -+INIT_KSYM_CALL(int, init_iptable_TOS, (void)); -+INIT_KSYM_CALL(int, init_iptable_REJECT, (void)); -+INIT_KSYM_CALL(int, init_iptable_TCPMSS, (void)); -+INIT_KSYM_CALL(int, init_iptable_tcpmss, (void)); -+INIT_KSYM_CALL(int, init_iptable_ttl, (void)); -+INIT_KSYM_CALL(int, init_iptable_LOG, (void)); -+INIT_KSYM_CALL(int, init_iptable_length, (void)); -+INIT_KSYM_CALL(int, init_iptable_conntrack, (void)); -+INIT_KSYM_CALL(int, init_iptable_ftp, (void)); -+INIT_KSYM_CALL(int, init_iptable_irc, (void)); -+INIT_KSYM_CALL(int, init_iptable_conntrack_match, (void)); -+INIT_KSYM_CALL(int, init_iptable_state, (void)); -+INIT_KSYM_CALL(int, init_iptable_helper, (void)); -+INIT_KSYM_CALL(int, init_iptable_nat, (void)); -+INIT_KSYM_CALL(int, init_iptable_nat_ftp, (void)); -+INIT_KSYM_CALL(int, init_iptable_nat_irc, (void)); -+INIT_KSYM_CALL(int, init_iptable_REDIRECT, (void)); -+INIT_KSYM_CALL(void, fini_iptable_nat_irc, (void)); -+INIT_KSYM_CALL(void, fini_iptable_nat_ftp, (void)); -+INIT_KSYM_CALL(void, fini_iptable_nat, (void)); -+INIT_KSYM_CALL(void, fini_iptable_helper, (void)); -+INIT_KSYM_CALL(void, fini_iptable_state, (void)); -+INIT_KSYM_CALL(void, fini_iptable_conntrack_match, (void)); -+INIT_KSYM_CALL(void, fini_iptable_irc, (void)); -+INIT_KSYM_CALL(void, fini_iptable_ftp, (void)); -+INIT_KSYM_CALL(void, fini_iptable_conntrack, (void)); -+INIT_KSYM_CALL(void, fini_iptable_length, (void)); -+INIT_KSYM_CALL(void, fini_iptable_LOG, (void)); -+INIT_KSYM_CALL(void, fini_iptable_ttl, (void)); -+INIT_KSYM_CALL(void, fini_iptable_tcpmss, (void)); -+INIT_KSYM_CALL(void, fini_iptable_TCPMSS, (void)); -+INIT_KSYM_CALL(void, fini_iptable_REJECT, (void)); -+INIT_KSYM_CALL(void, fini_iptable_TOS, (void)); -+INIT_KSYM_CALL(void, fini_iptable_tos, (void)); -+INIT_KSYM_CALL(void, fini_iptable_multiport, (void)); -+INIT_KSYM_CALL(void, fini_iptable_limit, (void)); -+INIT_KSYM_CALL(void, fini_iptable_filter, (void)); -+INIT_KSYM_CALL(void, fini_iptable_mangle, (void)); -+INIT_KSYM_CALL(void, fini_iptables, (void)); -+INIT_KSYM_CALL(void, fini_netfilter, (void)); -+INIT_KSYM_CALL(void, fini_iptable_REDIRECT, (void)); -+ -+INIT_KSYM_CALL(void, ipt_flush_table, (struct ipt_table *table)); -+#endif -+ -+#if defined(CONFIG_VE_CALLS_MODULE) || defined(CONFIG_VE_CALLS) -+INIT_KSYM_MODULE(vzmon); -+INIT_KSYM_CALL(int, real_get_device_perms_ve, -+ (int dev_type, dev_t dev, int access_mode)); -+INIT_KSYM_CALL(void, real_do_env_cleanup, (struct ve_struct *env)); -+INIT_KSYM_CALL(void, real_do_env_free, (struct ve_struct *env)); -+INIT_KSYM_CALL(void, real_update_load_avg_ve, (void)); -+ -+int get_device_perms_ve(int dev_type, dev_t dev, int access_mode) -+{ -+ return KSYMSAFECALL(int, vzmon, real_get_device_perms_ve, -+ (dev_type, dev, access_mode)); -+} -+EXPORT_SYMBOL(get_device_perms_ve); -+ -+void do_env_cleanup(struct ve_struct *env) -+{ -+ KSYMSAFECALL_VOID(vzmon, real_do_env_cleanup, (env)); -+} -+ -+void do_env_free(struct ve_struct *env) -+{ -+ KSYMSAFECALL_VOID(vzmon, real_do_env_free, (env)); -+} -+EXPORT_SYMBOL(do_env_free); -+ -+void do_update_load_avg_ve(void) -+{ -+ KSYMSAFECALL_VOID(vzmon, real_update_load_avg_ve, ()); -+} -+#endif -+ -+extern struct ipv4_devconf ipv4_devconf; -+extern struct ipv4_devconf *get_ipv4_devconf_dflt_addr(void); -+ -+struct ve_struct ve0 = { -+ .utsname = &system_utsname, -+ .vetask_lh = LIST_HEAD_INIT(ve0.vetask_lh), -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+ ._net_dev_tail = &ve0._net_dev_base, -+ .ifindex = -1, -+#endif -+}; -+ -+EXPORT_SYMBOL(ve0); -+ -+#endif /* CONFIG_VE */ -diff -uprN linux-2.6.8.1.orig/kernel/vecalls.c linux-2.6.8.1-ve022stab078/kernel/vecalls.c ---- linux-2.6.8.1.orig/kernel/vecalls.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/kernel/vecalls.c 2006-05-11 13:05:48.000000000 +0400 -@@ -0,0 +1,3202 @@ -+/* -+ * linux/kernel/vecalls.c -+ * -+ * Copyright (C) 2000-2005 SWsoft -+ * All rights reserved. -+ * -+ */ -+ -+/* -+ * 'vecalls.c' is file with basic VE support. It provides basic primities -+ * along with initialization script -+ */ -+ -+#include <linux/sched.h> -+#include <linux/delay.h> -+#include <linux/capability.h> -+#include <linux/ve.h> -+#include <linux/smp_lock.h> -+#include <linux/init.h> -+#include <linux/list.h> -+#include <linux/ve_owner.h> -+#include <linux/errno.h> -+#include <linux/unistd.h> -+#include <linux/slab.h> -+#include <linux/vmalloc.h> -+#include <linux/sys.h> -+#include <linux/fs.h> -+#include <linux/namespace.h> -+#include <linux/termios.h> -+#include <linux/tty_driver.h> -+#include <linux/netdevice.h> -+#include <linux/wait.h> -+#include <linux/inetdevice.h> -+#include <linux/utsname.h> -+#include <linux/sysctl.h> -+#include <linux/proc_fs.h> -+#include <linux/seq_file.h> -+#include <linux/kernel_stat.h> -+#include <linux/module.h> -+#include <linux/suspend.h> -+#include <linux/rcupdate.h> -+#include <linux/in.h> -+#include <linux/major.h> -+#include <linux/kdev_t.h> -+#include <linux/idr.h> -+#include <linux/inetdevice.h> -+#include <net/pkt_sched.h> -+#include <linux/divert.h> -+#include <ub/beancounter.h> -+ -+#include <net/route.h> -+#include <net/ip_fib.h> -+ -+#include <linux/ve_proto.h> -+#include <linux/venet.h> -+#include <linux/vzctl.h> -+#include <linux/vzcalluser.h> -+#include <linux/fairsched.h> -+ -+#include <linux/nfcalls.h> -+ -+struct ve_struct *ve_list_head = NULL; -+int nr_ve = 1; /* One VE always exists. Compatibility with vestat */ -+rwlock_t ve_list_guard = RW_LOCK_UNLOCKED; -+static rwlock_t devperms_hash_guard = RW_LOCK_UNLOCKED; -+ -+extern int glob_virt_pids; -+ -+static int do_env_enter(struct ve_struct *ve, unsigned int flags); -+int real_env_create(envid_t veid, unsigned flags, u32 class_id, -+ env_create_param_t *data, int datalen); -+static void do_clean_devperms(envid_t veid); -+static int alloc_ve_tty_drivers(struct ve_struct* ve); -+static void free_ve_tty_drivers(struct ve_struct* ve); -+static int register_ve_tty_drivers(struct ve_struct* ve); -+static void unregister_ve_tty_drivers(struct ve_struct* ve); -+static int init_ve_tty_drivers(struct ve_struct *); -+static void fini_ve_tty_drivers(struct ve_struct *); -+static void clear_termios(struct tty_driver* driver ); -+static void ve_mapped_devs_cleanup(struct ve_struct *ve); -+ -+static int ve_get_cpu_stat(envid_t veid, struct vz_cpu_stat *buf); -+ -+static void vecalls_exit(void); -+ -+struct ve_struct *__find_ve_by_id(envid_t veid) -+{ -+ struct ve_struct *ve; -+ for (ve = ve_list_head; -+ ve != NULL && ve->veid != veid; -+ ve = ve->next); -+ return ve; -+} -+ -+struct ve_struct *get_ve_by_id(envid_t veid) -+{ -+ struct ve_struct *ve; -+ read_lock(&ve_list_guard); -+ ve = __find_ve_by_id(veid); -+ get_ve(ve); -+ read_unlock(&ve_list_guard); -+ return ve; -+} -+ -+/* -+ * real_put_ve() MUST be used instead of put_ve() inside vecalls. -+ */ -+void real_do_env_free(struct ve_struct *ve); -+static inline void real_put_ve(struct ve_struct *ve) -+{ -+ if (ve && atomic_dec_and_test(&ve->counter)) { -+ if (atomic_read(&ve->pcounter) > 0) -+ BUG(); -+ if (ve->is_running) -+ BUG(); -+ real_do_env_free(ve); -+ } -+} -+ -+extern struct file_system_type devpts_fs_type; -+extern struct file_system_type sysfs_fs_type; -+extern struct file_system_type tmpfs_fs_type; -+extern struct file_system_type proc_fs_type; -+ -+extern spinlock_t task_capability_lock; -+extern void ve_ipc_free(struct ve_struct * ve); -+extern void ip_fragment_cleanup(struct ve_struct *ve); -+ -+static int ve_get_cpu_stat(envid_t veid, struct vz_cpu_stat *buf) -+{ -+ struct ve_struct *ve; -+ struct vz_cpu_stat *vstat; -+ int retval; -+ int i, cpu; -+ unsigned long tmp; -+ -+ if (!ve_is_super(get_exec_env()) && (veid != get_exec_env()->veid)) -+ return -EPERM; -+ if (veid == 0) -+ return -ESRCH; -+ -+ vstat = kmalloc(sizeof(*vstat), GFP_KERNEL); -+ if (!vstat) -+ return -ENOMEM; -+ memset(vstat, 0, sizeof(*vstat)); -+ -+ retval = -ESRCH; -+ read_lock(&ve_list_guard); -+ ve = __find_ve_by_id(veid); -+ if (ve == NULL) -+ goto out_unlock; -+ for (cpu = 0; cpu < NR_CPUS; cpu++) { -+ vstat->user_jif += VE_CPU_STATS(ve, cpu)->user; -+ vstat->nice_jif += VE_CPU_STATS(ve, cpu)->nice; -+ vstat->system_jif += VE_CPU_STATS(ve, cpu)->system; -+ vstat->idle_clk += ve_sched_get_idle_time(ve, cpu); -+ } -+ vstat->uptime_clk = get_cycles() - ve->start_cycles; -+ vstat->uptime_jif = jiffies - ve->start_jiffies; -+ for (i = 0; i < 3; i++) { -+ tmp = ve->avenrun[i] + (FIXED_1/200); -+ vstat->avenrun[i].val_int = LOAD_INT(tmp); -+ vstat->avenrun[i].val_frac = LOAD_FRAC(tmp); -+ } -+ read_unlock(&ve_list_guard); -+ -+ retval = 0; -+ if (copy_to_user(buf, vstat, sizeof(*vstat))) -+ retval = -EFAULT; -+out_free: -+ kfree(vstat); -+ return retval; -+ -+out_unlock: -+ read_unlock(&ve_list_guard); -+ goto out_free; -+} -+ -+/********************************************************************** -+ * Devices permissions routines, -+ * character and block devices separately -+ **********************************************************************/ -+ -+/* Rules applied in the following order: -+ MAJOR!=0, MINOR!=0 -+ MAJOR!=0, MINOR==0 -+ MAJOR==0, MINOR==0 -+*/ -+struct devperms_struct -+{ -+ dev_t dev; /* device id */ -+ unsigned char mask; -+ unsigned type; -+ envid_t veid; -+ -+ struct devperms_struct *devhash_next; -+ struct devperms_struct **devhash_pprev; -+}; -+ -+static struct devperms_struct original_perms[] = -+{{ -+ MKDEV(0,0), /*device*/ -+ S_IROTH | S_IWOTH, -+ S_IFCHR, /*type*/ -+ 0, /*veid*/ -+ NULL, NULL -+}, -+{ -+ MKDEV(0,0), /*device*/ -+ S_IXGRP | S_IROTH | S_IWOTH, -+ S_IFBLK, /*type*/ -+ 0, /*veid*/ -+ NULL, NULL -+}}; -+ -+static struct devperms_struct default_major_perms[] = { -+ {MKDEV(UNIX98_PTY_MASTER_MAJOR, 0), S_IROTH | S_IWOTH, S_IFCHR}, -+ {MKDEV(UNIX98_PTY_SLAVE_MAJOR, 0), S_IROTH | S_IWOTH, S_IFCHR}, -+ {MKDEV(PTY_MASTER_MAJOR, 0), S_IROTH | S_IWOTH, S_IFCHR}, -+ {MKDEV(PTY_SLAVE_MAJOR, 0), S_IROTH | S_IWOTH, S_IFCHR}, -+}; -+static struct devperms_struct default_minor_perms[] = { -+ {MKDEV(MEM_MAJOR, 3), S_IROTH | S_IWOTH, S_IFCHR}, /* null */ -+ {MKDEV(MEM_MAJOR, 5), S_IROTH | S_IWOTH, S_IFCHR}, /* zero */ -+ {MKDEV(MEM_MAJOR, 7), S_IROTH | S_IWOTH, S_IFCHR}, /* full */ -+ {MKDEV(TTYAUX_MAJOR, 0), S_IROTH | S_IWOTH, S_IFCHR},/* tty */ -+ {MKDEV(TTYAUX_MAJOR, 2), S_IROTH | S_IWOTH, S_IFCHR},/* ptmx */ -+ {MKDEV(MEM_MAJOR, 8), S_IROTH, S_IFCHR}, /* random */ -+ {MKDEV(MEM_MAJOR, 9), S_IROTH, S_IFCHR}, /* urandom */ -+}; -+ -+static struct devperms_struct default_deny_perms = { -+ MKDEV(0, 0), 0, S_IFCHR -+}; -+ -+static inline struct devperms_struct *find_default_devperms(int type, -+ dev_t dev) -+{ -+ int i; -+ -+ /* XXX all defaults perms are S_IFCHR */ -+ if (type != S_IFCHR) -+ return &default_deny_perms; -+ -+ for (i = 0; -+ i < sizeof(default_minor_perms)/sizeof(struct devperms_struct); -+ i++) -+ if (MAJOR(dev) == MAJOR(default_minor_perms[i].dev) && -+ MINOR(dev) == MINOR(default_minor_perms[i].dev)) -+ return &default_minor_perms[i]; -+ for (i = 0; -+ i < sizeof(default_major_perms)/sizeof(struct devperms_struct); -+ i++) -+ if (MAJOR(dev) == MAJOR(default_major_perms[i].dev)) -+ return &default_major_perms[i]; -+ -+ return &default_deny_perms; -+} -+ -+#define DEVPERMS_HASH_SZ 512 -+struct devperms_struct *devperms_hash[DEVPERMS_HASH_SZ]; -+ -+#define devperms_hashfn(id,dev) \ -+ ( (id << 5) ^ (id >> 5) ^ (MAJOR(dev)) ^ MINOR(dev) ) & \ -+ (DEVPERMS_HASH_SZ - 1) -+ -+static inline void hash_devperms(struct devperms_struct *p) -+{ -+ struct devperms_struct **htable = -+ &devperms_hash[devperms_hashfn(p->veid,p->dev)]; -+ -+ if ((p->devhash_next = *htable) != NULL) -+ (*htable)->devhash_pprev = &p->devhash_next; -+ *htable = p; -+ p->devhash_pprev = htable; -+} -+ -+static inline void unhash_devperms(struct devperms_struct *p) -+{ -+ if (p->devhash_next) -+ p->devhash_next->devhash_pprev = p->devhash_pprev; -+ *p->devhash_pprev = p->devhash_next; -+} -+ -+static int __init init_devperms_hash(void) -+{ -+ write_lock_irq(&devperms_hash_guard); -+ memset(devperms_hash, 0, sizeof(devperms_hash)); -+ hash_devperms(original_perms); -+ hash_devperms(original_perms+1); -+ write_unlock_irq(&devperms_hash_guard); -+ return 0; -+} -+ -+static inline void fini_devperms_hash(void) -+{ -+} -+ -+static inline struct devperms_struct *find_devperms(envid_t veid, -+ int type, -+ dev_t dev) -+{ -+ struct devperms_struct *p, **htable = -+ &devperms_hash[devperms_hashfn(veid,dev)]; -+ -+ for (p = *htable; p && !(p->type==type && -+ MAJOR(dev)==MAJOR(p->dev) && -+ MINOR(dev)==MINOR(p->dev) && -+ p->veid==veid); -+ p = p->devhash_next) -+ ; -+ return p; -+} -+ -+ -+static void do_clean_devperms(envid_t veid) -+{ -+ int i; -+ struct devperms_struct* ve; -+ -+ write_lock_irq(&devperms_hash_guard); -+ for (i = 0; i < DEVPERMS_HASH_SZ; i++) -+ for (ve = devperms_hash[i]; ve;) { -+ struct devperms_struct *next = ve->devhash_next; -+ if (ve->veid == veid) { -+ unhash_devperms(ve); -+ kfree(ve); -+ } -+ -+ ve = next; -+ } -+ write_unlock_irq(&devperms_hash_guard); -+} -+ -+/* -+ * Mode is a mask of -+ * FMODE_READ for read access (configurable by S_IROTH) -+ * FMODE_WRITE for write access (configurable by S_IWOTH) -+ * FMODE_QUOTACTL for quotactl access (configurable by S_IXGRP) -+ */ -+int real_get_device_perms_ve(int dev_type, dev_t dev, int access_mode) -+{ -+ struct devperms_struct *perms; -+ struct ve_struct *ve; -+ envid_t veid; -+ -+ perms = NULL; -+ ve = get_exec_env(); -+ veid = ve->veid; -+ -+ read_lock(&devperms_hash_guard); -+ -+ perms = find_devperms(veid, dev_type|VE_USE_MINOR, dev); -+ if (perms) -+ goto end; -+ -+ perms = find_devperms(veid, dev_type|VE_USE_MAJOR, MKDEV(MAJOR(dev),0)); -+ if (perms) -+ goto end; -+ -+ perms = find_devperms(veid, dev_type, MKDEV(0,0)); -+ if (perms) -+ goto end; -+ -+ perms = find_default_devperms(dev_type, dev); -+ -+end: -+ read_unlock(&devperms_hash_guard); -+ -+ access_mode = "\000\004\002\006\010\014\012\016"[access_mode]; -+ return perms ? -+ (((perms->mask & access_mode) == access_mode) ? 0 : -EACCES) : -+ -ENODEV; -+} -+ -+int do_setdevperms(envid_t veid, unsigned type, dev_t dev, unsigned mask) -+{ -+ struct devperms_struct *perms; -+ -+ write_lock_irq(&devperms_hash_guard); -+ perms = find_devperms(veid, type, dev); -+ if (!perms) { -+ struct devperms_struct *perms_new; -+ write_unlock_irq(&devperms_hash_guard); -+ -+ perms_new = kmalloc(sizeof(struct devperms_struct), GFP_KERNEL); -+ if (!perms_new) -+ return -ENOMEM; -+ -+ write_lock_irq(&devperms_hash_guard); -+ perms = find_devperms(veid, type, dev); -+ if (perms) { -+ kfree(perms_new); -+ perms_new = perms; -+ } -+ -+ switch (type & VE_USE_MASK) { -+ case 0: -+ dev = 0; -+ break; -+ case VE_USE_MAJOR: -+ dev = MKDEV(MAJOR(dev),0); -+ break; -+ } -+ -+ perms_new->veid = veid; -+ perms_new->dev = dev; -+ perms_new->type = type; -+ perms_new->mask = mask & S_IALLUGO; -+ hash_devperms(perms_new); -+ } else -+ perms->mask = mask & S_IALLUGO; -+ write_unlock_irq(&devperms_hash_guard); -+ return 0; -+} -+EXPORT_SYMBOL(do_setdevperms); -+ -+int real_setdevperms(envid_t veid, unsigned type, dev_t dev, unsigned mask) -+{ -+ struct ve_struct *ve; -+ int err; -+ -+ if (!capable(CAP_SETVEID) || veid == 0) -+ return -EPERM; -+ -+ if ((ve = get_ve_by_id(veid)) == NULL) -+ return -ESRCH; -+ -+ down_read(&ve->op_sem); -+ err = -ESRCH; -+ if (ve->is_running) -+ err = do_setdevperms(veid, type, dev, mask); -+ up_read(&ve->op_sem); -+ real_put_ve(ve); -+ return err; -+} -+ -+void real_update_load_avg_ve(void) -+{ -+ struct ve_struct *ve; -+ unsigned long nr_active; -+ -+ read_lock(&ve_list_guard); -+ for (ve = ve_list_head; ve != NULL; ve = ve->next) { -+ nr_active = nr_running_ve(ve) + nr_uninterruptible_ve(ve); -+ nr_active *= FIXED_1; -+ CALC_LOAD(ve->avenrun[0], EXP_1, nr_active); -+ CALC_LOAD(ve->avenrun[1], EXP_5, nr_active); -+ CALC_LOAD(ve->avenrun[2], EXP_15, nr_active); -+ } -+ read_unlock(&ve_list_guard); -+} -+ -+ -+/********************************************************************** -+ ********************************************************************** -+ * -+ * FS-related helpers to VE start/stop -+ * -+ ********************************************************************** -+ **********************************************************************/ -+ -+/* -+ * DEVPTS needs a virtualization: each environment should see each own list of -+ * pseudo-terminals. -+ * To implement it we need to have separate devpts superblocks for each -+ * VE, and each VE should mount its own one. -+ * Thus, separate vfsmount structures are required. -+ * To minimize intrusion into vfsmount lookup code, separate file_system_type -+ * structures are created. -+ * -+ * In addition to this, patch fo character device itself is required, as file -+ * system itself is used only for MINOR/MAJOR lookup. -+ */ -+static int register_ve_fs_type(struct ve_struct *ve, -+ struct file_system_type *template, -+ struct file_system_type **p_fs_type, struct vfsmount **p_mnt) -+{ -+ struct vfsmount *mnt; -+ struct file_system_type *local_fs_type; -+ int ret; -+ -+ VZTRACE("register_ve_fs_type(\"%s\")\n", template->name); -+ -+ local_fs_type = kmalloc(sizeof(*local_fs_type) + sizeof(void *), -+ GFP_KERNEL); -+ if (local_fs_type == NULL) -+ return -ENOMEM; -+ -+ memset(local_fs_type, 0, sizeof(*local_fs_type)); -+ local_fs_type->name = template->name; -+ local_fs_type->fs_flags = template->fs_flags; -+ local_fs_type->get_sb = template->get_sb; -+ local_fs_type->kill_sb = template->kill_sb; -+ local_fs_type->owner = template->owner; -+ /* -+ * 1. we do not have refcounter on fstype -+ * 2. fstype holds reference to ve using get_ve()/put_ve(). -+ * so we free fstype when freeing ve and we are sure it's ok to free it -+ */ -+ SET_VE_OWNER_FSTYPE(local_fs_type, ve); -+ get_filesystem(local_fs_type); /* get_ve() inside */ -+ -+ ret = register_filesystem(local_fs_type); /* does not get */ -+ if (ret) -+ goto reg_err; -+ -+ mnt = kern_mount(local_fs_type); -+ if (IS_ERR(mnt)) -+ goto mnt_err; -+ -+ /* Usage counters after succesful execution kern_mount: -+ * local_fs_type - +1 (get_fs_type,get_sb_single,put_filesystem) -+ * mnt - +1 == 1 (alloc_vfsmnt) -+ */ -+ -+ *p_fs_type = local_fs_type; -+ *p_mnt = mnt; -+ return 0; -+ -+mnt_err: -+ ret = PTR_ERR(mnt); -+ unregister_filesystem(local_fs_type); /* does not put */ -+ -+reg_err: -+ put_filesystem(local_fs_type); -+ kfree(local_fs_type); -+ printk(KERN_DEBUG -+ "register_ve_fs_type(\"%s\") err=%d\n", template->name, ret); -+ return ret; -+} -+ -+static void umount_ve_fs_type(struct file_system_type *local_fs_type) -+{ -+ struct vfsmount *mnt; -+ struct list_head *p, *q; -+ LIST_HEAD(kill); -+ -+ down_write(¤t->namespace->sem); -+ spin_lock(&vfsmount_lock); -+ list_for_each_safe(p, q, ¤t->namespace->list) { -+ mnt = list_entry(p, struct vfsmount, mnt_list); -+ if (mnt->mnt_sb->s_type != local_fs_type) -+ continue; -+ list_del(p); -+ list_add(p, &kill); -+ } -+ -+ while (!list_empty(&kill)) { -+ mnt = list_entry(kill.next, struct vfsmount, mnt_list); -+ umount_tree(mnt); -+ } -+ spin_unlock(&vfsmount_lock); -+ up_write(¤t->namespace->sem); -+} -+ -+static void unregister_ve_fs_type(struct file_system_type *local_fs_type, -+ struct vfsmount *local_fs_mount) -+{ -+ if (local_fs_mount == NULL || -+ local_fs_type == NULL) { -+ if (local_fs_mount != NULL || -+ local_fs_type != NULL) -+ BUG(); -+ return; -+ } -+ -+ VZTRACE("unregister_ve_fs_type(\"%s\")\n", local_fs_type->name); -+ -+ unregister_filesystem(local_fs_type); -+ umount_ve_fs_type(local_fs_type); -+ kern_umount(local_fs_mount); /* alias to mntput, drop our ref */ -+ put_filesystem(local_fs_type); -+} -+ -+ -+/********************************************************************** -+ ********************************************************************** -+ * -+ * FS-related helpers to VE start/stop -+ * -+ ********************************************************************** -+ **********************************************************************/ -+ -+#ifdef CONFIG_SYSCTL -+static ctl_table ve_sysctl_tables[] = { -+ /* kernel */ -+ { -+ .ctl_name = CTL_KERN, -+ .procname = "kernel", -+ .mode = 0555, -+ .child = &ve_sysctl_tables[2], -+ }, -+ { .ctl_name = 0 }, -+ /* kernel/[vars] */ -+ { -+ .ctl_name = KERN_NODENAME, -+ .procname = "hostname", -+ .maxlen = 64, -+ .mode = 0644, -+ .proc_handler = &proc_doutsstring, -+ .strategy = &sysctl_string, -+ }, -+ { -+ .ctl_name = KERN_DOMAINNAME, -+ .procname = "domainname", -+ .maxlen = 64, -+ .mode = 0644, -+ .proc_handler = &proc_doutsstring, -+ .strategy = &sysctl_string, -+ }, -+ { -+ .ctl_name = KERN_SHMMAX, -+ .procname = "shmmax", -+ .maxlen = sizeof(size_t), -+ .mode = 0644, -+ .proc_handler = &proc_doulongvec_minmax, -+ }, -+ { -+ .ctl_name = KERN_SHMALL, -+ .procname = "shmall", -+ .maxlen = sizeof(size_t), -+ .mode = 0644, -+ .proc_handler = &proc_doulongvec_minmax, -+ }, -+ { -+ .ctl_name = KERN_SHMMNI, -+ .procname = "shmmni", -+ .maxlen = sizeof(int), -+ .mode = 0644, -+ .proc_handler = &proc_dointvec, -+ }, -+ { -+ .ctl_name = KERN_MSGMAX, -+ .procname = "msgmax", -+ .maxlen = sizeof(int), -+ .mode = 0644, -+ .proc_handler = &proc_dointvec, -+ }, -+ { -+ .ctl_name = KERN_MSGMNI, -+ .procname = "msgmni", -+ .maxlen = sizeof(int), -+ .mode = 0644, -+ .proc_handler = &proc_dointvec, -+ }, -+ { -+ .ctl_name = KERN_MSGMNB, -+ .procname = "msgmnb", -+ .maxlen = sizeof(int), -+ .mode = 0644, -+ .proc_handler = &proc_dointvec, -+ }, -+ { -+ .ctl_name = KERN_SEM, -+ .procname = "sem", -+ .maxlen = 4 * sizeof(int), -+ .mode = 0644, -+ .proc_handler = &proc_dointvec -+ }, -+ { .ctl_name = 0, } -+}; -+ -+static int register_ve_sysctltables(struct ve_struct *ve) -+{ -+ struct ctl_table_header *header; -+ ctl_table *root, *table; -+ -+ VZTRACE("register_ve_sysctltables\n"); -+ -+ root = clone_sysctl_template(ve_sysctl_tables, -+ sizeof(ve_sysctl_tables) / sizeof(ctl_table)); -+ if (root == NULL) -+ goto out; -+ -+ table = root->child; -+ table[0].data = &ve->utsname->nodename; -+ table[1].data = &ve->utsname->domainname; -+ table[2].data = &ve->_shm_ctlmax; -+ table[3].data = &ve->_shm_ctlall; -+ table[4].data = &ve->_shm_ctlmni; -+ table[5].data = &ve->_msg_ctlmax; -+ table[6].data = &ve->_msg_ctlmni; -+ table[7].data = &ve->_msg_ctlmnb; -+ table[8].data = &ve->_sem_ctls[0]; -+ -+ /* insert at head to override kern entries */ -+ header = register_sysctl_table(root, 1); -+ if (header == NULL) -+ goto out_free; -+ -+ ve->kern_header = header; -+ ve->kern_table = root; -+ return 0; -+ -+out_free: -+ free_sysctl_clone(root); -+out: -+ return -ENOMEM; -+} -+ -+static inline void unregister_ve_sysctltables(struct ve_struct *ve) -+{ -+ unregister_sysctl_table(ve->kern_header); -+} -+ -+static inline void free_ve_sysctltables(struct ve_struct *ve) -+{ -+ free_sysctl_clone(ve->kern_table); -+} -+#endif -+ -+ -+/********************************************************************** -+ ********************************************************************** -+ * -+ * VE start: subsystems -+ * -+ ********************************************************************** -+ **********************************************************************/ -+ -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+#include <net/ip.h> -+#include <net/tcp.h> -+#include <net/udp.h> -+#include <net/icmp.h> -+ -+extern struct new_utsname virt_utsname; -+ -+static int init_ve_utsname(struct ve_struct *ve) -+{ -+ ve->utsname = kmalloc(sizeof(*ve->utsname), GFP_KERNEL); -+ if (ve->utsname == NULL) -+ return -ENOMEM; -+ -+ down_read(&uts_sem); /* protect the source */ -+ memcpy(ve->utsname, &system_utsname, sizeof(*ve->utsname)); -+ memcpy(ve->utsname->release, virt_utsname.release, -+ sizeof(virt_utsname.release)); -+ up_read(&uts_sem); -+ -+ return 0; -+} -+ -+static void free_ve_utsname(struct ve_struct *ve) -+{ -+ kfree(ve->utsname); -+ ve->utsname = NULL; -+} -+ -+static int init_fini_ve_mibs(struct ve_struct *ve, int fini) -+{ -+ if (fini) -+ goto fini; -+ if (!(ve->_net_statistics[0] = alloc_percpu(struct linux_mib))) -+ goto out1; -+ if (!(ve->_net_statistics[1] = alloc_percpu(struct linux_mib))) -+ goto out2; -+ if (!(ve->_ip_statistics[0] = alloc_percpu(struct ipstats_mib))) -+ goto out3; -+ if (!(ve->_ip_statistics[1] = alloc_percpu(struct ipstats_mib))) -+ goto out4; -+ if (!(ve->_icmp_statistics[0] = alloc_percpu(struct icmp_mib))) -+ goto out5; -+ if (!(ve->_icmp_statistics[1] = alloc_percpu(struct icmp_mib))) -+ goto out6; -+ if (!(ve->_tcp_statistics[0] = alloc_percpu(struct tcp_mib))) -+ goto out7; -+ if (!(ve->_tcp_statistics[1] = alloc_percpu(struct tcp_mib))) -+ goto out8; -+ if (!(ve->_udp_statistics[0] = alloc_percpu(struct udp_mib))) -+ goto out9; -+ if (!(ve->_udp_statistics[1] = alloc_percpu(struct udp_mib))) -+ goto out10; -+ return 0; -+fini: -+ free_percpu(ve->_udp_statistics[1]); -+out10: -+ free_percpu(ve->_udp_statistics[0]); -+out9: -+ free_percpu(ve->_tcp_statistics[1]); -+out8: -+ free_percpu(ve->_tcp_statistics[0]); -+out7: -+ free_percpu(ve->_icmp_statistics[1]); -+out6: -+ free_percpu(ve->_icmp_statistics[0]); -+out5: -+ free_percpu(ve->_ip_statistics[1]); -+out4: -+ free_percpu(ve->_ip_statistics[0]); -+out3: -+ free_percpu(ve->_net_statistics[1]); -+out2: -+ free_percpu(ve->_net_statistics[0]); -+out1: -+ return -ENOMEM; -+} -+ -+static inline int init_ve_mibs(struct ve_struct *ve) -+{ -+ return init_fini_ve_mibs(ve, 0); -+} -+ -+static inline void fini_ve_mibs(struct ve_struct *ve) -+{ -+ (void)init_fini_ve_mibs(ve, 1); -+} -+ -+extern struct net_device templ_loopback_dev; -+static void veloop_setup(struct net_device *dev) -+{ -+ int padded; -+ padded = dev->padded; -+ memcpy(dev, &templ_loopback_dev, sizeof(struct net_device)); -+ dev->padded = padded; -+} -+ -+static int init_ve_netdev(void) -+{ -+ struct ve_struct *ve; -+ struct net_device_stats *stats; -+ int err; -+ -+ ve = get_exec_env(); -+ INIT_HLIST_HEAD(&ve->_net_dev_head); -+ ve->_net_dev_base = NULL; -+ ve->_net_dev_tail = &ve->_net_dev_base; -+ -+ ve->_loopback_dev = alloc_netdev(0, templ_loopback_dev.name, -+ veloop_setup); -+ if (ve->_loopback_dev == NULL) -+ return -ENOMEM; -+ if (loopback_dev.get_stats != NULL) { -+ stats = kmalloc(sizeof(struct net_device_stats), GFP_KERNEL); -+ if (stats != NULL) { -+ memset(stats, 0, sizeof(struct net_device_stats)); -+ ve->_loopback_dev->priv = stats; -+ ve->_loopback_dev->get_stats = loopback_dev.get_stats; -+ ve->_loopback_dev->destructor = loopback_dev.destructor; -+ } -+ } -+ err = register_netdev(ve->_loopback_dev); -+ if (err) { -+ if (ve->_loopback_dev->priv != NULL) -+ kfree(ve->_loopback_dev->priv); -+ free_netdev(ve->_loopback_dev); -+ } -+ return err; -+} -+ -+static void fini_ve_netdev(void) -+{ -+ struct ve_struct *ve; -+ struct net_device *dev; -+ -+ ve = get_exec_env(); -+ while (1) { -+ rtnl_lock(); -+ /* -+ * loopback is special, it can be referenced in fib's, -+ * so it must be freed the last. Doing so is -+ * sufficient to guarantee absence of such references. -+ */ -+ if (dev_base == ve->_loopback_dev) -+ dev = dev_base->next; -+ else -+ dev = dev_base; -+ if (dev == NULL) -+ break; -+ unregister_netdevice(dev); -+ rtnl_unlock(); -+ free_netdev(dev); -+ } -+ unregister_netdevice(ve->_loopback_dev); -+ rtnl_unlock(); -+ free_netdev(ve->_loopback_dev); -+ ve->_loopback_dev = NULL; -+} -+#else -+#define init_ve_mibs(ve) (0) -+#define fini_ve_mibs(ve) do { } while (0) -+#define init_ve_netdev() (0) -+#define fini_ve_netdev() do { } while (0) -+#endif -+ -+static int prepare_proc_root(struct ve_struct *ve) -+{ -+ struct proc_dir_entry *de; -+ -+ de = kmalloc(sizeof(struct proc_dir_entry) + 6, GFP_KERNEL); -+ if (de == NULL) -+ return -ENOMEM; -+ memset(de, 0, sizeof(struct proc_dir_entry)); -+ memcpy(de + 1, "/proc", 6); -+ de->name = (char *)(de + 1); -+ de->namelen = 5; -+ de->mode = S_IFDIR | S_IRUGO | S_IXUGO; -+ de->nlink = 2; -+ atomic_set(&de->count, 1); -+ -+ ve->proc_root = de; -+ return 0; -+} -+ -+#ifdef CONFIG_PROC_FS -+static int init_ve_proc(struct ve_struct *ve) -+{ -+ int err; -+ struct proc_dir_entry *de; -+ -+ err = prepare_proc_root(ve); -+ if (err) -+ goto out_root; -+ -+ err = register_ve_fs_type(ve, &proc_fs_type, -+ &ve->proc_fstype, &ve->proc_mnt); -+ if (err) -+ goto out_reg; -+ -+ /* create /proc/vz in VE local proc tree */ -+ err = -ENOMEM; -+ de = create_proc_entry("vz", S_IFDIR|S_IRUGO|S_IXUGO, NULL); -+ if (!de) -+ goto out_vz; -+ -+ return 0; -+ -+out_vz: -+ unregister_ve_fs_type(ve->proc_fstype, ve->proc_mnt); -+ ve->proc_mnt = NULL; -+out_reg: -+ /* proc_fstype and proc_root are freed in real_put_ve -> free_ve_proc */ -+ ; -+out_root: -+ return err; -+} -+ -+static void fini_ve_proc(struct ve_struct *ve) -+{ -+ remove_proc_entry("vz", NULL); -+ unregister_ve_fs_type(ve->proc_fstype, ve->proc_mnt); -+ ve->proc_mnt = NULL; -+} -+ -+static void free_ve_proc(struct ve_struct *ve) -+{ -+ /* proc filesystem frees proc_dir_entries on remove_proc_entry() only, -+ so we check that everything was removed and not lost */ -+ if (ve->proc_root && ve->proc_root->subdir) { -+ struct proc_dir_entry *p = ve->proc_root; -+ printk(KERN_WARNING "VPS: %d: proc entry /proc", ve->veid); -+ while ((p = p->subdir) != NULL) -+ printk("/%s", p->name); -+ printk(" is not removed!\n"); -+ } -+ -+ kfree(ve->proc_root); -+ kfree(ve->proc_fstype); -+ -+ ve->proc_fstype = NULL; -+ ve->proc_root = NULL; -+} -+#else -+#define init_ve_proc(ve) (0) -+#define fini_ve_proc(ve) do { } while (0) -+#define free_ve_proc(ve) do { } while (0) -+#endif -+ -+#ifdef CONFIG_SYSCTL -+static int init_ve_sysctl(struct ve_struct *ve) -+{ -+ int err; -+ -+#ifdef CONFIG_PROC_FS -+ err = -ENOMEM; -+ ve->proc_sys_root = proc_mkdir("sys", 0); -+ if (ve->proc_sys_root == NULL) -+ goto out_proc; -+#endif -+ INIT_LIST_HEAD(&ve->sysctl_lh); -+ err = register_ve_sysctltables(ve); -+ if (err) -+ goto out_reg; -+ -+ err = devinet_sysctl_init(ve); -+ if (err) -+ goto out_dev; -+ -+ return 0; -+ -+out_dev: -+ unregister_ve_sysctltables(ve); -+ free_ve_sysctltables(ve); -+out_reg: -+#ifdef CONFIG_PROC_FS -+ remove_proc_entry("sys", NULL); -+out_proc: -+#endif -+ return err; -+} -+ -+static void fini_ve_sysctl(struct ve_struct *ve) -+{ -+ devinet_sysctl_fini(ve); -+ unregister_ve_sysctltables(ve); -+ remove_proc_entry("sys", NULL); -+} -+ -+static void free_ve_sysctl(struct ve_struct *ve) -+{ -+ devinet_sysctl_free(ve); -+ free_ve_sysctltables(ve); -+} -+#else -+#define init_ve_sysctl(ve) (0) -+#define fini_ve_sysctl(ve) do { } while (0) -+#define free_ve_sysctl(ve) do { } while (0) -+#endif -+ -+#ifdef CONFIG_UNIX98_PTYS -+#include <linux/devpts_fs.h> -+ -+static int init_ve_devpts(struct ve_struct *ve) -+{ -+ int err; -+ -+ err = -ENOMEM; -+ ve->devpts_config = kmalloc(sizeof(struct devpts_config), GFP_KERNEL); -+ if (ve->devpts_config == NULL) -+ goto out; -+ memset(ve->devpts_config, 0, sizeof(struct devpts_config)); -+ ve->devpts_config->mode = 0600; -+ err = register_ve_fs_type(ve, &devpts_fs_type, -+ &ve->devpts_fstype, &ve->devpts_mnt); -+ if (err) { -+ kfree(ve->devpts_config); -+ ve->devpts_config = NULL; -+ } -+out: -+ return err; -+} -+ -+static void fini_ve_devpts(struct ve_struct *ve) -+{ -+ unregister_ve_fs_type(ve->devpts_fstype, ve->devpts_mnt); -+ /* devpts_fstype is freed in real_put_ve -> free_ve_filesystems */ -+ ve->devpts_mnt = NULL; -+ kfree(ve->devpts_config); -+ ve->devpts_config = NULL; -+} -+#else -+#define init_ve_devpts(ve) (0) -+#define fini_ve_devpts(ve) do { } while (0) -+#endif -+ -+static int init_ve_shmem(struct ve_struct *ve) -+{ -+ return register_ve_fs_type(ve, -+ &tmpfs_fs_type, -+ &ve->shmem_fstype, -+ &ve->shmem_mnt); -+} -+ -+static void fini_ve_shmem(struct ve_struct *ve) -+{ -+ unregister_ve_fs_type(ve->shmem_fstype, ve->shmem_mnt); -+ /* shmem_fstype is freed in real_put_ve -> free_ve_filesystems */ -+ ve->shmem_mnt = NULL; -+} -+ -+static int init_ve_sysfs(struct ve_struct *ve) -+{ -+ struct subsystem *subsys; -+ struct class *nc; -+ int err; -+ extern struct subsystem class_obj_subsys; -+ extern struct subsystem class_subsys; -+ extern struct class net_class; -+ -+#ifdef CONFIG_SYSFS -+ err = 0; -+ if (ve->features & VE_FEATURE_SYSFS) -+ err = register_ve_fs_type(ve, -+ &sysfs_fs_type, -+ &ve->sysfs_fstype, -+ &ve->sysfs_mnt); -+ if (err != 0) -+ goto out_fs_type; -+#endif -+ err = -ENOMEM; -+ subsys = kmalloc(sizeof(*subsys), GFP_KERNEL); -+ if (subsys == NULL) -+ goto out_class_obj; -+ /* ick, this is ugly, the things we go through to keep from showing up -+ * in sysfs... */ -+ memset(subsys, 0, sizeof(*subsys)); -+ memcpy(&subsys->kset.kobj.name, &class_obj_subsys.kset.kobj.name, -+ sizeof(subsys->kset.kobj.name)); -+ subsys->kset.ktype = class_obj_subsys.kset.ktype; -+ subsys->kset.hotplug_ops = class_obj_subsys.kset.hotplug_ops; -+ subsystem_init(subsys); -+ if (!subsys->kset.subsys) -+ subsys->kset.subsys = subsys; -+ ve->class_obj_subsys = subsys; -+ -+ err = -ENOMEM; -+ subsys = kmalloc(sizeof(*subsys), GFP_KERNEL); -+ if (subsys == NULL) -+ goto out_class_subsys; -+ /* ick, this is ugly, the things we go through to keep from showing up -+ * in sysfs... */ -+ memset(subsys, 0, sizeof(*subsys)); -+ memcpy(&subsys->kset.kobj.name, &class_subsys.kset.kobj.name, -+ sizeof(subsys->kset.kobj.name)); -+ subsys->kset.ktype = class_subsys.kset.ktype; -+ subsys->kset.hotplug_ops = class_subsys.kset.hotplug_ops; -+ ve->class_subsys = subsys; -+ err = subsystem_register(subsys); -+ if (err != 0) -+ goto out_register; -+ -+ err = -ENOMEM; -+ nc = kmalloc(sizeof(*nc), GFP_KERNEL); -+ if (nc == NULL) -+ goto out_nc; -+ memset(nc, 0, sizeof(*nc)); -+ nc->name = net_class.name; -+ nc->release = net_class.release; -+ nc->hotplug = net_class.hotplug; -+ err = class_register(nc); -+ if (err != 0) -+ goto out_class_register; -+ ve->net_class = nc; -+ -+ return err; -+ -+out_class_register: -+ kfree(nc); -+out_nc: -+ subsystem_unregister(subsys); -+out_register: -+ kfree(ve->class_subsys); -+out_class_subsys: -+ kfree(ve->class_obj_subsys); -+out_class_obj: -+#ifdef CONFIG_SYSFS -+ unregister_ve_fs_type(ve->sysfs_fstype, ve->sysfs_mnt); -+ /* sysfs_fstype is freed in real_put_ve -> free_ve_filesystems */ -+out_fs_type: -+#endif -+ ve->class_subsys = NULL; -+ ve->class_obj_subsys = NULL; -+ return err; -+} -+ -+static void fini_ve_sysfs(struct ve_struct *ve) -+{ -+ class_unregister(ve->net_class); -+ subsystem_unregister(ve->class_subsys); -+ -+ kfree(ve->net_class); -+ kfree(ve->class_subsys); -+ kfree(ve->class_obj_subsys); -+ -+ ve->net_class = NULL; -+ ve->class_subsys = NULL; -+ ve->class_obj_subsys = NULL; -+#ifdef CONFIG_SYSFS -+ unregister_ve_fs_type(ve->sysfs_fstype, ve->sysfs_mnt); -+ ve->sysfs_mnt = NULL; -+ /* sysfs_fstype is freed in real_put_ve -> free_ve_filesystems */ -+#endif -+} -+ -+static void free_ve_filesystems(struct ve_struct *ve) -+{ -+#ifdef CONFIG_SYSFS -+ kfree(ve->sysfs_fstype); -+ ve->sysfs_fstype = NULL; -+#endif -+ kfree(ve->shmem_fstype); -+ ve->shmem_fstype = NULL; -+ -+ kfree(ve->devpts_fstype); -+ ve->devpts_fstype = NULL; -+ -+ free_ve_proc(ve); -+} -+ -+static int init_printk(struct ve_struct *ve) -+{ -+ struct ve_prep_printk { -+ wait_queue_head_t log_wait; -+ unsigned long log_start; -+ unsigned long log_end; -+ unsigned long logged_chars; -+ } *tmp; -+ -+ tmp = kmalloc(sizeof(struct ve_prep_printk), GFP_KERNEL); -+ if (!tmp) -+ return -ENOMEM; -+ memset(tmp, 0, sizeof(struct ve_prep_printk)); -+ init_waitqueue_head(&tmp->log_wait); -+ ve->_log_wait = &tmp->log_wait; -+ ve->_log_start = &tmp->log_start; -+ ve->_log_end = &tmp->log_end; -+ ve->_logged_chars = &tmp->logged_chars; -+ /* ve->log_buf will be initialized later by ve_log_init() */ -+ return 0; -+} -+ -+static void fini_printk(struct ve_struct *ve) -+{ -+ /* -+ * there is no spinlock protection here because nobody can use -+ * log_buf at the moments when this code is called. -+ */ -+ kfree(ve->log_buf); -+ kfree(ve->_log_wait); -+} -+ -+static void fini_venet(struct ve_struct *ve) -+{ -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+ tcp_v4_kill_ve_sockets(ve); -+#endif -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+ ve_mapped_devs_cleanup(ve); -+#endif -+} -+ -+static int init_ve_sched(struct ve_struct *ve) -+{ -+#ifdef CONFIG_FAIRSCHED -+ int err; -+ -+ /* -+ * We refuse to switch to an already existing node since nodes -+ * keep a pointer to their ve_struct... -+ */ -+ err = sys_fairsched_mknod(0, 1, ve->veid); -+ if (err < 0) { -+ printk(KERN_WARNING "Can't create fairsched node %d\n", -+ ve->veid); -+ return err; -+ } -+ err = sys_fairsched_mvpr(current->pid, ve->veid); -+ if (err) { -+ printk(KERN_WARNING "Can't switch to fairsched node %d\n", -+ ve->veid); -+ if (sys_fairsched_rmnod(ve->veid)) -+ printk(KERN_ERR "Can't clean fairsched node %d\n", -+ ve->veid); -+ return err; -+ } -+#endif -+ ve_sched_attach(ve); -+ return 0; -+} -+ -+static void fini_ve_sched(struct ve_struct *ve) -+{ -+#ifdef CONFIG_FAIRSCHED -+ if (task_vsched_id(current) == ve->veid) -+ if (sys_fairsched_mvpr(current->pid, fairsched_init_node.id)) -+ printk(KERN_WARNING "Can't leave fairsched node %d\n", -+ ve->veid); -+ if (sys_fairsched_rmnod(ve->veid)) -+ printk(KERN_ERR "Can't remove fairsched node %d\n", -+ ve->veid); -+#endif -+} -+ -+static int init_ve_struct(struct ve_struct *ve, envid_t veid, -+ u32 class_id, env_create_param_t *data, -+ struct task_struct *init_tsk) -+{ -+ int n; -+ -+ memset(ve, 0, sizeof(*ve)); -+ (void)get_ve(ve); -+ ve->veid = veid; -+ ve->class_id = class_id; -+ ve->init_entry = init_tsk; -+ ve->features = data->feature_mask; -+ INIT_LIST_HEAD(&ve->vetask_lh); -+ init_rwsem(&ve->op_sem); -+ ve->ifindex = -1; -+ -+ for(n = 0; n < UIDHASH_SZ_VE; ++n) -+ INIT_LIST_HEAD(&ve->uidhash_table[n]); -+ -+ do_posix_clock_monotonic_gettime(&ve->start_timespec); -+ ve->start_jiffies = jiffies; -+ ve->start_cycles = get_cycles(); -+ ve->virt_pids = glob_virt_pids; -+ -+ return 0; -+} -+ -+static void set_ve_root(struct ve_struct *ve, struct task_struct *tsk) -+{ -+ read_lock(&tsk->fs->lock); -+ ve->fs_rootmnt = tsk->fs->rootmnt; -+ ve->fs_root = tsk->fs->root; -+ read_unlock(&tsk->fs->lock); -+ mark_tree_virtual(ve->fs_rootmnt, ve->fs_root); -+} -+ -+static void set_ve_caps(struct ve_struct *ve, struct task_struct *tsk) -+{ -+ /* required for real_setdevperms from register_ve_<fs> above */ -+ memcpy(&ve->cap_default, &tsk->cap_effective, sizeof(kernel_cap_t)); -+ cap_lower(ve->cap_default, CAP_SETVEID); -+} -+ -+static int ve_list_add(struct ve_struct *ve) -+{ -+ write_lock_irq(&ve_list_guard); -+ if (__find_ve_by_id(ve->veid) != NULL) -+ goto err_exists; -+ -+ ve->prev = NULL; -+ ve->next = ve_list_head; -+ if (ve_list_head) -+ ve_list_head->prev = ve; -+ ve_list_head = ve; -+ nr_ve++; -+ write_unlock_irq(&ve_list_guard); -+ return 0; -+ -+err_exists: -+ write_unlock_irq(&ve_list_guard); -+ return -EEXIST; -+} -+ -+static void ve_list_del(struct ve_struct *ve) -+{ -+ write_lock_irq(&ve_list_guard); -+ if (ve->prev) -+ ve->prev->next = ve->next; -+ else -+ ve_list_head = ve->next; -+ if (ve->next) -+ ve->next->prev = ve->prev; -+ nr_ve--; -+ write_unlock_irq(&ve_list_guard); -+} -+ -+static void set_task_ve_caps(struct task_struct *tsk, struct ve_struct *ve) -+{ -+ spin_lock(&task_capability_lock); -+ cap_mask(tsk->cap_effective, ve->cap_default); -+ cap_mask(tsk->cap_inheritable, ve->cap_default); -+ cap_mask(tsk->cap_permitted, ve->cap_default); -+ spin_unlock(&task_capability_lock); -+} -+ -+static void move_task(struct task_struct *tsk, struct ve_struct *new, -+ struct ve_struct *old) -+{ -+ /* this probihibts ptracing of task entered to VPS from host system */ -+ tsk->mm->vps_dumpable = 0; -+ /* setup capabilities before enter */ -+ set_task_ve_caps(tsk, new); -+ -+ write_lock_irq(&tasklist_lock); -+ VE_TASK_INFO(tsk)->owner_env = new; -+ VE_TASK_INFO(tsk)->exec_env = new; -+ REMOVE_VE_LINKS(tsk); -+ SET_VE_LINKS(tsk); -+ -+ atomic_dec(&old->pcounter); -+ atomic_inc(&new->pcounter); -+ real_put_ve(old); -+ get_ve(new); -+ write_unlock_irq(&tasklist_lock); -+} -+ -+#if (defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE)) && \ -+ defined(CONFIG_NETFILTER) && defined(CONFIG_VE_IPTABLES) -+extern int init_netfilter(void); -+extern void fini_netfilter(void); -+#define init_ve_netfilter() init_netfilter() -+#define fini_ve_netfilter() fini_netfilter() -+#else -+#define init_ve_netfilter() (0) -+#define fini_ve_netfilter() do { } while (0) -+#endif -+ -+#define KSYMIPTINIT(mask, ve, full_mask, mod, name, args) \ -+({ \ -+ int ret = 0; \ -+ if (VE_IPT_CMP(mask, full_mask) && \ -+ VE_IPT_CMP((ve)->_iptables_modules, \ -+ full_mask & ~(full_mask##_MOD))) { \ -+ ret = KSYMERRCALL(1, mod, name, args); \ -+ if (ret == 0) \ -+ (ve)->_iptables_modules |= \ -+ full_mask##_MOD; \ -+ if (ret == 1) \ -+ ret = 0; \ -+ } \ -+ ret; \ -+}) -+ -+#define KSYMIPTFINI(mask, full_mask, mod, name, args) \ -+({ \ -+ if (VE_IPT_CMP(mask, full_mask##_MOD)) \ -+ KSYMSAFECALL_VOID(mod, name, args); \ -+}) -+ -+ -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+static int do_ve_iptables(struct ve_struct *ve, __u64 init_mask, -+ int init_or_cleanup) -+{ -+ int err; -+ -+ err = 0; -+ if (!init_or_cleanup) -+ goto cleanup; -+ -+ /* init part */ -+#if defined(CONFIG_IP_NF_IPTABLES) || \ -+ defined(CONFIG_IP_NF_IPTABLES_MODULE) -+ err = KSYMIPTINIT(init_mask, ve, VE_IP_IPTABLES, -+ ip_tables, init_iptables, ()); -+ if (err < 0) -+ goto err_iptables; -+#endif -+#if defined(CONFIG_IP_NF_CONNTRACK) || \ -+ defined(CONFIG_IP_NF_CONNTRACK_MODULE) -+ err = KSYMIPTINIT(init_mask, ve, VE_IP_CONNTRACK, -+ ip_conntrack, init_iptable_conntrack, ()); -+ if (err < 0) -+ goto err_iptable_conntrack; -+#endif -+#if defined(CONFIG_IP_NF_FTP) || \ -+ defined(CONFIG_IP_NF_FTP_MODULE) -+ err = KSYMIPTINIT(init_mask, ve, VE_IP_CONNTRACK_FTP, -+ ip_conntrack_ftp, init_iptable_ftp, ()); -+ if (err < 0) -+ goto err_iptable_ftp; -+#endif -+#if defined(CONFIG_IP_NF_IRC) || \ -+ defined(CONFIG_IP_NF_IRC_MODULE) -+ err = KSYMIPTINIT(init_mask, ve, VE_IP_CONNTRACK_IRC, -+ ip_conntrack_irc, init_iptable_irc, ()); -+ if (err < 0) -+ goto err_iptable_irc; -+#endif -+#if defined(CONFIG_IP_NF_MATCH_CONNTRACK) || \ -+ defined(CONFIG_IP_NF_MATCH_CONNTRACK_MODULE) -+ err = KSYMIPTINIT(init_mask, ve, VE_IP_MATCH_CONNTRACK, -+ ipt_conntrack, init_iptable_conntrack_match, ()); -+ if (err < 0) -+ goto err_iptable_conntrack_match; -+#endif -+#if defined(CONFIG_IP_NF_MATCH_STATE) || \ -+ defined(CONFIG_IP_NF_MATCH_STATE_MODULE) -+ err = KSYMIPTINIT(init_mask, ve, VE_IP_MATCH_STATE, -+ ipt_state, init_iptable_state, ()); -+ if (err < 0) -+ goto err_iptable_state; -+#endif -+#if defined(CONFIG_IP_NF_MATCH_HELPER) || \ -+ defined(CONFIG_IP_NF_MATCH_HELPER_MODULE) -+ err = KSYMIPTINIT(init_mask, ve, VE_IP_MATCH_HELPER, -+ ipt_helper, init_iptable_helper, ()); -+ if (err < 0) -+ goto err_iptable_helper; -+#endif -+#if defined(CONFIG_IP_NF_NAT) || \ -+ defined(CONFIG_IP_NF_NAT_MODULE) -+ err = KSYMIPTINIT(init_mask, ve, VE_IP_NAT, -+ iptable_nat, init_iptable_nat, ()); -+ if (err < 0) -+ goto err_iptable_nat; -+#endif -+#if defined(CONFIG_IP_NF_NAT_FTP) || \ -+ defined(CONFIG_IP_NF_NAT_FTP_MODULE) -+ err = KSYMIPTINIT(init_mask, ve, VE_IP_NAT_FTP, -+ ip_nat_ftp, init_iptable_nat_ftp, ()); -+ if (err < 0) -+ goto err_iptable_nat_ftp; -+#endif -+#if defined(CONFIG_IP_NF_NAT_IRC) || \ -+ defined(CONFIG_IP_NF_NAT_IRC_MODULE) -+ err = KSYMIPTINIT(init_mask, ve, VE_IP_NAT_IRC, -+ ip_nat_irc, init_iptable_nat_irc, ()); -+ if (err < 0) -+ goto err_iptable_nat_irc; -+#endif -+#if defined(CONFIG_IP_NF_FILTER) || \ -+ defined(CONFIG_IP_NF_FILTER_MODULE) -+ err = KSYMIPTINIT(init_mask, ve, VE_IP_FILTER, -+ iptable_filter, init_iptable_filter, ()); -+ if (err < 0) -+ goto err_iptable_filter; -+#endif -+#if defined(CONFIG_IP_NF_MANGLE) || \ -+ defined(CONFIG_IP_NF_MANGLE_MODULE) -+ err = KSYMIPTINIT(init_mask, ve, VE_IP_MANGLE, -+ iptable_mangle, init_iptable_mangle, ()); -+ if (err < 0) -+ goto err_iptable_mangle; -+#endif -+#if defined(CONFIG_IP_NF_MATCH_LIMIT) || \ -+ defined(CONFIG_IP_NF_MATCH_LIMIT_MODULE) -+ err = KSYMIPTINIT(init_mask, ve, VE_IP_MATCH_LIMIT, -+ ipt_limit, init_iptable_limit, ()); -+ if (err < 0) -+ goto err_iptable_limit; -+#endif -+#if defined(CONFIG_IP_NF_MATCH_MULTIPORT) || \ -+ defined(CONFIG_IP_NF_MATCH_MULTIPORT_MODULE) -+ err = KSYMIPTINIT(init_mask, ve, VE_IP_MATCH_MULTIPORT, -+ ipt_multiport, init_iptable_multiport, ()); -+ if (err < 0) -+ goto err_iptable_multiport; -+#endif -+#if defined(CONFIG_IP_NF_MATCH_TOS) || \ -+ defined(CONFIG_IP_NF_MATCH_TOS_MODULE) -+ err = KSYMIPTINIT(init_mask, ve, VE_IP_MATCH_TOS, -+ ipt_tos, init_iptable_tos, ()); -+ if (err < 0) -+ goto err_iptable_tos; -+#endif -+#if defined(CONFIG_IP_NF_TARGET_TOS) || \ -+ defined(CONFIG_IP_NF_TARGET_TOS_MODULE) -+ err = KSYMIPTINIT(init_mask, ve, VE_IP_TARGET_TOS, -+ ipt_TOS, init_iptable_TOS, ()); -+ if (err < 0) -+ goto err_iptable_TOS; -+#endif -+#if defined(CONFIG_IP_NF_TARGET_REJECT) || \ -+ defined(CONFIG_IP_NF_TARGET_REJECT_MODULE) -+ err = KSYMIPTINIT(init_mask, ve, VE_IP_TARGET_REJECT, -+ ipt_REJECT, init_iptable_REJECT, ()); -+ if (err < 0) -+ goto err_iptable_REJECT; -+#endif -+#if defined(CONFIG_IP_NF_TARGET_TCPMSS) || \ -+ defined(CONFIG_IP_NF_TARGET_TCPMSS_MODULE) -+ err = KSYMIPTINIT(init_mask, ve, VE_IP_TARGET_TCPMSS, -+ ipt_TCPMSS, init_iptable_TCPMSS, ()); -+ if (err < 0) -+ goto err_iptable_TCPMSS; -+#endif -+#if defined(CONFIG_IP_NF_MATCH_TCPMSS) || \ -+ defined(CONFIG_IP_NF_MATCH_TCPMSS_MODULE) -+ err = KSYMIPTINIT(init_mask, ve, VE_IP_MATCH_TCPMSS, -+ ipt_tcpmss, init_iptable_tcpmss, ()); -+ if (err < 0) -+ goto err_iptable_tcpmss; -+#endif -+#if defined(CONFIG_IP_NF_MATCH_TTL) || \ -+ defined(CONFIG_IP_NF_MATCH_TTL_MODULE) -+ err = KSYMIPTINIT(init_mask, ve, VE_IP_MATCH_TTL, -+ ipt_ttl, init_iptable_ttl, ()); -+ if (err < 0) -+ goto err_iptable_ttl; -+#endif -+#if defined(CONFIG_IP_NF_TARGET_LOG) || \ -+ defined(CONFIG_IP_NF_TARGET_LOG_MODULE) -+ err = KSYMIPTINIT(init_mask, ve, VE_IP_TARGET_LOG, -+ ipt_LOG, init_iptable_LOG, ()); -+ if (err < 0) -+ goto err_iptable_LOG; -+#endif -+#if defined(CONFIG_IP_NF_MATCH_LENGTH) || \ -+ defined(CONFIG_IP_NF_MATCH_LENGTH_MODULE) -+ err = KSYMIPTINIT(init_mask, ve, VE_IP_MATCH_LENGTH, -+ ipt_length, init_iptable_length, ()); -+ if (err < 0) -+ goto err_iptable_length; -+#endif -+#if defined(CONFIG_IP_NF_TARGET_REDIRECT) || \ -+ defined(CONFIG_IP_NF_TARGET_REDIRECT_MODULE) -+ err = KSYMIPTINIT(init_mask, ve, VE_IP_TARGET_REDIRECT, -+ ipt_REDIRECT, init_iptable_REDIRECT, ()); -+ if (err < 0) -+ goto err_iptable_REDIRECT; -+#endif -+ return 0; -+ -+/* ------------------------------------------------------------------------- */ -+ -+cleanup: -+#if defined(CONFIG_IP_NF_TARGET_REDIRECT) || \ -+ defined(CONFIG_IP_NF_TARGET_REDIRECT_MODULE) -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_TARGET_REDIRECT, -+ ipt_REDIRECT, fini_iptable_REDIRECT, ()); -+err_iptable_REDIRECT: -+#endif -+#if defined(CONFIG_IP_NF_MATCH_LENGTH) || \ -+ defined(CONFIG_IP_NF_MATCH_LENGTH_MODULE) -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_MATCH_LENGTH, -+ ipt_length, fini_iptable_length, ()); -+err_iptable_length: -+#endif -+#if defined(CONFIG_IP_NF_TARGET_LOG) || \ -+ defined(CONFIG_IP_NF_TARGET_LOG_MODULE) -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_TARGET_LOG, -+ ipt_LOG, fini_iptable_LOG, ()); -+err_iptable_LOG: -+#endif -+#if defined(CONFIG_IP_NF_MATCH_TTL) || \ -+ defined(CONFIG_IP_NF_MATCH_TTL_MODULE) -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_MATCH_TTL, -+ ipt_ttl, fini_iptable_ttl, ()); -+err_iptable_ttl: -+#endif -+#if defined(CONFIG_IP_NF_MATCH_TCPMSS) || \ -+ defined(CONFIG_IP_NF_MATCH_TCPMSS_MODULE) -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_MATCH_TCPMSS, -+ ipt_tcpmss, fini_iptable_tcpmss, ()); -+err_iptable_tcpmss: -+#endif -+#if defined(CONFIG_IP_NF_TARGET_TCPMSS) || \ -+ defined(CONFIG_IP_NF_TARGET_TCPMSS_MODULE) -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_TARGET_TCPMSS, -+ ipt_TCPMSS, fini_iptable_TCPMSS, ()); -+err_iptable_TCPMSS: -+#endif -+#if defined(CONFIG_IP_NF_TARGET_REJECT) || \ -+ defined(CONFIG_IP_NF_TARGET_REJECT_MODULE) -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_TARGET_REJECT, -+ ipt_REJECT, fini_iptable_REJECT, ()); -+err_iptable_REJECT: -+#endif -+#if defined(CONFIG_IP_NF_TARGET_TOS) || \ -+ defined(CONFIG_IP_NF_TARGET_TOS_MODULE) -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_TARGET_TOS, -+ ipt_TOS, fini_iptable_TOS, ()); -+err_iptable_TOS: -+#endif -+#if defined(CONFIG_IP_NF_MATCH_TOS) || \ -+ defined(CONFIG_IP_NF_MATCH_TOS_MODULE) -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_MATCH_TOS, -+ ipt_tos, fini_iptable_tos, ()); -+err_iptable_tos: -+#endif -+#if defined(CONFIG_IP_NF_MATCH_MULTIPORT) || \ -+ defined(CONFIG_IP_NF_MATCH_MULTIPORT_MODULE) -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_MATCH_MULTIPORT, -+ ipt_multiport, fini_iptable_multiport, ()); -+err_iptable_multiport: -+#endif -+#if defined(CONFIG_IP_NF_MATCH_LIMIT) || \ -+ defined(CONFIG_IP_NF_MATCH_LIMIT_MODULE) -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_MATCH_LIMIT, -+ ipt_limit, fini_iptable_limit, ()); -+err_iptable_limit: -+#endif -+#if defined(CONFIG_IP_NF_MANGLE) || \ -+ defined(CONFIG_IP_NF_MANGLE_MODULE) -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_MANGLE, -+ iptable_mangle, fini_iptable_mangle, ()); -+err_iptable_mangle: -+#endif -+#if defined(CONFIG_IP_NF_FILTER) || \ -+ defined(CONFIG_IP_NF_FILTER_MODULE) -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_FILTER, -+ iptable_filter, fini_iptable_filter, ()); -+err_iptable_filter: -+#endif -+#if defined(CONFIG_IP_NF_NAT_IRC) || \ -+ defined(CONFIG_IP_NF_NAT_IRC_MODULE) -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_NAT_IRC, -+ ip_nat_irc, fini_iptable_nat_irc, ()); -+err_iptable_nat_irc: -+#endif -+#if defined(CONFIG_IP_NF_NAT_FTP) || \ -+ defined(CONFIG_IP_NF_NAT_FTP_MODULE) -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_NAT_FTP, -+ ip_nat_ftp, fini_iptable_nat_ftp, ()); -+err_iptable_nat_ftp: -+#endif -+#if defined(CONFIG_IP_NF_NAT) || \ -+ defined(CONFIG_IP_NF_NAT_MODULE) -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_NAT, -+ iptable_nat, fini_iptable_nat, ()); -+err_iptable_nat: -+#endif -+#if defined(CONFIG_IP_NF_MATCH_HELPER) || \ -+ defined(CONFIG_IP_NF_MATCH_HELPER_MODULE) -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_MATCH_HELPER, -+ ipt_helper, fini_iptable_helper, ()); -+err_iptable_helper: -+#endif -+#if defined(CONFIG_IP_NF_MATCH_STATE) || \ -+ defined(CONFIG_IP_NF_MATCH_STATE_MODULE) -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_MATCH_STATE, -+ ipt_state, fini_iptable_state, ()); -+err_iptable_state: -+#endif -+#if defined(CONFIG_IP_NF_MATCH_CONNTRACK) || \ -+ defined(CONFIG_IP_NF_MATCH_CONNTRACK_MODULE) -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_MATCH_CONNTRACK, -+ ipt_conntrack, fini_iptable_conntrack_match, ()); -+err_iptable_conntrack_match: -+#endif -+#if defined(CONFIG_IP_NF_IRC) || \ -+ defined(CONFIG_IP_NF_IRC_MODULE) -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_CONNTRACK_IRC, -+ ip_conntrack_irc, fini_iptable_irc, ()); -+err_iptable_irc: -+#endif -+#if defined(CONFIG_IP_NF_FTP) || \ -+ defined(CONFIG_IP_NF_FTP_MODULE) -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_CONNTRACK_FTP, -+ ip_conntrack_ftp, fini_iptable_ftp, ()); -+err_iptable_ftp: -+#endif -+#if defined(CONFIG_IP_NF_CONNTRACK) || \ -+ defined(CONFIG_IP_NF_CONNTRACK_MODULE) -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_CONNTRACK, -+ ip_conntrack, fini_iptable_conntrack, ()); -+err_iptable_conntrack: -+#endif -+#if defined(CONFIG_IP_NF_IPTABLES) || \ -+ defined(CONFIG_IP_NF_IPTABLES_MODULE) -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_IPTABLES, -+ ip_tables, fini_iptables, ()); -+err_iptables: -+#endif -+ ve->_iptables_modules = 0; -+ -+ return err; -+} -+#else -+#define do_ve_iptables(ve, initmask, init) (0) -+#endif -+ -+static inline int init_ve_iptables(struct ve_struct *ve, __u64 init_mask) -+{ -+ return do_ve_iptables(ve, init_mask, 1); -+} -+ -+static inline void fini_ve_iptables(struct ve_struct *ve, __u64 init_mask) -+{ -+ (void)do_ve_iptables(ve, init_mask, 0); -+} -+ -+static void flush_ve_iptables(struct ve_struct *ve) -+{ -+ /* -+ * flush all rule tables first, -+ * this helps us to avoid refs to freed objs -+ */ -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_MANGLE, ip_tables, -+ ipt_flush_table, (ve->_ipt_mangle_table)); -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_FILTER, ip_tables, -+ ipt_flush_table, (ve->_ve_ipt_filter_pf)); -+ KSYMIPTFINI(ve->_iptables_modules, VE_IP_NAT, ip_tables, -+ ipt_flush_table, (ve->_ip_conntrack->_ip_nat_table)); -+} -+ -+static struct list_head ve_hooks[VE_MAX_HOOKS]; -+static DECLARE_RWSEM(ve_hook_sem); -+ -+int ve_hook_register(struct ve_hook *vh) -+{ -+ struct list_head *lh; -+ struct ve_hook *tmp; -+ -+ down_write(&ve_hook_sem); -+ list_for_each(lh, &ve_hooks[vh->hooknum]) { -+ tmp = list_entry(lh, struct ve_hook, list); -+ if (vh->priority < tmp->priority) -+ break; -+ } -+ list_add_tail(&vh->list, lh); -+ up_write(&ve_hook_sem); -+ return 0; -+} -+EXPORT_SYMBOL(ve_hook_register); -+ -+void ve_hook_unregister(struct ve_hook *vh) -+{ -+ down_write(&ve_hook_sem); -+ list_del(&vh->list); -+ up_write(&ve_hook_sem); -+} -+EXPORT_SYMBOL(ve_hook_unregister); -+ -+static int ve_hook_iterate(unsigned int hooknum, void *data) -+{ -+ struct ve_hook *vh; -+ int err; -+ -+ err = 0; -+ down_read(&ve_hook_sem); -+ list_for_each_entry(vh, &ve_hooks[hooknum], list) { -+ if (!try_module_get(vh->owner)) -+ continue; -+ err = vh->hook(hooknum, data); -+ module_put(vh->owner); -+ if (err) -+ break; -+ } -+ -+ if (err) { -+ list_for_each_entry_continue_reverse(vh, -+ &ve_hooks[hooknum], list) { -+ if (!try_module_get(vh->owner)) -+ continue; -+ if (vh->undo) -+ vh->undo(hooknum, data); -+ module_put(vh->owner); -+ } -+ } -+ up_read(&ve_hook_sem); -+ return err; -+} -+ -+static void ve_hook_iterate_cleanup(unsigned int hooknum, void *data) -+{ -+ struct ve_hook *vh; -+ -+ down_read(&ve_hook_sem); -+ list_for_each_entry_reverse(vh, &ve_hooks[hooknum], list) { -+ if (!try_module_get(vh->owner)) -+ continue; -+ (void)vh->hook(hooknum, data); -+ module_put(vh->owner); -+ } -+ up_read(&ve_hook_sem); -+} -+ -+static int do_env_create(envid_t veid, unsigned int flags, u32 class_id, -+ env_create_param_t *data, int datalen) -+{ -+ struct task_struct *tsk; -+ struct ve_struct *old; -+ struct ve_struct *old_exec; -+ struct ve_struct *ve; -+ __u64 init_mask; -+ int err; -+ -+ tsk = current; -+ old = VE_TASK_INFO(tsk)->owner_env; -+ -+ if (!thread_group_leader(tsk)) -+ return -EINVAL; -+ -+ if (tsk->signal->tty) { -+ printk("ERR: VE init has controlling terminal\n"); -+ return -EINVAL; -+ } -+ if (tsk->signal->pgrp != tsk->pid || tsk->signal->session != tsk->pid) { -+ int may_setsid; -+ read_lock(&tasklist_lock); -+ may_setsid = (find_pid(PIDTYPE_PGID, tsk->pid) == NULL); -+ read_unlock(&tasklist_lock); -+ if (!may_setsid) { -+ printk("ERR: VE init is process group leader\n"); -+ return -EINVAL; -+ } -+ } -+ -+ -+ VZTRACE("%s: veid=%d classid=%d pid=%d\n", -+ __FUNCTION__, veid, class_id, current->pid); -+ -+ err = -ENOMEM; -+ ve = kmalloc(sizeof(struct ve_struct), GFP_KERNEL); -+ if (ve == NULL) -+ goto err_struct; -+ -+ init_ve_struct(ve, veid, class_id, data, tsk); -+ __module_get(THIS_MODULE); -+ down_write(&ve->op_sem); -+ if (flags & VE_LOCK) -+ ve->is_locked = 1; -+ if ((err = ve_list_add(ve)) < 0) -+ goto err_exist; -+ -+ /* this should be done before context switching */ -+ if ((err = init_printk(ve)) < 0) -+ goto err_log_wait; -+ -+ old_exec = set_exec_env(ve); -+ -+ if ((err = init_ve_sched(ve)) < 0) -+ goto err_sched; -+ -+ /* move user to VE */ -+ if ((err = set_user(0, 0)) < 0) -+ goto err_set_user; -+ -+ set_ve_root(ve, tsk); -+ -+ if ((err = init_ve_utsname(ve))) -+ goto err_utsname; -+ -+ if ((err = init_ve_mibs(ve))) -+ goto err_mibs; -+ -+ if ((err = init_ve_proc(ve))) -+ goto err_proc; -+ -+ if ((err = init_ve_sysctl(ve))) -+ goto err_sysctl; -+ -+ if ((err = init_ve_sysfs(ve))) -+ goto err_sysfs; -+ -+ if ((err = init_ve_netdev())) -+ goto err_dev; -+ -+ if ((err = init_ve_tty_drivers(ve)) < 0) -+ goto err_tty; -+ -+ if ((err = init_ve_shmem(ve))) -+ goto err_shmem; -+ -+ if ((err = init_ve_devpts(ve))) -+ goto err_devpts; -+ -+ /* init SYSV IPC variables */ -+ if ((err = init_ve_ipc(ve)) < 0) -+ goto err_ipc; -+ -+ set_ve_caps(ve, tsk); -+ -+ /* It is safe to initialize netfilter here as routing initialization and -+ interface setup will be done below. This means that NO skb can be -+ passed inside. Den */ -+ /* iptables ve initialization for non ve0; -+ ve0 init is in module_init */ -+ if ((err = init_ve_netfilter()) < 0) -+ goto err_netfilter; -+ -+ init_mask = data ? data->iptables_mask : VE_IP_DEFAULT; -+ if ((err = init_ve_iptables(ve, init_mask)) < 0) -+ goto err_iptables; -+ -+ if ((err = init_ve_route(ve)) < 0) -+ goto err_route; -+ -+ if ((err = alloc_vpid(tsk->pid, 1)) < 0) -+ goto err_vpid; -+ -+ if ((err = ve_hook_iterate(VE_HOOK_INIT, (void *)ve)) < 0) -+ goto err_ve_hook; -+ -+ /* finally: set vpids and move inside */ -+ move_task(tsk, ve, old); -+ -+ set_virt_pid(tsk, 1); -+ set_virt_tgid(tsk, 1); -+ -+ set_special_pids(tsk->pid, tsk->pid); -+ current->signal->tty_old_pgrp = 0; -+ set_virt_pgid(tsk, 1); -+ set_virt_sid(tsk, 1); -+ -+ ve->is_running = 1; -+ up_write(&ve->op_sem); -+ -+ printk(KERN_INFO "VPS: %d: started\n", veid); -+ return veid; -+ -+err_ve_hook: -+ free_vpid(1, ve); -+err_vpid: -+ fini_venet(ve); -+ fini_ve_route(ve); -+err_route: -+ fini_ve_iptables(ve, init_mask); -+err_iptables: -+ fini_ve_netfilter(); -+err_netfilter: -+ fini_ve_ipc(ve); -+err_ipc: -+ fini_ve_devpts(ve); -+err_devpts: -+ fini_ve_shmem(ve); -+err_shmem: -+ fini_ve_tty_drivers(ve); -+err_tty: -+ fini_ve_netdev(); -+err_dev: -+ fini_ve_sysfs(ve); -+err_sysfs: -+ fini_ve_sysctl(ve); -+err_sysctl: -+ fini_ve_proc(ve); -+err_proc: -+ do_clean_devperms(ve->veid); /* register procfs adds devperms */ -+ fini_ve_mibs(ve); -+err_mibs: -+ /* free_ve_utsname() is called inside real_put_ve() */ ; -+err_utsname: -+ /* It is safe to restore current->envid here because -+ * ve_fairsched_detach does not use current->envid. */ -+ /* Really fairsched code uses current->envid in sys_fairsched_mknod -+ * only. It is correct if sys_fairsched_mknod is called from -+ * userspace. If sys_fairsched_mknod is called from -+ * ve_fairsched_attach, then node->envid and node->parent_node->envid -+ * are explicitly set to valid value after the call. */ -+ /* FIXME */ -+ VE_TASK_INFO(tsk)->owner_env = old; -+ VE_TASK_INFO(tsk)->exec_env = old_exec; -+ /* move user back */ -+ if (set_user(0, 0) < 0) -+ printk(KERN_WARNING"Can't restore UID\n"); -+ -+err_set_user: -+ fini_ve_sched(ve); -+err_sched: -+ (void)set_exec_env(old_exec); -+ -+ /* we can jump here having incorrect envid */ -+ VE_TASK_INFO(tsk)->owner_env = old; -+ fini_printk(ve); -+err_log_wait: -+ ve_list_del(ve); -+ up_write(&ve->op_sem); -+ -+ real_put_ve(ve); -+err_struct: -+ printk(KERN_INFO "VPS: %d: failed to start with err=%d\n", veid, err); -+ return err; -+ -+err_exist: -+ kfree(ve); -+ goto err_struct; -+} -+ -+ -+/********************************************************************** -+ ********************************************************************** -+ * -+ * VE start/stop callbacks -+ * -+ ********************************************************************** -+ **********************************************************************/ -+ -+int real_env_create(envid_t veid, unsigned flags, u32 class_id, -+ env_create_param_t *data, int datalen) -+{ -+ int status; -+ struct ve_struct *ve; -+ -+ if (!flags) { -+ status = get_exec_env()->veid; -+ goto out; -+ } -+ -+ status = -EPERM; -+ if (!capable(CAP_SETVEID)) -+ goto out; -+ -+ status = -EINVAL; -+ if ((flags & VE_TEST) && (flags & (VE_ENTER|VE_CREATE))) -+ goto out; -+ -+ status = -EINVAL; -+ ve = get_ve_by_id(veid); -+ if (ve) { -+ if (flags & VE_TEST) { -+ status = 0; -+ goto out_put; -+ } -+ if (flags & VE_EXCLUSIVE) { -+ status = -EACCES; -+ goto out_put; -+ } -+ if (flags & VE_CREATE) { -+ flags &= ~VE_CREATE; -+ flags |= VE_ENTER; -+ } -+ } else { -+ if (flags & (VE_TEST|VE_ENTER)) { -+ status = -ESRCH; -+ goto out; -+ } -+ } -+ -+ if (flags & VE_CREATE) { -+ status = do_env_create(veid, flags, class_id, data, datalen); -+ goto out; -+ } else if (flags & VE_ENTER) -+ status = do_env_enter(ve, flags); -+ -+ /* else: returning EINVAL */ -+ -+out_put: -+ real_put_ve(ve); -+out: -+ return status; -+} -+ -+static int do_env_enter(struct ve_struct *ve, unsigned int flags) -+{ -+ struct task_struct *tsk = current; -+ int err; -+ -+ VZTRACE("%s: veid=%d\n", __FUNCTION__, ve->veid); -+ -+ err = -EBUSY; -+ down_read(&ve->op_sem); -+ if (!ve->is_running) -+ goto out_up; -+ if (ve->is_locked && !(flags & VE_SKIPLOCK)) -+ goto out_up; -+ -+#ifdef CONFIG_FAIRSCHED -+ err = sys_fairsched_mvpr(current->pid, ve->veid); -+ if (err) -+ goto out_up; -+#endif -+ -+ ve_sched_attach(ve); -+ move_task(current, ve, VE_TASK_INFO(tsk)->owner_env); -+ err = VE_TASK_INFO(tsk)->owner_env->veid; -+ -+out_up: -+ up_read(&ve->op_sem); -+ return err; -+} -+ -+static void env_cleanup(struct ve_struct *ve) -+{ -+ struct ve_struct *old_ve; -+ -+ VZTRACE("real_do_env_cleanup\n"); -+ -+ down_read(&ve->op_sem); -+ old_ve = set_exec_env(ve); -+ -+ ve_hook_iterate_cleanup(VE_HOOK_FINI, (void *)ve); -+ -+ fini_venet(ve); -+ fini_ve_route(ve); -+ -+ /* no new packets in flight beyond this point */ -+ synchronize_net(); -+ /* skb hold dst_entry, and in turn lies in the ip fragment queue */ -+ ip_fragment_cleanup(ve); -+ -+ fini_ve_netdev(); -+ -+ /* kill iptables */ -+ /* No skb belonging to VE can exist at this point as unregister_netdev -+ is an operation awaiting until ALL skb's gone */ -+ flush_ve_iptables(ve); -+ fini_ve_iptables(ve, ve->_iptables_modules); -+ fini_ve_netfilter(); -+ -+ ve_ipc_cleanup(); -+ -+ fini_ve_sched(ve); -+ do_clean_devperms(ve->veid); -+ -+ fini_ve_devpts(ve); -+ fini_ve_shmem(ve); -+ fini_ve_sysfs(ve); -+ unregister_ve_tty_drivers(ve); -+ fini_ve_sysctl(ve); -+ fini_ve_proc(ve); -+ -+ fini_ve_mibs(ve); -+ -+ (void)set_exec_env(old_ve); -+ fini_printk(ve); /* no printk can happen in ve context anymore */ -+ -+ ve_list_del(ve); -+ up_read(&ve->op_sem); -+ -+ real_put_ve(ve); -+} -+ -+static struct list_head ve_cleanup_list; -+static spinlock_t ve_cleanup_lock; -+ -+static DECLARE_COMPLETION(vzmond_complete); -+static struct task_struct *vzmond_thread; -+static volatile int stop_vzmond; -+ -+void real_do_env_cleanup(struct ve_struct *ve) -+{ -+ spin_lock(&ve_cleanup_lock); -+ list_add_tail(&ve->cleanup_list, &ve_cleanup_list); -+ spin_unlock(&ve_cleanup_lock); -+ wake_up_process(vzmond_thread); -+} -+ -+static void do_pending_env_cleanups(void) -+{ -+ struct ve_struct *ve; -+ -+ spin_lock(&ve_cleanup_lock); -+ while (1) { -+ if (list_empty(&ve_cleanup_list) || need_resched()) -+ break; -+ ve = list_first_entry(&ve_cleanup_list, struct ve_struct, -+ cleanup_list); -+ list_del(&ve->cleanup_list); -+ spin_unlock(&ve_cleanup_lock); -+ env_cleanup(ve); -+ spin_lock(&ve_cleanup_lock); -+ } -+ spin_unlock(&ve_cleanup_lock); -+} -+ -+static int have_pending_cleanups(void) -+{ -+ return !list_empty(&ve_cleanup_list); -+} -+ -+static int vzmond(void *arg) -+{ -+ daemonize("vzmond"); -+ vzmond_thread = current; -+ set_current_state(TASK_INTERRUPTIBLE); -+ -+ while (!stop_vzmond) { -+ schedule(); -+ if (signal_pending(current)) -+ flush_signals(current); -+ if (test_thread_flag(TIF_FREEZE)) -+ refrigerator(); -+ -+ do_pending_env_cleanups(); -+ set_current_state(TASK_INTERRUPTIBLE); -+ if (have_pending_cleanups()) -+ __set_current_state(TASK_RUNNING); -+ } -+ -+ __set_task_state(current, TASK_RUNNING); -+ complete_and_exit(&vzmond_complete, 0); -+} -+ -+static int __init init_vzmond(void) -+{ -+ INIT_LIST_HEAD(&ve_cleanup_list); -+ spin_lock_init(&ve_cleanup_lock); -+ stop_vzmond = 0; -+ return kernel_thread(vzmond, NULL, 0); -+} -+ -+static void fini_vzmond(void) -+{ -+ stop_vzmond = 1; -+ wake_up_process(vzmond_thread); -+ wait_for_completion(&vzmond_complete); -+ WARN_ON(!list_empty(&ve_cleanup_list)); -+} -+ -+void real_do_env_free(struct ve_struct *ve) -+{ -+ VZTRACE("real_do_env_free\n"); -+ -+ ve_ipc_free(ve); /* free SYSV IPC resources */ -+ free_ve_tty_drivers(ve); -+ free_ve_utsname(ve); -+ free_ve_sysctl(ve); /* free per ve sysctl data */ -+ free_ve_filesystems(ve); -+ printk(KERN_INFO "VPS: %d: stopped\n", VEID(ve)); -+ kfree(ve); -+ -+ module_put(THIS_MODULE); -+} -+ -+ -+/********************************************************************** -+ ********************************************************************** -+ * -+ * VE TTY handling -+ * -+ ********************************************************************** -+ **********************************************************************/ -+ -+DCL_VE_OWNER(TTYDRV, TAIL_SOFT, struct tty_driver, owner_env, , ()) -+ -+static struct tty_driver *alloc_ve_tty_driver(struct tty_driver *base, -+ struct ve_struct *ve) -+{ -+ size_t size; -+ struct tty_driver *driver; -+ -+ driver = kmalloc(sizeof(struct tty_driver), GFP_KERNEL); -+ if (!driver) -+ goto out; -+ -+ memcpy(driver, base, sizeof(struct tty_driver)); -+ -+ driver->driver_state = NULL; -+ -+ size = base->num * 3 * sizeof(void *); -+ if (!(driver->flags & TTY_DRIVER_DEVPTS_MEM)) { -+ void **p; -+ p = kmalloc(size, GFP_KERNEL); -+ if (!p) -+ goto out_free; -+ memset(p, 0, size); -+ driver->ttys = (struct tty_struct **)p; -+ driver->termios = (struct termios **)(p + driver->num); -+ driver->termios_locked = (struct termios **)(p + driver->num * 2); -+ } else { -+ driver->ttys = NULL; -+ driver->termios = NULL; -+ driver->termios_locked = NULL; -+ } -+ -+ SET_VE_OWNER_TTYDRV(driver, ve); -+ driver->flags |= TTY_DRIVER_INSTALLED; -+ -+ return driver; -+ -+out_free: -+ kfree(driver); -+out: -+ return NULL; -+} -+ -+static void free_ve_tty_driver(struct tty_driver *driver) -+{ -+ if (!driver) -+ return; -+ -+ clear_termios(driver); -+ kfree(driver->ttys); -+ kfree(driver); -+} -+ -+static int alloc_ve_tty_drivers(struct ve_struct* ve) -+{ -+#ifdef CONFIG_LEGACY_PTYS -+ extern struct tty_driver *get_pty_driver(void); -+ extern struct tty_driver *get_pty_slave_driver(void); -+ -+ /* Traditional BSD devices */ -+ ve->pty_driver = alloc_ve_tty_driver(get_pty_driver(), ve); -+ if (!ve->pty_driver) -+ goto out_mem; -+ -+ ve->pty_slave_driver = alloc_ve_tty_driver( -+ get_pty_slave_driver(), ve); -+ if (!ve->pty_slave_driver) -+ goto out_mem; -+ -+ ve->pty_driver->other = ve->pty_slave_driver; -+ ve->pty_slave_driver->other = ve->pty_driver; -+#endif -+ -+#ifdef CONFIG_UNIX98_PTYS -+ ve->ptm_driver = alloc_ve_tty_driver(ptm_driver, ve); -+ if (!ve->ptm_driver) -+ goto out_mem; -+ -+ ve->pts_driver = alloc_ve_tty_driver(pts_driver, ve); -+ if (!ve->pts_driver) -+ goto out_mem; -+ -+ ve->ptm_driver->other = ve->pts_driver; -+ ve->pts_driver->other = ve->ptm_driver; -+ -+ ve->allocated_ptys = kmalloc(sizeof(*ve->allocated_ptys), GFP_KERNEL); -+ if (!ve->allocated_ptys) -+ goto out_mem; -+ idr_init(ve->allocated_ptys); -+#endif -+ return 0; -+ -+out_mem: -+ free_ve_tty_drivers(ve); -+ return -ENOMEM; -+} -+ -+static void free_ve_tty_drivers(struct ve_struct* ve) -+{ -+#ifdef CONFIG_LEGACY_PTYS -+ free_ve_tty_driver(ve->pty_driver); -+ free_ve_tty_driver(ve->pty_slave_driver); -+ ve->pty_driver = ve->pty_slave_driver = NULL; -+#endif -+#ifdef CONFIG_UNIX98_PTYS -+ free_ve_tty_driver(ve->ptm_driver); -+ free_ve_tty_driver(ve->pts_driver); -+ kfree(ve->allocated_ptys); -+ ve->ptm_driver = ve->pts_driver = NULL; -+ ve->allocated_ptys = NULL; -+#endif -+} -+ -+static inline void __register_tty_driver(struct tty_driver *driver) -+{ -+ list_add(&driver->tty_drivers, &tty_drivers); -+} -+ -+static inline void __unregister_tty_driver(struct tty_driver *driver) -+{ -+ if (!driver) -+ return; -+ list_del(&driver->tty_drivers); -+} -+ -+static int register_ve_tty_drivers(struct ve_struct* ve) -+{ -+ write_lock_irq(&tty_driver_guard); -+#ifdef CONFIG_UNIX98_PTYS -+ __register_tty_driver(ve->ptm_driver); -+ __register_tty_driver(ve->pts_driver); -+#endif -+#ifdef CONFIG_LEGACY_PTYS -+ __register_tty_driver(ve->pty_driver); -+ __register_tty_driver(ve->pty_slave_driver); -+#endif -+ write_unlock_irq(&tty_driver_guard); -+ -+ return 0; -+} -+ -+static void unregister_ve_tty_drivers(struct ve_struct* ve) -+{ -+ VZTRACE("unregister_ve_tty_drivers\n"); -+ -+ write_lock_irq(&tty_driver_guard); -+ __unregister_tty_driver(ve->pty_driver); -+ __unregister_tty_driver(ve->pty_slave_driver); -+#ifdef CONFIG_UNIX98_PTYS -+ __unregister_tty_driver(ve->ptm_driver); -+ __unregister_tty_driver(ve->pts_driver); -+#endif -+ write_unlock_irq(&tty_driver_guard); -+} -+ -+static int init_ve_tty_drivers(struct ve_struct *ve) -+{ -+ int err; -+ -+ if ((err = alloc_ve_tty_drivers(ve))) -+ goto err_ttyalloc; -+ if ((err = register_ve_tty_drivers(ve))) -+ goto err_ttyreg; -+ return 0; -+ -+err_ttyreg: -+ free_ve_tty_drivers(ve); -+err_ttyalloc: -+ return err; -+} -+ -+static void fini_ve_tty_drivers(struct ve_struct *ve) -+{ -+ unregister_ve_tty_drivers(ve); -+ free_ve_tty_drivers(ve); -+} -+ -+/* -+ * Free the termios and termios_locked structures because -+ * we don't want to get memory leaks when modular tty -+ * drivers are removed from the kernel. -+ */ -+static void clear_termios(struct tty_driver *driver) -+{ -+ int i; -+ struct termios *tp; -+ -+ if (driver->termios == NULL) -+ return; -+ for (i = 0; i < driver->num; i++) { -+ tp = driver->termios[i]; -+ if (tp) { -+ driver->termios[i] = NULL; -+ kfree(tp); -+ } -+ tp = driver->termios_locked[i]; -+ if (tp) { -+ driver->termios_locked[i] = NULL; -+ kfree(tp); -+ } -+ } -+} -+ -+ -+/********************************************************************** -+ ********************************************************************** -+ * -+ * Pieces of VE network -+ * -+ ********************************************************************** -+ **********************************************************************/ -+ -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+#include <asm/uaccess.h> -+#include <net/sock.h> -+#include <linux/netlink.h> -+#include <linux/rtnetlink.h> -+#include <net/route.h> -+#include <net/ip_fib.h> -+#endif -+ -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+static void ve_del_ip_addrs(struct net_device *dev) -+{ -+ struct in_device *in_dev; -+ -+ in_dev = in_dev_get(dev); -+ if (in_dev == NULL) -+ return; -+ -+ while (in_dev->ifa_list != NULL) { -+ inet_del_ifa(in_dev, &in_dev->ifa_list, 1); -+ } -+ in_dev_put(in_dev); -+} -+ -+static int ve_netdev_cleanup(struct net_device *dev, int to_ve) -+{ -+ int err; -+ -+ err = 0; -+ ve_del_ip_addrs(dev); -+ if ((dev->flags & IFF_UP) != 0) -+ err = dev_close(dev); -+ synchronize_net(); -+ dev_shutdown(dev); -+ dev_mc_discard(dev); -+ free_divert_blk(dev); -+ synchronize_net(); -+ -+ if (to_ve) -+ dev->orig_mtu = dev->mtu; -+ else { -+ int rc = dev_set_mtu(dev, dev->orig_mtu); -+ if (err == 0) -+ err = rc; -+ } -+ -+ return err; -+} -+ -+static void __ve_dev_move(struct net_device *dev, struct ve_struct *ve_src, -+ struct ve_struct *ve_dst, struct user_beancounter *exec_ub) -+{ -+ struct net_device **dp, *d; -+ struct user_beancounter *ub; -+ -+ for (d = ve_src->_net_dev_base, dp = NULL; d != NULL; -+ dp = &d->next, d = d->next) { -+ if (d == dev) { -+ hlist_del(&dev->name_hlist); -+ hlist_del(&dev->index_hlist); -+ if (ve_src->_net_dev_tail == &dev->next) -+ ve_src->_net_dev_tail = dp; -+ if (dp) -+ *dp = dev->next; -+ dev->next = NULL; -+ break; -+ } -+ } -+ *ve_dst->_net_dev_tail = dev; -+ ve_dst->_net_dev_tail = &dev->next; -+ hlist_add_head(&dev->name_hlist, dev_name_hash(dev->name, ve_dst)); -+ hlist_add_head(&dev->index_hlist, dev_index_hash(dev->ifindex, ve_dst)); -+ dev->owner_env = ve_dst; -+ -+ ub = netdev_bc(dev)->exec_ub; -+ netdev_bc(dev)->exec_ub = get_beancounter(exec_ub); -+ put_beancounter(ub); -+} -+ -+static int ve_dev_add(envid_t veid, char *dev_name) -+{ -+ int err; -+ struct net_device *dev; -+ struct ve_struct *ve; -+ struct hlist_node *p; -+ -+ dev = NULL; -+ err = -ESRCH; -+ -+ ve = get_ve_by_id(veid); -+ if (ve == NULL) -+ goto out; -+ -+ rtnl_lock(); -+ -+ read_lock(&dev_base_lock); -+ hlist_for_each(p, dev_name_hash(dev_name, get_ve0())) { -+ struct net_device *d = hlist_entry(p, struct net_device, -+ name_hlist); -+ if (strncmp(d->name, dev_name, IFNAMSIZ) == 0) { -+ dev = d; -+ break; -+ } -+ } -+ read_unlock(&dev_base_lock); -+ if (dev == NULL) -+ goto out_unlock; -+ -+ err = -EPERM; -+ if (!ve_is_dev_movable(dev)) -+ goto out_unlock; -+ -+ err = -EINVAL; -+ if (dev->flags & (IFF_SLAVE|IFF_MASTER)) -+ goto out_unlock; -+ -+ ve_netdev_cleanup(dev, 1); -+ -+ write_lock_bh(&dev_base_lock); -+ __ve_dev_move(dev, get_ve0(), ve, get_exec_ub()); -+ write_unlock_bh(&dev_base_lock); -+ -+ err = 0; -+ -+out_unlock: -+ rtnl_unlock(); -+ real_put_ve(ve); -+ -+ if (dev == NULL) -+ printk(KERN_WARNING "Device %s not found\n", dev_name); -+ -+out: -+ return err; -+} -+ -+static int ve_dev_del(envid_t veid, char *dev_name) -+{ -+ int err; -+ struct net_device *dev; -+ struct ve_struct *ve, *old_exec; -+ struct hlist_node *p; -+ -+ dev = NULL; -+ err = -ESRCH; -+ -+ ve = get_ve_by_id(veid); -+ if (ve == NULL) -+ goto out; -+ -+ rtnl_lock(); -+ -+ read_lock(&dev_base_lock); -+ hlist_for_each(p, dev_name_hash(dev_name, ve)) { -+ struct net_device *d = hlist_entry(p, struct net_device, -+ name_hlist); -+ if (strncmp(d->name, dev_name, IFNAMSIZ) == 0) { -+ dev = d; -+ break; -+ } -+ } -+ read_unlock(&dev_base_lock); -+ if (dev == NULL) -+ goto out_unlock; -+ -+ err = -EPERM; -+ if (!ve_is_dev_movable(dev)) -+ goto out_unlock; -+ -+ old_exec = set_exec_env(ve); -+ ve_netdev_cleanup(dev, 0); -+ (void)set_exec_env(old_exec); -+ -+ write_lock_bh(&dev_base_lock); -+ __ve_dev_move(dev, ve, get_ve0(), netdev_bc(dev)->owner_ub); -+ write_unlock_bh(&dev_base_lock); -+ -+ err = 0; -+ -+out_unlock: -+ rtnl_unlock(); -+ real_put_ve(ve); -+ -+ if (dev == NULL) -+ printk(KERN_WARNING "Device %s not found\n", dev_name); -+ -+out: -+ return err; -+} -+ -+int real_ve_dev_map(envid_t veid, int op, char *dev_name) -+{ -+ int err; -+ err = -EPERM; -+ if (!capable(CAP_SETVEID)) -+ goto out; -+ switch (op) -+ { -+ case VE_NETDEV_ADD: -+ err = ve_dev_add(veid, dev_name); -+ break; -+ case VE_NETDEV_DEL: -+ err = ve_dev_del(veid, dev_name); -+ break; -+ default: -+ err = -EINVAL; -+ break; -+ } -+out: -+ return err; -+} -+ -+static void ve_mapped_devs_cleanup(struct ve_struct *ve) -+{ -+ struct net_device *dev; -+ -+ rtnl_lock(); -+ write_lock_bh(&dev_base_lock); -+restart: -+ for (dev = ve->_net_dev_base; dev != NULL; dev = dev->next) -+ { -+ if ((dev->features & NETIF_F_VENET) || -+ (dev == ve->_loopback_dev)) /* Skip loopback dev */ -+ continue; -+ write_unlock_bh(&dev_base_lock); -+ ve_netdev_cleanup(dev, 0); -+ write_lock_bh(&dev_base_lock); -+ __ve_dev_move(dev, ve, get_ve0(), netdev_bc(dev)->owner_ub); -+ goto restart; -+ } -+ write_unlock_bh(&dev_base_lock); -+ rtnl_unlock(); -+} -+#endif -+ -+ -+/********************************************************************** -+ ********************************************************************** -+ * -+ * VE information via /proc -+ * -+ ********************************************************************** -+ **********************************************************************/ -+#ifdef CONFIG_PROC_FS -+static int devperms_seq_show(struct seq_file *m, void *v) -+{ -+ struct devperms_struct *dp; -+ char dev_s[32], type_c; -+ unsigned use, type; -+ dev_t dev; -+ -+ dp = (struct devperms_struct *)v; -+ if (dp == (struct devperms_struct *)1L) { -+ seq_printf(m, "Version: 2.7\n"); -+ return 0; -+ } -+ -+ use = dp->type & VE_USE_MASK; -+ type = dp->type & S_IFMT; -+ dev = dp->dev; -+ -+ if ((use | VE_USE_MINOR) == use) -+ snprintf(dev_s, sizeof(dev_s), "%d:%d", MAJOR(dev), MINOR(dev)); -+ else if ((use | VE_USE_MAJOR) == use) -+ snprintf(dev_s, sizeof(dev_s), "%d:*", MAJOR(dp->dev)); -+ else -+ snprintf(dev_s, sizeof(dev_s), "*:*"); -+ -+ if (type == S_IFCHR) -+ type_c = 'c'; -+ else if (type == S_IFBLK) -+ type_c = 'b'; -+ else -+ type_c = '?'; -+ -+ seq_printf(m, "%10u %c %03o %s\n", dp->veid, type_c, dp->mask, dev_s); -+ return 0; -+} -+ -+static void *devperms_seq_start(struct seq_file *m, loff_t *pos) -+{ -+ loff_t cpos; -+ long slot; -+ struct devperms_struct *dp; -+ -+ cpos = *pos; -+ read_lock(&devperms_hash_guard); -+ if (cpos-- == 0) -+ return (void *)1L; -+ -+ for (slot = 0; slot < DEVPERMS_HASH_SZ; slot++) -+ for (dp = devperms_hash[slot]; dp; dp = dp->devhash_next) -+ if (cpos-- == 0) { -+ m->private = (void *)slot; -+ return dp; -+ } -+ return NULL; -+} -+ -+static void *devperms_seq_next(struct seq_file *m, void *v, loff_t *pos) -+{ -+ long slot; -+ struct devperms_struct *dp; -+ -+ dp = (struct devperms_struct *)v; -+ -+ if (dp == (struct devperms_struct *)1L) -+ slot = 0; -+ else if (dp->devhash_next == NULL) -+ slot = (long)m->private + 1; -+ else { -+ (*pos)++; -+ return dp->devhash_next; -+ } -+ -+ for (; slot < DEVPERMS_HASH_SZ; slot++) -+ if (devperms_hash[slot]) { -+ (*pos)++; -+ m->private = (void *)slot; -+ return devperms_hash[slot]; -+ } -+ return NULL; -+} -+ -+static void devperms_seq_stop(struct seq_file *m, void *v) -+{ -+ read_unlock(&devperms_hash_guard); -+} -+ -+static struct seq_operations devperms_seq_op = { -+ .start = devperms_seq_start, -+ .next = devperms_seq_next, -+ .stop = devperms_seq_stop, -+ .show = devperms_seq_show, -+}; -+ -+static int devperms_open(struct inode *inode, struct file *file) -+{ -+ return seq_open(file, &devperms_seq_op); -+} -+ -+static struct file_operations proc_devperms_ops = { -+ .open = devperms_open, -+ .read = seq_read, -+ .llseek = seq_lseek, -+ .release = seq_release, -+}; -+ -+#if BITS_PER_LONG == 32 -+#define VESTAT_LINE_WIDTH (6 * 11 + 6 * 21) -+#define VESTAT_LINE_FMT "%10u %10lu %10lu %10lu %10lu %20Lu %20Lu %20Lu %20Lu %20Lu %20Lu %10lu\n" -+#define VESTAT_HEAD_FMT "%10s %10s %10s %10s %10s %20s %20s %20s %20s %20s %20s %10s\n" -+#else -+#define VESTAT_LINE_WIDTH (12 * 21) -+#define VESTAT_LINE_FMT "%20u %20lu %20lu %20lu %20lu %20Lu %20Lu %20Lu %20Lu %20Lu %20Lu %20lu\n" -+#define VESTAT_HEAD_FMT "%20s %20s %20s %20s %20s %20s %20s %20s %20s %20s %20s %20s\n" -+#endif -+ -+static int vestat_seq_show(struct seq_file *m, void *v) -+{ -+ struct ve_struct *ve = (struct ve_struct *)v; -+ struct ve_struct *curve; -+ int cpu; -+ unsigned long user_ve, nice_ve, system_ve, uptime; -+ cycles_t uptime_cycles, idle_time, strv_time, used; -+ -+ curve = get_exec_env(); -+ if (ve == ve_list_head || -+ (!ve_is_super(curve) && ve == curve)) { -+ /* print header */ -+ seq_printf(m, "%-*s\n", -+ VESTAT_LINE_WIDTH - 1, -+ "Version: 2.2"); -+ seq_printf(m, VESTAT_HEAD_FMT, "VEID", -+ "user", "nice", "system", -+ "uptime", "idle", -+ "strv", "uptime", "used", -+ "maxlat", "totlat", "numsched"); -+ } -+ -+ if (ve == get_ve0()) -+ return 0; -+ -+ user_ve = nice_ve = system_ve = 0; -+ idle_time = strv_time = used = 0; -+ -+ for (cpu = 0; cpu < NR_CPUS; cpu++) { -+ user_ve += VE_CPU_STATS(ve, cpu)->user; -+ nice_ve += VE_CPU_STATS(ve, cpu)->nice; -+ system_ve += VE_CPU_STATS(ve, cpu)->system; -+ used += VE_CPU_STATS(ve, cpu)->used_time; -+ idle_time += ve_sched_get_idle_time(ve, cpu); -+ } -+ uptime_cycles = get_cycles() - ve->start_cycles; -+ uptime = jiffies - ve->start_jiffies; -+ -+ seq_printf(m, VESTAT_LINE_FMT, ve->veid, -+ user_ve, nice_ve, system_ve, -+ uptime, idle_time, -+ strv_time, uptime_cycles, used, -+ ve->sched_lat_ve.last.maxlat, -+ ve->sched_lat_ve.last.totlat, -+ ve->sched_lat_ve.last.count); -+ return 0; -+} -+ -+static void *ve_seq_start(struct seq_file *m, loff_t *pos) -+{ -+ struct ve_struct *ve, *curve; -+ loff_t l; -+ -+ curve = get_exec_env(); -+ read_lock(&ve_list_guard); -+ if (!ve_is_super(curve)) { -+ if (*pos != 0) -+ return NULL; -+ return curve; -+ } -+ for (ve = ve_list_head, l = *pos; -+ ve != NULL && l > 0; -+ ve = ve->next, l--); -+ return ve; -+} -+ -+static void *ve_seq_next(struct seq_file *m, void *v, loff_t *pos) -+{ -+ struct ve_struct *ve = (struct ve_struct *)v; -+ -+ if (!ve_is_super(get_exec_env())) -+ return NULL; -+ (*pos)++; -+ return ve->next; -+} -+ -+static void ve_seq_stop(struct seq_file *m, void *v) -+{ -+ read_unlock(&ve_list_guard); -+} -+ -+static struct seq_operations vestat_seq_op = { -+ start: ve_seq_start, -+ next: ve_seq_next, -+ stop: ve_seq_stop, -+ show: vestat_seq_show -+}; -+ -+static int vestat_open(struct inode *inode, struct file *file) -+{ -+ return seq_open(file, &vestat_seq_op); -+} -+ -+static struct file_operations proc_vestat_operations = { -+ open: vestat_open, -+ read: seq_read, -+ llseek: seq_lseek, -+ release: seq_release -+}; -+ -+static int __init init_vecalls_proc(void) -+{ -+ struct proc_dir_entry *de; -+ -+ de = create_proc_glob_entry("vz/vestat", -+ S_IFREG|S_IRUSR, NULL); -+ if (de == NULL) { -+ /* create "vz" subdirectory, if not exist */ -+ (void) create_proc_glob_entry("vz", -+ S_IFDIR|S_IRUGO|S_IXUGO, NULL); -+ de = create_proc_glob_entry("vz/vestat", -+ S_IFREG|S_IRUSR, NULL); -+ } -+ if (de) -+ de->proc_fops = &proc_vestat_operations; -+ else -+ printk(KERN_WARNING -+ "VZMON: can't make vestat proc entry\n"); -+ -+ de = create_proc_entry("vz/devperms", S_IFREG | S_IRUSR, NULL); -+ if (de) -+ de->proc_fops = &proc_devperms_ops; -+ else -+ printk(KERN_WARNING -+ "VZMON: can't make devperms proc entry\n"); -+ return 0; -+} -+ -+static void fini_vecalls_proc(void) -+{ -+ remove_proc_entry("vz/devperms", NULL); -+ remove_proc_entry("vz/vestat", NULL); -+} -+#else -+#define init_vecalls_proc() (0) -+#define fini_vecalls_proc() do { } while (0) -+#endif /* CONFIG_PROC_FS */ -+ -+ -+/********************************************************************** -+ ********************************************************************** -+ * -+ * User ctl -+ * -+ ********************************************************************** -+ **********************************************************************/ -+ -+int vzcalls_ioctl(struct inode *, struct file *, unsigned int, unsigned long); -+static struct vzioctlinfo vzcalls = { -+ type: VZCTLTYPE, -+ func: vzcalls_ioctl, -+ owner: THIS_MODULE, -+}; -+ -+int vzcalls_ioctl(struct inode *ino, struct file *file, unsigned int cmd, -+ unsigned long arg) -+{ -+ int err; -+ -+ err = -ENOTTY; -+ switch(cmd) { -+ case VZCTL_MARK_ENV_TO_DOWN: { -+ /* Compatibility issue */ -+ err = 0; -+ } -+ break; -+ case VZCTL_SETDEVPERMS: { -+ /* Device type was mistakenly declared as dev_t -+ * in the old user-kernel interface. -+ * That's wrong, dev_t is a kernel internal type. -+ * I use `unsigned' not having anything better in mind. -+ * 2001/08/11 SAW */ -+ struct vzctl_setdevperms s; -+ err = -EFAULT; -+ if (copy_from_user(&s, (void *)arg, sizeof(s))) -+ break; -+ err = real_setdevperms(s.veid, s.type, -+ new_decode_dev(s.dev), s.mask); -+ } -+ break; -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+ case VZCTL_VE_NETDEV: { -+ struct vzctl_ve_netdev d; -+ char *s; -+ err = -EFAULT; -+ if (copy_from_user(&d, (void *)arg, sizeof(d))) -+ break; -+ err = -ENOMEM; -+ s = kmalloc(IFNAMSIZ+1, GFP_KERNEL); -+ if (s == NULL) -+ break; -+ strncpy_from_user(s, d.dev_name, IFNAMSIZ); -+ s[IFNAMSIZ] = 0; -+ err = real_ve_dev_map(d.veid, d.op, s); -+ kfree(s); -+ } -+ break; -+#endif -+ case VZCTL_ENV_CREATE: { -+ struct vzctl_env_create s; -+ err = -EFAULT; -+ if (copy_from_user(&s, (void *)arg, sizeof(s))) -+ break; -+ err = real_env_create(s.veid, s.flags, s.class_id, -+ NULL, 0); -+ } -+ break; -+ case VZCTL_ENV_CREATE_DATA: { -+ struct vzctl_env_create_data s; -+ env_create_param_t *data; -+ err = -EFAULT; -+ if (copy_from_user(&s, (void *)arg, sizeof(s))) -+ break; -+ err=-EINVAL; -+ if (s.datalen < VZCTL_ENV_CREATE_DATA_MINLEN || -+ s.datalen > VZCTL_ENV_CREATE_DATA_MAXLEN || -+ s.data == 0) -+ break; -+ err = -ENOMEM; -+ data = kmalloc(sizeof(*data), GFP_KERNEL); -+ if (!data) -+ break; -+ memset(data, 0, sizeof(*data)); -+ err = -EFAULT; -+ if (copy_from_user(data, (void *)s.data, s.datalen)) -+ goto free_data; -+ err = real_env_create(s.veid, s.flags, s.class_id, -+ data, s.datalen); -+free_data: -+ kfree(data); -+ } -+ break; -+ case VZCTL_GET_CPU_STAT: { -+ struct vzctl_cpustatctl s; -+ err = -EFAULT; -+ if (copy_from_user(&s, (void *)arg, sizeof(s))) -+ break; -+ err = ve_get_cpu_stat(s.veid, s.cpustat); -+ } -+ break; -+ } -+ return err; -+} -+EXPORT_SYMBOL(real_env_create); -+ -+ -+/********************************************************************** -+ ********************************************************************** -+ * -+ * Init/exit stuff -+ * -+ ********************************************************************** -+ **********************************************************************/ -+ -+#ifdef CONFIG_VE_CALLS_MODULE -+static int __init init_vecalls_symbols(void) -+{ -+ KSYMRESOLVE(real_get_device_perms_ve); -+ KSYMRESOLVE(real_do_env_cleanup); -+ KSYMRESOLVE(real_do_env_free); -+ KSYMRESOLVE(real_update_load_avg_ve); -+ KSYMMODRESOLVE(vzmon); -+ return 0; -+} -+ -+static void fini_vecalls_symbols(void) -+{ -+ KSYMMODUNRESOLVE(vzmon); -+ KSYMUNRESOLVE(real_get_device_perms_ve); -+ KSYMUNRESOLVE(real_do_env_cleanup); -+ KSYMUNRESOLVE(real_do_env_free); -+ KSYMUNRESOLVE(real_update_load_avg_ve); -+} -+#else -+#define init_vecalls_symbols() (0) -+#define fini_vecalls_symbols() do { } while (0) -+#endif -+ -+static inline __init int init_vecalls_ioctls(void) -+{ -+ vzioctl_register(&vzcalls); -+ return 0; -+} -+ -+static inline void fini_vecalls_ioctls(void) -+{ -+ vzioctl_unregister(&vzcalls); -+} -+ -+static int __init vecalls_init(void) -+{ -+ int err; -+ int i; -+ -+ ve_list_head = get_ve0(); -+ -+ err = init_vzmond(); -+ if (err < 0) -+ goto out_vzmond; -+ -+ err = init_devperms_hash(); -+ if (err < 0) -+ goto out_perms; -+ -+ err = init_vecalls_symbols(); -+ if (err < 0) -+ goto out_sym; -+ -+ err = init_vecalls_proc(); -+ if (err < 0) -+ goto out_proc; -+ -+ err = init_vecalls_ioctls(); -+ if (err < 0) -+ goto out_ioctls; -+ -+ for (i = 0; i < VE_MAX_HOOKS; i++) -+ INIT_LIST_HEAD(&ve_hooks[i]); -+ -+ return 0; -+ -+out_ioctls: -+ fini_vecalls_proc(); -+out_proc: -+ fini_vecalls_symbols(); -+out_sym: -+ fini_devperms_hash(); -+out_perms: -+ fini_vzmond(); -+out_vzmond: -+ return err; -+} -+ -+static void vecalls_exit(void) -+{ -+ fini_vecalls_ioctls(); -+ fini_vecalls_proc(); -+ fini_vecalls_symbols(); -+ fini_devperms_hash(); -+ fini_vzmond(); -+} -+ -+EXPORT_SYMBOL(get_ve_by_id); -+EXPORT_SYMBOL(__find_ve_by_id); -+EXPORT_SYMBOL(ve_list_guard); -+EXPORT_SYMBOL(ve_list_head); -+EXPORT_SYMBOL(nr_ve); -+ -+MODULE_AUTHOR("SWsoft <info@sw-soft.com>"); -+MODULE_DESCRIPTION("Virtuozzo Control"); -+MODULE_LICENSE("GPL v2"); -+ -+module_init(vecalls_init) -+module_exit(vecalls_exit) -diff -uprN linux-2.6.8.1.orig/kernel/veowner.c linux-2.6.8.1-ve022stab078/kernel/veowner.c ---- linux-2.6.8.1.orig/kernel/veowner.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/kernel/veowner.c 2006-05-11 13:05:42.000000000 +0400 -@@ -0,0 +1,300 @@ -+/* -+ * kernel/veowner.c -+ * -+ * Copyright (C) 2000-2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#include <linux/sched.h> -+#include <linux/ve.h> -+#include <linux/ve_owner.h> -+#include <linux/ve_proto.h> -+#include <linux/ipc.h> -+#include <linux/fs.h> -+#include <linux/proc_fs.h> -+#include <linux/file.h> -+#include <linux/mm.h> -+#include <linux/delay.h> -+#include <linux/vmalloc.h> -+#include <linux/init.h> -+#include <linux/module.h> -+#include <linux/list.h> -+#include <asm/system.h> -+#include <asm/io.h> -+ -+#include <net/tcp.h> -+ -+void prepare_ve0_process(struct task_struct *tsk) -+{ -+ set_virt_pid(tsk, tsk->pid); -+ set_virt_tgid(tsk, tsk->tgid); -+ if (tsk->signal) { -+ set_virt_pgid(tsk, tsk->signal->pgrp); -+ set_virt_sid(tsk, tsk->signal->session); -+ } -+ VE_TASK_INFO(tsk)->exec_env = get_ve0(); -+ VE_TASK_INFO(tsk)->owner_env = get_ve0(); -+ VE_TASK_INFO(tsk)->sleep_time = 0; -+ VE_TASK_INFO(tsk)->wakeup_stamp = 0; -+ VE_TASK_INFO(tsk)->sched_time = 0; -+ seqcount_init(&VE_TASK_INFO(tsk)->wakeup_lock); -+ -+ if (tsk->pid) { -+ SET_VE_LINKS(tsk); -+ atomic_inc(&get_ve0()->pcounter); -+ } -+} -+ -+void prepare_ve0_loopback(void) -+{ -+ get_ve0()->_loopback_dev = &loopback_dev; -+} -+ -+/* -+ * ------------------------------------------------------------------------ -+ * proc entries -+ * ------------------------------------------------------------------------ -+ */ -+ -+static void proc_move(struct proc_dir_entry *ddir, -+ struct proc_dir_entry *sdir, -+ const char *name) -+{ -+ struct proc_dir_entry **p, *q; -+ int len; -+ -+ len = strlen(name); -+ for (p = &sdir->subdir, q = *p; q != NULL; p = &q->next, q = *p) -+ if (proc_match(len, name, q)) -+ break; -+ if (q == NULL) -+ return; -+ *p = q->next; -+ q->parent = ddir; -+ q->next = ddir->subdir; -+ ddir->subdir = q; -+} -+static void prepare_proc_misc(void) -+{ -+ static char *table[] = { -+ "loadavg", -+ "uptime", -+ "meminfo", -+ "version", -+ "stat", -+ "filesystems", -+ "locks", -+ "swaps", -+ "mounts", -+ "cpuinfo", -+ "net", -+ "sysvipc", -+ "sys", -+ "fs", -+ "vz", -+ "user_beancounters", -+ "cmdline", -+ "vmstat", -+ "modules", -+ "kmsg", -+ NULL, -+ }; -+ char **p; -+ -+ for (p = table; *p != NULL; p++) -+ proc_move(&proc_root, ve0.proc_root, *p); -+} -+int prepare_proc(void) -+{ -+ struct ve_struct *envid; -+ struct proc_dir_entry *de; -+ struct proc_dir_entry *ve_root; -+ -+ envid = set_exec_env(&ve0); -+ ve_root = ve0.proc_root->subdir; -+ /* move the whole tree to be visible in VE0 only */ -+ ve0.proc_root->subdir = proc_root.subdir; -+ for (de = ve0.proc_root->subdir; de->next != NULL; de = de->next) -+ de->parent = ve0.proc_root; -+ de->parent = ve0.proc_root; -+ de->next = ve_root; -+ -+ /* move back into the global scope some specific entries */ -+ proc_root.subdir = NULL; -+ prepare_proc_misc(); -+ proc_mkdir("net", 0); -+ proc_mkdir("vz", 0); -+#ifdef CONFIG_SYSVIPC -+ proc_mkdir("sysvipc", 0); -+#endif -+ proc_root_fs = proc_mkdir("fs", 0); -+ /* XXX proc_tty_init(); */ -+ -+ /* XXX process inodes */ -+ -+ (void)set_exec_env(envid); -+ -+ (void)create_proc_glob_entry("vz", S_IFDIR|S_IRUGO|S_IXUGO, NULL); -+ return 0; -+} -+ -+static struct proc_dir_entry ve0_proc_root = { -+ .name = "/proc", -+ .namelen = 5, -+ .mode = S_IFDIR | S_IRUGO | S_IXUGO, -+ .nlink = 2 -+}; -+ -+void prepare_ve0_proc_root(void) -+{ -+ ve0.proc_root = &ve0_proc_root; -+} -+ -+/* -+ * ------------------------------------------------------------------------ -+ * Virtualized sysctl -+ * ------------------------------------------------------------------------ -+ */ -+ -+static int semmin[4] = { 1, 1, 1, 1 }; -+static int semmax[4] = { 8000, INT_MAX, 1000, IPCMNI }; -+static ctl_table kern_table[] = { -+ {KERN_NODENAME, "hostname", system_utsname.nodename, 64, -+ 0644, NULL, &proc_doutsstring, &sysctl_string}, -+ {KERN_DOMAINNAME, "domainname", system_utsname.domainname, 64, -+ 0644, NULL, &proc_doutsstring, &sysctl_string}, -+#ifdef CONFIG_SYSVIPC -+#define get_ve0_field(fname) &ve0._##fname -+ {KERN_SHMMAX, "shmmax", get_ve0_field(shm_ctlmax), sizeof (size_t), -+ 0644, NULL, &proc_doulongvec_minmax }, -+ {KERN_SHMALL, "shmall", get_ve0_field(shm_ctlall), sizeof (size_t), -+ 0644, NULL, &proc_doulongvec_minmax }, -+ {KERN_SHMMNI, "shmmni", get_ve0_field(shm_ctlmni), sizeof (int), -+ 0644, NULL, &proc_dointvec_minmax, NULL, -+ NULL, &semmin[0], &semmax[3] }, -+ {KERN_MSGMAX, "msgmax", get_ve0_field(msg_ctlmax), sizeof (int), -+ 0644, NULL, &proc_dointvec }, -+ {KERN_MSGMNI, "msgmni", get_ve0_field(msg_ctlmni), sizeof (int), -+ 0644, NULL, &proc_dointvec_minmax, NULL, -+ NULL, &semmin[0], &semmax[3] }, -+ {KERN_MSGMNB, "msgmnb", get_ve0_field(msg_ctlmnb), sizeof (int), -+ 0644, NULL, &proc_dointvec }, -+ {KERN_SEM, "sem", get_ve0_field(sem_ctls), 4*sizeof (int), -+ 0644, NULL, &proc_dointvec }, -+#endif -+ {0} -+}; -+static ctl_table root_table[] = { -+ {CTL_KERN, "kernel", NULL, 0, 0555, kern_table}, -+ {0} -+}; -+extern int ip_rt_src_check; -+extern int ve_area_access_check; -+static ctl_table ipv4_route_table[] = { -+ { -+ ctl_name: NET_IPV4_ROUTE_SRC_CHECK, -+ procname: "src_check", -+ data: &ip_rt_src_check, -+ maxlen: sizeof(int), -+ mode: 0644, -+ proc_handler: &proc_dointvec, -+ }, -+ { 0 } -+}; -+static ctl_table ipv4_table[] = { -+ {NET_IPV4_ROUTE, "route", NULL, 0, 0555, ipv4_route_table}, -+ { 0 } -+}; -+static ctl_table net_table[] = { -+ {NET_IPV4, "ipv4", NULL, 0, 0555, ipv4_table}, -+ { 0 } -+}; -+static ctl_table fs_table[] = { -+ { -+ ctl_name: 226, -+ procname: "ve-area-access-check", -+ data: &ve_area_access_check, -+ maxlen: sizeof(int), -+ mode: 0644, -+ proc_handler: &proc_dointvec, -+ }, -+ { 0 } -+}; -+static ctl_table root_table2[] = { -+ {CTL_NET, "net", NULL, 0, 0555, net_table}, -+ {CTL_FS, "fs", NULL, 0, 0555, fs_table}, -+ { 0 } -+}; -+int prepare_sysctl(void) -+{ -+ struct ve_struct *envid; -+ -+ envid = set_exec_env(&ve0); -+ ve0.kern_header = register_sysctl_table(root_table, 1); -+ register_sysctl_table(root_table2, 0); -+ (void)set_exec_env(envid); -+ return 0; -+} -+ -+void prepare_ve0_sysctl(void) -+{ -+ INIT_LIST_HEAD(&ve0.sysctl_lh); -+#ifdef CONFIG_SYSCTL -+ ve0.proc_sys_root = proc_mkdir("sys", 0); -+#endif -+} -+ -+/* -+ * ------------------------------------------------------------------------ -+ * XXX init_ve_system -+ * ------------------------------------------------------------------------ -+ */ -+ -+extern struct ipv4_devconf *get_ipv4_devconf_dflt_addr(void); -+ -+void init_ve_system(void) -+{ -+ struct task_struct *init_entry, *p, *tsk; -+ struct ve_struct *ptr; -+ unsigned long flags; -+ int i; -+ -+ ptr = get_ve0(); -+ (void)get_ve(ptr); -+ atomic_set(&ptr->pcounter, 1); -+ -+ /* Don't forget about idle tasks */ -+ write_lock_irqsave(&tasklist_lock, flags); -+ for (i = 0; i < NR_CPUS; i++) { -+ tsk = idle_task(i); -+ if (tsk == NULL) -+ continue; -+ -+ prepare_ve0_process(tsk); -+ } -+ do_each_thread_all(p, tsk) { -+ prepare_ve0_process(tsk); -+ } while_each_thread_all(p, tsk); -+ write_unlock_irqrestore(&tasklist_lock, flags); -+ -+ init_entry = child_reaper; -+ ptr->init_entry = init_entry; -+ /* XXX: why? */ -+ cap_set_full(ptr->cap_default); -+ -+ ptr->_ipv4_devconf = &ipv4_devconf; -+ ptr->_ipv4_devconf_dflt = get_ipv4_devconf_dflt_addr(); -+ -+ read_lock(&init_entry->fs->lock); -+ ptr->fs_rootmnt = init_entry->fs->rootmnt; -+ ptr->fs_root = init_entry->fs->root; -+ read_unlock(&init_entry->fs->lock); -+ -+ /* common prepares */ -+ prepare_proc(); -+ prepare_sysctl(); -+ prepare_ipc(); -+} -diff -uprN linux-2.6.8.1.orig/kernel/vzdev.c linux-2.6.8.1-ve022stab078/kernel/vzdev.c ---- linux-2.6.8.1.orig/kernel/vzdev.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/kernel/vzdev.c 2006-05-11 13:05:40.000000000 +0400 -@@ -0,0 +1,97 @@ -+/* -+ * kernel/vzdev.c -+ * -+ * Copyright (C) 2000-2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#include <linux/fs.h> -+#include <linux/list.h> -+#include <linux/init.h> -+#include <linux/module.h> -+#include <linux/vzctl.h> -+#include <linux/slab.h> -+#include <linux/vmalloc.h> -+#include <linux/vzcalluser.h> -+#include <asm/uaccess.h> -+#include <asm/pgalloc.h> -+ -+#define VZCTL_MAJOR 126 -+#define VZCTL_NAME "vzctl" -+ -+MODULE_AUTHOR("SWsoft <info@sw-soft.com>"); -+MODULE_DESCRIPTION("Virtuozzo Interface"); -+MODULE_LICENSE("GPL v2"); -+ -+static LIST_HEAD(ioctls); -+static spinlock_t ioctl_lock = SPIN_LOCK_UNLOCKED; -+ -+int vzctl_ioctl(struct inode *ino, struct file *file, unsigned int cmd, -+ unsigned long arg) -+{ -+ int err; -+ struct list_head *p; -+ struct vzioctlinfo *inf; -+ -+ err = -ENOTTY; -+ spin_lock(&ioctl_lock); -+ list_for_each(p, &ioctls) { -+ inf = list_entry(p, struct vzioctlinfo, list); -+ if (inf->type != _IOC_TYPE(cmd)) -+ continue; -+ -+ err = try_module_get(inf->owner) ? 0 : -EBUSY; -+ spin_unlock(&ioctl_lock); -+ if (!err) { -+ err = (*inf->func)(ino, file, cmd, arg); -+ module_put(inf->owner); -+ } -+ return err; -+ } -+ spin_unlock(&ioctl_lock); -+ return err; -+} -+ -+void vzioctl_register(struct vzioctlinfo *inf) -+{ -+ spin_lock(&ioctl_lock); -+ list_add(&inf->list, &ioctls); -+ spin_unlock(&ioctl_lock); -+} -+ -+void vzioctl_unregister(struct vzioctlinfo *inf) -+{ -+ spin_lock(&ioctl_lock); -+ list_del_init(&inf->list); -+ spin_unlock(&ioctl_lock); -+} -+ -+EXPORT_SYMBOL(vzioctl_register); -+EXPORT_SYMBOL(vzioctl_unregister); -+ -+/* -+ * Init/exit stuff. -+ */ -+static struct file_operations vzctl_fops = { -+ .owner = THIS_MODULE, -+ .ioctl = vzctl_ioctl, -+}; -+ -+static void __exit vzctl_exit(void) -+{ -+ unregister_chrdev(VZCTL_MAJOR, VZCTL_NAME); -+} -+ -+static int __init vzctl_init(void) -+{ -+ int ret; -+ -+ ret = register_chrdev(VZCTL_MAJOR, VZCTL_NAME, &vzctl_fops); -+ return ret; -+} -+ -+module_init(vzctl_init) -+module_exit(vzctl_exit); -diff -uprN linux-2.6.8.1.orig/kernel/vzwdog.c linux-2.6.8.1-ve022stab078/kernel/vzwdog.c ---- linux-2.6.8.1.orig/kernel/vzwdog.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/kernel/vzwdog.c 2006-05-11 13:05:40.000000000 +0400 -@@ -0,0 +1,278 @@ -+/* -+ * kernel/vzwdog.c -+ * -+ * Copyright (C) 2000-2005 SWsoft -+ * All rights reserved. -+ * -+ * Licensing governed by "linux/COPYING.SWsoft" file. -+ * -+ */ -+ -+#include <linux/sched.h> -+#include <linux/fs.h> -+#include <linux/list.h> -+#include <linux/ctype.h> -+#include <linux/kobject.h> -+#include <linux/genhd.h> -+#include <linux/module.h> -+#include <linux/init.h> -+#include <linux/kernel.h> -+#include <linux/kernel_stat.h> -+#include <linux/smp_lock.h> -+#include <linux/errno.h> -+#include <linux/suspend.h> -+#include <linux/ve.h> -+#include <linux/vzstat.h> -+ -+/* Staff regading kernel thread polling VE validity */ -+static int sleep_timeout = 60; -+static pid_t wdog_thread_pid; -+static int wdog_thread_continue = 1; -+static DECLARE_COMPLETION(license_thread_exited); -+ -+extern void show_mem(void); -+extern struct ve_struct *ve_list_head; -+ -+#if 0 -+static char page[PAGE_SIZE]; -+ -+static void parse_irq_list(int len) -+{ -+ int i, k, skip; -+ for (i = 0; i < len; ) { -+ k = i; -+ while (i < len && page[i] != '\n' && page[i] != ':') -+ i++; -+ skip = 0; -+ if (i < len && page[i] != '\n') { -+ i++; /* skip ':' */ -+ while (i < len && (page[i] == ' ' || page[i] == '0')) -+ i++; -+ skip = (i < len && (page[i] < '0' || page[i] > '9')); -+ while (i < len && page[i] != '\n') -+ i++; -+ } -+ if (!skip) -+ printk("\n%.*s", i - k, page + k); -+ if (i < len) -+ i++; /* skip '\n' */ -+ } -+} -+#endif -+ -+static void show_irq_list(void) -+{ -+#if 0 -+ i = KSYMSAFECALL(int, get_irq_list, (page)); -+ parse_irq_list(i); /* Safe, zero was returned if unassigned */ -+#endif -+} -+ -+static void show_alloc_latency(void) -+{ -+ static const char *alloc_descr[KSTAT_ALLOCSTAT_NR] = { -+ "A0", -+ "L0", -+ "H0", -+ "L1", -+ "H1" -+ }; -+ int i; -+ -+ printk("lat: "); -+ for (i = 0; i < KSTAT_ALLOCSTAT_NR; i++) { -+ struct kstat_lat_struct *p; -+ cycles_t maxlat, avg0, avg1, avg2; -+ -+ p = &kstat_glob.alloc_lat[i]; -+ spin_lock_irq(&kstat_glb_lock); -+ maxlat = p->last.maxlat; -+ avg0 = p->avg[0]; -+ avg1 = p->avg[1]; -+ avg2 = p->avg[2]; -+ spin_unlock_irq(&kstat_glb_lock); -+ -+ printk("%s %Lu (%Lu %Lu %Lu)", -+ alloc_descr[i], -+ maxlat, -+ avg0, -+ avg1, -+ avg2); -+ } -+ printk("\n"); -+} -+ -+static void show_schedule_latency(void) -+{ -+ struct kstat_lat_pcpu_struct *p; -+ cycles_t maxlat, totlat, avg0, avg1, avg2; -+ unsigned long count; -+ -+ p = &kstat_glob.sched_lat; -+ spin_lock_irq(&kstat_glb_lock); -+ maxlat = p->last.maxlat; -+ totlat = p->last.totlat; -+ count = p->last.count; -+ avg0 = p->avg[0]; -+ avg1 = p->avg[1]; -+ avg2 = p->avg[2]; -+ spin_unlock_irq(&kstat_glb_lock); -+ -+ printk("sched lat: %Lu/%Lu/%lu (%Lu %Lu %Lu)\n", -+ maxlat, -+ totlat, -+ count, -+ avg0, -+ avg1, -+ avg2); -+} -+ -+static void show_header(void) -+{ -+ struct timeval tv; -+ -+ do_gettimeofday(&tv); -+ printk("*** VZWDOG 1.14: time %lu.%06lu uptime %Lu CPU %d ***\n", -+ tv.tv_sec, tv.tv_usec, -+ get_jiffies_64(), smp_processor_id()); -+ printk("*** cycles_per_jiffy %lu jiffies_per_second %u ***\n", -+ cycles_per_jiffy, HZ); -+} -+ -+static void show_pgdatinfo(void) -+{ -+ pg_data_t *pgdat; -+ -+ printk("pgdat:"); -+ for_each_pgdat(pgdat) { -+ printk(" %d: %lu,%lu,%lu,%p", -+ pgdat->node_id, -+ pgdat->node_start_pfn, -+ pgdat->node_present_pages, -+ pgdat->node_spanned_pages, -+ pgdat->node_mem_map); -+ } -+ printk("\n"); -+} -+ -+extern struct subsystem *get_block_subsys(void); -+static void show_diskio(void) -+{ -+ struct gendisk *gd; -+ struct subsystem *block_subsys; -+ char buf[BDEVNAME_SIZE]; -+ -+ printk("disk_io: "); -+ -+ block_subsys = get_block_subsys(); -+ down_read(&block_subsys->rwsem); -+ list_for_each_entry(gd, &block_subsys->kset.list, kobj.entry) { -+ char *name; -+ name = disk_name(gd, 0, buf); -+ if ((strlen(name) > 4) && (strncmp(name, "loop", 4) == 0) && -+ isdigit(name[4])) -+ continue; -+ if ((strlen(name) > 3) && (strncmp(name, "ram", 3) == 0) && -+ isdigit(name[3])) -+ continue; -+ printk("(%u,%u) %s r(%u %u %u) w(%u %u %u)\n", -+ gd->major, gd->first_minor, -+ name, -+ disk_stat_read(gd, reads), -+ disk_stat_read(gd, read_sectors), -+ disk_stat_read(gd, read_merges), -+ disk_stat_read(gd, writes), -+ disk_stat_read(gd, write_sectors), -+ disk_stat_read(gd, write_merges)); -+ } -+ up_read(&block_subsys->rwsem); -+ -+ printk("\n"); -+} -+ -+static void show_nrprocs(void) -+{ -+ unsigned long _nr_running, _nr_sleeping, -+ _nr_unint, _nr_zombie, _nr_dead, _nr_stopped; -+ -+ _nr_running = nr_running(); -+ _nr_unint = nr_uninterruptible(); -+ _nr_sleeping = nr_sleeping(); -+ _nr_zombie = nr_zombie; -+ _nr_dead = nr_dead; -+ _nr_stopped = nr_stopped(); -+ -+ printk("VEnum: %d, proc R %lu, S %lu, D %lu, " -+ "Z %lu, X %lu, T %lu (tot %d)\n", -+ nr_ve, _nr_running, _nr_sleeping, _nr_unint, -+ _nr_zombie, _nr_dead, _nr_stopped, nr_threads); -+} -+ -+static void wdog_print(void) -+{ -+ show_header(); -+ show_irq_list(); -+ show_pgdatinfo(); -+ show_mem(); -+ show_diskio(); -+ show_schedule_latency(); -+ show_alloc_latency(); -+ show_nrprocs(); -+} -+ -+static int wdog_loop(void* data) -+{ -+ struct task_struct *tsk = current; -+ DECLARE_WAIT_QUEUE_HEAD(thread_wait_queue); -+ -+ /* -+ * This thread doesn't need any user-level access, -+ * so get rid of all our resources -+ */ -+ daemonize("wdogd"); -+ -+ spin_lock_irq(&tsk->sighand->siglock); -+ sigfillset(&tsk->blocked); -+ sigdelset(&tsk->blocked, SIGHUP); -+ recalc_sigpending(); -+ spin_unlock_irq(&tsk->sighand->siglock); -+ -+ while (wdog_thread_continue) { -+ wdog_print(); -+ interruptible_sleep_on_timeout(&thread_wait_queue, -+ sleep_timeout*HZ); -+ if (test_thread_flag(TIF_FREEZE)) -+ refrigerator(); -+ /* clear all signals */ -+ if (signal_pending(tsk)) -+ flush_signals(tsk); -+ } -+ -+ complete_and_exit(&license_thread_exited, 0); -+} -+ -+static int __init wdog_init(void) -+{ -+ wdog_thread_pid = kernel_thread(wdog_loop, NULL, 0); -+ if (wdog_thread_pid < 0) -+ return wdog_thread_pid; -+ -+ return 0; -+} -+ -+static void __exit wdog_exit(void) -+{ -+ wdog_thread_continue = 0; -+ if (wdog_thread_pid > 0) { -+ kill_proc(wdog_thread_pid, SIGHUP, 1); -+ wait_for_completion(&license_thread_exited); -+ } -+} -+ -+MODULE_PARM(sleep_timeout, "i"); -+MODULE_AUTHOR("SWsoft <info@sw-soft.com>"); -+MODULE_DESCRIPTION("Virtuozzo WDOG"); -+MODULE_LICENSE("GPL v2"); -+ -+module_init(wdog_init) -+module_exit(wdog_exit) -diff -uprN linux-2.6.8.1.orig/lib/bust_spinlocks.c linux-2.6.8.1-ve022stab078/lib/bust_spinlocks.c ---- linux-2.6.8.1.orig/lib/bust_spinlocks.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/lib/bust_spinlocks.c 2006-05-11 13:05:24.000000000 +0400 -@@ -14,26 +14,15 @@ - #include <linux/wait.h> - #include <linux/vt_kern.h> - -- - void bust_spinlocks(int yes) - { - if (yes) { - oops_in_progress = 1; - } else { -- int loglevel_save = console_loglevel; - #ifdef CONFIG_VT - unblank_screen(); - #endif - oops_in_progress = 0; -- /* -- * OK, the message is on the console. Now we call printk() -- * without oops_in_progress set so that printk() will give klogd -- * and the blanked console a poke. Hold onto your hats... -- */ -- console_loglevel = 15; /* NMI oopser may have shut the console up */ -- printk(" "); -- console_loglevel = loglevel_save; -+ wake_up_klogd(); - } - } -- -- -diff -uprN linux-2.6.8.1.orig/lib/inflate.c linux-2.6.8.1-ve022stab078/lib/inflate.c ---- linux-2.6.8.1.orig/lib/inflate.c 2004-08-14 14:55:31.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/lib/inflate.c 2006-05-11 13:05:34.000000000 +0400 -@@ -322,7 +322,7 @@ DEBG("huft1 "); - { - *t = (struct huft *)NULL; - *m = 0; -- return 0; -+ return 2; - } - - DEBG("huft2 "); -@@ -370,6 +370,7 @@ DEBG("huft5 "); - if ((j = *p++) != 0) - v[x[j]++] = i; - } while (++i < n); -+ n = x[g]; /* set n to length of v */ - - DEBG("h6 "); - -@@ -406,12 +407,13 @@ DEBG1("1 "); - DEBG1("2 "); - f -= a + 1; /* deduct codes from patterns left */ - xp = c + k; -- while (++j < z) /* try smaller tables up to z bits */ -- { -- if ((f <<= 1) <= *++xp) -- break; /* enough codes to use up j bits */ -- f -= *xp; /* else deduct codes from patterns */ -- } -+ if (j < z) -+ while (++j < z) /* try smaller tables up to z bits */ -+ { -+ if ((f <<= 1) <= *++xp) -+ break; /* enough codes to use up j bits */ -+ f -= *xp; /* else deduct codes from patterns */ -+ } - } - DEBG1("3 "); - z = 1 << j; /* table entries for j-bit table */ -diff -uprN linux-2.6.8.1.orig/lib/rwsem-spinlock.c linux-2.6.8.1-ve022stab078/lib/rwsem-spinlock.c ---- linux-2.6.8.1.orig/lib/rwsem-spinlock.c 2004-08-14 14:56:25.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/lib/rwsem-spinlock.c 2006-05-11 13:05:25.000000000 +0400 -@@ -140,12 +140,12 @@ void fastcall __sched __down_read(struct - - rwsemtrace(sem, "Entering __down_read"); - -- spin_lock(&sem->wait_lock); -+ spin_lock_irq(&sem->wait_lock); - - if (sem->activity >= 0 && list_empty(&sem->wait_list)) { - /* granted */ - sem->activity++; -- spin_unlock(&sem->wait_lock); -+ spin_unlock_irq(&sem->wait_lock); - goto out; - } - -@@ -160,7 +160,7 @@ void fastcall __sched __down_read(struct - list_add_tail(&waiter.list, &sem->wait_list); - - /* we don't need to touch the semaphore struct anymore */ -- spin_unlock(&sem->wait_lock); -+ spin_unlock_irq(&sem->wait_lock); - - /* wait to be given the lock */ - for (;;) { -@@ -181,10 +181,12 @@ void fastcall __sched __down_read(struct - */ - int fastcall __down_read_trylock(struct rw_semaphore *sem) - { -+ unsigned long flags; - int ret = 0; -+ - rwsemtrace(sem, "Entering __down_read_trylock"); - -- spin_lock(&sem->wait_lock); -+ spin_lock_irqsave(&sem->wait_lock, flags); - - if (sem->activity >= 0 && list_empty(&sem->wait_list)) { - /* granted */ -@@ -192,7 +194,7 @@ int fastcall __down_read_trylock(struct - ret = 1; - } - -- spin_unlock(&sem->wait_lock); -+ spin_unlock_irqrestore(&sem->wait_lock, flags); - - rwsemtrace(sem, "Leaving __down_read_trylock"); - return ret; -@@ -209,12 +211,12 @@ void fastcall __sched __down_write(struc - - rwsemtrace(sem, "Entering __down_write"); - -- spin_lock(&sem->wait_lock); -+ spin_lock_irq(&sem->wait_lock); - - if (sem->activity == 0 && list_empty(&sem->wait_list)) { - /* granted */ - sem->activity = -1; -- spin_unlock(&sem->wait_lock); -+ spin_unlock_irq(&sem->wait_lock); - goto out; - } - -@@ -229,7 +231,7 @@ void fastcall __sched __down_write(struc - list_add_tail(&waiter.list, &sem->wait_list); - - /* we don't need to touch the semaphore struct anymore */ -- spin_unlock(&sem->wait_lock); -+ spin_unlock_irq(&sem->wait_lock); - - /* wait to be given the lock */ - for (;;) { -@@ -250,10 +252,12 @@ void fastcall __sched __down_write(struc - */ - int fastcall __down_write_trylock(struct rw_semaphore *sem) - { -+ unsigned long flags; - int ret = 0; -+ - rwsemtrace(sem, "Entering __down_write_trylock"); - -- spin_lock(&sem->wait_lock); -+ spin_lock_irqsave(&sem->wait_lock, flags); - - if (sem->activity == 0 && list_empty(&sem->wait_list)) { - /* granted */ -@@ -261,7 +265,7 @@ int fastcall __down_write_trylock(struct - ret = 1; - } - -- spin_unlock(&sem->wait_lock); -+ spin_unlock_irqrestore(&sem->wait_lock, flags); - - rwsemtrace(sem, "Leaving __down_write_trylock"); - return ret; -@@ -272,14 +276,16 @@ int fastcall __down_write_trylock(struct - */ - void fastcall __up_read(struct rw_semaphore *sem) - { -+ unsigned long flags; -+ - rwsemtrace(sem, "Entering __up_read"); - -- spin_lock(&sem->wait_lock); -+ spin_lock_irqsave(&sem->wait_lock, flags); - - if (--sem->activity == 0 && !list_empty(&sem->wait_list)) - sem = __rwsem_wake_one_writer(sem); - -- spin_unlock(&sem->wait_lock); -+ spin_unlock_irqrestore(&sem->wait_lock, flags); - - rwsemtrace(sem, "Leaving __up_read"); - } -@@ -289,15 +295,17 @@ void fastcall __up_read(struct rw_semaph - */ - void fastcall __up_write(struct rw_semaphore *sem) - { -+ unsigned long flags; -+ - rwsemtrace(sem, "Entering __up_write"); - -- spin_lock(&sem->wait_lock); -+ spin_lock_irqsave(&sem->wait_lock, flags); - - sem->activity = 0; - if (!list_empty(&sem->wait_list)) - sem = __rwsem_do_wake(sem, 1); - -- spin_unlock(&sem->wait_lock); -+ spin_unlock_irqrestore(&sem->wait_lock, flags); - - rwsemtrace(sem, "Leaving __up_write"); - } -@@ -308,15 +316,17 @@ void fastcall __up_write(struct rw_semap - */ - void fastcall __downgrade_write(struct rw_semaphore *sem) - { -+ unsigned long flags; -+ - rwsemtrace(sem, "Entering __downgrade_write"); - -- spin_lock(&sem->wait_lock); -+ spin_lock_irqsave(&sem->wait_lock, flags); - - sem->activity = 1; - if (!list_empty(&sem->wait_list)) - sem = __rwsem_do_wake(sem, 0); - -- spin_unlock(&sem->wait_lock); -+ spin_unlock_irqrestore(&sem->wait_lock, flags); - - rwsemtrace(sem, "Leaving __downgrade_write"); - } -diff -uprN linux-2.6.8.1.orig/lib/rwsem.c linux-2.6.8.1-ve022stab078/lib/rwsem.c ---- linux-2.6.8.1.orig/lib/rwsem.c 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/lib/rwsem.c 2006-05-11 13:05:25.000000000 +0400 -@@ -150,7 +150,7 @@ rwsem_down_failed_common(struct rw_semap - set_task_state(tsk, TASK_UNINTERRUPTIBLE); - - /* set up my own style of waitqueue */ -- spin_lock(&sem->wait_lock); -+ spin_lock_irq(&sem->wait_lock); - waiter->task = tsk; - get_task_struct(tsk); - -@@ -163,7 +163,7 @@ rwsem_down_failed_common(struct rw_semap - if (!(count & RWSEM_ACTIVE_MASK)) - sem = __rwsem_do_wake(sem, 0); - -- spin_unlock(&sem->wait_lock); -+ spin_unlock_irq(&sem->wait_lock); - - /* wait to be given the lock */ - for (;;) { -@@ -219,15 +219,17 @@ rwsem_down_write_failed(struct rw_semaph - */ - struct rw_semaphore fastcall *rwsem_wake(struct rw_semaphore *sem) - { -+ unsigned long flags; -+ - rwsemtrace(sem, "Entering rwsem_wake"); - -- spin_lock(&sem->wait_lock); -+ spin_lock_irqsave(&sem->wait_lock, flags); - - /* do nothing if list empty */ - if (!list_empty(&sem->wait_list)) - sem = __rwsem_do_wake(sem, 0); - -- spin_unlock(&sem->wait_lock); -+ spin_unlock_irqrestore(&sem->wait_lock, flags); - - rwsemtrace(sem, "Leaving rwsem_wake"); - -@@ -241,15 +243,17 @@ struct rw_semaphore fastcall *rwsem_wake - */ - struct rw_semaphore fastcall *rwsem_downgrade_wake(struct rw_semaphore *sem) - { -+ unsigned long flags; -+ - rwsemtrace(sem, "Entering rwsem_downgrade_wake"); - -- spin_lock(&sem->wait_lock); -+ spin_lock_irqsave(&sem->wait_lock, flags); - - /* do nothing if list empty */ - if (!list_empty(&sem->wait_list)) - sem = __rwsem_do_wake(sem, 1); - -- spin_unlock(&sem->wait_lock); -+ spin_unlock_irqrestore(&sem->wait_lock, flags); - - rwsemtrace(sem, "Leaving rwsem_downgrade_wake"); - return sem; -diff -uprN linux-2.6.8.1.orig/mm/Makefile linux-2.6.8.1-ve022stab078/mm/Makefile ---- linux-2.6.8.1.orig/mm/Makefile 2004-08-14 14:55:35.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/mm/Makefile 2006-05-11 13:05:38.000000000 +0400 -@@ -13,5 +13,6 @@ obj-y := bootmem.o filemap.o mempool.o - $(mmu-y) - - obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o -+obj-$(CONFIG_X86_4G) += usercopy.o - obj-$(CONFIG_HUGETLBFS) += hugetlb.o - obj-$(CONFIG_NUMA) += mempolicy.o -diff -uprN linux-2.6.8.1.orig/mm/filemap.c linux-2.6.8.1-ve022stab078/mm/filemap.c ---- linux-2.6.8.1.orig/mm/filemap.c 2004-08-14 14:56:25.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/mm/filemap.c 2006-05-11 13:05:40.000000000 +0400 -@@ -127,20 +127,6 @@ void remove_from_page_cache(struct page - spin_unlock_irq(&mapping->tree_lock); - } - --static inline int sync_page(struct page *page) --{ -- struct address_space *mapping; -- -- /* -- * FIXME, fercrissake. What is this barrier here for? -- */ -- smp_mb(); -- mapping = page_mapping(page); -- if (mapping && mapping->a_ops && mapping->a_ops->sync_page) -- return mapping->a_ops->sync_page(page); -- return 0; --} -- - /** - * filemap_fdatawrite - start writeback against all of a mapping's dirty pages - * @mapping: address space structure to write -@@ -828,6 +814,8 @@ int file_read_actor(read_descriptor_t *d - if (size > count) - size = count; - -+ left = size; -+#ifndef CONFIG_X86_UACCESS_INDIRECT - /* - * Faults on the destination of a read are common, so do it before - * taking the kmap. -@@ -836,20 +824,21 @@ int file_read_actor(read_descriptor_t *d - kaddr = kmap_atomic(page, KM_USER0); - left = __copy_to_user(desc->arg.buf, kaddr + offset, size); - kunmap_atomic(kaddr, KM_USER0); -- if (left == 0) -- goto success; - } -+#endif - -- /* Do it the slow way */ -- kaddr = kmap(page); -- left = __copy_to_user(desc->arg.buf, kaddr + offset, size); -- kunmap(page); -- -- if (left) { -- size -= left; -- desc->error = -EFAULT; -+ if (left != 0) { -+ /* Do it the slow way */ -+ kaddr = kmap(page); -+ left = __copy_to_user(desc->arg.buf, kaddr + offset, size); -+ kunmap(page); -+ -+ if (left) { -+ size -= left; -+ desc->error = -EFAULT; -+ } - } --success: -+ - desc->count = count - size; - desc->written += size; - desc->arg.buf += size; -@@ -926,8 +915,8 @@ __generic_file_aio_read(struct kiocb *io - desc.error = 0; - do_generic_file_read(filp,ppos,&desc,file_read_actor); - retval += desc.written; -- if (!retval) { -- retval = desc.error; -+ if (desc.error) { -+ retval = retval ?: desc.error; - break; - } - } -@@ -1629,9 +1618,13 @@ filemap_copy_from_user(struct page *page - char *kaddr; - int left; - -+#ifndef CONFIG_X86_UACCESS_INDIRECT - kaddr = kmap_atomic(page, KM_USER0); - left = __copy_from_user(kaddr + offset, buf, bytes); - kunmap_atomic(kaddr, KM_USER0); -+#else -+ left = bytes; -+#endif - - if (left != 0) { - /* Do it the slow way */ -@@ -1682,10 +1675,14 @@ filemap_copy_from_user_iovec(struct page - char *kaddr; - size_t copied; - -+#ifndef CONFIG_X86_UACCESS_INDIRECT - kaddr = kmap_atomic(page, KM_USER0); - copied = __filemap_copy_from_user_iovec(kaddr + offset, iov, - base, bytes); - kunmap_atomic(kaddr, KM_USER0); -+#else -+ copied = 0; -+#endif - if (copied != bytes) { - kaddr = kmap(page); - copied = __filemap_copy_from_user_iovec(kaddr + offset, iov, -diff -uprN linux-2.6.8.1.orig/mm/fremap.c linux-2.6.8.1-ve022stab078/mm/fremap.c ---- linux-2.6.8.1.orig/mm/fremap.c 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/mm/fremap.c 2006-05-11 13:05:39.000000000 +0400 -@@ -19,6 +19,8 @@ - #include <asm/cacheflush.h> - #include <asm/tlbflush.h> - -+#include <ub/ub_vmpages.h> -+ - static inline void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long addr, pte_t *ptep) - { -@@ -37,8 +39,11 @@ static inline void zap_pte(struct mm_str - if (pte_dirty(pte)) - set_page_dirty(page); - page_remove_rmap(page); -+ pb_remove_ref(page, mm_ub(mm)); - page_cache_release(page); - mm->rss--; -+ vma->vm_rss--; -+ ub_unused_privvm_inc(mm_ub(mm), 1, vma); - } - } - } else { -@@ -62,7 +67,10 @@ int install_page(struct mm_struct *mm, s - pgd_t *pgd; - pmd_t *pmd; - pte_t pte_val; -+ struct page_beancounter *pbc; - -+ if (pb_alloc(&pbc)) -+ goto err_pb; - pgd = pgd_offset(mm, addr); - spin_lock(&mm->page_table_lock); - -@@ -87,6 +95,9 @@ int install_page(struct mm_struct *mm, s - zap_pte(mm, vma, addr, pte); - - mm->rss++; -+ vma->vm_rss++; -+ pb_add_ref(page, mm_ub(mm), &pbc); -+ ub_unused_privvm_dec(mm_ub(mm), 1, vma); - flush_icache_page(vma, page); - set_pte(pte, mk_pte(page, prot)); - page_add_file_rmap(page); -@@ -97,6 +108,8 @@ int install_page(struct mm_struct *mm, s - err = 0; - err_unlock: - spin_unlock(&mm->page_table_lock); -+ pb_free(&pbc); -+err_pb: - return err; - } - EXPORT_SYMBOL(install_page); -diff -uprN linux-2.6.8.1.orig/mm/highmem.c linux-2.6.8.1-ve022stab078/mm/highmem.c ---- linux-2.6.8.1.orig/mm/highmem.c 2004-08-14 14:55:35.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/mm/highmem.c 2006-05-11 13:05:28.000000000 +0400 -@@ -284,7 +284,7 @@ static void copy_to_high_bio_irq(struct - struct bio_vec *tovec, *fromvec; - int i; - -- bio_for_each_segment(tovec, to, i) { -+ __bio_for_each_segment(tovec, to, i, 0) { - fromvec = from->bi_io_vec + i; - - /* -@@ -316,7 +316,7 @@ static void bounce_end_io(struct bio *bi - /* - * free up bounce indirect pages used - */ -- bio_for_each_segment(bvec, bio, i) { -+ __bio_for_each_segment(bvec, bio, i, 0) { - org_vec = bio_orig->bi_io_vec + i; - if (bvec->bv_page == org_vec->bv_page) - continue; -@@ -423,7 +423,7 @@ static void __blk_queue_bounce(request_q - * at least one page was bounced, fill in possible non-highmem - * pages - */ -- bio_for_each_segment(from, *bio_orig, i) { -+ __bio_for_each_segment(from, *bio_orig, i, 0) { - to = bio_iovec_idx(bio, i); - if (!to->bv_page) { - to->bv_page = from->bv_page; -diff -uprN linux-2.6.8.1.orig/mm/memory.c linux-2.6.8.1-ve022stab078/mm/memory.c ---- linux-2.6.8.1.orig/mm/memory.c 2004-08-14 14:55:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/mm/memory.c 2006-05-11 13:05:49.000000000 +0400 -@@ -40,6 +40,7 @@ - #include <linux/mm.h> - #include <linux/hugetlb.h> - #include <linux/mman.h> -+#include <linux/virtinfo.h> - #include <linux/swap.h> - #include <linux/highmem.h> - #include <linux/pagemap.h> -@@ -56,6 +57,9 @@ - #include <linux/swapops.h> - #include <linux/elf.h> - -+#include <ub/beancounter.h> -+#include <ub/ub_vmpages.h> -+ - #ifndef CONFIG_DISCONTIGMEM - /* use the per-pgdat data instead for discontigmem - mbligh */ - unsigned long max_mapnr; -@@ -117,7 +121,8 @@ static inline void free_one_pmd(struct m - pte_free_tlb(tlb, page); - } - --static inline void free_one_pgd(struct mmu_gather *tlb, pgd_t * dir) -+static inline void free_one_pgd(struct mmu_gather *tlb, pgd_t * dir, -+ int pgd_idx) - { - int j; - pmd_t * pmd; -@@ -131,8 +136,11 @@ static inline void free_one_pgd(struct m - } - pmd = pmd_offset(dir, 0); - pgd_clear(dir); -- for (j = 0; j < PTRS_PER_PMD ; j++) -+ for (j = 0; j < PTRS_PER_PMD ; j++) { -+ if (pgd_idx * PGDIR_SIZE + j * PMD_SIZE >= TASK_SIZE) -+ break; - free_one_pmd(tlb, pmd+j); -+ } - pmd_free_tlb(tlb, pmd); - } - -@@ -145,11 +153,13 @@ static inline void free_one_pgd(struct m - void clear_page_tables(struct mmu_gather *tlb, unsigned long first, int nr) - { - pgd_t * page_dir = tlb->mm->pgd; -+ int pgd_idx = first; - - page_dir += first; - do { -- free_one_pgd(tlb, page_dir); -+ free_one_pgd(tlb, page_dir, pgd_idx); - page_dir++; -+ pgd_idx++; - } while (--nr); - } - -@@ -205,6 +215,8 @@ out: - } - #define PTE_TABLE_MASK ((PTRS_PER_PTE-1) * sizeof(pte_t)) - #define PMD_TABLE_MASK ((PTRS_PER_PMD-1) * sizeof(pmd_t)) -+#define pb_list_size(addr) \ -+ (PTRS_PER_PTE - ((addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))) - - /* - * copy one vm_area from one task to the other. Assumes the page tables -@@ -217,13 +229,15 @@ out: - * dst->page_table_lock is held on entry and exit, - * but may be dropped within pmd_alloc() and pte_alloc_map(). - */ --int copy_page_range(struct mm_struct *dst, struct mm_struct *src, -- struct vm_area_struct *vma) -+int __copy_page_range(struct vm_area_struct *vma, struct mm_struct *src, -+ unsigned long address, size_t size) - { -+ struct mm_struct *dst = vma->vm_mm; - pgd_t * src_pgd, * dst_pgd; -- unsigned long address = vma->vm_start; -- unsigned long end = vma->vm_end; -+ unsigned long end = address + size; - unsigned long cow; -+ struct page_beancounter *pbc; -+ int need_pbc; - - if (is_vm_hugetlb_page(vma)) - return copy_hugetlb_page_range(dst, src, vma); -@@ -231,6 +245,8 @@ int copy_page_range(struct mm_struct *ds - cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; - src_pgd = pgd_offset(src, address)-1; - dst_pgd = pgd_offset(dst, address)-1; -+ pbc = NULL; -+ need_pbc = (mm_ub(dst) != mm_ub(src)); - - for (;;) { - pmd_t * src_pmd, * dst_pmd; -@@ -272,6 +288,10 @@ skip_copy_pte_range: - goto cont_copy_pmd_range; - } - -+ if (need_pbc && -+ pb_alloc_list(&pbc, pb_list_size(address), dst)) -+ goto nomem; -+ - dst_pte = pte_alloc_map(dst, dst_pmd, address); - if (!dst_pte) - goto nomem; -@@ -326,6 +346,9 @@ skip_copy_pte_range: - pte = pte_mkold(pte); - get_page(page); - dst->rss++; -+ vma->vm_rss++; -+ ub_unused_privvm_dec(mm_ub(dst), 1, vma); -+ pb_add_list_ref(page, mm_ub(dst), &pbc); - set_pte(dst_pte, pte); - page_dup_rmap(page); - cont_copy_pte_range_noset: -@@ -350,11 +373,21 @@ cont_copy_pmd_range: - out_unlock: - spin_unlock(&src->page_table_lock); - out: -+ pb_free_list(&pbc); - return 0; - nomem: -+ pb_free_list(&pbc); - return -ENOMEM; - } - -+int copy_page_range(struct mm_struct *dst, struct mm_struct *src, -+ struct vm_area_struct *vma) -+{ -+ if (vma->vm_mm != dst) -+ BUG(); -+ return __copy_page_range(vma, src, vma->vm_start, vma->vm_end-vma->vm_start); -+} -+ - static void zap_pte_range(struct mmu_gather *tlb, - pmd_t *pmd, unsigned long address, - unsigned long size, struct zap_details *details) -@@ -420,6 +453,7 @@ static void zap_pte_range(struct mmu_gat - mark_page_accessed(page); - tlb->freed++; - page_remove_rmap(page); -+ pb_remove_ref(page, mm_ub(tlb->mm)); - tlb_remove_page(tlb, page); - continue; - } -@@ -441,7 +475,7 @@ static void zap_pmd_range(struct mmu_gat - unsigned long size, struct zap_details *details) - { - pmd_t * pmd; -- unsigned long end; -+ unsigned long end, pgd_boundary; - - if (pgd_none(*dir)) - return; -@@ -452,8 +486,9 @@ static void zap_pmd_range(struct mmu_gat - } - pmd = pmd_offset(dir, address); - end = address + size; -- if (end > ((address + PGDIR_SIZE) & PGDIR_MASK)) -- end = ((address + PGDIR_SIZE) & PGDIR_MASK); -+ pgd_boundary = ((address + PGDIR_SIZE) & PGDIR_MASK); -+ if (pgd_boundary && (end > pgd_boundary)) -+ end = pgd_boundary; - do { - zap_pte_range(tlb, pmd, address, end - address, details); - address = (address + PMD_SIZE) & PMD_MASK; -@@ -461,20 +496,63 @@ static void zap_pmd_range(struct mmu_gat - } while (address && (address < end)); - } - -+static void warn_bad_zap(struct vm_area_struct *vma, unsigned long freed) -+{ -+#ifdef CONFIG_USER_RESOURCE -+ static struct ub_rate_info ri = { -+ .burst = 10, -+ .interval = 40 * HZ, -+ }; -+ struct user_beancounter *ub; -+ char ubuid[64] = "No UB"; -+ -+ if (!ub_ratelimit(&ri)) -+ return; -+ -+ ub = mm_ub(vma->vm_mm); -+ if (ub) -+ print_ub_uid(ub, ubuid, sizeof(ubuid)); -+ -+#else -+ const char ubuid[] = "0"; -+#endif -+ -+ printk(KERN_WARNING -+ "%s vm_rss: process pid %d comm %.20s flags %lx, " -+ "vma %p %08lx-%08lx %p rss %lu freed %lu\n flags %lx, " -+ "ub %s\n", -+ vma->vm_rss > freed ? "Positive" : "Negative", -+ current->pid, current->comm, current->flags, -+ vma, vma->vm_start, vma->vm_end, vma->vm_file, -+ vma->vm_rss, freed, vma->vm_flags, ubuid); -+ dump_stack(); -+} -+ - static void unmap_page_range(struct mmu_gather *tlb, - struct vm_area_struct *vma, unsigned long address, - unsigned long end, struct zap_details *details) - { -+ unsigned long freed; - pgd_t * dir; - - BUG_ON(address >= end); - dir = pgd_offset(vma->vm_mm, address); - tlb_start_vma(tlb, vma); -+ freed = tlb->freed; - do { - zap_pmd_range(tlb, dir, address, end - address, details); - address = (address + PGDIR_SIZE) & PGDIR_MASK; - dir++; - } while (address && (address < end)); -+ freed = tlb->freed - freed; -+ if (freed) { -+ ub_unused_privvm_inc(mm_ub(tlb->mm), freed, vma); -+ if (vma->vm_rss < freed) { -+ warn_bad_zap(vma, freed); -+ freed = vma->vm_rss; -+ } -+ vma->vm_rss -= freed; -+ } - tlb_end_vma(tlb, vma); - } - -@@ -596,6 +674,7 @@ void zap_page_range(struct vm_area_struc - unsigned long nr_accounted = 0; - - if (is_vm_hugetlb_page(vma)) { -+ /* ub acct is performed in unmap_hugepage_range */ - zap_hugepage_range(vma, address, size); - return; - } -@@ -604,6 +683,8 @@ void zap_page_range(struct vm_area_struc - spin_lock(&mm->page_table_lock); - tlb = tlb_gather_mmu(mm, 0); - unmap_vmas(&tlb, mm, vma, address, end, &nr_accounted, details); -+ if (vma->vm_rss && address == vma->vm_start && end == vma->vm_end) -+ warn_bad_zap(vma, 0); - tlb_finish_mmu(tlb, address, end); - spin_unlock(&mm->page_table_lock); - } -@@ -612,21 +693,98 @@ void zap_page_range(struct vm_area_struc - * Do a quick page-table lookup for a single page. - * mm->page_table_lock must be held. - */ --struct page * --follow_page(struct mm_struct *mm, unsigned long address, int write) -+static struct page * -+pgd_follow_page(struct mm_struct *mm, pgd_t *pgd, unsigned long address, -+ int write) - { -- pgd_t *pgd; - pmd_t *pmd; - pte_t *ptep, pte; - unsigned long pfn; - struct page *page; - -+ pmd = pmd_offset(pgd, address); -+ if (pmd_none(*pmd)) -+ goto out; -+ if (pmd_huge(*pmd)) -+ return follow_huge_pmd(mm, address, pmd, write); -+ if (unlikely(pmd_bad(*pmd))) -+ goto out; -+ -+ ptep = pte_offset_map(pmd, address); -+ if (!ptep) -+ goto out; -+ -+ pte = *ptep; -+ pte_unmap(ptep); -+ if (pte_present(pte)) { -+ if (write && !pte_write(pte)) -+ goto out; -+ pfn = pte_pfn(pte); -+ if (pfn_valid(pfn)) { -+ page = pfn_to_page(pfn); -+ if (write && !pte_dirty(pte) && !PageDirty(page)) -+ set_page_dirty(page); -+ mark_page_accessed(page); -+ return page; -+ } -+ } -+ -+out: -+ return NULL; -+} -+ -+struct page * -+follow_page(struct mm_struct *mm, unsigned long address, int write) -+{ -+ pgd_t *pgd; -+ struct page *page; -+ - page = follow_huge_addr(mm, address, write); - if (! IS_ERR(page)) - return page; - - pgd = pgd_offset(mm, address); - if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd))) -+ return NULL; -+ -+ return pgd_follow_page(mm, pgd, address, write); -+} -+ -+struct page * -+follow_page_k(unsigned long address, int write) -+{ -+ pgd_t *pgd; -+ struct page *page; -+ -+ page = follow_huge_addr(&init_mm, address, write); -+ if (! IS_ERR(page)) -+ return page; -+ -+ pgd = pgd_offset_k(address); -+ if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd))) -+ return NULL; -+ -+ return pgd_follow_page(&init_mm, pgd, address, write); -+} -+ -+struct page * -+follow_page_pte(struct mm_struct *mm, unsigned long address, int write, -+ pte_t *page_pte) -+{ -+ pgd_t *pgd; -+ pmd_t *pmd; -+ pte_t *ptep, pte; -+ unsigned long pfn; -+ struct page *page; -+ -+ -+ memset(page_pte, 0, sizeof(*page_pte)); -+ page = follow_huge_addr(mm, address, write); -+ if (!IS_ERR(page)) -+ return page; -+ -+ pgd = pgd_offset(mm, address); -+ if (pgd_none(*pgd) || pgd_bad(*pgd)) - goto out; - - pmd = pmd_offset(pgd, address); -@@ -634,7 +792,7 @@ follow_page(struct mm_struct *mm, unsign - goto out; - if (pmd_huge(*pmd)) - return follow_huge_pmd(mm, address, pmd, write); -- if (unlikely(pmd_bad(*pmd))) -+ if (pmd_bad(*pmd)) - goto out; - - ptep = pte_offset_map(pmd, address); -@@ -643,16 +801,23 @@ follow_page(struct mm_struct *mm, unsign - - pte = *ptep; - pte_unmap(ptep); -- if (pte_present(pte)) { -+ if (pte_present(pte) && pte_read(pte)) { - if (write && !pte_write(pte)) - goto out; -+ if (write && !pte_dirty(pte)) { -+ struct page *page = pte_page(pte); -+ if (!PageDirty(page)) -+ set_page_dirty(page); -+ } - pfn = pte_pfn(pte); - if (pfn_valid(pfn)) { -- page = pfn_to_page(pfn); -- if (write && !pte_dirty(pte) && !PageDirty(page)) -- set_page_dirty(page); -+ struct page *page = pfn_to_page(pfn); -+ - mark_page_accessed(page); - return page; -+ } else { -+ *page_pte = pte; -+ return NULL; - } - } - -@@ -660,6 +825,7 @@ out: - return NULL; - } - -+ - /* - * Given a physical address, is there a useful struct page pointing to - * it? This may become more complex in the future if we start dealing -@@ -674,6 +840,7 @@ static inline struct page *get_page_map( - } - - -+#ifndef CONFIG_X86_4G - static inline int - untouched_anonymous_page(struct mm_struct* mm, struct vm_area_struct *vma, - unsigned long address) -@@ -698,6 +865,7 @@ untouched_anonymous_page(struct mm_struc - /* There is a pte slot for 'address' in 'mm'. */ - return 0; - } -+#endif - - - int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, -@@ -727,16 +895,16 @@ int get_user_pages(struct task_struct *t - pte_t *pte; - if (write) /* user gate pages are read-only */ - return i ? : -EFAULT; -- pgd = pgd_offset_gate(mm, pg); -- if (!pgd) -- return i ? : -EFAULT; -+ if (pg > TASK_SIZE) -+ pgd = pgd_offset_k(pg); -+ else -+ pgd = pgd_offset_gate(mm, pg); -+ BUG_ON(pgd_none(*pgd)); - pmd = pmd_offset(pgd, pg); -- if (!pmd) -+ if (pmd_none(*pmd)) - return i ? : -EFAULT; - pte = pte_offset_map(pmd, pg); -- if (!pte) -- return i ? : -EFAULT; -- if (!pte_present(*pte)) { -+ if (pte_none(*pte)) { - pte_unmap(pte); - return i ? : -EFAULT; - } -@@ -773,12 +941,21 @@ int get_user_pages(struct task_struct *t - * insanly big anonymously mapped areas that - * nobody touched so far. This is important - * for doing a core dump for these mappings. -+ * -+ * disable this for 4:4 - it prevents -+ * follow_page() from ever seeing these pages. -+ * -+ * (The 'fix' is dubious anyway, there's -+ * nothing that this code avoids which couldnt -+ * be triggered from userspace anyway.) - */ -+#ifndef CONFIG_X86_4G - if (!lookup_write && - untouched_anonymous_page(mm,vma,start)) { - map = ZERO_PAGE(start); - break; - } -+#endif - spin_unlock(&mm->page_table_lock); - switch (handle_mm_fault(mm,vma,start,write)) { - case VM_FAULT_MINOR: -@@ -968,6 +1145,15 @@ int remap_page_range(struct vm_area_stru - if (from >= end) - BUG(); - -+ /* -+ * Physically remapped pages are special. Tell the -+ * rest of the world about it: -+ * VM_IO tells people not to look at these pages -+ * (accesses can have side effects). -+ * VM_RESERVED tells swapout not to try to touch -+ * this region. -+ */ -+ vma->vm_flags |= VM_IO | VM_RESERVED; - spin_lock(&mm->page_table_lock); - do { - pmd_t *pmd = pmd_alloc(mm, dir, from); -@@ -1016,6 +1202,7 @@ static inline void break_cow(struct vm_a - vma); - ptep_establish(vma, address, page_table, entry); - update_mmu_cache(vma, address, entry); -+ lazy_mmu_prot_update(entry); - } - - /* -@@ -1042,6 +1229,7 @@ static int do_wp_page(struct mm_struct * - unsigned long address, pte_t *page_table, pmd_t *pmd, pte_t pte) - { - struct page *old_page, *new_page; -+ struct page_beancounter *pbc; - unsigned long pfn = pte_pfn(pte); - pte_t entry; - -@@ -1068,6 +1256,7 @@ static int do_wp_page(struct mm_struct * - vma); - ptep_set_access_flags(vma, address, page_table, entry, 1); - update_mmu_cache(vma, address, entry); -+ lazy_mmu_prot_update(entry); - pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); - return VM_FAULT_MINOR; -@@ -1082,6 +1271,9 @@ static int do_wp_page(struct mm_struct * - page_cache_get(old_page); - spin_unlock(&mm->page_table_lock); - -+ if (pb_alloc(&pbc)) -+ goto out; -+ - if (unlikely(anon_vma_prepare(vma))) - goto no_new_page; - new_page = alloc_page_vma(GFP_HIGHUSER, vma, address); -@@ -1095,10 +1287,16 @@ static int do_wp_page(struct mm_struct * - spin_lock(&mm->page_table_lock); - page_table = pte_offset_map(pmd, address); - if (likely(pte_same(*page_table, pte))) { -- if (PageReserved(old_page)) -+ if (PageReserved(old_page)) { - ++mm->rss; -- else -+ ++vma->vm_rss; -+ ub_unused_privvm_dec(mm_ub(mm), 1, vma); -+ } else { - page_remove_rmap(old_page); -+ pb_remove_ref(old_page, mm_ub(mm)); -+ } -+ -+ pb_add_ref(new_page, mm_ub(mm), &pbc); - break_cow(vma, new_page, address, page_table); - lru_cache_add_active(new_page); - page_add_anon_rmap(new_page, vma, address); -@@ -1113,6 +1311,8 @@ static int do_wp_page(struct mm_struct * - return VM_FAULT_MINOR; - - no_new_page: -+ pb_free(&pbc); -+out: - page_cache_release(old_page); - return VM_FAULT_OOM; - } -@@ -1322,12 +1522,21 @@ static int do_swap_page(struct mm_struct - pte_t *page_table, pmd_t *pmd, pte_t orig_pte, int write_access) - { - struct page *page; -+ struct page_beancounter *pbc; - swp_entry_t entry = pte_to_swp_entry(orig_pte); - pte_t pte; -- int ret = VM_FAULT_MINOR; -+ int ret; -+ cycles_t start; - - pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); -+ start = get_cycles(); -+ pbc = NULL; -+ ret = VM_FAULT_OOM; -+ if (pb_alloc(&pbc)) -+ goto out_nopbc; -+ -+ ret = VM_FAULT_MINOR; - page = lookup_swap_cache(entry); - if (!page) { - swapin_readahead(entry, address, vma); -@@ -1363,21 +1572,25 @@ static int do_swap_page(struct mm_struct - spin_lock(&mm->page_table_lock); - page_table = pte_offset_map(pmd, address); - if (unlikely(!pte_same(*page_table, orig_pte))) { -- pte_unmap(page_table); -- spin_unlock(&mm->page_table_lock); -- unlock_page(page); -- page_cache_release(page); - ret = VM_FAULT_MINOR; -- goto out; -+ goto out_nomap; -+ } -+ -+ if (unlikely(!PageUptodate(page))) { -+ ret = VM_FAULT_SIGBUS; -+ goto out_nomap; - } - - /* The page isn't present yet, go ahead with the fault. */ - - swap_free(entry); -- if (vm_swap_full()) -- remove_exclusive_swap_page(page); -+ try_to_remove_exclusive_swap_page(page); - - mm->rss++; -+ vma->vm_rss++; -+ mm_ub(mm)->ub_perfstat[smp_processor_id()].swapin++; -+ ub_unused_privvm_dec(mm_ub(mm), 1, vma); -+ pb_add_ref(page, mm_ub(mm), &pbc); - pte = mk_pte(page, vma->vm_page_prot); - if (write_access && can_share_swap_page(page)) { - pte = maybe_mkwrite(pte_mkdirty(pte), vma); -@@ -1398,10 +1611,23 @@ static int do_swap_page(struct mm_struct - - /* No need to invalidate - it was non-present before */ - update_mmu_cache(vma, address, pte); -+ lazy_mmu_prot_update(pte); - pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); - out: -+ pb_free(&pbc); -+ spin_lock_irq(&kstat_glb_lock); -+ KSTAT_LAT_ADD(&kstat_glob.swap_in, get_cycles() - start); -+ spin_unlock_irq(&kstat_glb_lock); -+out_nopbc: - return ret; -+ -+out_nomap: -+ pte_unmap(page_table); -+ spin_unlock(&mm->page_table_lock); -+ unlock_page(page); -+ page_cache_release(page); -+ goto out; - } - - /* -@@ -1416,16 +1642,20 @@ do_anonymous_page(struct mm_struct *mm, - { - pte_t entry; - struct page * page = ZERO_PAGE(addr); -+ struct page_beancounter *pbc; - - /* Read-only mapping of ZERO_PAGE. */ - entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); - - /* ..except if it's a write access */ -+ pbc = NULL; - if (write_access) { - /* Allocate our own private page. */ - pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); - -+ if (pb_alloc(&pbc)) -+ goto no_mem; - if (unlikely(anon_vma_prepare(vma))) - goto no_mem; - page = alloc_page_vma(GFP_HIGHUSER, vma, addr); -@@ -1443,6 +1673,9 @@ do_anonymous_page(struct mm_struct *mm, - goto out; - } - mm->rss++; -+ vma->vm_rss++; -+ ub_unused_privvm_dec(mm_ub(mm), 1, vma); -+ pb_add_ref(page, mm_ub(mm), &pbc); - entry = maybe_mkwrite(pte_mkdirty(mk_pte(page, - vma->vm_page_prot)), - vma); -@@ -1456,10 +1689,13 @@ do_anonymous_page(struct mm_struct *mm, - - /* No need to invalidate - it was non-present before */ - update_mmu_cache(vma, addr, entry); -+ lazy_mmu_prot_update(entry); - spin_unlock(&mm->page_table_lock); - out: -+ pb_free(&pbc); - return VM_FAULT_MINOR; - no_mem: -+ pb_free(&pbc); - return VM_FAULT_OOM; - } - -@@ -1480,6 +1716,7 @@ do_no_page(struct mm_struct *mm, struct - unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd) - { - struct page * new_page; -+ struct page_beancounter *pbc; - struct address_space *mapping = NULL; - pte_t entry; - int sequence = 0; -@@ -1492,6 +1729,9 @@ do_no_page(struct mm_struct *mm, struct - pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); - -+ if (pb_alloc(&pbc)) -+ return VM_FAULT_OOM; -+ - if (vma->vm_file) { - mapping = vma->vm_file->f_mapping; - sequence = atomic_read(&mapping->truncate_count); -@@ -1501,10 +1741,14 @@ retry: - new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret); - - /* no page was available -- either SIGBUS or OOM */ -- if (new_page == NOPAGE_SIGBUS) -+ if (new_page == NOPAGE_SIGBUS) { -+ pb_free(&pbc); - return VM_FAULT_SIGBUS; -- if (new_page == NOPAGE_OOM) -+ } -+ if (new_page == NOPAGE_OOM) { -+ pb_free(&pbc); - return VM_FAULT_OOM; -+ } - - /* - * Should we do an early C-O-W break? -@@ -1550,8 +1794,12 @@ retry: - */ - /* Only go through if we didn't race with anybody else... */ - if (pte_none(*page_table)) { -- if (!PageReserved(new_page)) -+ if (!PageReserved(new_page)) { - ++mm->rss; -+ ++vma->vm_rss; -+ ub_unused_privvm_dec(mm_ub(mm), 1, vma); -+ pb_add_ref(new_page, mm_ub(mm), &pbc); -+ } - flush_icache_page(vma, new_page); - entry = mk_pte(new_page, vma->vm_page_prot); - if (write_access) -@@ -1573,8 +1821,10 @@ retry: - - /* no need to invalidate: a not-present page shouldn't be cached */ - update_mmu_cache(vma, address, entry); -+ lazy_mmu_prot_update(entry); - spin_unlock(&mm->page_table_lock); - out: -+ pb_free(&pbc); - return ret; - oom: - page_cache_release(new_page); -@@ -1667,6 +1917,7 @@ static inline int handle_pte_fault(struc - entry = pte_mkyoung(entry); - ptep_set_access_flags(vma, address, pte, entry, write_access); - update_mmu_cache(vma, address, entry); -+ lazy_mmu_prot_update(entry); - pte_unmap(pte); - spin_unlock(&mm->page_table_lock); - return VM_FAULT_MINOR; -@@ -1681,6 +1932,18 @@ int handle_mm_fault(struct mm_struct *mm - pgd_t *pgd; - pmd_t *pmd; - -+#if CONFIG_VZ_GENCALLS -+ if (test_bit(UB_AFLAG_NOTIF_PAGEIN, &mm_ub(mm)->ub_aflags)) { -+ int ret; -+ ret = virtinfo_notifier_call(VITYPE_GENERAL, VIRTINFO_PAGEIN, -+ (void *)1); -+ if (ret & NOTIFY_FAIL) -+ return VM_FAULT_SIGBUS; -+ if (ret & NOTIFY_OK) -+ return VM_FAULT_MINOR; /* retry */ -+ } -+#endif -+ - __set_current_state(TASK_RUNNING); - pgd = pgd_offset(mm, address); - -diff -uprN linux-2.6.8.1.orig/mm/mempolicy.c linux-2.6.8.1-ve022stab078/mm/mempolicy.c ---- linux-2.6.8.1.orig/mm/mempolicy.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/mm/mempolicy.c 2006-05-11 13:05:34.000000000 +0400 -@@ -136,6 +136,8 @@ static int get_nodes(unsigned long *node - bitmap_zero(nodes, MAX_NUMNODES); - if (maxnode == 0 || !nmask) - return 0; -+ if (maxnode > PAGE_SIZE*8 /*BITS_PER_BYTE*/) -+ return -EINVAL; - - nlongs = BITS_TO_LONGS(maxnode); - if ((maxnode % BITS_PER_LONG) == 0) -@@ -210,6 +212,10 @@ static struct mempolicy *mpol_new(int mo - switch (mode) { - case MPOL_INTERLEAVE: - bitmap_copy(policy->v.nodes, nodes, MAX_NUMNODES); -+ if (bitmap_weight(nodes, MAX_NUMNODES) == 0) { -+ kmem_cache_free(policy_cache, policy); -+ return ERR_PTR(-EINVAL); -+ } - break; - case MPOL_PREFERRED: - policy->v.preferred_node = find_first_bit(nodes, MAX_NUMNODES); -@@ -388,7 +394,7 @@ asmlinkage long sys_set_mempolicy(int mo - struct mempolicy *new; - DECLARE_BITMAP(nodes, MAX_NUMNODES); - -- if (mode > MPOL_MAX) -+ if (mode < 0 || mode > MPOL_MAX) - return -EINVAL; - err = get_nodes(nodes, nmask, maxnode, mode); - if (err) -@@ -508,9 +514,13 @@ asmlinkage long sys_get_mempolicy(int __ - } else - pval = pol->policy; - -- err = -EFAULT; -+ if (vma) { -+ up_read(¤t->mm->mmap_sem); -+ vma = NULL; -+ } -+ - if (policy && put_user(pval, policy)) -- goto out; -+ return -EFAULT; - - err = 0; - if (nmask) { -diff -uprN linux-2.6.8.1.orig/mm/mempool.c linux-2.6.8.1-ve022stab078/mm/mempool.c ---- linux-2.6.8.1.orig/mm/mempool.c 2004-08-14 14:55:34.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/mm/mempool.c 2006-05-11 13:05:39.000000000 +0400 -@@ -10,6 +10,7 @@ - - #include <linux/mm.h> - #include <linux/slab.h> -+#include <linux/kmem_cache.h> - #include <linux/module.h> - #include <linux/mempool.h> - #include <linux/blkdev.h> -@@ -72,6 +73,9 @@ mempool_t * mempool_create(int min_nr, m - pool->alloc = alloc_fn; - pool->free = free_fn; - -+ if (alloc_fn == mempool_alloc_slab) -+ kmem_mark_nocharge((kmem_cache_t *)pool_data); -+ - /* - * First pre-allocate the guaranteed number of buffers. - */ -@@ -112,6 +116,7 @@ int mempool_resize(mempool_t *pool, int - unsigned long flags; - - BUG_ON(new_min_nr <= 0); -+ gfp_mask &= ~__GFP_UBC; - - spin_lock_irqsave(&pool->lock, flags); - if (new_min_nr < pool->min_nr) { -@@ -194,6 +199,9 @@ void * mempool_alloc(mempool_t *pool, in - DEFINE_WAIT(wait); - int gfp_nowait = gfp_mask & ~(__GFP_WAIT | __GFP_IO); - -+ gfp_mask &= ~__GFP_UBC; -+ gfp_nowait &= ~__GFP_UBC; -+ - repeat_alloc: - element = pool->alloc(gfp_nowait|__GFP_NOWARN, pool->pool_data); - if (likely(element != NULL)) -diff -uprN linux-2.6.8.1.orig/mm/mlock.c linux-2.6.8.1-ve022stab078/mm/mlock.c ---- linux-2.6.8.1.orig/mm/mlock.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/mm/mlock.c 2006-05-11 13:05:39.000000000 +0400 -@@ -8,6 +8,8 @@ - #include <linux/mman.h> - #include <linux/mm.h> - -+#include <ub/ub_vmpages.h> -+ - - static int mlock_fixup(struct vm_area_struct * vma, - unsigned long start, unsigned long end, unsigned int newflags) -@@ -19,17 +21,23 @@ static int mlock_fixup(struct vm_area_st - if (newflags == vma->vm_flags) - goto out; - -+ if (newflags & VM_LOCKED) { -+ ret = ub_locked_mem_charge(mm_ub(mm), end - start); -+ if (ret < 0) -+ goto out; -+ } -+ - if (start != vma->vm_start) { - if (split_vma(mm, vma, start, 1)) { - ret = -EAGAIN; -- goto out; -+ goto out_uncharge; - } - } - - if (end != vma->vm_end) { - if (split_vma(mm, vma, end, 0)) { - ret = -EAGAIN; -- goto out; -+ goto out_uncharge; - } - } - -@@ -47,9 +55,17 @@ static int mlock_fixup(struct vm_area_st - if (newflags & VM_LOCKED) { - pages = -pages; - ret = make_pages_present(start, end); -+ } else { -+ /* uncharge this memory, since it was unlocked */ -+ ub_locked_mem_uncharge(mm_ub(mm), end - start); - } - - vma->vm_mm->locked_vm -= pages; -+ return ret; -+ -+out_uncharge: -+ if (newflags & VM_LOCKED) -+ ub_locked_mem_uncharge(mm_ub(mm), end - start); - out: - return ret; - } -diff -uprN linux-2.6.8.1.orig/mm/mmap.c linux-2.6.8.1-ve022stab078/mm/mmap.c ---- linux-2.6.8.1.orig/mm/mmap.c 2004-08-14 14:55:35.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/mm/mmap.c 2006-05-11 13:05:40.000000000 +0400 -@@ -28,6 +28,8 @@ - #include <asm/cacheflush.h> - #include <asm/tlb.h> - -+#include <ub/ub_vmpages.h> -+ - /* - * WARNING: the debugging will use recursive algorithms so never enable this - * unless you know what you are doing. -@@ -90,6 +92,8 @@ static void remove_vm_struct(struct vm_a - { - struct file *file = vma->vm_file; - -+ ub_memory_uncharge(mm_ub(vma->vm_mm), vma->vm_end - vma->vm_start, -+ vma->vm_flags, vma->vm_file); - if (file) { - struct address_space *mapping = file->f_mapping; - spin_lock(&mapping->i_mmap_lock); -@@ -105,6 +109,7 @@ static void remove_vm_struct(struct vm_a - kmem_cache_free(vm_area_cachep, vma); - } - -+static unsigned long __do_brk(unsigned long, unsigned long, int); - /* - * sys_brk() for the most part doesn't need the global kernel - * lock, except when an application is doing something nasty -@@ -144,7 +149,7 @@ asmlinkage unsigned long sys_brk(unsigne - goto out; - - /* Ok, looks good - let it rip. */ -- if (do_brk(oldbrk, newbrk-oldbrk) != oldbrk) -+ if (__do_brk(oldbrk, newbrk-oldbrk, UB_HARD) != oldbrk) - goto out; - set_brk: - mm->brk = brk; -@@ -607,6 +612,7 @@ struct vm_area_struct *vma_merge(struct - { - pgoff_t pglen = (end - addr) >> PAGE_SHIFT; - struct vm_area_struct *area, *next; -+ unsigned long extra_rss; - - /* - * We later require that vma->vm_flags == vm_flags, -@@ -620,8 +626,12 @@ struct vm_area_struct *vma_merge(struct - else - next = mm->mmap; - area = next; -- if (next && next->vm_end == end) /* cases 6, 7, 8 */ -+ extra_rss = 0; -+ spin_lock(&mm->page_table_lock); -+ if (next && next->vm_end == end) { /* cases 6, 7, 8 */ - next = next->vm_next; -+ extra_rss = area->vm_rss; /* asterix below */ -+ } - - /* - * Can it merge with the predecessor? -@@ -640,11 +650,28 @@ struct vm_area_struct *vma_merge(struct - is_mergeable_anon_vma(prev->anon_vma, - next->anon_vma)) { - /* cases 1, 6 */ -+ /* case 1 : prev->vm_rss += next->vm_rss -+ * case 6*: prev->vm_rss += area->vm_rss + next->vm_rss -+ */ -+ prev->vm_rss += next->vm_rss + extra_rss; -+ spin_unlock(&mm->page_table_lock); - vma_adjust(prev, prev->vm_start, - next->vm_end, prev->vm_pgoff, NULL); -- } else /* cases 2, 5, 7 */ -+ } else { /* cases 2, 5, 7 */ -+ /* case 2 : nothing -+ * case 5 : prev->vm_rss += pages_in(addr, end) -+ * next->vm_rss -= pages_in(addr, end) -+ * case 7*: prev->vm_rss += area->vm_rss -+ */ -+ if (next && addr == next->vm_start) { /* case 5 */ -+ extra_rss = pages_in_vma_range(next, addr, end); -+ next->vm_rss -= extra_rss; -+ } -+ prev->vm_rss += extra_rss; -+ spin_unlock(&mm->page_table_lock); - vma_adjust(prev, prev->vm_start, - end, prev->vm_pgoff, NULL); -+ } - return prev; - } - -@@ -655,15 +682,29 @@ struct vm_area_struct *vma_merge(struct - mpol_equal(policy, vma_policy(next)) && - can_vma_merge_before(next, vm_flags, - anon_vma, file, pgoff+pglen)) { -- if (prev && addr < prev->vm_end) /* case 4 */ -+ if (prev && addr < prev->vm_end) { /* case 4 */ -+ /* case 4 : prev->vm_rss -= pages_in(addr, end) -+ * next->vm_rss += pages_in(addr, end) -+ */ -+ extra_rss = pages_in_vma_range(prev, addr, end); -+ prev->vm_rss -= extra_rss; -+ next->vm_rss += extra_rss; -+ spin_unlock(&mm->page_table_lock); - vma_adjust(prev, prev->vm_start, - addr, prev->vm_pgoff, NULL); -- else /* cases 3, 8 */ -+ } else { /* cases 3, 8 */ -+ /* case 3 : nothing -+ * case 8*: next->vm_rss += area->vm_rss -+ */ -+ next->vm_rss += extra_rss; -+ spin_unlock(&mm->page_table_lock); - vma_adjust(area, addr, next->vm_end, - next->vm_pgoff - pglen, NULL); -+ } - return area; - } - -+ spin_unlock(&mm->page_table_lock); - return NULL; - } - -@@ -785,6 +826,12 @@ unsigned long do_mmap_pgoff(struct file - if (mm->map_count > sysctl_max_map_count) - return -ENOMEM; - -+ if (file && (prot & PROT_EXEC)) { -+ error = check_area_execute_ve(file->f_dentry, file->f_vfsmnt); -+ if (error) -+ return error; -+ } -+ - /* Obtain the address to map to. we verify (or select) it and ensure - * that it represents a valid section of the address space. - */ -@@ -897,6 +944,11 @@ munmap_back: - } - } - -+ error = -ENOMEM; -+ if (ub_memory_charge(mm_ub(mm), len, vm_flags, file, -+ (flags & MAP_EXECPRIO ? UB_SOFT : UB_HARD))) -+ goto uncharge_error; -+ - /* - * Can we just expand an old private anonymous mapping? - * The VM_SHARED test is necessary because shmem_zero_setup -@@ -912,7 +964,8 @@ munmap_back: - * specific mapper. the address has already been validated, but - * not unmapped, but the maps are removed from the list. - */ -- vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL); -+ vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL | -+ (flags & MAP_EXECPRIO ? __GFP_SOFT_UBC : 0)); - if (!vma) { - error = -ENOMEM; - goto unacct_error; -@@ -923,6 +976,7 @@ munmap_back: - vma->vm_start = addr; - vma->vm_end = addr + len; - vma->vm_flags = vm_flags; -+ vma->vm_rss = 0; - vma->vm_page_prot = protection_map[vm_flags & 0x0f]; - vma->vm_pgoff = pgoff; - -@@ -1001,6 +1055,8 @@ unmap_and_free_vma: - free_vma: - kmem_cache_free(vm_area_cachep, vma); - unacct_error: -+ ub_memory_uncharge(mm_ub(mm), len, vm_flags, file); -+uncharge_error: - if (charged) - vm_unacct_memory(charged); - return error; -@@ -1210,15 +1266,28 @@ int expand_stack(struct vm_area_struct * - address &= PAGE_MASK; - grow = (address - vma->vm_end) >> PAGE_SHIFT; - -+ /* Somebody else might have raced and expanded it already */ -+ if (address <= vma->vm_end) -+ goto raced; -+ - /* Overcommit.. */ - if (security_vm_enough_memory(grow)) { - anon_vma_unlock(vma); - return -ENOMEM; - } - -+ if ((vma->vm_flags & VM_LOCKED) && -+ ((vma->vm_mm->locked_vm + grow) << PAGE_SHIFT) > -+ current->rlim[RLIMIT_MEMLOCK].rlim_cur) -+ goto nomem; -+ - if (address - vma->vm_start > current->rlim[RLIMIT_STACK].rlim_cur || - ((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) > -- current->rlim[RLIMIT_AS].rlim_cur) { -+ current->rlim[RLIMIT_AS].rlim_cur || -+ ub_memory_charge(mm_ub(vma->vm_mm), -+ address - vma->vm_end, -+ vma->vm_flags, vma->vm_file, UB_SOFT)) { -+nomem: - anon_vma_unlock(vma); - vm_unacct_memory(grow); - return -ENOMEM; -@@ -1227,6 +1296,7 @@ int expand_stack(struct vm_area_struct * - vma->vm_mm->total_vm += grow; - if (vma->vm_flags & VM_LOCKED) - vma->vm_mm->locked_vm += grow; -+raced: - anon_vma_unlock(vma); - return 0; - } -@@ -1271,15 +1341,28 @@ int expand_stack(struct vm_area_struct * - address &= PAGE_MASK; - grow = (vma->vm_start - address) >> PAGE_SHIFT; - -+ /* Somebody else might have raced and expanded it already */ -+ if (address >= vma->vm_start) -+ goto raced; -+ - /* Overcommit.. */ - if (security_vm_enough_memory(grow)) { - anon_vma_unlock(vma); - return -ENOMEM; - } - -+ if ((vma->vm_flags & VM_LOCKED) && -+ ((vma->vm_mm->locked_vm + grow) << PAGE_SHIFT) > -+ current->rlim[RLIMIT_MEMLOCK].rlim_cur) -+ goto nomem; -+ - if (vma->vm_end - address > current->rlim[RLIMIT_STACK].rlim_cur || - ((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) > -- current->rlim[RLIMIT_AS].rlim_cur) { -+ current->rlim[RLIMIT_AS].rlim_cur || -+ ub_memory_charge(mm_ub(vma->vm_mm), -+ vma->vm_start - address, -+ vma->vm_flags, vma->vm_file, UB_SOFT)) { -+nomem: - anon_vma_unlock(vma); - vm_unacct_memory(grow); - return -ENOMEM; -@@ -1289,6 +1372,7 @@ int expand_stack(struct vm_area_struct * - vma->vm_mm->total_vm += grow; - if (vma->vm_flags & VM_LOCKED) - vma->vm_mm->locked_vm += grow; -+raced: - anon_vma_unlock(vma); - return 0; - } -@@ -1517,6 +1601,11 @@ int split_vma(struct mm_struct * mm, str - else - vma_adjust(vma, vma->vm_start, addr, vma->vm_pgoff, new); - -+ spin_lock(&mm->page_table_lock); -+ new->vm_rss = pages_in_vma(new); -+ vma->vm_rss = pages_in_vma(vma); -+ spin_unlock(&mm->page_table_lock); -+ - return 0; - } - -@@ -1611,7 +1700,7 @@ asmlinkage long sys_munmap(unsigned long - * anonymous maps. eventually we may be able to do some - * brk-specific accounting here. - */ --unsigned long do_brk(unsigned long addr, unsigned long len) -+static unsigned long __do_brk(unsigned long addr, unsigned long len, int lowpri) - { - struct mm_struct * mm = current->mm; - struct vm_area_struct * vma, * prev; -@@ -1637,6 +1726,12 @@ unsigned long do_brk(unsigned long addr, - } - - /* -+ * mm->mmap_sem is required to protect against another thread -+ * changing the mappings in case we sleep. -+ */ -+ WARN_ON(down_read_trylock(&mm->mmap_sem)); -+ -+ /* - * Clear old maps. this also does some error checking for us - */ - munmap_back: -@@ -1660,6 +1755,10 @@ unsigned long do_brk(unsigned long addr, - - flags = VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags; - -+ if (ub_memory_charge(mm_ub(mm), len, flags, NULL, lowpri)) -+ goto out_unacct; -+ -+ - /* Can we just expand an old private anonymous mapping? */ - if (vma_merge(mm, prev, addr, addr + len, flags, - NULL, NULL, pgoff, NULL)) -@@ -1668,8 +1767,11 @@ unsigned long do_brk(unsigned long addr, - /* - * create a vma struct for an anonymous mapping - */ -- vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL); -+ vma = kmem_cache_alloc(vm_area_cachep, -+ SLAB_KERNEL | (lowpri ? 0 : __GFP_SOFT_UBC)); - if (!vma) { -+ ub_memory_uncharge(mm_ub(mm), len, flags, NULL); -+out_unacct: - vm_unacct_memory(len >> PAGE_SHIFT); - return -ENOMEM; - } -@@ -1680,6 +1782,7 @@ unsigned long do_brk(unsigned long addr, - vma->vm_end = addr + len; - vma->vm_pgoff = pgoff; - vma->vm_flags = flags; -+ vma->vm_rss = 0; - vma->vm_page_prot = protection_map[flags & 0x0f]; - vma_link(mm, vma, prev, rb_link, rb_parent); - out: -@@ -1691,6 +1794,11 @@ out: - return addr; - } - -+unsigned long do_brk(unsigned long addr, unsigned long len) -+{ -+ return __do_brk(addr, len, UB_SOFT); -+} -+ - EXPORT_SYMBOL(do_brk); - - /* Release all mmaps. */ -@@ -1740,7 +1848,7 @@ void exit_mmap(struct mm_struct *mm) - * and into the inode's i_mmap tree. If vm_file is non-NULL - * then i_mmap_lock is taken here. - */ --void insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma) -+int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma) - { - struct vm_area_struct * __vma, * prev; - struct rb_node ** rb_link, * rb_parent; -@@ -1763,8 +1871,9 @@ void insert_vm_struct(struct mm_struct * - } - __vma = find_vma_prepare(mm,vma->vm_start,&prev,&rb_link,&rb_parent); - if (__vma && __vma->vm_start < vma->vm_end) -- BUG(); -+ return -ENOMEM; - vma_link(mm, vma, prev, rb_link, rb_parent); -+ return 0; - } - - /* -@@ -1812,6 +1921,7 @@ struct vm_area_struct *copy_vma(struct v - new_vma->vm_start = addr; - new_vma->vm_end = addr + len; - new_vma->vm_pgoff = pgoff; -+ new_vma->vm_rss = 0; - if (new_vma->vm_file) - get_file(new_vma->vm_file); - if (new_vma->vm_ops && new_vma->vm_ops->open) -diff -uprN linux-2.6.8.1.orig/mm/mprotect.c linux-2.6.8.1-ve022stab078/mm/mprotect.c ---- linux-2.6.8.1.orig/mm/mprotect.c 2004-08-14 14:56:26.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/mm/mprotect.c 2006-05-11 13:05:39.000000000 +0400 -@@ -24,6 +24,8 @@ - #include <asm/cacheflush.h> - #include <asm/tlbflush.h> - -+#include <ub/ub_vmpages.h> -+ - static inline void - change_pte_range(pmd_t *pmd, unsigned long address, - unsigned long size, pgprot_t newprot) -@@ -51,8 +53,9 @@ change_pte_range(pmd_t *pmd, unsigned lo - * bits by wiping the pte and then setting the new pte - * into place. - */ -- entry = ptep_get_and_clear(pte); -- set_pte(pte, pte_modify(entry, newprot)); -+ entry = pte_modify(ptep_get_and_clear(pte), newprot); -+ set_pte(pte, entry); -+ lazy_mmu_prot_update(entry); - } - address += PAGE_SIZE; - pte++; -@@ -114,6 +117,8 @@ mprotect_fixup(struct vm_area_struct *vm - { - struct mm_struct * mm = vma->vm_mm; - unsigned long charged = 0; -+ unsigned long vma_rss; -+ int prot_dir; - pgprot_t newprot; - pgoff_t pgoff; - int error; -@@ -123,6 +128,17 @@ mprotect_fixup(struct vm_area_struct *vm - return 0; - } - -+ spin_lock(&mm->page_table_lock); -+ vma_rss = pages_in_vma_range(vma, start, end); -+ spin_unlock(&mm->page_table_lock); -+ charged = ((end - start) >> PAGE_SHIFT); -+ -+ prot_dir = ub_protected_charge(mm_ub(mm), charged - vma_rss, -+ newflags, vma); -+ error = -ENOMEM; -+ if (prot_dir == PRIVVM_ERROR) -+ goto fail_nocharge; -+ - /* - * If we make a private mapping writable we increase our commit; - * but (without finer accounting) cannot reduce our commit if we -@@ -133,9 +149,8 @@ mprotect_fixup(struct vm_area_struct *vm - */ - if (newflags & VM_WRITE) { - if (!(vma->vm_flags & (VM_ACCOUNT|VM_WRITE|VM_SHARED|VM_HUGETLB))) { -- charged = (end - start) >> PAGE_SHIFT; - if (security_vm_enough_memory(charged)) -- return -ENOMEM; -+ goto fail_noacct; - newflags |= VM_ACCOUNT; - } - } -@@ -178,10 +193,16 @@ success: - vma->vm_flags = newflags; - vma->vm_page_prot = newprot; - change_protection(vma, start, end, newprot); -+ if (prot_dir == PRIVVM_TO_SHARED) -+ __ub_unused_privvm_dec(mm_ub(mm), charged - vma_rss); - return 0; - - fail: - vm_unacct_memory(charged); -+fail_noacct: -+ if (prot_dir == PRIVVM_TO_PRIVATE) -+ __ub_unused_privvm_dec(mm_ub(mm), charged - vma_rss); -+fail_nocharge: - return error; - } - -diff -uprN linux-2.6.8.1.orig/mm/mremap.c linux-2.6.8.1-ve022stab078/mm/mremap.c ---- linux-2.6.8.1.orig/mm/mremap.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/mm/mremap.c 2006-05-11 13:05:39.000000000 +0400 -@@ -21,6 +21,8 @@ - #include <asm/cacheflush.h> - #include <asm/tlbflush.h> - -+#include <ub/ub_vmpages.h> -+ - static pte_t *get_one_pte_map_nested(struct mm_struct *mm, unsigned long addr) - { - pgd_t *pgd; -@@ -81,6 +83,7 @@ static inline pte_t *alloc_one_pte_map(s - - static int - move_one_page(struct vm_area_struct *vma, unsigned long old_addr, -+ struct vm_area_struct *new_vma, - unsigned long new_addr) - { - struct address_space *mapping = NULL; -@@ -129,6 +132,8 @@ move_one_page(struct vm_area_struct *vma - pte_t pte; - pte = ptep_clear_flush(vma, old_addr, src); - set_pte(dst, pte); -+ vma->vm_rss--; -+ new_vma->vm_rss++; - } else - error = -ENOMEM; - pte_unmap_nested(src); -@@ -143,6 +148,7 @@ move_one_page(struct vm_area_struct *vma - } - - static unsigned long move_page_tables(struct vm_area_struct *vma, -+ struct vm_area_struct *new_vma, - unsigned long new_addr, unsigned long old_addr, - unsigned long len) - { -@@ -156,7 +162,8 @@ static unsigned long move_page_tables(st - * only a few pages.. This also makes error recovery easier. - */ - for (offset = 0; offset < len; offset += PAGE_SIZE) { -- if (move_one_page(vma, old_addr+offset, new_addr+offset) < 0) -+ if (move_one_page(vma, old_addr+offset, -+ new_vma, new_addr+offset) < 0) - break; - cond_resched(); - } -@@ -175,26 +182,29 @@ static unsigned long move_vma(struct vm_ - unsigned long excess = 0; - int split = 0; - -+ if (ub_memory_charge(mm_ub(mm), new_len, vma->vm_flags, -+ vma->vm_file, UB_HARD)) -+ return -ENOMEM; - /* - * We'd prefer to avoid failure later on in do_munmap: - * which may split one vma into three before unmapping. - */ - if (mm->map_count >= sysctl_max_map_count - 3) -- return -ENOMEM; -+ goto out_nomem; - - new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT); - new_vma = copy_vma(&vma, new_addr, new_len, new_pgoff); - if (!new_vma) -- return -ENOMEM; -+ goto out_nomem; - -- moved_len = move_page_tables(vma, new_addr, old_addr, old_len); -+ moved_len = move_page_tables(vma, new_vma, new_addr, old_addr, old_len); - if (moved_len < old_len) { - /* - * On error, move entries back from new area to old, - * which will succeed since page tables still there, - * and then proceed to unmap new area instead of old. - */ -- move_page_tables(new_vma, old_addr, new_addr, moved_len); -+ move_page_tables(new_vma, vma, old_addr, new_addr, moved_len); - vma = new_vma; - old_len = new_len; - old_addr = new_addr; -@@ -231,7 +241,12 @@ static unsigned long move_vma(struct vm_ - new_addr + new_len); - } - -- return new_addr; -+ if (new_addr != -ENOMEM) -+ return new_addr; -+ -+out_nomem: -+ ub_memory_uncharge(mm_ub(mm), new_len, vma->vm_flags, vma->vm_file); -+ return -ENOMEM; - } - - /* -@@ -354,6 +369,12 @@ unsigned long do_mremap(unsigned long ad - if (max_addr - addr >= new_len) { - int pages = (new_len - old_len) >> PAGE_SHIFT; - -+ ret = ub_memory_charge(mm_ub(vma->vm_mm), -+ new_len - old_len, vma->vm_flags, -+ vma->vm_file, UB_HARD); -+ if (ret != 0) -+ goto out; -+ - vma_adjust(vma, vma->vm_start, - addr + new_len, vma->vm_pgoff, NULL); - -diff -uprN linux-2.6.8.1.orig/mm/oom_kill.c linux-2.6.8.1-ve022stab078/mm/oom_kill.c ---- linux-2.6.8.1.orig/mm/oom_kill.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/mm/oom_kill.c 2006-05-11 13:05:48.000000000 +0400 -@@ -15,12 +15,22 @@ - * kernel subsystems and hints as to where to find out what things do. - */ - -+#include <linux/bitops.h> - #include <linux/mm.h> - #include <linux/sched.h> -+#include <linux/virtinfo.h> -+#include <linux/module.h> - #include <linux/swap.h> - #include <linux/timex.h> - #include <linux/jiffies.h> - -+#include <ub/beancounter.h> -+#include <ub/ub_mem.h> -+ -+spinlock_t oom_generation_lock = SPIN_LOCK_UNLOCKED; -+int oom_kill_counter; -+int oom_generation; -+ - /* #define DEBUG */ - - /** -@@ -106,23 +116,47 @@ static int badness(struct task_struct *p - * - * (not docbooked, we don't want this one cluttering up the manual) - */ --static struct task_struct * select_bad_process(void) -+static struct task_struct * select_bad_process(struct user_beancounter *ub) - { -+ int points; - int maxpoints = 0; - struct task_struct *g, *p; - struct task_struct *chosen = NULL; -+ struct user_beancounter *mub; -+ -+ do_each_thread_all(g, p) { -+ if (!p->pid) -+ continue; -+ if (!p->mm) -+ continue; -+ -+#if 0 -+ /* -+ * swapoff check. -+ * Pro: do not let opportunistic swapoff kill the whole system; -+ * if the system enter OOM state, better stop swapoff. -+ * Contra: essential services must survive without swap -+ * (otherwise, the system is grossly misconfigured), -+ * and disabling swapoff completely, with cryptic diagnostic -+ * "interrupted system call", looks like a bad idea. -+ * 2006/02/28 SAW -+ */ -+ if (!(p->flags & PF_MEMDIE) && (p->flags & PF_SWAPOFF)) -+ return p; -+#endif - -- do_each_thread(g, p) -- if (p->pid) { -- int points = badness(p); -- if (points > maxpoints) { -- chosen = p; -- maxpoints = points; -- } -- if (p->flags & PF_SWAPOFF) -- return p; -+ for (mub = mm_ub(p->mm); mub != NULL; mub = mub->parent) -+ if (mub == ub) -+ break; -+ if (mub != ub) /* wrong beancounter */ -+ continue; -+ -+ points = badness(p); -+ if (points > maxpoints) { -+ chosen = p; -+ maxpoints = points; - } -- while_each_thread(g, p); -+ } while_each_thread_all(g, p); - return chosen; - } - -@@ -141,7 +175,8 @@ static void __oom_kill_task(task_t *p) - return; - } - task_unlock(p); -- printk(KERN_ERR "Out of Memory: Killed process %d (%s).\n", p->pid, p->comm); -+ printk(KERN_ERR "Out of Memory: Killing process %d (%.20s), flags=%lx, " -+ "mm=%p.\n", p->pid, p->comm, p->flags, p->mm); - - /* - * We give our sacrificial lamb high priority and access to -@@ -149,7 +184,10 @@ static void __oom_kill_task(task_t *p) - * exit() and clear out its resources quickly... - */ - p->time_slice = HZ; -- p->flags |= PF_MEMALLOC | PF_MEMDIE; -+ /* flag should be set atomically since p != current */ -+ set_bit(generic_ffs(PF_MEMDIE) - 1, &p->flags); -+ /* oom_generation_lock must be held */ -+ oom_kill_counter++; - - /* This process has hardware access, be more careful. */ - if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_RAWIO)) { -@@ -159,53 +197,55 @@ static void __oom_kill_task(task_t *p) - } - } - --static struct mm_struct *oom_kill_task(task_t *p) --{ -- struct mm_struct *mm = get_task_mm(p); -- if (!mm || mm == &init_mm) -- return NULL; -- __oom_kill_task(p); -- return mm; --} -- -- - /** -- * oom_kill - kill the "best" process when we run out of memory -+ * oom_kill - do a complete job of killing a process - * -- * If we run out of memory, we have the choice between either -- * killing a random task (bad), letting the system crash (worse) -- * OR try to be smart about which process to kill. Note that we -- * don't have to be perfect here, we just have to be good. -+ * Returns TRUE if selected process is unkillable. -+ * Called with oom_generation_lock and tasklist_lock held, drops them. - */ --static void oom_kill(void) -+static int oom_kill(struct task_struct *p, -+ struct user_beancounter *ub, long ub_maxover) - { - struct mm_struct *mm; -- struct task_struct *g, *p, *q; -- -- read_lock(&tasklist_lock); --retry: -- p = select_bad_process(); -- -- /* Found nothing?!?! Either we hang forever, or we panic. */ -- if (!p) { -- show_free_areas(); -- panic("Out of memory and no killable processes...\n"); -+ struct task_struct *g, *q; -+ uid_t ub_uid; -+ int suicide; -+ -+ mm = get_task_mm(p); -+ if (mm == &init_mm) { -+ mmput(mm); -+ mm = NULL; - } -+ if (mm == NULL) -+ return -1; -+ -+ /* -+ * The following message showing mm, its size, and free space -+ * should be printed regardless of CONFIG_USER_RESOURCE. -+ */ -+ ub_uid = (ub ? ub->ub_uid : -1); -+ printk(KERN_INFO"MM to kill %p (UB=%d, UBover=%ld, VM=%lu, free=%u).\n", -+ mm, ub_uid, ub_maxover, -+ mm->total_vm, nr_free_pages()); - -- mm = oom_kill_task(p); -- if (!mm) -- goto retry; - /* - * kill all processes that share the ->mm (i.e. all threads), - * but are in a different thread group - */ -- do_each_thread(g, q) -- if (q->mm == mm && q->tgid != p->tgid) -+ suicide = 0; -+ __oom_kill_task(p); -+ if (p == current) -+ suicide = 1; -+ do_each_thread_all(g, q) { -+ if (q->mm == mm && q->tgid != p->tgid) { - __oom_kill_task(q); -- while_each_thread(g, q); -- if (!p->mm) -- printk(KERN_INFO "Fixed up OOM kill of mm-less task\n"); -+ if (q == current) -+ suicide = 1; -+ } -+ } while_each_thread_all(g, q); - read_unlock(&tasklist_lock); -+ spin_unlock(&oom_generation_lock); -+ ub_oomkill_task(mm, ub, ub_maxover); /* nonblocking but long */ - mmput(mm); - - /* -@@ -213,81 +253,132 @@ retry: - * killing itself before someone else gets the chance to ask - * for more memory. - */ -- yield(); -- return; -+ if (!suicide) -+ yield(); -+ -+ return 0; - } - - /** -- * out_of_memory - is the system out of memory? -+ * oom_select_and_kill - kill the "best" process when we run out of memory -+ * -+ * If we run out of memory, we have the choice between either -+ * killing a random task (bad), letting the system crash (worse) -+ * OR try to be smart about which process to kill. Note that we -+ * don't have to be perfect here, we just have to be good. -+ * -+ * Called with oom_generation_lock held, drops it. - */ --void out_of_memory(int gfp_mask) -+static void oom_select_and_kill(void) - { -- /* -- * oom_lock protects out_of_memory()'s static variables. -- * It's a global lock; this is not performance-critical. -- */ -- static spinlock_t oom_lock = SPIN_LOCK_UNLOCKED; -- static unsigned long first, last, count, lastkill; -- unsigned long now, since; -- -- spin_lock(&oom_lock); -- now = jiffies; -- since = now - last; -- last = now; -+ struct user_beancounter *ub; -+ struct task_struct *p; -+ long ub_maxover; -+ int r; - -- /* -- * If it's been a long time since last failure, -- * we're not oom. -- */ -- if (since > 5*HZ) -- goto reset; -+ ub_clear_oom(); - -- /* -- * If we haven't tried for at least one second, -- * we're not really oom. -- */ -- since = now - first; -- if (since < HZ) -- goto out_unlock; -+ read_lock(&tasklist_lock); -+retry: -+ ub = ub_select_worst(&ub_maxover); -+ p = select_bad_process(ub); - -- /* -- * If we have gotten only a few failures, -- * we're not really oom. -- */ -- if (++count < 10) -- goto out_unlock; -+ /* Found nothing?!?! Either we hang forever, or we panic. */ -+ if (!p) { -+ if (!ub) { -+ show_free_areas(); -+ panic("Out of memory and no killable processes...\n"); -+ } - -- /* -- * If we just killed a process, wait a while -- * to give that task a chance to exit. This -- * avoids killing multiple processes needlessly. -- */ -- since = now - lastkill; -- if (since < HZ*5) -- goto out_unlock; -+ goto retry; -+ } - -- /* -- * Ok, really out of memory. Kill something. -- */ -- lastkill = now; -+ r = oom_kill(p, ub, ub_maxover); -+ put_beancounter(ub); -+ if (r) -+ goto retry; -+} - -- printk("oom-killer: gfp_mask=0x%x\n", gfp_mask); -- show_free_areas(); -+void oom_select_and_kill_sc(struct user_beancounter *scope) -+{ -+ struct user_beancounter *ub; -+ struct task_struct *p; - -- /* oom_kill() sleeps */ -- spin_unlock(&oom_lock); -- oom_kill(); -- spin_lock(&oom_lock); -+ ub_clear_oom(); -+ ub = get_beancounter(scope); - --reset: -- /* -- * We dropped the lock above, so check to be sure the variable -- * first only ever increases to prevent false OOM's. -- */ -- if (time_after(now, first)) -- first = now; -- count = 0; -+ read_lock(&tasklist_lock); -+retry: -+ p = select_bad_process(ub); -+ if (!p) { -+ read_unlock(&tasklist_lock); -+ return; -+ } -+ -+ if (oom_kill(p, ub, 0)) -+ goto retry; -+ -+ put_beancounter(ub); -+} -+ -+static void do_out_of_memory(struct oom_freeing_stat *stat) -+{ -+ spin_lock(&oom_generation_lock); -+ if (oom_generation != stat->oom_generation) { -+ /* OOM-killed process has exited */ -+ spin_unlock(&oom_generation_lock); -+ return; -+ } -+ if (oom_kill_counter) { -+ /* OOM in progress */ -+ spin_unlock(&oom_generation_lock); -+ __set_current_state(TASK_UNINTERRUPTIBLE); -+ schedule_timeout(5 * HZ); -+ -+ spin_lock(&oom_generation_lock); -+ if (oom_generation != stat->oom_generation) { -+ spin_unlock(&oom_generation_lock); -+ return; -+ } -+ /* -+ * Some process is stuck exiting. -+ * No choice other than to kill something else. -+ */ -+ oom_kill_counter = 0; -+ } -+ oom_select_and_kill(); -+} -+ -+void do_out_of_memory_sc(struct user_beancounter *ub) -+{ -+ spin_lock(&oom_generation_lock); -+ oom_select_and_kill_sc(ub); -+} -+EXPORT_SYMBOL(do_out_of_memory_sc); -+ -+/** -+ * out_of_memory - is the system out of memory? -+ */ -+void out_of_memory(struct oom_freeing_stat *stat, int gfp_mask) -+{ -+ if (nr_swap_pages > 0) { -+ /* some pages have been freed */ -+ if (stat->freed) -+ return; -+ /* some IO was started */ -+ if (stat->written) -+ return; -+ /* some pages have been swapped out, ref. counter removed */ -+ if (stat->swapped) -+ return; -+ /* some slabs were shrinked */ -+ if (stat->slabs) -+ return; -+ } -+ -+ if (virtinfo_notifier_call(VITYPE_GENERAL, VIRTINFO_OUTOFMEM, stat) -+ & (NOTIFY_OK | NOTIFY_FAIL)) -+ return; - --out_unlock: -- spin_unlock(&oom_lock); -+ do_out_of_memory(stat); - } -diff -uprN linux-2.6.8.1.orig/mm/page_alloc.c linux-2.6.8.1-ve022stab078/mm/page_alloc.c ---- linux-2.6.8.1.orig/mm/page_alloc.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/mm/page_alloc.c 2006-05-11 13:05:44.000000000 +0400 -@@ -31,9 +31,12 @@ - #include <linux/topology.h> - #include <linux/sysctl.h> - #include <linux/cpu.h> -+#include <linux/kernel_stat.h> - - #include <asm/tlbflush.h> - -+#include <ub/ub_mem.h> -+ - DECLARE_BITMAP(node_online_map, MAX_NUMNODES); - struct pglist_data *pgdat_list; - unsigned long totalram_pages; -@@ -41,7 +44,9 @@ unsigned long totalhigh_pages; - long nr_swap_pages; - int numnodes = 1; - int sysctl_lower_zone_protection = 0; -+int alloc_fail_warn = 0; - -+EXPORT_SYMBOL(pgdat_list); - EXPORT_SYMBOL(totalram_pages); - EXPORT_SYMBOL(nr_swap_pages); - -@@ -284,6 +289,7 @@ void __free_pages_ok(struct page *page, - free_pages_check(__FUNCTION__, page + i); - list_add(&page->lru, &list); - kernel_map_pages(page, 1<<order, 0); -+ ub_page_uncharge(page, order); - free_pages_bulk(page_zone(page), 1, &list, order); - } - -@@ -513,6 +519,7 @@ static void fastcall free_hot_cold_page( - inc_page_state(pgfree); - free_pages_check(__FUNCTION__, page); - pcp = &zone->pageset[get_cpu()].pcp[cold]; -+ ub_page_uncharge(page, 0); - local_irq_save(flags); - if (pcp->count >= pcp->high) - pcp->count -= free_pages_bulk(zone, pcp->batch, &pcp->list, 0); -@@ -578,6 +585,26 @@ buffered_rmqueue(struct zone *zone, int - return page; - } - -+static void __alloc_collect_stats(unsigned int gfp_mask, -+ unsigned int order, struct page *page, cycles_t time) -+{ -+ int ind; -+ unsigned long flags; -+ -+ time = get_cycles() - time; -+ if (!(gfp_mask & __GFP_WAIT)) -+ ind = 0; -+ else if (!(gfp_mask & __GFP_HIGHMEM)) -+ ind = (order > 0 ? 2 : 1); -+ else -+ ind = (order > 0 ? 4 : 3); -+ spin_lock_irqsave(&kstat_glb_lock, flags); -+ KSTAT_LAT_ADD(&kstat_glob.alloc_lat[ind], time); -+ if (!page) -+ kstat_glob.alloc_fails[ind]++; -+ spin_unlock_irqrestore(&kstat_glb_lock, flags); -+} -+ - /* - * This is the 'heart' of the zoned buddy allocator. - * -@@ -607,6 +634,7 @@ __alloc_pages(unsigned int gfp_mask, uns - int i; - int alloc_type; - int do_retry; -+ cycles_t start_time; - - might_sleep_if(wait); - -@@ -614,6 +642,7 @@ __alloc_pages(unsigned int gfp_mask, uns - if (zones[0] == NULL) /* no zones in the zonelist */ - return NULL; - -+ start_time = get_cycles(); - alloc_type = zone_idx(zones[0]); - - /* Go through the zonelist once, looking for a zone with enough free */ -@@ -678,6 +707,10 @@ rebalance: - goto got_pg; - } - } -+ if (gfp_mask & __GFP_NOFAIL) { -+ blk_congestion_wait(WRITE, HZ/50); -+ goto rebalance; -+ } - goto nopage; - } - -@@ -730,15 +763,24 @@ rebalance: - } - - nopage: -- if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) { -+ if (alloc_fail_warn && !(gfp_mask & __GFP_NOWARN) -+ && printk_ratelimit()) { - printk(KERN_WARNING "%s: page allocation failure." - " order:%d, mode:0x%x\n", - p->comm, order, gfp_mask); - dump_stack(); - } -+ __alloc_collect_stats(gfp_mask, order, NULL, start_time); - return NULL; - got_pg: - kernel_map_pages(page, 1 << order, 1); -+ __alloc_collect_stats(gfp_mask, order, page, start_time); -+ -+ if (ub_page_charge(page, order, gfp_mask)) { -+ __free_pages(page, order); -+ page = NULL; -+ } -+ - return page; - } - -@@ -887,6 +929,17 @@ unsigned int nr_free_highpages (void) - } - #endif - -+unsigned int nr_free_lowpages (void) -+{ -+ pg_data_t *pgdat; -+ unsigned int pages = 0; -+ -+ for_each_pgdat(pgdat) -+ pages += pgdat->node_zones[ZONE_NORMAL].free_pages; -+ -+ return pages; -+} -+ - #ifdef CONFIG_NUMA - static void show_node(struct zone *zone) - { -@@ -1710,7 +1763,10 @@ static void *vmstat_start(struct seq_fil - m->private = ps; - if (!ps) - return ERR_PTR(-ENOMEM); -- get_full_page_state(ps); -+ if (ve_is_super(get_exec_env())) -+ get_full_page_state(ps); -+ else -+ memset(ps, 0, sizeof(*ps)); - ps->pgpgin /= 2; /* sectors -> kbytes */ - ps->pgpgout /= 2; - return (unsigned long *)ps + *pos; -diff -uprN linux-2.6.8.1.orig/mm/pdflush.c linux-2.6.8.1-ve022stab078/mm/pdflush.c ---- linux-2.6.8.1.orig/mm/pdflush.c 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/mm/pdflush.c 2006-05-11 13:05:25.000000000 +0400 -@@ -106,8 +106,8 @@ static int __pdflush(struct pdflush_work - spin_unlock_irq(&pdflush_lock); - - schedule(); -- if (current->flags & PF_FREEZE) { -- refrigerator(PF_FREEZE); -+ if (test_thread_flag(TIF_FREEZE)) { -+ refrigerator(); - spin_lock_irq(&pdflush_lock); - continue; - } -diff -uprN linux-2.6.8.1.orig/mm/prio_tree.c linux-2.6.8.1-ve022stab078/mm/prio_tree.c ---- linux-2.6.8.1.orig/mm/prio_tree.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/mm/prio_tree.c 2006-05-11 13:05:38.000000000 +0400 -@@ -81,6 +81,8 @@ static inline unsigned long prio_tree_ma - return index_bits_to_maxindex[bits - 1]; - } - -+static void prio_tree_remove(struct prio_tree_root *, struct prio_tree_node *); -+ - /* - * Extend a priority search tree so that it can store a node with heap_index - * max_heap_index. In the worst case, this algorithm takes O((log n)^2). -@@ -90,8 +92,6 @@ static inline unsigned long prio_tree_ma - static struct prio_tree_node *prio_tree_expand(struct prio_tree_root *root, - struct prio_tree_node *node, unsigned long max_heap_index) - { -- static void prio_tree_remove(struct prio_tree_root *, -- struct prio_tree_node *); - struct prio_tree_node *first = NULL, *prev, *last = NULL; - - if (max_heap_index > prio_tree_maxindex(root->index_bits)) -@@ -245,7 +245,7 @@ static struct prio_tree_node *prio_tree_ - mask >>= 1; - - if (!mask) { -- mask = 1UL << (root->index_bits - 1); -+ mask = 1UL << (BITS_PER_LONG - 1); - size_flag = 1; - } - } -@@ -336,7 +336,7 @@ static struct prio_tree_node *prio_tree_ - iter->mask = ULONG_MAX; - } else { - iter->size_level = 1; -- iter->mask = 1UL << (root->index_bits - 1); -+ iter->mask = 1UL << (BITS_PER_LONG - 1); - } - } - return iter->cur; -@@ -380,7 +380,7 @@ static struct prio_tree_node *prio_tree_ - iter->mask = ULONG_MAX; - } else { - iter->size_level = 1; -- iter->mask = 1UL << (root->index_bits - 1); -+ iter->mask = 1UL << (BITS_PER_LONG - 1); - } - } - return iter->cur; -diff -uprN linux-2.6.8.1.orig/mm/rmap.c linux-2.6.8.1-ve022stab078/mm/rmap.c ---- linux-2.6.8.1.orig/mm/rmap.c 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/mm/rmap.c 2006-05-11 13:05:39.000000000 +0400 -@@ -33,6 +33,8 @@ - - #include <asm/tlbflush.h> - -+#include <ub/ub_vmpages.h> -+ - //#define RMAP_DEBUG /* can be enabled only for debugging */ - - kmem_cache_t *anon_vma_cachep; -@@ -160,7 +162,8 @@ static void anon_vma_ctor(void *data, km - void __init anon_vma_init(void) - { - anon_vma_cachep = kmem_cache_create("anon_vma", -- sizeof(struct anon_vma), 0, SLAB_PANIC, anon_vma_ctor, NULL); -+ sizeof(struct anon_vma), 0, SLAB_PANIC | SLAB_UBC, -+ anon_vma_ctor, NULL); - } - - /* this needs the page->flags PG_maplock held */ -@@ -369,8 +372,8 @@ void page_add_anon_rmap(struct page *pag - inc_page_state(nr_mapped); - } else { - BUG_ON(!PageAnon(page)); -- BUG_ON(page->index != index); -- BUG_ON(page->mapping != (struct address_space *) anon_vma); -+ WARN_ON(page->index != index); -+ WARN_ON(page->mapping != (struct address_space *) anon_vma); - } - page->mapcount++; - page_map_unlock(page); -@@ -513,6 +516,10 @@ static int try_to_unmap_one(struct page - } - - mm->rss--; -+ vma->vm_rss--; -+ mm_ub(mm)->ub_perfstat[smp_processor_id()].unmap++; -+ ub_unused_privvm_inc(mm_ub(mm), 1, vma); -+ pb_remove_ref(page, mm_ub(mm)); - BUG_ON(!page->mapcount); - page->mapcount--; - page_cache_release(page); -@@ -553,12 +560,13 @@ static int try_to_unmap_cluster(unsigned - struct mm_struct *mm = vma->vm_mm; - pgd_t *pgd; - pmd_t *pmd; -- pte_t *pte; -+ pte_t *pte, *original_pte; - pte_t pteval; - struct page *page; - unsigned long address; - unsigned long end; - unsigned long pfn; -+ unsigned long old_rss; - - /* - * We need the page_table_lock to protect us from page faults, -@@ -582,7 +590,8 @@ static int try_to_unmap_cluster(unsigned - if (!pmd_present(*pmd)) - goto out_unlock; - -- for (pte = pte_offset_map(pmd, address); -+ old_rss = mm->rss; -+ for (original_pte = pte = pte_offset_map(pmd, address); - address < end; pte++, address += PAGE_SIZE) { - - if (!pte_present(*pte)) -@@ -613,12 +622,17 @@ static int try_to_unmap_cluster(unsigned - set_page_dirty(page); - - page_remove_rmap(page); -- page_cache_release(page); - mm->rss--; -+ vma->vm_rss--; -+ mm_ub(mm)->ub_perfstat[smp_processor_id()].unmap++; -+ pb_remove_ref(page, mm_ub(mm)); -+ page_cache_release(page); - (*mapcount)--; - } -+ if (old_rss > mm->rss) -+ ub_unused_privvm_inc(mm_ub(mm), old_rss - mm->rss, vma); - -- pte_unmap(pte); -+ pte_unmap(original_pte); - - out_unlock: - spin_unlock(&mm->page_table_lock); -diff -uprN linux-2.6.8.1.orig/mm/shmem.c linux-2.6.8.1-ve022stab078/mm/shmem.c ---- linux-2.6.8.1.orig/mm/shmem.c 2004-08-14 14:55:20.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/mm/shmem.c 2006-05-11 13:05:42.000000000 +0400 -@@ -45,6 +45,9 @@ - #include <asm/div64.h> - #include <asm/pgtable.h> - -+#include <ub/beancounter.h> -+#include <ub/ub_vmpages.h> -+ - /* This magic number is used in glibc for posix shared memory */ - #define TMPFS_MAGIC 0x01021994 - -@@ -204,7 +207,7 @@ static void shmem_free_block(struct inod - * - * It has to be called with the spinlock held. - */ --static void shmem_recalc_inode(struct inode *inode) -+static void shmem_recalc_inode(struct inode *inode, unsigned long swp_freed) - { - struct shmem_inode_info *info = SHMEM_I(inode); - long freed; -@@ -217,6 +220,9 @@ static void shmem_recalc_inode(struct in - sbinfo->free_blocks += freed; - inode->i_blocks -= freed*BLOCKS_PER_PAGE; - spin_unlock(&sbinfo->stat_lock); -+ if (freed > swp_freed) -+ ub_tmpfs_respages_dec(shm_info_ub(info), -+ freed - swp_freed); - shmem_unacct_blocks(info->flags, freed); - } - } -@@ -321,6 +327,11 @@ static void shmem_swp_set(struct shmem_i - info->swapped += incdec; - if ((unsigned long)(entry - info->i_direct) >= SHMEM_NR_DIRECT) - kmap_atomic_to_page(entry)->nr_swapped += incdec; -+ -+ if (incdec == 1) -+ ub_tmpfs_respages_dec(shm_info_ub(info), 1); -+ else -+ ub_tmpfs_respages_inc(shm_info_ub(info), 1); - } - - /* -@@ -337,14 +348,24 @@ static swp_entry_t *shmem_swp_alloc(stru - struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); - struct page *page = NULL; - swp_entry_t *entry; -+ unsigned long ub_val; - - if (sgp != SGP_WRITE && - ((loff_t) index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) - return ERR_PTR(-EINVAL); - -+ ub_val = 0; -+ if (info->next_index <= index) { -+ ub_val = index + 1 - info->next_index; -+ if (ub_shmpages_charge(shm_info_ub(info), ub_val)) -+ return ERR_PTR(-ENOSPC); -+ } -+ - while (!(entry = shmem_swp_entry(info, index, &page))) { -- if (sgp == SGP_READ) -- return shmem_swp_map(ZERO_PAGE(0)); -+ if (sgp == SGP_READ) { -+ entry = shmem_swp_map(ZERO_PAGE(0)); -+ goto out; -+ } - /* - * Test free_blocks against 1 not 0, since we have 1 data - * page (and perhaps indirect index pages) yet to allocate: -@@ -353,14 +374,16 @@ static swp_entry_t *shmem_swp_alloc(stru - spin_lock(&sbinfo->stat_lock); - if (sbinfo->free_blocks <= 1) { - spin_unlock(&sbinfo->stat_lock); -- return ERR_PTR(-ENOSPC); -+ entry = ERR_PTR(-ENOSPC); -+ goto out; - } - sbinfo->free_blocks--; - inode->i_blocks += BLOCKS_PER_PAGE; - spin_unlock(&sbinfo->stat_lock); - - spin_unlock(&info->lock); -- page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping)); -+ page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | -+ __GFP_UBC); - if (page) { - clear_highpage(page); - page->nr_swapped = 0; -@@ -368,25 +391,36 @@ static swp_entry_t *shmem_swp_alloc(stru - spin_lock(&info->lock); - - if (!page) { -- shmem_free_block(inode); -- return ERR_PTR(-ENOMEM); -+ entry = ERR_PTR(-ENOMEM); -+ goto out_block; - } - if (sgp != SGP_WRITE && - ((loff_t) index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) { - entry = ERR_PTR(-EINVAL); -- break; -+ goto out_page; - } -- if (info->next_index <= index) -+ if (info->next_index <= index) { -+ ub_val = 0; - info->next_index = index + 1; -+ } - } - if (page) { - /* another task gave its page, or truncated the file */ - shmem_free_block(inode); - shmem_dir_free(page); - } -- if (info->next_index <= index && !IS_ERR(entry)) -+ if (info->next_index <= index) - info->next_index = index + 1; - return entry; -+ -+out_page: -+ shmem_dir_free(page); -+out_block: -+ shmem_free_block(inode); -+out: -+ if (ub_val) -+ ub_shmpages_uncharge(shm_info_ub(info), ub_val); -+ return entry; - } - - /* -@@ -423,13 +457,16 @@ static void shmem_truncate(struct inode - swp_entry_t *ptr; - int offset; - int freed; -+ unsigned long swp_freed; - -+ swp_freed = 0; - inode->i_ctime = inode->i_mtime = CURRENT_TIME; - idx = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; - if (idx >= info->next_index) - return; - - spin_lock(&info->lock); -+ ub_shmpages_uncharge(shm_info_ub(info), info->next_index - idx); - info->flags |= SHMEM_TRUNCATE; - limit = info->next_index; - info->next_index = idx; -@@ -438,7 +475,9 @@ static void shmem_truncate(struct inode - size = limit; - if (size > SHMEM_NR_DIRECT) - size = SHMEM_NR_DIRECT; -- info->swapped -= shmem_free_swp(ptr+idx, ptr+size); -+ freed = shmem_free_swp(ptr+idx, ptr+size); -+ swp_freed += freed; -+ info->swapped -= freed; - } - if (!info->i_indirect) - goto done2; -@@ -508,6 +547,7 @@ static void shmem_truncate(struct inode - shmem_swp_unmap(ptr); - info->swapped -= freed; - subdir->nr_swapped -= freed; -+ swp_freed += freed; - BUG_ON(subdir->nr_swapped > offset); - } - if (offset) -@@ -544,7 +584,7 @@ done2: - spin_lock(&info->lock); - } - info->flags &= ~SHMEM_TRUNCATE; -- shmem_recalc_inode(inode); -+ shmem_recalc_inode(inode, swp_freed); - spin_unlock(&info->lock); - } - -@@ -609,6 +649,8 @@ static void shmem_delete_inode(struct in - spin_lock(&sbinfo->stat_lock); - sbinfo->free_inodes++; - spin_unlock(&sbinfo->stat_lock); -+ put_beancounter(shm_info_ub(info)); -+ shm_info_ub(info) = NULL; - clear_inode(inode); - } - -@@ -752,12 +794,11 @@ static int shmem_writepage(struct page * - info = SHMEM_I(inode); - if (info->flags & VM_LOCKED) - goto redirty; -- swap = get_swap_page(); -+ swap = get_swap_page(shm_info_ub(info)); - if (!swap.val) - goto redirty; - - spin_lock(&info->lock); -- shmem_recalc_inode(inode); - if (index >= info->next_index) { - BUG_ON(!(info->flags & SHMEM_TRUNCATE)); - goto unlock; -@@ -890,7 +931,6 @@ repeat: - goto failed; - - spin_lock(&info->lock); -- shmem_recalc_inode(inode); - entry = shmem_swp_alloc(info, idx, sgp); - if (IS_ERR(entry)) { - spin_unlock(&info->lock); -@@ -1051,6 +1091,7 @@ repeat: - clear_highpage(filepage); - flush_dcache_page(filepage); - SetPageUptodate(filepage); -+ ub_tmpfs_respages_inc(shm_info_ub(info), 1); - } - done: - if (!*pagep) { -@@ -1082,6 +1123,8 @@ struct page *shmem_nopage(struct vm_area - idx = (address - vma->vm_start) >> PAGE_SHIFT; - idx += vma->vm_pgoff; - idx >>= PAGE_CACHE_SHIFT - PAGE_SHIFT; -+ if (((loff_t) idx << PAGE_CACHE_SHIFT) >= i_size_read(inode)) -+ return NOPAGE_SIGBUS; - - error = shmem_getpage(inode, idx, &page, SGP_CACHE, type); - if (error) -@@ -1151,19 +1194,6 @@ shmem_get_policy(struct vm_area_struct * - } - #endif - --void shmem_lock(struct file *file, int lock) --{ -- struct inode *inode = file->f_dentry->d_inode; -- struct shmem_inode_info *info = SHMEM_I(inode); -- -- spin_lock(&info->lock); -- if (lock) -- info->flags |= VM_LOCKED; -- else -- info->flags &= ~VM_LOCKED; -- spin_unlock(&info->lock); --} -- - static int shmem_mmap(struct file *file, struct vm_area_struct *vma) - { - file_accessed(file); -@@ -1198,6 +1228,7 @@ shmem_get_inode(struct super_block *sb, - inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; - info = SHMEM_I(inode); - memset(info, 0, (char *)inode - (char *)info); -+ shm_info_ub(info) = get_beancounter(get_exec_ub()); - spin_lock_init(&info->lock); - mpol_shared_policy_init(&info->policy); - switch (mode & S_IFMT) { -@@ -1317,6 +1348,7 @@ shmem_file_write(struct file *file, cons - break; - - left = bytes; -+#ifndef CONFIG_X86_UACCESS_INDIRECT - if (PageHighMem(page)) { - volatile unsigned char dummy; - __get_user(dummy, buf); -@@ -1326,6 +1358,7 @@ shmem_file_write(struct file *file, cons - left = __copy_from_user(kaddr + offset, buf, bytes); - kunmap_atomic(kaddr, KM_USER0); - } -+#endif - if (left) { - kaddr = kmap(page); - left = __copy_from_user(kaddr + offset, buf, bytes); -@@ -1960,20 +1993,42 @@ static struct vm_operations_struct shmem - #endif - }; - -+int is_shmem_mapping(struct address_space *map) -+{ -+ return (map != NULL && map->a_ops == &shmem_aops); -+} -+ - static struct super_block *shmem_get_sb(struct file_system_type *fs_type, - int flags, const char *dev_name, void *data) - { - return get_sb_nodev(fs_type, flags, data, shmem_fill_super); - } - --static struct file_system_type tmpfs_fs_type = { -+struct file_system_type tmpfs_fs_type = { - .owner = THIS_MODULE, - .name = "tmpfs", - .get_sb = shmem_get_sb, - .kill_sb = kill_litter_super, - }; -+ -+EXPORT_SYMBOL(tmpfs_fs_type); -+ - static struct vfsmount *shm_mnt; - -+#ifndef CONFIG_VE -+#define visible_shm_mnt shm_mnt -+#else -+#define visible_shm_mnt (get_exec_env()->shmem_mnt) -+#endif -+ -+void prepare_shmmnt(void) -+{ -+#ifdef CONFIG_VE -+ get_ve0()->shmem_mnt = shm_mnt; -+ shm_mnt = (struct vfsmount *)0x10111213; -+#endif -+} -+ - static int __init init_tmpfs(void) - { - int error; -@@ -1999,6 +2054,7 @@ static int __init init_tmpfs(void) - - /* The internal instance should not do size checking */ - shmem_set_size(SHMEM_SB(shm_mnt->mnt_sb), ULONG_MAX, ULONG_MAX); -+ prepare_shmmnt(); - return 0; - - out1: -@@ -2011,6 +2067,32 @@ out3: - } - module_init(init_tmpfs) - -+static inline int shm_charge_ahead(struct inode *inode) -+{ -+ struct shmem_inode_info *info = SHMEM_I(inode); -+ unsigned long idx; -+ swp_entry_t *entry; -+ -+ if (!inode->i_size) -+ return 0; -+ idx = (inode->i_size - 1) >> PAGE_CACHE_SHIFT; -+ /* -+ * Just touch info to allocate space for entry and -+ * make all UBC checks -+ */ -+ spin_lock(&info->lock); -+ entry = shmem_swp_alloc(info, idx, SGP_CACHE); -+ if (IS_ERR(entry)) -+ goto err; -+ shmem_swp_unmap(entry); -+ spin_unlock(&info->lock); -+ return 0; -+ -+err: -+ spin_unlock(&info->lock); -+ return PTR_ERR(entry); -+} -+ - /* - * shmem_file_setup - get an unlinked file living in tmpfs - * -@@ -2026,8 +2108,8 @@ struct file *shmem_file_setup(char *name - struct dentry *dentry, *root; - struct qstr this; - -- if (IS_ERR(shm_mnt)) -- return (void *)shm_mnt; -+ if (IS_ERR(visible_shm_mnt)) -+ return (void *)visible_shm_mnt; - - if (size > SHMEM_MAX_BYTES) - return ERR_PTR(-EINVAL); -@@ -2039,7 +2121,7 @@ struct file *shmem_file_setup(char *name - this.name = name; - this.len = strlen(name); - this.hash = 0; /* will go */ -- root = shm_mnt->mnt_root; -+ root = visible_shm_mnt->mnt_root; - dentry = d_alloc(root, &this); - if (!dentry) - goto put_memory; -@@ -2058,7 +2140,10 @@ struct file *shmem_file_setup(char *name - d_instantiate(dentry, inode); - inode->i_size = size; - inode->i_nlink = 0; /* It is unlinked */ -- file->f_vfsmnt = mntget(shm_mnt); -+ error = shm_charge_ahead(inode); -+ if (error) -+ goto close_file; -+ file->f_vfsmnt = mntget(visible_shm_mnt); - file->f_dentry = dentry; - file->f_mapping = inode->i_mapping; - file->f_op = &shmem_file_operations; -@@ -2090,6 +2175,8 @@ int shmem_zero_setup(struct vm_area_stru - - if (vma->vm_file) - fput(vma->vm_file); -+ else if (vma->vm_flags & VM_WRITE) /* should match VM_UB_PRIVATE */ -+ __ub_unused_privvm_dec(mm_ub(vma->vm_mm), size >> PAGE_SHIFT); - vma->vm_file = file; - vma->vm_ops = &shmem_vm_ops; - return 0; -diff -uprN linux-2.6.8.1.orig/mm/slab.c linux-2.6.8.1-ve022stab078/mm/slab.c ---- linux-2.6.8.1.orig/mm/slab.c 2004-08-14 14:56:26.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/mm/slab.c 2006-05-11 13:05:41.000000000 +0400 -@@ -91,32 +91,21 @@ - #include <linux/cpu.h> - #include <linux/sysctl.h> - #include <linux/module.h> -+#include <linux/kmem_slab.h> -+#include <linux/kmem_cache.h> -+#include <linux/kernel_stat.h> -+#include <linux/ve_owner.h> - - #include <asm/uaccess.h> - #include <asm/cacheflush.h> - #include <asm/tlbflush.h> - --/* -- * DEBUG - 1 for kmem_cache_create() to honour; SLAB_DEBUG_INITIAL, -- * SLAB_RED_ZONE & SLAB_POISON. -- * 0 for faster, smaller code (especially in the critical paths). -- * -- * STATS - 1 to collect stats for /proc/slabinfo. -- * 0 for faster, smaller code (especially in the critical paths). -- * -- * FORCED_DEBUG - 1 enables SLAB_RED_ZONE and SLAB_POISON (if possible) -- */ -- --#ifdef CONFIG_DEBUG_SLAB --#define DEBUG 1 --#define STATS 1 --#define FORCED_DEBUG 1 --#else --#define DEBUG 0 --#define STATS 0 --#define FORCED_DEBUG 0 --#endif -+#include <ub/beancounter.h> -+#include <ub/ub_mem.h> - -+#define DEBUG SLAB_DEBUG -+#define STATS SLAB_STATS -+#define FORCED_DEBUG SLAB_FORCED_DEBUG - - /* Shouldn't this be in a header file somewhere? */ - #define BYTES_PER_WORD sizeof(void *) -@@ -139,182 +128,20 @@ - SLAB_POISON | SLAB_HWCACHE_ALIGN | \ - SLAB_NO_REAP | SLAB_CACHE_DMA | \ - SLAB_MUST_HWCACHE_ALIGN | SLAB_STORE_USER | \ -- SLAB_RECLAIM_ACCOUNT | SLAB_PANIC) -+ SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | \ -+ SLAB_UBC | SLAB_NO_CHARGE) - #else - # define CREATE_MASK (SLAB_HWCACHE_ALIGN | SLAB_NO_REAP | \ - SLAB_CACHE_DMA | SLAB_MUST_HWCACHE_ALIGN | \ -- SLAB_RECLAIM_ACCOUNT | SLAB_PANIC) -+ SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | \ -+ SLAB_UBC | SLAB_NO_CHARGE) - #endif - --/* -- * kmem_bufctl_t: -- * -- * Bufctl's are used for linking objs within a slab -- * linked offsets. -- * -- * This implementation relies on "struct page" for locating the cache & -- * slab an object belongs to. -- * This allows the bufctl structure to be small (one int), but limits -- * the number of objects a slab (not a cache) can contain when off-slab -- * bufctls are used. The limit is the size of the largest general cache -- * that does not use off-slab slabs. -- * For 32bit archs with 4 kB pages, is this 56. -- * This is not serious, as it is only for large objects, when it is unwise -- * to have too many per slab. -- * Note: This limit can be raised by introducing a general cache whose size -- * is less than 512 (PAGE_SIZE<<3), but greater than 256. -- */ -- --#define BUFCTL_END (((kmem_bufctl_t)(~0U))-0) --#define BUFCTL_FREE (((kmem_bufctl_t)(~0U))-1) --#define SLAB_LIMIT (((kmem_bufctl_t)(~0U))-2) -- - /* Max number of objs-per-slab for caches which use off-slab slabs. - * Needed to avoid a possible looping condition in cache_grow(). - */ - static unsigned long offslab_limit; - --/* -- * struct slab -- * -- * Manages the objs in a slab. Placed either at the beginning of mem allocated -- * for a slab, or allocated from an general cache. -- * Slabs are chained into three list: fully used, partial, fully free slabs. -- */ --struct slab { -- struct list_head list; -- unsigned long colouroff; -- void *s_mem; /* including colour offset */ -- unsigned int inuse; /* num of objs active in slab */ -- kmem_bufctl_t free; --}; -- --/* -- * struct array_cache -- * -- * Per cpu structures -- * Purpose: -- * - LIFO ordering, to hand out cache-warm objects from _alloc -- * - reduce the number of linked list operations -- * - reduce spinlock operations -- * -- * The limit is stored in the per-cpu structure to reduce the data cache -- * footprint. -- * -- */ --struct array_cache { -- unsigned int avail; -- unsigned int limit; -- unsigned int batchcount; -- unsigned int touched; --}; -- --/* bootstrap: The caches do not work without cpuarrays anymore, -- * but the cpuarrays are allocated from the generic caches... -- */ --#define BOOT_CPUCACHE_ENTRIES 1 --struct arraycache_init { -- struct array_cache cache; -- void * entries[BOOT_CPUCACHE_ENTRIES]; --}; -- --/* -- * The slab lists of all objects. -- * Hopefully reduce the internal fragmentation -- * NUMA: The spinlock could be moved from the kmem_cache_t -- * into this structure, too. Figure out what causes -- * fewer cross-node spinlock operations. -- */ --struct kmem_list3 { -- struct list_head slabs_partial; /* partial list first, better asm code */ -- struct list_head slabs_full; -- struct list_head slabs_free; -- unsigned long free_objects; -- int free_touched; -- unsigned long next_reap; -- struct array_cache *shared; --}; -- --#define LIST3_INIT(parent) \ -- { \ -- .slabs_full = LIST_HEAD_INIT(parent.slabs_full), \ -- .slabs_partial = LIST_HEAD_INIT(parent.slabs_partial), \ -- .slabs_free = LIST_HEAD_INIT(parent.slabs_free) \ -- } --#define list3_data(cachep) \ -- (&(cachep)->lists) -- --/* NUMA: per-node */ --#define list3_data_ptr(cachep, ptr) \ -- list3_data(cachep) -- --/* -- * kmem_cache_t -- * -- * manages a cache. -- */ -- --struct kmem_cache_s { --/* 1) per-cpu data, touched during every alloc/free */ -- struct array_cache *array[NR_CPUS]; -- unsigned int batchcount; -- unsigned int limit; --/* 2) touched by every alloc & free from the backend */ -- struct kmem_list3 lists; -- /* NUMA: kmem_3list_t *nodelists[MAX_NUMNODES] */ -- unsigned int objsize; -- unsigned int flags; /* constant flags */ -- unsigned int num; /* # of objs per slab */ -- unsigned int free_limit; /* upper limit of objects in the lists */ -- spinlock_t spinlock; -- --/* 3) cache_grow/shrink */ -- /* order of pgs per slab (2^n) */ -- unsigned int gfporder; -- -- /* force GFP flags, e.g. GFP_DMA */ -- unsigned int gfpflags; -- -- size_t colour; /* cache colouring range */ -- unsigned int colour_off; /* colour offset */ -- unsigned int colour_next; /* cache colouring */ -- kmem_cache_t *slabp_cache; -- unsigned int slab_size; -- unsigned int dflags; /* dynamic flags */ -- -- /* constructor func */ -- void (*ctor)(void *, kmem_cache_t *, unsigned long); -- -- /* de-constructor func */ -- void (*dtor)(void *, kmem_cache_t *, unsigned long); -- --/* 4) cache creation/removal */ -- const char *name; -- struct list_head next; -- --/* 5) statistics */ --#if STATS -- unsigned long num_active; -- unsigned long num_allocations; -- unsigned long high_mark; -- unsigned long grown; -- unsigned long reaped; -- unsigned long errors; -- unsigned long max_freeable; -- atomic_t allochit; -- atomic_t allocmiss; -- atomic_t freehit; -- atomic_t freemiss; --#endif --#if DEBUG -- int dbghead; -- int reallen; --#endif --}; -- --#define CFLGS_OFF_SLAB (0x80000000UL) --#define OFF_SLAB(x) ((x)->flags & CFLGS_OFF_SLAB) -- - #define BATCHREFILL_LIMIT 16 - /* Optimization question: fewer reaps means less - * probability for unnessary cpucache drain/refill cycles. -@@ -446,15 +273,6 @@ static void **dbg_userword(kmem_cache_t - #define BREAK_GFP_ORDER_LO 0 - static int slab_break_gfp_order = BREAK_GFP_ORDER_LO; - --/* Macros for storing/retrieving the cachep and or slab from the -- * global 'mem_map'. These are used to find the slab an obj belongs to. -- * With kfree(), these are used to find the cache which an obj belongs to. -- */ --#define SET_PAGE_CACHE(pg,x) ((pg)->lru.next = (struct list_head *)(x)) --#define GET_PAGE_CACHE(pg) ((kmem_cache_t *)(pg)->lru.next) --#define SET_PAGE_SLAB(pg,x) ((pg)->lru.prev = (struct list_head *)(x)) --#define GET_PAGE_SLAB(pg) ((struct slab *)(pg)->lru.prev) -- - /* These are the default caches for kmalloc. Custom caches can have other sizes. */ - struct cache_sizes malloc_sizes[] = { - #define CACHE(x) { .cs_size = (x) }, -@@ -543,13 +361,24 @@ static void cache_estimate (unsigned lon - size_t wastage = PAGE_SIZE<<gfporder; - size_t extra = 0; - size_t base = 0; -+ size_t ub_align, ub_extra; -+ -+ ub_align = 1; -+ ub_extra = 0; - - if (!(flags & CFLGS_OFF_SLAB)) { - base = sizeof(struct slab); - extra = sizeof(kmem_bufctl_t); -+#ifdef CONFIG_USER_RESOURCE -+ if (flags & SLAB_UBC) { -+ ub_extra = sizeof(void *); -+ ub_align = sizeof(void *); -+ } -+#endif - } - i = 0; -- while (i*size + ALIGN(base+i*extra, align) <= wastage) -+ while (i * size + ALIGN(ALIGN(base + i * extra, ub_align) + -+ i * ub_extra, align) <= wastage) - i++; - if (i > 0) - i--; -@@ -558,8 +387,8 @@ static void cache_estimate (unsigned lon - i = SLAB_LIMIT; - - *num = i; -- wastage -= i*size; -- wastage -= ALIGN(base+i*extra, align); -+ wastage -= i * size + ALIGN(ALIGN(base + i * extra, ub_align) + -+ i * ub_extra, align); - *left_over = wastage; - } - -@@ -747,17 +576,18 @@ void __init kmem_cache_init(void) - * allow tighter packing of the smaller caches. */ - sizes->cs_cachep = kmem_cache_create(names->name, - sizes->cs_size, ARCH_KMALLOC_MINALIGN, -- (ARCH_KMALLOC_FLAGS | SLAB_PANIC), NULL, NULL); -+ (ARCH_KMALLOC_FLAGS | SLAB_PANIC | -+ SLAB_UBC | SLAB_NO_CHARGE), -+ NULL, NULL); - - /* Inc off-slab bufctl limit until the ceiling is hit. */ -- if (!(OFF_SLAB(sizes->cs_cachep))) { -- offslab_limit = sizes->cs_size-sizeof(struct slab); -- offslab_limit /= sizeof(kmem_bufctl_t); -- } -+ if (!(OFF_SLAB(sizes->cs_cachep))) -+ offslab_limit = sizes->cs_size; - - sizes->cs_dmacachep = kmem_cache_create(names->name_dma, - sizes->cs_size, ARCH_KMALLOC_MINALIGN, -- (ARCH_KMALLOC_FLAGS | SLAB_CACHE_DMA | SLAB_PANIC), -+ (ARCH_KMALLOC_FLAGS | SLAB_CACHE_DMA | SLAB_PANIC | -+ SLAB_UBC | SLAB_NO_CHARGE), - NULL, NULL); - - sizes++; -@@ -1115,7 +945,7 @@ kmem_cache_create (const char *name, siz - unsigned long flags, void (*ctor)(void*, kmem_cache_t *, unsigned long), - void (*dtor)(void*, kmem_cache_t *, unsigned long)) - { -- size_t left_over, slab_size; -+ size_t left_over, slab_size, ub_size, ub_align; - kmem_cache_t *cachep = NULL; - - /* -@@ -1249,6 +1079,7 @@ kmem_cache_create (const char *name, siz - */ - do { - unsigned int break_flag = 0; -+ unsigned long off_slab_size; - cal_wastage: - cache_estimate(cachep->gfporder, size, align, flags, - &left_over, &cachep->num); -@@ -1258,12 +1089,22 @@ cal_wastage: - break; - if (!cachep->num) - goto next; -- if (flags & CFLGS_OFF_SLAB && -- cachep->num > offslab_limit) { -+ if (flags & CFLGS_OFF_SLAB) { -+ off_slab_size = sizeof(struct slab) + -+ cachep->num * sizeof(kmem_bufctl_t); -+#ifdef CONFIG_USER_RESOURCE -+ if (flags & SLAB_UBC) -+ off_slab_size = ALIGN(off_slab_size, -+ sizeof(void *)) + -+ cachep->num * sizeof(void *); -+#endif -+ - /* This num of objs will cause problems. */ -- cachep->gfporder--; -- break_flag++; -- goto cal_wastage; -+ if (off_slab_size > offslab_limit) { -+ cachep->gfporder--; -+ break_flag++; -+ goto cal_wastage; -+ } - } - - /* -@@ -1286,8 +1127,19 @@ next: - cachep = NULL; - goto opps; - } -- slab_size = ALIGN(cachep->num*sizeof(kmem_bufctl_t) -- + sizeof(struct slab), align); -+ -+ ub_size = 0; -+ ub_align = 1; -+#ifdef CONFIG_USER_RESOURCE -+ if (flags & SLAB_UBC) { -+ ub_size = sizeof(void *); -+ ub_align = sizeof(void *); -+ } -+#endif -+ -+ slab_size = ALIGN(ALIGN(cachep->num * sizeof(kmem_bufctl_t) + -+ sizeof(struct slab), ub_align) + -+ cachep->num * ub_size, align); - - /* - * If the slab has been placed off-slab, and we have enough space then -@@ -1300,7 +1152,9 @@ next: - - if (flags & CFLGS_OFF_SLAB) { - /* really off slab. No need for manual alignment */ -- slab_size = cachep->num*sizeof(kmem_bufctl_t)+sizeof(struct slab); -+ slab_size = ALIGN(cachep->num * sizeof(kmem_bufctl_t) + -+ sizeof(struct slab), ub_align) + -+ cachep->num * ub_size; - } - - cachep->colour_off = cache_line_size(); -@@ -1337,10 +1191,13 @@ next: - * the cache that's used by kmalloc(24), otherwise - * the creation of further caches will BUG(). - */ -- cachep->array[smp_processor_id()] = &initarray_generic.cache; -+ cachep->array[smp_processor_id()] = -+ &initarray_generic.cache; - g_cpucache_up = PARTIAL; - } else { -- cachep->array[smp_processor_id()] = kmalloc(sizeof(struct arraycache_init),GFP_KERNEL); -+ cachep->array[smp_processor_id()] = -+ kmalloc(sizeof(struct arraycache_init), -+ GFP_KERNEL); - } - BUG_ON(!ac_data(cachep)); - ac_data(cachep)->avail = 0; -@@ -1354,7 +1211,7 @@ next: - } - - cachep->lists.next_reap = jiffies + REAPTIMEOUT_LIST3 + -- ((unsigned long)cachep)%REAPTIMEOUT_LIST3; -+ ((unsigned long)cachep)%REAPTIMEOUT_LIST3; - - /* Need the semaphore to access the chain. */ - down(&cache_chain_sem); -@@ -1367,16 +1224,24 @@ next: - list_for_each(p, &cache_chain) { - kmem_cache_t *pc = list_entry(p, kmem_cache_t, next); - char tmp; -- /* This happens when the module gets unloaded and doesn't -- destroy its slab cache and noone else reuses the vmalloc -- area of the module. Print a warning. */ -- if (__get_user(tmp,pc->name)) { -- printk("SLAB: cache with size %d has lost its name\n", -- pc->objsize); -+ -+ /* -+ * This happens when the module gets unloaded and -+ * doesn't destroy its slab cache and noone else reuses -+ * the vmalloc area of the module. Print a warning. -+ */ -+#ifdef CONFIG_X86_UACCESS_INDIRECT -+ if (__direct_get_user(tmp,pc->name)) { -+#else -+ if (__get_user(tmp,pc->name)) { -+#endif -+ printk("SLAB: cache with size %d has lost its " -+ "name\n", pc->objsize); - continue; - } - if (!strcmp(pc->name,name)) { -- printk("kmem_cache_create: duplicate cache %s\n",name); -+ printk("kmem_cache_create: duplicate " -+ "cache %s\n",name); - up(&cache_chain_sem); - unlock_cpu_hotplug(); - BUG(); -@@ -1389,6 +1254,16 @@ next: - list_add(&cachep->next, &cache_chain); - up(&cache_chain_sem); - unlock_cpu_hotplug(); -+ -+#ifdef CONFIG_USER_RESOURCE -+ cachep->objuse = ((PAGE_SIZE << cachep->gfporder) + cachep->num - 1) / -+ cachep->num; -+ if (OFF_SLAB(cachep)) -+ cachep->objuse += -+ (cachep->slabp_cache->objuse + cachep->num - 1) -+ / cachep->num; -+#endif -+ - opps: - if (!cachep && (flags & SLAB_PANIC)) - panic("kmem_cache_create(): failed to create slab `%s'\n", -@@ -1572,6 +1447,7 @@ int kmem_cache_destroy (kmem_cache_t * c - /* NUMA: free the list3 structures */ - kfree(cachep->lists.shared); - cachep->lists.shared = NULL; -+ ub_kmemcache_free(cachep); - kmem_cache_free(&cache_cache, cachep); - - unlock_cpu_hotplug(); -@@ -1586,28 +1462,30 @@ static struct slab* alloc_slabmgmt (kmem - void *objp, int colour_off, int local_flags) - { - struct slab *slabp; -- -+ - if (OFF_SLAB(cachep)) { - /* Slab management obj is off-slab. */ -- slabp = kmem_cache_alloc(cachep->slabp_cache, local_flags); -+ slabp = kmem_cache_alloc(cachep->slabp_cache, -+ local_flags & (~__GFP_UBC)); - if (!slabp) - return NULL; - } else { - slabp = objp+colour_off; - colour_off += cachep->slab_size; - } -+ - slabp->inuse = 0; - slabp->colouroff = colour_off; - slabp->s_mem = objp+colour_off; - -+#ifdef CONFIG_USER_RESOURCE -+ if (cachep->flags & SLAB_UBC) -+ memset(slab_ubcs(cachep, slabp), 0, cachep->num * -+ sizeof(struct user_beancounter *)); -+#endif - return slabp; - } - --static inline kmem_bufctl_t *slab_bufctl(struct slab *slabp) --{ -- return (kmem_bufctl_t *)(slabp+1); --} -- - static void cache_init_objs (kmem_cache_t * cachep, - struct slab * slabp, unsigned long ctor_flags) - { -@@ -1735,7 +1613,7 @@ static int cache_grow (kmem_cache_t * ca - - - /* Get mem for the objs. */ -- if (!(objp = kmem_getpages(cachep, flags, -1))) -+ if (!(objp = kmem_getpages(cachep, flags & (~__GFP_UBC), -1))) - goto failed; - - /* Get slab management. */ -@@ -2038,6 +1916,16 @@ cache_alloc_debugcheck_after(kmem_cache_ - #define cache_alloc_debugcheck_after(a,b,objp,d) (objp) - #endif - -+static inline int should_charge(kmem_cache_t *cachep, int flags, void *objp) -+{ -+ if (objp == NULL) -+ return 0; -+ if (!(cachep->flags & SLAB_UBC)) -+ return 0; -+ if ((cachep->flags & SLAB_NO_CHARGE) && !(flags & __GFP_UBC)) -+ return 0; -+ return 1; -+} - - static inline void * __cache_alloc (kmem_cache_t *cachep, int flags) - { -@@ -2058,8 +1946,18 @@ static inline void * __cache_alloc (kmem - objp = cache_alloc_refill(cachep, flags); - } - local_irq_restore(save_flags); -+ -+ if (should_charge(cachep, flags, objp) && -+ ub_slab_charge(objp, flags) < 0) -+ goto out_err; -+ - objp = cache_alloc_debugcheck_after(cachep, flags, objp, __builtin_return_address(0)); - return objp; -+ -+out_err: -+ objp = cache_alloc_debugcheck_after(cachep, flags, objp, __builtin_return_address(0)); -+ kmem_cache_free(cachep, objp); -+ return NULL; - } - - /* -@@ -2182,6 +2080,9 @@ static inline void __cache_free (kmem_ca - check_irq_off(); - objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0)); - -+ if (cachep->flags & SLAB_UBC) -+ ub_slab_uncharge(objp); -+ - if (likely(ac->avail < ac->limit)) { - STATS_INC_FREEHIT(cachep); - ac_entry(ac)[ac->avail++] = objp; -@@ -2434,6 +2335,20 @@ void kmem_cache_free (kmem_cache_t *cach - EXPORT_SYMBOL(kmem_cache_free); - - /** -+ * kzalloc - allocate memory. The memory is set to zero. -+ * @size: how many bytes of memory are required. -+ * @flags: the type of memory to allocate. -+ */ -+void *kzalloc(size_t size, gfp_t flags) -+{ -+ void *ret = kmalloc(size, flags); -+ if (ret) -+ memset(ret, 0, size); -+ return ret; -+} -+EXPORT_SYMBOL(kzalloc); -+ -+/** - * kfree - free previously allocated memory - * @objp: pointer returned by kmalloc. - * -@@ -2475,6 +2390,7 @@ free_percpu(const void *objp) - continue; - kfree(p->ptrs[i]); - } -+ kfree(p); - } - - EXPORT_SYMBOL(free_percpu); -@@ -2693,6 +2609,7 @@ static void cache_reap (void) - if (down_trylock(&cache_chain_sem)) - return; - -+ {KSTAT_PERF_ENTER(cache_reap) - list_for_each(walk, &cache_chain) { - kmem_cache_t *searchp; - struct list_head* p; -@@ -2755,6 +2672,7 @@ next: - } - check_irq_on(); - up(&cache_chain_sem); -+ KSTAT_PERF_LEAVE(cache_reap)} - } - - /* -diff -uprN linux-2.6.8.1.orig/mm/swap.c linux-2.6.8.1-ve022stab078/mm/swap.c ---- linux-2.6.8.1.orig/mm/swap.c 2004-08-14 14:55:19.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/mm/swap.c 2006-05-11 13:05:32.000000000 +0400 -@@ -351,7 +351,9 @@ void pagevec_strip(struct pagevec *pvec) - struct page *page = pvec->pages[i]; - - if (PagePrivate(page) && !TestSetPageLocked(page)) { -- try_to_release_page(page, 0); -+ /* need to recheck after lock */ -+ if (page_has_buffers(page)) -+ try_to_release_page(page, 0); - unlock_page(page); - } - } -diff -uprN linux-2.6.8.1.orig/mm/swap_state.c linux-2.6.8.1-ve022stab078/mm/swap_state.c ---- linux-2.6.8.1.orig/mm/swap_state.c 2004-08-14 14:55:20.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/mm/swap_state.c 2006-05-11 13:05:42.000000000 +0400 -@@ -14,9 +14,15 @@ - #include <linux/pagemap.h> - #include <linux/buffer_head.h> - #include <linux/backing-dev.h> -+#include <linux/kernel_stat.h> - - #include <asm/pgtable.h> - -+#include <ub/beancounter.h> -+#include <ub/ub_mem.h> -+#include <ub/ub_page.h> -+#include <ub/ub_vmpages.h> -+ - /* - * swapper_space is a fiction, retained to simplify the path through - * vmscan's shrink_list, to make sync_page look nicer, and to allow -@@ -42,23 +48,20 @@ struct address_space swapper_space = { - }; - EXPORT_SYMBOL(swapper_space); - -+/* can't remove variable swap_cache_info due to dynamic kernel */ - #define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0) - --static struct { -- unsigned long add_total; -- unsigned long del_total; -- unsigned long find_success; -- unsigned long find_total; -- unsigned long noent_race; -- unsigned long exist_race; --} swap_cache_info; -+struct swap_cache_info_struct swap_cache_info; -+EXPORT_SYMBOL(swap_cache_info); - - void show_swap_cache_info(void) - { -- printk("Swap cache: add %lu, delete %lu, find %lu/%lu, race %lu+%lu\n", -+ printk("Swap cache: add %lu, delete %lu, find %lu/%lu, " -+ "race %lu+%lu+%lu\n", - swap_cache_info.add_total, swap_cache_info.del_total, - swap_cache_info.find_success, swap_cache_info.find_total, -- swap_cache_info.noent_race, swap_cache_info.exist_race); -+ swap_cache_info.noent_race, swap_cache_info.exist_race, -+ swap_cache_info.remove_race); - } - - /* -@@ -148,7 +151,14 @@ int add_to_swap(struct page * page) - BUG(); - - for (;;) { -- entry = get_swap_page(); -+ struct user_beancounter *ub; -+ -+ ub = pb_grab_page_ub(page); -+ if (IS_ERR(ub)) -+ return 0; -+ -+ entry = get_swap_page(ub); -+ put_beancounter(ub); - if (!entry.val) - return 0; - -@@ -264,10 +274,13 @@ int move_from_swap_cache(struct page *pa - */ - static inline void free_swap_cache(struct page *page) - { -- if (PageSwapCache(page) && !TestSetPageLocked(page)) { -+ if (!PageSwapCache(page)) -+ return; -+ if (!TestSetPageLocked(page)) { - remove_exclusive_swap_page(page); - unlock_page(page); -- } -+ } else -+ INC_CACHE_INFO(remove_race); - } - - /* -diff -uprN linux-2.6.8.1.orig/mm/swapfile.c linux-2.6.8.1-ve022stab078/mm/swapfile.c ---- linux-2.6.8.1.orig/mm/swapfile.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/mm/swapfile.c 2006-05-11 13:05:45.000000000 +0400 -@@ -30,6 +30,8 @@ - #include <asm/tlbflush.h> - #include <linux/swapops.h> - -+#include <ub/ub_vmpages.h> -+ - spinlock_t swaplock = SPIN_LOCK_UNLOCKED; - unsigned int nr_swapfiles; - long total_swap_pages; -@@ -147,7 +149,7 @@ static inline int scan_swap_map(struct s - return 0; - } - --swp_entry_t get_swap_page(void) -+swp_entry_t get_swap_page(struct user_beancounter *ub) - { - struct swap_info_struct * p; - unsigned long offset; -@@ -164,7 +166,7 @@ swp_entry_t get_swap_page(void) - - while (1) { - p = &swap_info[type]; -- if ((p->flags & SWP_ACTIVE) == SWP_ACTIVE) { -+ if ((p->flags & (SWP_ACTIVE|SWP_READONLY)) == SWP_ACTIVE) { - swap_device_lock(p); - offset = scan_swap_map(p); - swap_device_unlock(p); -@@ -177,6 +179,12 @@ swp_entry_t get_swap_page(void) - } else { - swap_list.next = type; - } -+#if CONFIG_USER_SWAP_ACCOUNTING -+ if (p->owner_map[offset] != NULL) -+ BUG(); -+ ub_swapentry_inc(ub); -+ p->owner_map[offset] = get_beancounter(ub); -+#endif - goto out; - } - } -@@ -248,6 +256,11 @@ static int swap_entry_free(struct swap_i - count--; - p->swap_map[offset] = count; - if (!count) { -+#if CONFIG_USER_SWAP_ACCOUNTING -+ ub_swapentry_dec(p->owner_map[offset]); -+ put_beancounter(p->owner_map[offset]); -+ p->owner_map[offset] = NULL; -+#endif - if (offset < p->lowest_bit) - p->lowest_bit = offset; - if (offset > p->highest_bit) -@@ -288,7 +301,8 @@ static int exclusive_swap_page(struct pa - p = swap_info_get(entry); - if (p) { - /* Is the only swap cache user the cache itself? */ -- if (p->swap_map[swp_offset(entry)] == 1) { -+ if ((p->flags & (SWP_ACTIVE|SWP_READONLY)) == SWP_ACTIVE && -+ p->swap_map[swp_offset(entry)] == 1) { - /* Recheck the page count with the swapcache lock held.. */ - spin_lock_irq(&swapper_space.tree_lock); - if (page_count(page) == 2) -@@ -379,6 +393,54 @@ int remove_exclusive_swap_page(struct pa - return retval; - } - -+int try_to_remove_exclusive_swap_page(struct page *page) -+{ -+ int retval; -+ struct swap_info_struct * p; -+ swp_entry_t entry; -+ -+ BUG_ON(PagePrivate(page)); -+ BUG_ON(!PageLocked(page)); -+ -+ if (!PageSwapCache(page)) -+ return 0; -+ if (PageWriteback(page)) -+ return 0; -+ if (page_count(page) != 2) /* 2: us + cache */ -+ return 0; -+ -+ entry.val = page->private; -+ p = swap_info_get(entry); -+ if (!p) -+ return 0; -+ if (!vm_swap_full() && -+ (p->flags & (SWP_ACTIVE|SWP_READONLY)) == SWP_ACTIVE) { -+ swap_info_put(p); -+ return 0; -+ } -+ -+ /* Is the only swap cache user the cache itself? */ -+ retval = 0; -+ if (p->swap_map[swp_offset(entry)] == 1) { -+ /* Recheck the page count with the swapcache lock held.. */ -+ spin_lock_irq(&swapper_space.tree_lock); -+ if ((page_count(page) == 2) && !PageWriteback(page)) { -+ __delete_from_swap_cache(page); -+ SetPageDirty(page); -+ retval = 1; -+ } -+ spin_unlock_irq(&swapper_space.tree_lock); -+ } -+ swap_info_put(p); -+ -+ if (retval) { -+ swap_free(entry); -+ page_cache_release(page); -+ } -+ -+ return retval; -+} -+ - /* - * Free the swap entry like above, but also try to - * free the page cache entry if it is the last user. -@@ -428,9 +490,12 @@ void free_swap_and_cache(swp_entry_t ent - /* vma->vm_mm->page_table_lock is held */ - static void - unuse_pte(struct vm_area_struct *vma, unsigned long address, pte_t *dir, -- swp_entry_t entry, struct page *page) -+ swp_entry_t entry, struct page *page, struct page_beancounter **ppbs) - { - vma->vm_mm->rss++; -+ vma->vm_rss++; -+ ub_unused_privvm_dec(mm_ub(vma->vm_mm), 1, vma); -+ pb_add_list_ref(page, mm_ub(vma->vm_mm), ppbs); - get_page(page); - set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot))); - page_add_anon_rmap(page, vma, address); -@@ -440,7 +505,7 @@ unuse_pte(struct vm_area_struct *vma, un - /* vma->vm_mm->page_table_lock is held */ - static unsigned long unuse_pmd(struct vm_area_struct * vma, pmd_t *dir, - unsigned long address, unsigned long size, unsigned long offset, -- swp_entry_t entry, struct page *page) -+ swp_entry_t entry, struct page *page, struct page_beancounter **ppbs) - { - pte_t * pte; - unsigned long end; -@@ -465,7 +530,8 @@ static unsigned long unuse_pmd(struct vm - * Test inline before going to call unuse_pte. - */ - if (unlikely(pte_same(*pte, swp_pte))) { -- unuse_pte(vma, offset + address, pte, entry, page); -+ unuse_pte(vma, offset + address, pte, entry, page, -+ ppbs); - pte_unmap(pte); - - /* -@@ -486,8 +552,8 @@ static unsigned long unuse_pmd(struct vm - - /* vma->vm_mm->page_table_lock is held */ - static unsigned long unuse_pgd(struct vm_area_struct * vma, pgd_t *dir, -- unsigned long address, unsigned long size, -- swp_entry_t entry, struct page *page) -+ unsigned long address, unsigned long size, swp_entry_t entry, -+ struct page *page, struct page_beancounter **ppbs) - { - pmd_t * pmd; - unsigned long offset, end; -@@ -510,7 +576,7 @@ static unsigned long unuse_pgd(struct vm - BUG(); - do { - foundaddr = unuse_pmd(vma, pmd, address, end - address, -- offset, entry, page); -+ offset, entry, page, ppbs); - if (foundaddr) - return foundaddr; - address = (address + PMD_SIZE) & PMD_MASK; -@@ -521,7 +587,7 @@ static unsigned long unuse_pgd(struct vm - - /* vma->vm_mm->page_table_lock is held */ - static unsigned long unuse_vma(struct vm_area_struct * vma, pgd_t *pgdir, -- swp_entry_t entry, struct page *page) -+ swp_entry_t entry, struct page *page, struct page_beancounter **ppbs) - { - unsigned long start = vma->vm_start, end = vma->vm_end; - unsigned long foundaddr; -@@ -530,7 +596,7 @@ static unsigned long unuse_vma(struct vm - BUG(); - do { - foundaddr = unuse_pgd(vma, pgdir, start, end - start, -- entry, page); -+ entry, page, ppbs); - if (foundaddr) - return foundaddr; - start = (start + PGDIR_SIZE) & PGDIR_MASK; -@@ -540,7 +606,8 @@ static unsigned long unuse_vma(struct vm - } - - static int unuse_process(struct mm_struct * mm, -- swp_entry_t entry, struct page* page) -+ swp_entry_t entry, struct page* page, -+ struct page_beancounter **ppbs) - { - struct vm_area_struct* vma; - unsigned long foundaddr = 0; -@@ -561,7 +628,7 @@ static int unuse_process(struct mm_struc - for (vma = mm->mmap; vma; vma = vma->vm_next) { - if (!is_vm_hugetlb_page(vma)) { - pgd_t * pgd = pgd_offset(mm, vma->vm_start); -- foundaddr = unuse_vma(vma, pgd, entry, page); -+ foundaddr = unuse_vma(vma, pgd, entry, page, ppbs); - if (foundaddr) - break; - } -@@ -629,6 +696,7 @@ static int try_to_unuse(unsigned int typ - int retval = 0; - int reset_overflow = 0; - int shmem; -+ struct page_beancounter *pb_list; - - /* - * When searching mms for an entry, a good strategy is to -@@ -687,6 +755,13 @@ static int try_to_unuse(unsigned int typ - break; - } - -+ pb_list = NULL; -+ if (pb_reserve_all(&pb_list)) { -+ page_cache_release(page); -+ retval = -ENOMEM; -+ break; -+ } -+ - /* - * Don't hold on to start_mm if it looks like exiting. - */ -@@ -709,6 +784,20 @@ static int try_to_unuse(unsigned int typ - lock_page(page); - wait_on_page_writeback(page); - -+ /* If read failed we cannot map not-uptodate page to -+ * user space. Actually, we are in serious troubles, -+ * we do not even know what process to kill. So, the only -+ * variant remains: to stop swapoff() and allow someone -+ * to kill processes to zap invalid pages. -+ */ -+ if (unlikely(!PageUptodate(page))) { -+ pb_free_list(&pb_list); -+ unlock_page(page); -+ page_cache_release(page); -+ retval = -EIO; -+ break; -+ } -+ - /* - * Remove all references to entry, without blocking. - * Whenever we reach init_mm, there's no address space -@@ -720,8 +809,10 @@ static int try_to_unuse(unsigned int typ - if (start_mm == &init_mm) - shmem = shmem_unuse(entry, page); - else -- retval = unuse_process(start_mm, entry, page); -+ retval = unuse_process(start_mm, entry, page, -+ &pb_list); - } -+ - if (*swap_map > 1) { - int set_start_mm = (*swap_map >= swcount); - struct list_head *p = &start_mm->mmlist; -@@ -749,7 +840,8 @@ static int try_to_unuse(unsigned int typ - set_start_mm = 1; - shmem = shmem_unuse(entry, page); - } else -- retval = unuse_process(mm, entry, page); -+ retval = unuse_process(mm, entry, page, -+ &pb_list); - if (set_start_mm && *swap_map < swcount) { - mmput(new_start_mm); - atomic_inc(&mm->mm_users); -@@ -763,6 +855,8 @@ static int try_to_unuse(unsigned int typ - mmput(start_mm); - start_mm = new_start_mm; - } -+ -+ pb_free_list(&pb_list); - if (retval) { - unlock_page(page); - page_cache_release(page); -@@ -1078,6 +1172,7 @@ asmlinkage long sys_swapoff(const char _ - { - struct swap_info_struct * p = NULL; - unsigned short *swap_map; -+ struct user_beancounter **owner_map; - struct file *swap_file, *victim; - struct address_space *mapping; - struct inode *inode; -@@ -1085,6 +1180,10 @@ asmlinkage long sys_swapoff(const char _ - int i, type, prev; - int err; - -+ /* VE admin check is just to be on the safe side, the admin may affect -+ * swaps only if he has access to special, i.e. if he has been granted -+ * access to the block device or if the swap file is in the area -+ * visible to him. */ - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - -@@ -1168,12 +1267,15 @@ asmlinkage long sys_swapoff(const char _ - p->max = 0; - swap_map = p->swap_map; - p->swap_map = NULL; -+ owner_map = p->owner_map; -+ p->owner_map = NULL; - p->flags = 0; - destroy_swap_extents(p); - swap_device_unlock(p); - swap_list_unlock(); - up(&swapon_sem); - vfree(swap_map); -+ vfree(owner_map); - inode = mapping->host; - if (S_ISBLK(inode->i_mode)) { - struct block_device *bdev = I_BDEV(inode); -@@ -1310,6 +1412,7 @@ asmlinkage long sys_swapon(const char __ - struct page *page = NULL; - struct inode *inode = NULL; - int did_down = 0; -+ struct user_beancounter **owner_map; - - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; -@@ -1347,6 +1450,7 @@ asmlinkage long sys_swapon(const char __ - p->highest_bit = 0; - p->cluster_nr = 0; - p->inuse_pages = 0; -+ p->owner_map = NULL; - p->sdev_lock = SPIN_LOCK_UNLOCKED; - p->next = -1; - if (swap_flags & SWAP_FLAG_PREFER) { -@@ -1513,6 +1617,15 @@ asmlinkage long sys_swapon(const char __ - error = -EINVAL; - goto bad_swap; - } -+#if CONFIG_USER_SWAP_ACCOUNTING -+ p->owner_map = vmalloc(maxpages * sizeof(struct user_beancounter *)); -+ if (!p->owner_map) { -+ error = -ENOMEM; -+ goto bad_swap; -+ } -+ memset(p->owner_map, 0, -+ maxpages * sizeof(struct user_beancounter *)); -+#endif - p->swap_map[0] = SWAP_MAP_BAD; - p->max = maxpages; - p->pages = nr_good_pages; -@@ -1525,6 +1638,8 @@ asmlinkage long sys_swapon(const char __ - swap_list_lock(); - swap_device_lock(p); - p->flags = SWP_ACTIVE; -+ if (swap_flags & SWAP_FLAG_READONLY) -+ p->flags |= SWP_READONLY; - nr_swap_pages += nr_good_pages; - total_swap_pages += nr_good_pages; - printk(KERN_INFO "Adding %dk swap on %s. Priority:%d extents:%d\n", -@@ -1558,6 +1673,7 @@ bad_swap: - bad_swap_2: - swap_list_lock(); - swap_map = p->swap_map; -+ owner_map = p->owner_map; - p->swap_file = NULL; - p->swap_map = NULL; - p->flags = 0; -@@ -1567,6 +1683,8 @@ bad_swap_2: - destroy_swap_extents(p); - if (swap_map) - vfree(swap_map); -+ if (owner_map) -+ vfree(owner_map); - if (swap_file) - filp_close(swap_file, NULL); - out: -diff -uprN linux-2.6.8.1.orig/mm/truncate.c linux-2.6.8.1-ve022stab078/mm/truncate.c ---- linux-2.6.8.1.orig/mm/truncate.c 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/mm/truncate.c 2006-05-11 13:05:28.000000000 +0400 -@@ -79,6 +79,12 @@ invalidate_complete_page(struct address_ - spin_unlock_irq(&mapping->tree_lock); - return 0; - } -+ -+ BUG_ON(PagePrivate(page)); -+ if (page_count(page) != 2) { -+ spin_unlock_irq(&mapping->tree_lock); -+ return 0; -+ } - __remove_from_page_cache(page); - spin_unlock_irq(&mapping->tree_lock); - ClearPageUptodate(page); -@@ -268,7 +274,11 @@ void invalidate_inode_pages2(struct addr - clear_page_dirty(page); - ClearPageUptodate(page); - } else { -- invalidate_complete_page(mapping, page); -+ if (!invalidate_complete_page(mapping, -+ page)) { -+ clear_page_dirty(page); -+ ClearPageUptodate(page); -+ } - } - } - unlock_page(page); -diff -uprN linux-2.6.8.1.orig/mm/usercopy.c linux-2.6.8.1-ve022stab078/mm/usercopy.c ---- linux-2.6.8.1.orig/mm/usercopy.c 1970-01-01 03:00:00.000000000 +0300 -+++ linux-2.6.8.1-ve022stab078/mm/usercopy.c 2006-05-11 13:05:38.000000000 +0400 -@@ -0,0 +1,310 @@ -+/* -+ * linux/mm/usercopy.c -+ * -+ * (C) Copyright 2003 Ingo Molnar -+ * -+ * Generic implementation of all the user-VM access functions, without -+ * relying on being able to access the VM directly. -+ */ -+ -+#include <linux/module.h> -+#include <linux/sched.h> -+#include <linux/errno.h> -+#include <linux/mm.h> -+#include <linux/highmem.h> -+#include <linux/pagemap.h> -+#include <linux/smp_lock.h> -+#include <linux/ptrace.h> -+#include <linux/interrupt.h> -+ -+#include <asm/pgtable.h> -+#include <asm/uaccess.h> -+#include <asm/atomic_kmap.h> -+ -+/* -+ * Get kernel address of the user page and pin it. -+ */ -+static inline struct page *pin_page(unsigned long addr, int write, -+ pte_t *pte) -+{ -+ struct mm_struct *mm = current->mm ? : &init_mm; -+ struct page *page = NULL; -+ int ret; -+ -+ if (addr >= current_thread_info()->addr_limit.seg) -+ return (struct page *)-1UL; -+ /* -+ * Do a quick atomic lookup first - this is the fastpath. -+ */ -+retry: -+ page = follow_page_pte(mm, addr, write, pte); -+ if (likely(page != NULL)) { -+ if (!PageReserved(page)) -+ get_page(page); -+ return page; -+ } -+ if (pte_present(*pte)) -+ return NULL; -+ /* -+ * No luck - bad address or need to fault in the page: -+ */ -+ -+ /* Release the lock so get_user_pages can sleep */ -+ spin_unlock(&mm->page_table_lock); -+ -+ /* -+ * In the context of filemap_copy_from_user(), we are not allowed -+ * to sleep. We must fail this usercopy attempt and allow -+ * filemap_copy_from_user() to recover: drop its atomic kmap and use -+ * a sleeping kmap instead. -+ */ -+ if (in_atomic()) { -+ spin_lock(&mm->page_table_lock); -+ return NULL; -+ } -+ -+ down_read(&mm->mmap_sem); -+ ret = get_user_pages(current, mm, addr, 1, write, 0, NULL, NULL); -+ up_read(&mm->mmap_sem); -+ spin_lock(&mm->page_table_lock); -+ -+ if (ret <= 0) -+ return NULL; -+ -+ /* -+ * Go try the follow_page again. -+ */ -+ goto retry; -+} -+ -+static inline void unpin_page(struct page *page) -+{ -+ put_page(page); -+} -+ -+/* -+ * Access another process' address space. -+ * Source/target buffer must be kernel space, -+ * Do not walk the page table directly, use get_user_pages -+ */ -+static int rw_vm(unsigned long addr, void *buf, int len, int write) -+{ -+ struct mm_struct *mm = current->mm ? : &init_mm; -+ -+ if (!len) -+ return 0; -+ -+ spin_lock(&mm->page_table_lock); -+ -+ /* ignore errors, just check how much was sucessfully transfered */ -+ while (len) { -+ struct page *page = NULL; -+ pte_t pte; -+ int bytes, offset; -+ void *maddr; -+ -+ page = pin_page(addr, write, &pte); -+ if ((page == (struct page *)-1UL) || -+ (!page && !pte_present(pte))) -+ break; -+ -+ bytes = len; -+ offset = addr & (PAGE_SIZE-1); -+ if (bytes > PAGE_SIZE-offset) -+ bytes = PAGE_SIZE-offset; -+ -+ if (page) -+ maddr = kmap_atomic(page, KM_USER_COPY); -+ else -+ /* we will map with user pte -+ */ -+ maddr = kmap_atomic_pte(&pte, KM_USER_COPY); -+ -+#define HANDLE_TYPE(type) \ -+ case sizeof(type): *(type *)(maddr+offset) = *(type *)(buf); break; -+ -+ if (write) { -+ switch (bytes) { -+ HANDLE_TYPE(char); -+ HANDLE_TYPE(int); -+ HANDLE_TYPE(long long); -+ default: -+ memcpy(maddr + offset, buf, bytes); -+ } -+ } else { -+#undef HANDLE_TYPE -+#define HANDLE_TYPE(type) \ -+ case sizeof(type): *(type *)(buf) = *(type *)(maddr+offset); break; -+ switch (bytes) { -+ HANDLE_TYPE(char); -+ HANDLE_TYPE(int); -+ HANDLE_TYPE(long long); -+ default: -+ memcpy(buf, maddr + offset, bytes); -+ } -+#undef HANDLE_TYPE -+ } -+ kunmap_atomic(maddr, KM_USER_COPY); -+ if (page) -+ unpin_page(page); -+ len -= bytes; -+ buf += bytes; -+ addr += bytes; -+ } -+ spin_unlock(&mm->page_table_lock); -+ -+ return len; -+} -+ -+static int str_vm(unsigned long addr, void *buf0, int len, int copy) -+{ -+ struct mm_struct *mm = current->mm ? : &init_mm; -+ struct page *page; -+ void *buf = buf0; -+ -+ if (!len) -+ return len; -+ -+ spin_lock(&mm->page_table_lock); -+ -+ /* ignore errors, just check how much was sucessfully transfered */ -+ while (len) { -+ int bytes, offset, left, copied; -+ pte_t pte; -+ char *maddr; -+ -+ page = pin_page(addr, copy == 2, &pte); -+ if ((page == (struct page *)-1UL) || -+ (!page && !pte_present(pte))) { -+ spin_unlock(&mm->page_table_lock); -+ return -EFAULT; -+ } -+ bytes = len; -+ offset = addr & (PAGE_SIZE-1); -+ if (bytes > PAGE_SIZE-offset) -+ bytes = PAGE_SIZE-offset; -+ -+ if (page) -+ maddr = kmap_atomic(page, KM_USER_COPY); -+ else -+ /* we will map with user pte -+ */ -+ maddr = kmap_atomic_pte(&pte, KM_USER_COPY); -+ if (copy == 2) { -+ memset(maddr + offset, 0, bytes); -+ copied = bytes; -+ left = 0; -+ } else if (copy == 1) { -+ left = strncpy_count(buf, maddr + offset, bytes); -+ copied = bytes - left; -+ } else { -+ copied = strnlen(maddr + offset, bytes); -+ left = bytes - copied; -+ } -+ BUG_ON(bytes < 0 || copied < 0); -+ kunmap_atomic(maddr, KM_USER_COPY); -+ if (page) -+ unpin_page(page); -+ len -= copied; -+ buf += copied; -+ addr += copied; -+ if (left) -+ break; -+ } -+ spin_unlock(&mm->page_table_lock); -+ -+ return len; -+} -+ -+/* -+ * Copies memory from userspace (ptr) into kernelspace (val). -+ * -+ * returns # of bytes not copied. -+ */ -+int get_user_size(unsigned int size, void *val, const void *ptr) -+{ -+ int ret; -+ -+ if (unlikely(segment_eq(get_fs(), KERNEL_DS))) -+ ret = __direct_copy_from_user(val, ptr, size); -+ else -+ ret = rw_vm((unsigned long)ptr, val, size, 0); -+ if (ret) -+ /* -+ * Zero the rest: -+ */ -+ memset(val + size - ret, 0, ret); -+ return ret; -+} -+ -+/* -+ * Copies memory from kernelspace (val) into userspace (ptr). -+ * -+ * returns # of bytes not copied. -+ */ -+int put_user_size(unsigned int size, const void *val, void *ptr) -+{ -+ if (unlikely(segment_eq(get_fs(), KERNEL_DS))) -+ return __direct_copy_to_user(ptr, val, size); -+ else -+ return rw_vm((unsigned long)ptr, (void *)val, size, 1); -+} -+ -+int copy_str_fromuser_size(unsigned int size, void *val, const void *ptr) -+{ -+ int copied, left; -+ -+ if (unlikely(segment_eq(get_fs(), KERNEL_DS))) { -+ left = strncpy_count(val, ptr, size); -+ copied = size - left; -+ BUG_ON(copied < 0); -+ -+ return copied; -+ } -+ left = str_vm((unsigned long)ptr, val, size, 1); -+ if (left < 0) -+ return left; -+ copied = size - left; -+ BUG_ON(copied < 0); -+ -+ return copied; -+} -+ -+int strlen_fromuser_size(unsigned int size, const void *ptr) -+{ -+ int copied, left; -+ -+ if (unlikely(segment_eq(get_fs(), KERNEL_DS))) { -+ copied = strnlen(ptr, size) + 1; -+ BUG_ON(copied < 0); -+ -+ return copied; -+ } -+ left = str_vm((unsigned long)ptr, NULL, size, 0); -+ if (left < 0) -+ return 0; -+ copied = size - left + 1; -+ BUG_ON(copied < 0); -+ -+ return copied; -+} -+ -+int zero_user_size(unsigned int size, void *ptr) -+{ -+ int left; -+ -+ if (unlikely(segment_eq(get_fs(), KERNEL_DS))) { -+ memset(ptr, 0, size); -+ return 0; -+ } -+ left = str_vm((unsigned long)ptr, NULL, size, 2); -+ if (left < 0) -+ return size; -+ return left; -+} -+ -+EXPORT_SYMBOL(get_user_size); -+EXPORT_SYMBOL(put_user_size); -+EXPORT_SYMBOL(zero_user_size); -+EXPORT_SYMBOL(copy_str_fromuser_size); -+EXPORT_SYMBOL(strlen_fromuser_size); -diff -uprN linux-2.6.8.1.orig/mm/vmalloc.c linux-2.6.8.1-ve022stab078/mm/vmalloc.c ---- linux-2.6.8.1.orig/mm/vmalloc.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/mm/vmalloc.c 2006-05-11 13:05:41.000000000 +0400 -@@ -19,6 +19,7 @@ - #include <asm/uaccess.h> - #include <asm/tlbflush.h> - -+#include <ub/ub_debug.h> - - rwlock_t vmlist_lock = RW_LOCK_UNLOCKED; - struct vm_struct *vmlist; -@@ -246,6 +247,66 @@ struct vm_struct *get_vm_area(unsigned l - return __get_vm_area(size, flags, VMALLOC_START, VMALLOC_END); - } - -+struct vm_struct * get_vm_area_best(unsigned long size, unsigned long flags) -+{ -+ unsigned long addr, best_addr, delta, best_delta; -+ struct vm_struct **p, **best_p, *tmp, *area; -+ -+ area = (struct vm_struct *) kmalloc(sizeof(*area), GFP_KERNEL); -+ if (!area) -+ return NULL; -+ -+ size += PAGE_SIZE; /* one-page gap at the end */ -+ addr = VMALLOC_START; -+ best_addr = 0UL; -+ best_p = NULL; -+ best_delta = PAGE_ALIGN(VMALLOC_END) - VMALLOC_START; -+ -+ write_lock(&vmlist_lock); -+ for (p = &vmlist; (tmp = *p) ; p = &tmp->next) { -+ if ((size + addr) < addr) -+ break; -+ delta = (unsigned long) tmp->addr - (size + addr); -+ if (delta < best_delta) { -+ best_delta = delta; -+ best_addr = addr; -+ best_p = p; -+ } -+ addr = tmp->size + (unsigned long) tmp->addr; -+ if (addr > VMALLOC_END-size) -+ break; -+ } -+ -+ if (!tmp) { -+ /* check free area after list end */ -+ delta = (unsigned long) PAGE_ALIGN(VMALLOC_END) - (size + addr); -+ if (delta < best_delta) { -+ best_delta = delta; -+ best_addr = addr; -+ best_p = p; -+ } -+ } -+ if (best_addr) { -+ area->flags = flags; -+ /* allocate at the end of this area */ -+ area->addr = (void *)(best_addr + best_delta); -+ area->size = size; -+ area->next = *best_p; -+ area->pages = NULL; -+ area->nr_pages = 0; -+ area->phys_addr = 0; -+ *best_p = area; -+ /* check like in __vunmap */ -+ WARN_ON((PAGE_SIZE - 1) & (unsigned long)area->addr); -+ } else { -+ kfree(area); -+ area = NULL; -+ } -+ write_unlock(&vmlist_lock); -+ -+ return area; -+} -+ - /** - * remove_vm_area - find and remove a contingous kernel virtual area - * -@@ -298,6 +359,7 @@ void __vunmap(void *addr, int deallocate - if (deallocate_pages) { - int i; - -+ dec_vmalloc_charged(area); - for (i = 0; i < area->nr_pages; i++) { - if (unlikely(!area->pages[i])) - BUG(); -@@ -390,17 +452,20 @@ EXPORT_SYMBOL(vmap); - * allocator with @gfp_mask flags. Map them into contiguous - * kernel virtual space, using a pagetable protection of @prot. - */ --void *__vmalloc(unsigned long size, int gfp_mask, pgprot_t prot) -+void *____vmalloc(unsigned long size, int gfp_mask, pgprot_t prot, int best) - { - struct vm_struct *area; - struct page **pages; -- unsigned int nr_pages, array_size, i; -+ unsigned int nr_pages, array_size, i, j; - - size = PAGE_ALIGN(size); - if (!size || (size >> PAGE_SHIFT) > num_physpages) - return NULL; - -- area = get_vm_area(size, VM_ALLOC); -+ if (best) -+ area = get_vm_area_best(size, VM_ALLOC); -+ else -+ area = get_vm_area(size, VM_ALLOC); - if (!area) - return NULL; - -@@ -409,31 +474,38 @@ void *__vmalloc(unsigned long size, int - - area->nr_pages = nr_pages; - area->pages = pages = kmalloc(array_size, (gfp_mask & ~__GFP_HIGHMEM)); -- if (!area->pages) { -- remove_vm_area(area->addr); -- kfree(area); -- return NULL; -- } -+ if (!area->pages) -+ goto fail_area; - memset(area->pages, 0, array_size); - - for (i = 0; i < area->nr_pages; i++) { - area->pages[i] = alloc_page(gfp_mask); -- if (unlikely(!area->pages[i])) { -- /* Successfully allocated i pages, free them in __vunmap() */ -- area->nr_pages = i; -+ if (unlikely(!area->pages[i])) - goto fail; -- } - } - - if (map_vm_area(area, prot, &pages)) - goto fail; -+ -+ inc_vmalloc_charged(area, gfp_mask); - return area->addr; - - fail: -- vfree(area->addr); -+ for (j = 0; j < i; j++) -+ __free_page(area->pages[j]); -+ kfree(area->pages); -+fail_area: -+ remove_vm_area(area->addr); -+ kfree(area); -+ - return NULL; - } - -+void *__vmalloc(unsigned long size, int gfp_mask, pgprot_t prot) -+{ -+ return ____vmalloc(size, gfp_mask, prot, 0); -+} -+ - EXPORT_SYMBOL(__vmalloc); - - /** -@@ -454,6 +526,20 @@ void *vmalloc(unsigned long size) - - EXPORT_SYMBOL(vmalloc); - -+void *vmalloc_best(unsigned long size) -+{ -+ return ____vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL, 1); -+} -+ -+EXPORT_SYMBOL(vmalloc_best); -+ -+void *ub_vmalloc_best(unsigned long size) -+{ -+ return ____vmalloc(size, GFP_KERNEL_UBC | __GFP_HIGHMEM, PAGE_KERNEL, 1); -+} -+ -+EXPORT_SYMBOL(ub_vmalloc_best); -+ - /** - * vmalloc_exec - allocate virtually contiguous, executable memory - * -@@ -565,3 +651,37 @@ finished: - read_unlock(&vmlist_lock); - return buf - buf_start; - } -+ -+void vprintstat(void) -+{ -+ struct vm_struct *p, *last_p = NULL; -+ unsigned long addr, size, free_size, max_free_size; -+ int num; -+ -+ addr = VMALLOC_START; -+ size = max_free_size = 0; -+ num = 0; -+ -+ read_lock(&vmlist_lock); -+ for (p = vmlist; p; p = p->next) { -+ free_size = (unsigned long)p->addr - addr; -+ if (free_size > max_free_size) -+ max_free_size = free_size; -+ addr = (unsigned long)p->addr + p->size; -+ size += p->size; -+ ++num; -+ last_p = p; -+ } -+ if (last_p) { -+ free_size = VMALLOC_END - -+ ((unsigned long)last_p->addr + last_p->size); -+ if (free_size > max_free_size) -+ max_free_size = free_size; -+ } -+ read_unlock(&vmlist_lock); -+ -+ printk("VMALLOC Used: %luKB Total: %luKB Entries: %d\n" -+ " Max_Free: %luKB Start: %lx End: %lx\n", -+ size/1024, (VMALLOC_END - VMALLOC_START)/1024, num, -+ max_free_size/1024, VMALLOC_START, VMALLOC_END); -+} -diff -uprN linux-2.6.8.1.orig/mm/vmscan.c linux-2.6.8.1-ve022stab078/mm/vmscan.c ---- linux-2.6.8.1.orig/mm/vmscan.c 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/mm/vmscan.c 2006-05-11 13:05:41.000000000 +0400 -@@ -38,6 +38,8 @@ - - #include <linux/swapops.h> - -+#include <ub/ub_mem.h> -+ - /* possible outcome of pageout() */ - typedef enum { - /* failed to write page out, page is locked */ -@@ -72,6 +74,8 @@ struct scan_control { - unsigned int gfp_mask; - - int may_writepage; -+ -+ struct oom_freeing_stat oom_stat; - }; - - /* -@@ -174,14 +178,16 @@ EXPORT_SYMBOL(remove_shrinker); - * are eligible for the caller's allocation attempt. It is used for balancing - * slab reclaim versus page reclaim. - */ --static int shrink_slab(unsigned long scanned, unsigned int gfp_mask, -+static int shrink_slab(struct scan_control *sc, unsigned int gfp_mask, - unsigned long lru_pages) - { - struct shrinker *shrinker; -+ unsigned long scanned; - - if (down_trylock(&shrinker_sem)) - return 0; - -+ scanned = sc->nr_scanned; - list_for_each_entry(shrinker, &shrinker_list, list) { - unsigned long long delta; - -@@ -205,6 +211,7 @@ static int shrink_slab(unsigned long sca - shrinker->nr -= this_scan; - if (shrink_ret == -1) - break; -+ sc->oom_stat.slabs += shrink_ret; - cond_resched(); - } - } -@@ -389,6 +396,7 @@ static int shrink_list(struct list_head - page_map_unlock(page); - if (!add_to_swap(page)) - goto activate_locked; -+ sc->oom_stat.swapped++; - page_map_lock(page); - } - #endif /* CONFIG_SWAP */ -@@ -430,6 +438,7 @@ static int shrink_list(struct list_head - case PAGE_ACTIVATE: - goto activate_locked; - case PAGE_SUCCESS: -+ sc->oom_stat.written++; - if (PageWriteback(page) || PageDirty(page)) - goto keep; - /* -@@ -589,6 +598,7 @@ static void shrink_cache(struct zone *zo - else - mod_page_state_zone(zone, pgscan_direct, nr_scan); - nr_freed = shrink_list(&page_list, sc); -+ sc->oom_stat.freed += nr_freed; - if (current_is_kswapd()) - mod_page_state(kswapd_steal, nr_freed); - mod_page_state_zone(zone, pgsteal, nr_freed); -@@ -653,6 +663,7 @@ refill_inactive_zone(struct zone *zone, - long distress; - long swap_tendency; - -+ KSTAT_PERF_ENTER(refill_inact) - lru_add_drain(); - pgmoved = 0; - spin_lock_irq(&zone->lru_lock); -@@ -793,6 +804,8 @@ refill_inactive_zone(struct zone *zone, - - mod_page_state_zone(zone, pgrefill, pgscanned); - mod_page_state(pgdeactivate, pgdeactivate); -+ -+ KSTAT_PERF_LEAVE(refill_inact); - } - - /* -@@ -902,6 +915,10 @@ int try_to_free_pages(struct zone **zone - unsigned long lru_pages = 0; - int i; - -+ KSTAT_PERF_ENTER(ttfp); -+ -+ memset(&sc.oom_stat, 0, sizeof(struct oom_freeing_stat)); -+ sc.oom_stat.oom_generation = oom_generation; - sc.gfp_mask = gfp_mask; - sc.may_writepage = 0; - -@@ -920,7 +937,7 @@ int try_to_free_pages(struct zone **zone - sc.nr_reclaimed = 0; - sc.priority = priority; - shrink_caches(zones, &sc); -- shrink_slab(sc.nr_scanned, gfp_mask, lru_pages); -+ shrink_slab(&sc, gfp_mask, lru_pages); - if (reclaim_state) { - sc.nr_reclaimed += reclaim_state->reclaimed_slab; - reclaim_state->reclaimed_slab = 0; -@@ -949,10 +966,11 @@ int try_to_free_pages(struct zone **zone - blk_congestion_wait(WRITE, HZ/10); - } - if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) -- out_of_memory(gfp_mask); -+ out_of_memory(&sc.oom_stat, gfp_mask); - out: - for (i = 0; zones[i] != 0; i++) - zones[i]->prev_priority = zones[i]->temp_priority; -+ KSTAT_PERF_LEAVE(ttfp); - return ret; - } - -@@ -1062,7 +1080,7 @@ scan: - sc.priority = priority; - shrink_zone(zone, &sc); - reclaim_state->reclaimed_slab = 0; -- shrink_slab(sc.nr_scanned, GFP_KERNEL, lru_pages); -+ shrink_slab(&sc, GFP_KERNEL, lru_pages); - sc.nr_reclaimed += reclaim_state->reclaimed_slab; - total_reclaimed += sc.nr_reclaimed; - if (zone->all_unreclaimable) -@@ -1142,8 +1160,8 @@ static int kswapd(void *p) - tsk->flags |= PF_MEMALLOC|PF_KSWAPD; - - for ( ; ; ) { -- if (current->flags & PF_FREEZE) -- refrigerator(PF_FREEZE); -+ if (test_thread_flag(TIF_FREEZE)) -+ refrigerator(); - prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE); - schedule(); - finish_wait(&pgdat->kswapd_wait, &wait); -@@ -1223,7 +1241,7 @@ static int __init kswapd_init(void) - swap_setup(); - for_each_pgdat(pgdat) - pgdat->kswapd -- = find_task_by_pid(kernel_thread(kswapd, pgdat, CLONE_KERNEL)); -+ = find_task_by_pid_all(kernel_thread(kswapd, pgdat, CLONE_KERNEL)); - total_memory = nr_free_pagecache_pages(); - hotcpu_notifier(cpu_callback, 0); - return 0; -diff -uprN linux-2.6.8.1.orig/net/bluetooth/af_bluetooth.c linux-2.6.8.1-ve022stab078/net/bluetooth/af_bluetooth.c ---- linux-2.6.8.1.orig/net/bluetooth/af_bluetooth.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/bluetooth/af_bluetooth.c 2006-05-11 13:05:34.000000000 +0400 -@@ -64,7 +64,7 @@ static kmem_cache_t *bt_sock_cache; - - int bt_sock_register(int proto, struct net_proto_family *ops) - { -- if (proto >= BT_MAX_PROTO) -+ if (proto < 0 || proto >= BT_MAX_PROTO) - return -EINVAL; - - if (bt_proto[proto]) -@@ -77,7 +77,7 @@ EXPORT_SYMBOL(bt_sock_register); - - int bt_sock_unregister(int proto) - { -- if (proto >= BT_MAX_PROTO) -+ if (proto < 0 || proto >= BT_MAX_PROTO) - return -EINVAL; - - if (!bt_proto[proto]) -@@ -92,7 +92,7 @@ static int bt_sock_create(struct socket - { - int err = 0; - -- if (proto >= BT_MAX_PROTO) -+ if (proto < 0 || proto >= BT_MAX_PROTO) - return -EINVAL; - - #if defined(CONFIG_KMOD) -diff -uprN linux-2.6.8.1.orig/net/compat.c linux-2.6.8.1-ve022stab078/net/compat.c ---- linux-2.6.8.1.orig/net/compat.c 2004-08-14 14:55:34.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/compat.c 2006-05-11 13:05:49.000000000 +0400 -@@ -90,20 +90,11 @@ int verify_compat_iovec(struct msghdr *k - } else - kern_msg->msg_name = NULL; - -- if(kern_msg->msg_iovlen > UIO_FASTIOV) { -- kern_iov = kmalloc(kern_msg->msg_iovlen * sizeof(struct iovec), -- GFP_KERNEL); -- if(!kern_iov) -- return -ENOMEM; -- } -- - tot_len = iov_from_user_compat_to_kern(kern_iov, - (struct compat_iovec __user *)kern_msg->msg_iov, - kern_msg->msg_iovlen); - if(tot_len >= 0) - kern_msg->msg_iov = kern_iov; -- else if(kern_msg->msg_iovlen > UIO_FASTIOV) -- kfree(kern_iov); - - return tot_len; - } -@@ -123,6 +114,12 @@ int verify_compat_iovec(struct msghdr *k - (struct compat_cmsghdr __user *)((msg)->msg_control) : \ - (struct compat_cmsghdr __user *)NULL) - -+#define CMSG_COMPAT_OK(ucmlen, ucmsg, mhdr) \ -+ ((ucmlen) >= sizeof(struct compat_cmsghdr) && \ -+ (ucmlen) <= (unsigned long) \ -+ ((mhdr)->msg_controllen - \ -+ ((char *)(ucmsg) - (char *)(mhdr)->msg_control))) -+ - static inline struct compat_cmsghdr __user *cmsg_compat_nxthdr(struct msghdr *msg, - struct compat_cmsghdr __user *cmsg, int cmsg_len) - { -@@ -137,13 +134,14 @@ static inline struct compat_cmsghdr __us - * thus placement) of cmsg headers and length are different for - * 32-bit apps. -DaveM - */ --int cmsghdr_from_user_compat_to_kern(struct msghdr *kmsg, -+int cmsghdr_from_user_compat_to_kern(struct msghdr *kmsg, struct sock *sk, - unsigned char *stackbuf, int stackbuf_size) - { - struct compat_cmsghdr __user *ucmsg; - struct cmsghdr *kcmsg, *kcmsg_base; - compat_size_t ucmlen; - __kernel_size_t kcmlen, tmp; -+ int err = -EFAULT; - - kcmlen = 0; - kcmsg_base = kcmsg = (struct cmsghdr *)stackbuf; -@@ -153,15 +151,12 @@ int cmsghdr_from_user_compat_to_kern(str - return -EFAULT; - - /* Catch bogons. */ -- if(CMSG_COMPAT_ALIGN(ucmlen) < -- CMSG_COMPAT_ALIGN(sizeof(struct compat_cmsghdr))) -- return -EINVAL; -- if((unsigned long)(((char __user *)ucmsg - (char __user *)kmsg->msg_control) -- + ucmlen) > kmsg->msg_controllen) -+ if (!CMSG_COMPAT_OK(ucmlen, ucmsg, kmsg)) - return -EINVAL; - - tmp = ((ucmlen - CMSG_COMPAT_ALIGN(sizeof(*ucmsg))) + - CMSG_ALIGN(sizeof(struct cmsghdr))); -+ tmp = CMSG_ALIGN(tmp); - kcmlen += tmp; - ucmsg = cmsg_compat_nxthdr(kmsg, ucmsg, ucmlen); - } -@@ -173,30 +168,34 @@ int cmsghdr_from_user_compat_to_kern(str - * until we have successfully copied over all of the data - * from the user. - */ -- if(kcmlen > stackbuf_size) -- kcmsg_base = kcmsg = kmalloc(kcmlen, GFP_KERNEL); -- if(kcmsg == NULL) -+ if (kcmlen > stackbuf_size) -+ kcmsg_base = kcmsg = sock_kmalloc(sk, kcmlen, GFP_KERNEL); -+ if (kcmsg == NULL) - return -ENOBUFS; - - /* Now copy them over neatly. */ - memset(kcmsg, 0, kcmlen); - ucmsg = CMSG_COMPAT_FIRSTHDR(kmsg); - while(ucmsg != NULL) { -- __get_user(ucmlen, &ucmsg->cmsg_len); -+ if (__get_user(ucmlen, &ucmsg->cmsg_len)) -+ goto Efault; -+ if (!CMSG_COMPAT_OK(ucmlen, ucmsg, kmsg)) -+ goto Einval; - tmp = ((ucmlen - CMSG_COMPAT_ALIGN(sizeof(*ucmsg))) + - CMSG_ALIGN(sizeof(struct cmsghdr))); -+ if ((char *)kcmsg_base + kcmlen - (char *)kcmsg < CMSG_ALIGN(tmp)) -+ goto Einval; - kcmsg->cmsg_len = tmp; -- __get_user(kcmsg->cmsg_level, &ucmsg->cmsg_level); -- __get_user(kcmsg->cmsg_type, &ucmsg->cmsg_type); -- -- /* Copy over the data. */ -- if(copy_from_user(CMSG_DATA(kcmsg), -- CMSG_COMPAT_DATA(ucmsg), -- (ucmlen - CMSG_COMPAT_ALIGN(sizeof(*ucmsg))))) -- goto out_free_efault; -+ tmp = CMSG_ALIGN(tmp); -+ if (__get_user(kcmsg->cmsg_level, &ucmsg->cmsg_level) || -+ __get_user(kcmsg->cmsg_type, &ucmsg->cmsg_type) || -+ copy_from_user(CMSG_DATA(kcmsg), -+ CMSG_COMPAT_DATA(ucmsg), -+ (ucmlen - CMSG_COMPAT_ALIGN(sizeof(*ucmsg))))) -+ goto Efault; - - /* Advance. */ -- kcmsg = (struct cmsghdr *)((char *)kcmsg + CMSG_ALIGN(tmp)); -+ kcmsg = (struct cmsghdr *)((char *)kcmsg + tmp); - ucmsg = cmsg_compat_nxthdr(kmsg, ucmsg, ucmlen); - } - -@@ -205,10 +204,12 @@ int cmsghdr_from_user_compat_to_kern(str - kmsg->msg_controllen = kcmlen; - return 0; - --out_free_efault: -- if(kcmsg_base != (struct cmsghdr *)stackbuf) -- kfree(kcmsg_base); -- return -EFAULT; -+Einval: -+ err = -EINVAL; -+Efault: -+ if (kcmsg_base != (struct cmsghdr *)stackbuf) -+ sock_kfree_s(sk, kcmsg_base, kcmlen); -+ return err; - } - - int put_cmsg_compat(struct msghdr *kmsg, int level, int type, int len, void *data) -@@ -303,107 +304,6 @@ void scm_detach_fds_compat(struct msghdr - } - - /* -- * For now, we assume that the compatibility and native version -- * of struct ipt_entry are the same - sfr. FIXME -- */ --struct compat_ipt_replace { -- char name[IPT_TABLE_MAXNAMELEN]; -- u32 valid_hooks; -- u32 num_entries; -- u32 size; -- u32 hook_entry[NF_IP_NUMHOOKS]; -- u32 underflow[NF_IP_NUMHOOKS]; -- u32 num_counters; -- compat_uptr_t counters; /* struct ipt_counters * */ -- struct ipt_entry entries[0]; --}; -- --static int do_netfilter_replace(int fd, int level, int optname, -- char __user *optval, int optlen) --{ -- struct compat_ipt_replace __user *urepl; -- struct ipt_replace __user *repl_nat; -- char name[IPT_TABLE_MAXNAMELEN]; -- u32 origsize, tmp32, num_counters; -- unsigned int repl_nat_size; -- int ret; -- int i; -- compat_uptr_t ucntrs; -- -- urepl = (struct compat_ipt_replace __user *)optval; -- if (get_user(origsize, &urepl->size)) -- return -EFAULT; -- -- /* Hack: Causes ipchains to give correct error msg --RR */ -- if (optlen != sizeof(*urepl) + origsize) -- return -ENOPROTOOPT; -- -- /* XXX Assumes that size of ipt_entry is the same both in -- * native and compat environments. -- */ -- repl_nat_size = sizeof(*repl_nat) + origsize; -- repl_nat = compat_alloc_user_space(repl_nat_size); -- -- ret = -EFAULT; -- if (put_user(origsize, &repl_nat->size)) -- goto out; -- -- if (!access_ok(VERIFY_READ, urepl, optlen) || -- !access_ok(VERIFY_WRITE, repl_nat, optlen)) -- goto out; -- -- if (__copy_from_user(name, urepl->name, sizeof(urepl->name)) || -- __copy_to_user(repl_nat->name, name, sizeof(repl_nat->name))) -- goto out; -- -- if (__get_user(tmp32, &urepl->valid_hooks) || -- __put_user(tmp32, &repl_nat->valid_hooks)) -- goto out; -- -- if (__get_user(tmp32, &urepl->num_entries) || -- __put_user(tmp32, &repl_nat->num_entries)) -- goto out; -- -- if (__get_user(num_counters, &urepl->num_counters) || -- __put_user(num_counters, &repl_nat->num_counters)) -- goto out; -- -- if (__get_user(ucntrs, &urepl->counters) || -- __put_user(compat_ptr(ucntrs), &repl_nat->counters)) -- goto out; -- -- if (__copy_in_user(&repl_nat->entries[0], -- &urepl->entries[0], -- origsize)) -- goto out; -- -- for (i = 0; i < NF_IP_NUMHOOKS; i++) { -- if (__get_user(tmp32, &urepl->hook_entry[i]) || -- __put_user(tmp32, &repl_nat->hook_entry[i]) || -- __get_user(tmp32, &urepl->underflow[i]) || -- __put_user(tmp32, &repl_nat->underflow[i])) -- goto out; -- } -- -- /* -- * Since struct ipt_counters just contains two u_int64_t members -- * we can just do the access_ok check here and pass the (converted) -- * pointer into the standard syscall. We hope that the pointer is -- * not misaligned ... -- */ -- if (!access_ok(VERIFY_WRITE, compat_ptr(ucntrs), -- num_counters * sizeof(struct ipt_counters))) -- goto out; -- -- -- ret = sys_setsockopt(fd, level, optname, -- (char __user *)repl_nat, repl_nat_size); -- --out: -- return ret; --} -- --/* - * A struct sock_filter is architecture independent. - */ - struct compat_sock_fprog { -@@ -455,15 +355,11 @@ static int do_set_sock_timeout(int fd, i - asmlinkage long compat_sys_setsockopt(int fd, int level, int optname, - char __user *optval, int optlen) - { -- if (optname == IPT_SO_SET_REPLACE) -- return do_netfilter_replace(fd, level, optname, -- optval, optlen); - if (optname == SO_ATTACH_FILTER) - return do_set_attach_filter(fd, level, optname, - optval, optlen); - if (optname == SO_RCVTIMEO || optname == SO_SNDTIMEO) - return do_set_sock_timeout(fd, level, optname, optval, optlen); -- - return sys_setsockopt(fd, level, optname, optval, optlen); - } - -@@ -499,7 +395,8 @@ static int do_get_sock_timeout(int fd, i - asmlinkage long compat_sys_getsockopt(int fd, int level, int optname, - char __user *optval, int __user *optlen) - { -- if (optname == SO_RCVTIMEO || optname == SO_SNDTIMEO) -+ if (level == SOL_SOCKET && -+ (optname == SO_RCVTIMEO || optname == SO_SNDTIMEO)) - return do_get_sock_timeout(fd, level, optname, optval, optlen); - return sys_getsockopt(fd, level, optname, optval, optlen); - } -diff -uprN linux-2.6.8.1.orig/net/core/datagram.c linux-2.6.8.1-ve022stab078/net/core/datagram.c ---- linux-2.6.8.1.orig/net/core/datagram.c 2004-08-14 14:55:34.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/core/datagram.c 2006-05-11 13:05:39.000000000 +0400 -@@ -54,6 +54,8 @@ - #include <net/sock.h> - #include <net/checksum.h> - -+#include <ub/ub_net.h> -+ - - /* - * Is a socket 'connection oriented' ? -@@ -454,6 +456,7 @@ unsigned int datagram_poll(struct file * - { - struct sock *sk = sock->sk; - unsigned int mask; -+ int no_ubc_space; - - poll_wait(file, sk->sk_sleep, wait); - mask = 0; -@@ -461,8 +464,14 @@ unsigned int datagram_poll(struct file * - /* exceptional events? */ - if (sk->sk_err || !skb_queue_empty(&sk->sk_error_queue)) - mask |= POLLERR; -- if (sk->sk_shutdown == SHUTDOWN_MASK) -+ if (sk->sk_shutdown == SHUTDOWN_MASK) { -+ no_ubc_space = 0; - mask |= POLLHUP; -+ } else { -+ no_ubc_space = ub_sock_makewres_other(sk, SOCK_MIN_UBCSPACE_CH); -+ if (no_ubc_space) -+ ub_sock_sndqueueadd_other(sk, SOCK_MIN_UBCSPACE_CH); -+ } - - /* readable? */ - if (!skb_queue_empty(&sk->sk_receive_queue) || -@@ -479,7 +488,7 @@ unsigned int datagram_poll(struct file * - } - - /* writable? */ -- if (sock_writeable(sk)) -+ if (!no_ubc_space && sock_writeable(sk)) - mask |= POLLOUT | POLLWRNORM | POLLWRBAND; - else - set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); -diff -uprN linux-2.6.8.1.orig/net/core/dev.c linux-2.6.8.1-ve022stab078/net/core/dev.c ---- linux-2.6.8.1.orig/net/core/dev.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/core/dev.c 2006-05-11 13:05:42.000000000 +0400 -@@ -113,6 +113,7 @@ - #include <net/iw_handler.h> - #endif /* CONFIG_NET_RADIO */ - #include <asm/current.h> -+#include <ub/beancounter.h> - - /* This define, if set, will randomly drop a packet when congestion - * is more than moderate. It helps fairness in the multi-interface -@@ -182,25 +183,40 @@ static struct timer_list samp_timer = TI - * unregister_netdevice(), which must be called with the rtnl - * semaphore held. - */ -+#if defined(CONFIG_VE) -+#define dev_tail (get_exec_env()->_net_dev_tail) -+#else - struct net_device *dev_base; - struct net_device **dev_tail = &dev_base; --rwlock_t dev_base_lock = RW_LOCK_UNLOCKED; -- - EXPORT_SYMBOL(dev_base); -+#endif -+ -+rwlock_t dev_base_lock = RW_LOCK_UNLOCKED; - EXPORT_SYMBOL(dev_base_lock); - -+#ifdef CONFIG_VE -+#define MAX_UNMOVABLE_NETDEVICES (8*4096) -+static uint8_t unmovable_ifindex_list[MAX_UNMOVABLE_NETDEVICES/8]; -+static LIST_HEAD(dev_global_list); -+#endif -+ - #define NETDEV_HASHBITS 8 - static struct hlist_head dev_name_head[1<<NETDEV_HASHBITS]; - static struct hlist_head dev_index_head[1<<NETDEV_HASHBITS]; - --static inline struct hlist_head *dev_name_hash(const char *name) -+struct hlist_head *dev_name_hash(const char *name, struct ve_struct *env) - { -- unsigned hash = full_name_hash(name, strnlen(name, IFNAMSIZ)); -+ unsigned hash; -+ if (!ve_is_super(env)) -+ return visible_dev_head(env); -+ hash = full_name_hash(name, strnlen(name, IFNAMSIZ)); - return &dev_name_head[hash & ((1<<NETDEV_HASHBITS)-1)]; - } - --static inline struct hlist_head *dev_index_hash(int ifindex) -+struct hlist_head *dev_index_hash(int ifindex, struct ve_struct *env) - { -+ if (!ve_is_super(env)) -+ return visible_dev_index_head(env); - return &dev_index_head[ifindex & ((1<<NETDEV_HASHBITS)-1)]; - } - -@@ -488,7 +504,7 @@ struct net_device *__dev_get_by_name(con - { - struct hlist_node *p; - -- hlist_for_each(p, dev_name_hash(name)) { -+ hlist_for_each(p, dev_name_hash(name, get_exec_env())) { - struct net_device *dev - = hlist_entry(p, struct net_device, name_hlist); - if (!strncmp(dev->name, name, IFNAMSIZ)) -@@ -520,6 +536,28 @@ struct net_device *dev_get_by_name(const - return dev; - } - -+/** -+ * __dev_global_get_by_name - find a device by its name in dev_global_list -+ * @name: name to find -+ * -+ * Find an interface by name. Must be called under RTNL semaphore -+ * If the name is found a pointer to the device -+ * is returned. If the name is not found then %NULL is returned. The -+ * reference counters are not incremented so the caller must be -+ * careful with locks. -+ */ -+ -+struct net_device *__dev_global_get_by_name(const char *name) -+{ -+ struct net_device *dev; -+ /* It's called relatively rarely */ -+ list_for_each_entry(dev, &dev_global_list, dev_global_list_entry) { -+ if (strncmp(dev->name, name, IFNAMSIZ) == 0) -+ return dev; -+ } -+ return NULL; -+} -+ - /* - Return value is changed to int to prevent illegal usage in future. - It is still legal to use to check for device existence. -@@ -564,7 +602,7 @@ struct net_device *__dev_get_by_index(in - { - struct hlist_node *p; - -- hlist_for_each(p, dev_index_hash(ifindex)) { -+ hlist_for_each(p, dev_index_hash(ifindex, get_exec_env())) { - struct net_device *dev - = hlist_entry(p, struct net_device, index_hlist); - if (dev->ifindex == ifindex) -@@ -720,6 +758,23 @@ int dev_valid_name(const char *name) - * of the unit assigned or a negative errno code. - */ - -+static inline void __dev_check_name(const char *dev_name, const char *name, -+ long *inuse, const int max_netdevices) -+{ -+ int i = 0; -+ char buf[IFNAMSIZ]; -+ -+ if (!sscanf(dev_name, name, &i)) -+ return; -+ if (i < 0 || i >= max_netdevices) -+ return; -+ -+ /* avoid cases where sscanf is not exact inverse of printf */ -+ snprintf(buf, sizeof(buf), name, i); -+ if (!strncmp(buf, dev_name, IFNAMSIZ)) -+ set_bit(i, inuse); -+} -+ - int dev_alloc_name(struct net_device *dev, const char *name) - { - int i = 0; -@@ -744,16 +799,18 @@ int dev_alloc_name(struct net_device *de - if (!inuse) - return -ENOMEM; - -- for (d = dev_base; d; d = d->next) { -- if (!sscanf(d->name, name, &i)) -- continue; -- if (i < 0 || i >= max_netdevices) -- continue; -- -- /* avoid cases where sscanf is not exact inverse of printf */ -- snprintf(buf, sizeof(buf), name, i); -- if (!strncmp(buf, d->name, IFNAMSIZ)) -- set_bit(i, inuse); -+ if (ve_is_super(get_exec_env())) { -+ list_for_each_entry(d, &dev_global_list, -+ dev_global_list_entry) { -+ __dev_check_name(d->name, name, inuse, -+ max_netdevices); -+ } -+ } -+ else { -+ for (d = dev_base; d; d = d->next) { -+ __dev_check_name(d->name, name, inuse, -+ max_netdevices); -+ } - } - - i = find_first_zero_bit(inuse, max_netdevices); -@@ -761,7 +818,11 @@ int dev_alloc_name(struct net_device *de - } - - snprintf(buf, sizeof(buf), name, i); -- if (!__dev_get_by_name(buf)) { -+ if (ve_is_super(get_exec_env())) -+ d = __dev_global_get_by_name(buf); -+ else -+ d = __dev_get_by_name(buf); -+ if (d == NULL) { - strlcpy(dev->name, buf, IFNAMSIZ); - return i; - } -@@ -794,13 +855,15 @@ int dev_change_name(struct net_device *d - if (!dev_valid_name(newname)) - return -EINVAL; - -+ /* Rename of devices in VE is prohibited by CAP_NET_ADMIN */ -+ - if (strchr(newname, '%')) { - err = dev_alloc_name(dev, newname); - if (err < 0) - return err; - strcpy(newname, dev->name); - } -- else if (__dev_get_by_name(newname)) -+ else if (__dev_global_get_by_name(newname)) - return -EEXIST; - else - strlcpy(dev->name, newname, IFNAMSIZ); -@@ -808,7 +871,8 @@ int dev_change_name(struct net_device *d - err = class_device_rename(&dev->class_dev, dev->name); - if (!err) { - hlist_del(&dev->name_hlist); -- hlist_add_head(&dev->name_hlist, dev_name_hash(dev->name)); -+ hlist_add_head(&dev->name_hlist, dev_name_hash(dev->name, -+ get_exec_env())); - notifier_call_chain(&netdev_chain, NETDEV_CHANGENAME, dev); - } - -@@ -1338,6 +1402,25 @@ int dev_queue_xmit(struct sk_buff *skb) - skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_EGRESS); - #endif - if (q->enqueue) { -+ struct user_beancounter *ub; -+ -+ ub = netdev_bc(dev)->exec_ub; -+ /* the skb CAN be already charged if it transmitted via -+ * something like bonding device */ -+ if (ub && (skb_bc(skb)->resource == 0)) { -+ unsigned long chargesize; -+ chargesize = skb_charge_fullsize(skb); -+ if (charge_beancounter(ub, UB_OTHERSOCKBUF, -+ chargesize, UB_SOFT)) { -+ rcu_read_unlock(); -+ rc = -ENOMEM; -+ goto out_kfree_skb; -+ } -+ skb_bc(skb)->ub = ub; -+ skb_bc(skb)->charged = chargesize; -+ skb_bc(skb)->resource = UB_OTHERSOCKBUF; -+ } -+ - /* Grab device queue */ - spin_lock_bh(&dev->queue_lock); - -@@ -1761,6 +1844,7 @@ int netif_receive_skb(struct sk_buff *sk - struct packet_type *ptype, *pt_prev; - int ret = NET_RX_DROP; - unsigned short type; -+ struct ve_struct *old_env; - - #ifdef CONFIG_NETPOLL_RX - if (skb->dev->netpoll_rx && skb->dev->poll && netpoll_rx(skb)) { -@@ -1779,6 +1863,15 @@ int netif_receive_skb(struct sk_buff *sk - skb->h.raw = skb->nh.raw = skb->data; - skb->mac_len = skb->nh.raw - skb->mac.raw; - -+ /* -+ * Skb might be alloced in another VE context, than its device works. -+ * So, set the correct owner_env. -+ */ -+ skb->owner_env = skb->dev->owner_env; -+ BUG_ON(skb->owner_env == NULL); -+ -+ old_env = set_exec_env(VE_OWNER_SKB(skb)); -+ - pt_prev = NULL; - #ifdef CONFIG_NET_CLS_ACT - if (skb->tc_verd & TC_NCLS) { -@@ -1844,6 +1937,7 @@ ncls: - - out: - rcu_read_unlock(); -+ (void)set_exec_env(old_env); - return ret; - } - -@@ -2240,7 +2334,8 @@ static int __init dev_proc_init(void) - - if (!proc_net_fops_create("dev", S_IRUGO, &dev_seq_fops)) - goto out; -- if (!proc_net_fops_create("softnet_stat", S_IRUGO, &softnet_seq_fops)) -+ if (!__proc_net_fops_create("net/softnet_stat", S_IRUGO, -+ &softnet_seq_fops, NULL)) - goto out_dev; - if (wireless_proc_init()) - goto out_softnet; -@@ -2248,7 +2343,7 @@ static int __init dev_proc_init(void) - out: - return rc; - out_softnet: -- proc_net_remove("softnet_stat"); -+ __proc_net_remove("net/softnet_stat"); - out_dev: - proc_net_remove("dev"); - goto out; -@@ -2314,6 +2409,9 @@ void dev_set_promiscuity(struct net_devi - dev->flags |= IFF_PROMISC; - if ((dev->promiscuity += inc) == 0) - dev->flags &= ~IFF_PROMISC; -+ /* Promiscous mode on these devices does not mean anything */ -+ if (dev->flags & (IFF_LOOPBACK|IFF_POINTOPOINT)) -+ return; - if (dev->flags ^ old_flags) { - dev_mc_upload(dev); - printk(KERN_INFO "device %s %s promiscuous mode\n", -@@ -2485,6 +2583,8 @@ static int dev_ifsioc(struct ifreq *ifr, - return dev_set_mtu(dev, ifr->ifr_mtu); - - case SIOCGIFHWADDR: -+ memset(ifr->ifr_hwaddr.sa_data, 0, -+ sizeof(ifr->ifr_hwaddr.sa_data)); - memcpy(ifr->ifr_hwaddr.sa_data, dev->dev_addr, - min(sizeof ifr->ifr_hwaddr.sa_data, (size_t) dev->addr_len)); - ifr->ifr_hwaddr.sa_family = dev->type; -@@ -2720,9 +2820,28 @@ int dev_ioctl(unsigned int cmd, void __u - * - require strict serialization. - * - do not return a value - */ -+ case SIOCSIFMTU: -+ if (!capable(CAP_NET_ADMIN) && -+ !capable(CAP_VE_NET_ADMIN)) -+ return -EPERM; -+ dev_load(ifr.ifr_name); -+ rtnl_lock(); -+ if (!ve_is_super(get_exec_env())) { -+ struct net_device *dev; -+ ret = -ENODEV; -+ if ((dev = __dev_get_by_name(ifr.ifr_name)) == NULL) -+ goto out_set_mtu_unlock; -+ ret = -EPERM; -+ if (ifr.ifr_mtu > dev->orig_mtu) -+ goto out_set_mtu_unlock; -+ } -+ ret = dev_ifsioc(&ifr, cmd); -+out_set_mtu_unlock: -+ rtnl_unlock(); -+ return ret; -+ - case SIOCSIFFLAGS: - case SIOCSIFMETRIC: -- case SIOCSIFMTU: - case SIOCSIFMAP: - case SIOCSIFHWADDR: - case SIOCSIFSLAVE: -@@ -2798,25 +2917,75 @@ int dev_ioctl(unsigned int cmd, void __u - } - } - -- - /** - * dev_new_index - allocate an ifindex - * - * Returns a suitable unique value for a new device interface -- * number. The caller must hold the rtnl semaphore or the -+ * number. The caller must hold the rtnl semaphore or the - * dev_base_lock to be sure it remains unique. -+ * -+ * Note: dev->name must be valid on entrance - */ --int dev_new_index(void) -+static int dev_ve_new_index(void) - { -- static int ifindex; -+#ifdef CONFIG_VE -+ int *ifindex = &get_exec_env()->ifindex; -+ int delta = 2; -+#else -+ static int s_ifindex; -+ int *ifindex = &s_ifindex; -+ int delta = 1; -+#endif - for (;;) { -- if (++ifindex <= 0) -- ifindex = 1; -- if (!__dev_get_by_index(ifindex)) -- return ifindex; -+ *ifindex += delta; -+ if (*ifindex <= 0) -+ *ifindex = 1; -+ if (!__dev_get_by_index(*ifindex)) -+ return *ifindex; - } - } - -+static int dev_glb_new_index(void) -+{ -+#ifdef CONFIG_VE -+ int i; -+ -+ i = find_first_zero_bit((long*)unmovable_ifindex_list, -+ MAX_UNMOVABLE_NETDEVICES); -+ -+ if (i == MAX_UNMOVABLE_NETDEVICES) -+ return -EMFILE; -+ -+ __set_bit(i, (long*)unmovable_ifindex_list); -+ return (i + 1) * 2; -+#endif -+} -+ -+static void dev_glb_free_index(struct net_device *dev) -+{ -+#ifdef CONFIG_VE -+ int bit; -+ -+ bit = dev->ifindex / 2 - 1; -+ BUG_ON(bit >= MAX_UNMOVABLE_NETDEVICES); -+ __clear_bit(bit, (long*)unmovable_ifindex_list); -+#endif -+} -+ -+int dev_new_index(struct net_device *dev) -+{ -+ if (ve_is_super(get_exec_env()) && ve_is_dev_movable(dev)) -+ return dev_glb_new_index(); -+ -+ return dev_ve_new_index(); -+} -+ -+void dev_free_index(struct net_device *dev) -+{ -+ if ((dev->ifindex % 2) == 0) -+ dev_glb_free_index(dev); -+} -+ - static int dev_boot_phase = 1; - - /* Delayed registration/unregisteration */ -@@ -2860,6 +3029,10 @@ int register_netdevice(struct net_device - /* When net_device's are persistent, this will be fatal. */ - BUG_ON(dev->reg_state != NETREG_UNINITIALIZED); - -+ ret = -EPERM; -+ if (!ve_is_super(get_exec_env()) && ve_is_dev_movable(dev)) -+ goto out; -+ - spin_lock_init(&dev->queue_lock); - spin_lock_init(&dev->xmit_lock); - dev->xmit_lock_owner = -1; -@@ -2879,27 +3052,32 @@ int register_netdevice(struct net_device - if (ret) { - if (ret > 0) - ret = -EIO; -- goto out_err; -+ goto out_free_div; - } - } - - if (!dev_valid_name(dev->name)) { - ret = -EINVAL; -- goto out_err; -+ goto out_free_div; -+ } -+ -+ dev->ifindex = dev_new_index(dev); -+ if (dev->ifindex < 0) { -+ ret = dev->ifindex; -+ goto out_free_div; - } - -- dev->ifindex = dev_new_index(); - if (dev->iflink == -1) - dev->iflink = dev->ifindex; - - /* Check for existence of name */ -- head = dev_name_hash(dev->name); -+ head = dev_name_hash(dev->name, get_exec_env()); - hlist_for_each(p, head) { - struct net_device *d - = hlist_entry(p, struct net_device, name_hlist); - if (!strncmp(d->name, dev->name, IFNAMSIZ)) { - ret = -EEXIST; -- goto out_err; -+ goto out_free_ind; - } - } - -@@ -2929,12 +3107,19 @@ int register_netdevice(struct net_device - set_bit(__LINK_STATE_PRESENT, &dev->state); - - dev->next = NULL; -+ dev->owner_env = get_exec_env(); -+ dev->orig_mtu = dev->mtu; -+ netdev_bc(dev)->owner_ub = get_beancounter(get_exec_ub()); -+ netdev_bc(dev)->exec_ub = get_beancounter(get_exec_ub()); - dev_init_scheduler(dev); -+ if (ve_is_super(get_exec_env())) -+ list_add_tail(&dev->dev_global_list_entry, &dev_global_list); - write_lock_bh(&dev_base_lock); - *dev_tail = dev; - dev_tail = &dev->next; - hlist_add_head(&dev->name_hlist, head); -- hlist_add_head(&dev->index_hlist, dev_index_hash(dev->ifindex)); -+ hlist_add_head(&dev->index_hlist, dev_index_hash(dev->ifindex, -+ get_exec_env())); - dev_hold(dev); - dev->reg_state = NETREG_REGISTERING; - write_unlock_bh(&dev_base_lock); -@@ -2948,7 +3133,9 @@ int register_netdevice(struct net_device - - out: - return ret; --out_err: -+out_free_ind: -+ dev_free_index(dev); -+out_free_div: - free_divert_blk(dev); - goto out; - } -@@ -3032,6 +3219,7 @@ void netdev_run_todo(void) - { - struct list_head list = LIST_HEAD_INIT(list); - int err; -+ struct ve_struct *current_env; - - - /* Need to guard against multiple cpu's getting out of order. */ -@@ -3050,22 +3238,30 @@ void netdev_run_todo(void) - list_splice_init(&net_todo_list, &list); - spin_unlock(&net_todo_list_lock); - -+ current_env = get_exec_env(); - while (!list_empty(&list)) { - struct net_device *dev - = list_entry(list.next, struct net_device, todo_list); - list_del(&dev->todo_list); - -+ (void)set_exec_env(dev->owner_env); - switch(dev->reg_state) { - case NETREG_REGISTERING: - err = netdev_register_sysfs(dev); -- if (err) -+ if (err) { - printk(KERN_ERR "%s: failed sysfs registration (%d)\n", - dev->name, err); -+ dev->reg_state = NETREG_REGISTER_ERR; -+ break; -+ } - dev->reg_state = NETREG_REGISTERED; - break; - - case NETREG_UNREGISTERING: - netdev_unregister_sysfs(dev); -+ /* fall through */ -+ -+ case NETREG_REGISTER_ERR: - dev->reg_state = NETREG_UNREGISTERED; - - netdev_wait_allrefs(dev); -@@ -3076,6 +3272,10 @@ void netdev_run_todo(void) - BUG_TRAP(!dev->ip6_ptr); - BUG_TRAP(!dev->dn_ptr); - -+ put_beancounter(netdev_bc(dev)->exec_ub); -+ put_beancounter(netdev_bc(dev)->owner_ub); -+ netdev_bc(dev)->exec_ub = NULL; -+ netdev_bc(dev)->owner_ub = NULL; - - /* It must be the very last action, - * after this 'dev' may point to freed up memory. -@@ -3090,6 +3290,7 @@ void netdev_run_todo(void) - break; - } - } -+ (void)set_exec_env(current_env); - - out: - up(&net_todo_run_mutex); -@@ -3156,7 +3357,8 @@ int unregister_netdevice(struct net_devi - return -ENODEV; - } - -- BUG_ON(dev->reg_state != NETREG_REGISTERED); -+ BUG_ON(dev->reg_state != NETREG_REGISTERED && -+ dev->reg_state != NETREG_REGISTER_ERR); - - /* If device is running, close it first. */ - if (dev->flags & IFF_UP) -@@ -3172,6 +3374,8 @@ int unregister_netdevice(struct net_devi - dev_tail = dp; - *dp = d->next; - write_unlock_bh(&dev_base_lock); -+ if (ve_is_super(get_exec_env())) -+ list_del(&dev->dev_global_list_entry); - break; - } - } -@@ -3181,7 +3385,8 @@ int unregister_netdevice(struct net_devi - return -ENODEV; - } - -- dev->reg_state = NETREG_UNREGISTERING; -+ if (dev->reg_state != NETREG_REGISTER_ERR) -+ dev->reg_state = NETREG_UNREGISTERING; - - synchronize_net(); - -@@ -3205,6 +3410,8 @@ int unregister_netdevice(struct net_devi - /* Notifier chain MUST detach us from master device. */ - BUG_TRAP(!dev->master); - -+ dev_free_index(dev); -+ - free_divert_blk(dev); - - /* Finish processing unregister after unlock */ -@@ -3352,6 +3559,8 @@ EXPORT_SYMBOL(dev_get_by_name); - EXPORT_SYMBOL(dev_getbyhwaddr); - EXPORT_SYMBOL(dev_ioctl); - EXPORT_SYMBOL(dev_new_index); -+EXPORT_SYMBOL(dev_name_hash); -+EXPORT_SYMBOL(dev_index_hash); - EXPORT_SYMBOL(dev_open); - EXPORT_SYMBOL(dev_queue_xmit); - EXPORT_SYMBOL(dev_queue_xmit_nit); -diff -uprN linux-2.6.8.1.orig/net/core/dev_mcast.c linux-2.6.8.1-ve022stab078/net/core/dev_mcast.c ---- linux-2.6.8.1.orig/net/core/dev_mcast.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/core/dev_mcast.c 2006-05-11 13:05:42.000000000 +0400 -@@ -297,3 +297,4 @@ void __init dev_mcast_init(void) - EXPORT_SYMBOL(dev_mc_add); - EXPORT_SYMBOL(dev_mc_delete); - EXPORT_SYMBOL(dev_mc_upload); -+EXPORT_SYMBOL(dev_mc_discard); -diff -uprN linux-2.6.8.1.orig/net/core/dst.c linux-2.6.8.1-ve022stab078/net/core/dst.c ---- linux-2.6.8.1.orig/net/core/dst.c 2004-08-14 14:55:34.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/core/dst.c 2006-05-11 13:05:41.000000000 +0400 -@@ -47,6 +47,7 @@ static struct timer_list dst_gc_timer = - static void dst_run_gc(unsigned long dummy) - { - int delayed = 0; -+ int work_performed; - struct dst_entry * dst, **dstp; - - if (!spin_trylock(&dst_lock)) { -@@ -54,9 +55,9 @@ static void dst_run_gc(unsigned long dum - return; - } - -- - del_timer(&dst_gc_timer); - dstp = &dst_garbage_list; -+ work_performed = 0; - while ((dst = *dstp) != NULL) { - if (atomic_read(&dst->__refcnt)) { - dstp = &dst->next; -@@ -64,6 +65,7 @@ static void dst_run_gc(unsigned long dum - continue; - } - *dstp = dst->next; -+ work_performed = 1; - - dst = dst_destroy(dst); - if (dst) { -@@ -88,9 +90,14 @@ static void dst_run_gc(unsigned long dum - dst_gc_timer_inc = DST_GC_MAX; - goto out; - } -- if ((dst_gc_timer_expires += dst_gc_timer_inc) > DST_GC_MAX) -- dst_gc_timer_expires = DST_GC_MAX; -- dst_gc_timer_inc += DST_GC_INC; -+ if (!work_performed) { -+ if ((dst_gc_timer_expires += dst_gc_timer_inc) > DST_GC_MAX) -+ dst_gc_timer_expires = DST_GC_MAX; -+ dst_gc_timer_inc += DST_GC_INC; -+ } else { -+ dst_gc_timer_inc = DST_GC_INC; -+ dst_gc_timer_expires = DST_GC_MIN; -+ } - dst_gc_timer.expires = jiffies + dst_gc_timer_expires; - #if RT_CACHE_DEBUG >= 2 - printk("dst_total: %d/%d %ld\n", -@@ -231,13 +238,13 @@ static void dst_ifdown(struct dst_entry - - do { - if (unregister) { -- dst->dev = &loopback_dev; -- dev_hold(&loopback_dev); -+ dst->dev = &visible_loopback_dev; -+ dev_hold(&visible_loopback_dev); - dev_put(dev); - if (dst->neighbour && dst->neighbour->dev == dev) { -- dst->neighbour->dev = &loopback_dev; -+ dst->neighbour->dev = &visible_loopback_dev; - dev_put(dev); -- dev_hold(&loopback_dev); -+ dev_hold(&visible_loopback_dev); - } - } - -@@ -255,12 +262,15 @@ static int dst_dev_event(struct notifier - switch (event) { - case NETDEV_UNREGISTER: - case NETDEV_DOWN: -- spin_lock_bh(&dst_lock); -+ local_bh_disable(); -+ dst_run_gc(0); -+ spin_lock(&dst_lock); - for (dst = dst_garbage_list; dst; dst = dst->next) { - if (dst->dev == dev) - dst_ifdown(dst, event != NETDEV_DOWN); - } -- spin_unlock_bh(&dst_lock); -+ spin_unlock(&dst_lock); -+ local_bh_enable(); - break; - } - return NOTIFY_DONE; -diff -uprN linux-2.6.8.1.orig/net/core/filter.c linux-2.6.8.1-ve022stab078/net/core/filter.c ---- linux-2.6.8.1.orig/net/core/filter.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/core/filter.c 2006-05-11 13:05:39.000000000 +0400 -@@ -33,6 +33,7 @@ - #include <linux/timer.h> - #include <asm/system.h> - #include <asm/uaccess.h> -+#include <asm/unaligned.h> - #include <linux/filter.h> - - /* No hurry in this branch */ -@@ -169,7 +170,7 @@ int sk_run_filter(struct sk_buff *skb, s - k = fentry->k; - load_w: - if (k >= 0 && (unsigned int)(k+sizeof(u32)) <= len) { -- A = ntohl(*(u32*)&data[k]); -+ A = ntohl(get_unaligned((u32*)&data[k])); - continue; - } - if (k < 0) { -@@ -179,7 +180,7 @@ int sk_run_filter(struct sk_buff *skb, s - break; - ptr = load_pointer(skb, k); - if (ptr) { -- A = ntohl(*(u32*)ptr); -+ A = ntohl(get_unaligned((u32*)ptr)); - continue; - } - } else { -@@ -194,7 +195,7 @@ int sk_run_filter(struct sk_buff *skb, s - k = fentry->k; - load_h: - if (k >= 0 && (unsigned int)(k + sizeof(u16)) <= len) { -- A = ntohs(*(u16*)&data[k]); -+ A = ntohs(get_unaligned((u16*)&data[k])); - continue; - } - if (k < 0) { -@@ -204,7 +205,7 @@ int sk_run_filter(struct sk_buff *skb, s - break; - ptr = load_pointer(skb, k); - if (ptr) { -- A = ntohs(*(u16*)ptr); -+ A = ntohs(get_unaligned((u16*)ptr)); - continue; - } - } else { -@@ -398,7 +399,7 @@ int sk_attach_filter(struct sock_fprog * - if (fprog->filter == NULL || fprog->len > BPF_MAXINSNS) - return -EINVAL; - -- fp = sock_kmalloc(sk, fsize+sizeof(*fp), GFP_KERNEL); -+ fp = sock_kmalloc(sk, fsize+sizeof(*fp), GFP_KERNEL_UBC); - if (!fp) - return -ENOMEM; - if (copy_from_user(fp->insns, fprog->filter, fsize)) { -diff -uprN linux-2.6.8.1.orig/net/core/neighbour.c linux-2.6.8.1-ve022stab078/net/core/neighbour.c ---- linux-2.6.8.1.orig/net/core/neighbour.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/core/neighbour.c 2006-05-11 13:05:41.000000000 +0400 -@@ -652,6 +652,11 @@ static void neigh_timer_handler(unsigned - struct neighbour *neigh = (struct neighbour *)arg; - unsigned state; - int notify = 0; -+ struct ve_struct *env; -+ struct user_beancounter *ub; -+ -+ env = set_exec_env(neigh->dev->owner_env); -+ ub = set_exec_ub(netdev_bc(neigh->dev)->exec_ub); - - write_lock(&neigh->lock); - -@@ -706,6 +711,8 @@ static void neigh_timer_handler(unsigned - - neigh->ops->solicit(neigh, skb_peek(&neigh->arp_queue)); - atomic_inc(&neigh->probes); -+ (void)set_exec_ub(ub); -+ set_exec_env(env); - return; - - out: -@@ -715,6 +722,8 @@ out: - neigh_app_notify(neigh); - #endif - neigh_release(neigh); -+ (void)set_exec_ub(ub); -+ set_exec_env(env); - } - - int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb) -@@ -1068,6 +1077,12 @@ static void neigh_proxy_process(unsigned - skb = skb->next; - if (tdif <= 0) { - struct net_device *dev = back->dev; -+ struct ve_struct *env; -+ struct user_beancounter *ub; -+ -+ env = set_exec_env(dev->owner_env); -+ ub = set_exec_ub(netdev_bc(dev)->exec_ub); -+ - __skb_unlink(back, &tbl->proxy_queue); - if (tbl->proxy_redo && netif_running(dev)) - tbl->proxy_redo(back); -@@ -1075,6 +1090,9 @@ static void neigh_proxy_process(unsigned - kfree_skb(back); - - dev_put(dev); -+ -+ (void)set_exec_ub(ub); -+ set_exec_env(env); - } else if (!sched_next || tdif < sched_next) - sched_next = tdif; - } -@@ -1222,6 +1240,9 @@ int neigh_delete(struct sk_buff *skb, st - struct net_device *dev = NULL; - int err = -ENODEV; - -+ if (!ve_is_super(get_exec_env())) -+ return -EACCES; -+ - if (ndm->ndm_ifindex && - (dev = dev_get_by_index(ndm->ndm_ifindex)) == NULL) - goto out; -@@ -1272,6 +1293,9 @@ int neigh_add(struct sk_buff *skb, struc - struct net_device *dev = NULL; - int err = -ENODEV; - -+ if (!ve_is_super(get_exec_env())) -+ return -EACCES; -+ - if (ndm->ndm_ifindex && - (dev = dev_get_by_index(ndm->ndm_ifindex)) == NULL) - goto out; -@@ -1418,6 +1442,9 @@ int neigh_dump_info(struct sk_buff *skb, - struct neigh_table *tbl; - int t, family, s_t; - -+ if (!ve_is_super(get_exec_env())) -+ return -EACCES; -+ - read_lock(&neigh_tbl_lock); - family = ((struct rtgenmsg *)NLMSG_DATA(cb->nlh))->rtgen_family; - s_t = cb->args[0]; -@@ -1636,11 +1663,17 @@ int neigh_sysctl_register(struct net_dev - int p_id, int pdev_id, char *p_name, - proc_handler *handler) - { -- struct neigh_sysctl_table *t = kmalloc(sizeof(*t), GFP_KERNEL); -+ struct neigh_sysctl_table *t; - const char *dev_name_source = NULL; - char *dev_name = NULL; - int err = 0; - -+ /* This function is called from VExx only from devinet_init, -+ and it is does not matter what is returned */ -+ if (!ve_is_super(get_exec_env())) -+ return 0; -+ -+ t = kmalloc(sizeof(*t), GFP_KERNEL); - if (!t) - return -ENOBUFS; - memcpy(t, &neigh_sysctl_template, sizeof(*t)); -@@ -1710,6 +1743,8 @@ int neigh_sysctl_register(struct net_dev - - void neigh_sysctl_unregister(struct neigh_parms *p) - { -+ if (!ve_is_super(get_exec_env())) -+ return; - if (p->sysctl_table) { - struct neigh_sysctl_table *t = p->sysctl_table; - p->sysctl_table = NULL; -diff -uprN linux-2.6.8.1.orig/net/core/net-sysfs.c linux-2.6.8.1-ve022stab078/net/core/net-sysfs.c ---- linux-2.6.8.1.orig/net/core/net-sysfs.c 2004-08-14 14:56:14.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/core/net-sysfs.c 2006-05-11 13:05:42.000000000 +0400 -@@ -370,18 +370,26 @@ static void netdev_release(struct class_ - struct net_device *dev - = container_of(cd, struct net_device, class_dev); - -- BUG_ON(dev->reg_state != NETREG_RELEASED); -+ BUG_ON(dev->reg_state != NETREG_RELEASED && -+ dev->reg_state != NETREG_REGISTERING); - - kfree((char *)dev - dev->padded); - } - --static struct class net_class = { -+struct class net_class = { - .name = "net", - .release = netdev_release, - #ifdef CONFIG_HOTPLUG - .hotplug = netdev_hotplug, - #endif - }; -+EXPORT_SYMBOL(net_class); -+ -+#ifndef CONFIG_VE -+#define visible_net_class net_class -+#else -+#define visible_net_class (*get_exec_env()->net_class) -+#endif - - void netdev_unregister_sysfs(struct net_device * net) - { -@@ -406,7 +414,7 @@ int netdev_register_sysfs(struct net_dev - struct class_device_attribute *attr; - int ret; - -- class_dev->class = &net_class; -+ class_dev->class = &visible_net_class; - class_dev->class_data = net; - net->last_stats = net->get_stats; - -@@ -440,12 +448,21 @@ out_cleanup: - out_unreg: - printk(KERN_WARNING "%s: sysfs attribute registration failed %d\n", - net->name, ret); -- class_device_unregister(class_dev); -+ /* put is called in free_netdev() */ -+ class_device_del(class_dev); - out: - return ret; - } - -+void prepare_sysfs_netdev(void) -+{ -+#ifdef CONFIG_VE -+ get_ve0()->net_class = &net_class; -+#endif -+} -+ - int netdev_sysfs_init(void) - { -+ prepare_sysfs_netdev(); - return class_register(&net_class); - } -diff -uprN linux-2.6.8.1.orig/net/core/netfilter.c linux-2.6.8.1-ve022stab078/net/core/netfilter.c ---- linux-2.6.8.1.orig/net/core/netfilter.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/core/netfilter.c 2006-05-11 13:05:41.000000000 +0400 -@@ -49,6 +49,13 @@ struct list_head nf_hooks[NPROTO][NF_MAX - static LIST_HEAD(nf_sockopts); - static spinlock_t nf_hook_lock = SPIN_LOCK_UNLOCKED; - -+#ifdef CONFIG_VE_IPTABLES -+#define ve_nf_hooks \ -+ ((struct list_head (*)[NF_MAX_HOOKS])(get_exec_env()->_nf_hooks)) -+#else -+#define ve_nf_hooks nf_hooks -+#endif -+ - /* - * A queue handler may be registered for each protocol. Each is protected by - * long term mutex. The handler must provide an an outfn() to accept packets -@@ -65,7 +72,7 @@ int nf_register_hook(struct nf_hook_ops - struct list_head *i; - - spin_lock_bh(&nf_hook_lock); -- list_for_each(i, &nf_hooks[reg->pf][reg->hooknum]) { -+ list_for_each(i, &ve_nf_hooks[reg->pf][reg->hooknum]) { - if (reg->priority < ((struct nf_hook_ops *)i)->priority) - break; - } -@@ -76,6 +83,32 @@ int nf_register_hook(struct nf_hook_ops - return 0; - } - -+int visible_nf_register_hook(struct nf_hook_ops *reg) -+{ -+ int ret = 0; -+ -+ if (!ve_is_super(get_exec_env())) { -+ struct nf_hook_ops *tmp; -+ ret = -ENOMEM; -+ tmp = kmalloc(sizeof(struct nf_hook_ops), GFP_KERNEL); -+ if (!tmp) -+ goto nomem; -+ memcpy(tmp, reg, sizeof(struct nf_hook_ops)); -+ reg = tmp; -+ } -+ -+ ret = nf_register_hook(reg); -+ if (ret) -+ goto out; -+ -+ return 0; -+out: -+ if (!ve_is_super(get_exec_env())) -+ kfree(reg); -+nomem: -+ return ret; -+} -+ - void nf_unregister_hook(struct nf_hook_ops *reg) - { - spin_lock_bh(&nf_hook_lock); -@@ -85,6 +118,28 @@ void nf_unregister_hook(struct nf_hook_o - synchronize_net(); - } - -+int visible_nf_unregister_hook(struct nf_hook_ops *reg) -+{ -+ struct nf_hook_ops *i; -+ -+ spin_lock_bh(&nf_hook_lock); -+ list_for_each_entry(i, &ve_nf_hooks[reg->pf][reg->hooknum], list) { -+ if (reg->hook == i->hook) { -+ reg = i; -+ break; -+ } -+ } -+ spin_unlock_bh(&nf_hook_lock); -+ if (reg != i) -+ return -ENOENT; -+ -+ nf_unregister_hook(reg); -+ -+ if (!ve_is_super(get_exec_env())) -+ kfree(reg); -+ return 0; -+} -+ - /* Do exclusive ranges overlap? */ - static inline int overlap(int min1, int max1, int min2, int max2) - { -@@ -292,6 +347,12 @@ static int nf_sockopt(struct sock *sk, i - struct nf_sockopt_ops *ops; - int ret; - -+#ifdef CONFIG_VE_IPTABLES -+ if (!get_exec_env()->_nf_hooks || -+ !get_exec_env()->_ipt_standard_target) -+ return -ENOPROTOOPT; -+#endif -+ - if (down_interruptible(&nf_sockopt_mutex) != 0) - return -EINTR; - -@@ -515,9 +576,9 @@ int nf_hook_slow(int pf, unsigned int ho - skb->nf_debug |= (1 << hook); - #endif - -- elem = &nf_hooks[pf][hook]; -+ elem = &ve_nf_hooks[pf][hook]; - next_hook: -- verdict = nf_iterate(&nf_hooks[pf][hook], &skb, hook, indev, -+ verdict = nf_iterate(&ve_nf_hooks[pf][hook], &skb, hook, indev, - outdev, &elem, okfn, hook_thresh); - if (verdict == NF_QUEUE) { - NFDEBUG("nf_hook: Verdict = QUEUE.\n"); -@@ -563,12 +624,12 @@ void nf_reinject(struct sk_buff *skb, st - /* Drop reference to owner of hook which queued us. */ - module_put(info->elem->owner); - -- list_for_each_rcu(i, &nf_hooks[info->pf][info->hook]) { -+ list_for_each_rcu(i, &ve_nf_hooks[info->pf][info->hook]) { - if (i == elem) - break; - } - -- if (elem == &nf_hooks[info->pf][info->hook]) { -+ if (elem == &ve_nf_hooks[info->pf][info->hook]) { - /* The module which sent it to userspace is gone. */ - NFDEBUG("%s: module disappeared, dropping packet.\n", - __FUNCTION__); -@@ -583,7 +644,7 @@ void nf_reinject(struct sk_buff *skb, st - - if (verdict == NF_ACCEPT) { - next_hook: -- verdict = nf_iterate(&nf_hooks[info->pf][info->hook], -+ verdict = nf_iterate(&ve_nf_hooks[info->pf][info->hook], - &skb, info->hook, - info->indev, info->outdev, &elem, - info->okfn, INT_MIN); -@@ -808,26 +869,69 @@ EXPORT_SYMBOL(nf_log_packet); - with it. */ - void (*ip_ct_attach)(struct sk_buff *, struct nf_ct_info *); - --void __init netfilter_init(void) -+void init_nf_hooks(struct list_head (*nh)[NF_MAX_HOOKS]) - { - int i, h; - - for (i = 0; i < NPROTO; i++) { - for (h = 0; h < NF_MAX_HOOKS; h++) -- INIT_LIST_HEAD(&nf_hooks[i][h]); -+ INIT_LIST_HEAD(&nh[i][h]); - } - } - -+int init_netfilter(void) -+{ -+#ifdef CONFIG_VE_IPTABLES -+ struct ve_struct *envid; -+ -+ envid = get_exec_env(); -+ envid->_nf_hooks = kmalloc(sizeof(nf_hooks), GFP_KERNEL); -+ if (envid->_nf_hooks == NULL) -+ return -ENOMEM; -+ -+ /* FIXME: charge ubc */ -+ -+ init_nf_hooks(envid->_nf_hooks); -+ return 0; -+#else -+ init_nf_hooks(nf_hooks); -+ return 0; -+#endif -+} -+ -+#ifdef CONFIG_VE_IPTABLES -+void fini_netfilter(void) -+{ -+ struct ve_struct *envid; -+ -+ envid = get_exec_env(); -+ if (envid->_nf_hooks != NULL) -+ kfree(envid->_nf_hooks); -+ envid->_nf_hooks = NULL; -+ -+ /* FIXME: uncharge ubc */ -+} -+#endif -+ -+void __init netfilter_init(void) -+{ -+ init_netfilter(); -+} -+ - EXPORT_SYMBOL(ip_ct_attach); - EXPORT_SYMBOL(ip_route_me_harder); - EXPORT_SYMBOL(nf_getsockopt); - EXPORT_SYMBOL(nf_hook_slow); - EXPORT_SYMBOL(nf_hooks); - EXPORT_SYMBOL(nf_register_hook); -+EXPORT_SYMBOL(visible_nf_register_hook); - EXPORT_SYMBOL(nf_register_queue_handler); - EXPORT_SYMBOL(nf_register_sockopt); - EXPORT_SYMBOL(nf_reinject); - EXPORT_SYMBOL(nf_setsockopt); - EXPORT_SYMBOL(nf_unregister_hook); -+EXPORT_SYMBOL(visible_nf_unregister_hook); - EXPORT_SYMBOL(nf_unregister_queue_handler); - EXPORT_SYMBOL(nf_unregister_sockopt); -+EXPORT_SYMBOL(init_netfilter); -+EXPORT_SYMBOL(fini_netfilter); -diff -uprN linux-2.6.8.1.orig/net/core/rtnetlink.c linux-2.6.8.1-ve022stab078/net/core/rtnetlink.c ---- linux-2.6.8.1.orig/net/core/rtnetlink.c 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/core/rtnetlink.c 2006-05-11 13:05:42.000000000 +0400 -@@ -294,6 +294,8 @@ static int rtnetlink_dump_all(struct sk_ - if (rtnetlink_links[idx] == NULL || - rtnetlink_links[idx][type].dumpit == NULL) - continue; -+ if (vz_security_proto_check(idx, 0, 0)) -+ continue; - if (idx > s_idx) - memset(&cb->args[0], 0, sizeof(cb->args)); - if (rtnetlink_links[idx][type].dumpit(skb, cb)) -@@ -362,7 +364,7 @@ rtnetlink_rcv_msg(struct sk_buff *skb, s - return 0; - - family = ((struct rtgenmsg*)NLMSG_DATA(nlh))->rtgen_family; -- if (family >= NPROTO) { -+ if (family >= NPROTO || vz_security_proto_check(family, 0, 0)) { - *errp = -EAFNOSUPPORT; - return -1; - } -@@ -488,7 +490,13 @@ static void rtnetlink_rcv(struct sock *s - return; - - while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) { -- if (rtnetlink_rcv_skb(skb)) { -+ int ret; -+ struct ve_struct *old_env; -+ -+ old_env = set_exec_env(VE_OWNER_SKB(skb)); -+ ret = rtnetlink_rcv_skb(skb); -+ (void)set_exec_env(old_env); -+ if (ret) { - if (skb->len) - skb_queue_head(&sk->sk_receive_queue, - skb); -diff -uprN linux-2.6.8.1.orig/net/core/scm.c linux-2.6.8.1-ve022stab078/net/core/scm.c ---- linux-2.6.8.1.orig/net/core/scm.c 2004-08-14 14:55:19.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/core/scm.c 2006-05-11 13:05:41.000000000 +0400 -@@ -34,6 +34,7 @@ - #include <net/compat.h> - #include <net/scm.h> - -+#include <ub/ub_mem.h> - - /* - * Only allow a user to send credentials, that they could set with -@@ -42,7 +43,9 @@ - - static __inline__ int scm_check_creds(struct ucred *creds) - { -- if ((creds->pid == current->tgid || capable(CAP_SYS_ADMIN)) && -+ if ((creds->pid == virt_tgid(current) || -+ creds->pid == current->tgid || -+ capable(CAP_VE_SYS_ADMIN)) && - ((creds->uid == current->uid || creds->uid == current->euid || - creds->uid == current->suid) || capable(CAP_SETUID)) && - ((creds->gid == current->gid || creds->gid == current->egid || -@@ -69,7 +72,7 @@ static int scm_fp_copy(struct cmsghdr *c - - if (!fpl) - { -- fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL); -+ fpl = ub_kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL); - if (!fpl) - return -ENOMEM; - *fplp = fpl; -@@ -127,9 +130,7 @@ int __scm_send(struct socket *sock, stru - for too short ancillary data object at all! Oops. - OK, let's add it... - */ -- if (cmsg->cmsg_len < sizeof(struct cmsghdr) || -- (unsigned long)(((char*)cmsg - (char*)msg->msg_control) -- + cmsg->cmsg_len) > msg->msg_controllen) -+ if (!CMSG_OK(msg, cmsg)) - goto error; - - if (cmsg->cmsg_level != SOL_SOCKET) -@@ -277,7 +278,7 @@ struct scm_fp_list *scm_fp_dup(struct sc - if (!fpl) - return NULL; - -- new_fpl = kmalloc(sizeof(*fpl), GFP_KERNEL); -+ new_fpl = ub_kmalloc(sizeof(*fpl), GFP_KERNEL); - if (new_fpl) { - for (i=fpl->count-1; i>=0; i--) - get_file(fpl->fp[i]); -diff -uprN linux-2.6.8.1.orig/net/core/skbuff.c linux-2.6.8.1-ve022stab078/net/core/skbuff.c ---- linux-2.6.8.1.orig/net/core/skbuff.c 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/core/skbuff.c 2006-05-11 13:05:41.000000000 +0400 -@@ -48,6 +48,7 @@ - #include <linux/in.h> - #include <linux/inet.h> - #include <linux/slab.h> -+#include <linux/kmem_cache.h> - #include <linux/netdevice.h> - #ifdef CONFIG_NET_CLS_ACT - #include <net/pkt_sched.h> -@@ -68,6 +69,8 @@ - #include <asm/uaccess.h> - #include <asm/system.h> - -+#include <ub/ub_net.h> -+ - static kmem_cache_t *skbuff_head_cache; - - /* -@@ -136,6 +139,9 @@ struct sk_buff *alloc_skb(unsigned int s - if (!skb) - goto out; - -+ if (ub_skb_alloc_bc(skb, gfp_mask)) -+ goto nobc; -+ - /* Get the DATA. Size must match skb_add_mtu(). */ - size = SKB_DATA_ALIGN(size); - data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask); -@@ -149,6 +155,7 @@ struct sk_buff *alloc_skb(unsigned int s - skb->data = data; - skb->tail = data; - skb->end = data + size; -+ SET_VE_OWNER_SKB(skb, get_exec_env()); - - atomic_set(&(skb_shinfo(skb)->dataref), 1); - skb_shinfo(skb)->nr_frags = 0; -@@ -158,6 +165,8 @@ struct sk_buff *alloc_skb(unsigned int s - out: - return skb; - nodata: -+ ub_skb_free_bc(skb); -+nobc: - kmem_cache_free(skbuff_head_cache, skb); - skb = NULL; - goto out; -@@ -208,6 +217,7 @@ void skb_release_data(struct sk_buff *sk - void kfree_skbmem(struct sk_buff *skb) - { - skb_release_data(skb); -+ ub_skb_free_bc(skb); - kmem_cache_free(skbuff_head_cache, skb); - } - -@@ -232,6 +242,7 @@ void __kfree_skb(struct sk_buff *skb) - #ifdef CONFIG_XFRM - secpath_put(skb->sp); - #endif -+ ub_skb_uncharge(skb); - if(skb->destructor) { - if (in_irq()) - printk(KERN_WARNING "Warning: kfree_skb on " -@@ -277,6 +288,11 @@ struct sk_buff *skb_clone(struct sk_buff - if (!n) - return NULL; - -+ if (ub_skb_alloc_bc(n, gfp_mask)) { -+ kmem_cache_free(skbuff_head_cache, n); -+ return NULL; -+ } -+ - #define C(x) n->x = skb->x - - n->next = n->prev = NULL; -@@ -305,6 +321,7 @@ struct sk_buff *skb_clone(struct sk_buff - C(priority); - C(protocol); - C(security); -+ SET_VE_OWNER_SKB(n, VE_OWNER_SKB(skb)); - n->destructor = NULL; - #ifdef CONFIG_NETFILTER - C(nfmark); -@@ -372,6 +389,7 @@ static void copy_skb_header(struct sk_bu - new->stamp = old->stamp; - new->destructor = NULL; - new->security = old->security; -+ SET_VE_OWNER_SKB(new, VE_OWNER_SKB((struct sk_buff *)old)); - #ifdef CONFIG_NETFILTER - new->nfmark = old->nfmark; - new->nfcache = old->nfcache; -@@ -1434,6 +1452,7 @@ void __init skb_init(void) - NULL, NULL); - if (!skbuff_head_cache) - panic("cannot create skbuff cache"); -+ skbuff_head_cache->flags |= CFLGS_ENVIDS; - } - - EXPORT_SYMBOL(___pskb_trim); -diff -uprN linux-2.6.8.1.orig/net/core/sock.c linux-2.6.8.1-ve022stab078/net/core/sock.c ---- linux-2.6.8.1.orig/net/core/sock.c 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/core/sock.c 2006-05-11 13:05:41.000000000 +0400 -@@ -106,6 +106,7 @@ - #include <linux/net.h> - #include <linux/mm.h> - #include <linux/slab.h> -+#include <linux/kmem_cache.h> - #include <linux/interrupt.h> - #include <linux/poll.h> - #include <linux/tcp.h> -@@ -121,6 +122,9 @@ - #include <net/xfrm.h> - #include <linux/ipsec.h> - -+#include <ub/ub_net.h> -+#include <ub/beancounter.h> -+ - #include <linux/filter.h> - - #ifdef CONFIG_INET -@@ -169,7 +173,7 @@ static void sock_warn_obsolete_bsdism(co - static char warncomm[16]; - if (strcmp(warncomm, current->comm) && warned < 5) { - strcpy(warncomm, current->comm); -- printk(KERN_WARNING "process `%s' is using obsolete " -+ ve_printk(VE_LOG, KERN_WARNING "process `%s' is using obsolete " - "%s SO_BSDCOMPAT\n", warncomm, name); - warned++; - } -@@ -621,6 +625,7 @@ struct sock *sk_alloc(int family, int pr - zero_it == 1 ? sizeof(struct sock) : zero_it); - sk->sk_family = family; - sock_lock_init(sk); -+ SET_VE_OWNER_SK(sk, get_exec_env()); - } - sk->sk_slab = slab; - -@@ -653,6 +658,7 @@ void sk_free(struct sock *sk) - __FUNCTION__, atomic_read(&sk->sk_omem_alloc)); - - security_sk_free(sk); -+ ub_sock_uncharge(sk); - kmem_cache_free(sk->sk_slab, sk); - module_put(owner); - } -@@ -663,6 +669,7 @@ void __init sk_init(void) - SLAB_HWCACHE_ALIGN, NULL, NULL); - if (!sk_cachep) - printk(KERN_CRIT "sk_init: Cannot create sock SLAB cache!"); -+ sk_cachep->flags |= CFLGS_ENVIDS; - - if (num_physpages <= 4096) { - sysctl_wmem_max = 32767; -@@ -819,6 +826,7 @@ static long sock_wait_for_wmem(struct so - struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len, - unsigned long data_len, int noblock, int *errcode) - { -+#if 0 - struct sk_buff *skb; - unsigned int gfp_mask; - long timeo; -@@ -895,13 +903,87 @@ interrupted: - err = sock_intr_errno(timeo); - failure: - *errcode = err; -+#endif -+ return NULL; -+} -+ -+struct sk_buff *sock_alloc_send_skb2(struct sock *sk, unsigned long size, -+ unsigned long size2, int noblock, -+ int *errcode) -+{ -+ struct sk_buff *skb; -+ unsigned int gfp_mask; -+ long timeo; -+ int err; -+ -+ gfp_mask = sk->sk_allocation; -+ if (gfp_mask & __GFP_WAIT) -+ gfp_mask |= __GFP_REPEAT; -+ -+ timeo = sock_sndtimeo(sk, noblock); -+ while (1) { -+ err = sock_error(sk); -+ if (err != 0) -+ goto failure; -+ -+ err = -EPIPE; -+ if (sk->sk_shutdown & SEND_SHUTDOWN) -+ goto failure; -+ -+ if (ub_sock_getwres_other(sk, skb_charge_size(size))) { -+ if (size2 < size) { -+ size = size2; -+ continue; -+ } -+ set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); -+ err = -EAGAIN; -+ if (!timeo) -+ goto failure; -+ if (signal_pending(current)) -+ goto interrupted; -+ timeo = ub_sock_wait_for_space(sk, timeo, -+ skb_charge_size(size)); -+ continue; -+ } -+ -+ if (atomic_read(&sk->sk_wmem_alloc) < sk->sk_sndbuf) { -+ skb = alloc_skb(size, sk->sk_allocation); -+ if (skb) -+ /* Full success... */ -+ break; -+ ub_sock_retwres_other(sk, skb_charge_size(size), -+ SOCK_MIN_UBCSPACE_CH); -+ err = -ENOBUFS; -+ goto failure; -+ } -+ ub_sock_retwres_other(sk, -+ skb_charge_size(size), -+ SOCK_MIN_UBCSPACE_CH); -+ set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); -+ set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); -+ err = -EAGAIN; -+ if (!timeo) -+ goto failure; -+ if (signal_pending(current)) -+ goto interrupted; -+ timeo = sock_wait_for_wmem(sk, timeo); -+ } -+ -+ ub_skb_set_charge(skb, sk, skb_charge_size(size), UB_OTHERSOCKBUF); -+ skb_set_owner_w(skb, sk); -+ return skb; -+ -+interrupted: -+ err = sock_intr_errno(timeo); -+failure: -+ *errcode = err; - return NULL; - } - - struct sk_buff *sock_alloc_send_skb(struct sock *sk, unsigned long size, - int noblock, int *errcode) - { -- return sock_alloc_send_pskb(sk, size, 0, noblock, errcode); -+ return sock_alloc_send_skb2(sk, size, size, noblock, errcode); - } - - void __lock_sock(struct sock *sk) -diff -uprN linux-2.6.8.1.orig/net/core/stream.c linux-2.6.8.1-ve022stab078/net/core/stream.c ---- linux-2.6.8.1.orig/net/core/stream.c 2004-08-14 14:54:46.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/core/stream.c 2006-05-11 13:05:39.000000000 +0400 -@@ -109,8 +109,9 @@ EXPORT_SYMBOL(sk_stream_wait_close); - * sk_stream_wait_memory - Wait for more memory for a socket - * @sk - socket to wait for memory - * @timeo_p - for how long -+ * @amount - amount of memory to wait for (in UB space!) - */ --int sk_stream_wait_memory(struct sock *sk, long *timeo_p) -+int sk_stream_wait_memory(struct sock *sk, long *timeo_p, unsigned long amount) - { - int err = 0; - long vm_wait = 0; -@@ -132,14 +133,19 @@ int sk_stream_wait_memory(struct sock *s - if (signal_pending(current)) - goto do_interrupted; - clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); -- if (sk_stream_memory_free(sk) && !vm_wait) -- break; -+ if (amount == 0) { -+ if (sk_stream_memory_free(sk) && !vm_wait) -+ break; -+ } else -+ ub_sock_sndqueueadd_tcp(sk, amount); - - set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); - sk->sk_write_pending++; - sk_wait_event(sk, ¤t_timeo, sk_stream_memory_free(sk) && - vm_wait); - sk->sk_write_pending--; -+ if (amount > 0) -+ ub_sock_sndqueuedel(sk); - - if (vm_wait) { - vm_wait -= current_timeo; -diff -uprN linux-2.6.8.1.orig/net/ipv4/af_inet.c linux-2.6.8.1-ve022stab078/net/ipv4/af_inet.c ---- linux-2.6.8.1.orig/net/ipv4/af_inet.c 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/af_inet.c 2006-05-11 13:05:41.000000000 +0400 -@@ -113,6 +113,8 @@ - #include <linux/mroute.h> - #endif - -+#include <ub/ub_net.h> -+ - DEFINE_SNMP_STAT(struct linux_mib, net_statistics); - - #ifdef INET_REFCNT_DEBUG -@@ -299,6 +301,13 @@ static int inet_create(struct socket *so - err = -EPROTONOSUPPORT; - if (!protocol) - goto out_sk_free; -+ err = -ENOBUFS; -+ if (ub_sock_charge(sk, PF_INET, sock->type)) -+ goto out_sk_free; -+ /* if charge was successful, sock_init_data() MUST be called to -+ * set sk->sk_type. otherwise sk will be uncharged to wrong resource -+ */ -+ - err = 0; - sock->ops = answer->ops; - sk->sk_prot = answer->prot; -@@ -377,6 +386,9 @@ int inet_release(struct socket *sock) - - if (sk) { - long timeout; -+ struct ve_struct *saved_env; -+ -+ saved_env = set_exec_env(VE_OWNER_SK(sk)); - - /* Applications forget to leave groups before exiting */ - ip_mc_drop_socket(sk); -@@ -394,6 +406,8 @@ int inet_release(struct socket *sock) - timeout = sk->sk_lingertime; - sock->sk = NULL; - sk->sk_prot->close(sk, timeout); -+ -+ set_exec_env(saved_env); - } - return 0; - } -@@ -981,20 +995,20 @@ static struct net_protocol icmp_protocol - - static int __init init_ipv4_mibs(void) - { -- net_statistics[0] = alloc_percpu(struct linux_mib); -- net_statistics[1] = alloc_percpu(struct linux_mib); -- ip_statistics[0] = alloc_percpu(struct ipstats_mib); -- ip_statistics[1] = alloc_percpu(struct ipstats_mib); -- icmp_statistics[0] = alloc_percpu(struct icmp_mib); -- icmp_statistics[1] = alloc_percpu(struct icmp_mib); -- tcp_statistics[0] = alloc_percpu(struct tcp_mib); -- tcp_statistics[1] = alloc_percpu(struct tcp_mib); -- udp_statistics[0] = alloc_percpu(struct udp_mib); -- udp_statistics[1] = alloc_percpu(struct udp_mib); -+ ve_net_statistics[0] = alloc_percpu(struct linux_mib); -+ ve_net_statistics[1] = alloc_percpu(struct linux_mib); -+ ve_ip_statistics[0] = alloc_percpu(struct ipstats_mib); -+ ve_ip_statistics[1] = alloc_percpu(struct ipstats_mib); -+ ve_icmp_statistics[0] = alloc_percpu(struct icmp_mib); -+ ve_icmp_statistics[1] = alloc_percpu(struct icmp_mib); -+ ve_tcp_statistics[0] = alloc_percpu(struct tcp_mib); -+ ve_tcp_statistics[1] = alloc_percpu(struct tcp_mib); -+ ve_udp_statistics[0] = alloc_percpu(struct udp_mib); -+ ve_udp_statistics[1] = alloc_percpu(struct udp_mib); - if (! -- (net_statistics[0] && net_statistics[1] && ip_statistics[0] -- && ip_statistics[1] && tcp_statistics[0] && tcp_statistics[1] -- && udp_statistics[0] && udp_statistics[1])) -+ (ve_net_statistics[0] && ve_net_statistics[1] && ve_ip_statistics[0] -+ && ve_ip_statistics[1] && ve_tcp_statistics[0] && ve_tcp_statistics[1] -+ && ve_udp_statistics[0] && ve_udp_statistics[1])) - return -ENOMEM; - - (void) tcp_mib_init(); -diff -uprN linux-2.6.8.1.orig/net/ipv4/arp.c linux-2.6.8.1-ve022stab078/net/ipv4/arp.c ---- linux-2.6.8.1.orig/net/ipv4/arp.c 2004-08-14 14:56:01.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/arp.c 2006-05-11 13:05:41.000000000 +0400 -@@ -695,6 +695,9 @@ void arp_send(int type, int ptype, u32 d - - static void parp_redo(struct sk_buff *skb) - { -+#if defined(CONFIG_NETFILTER) && defined(CONFIG_NETFILTER_DEBUG) -+ skb->nf_debug = 0; -+#endif - arp_rcv(skb, skb->dev, NULL); - } - -@@ -980,7 +983,7 @@ int arp_req_set(struct arpreq *r, struct - return 0; - } - if (dev == NULL) { -- ipv4_devconf.proxy_arp = 1; -+ ve_ipv4_devconf.proxy_arp = 1; - return 0; - } - if (__in_dev_get(dev)) { -@@ -1066,7 +1069,7 @@ int arp_req_delete(struct arpreq *r, str - return pneigh_delete(&arp_tbl, &ip, dev); - if (mask == 0) { - if (dev == NULL) { -- ipv4_devconf.proxy_arp = 0; -+ ve_ipv4_devconf.proxy_arp = 0; - return 0; - } - if (__in_dev_get(dev)) { -@@ -1115,6 +1118,8 @@ int arp_ioctl(unsigned int cmd, void __u - if (!capable(CAP_NET_ADMIN)) - return -EPERM; - case SIOCGARP: -+ if (!ve_is_super(get_exec_env())) -+ return -EACCES; - err = copy_from_user(&r, arg, sizeof(struct arpreq)); - if (err) - return -EFAULT; -@@ -1486,8 +1491,12 @@ static int arp_seq_open(struct inode *in - { - struct seq_file *seq; - int rc = -ENOMEM; -- struct arp_iter_state *s = kmalloc(sizeof(*s), GFP_KERNEL); -- -+ struct arp_iter_state *s; -+ -+ if (!ve_is_super(get_exec_env())) -+ return -EPERM; -+ -+ s = kmalloc(sizeof(*s), GFP_KERNEL); - if (!s) - goto out; - -diff -uprN linux-2.6.8.1.orig/net/ipv4/devinet.c linux-2.6.8.1-ve022stab078/net/ipv4/devinet.c ---- linux-2.6.8.1.orig/net/ipv4/devinet.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/devinet.c 2006-05-11 13:05:42.000000000 +0400 -@@ -77,10 +77,21 @@ static struct ipv4_devconf ipv4_devconf_ - .accept_source_route = 1, - }; - -+struct ipv4_devconf *get_ipv4_devconf_dflt_addr(void) -+{ -+ return &ipv4_devconf_dflt; -+} -+ -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+#define ve_ipv4_devconf_dflt (*(get_exec_env()->_ipv4_devconf_dflt)) -+#else -+#define ve_ipv4_devconf_dflt ipv4_devconf_dflt -+#endif -+ - static void rtmsg_ifa(int event, struct in_ifaddr *); - - static struct notifier_block *inetaddr_chain; --static void inet_del_ifa(struct in_device *in_dev, struct in_ifaddr **ifap, -+void inet_del_ifa(struct in_device *in_dev, struct in_ifaddr **ifap, - int destroy); - #ifdef CONFIG_SYSCTL - static void devinet_sysctl_register(struct in_device *in_dev, -@@ -221,7 +232,7 @@ int inet_addr_onlink(struct in_device *i - return 0; - } - --static void inet_del_ifa(struct in_device *in_dev, struct in_ifaddr **ifap, -+void inet_del_ifa(struct in_device *in_dev, struct in_ifaddr **ifap, - int destroy) - { - struct in_ifaddr *ifa1 = *ifap; -@@ -537,7 +548,7 @@ int devinet_ioctl(unsigned int cmd, void - - case SIOCSIFFLAGS: - ret = -EACCES; -- if (!capable(CAP_NET_ADMIN)) -+ if (!capable(CAP_VE_NET_ADMIN)) - goto out; - break; - case SIOCSIFADDR: /* Set interface address (and family) */ -@@ -545,7 +556,7 @@ int devinet_ioctl(unsigned int cmd, void - case SIOCSIFDSTADDR: /* Set the destination address */ - case SIOCSIFNETMASK: /* Set the netmask for the interface */ - ret = -EACCES; -- if (!capable(CAP_NET_ADMIN)) -+ if (!capable(CAP_VE_NET_ADMIN)) - goto out; - ret = -EINVAL; - if (sin->sin_family != AF_INET) -@@ -965,7 +976,7 @@ static int inetdev_event(struct notifier - case NETDEV_UP: - if (dev->mtu < 68) - break; -- if (dev == &loopback_dev) { -+ if (dev == &visible_loopback_dev) { - struct in_ifaddr *ifa; - if ((ifa = inet_alloc_ifa()) != NULL) { - ifa->ifa_local = -@@ -1130,10 +1141,10 @@ static struct rtnetlink_link inet_rtnetl - void inet_forward_change(void) - { - struct net_device *dev; -- int on = ipv4_devconf.forwarding; -+ int on = ve_ipv4_devconf.forwarding; - -- ipv4_devconf.accept_redirects = !on; -- ipv4_devconf_dflt.forwarding = on; -+ ve_ipv4_devconf.accept_redirects = !on; -+ ve_ipv4_devconf_dflt.forwarding = on; - - read_lock(&dev_base_lock); - for (dev = dev_base; dev; dev = dev->next) { -@@ -1158,9 +1169,9 @@ static int devinet_sysctl_forward(ctl_ta - int ret = proc_dointvec(ctl, write, filp, buffer, lenp, ppos); - - if (write && *valp != val) { -- if (valp == &ipv4_devconf.forwarding) -+ if (valp == &ve_ipv4_devconf.forwarding) - inet_forward_change(); -- else if (valp != &ipv4_devconf_dflt.forwarding) -+ else if (valp != &ve_ipv4_devconf_dflt.forwarding) - rt_cache_flush(0); - } - -@@ -1422,30 +1433,22 @@ static struct devinet_sysctl_table { - }, - }; - --static void devinet_sysctl_register(struct in_device *in_dev, -- struct ipv4_devconf *p) -+static struct devinet_sysctl_table *__devinet_sysctl_register(char *dev_name, -+ int ifindex, struct ipv4_devconf *p) - { - int i; -- struct net_device *dev = in_dev ? in_dev->dev : NULL; -- struct devinet_sysctl_table *t = kmalloc(sizeof(*t), GFP_KERNEL); -- char *dev_name = NULL; -+ struct devinet_sysctl_table *t; - -+ t = kmalloc(sizeof(*t), GFP_KERNEL); - if (!t) -- return; -+ goto out; -+ - memcpy(t, &devinet_sysctl, sizeof(*t)); - for (i = 0; i < ARRAY_SIZE(t->devinet_vars) - 1; i++) { - t->devinet_vars[i].data += (char *)p - (char *)&ipv4_devconf; - t->devinet_vars[i].de = NULL; - } - -- if (dev) { -- dev_name = dev->name; -- t->devinet_dev[0].ctl_name = dev->ifindex; -- } else { -- dev_name = "default"; -- t->devinet_dev[0].ctl_name = NET_PROTO_CONF_DEFAULT; -- } -- - /* - * Make a copy of dev_name, because '.procname' is regarded as const - * by sysctl and we wouldn't want anyone to change it under our feet -@@ -1453,8 +1456,9 @@ static void devinet_sysctl_register(stru - */ - dev_name = net_sysctl_strdup(dev_name); - if (!dev_name) -- goto free; -+ goto out_free_table; - -+ t->devinet_dev[0].ctl_name = ifindex; - t->devinet_dev[0].procname = dev_name; - t->devinet_dev[0].child = t->devinet_vars; - t->devinet_dev[0].de = NULL; -@@ -1467,17 +1471,38 @@ static void devinet_sysctl_register(stru - - t->sysctl_header = register_sysctl_table(t->devinet_root_dir, 0); - if (!t->sysctl_header) -- goto free_procname; -+ goto out_free_procname; - -- p->sysctl = t; -- return; -+ return t; - - /* error path */ -- free_procname: -+out_free_procname: - kfree(dev_name); -- free: -+out_free_table: - kfree(t); -- return; -+out: -+ printk(KERN_DEBUG "Can't register net/ipv4/conf sysctls.\n"); -+ return NULL; -+} -+ -+static void devinet_sysctl_register(struct in_device *in_dev, -+ struct ipv4_devconf *p) -+{ -+ struct net_device *dev; -+ char *dev_name; -+ int ifindex; -+ -+ dev = in_dev ? in_dev->dev : NULL; -+ -+ if (dev) { -+ dev_name = dev->name; -+ ifindex = dev->ifindex; -+ } else { -+ dev_name = "default"; -+ ifindex = NET_PROTO_CONF_DEFAULT; -+ } -+ -+ p->sysctl = __devinet_sysctl_register(dev_name, ifindex, p); - } - - static void devinet_sysctl_unregister(struct ipv4_devconf *p) -@@ -1490,7 +1515,189 @@ static void devinet_sysctl_unregister(st - kfree(t); - } - } -+ -+extern int visible_ipv4_sysctl_forward(ctl_table *ctl, int write, struct file * filp, -+ void __user *buffer, size_t *lenp, loff_t *ppos); -+extern int visible_ipv4_sysctl_forward_strategy(ctl_table *table, int *name, int nlen, -+ void *oldval, size_t *oldlenp, -+ void *newval, size_t newlen, -+ void **context); -+ -+extern void *get_flush_delay_addr(void); -+extern int visible_ipv4_sysctl_rtcache_flush(ctl_table *ctl, int write, struct file * filp, -+ void __user *buffer, size_t *lenp, loff_t *ppos); -+extern int visible_ipv4_sysctl_rtcache_flush_strategy(ctl_table *table, -+ int __user *name, -+ int nlen, -+ void __user *oldval, -+ size_t __user *oldlenp, -+ void __user *newval, -+ size_t newlen, -+ void **context); -+ -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+static ctl_table net_sysctl_tables[] = { -+ /* 0: net */ -+ { -+ .ctl_name = CTL_NET, -+ .procname = "net", -+ .mode = 0555, -+ .child = &net_sysctl_tables[2], -+ }, -+ { .ctl_name = 0, }, -+ /* 2: net/ipv4 */ -+ { -+ .ctl_name = NET_IPV4, -+ .procname = "ipv4", -+ .mode = 0555, -+ .child = &net_sysctl_tables[4], -+ }, -+ { .ctl_name = 0, }, -+ /* 4, 5: net/ipv4/[vars] */ -+ { -+ .ctl_name = NET_IPV4_FORWARD, -+ .procname = "ip_forward", -+ .data = &ipv4_devconf.forwarding, -+ .maxlen = sizeof(int), -+ .mode = 0644, -+ .proc_handler = &visible_ipv4_sysctl_forward, -+ .strategy = &visible_ipv4_sysctl_forward_strategy, -+ }, -+ { -+ .ctl_name = NET_IPV4_ROUTE, -+ .procname = "route", -+ .maxlen = 0, -+ .mode = 0555, -+ .child = &net_sysctl_tables[7], -+ }, -+ { .ctl_name = 0 }, -+ /* 7: net/ipv4/route/flush */ -+ { -+ .ctl_name = NET_IPV4_ROUTE_FLUSH, -+ .procname = "flush", -+ .data = NULL, /* setuped below */ -+ .maxlen = sizeof(int), -+ .mode = 0644, -+ .proc_handler = &visible_ipv4_sysctl_rtcache_flush, -+ .strategy = &visible_ipv4_sysctl_rtcache_flush_strategy, -+ }, -+ { .ctl_name = 0 }, -+}; -+ -+static int ip_forward_sysctl_register(struct ve_struct *ve, -+ struct ipv4_devconf *p) -+{ -+ struct ctl_table_header *hdr; -+ ctl_table *root; -+ -+ root = clone_sysctl_template(net_sysctl_tables, -+ sizeof(net_sysctl_tables) / sizeof(ctl_table)); -+ if (root == NULL) -+ goto out; -+ -+ root[4].data = &p->forwarding; -+ root[7].data = get_flush_delay_addr(); -+ -+ hdr = register_sysctl_table(root, 1); -+ if (hdr == NULL) -+ goto out_free; -+ -+ ve->forward_header = hdr; -+ ve->forward_table = root; -+ return 0; -+ -+out_free: -+ free_sysctl_clone(root); -+out: -+ return -ENOMEM; -+} -+ -+static inline void ip_forward_sysctl_unregister(struct ve_struct *ve) -+{ -+ unregister_sysctl_table(ve->forward_header); -+ ve->forward_header = NULL; -+} -+ -+static inline void ip_forward_sysctl_free(struct ve_struct *ve) -+{ -+ free_sysctl_clone(ve->forward_table); -+ ve->forward_table = NULL; -+} - #endif -+#endif -+ -+int devinet_sysctl_init(struct ve_struct *ve) -+{ -+ int err = 0; -+#ifdef CONFIG_SYSCTL -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+ struct ipv4_devconf *conf, *conf_def; -+ -+ err = -ENOMEM; -+ -+ conf = kmalloc(sizeof(*conf), GFP_KERNEL); -+ if (!conf) -+ goto err1; -+ -+ memcpy(conf, &ipv4_devconf, sizeof(*conf)); -+ conf->sysctl = __devinet_sysctl_register("all", -+ NET_PROTO_CONF_ALL, conf); -+ if (!conf->sysctl) -+ goto err2; -+ -+ conf_def = kmalloc(sizeof(*conf_def), GFP_KERNEL); -+ if (!conf_def) -+ goto err3; -+ -+ memcpy(conf_def, &ipv4_devconf_dflt, sizeof(*conf_def)); -+ conf_def->sysctl = __devinet_sysctl_register("default", -+ NET_PROTO_CONF_DEFAULT, conf_def); -+ if (!conf_def->sysctl) -+ goto err4; -+ -+ err = ip_forward_sysctl_register(ve, conf); -+ if (err) -+ goto err5; -+ -+ ve->_ipv4_devconf = conf; -+ ve->_ipv4_devconf_dflt = conf_def; -+ return 0; -+ -+err5: -+ devinet_sysctl_unregister(conf_def); -+err4: -+ kfree(conf_def); -+err3: -+ devinet_sysctl_unregister(conf); -+err2: -+ kfree(conf); -+err1: -+#endif -+#endif -+ return err; -+} -+ -+void devinet_sysctl_fini(struct ve_struct *ve) -+{ -+#ifdef CONFIG_SYSCTL -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+ ip_forward_sysctl_unregister(ve); -+ devinet_sysctl_unregister(ve->_ipv4_devconf); -+ devinet_sysctl_unregister(ve->_ipv4_devconf_dflt); -+#endif -+#endif -+} -+ -+void devinet_sysctl_free(struct ve_struct *ve) -+{ -+#ifdef CONFIG_SYSCTL -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+ ip_forward_sysctl_free(ve); -+ kfree(ve->_ipv4_devconf); -+ kfree(ve->_ipv4_devconf_dflt); -+#endif -+#endif -+} - - void __init devinet_init(void) - { -@@ -1500,14 +1707,19 @@ void __init devinet_init(void) - #ifdef CONFIG_SYSCTL - devinet_sysctl.sysctl_header = - register_sysctl_table(devinet_sysctl.devinet_root_dir, 0); -- devinet_sysctl_register(NULL, &ipv4_devconf_dflt); -+ __devinet_sysctl_register("default", NET_PROTO_CONF_DEFAULT, -+ &ipv4_devconf_dflt); - #endif - } - - EXPORT_SYMBOL(devinet_ioctl); - EXPORT_SYMBOL(in_dev_finish_destroy); - EXPORT_SYMBOL(inet_select_addr); -+EXPORT_SYMBOL(inet_del_ifa); - EXPORT_SYMBOL(inetdev_by_index); - EXPORT_SYMBOL(inetdev_lock); -+EXPORT_SYMBOL(devinet_sysctl_init); -+EXPORT_SYMBOL(devinet_sysctl_fini); -+EXPORT_SYMBOL(devinet_sysctl_free); - EXPORT_SYMBOL(register_inetaddr_notifier); - EXPORT_SYMBOL(unregister_inetaddr_notifier); -diff -uprN linux-2.6.8.1.orig/net/ipv4/fib_frontend.c linux-2.6.8.1-ve022stab078/net/ipv4/fib_frontend.c ---- linux-2.6.8.1.orig/net/ipv4/fib_frontend.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/fib_frontend.c 2006-05-11 13:05:41.000000000 +0400 -@@ -51,14 +51,46 @@ - - #define RT_TABLE_MIN RT_TABLE_MAIN - -+#undef ip_fib_local_table -+#undef ip_fib_main_table - struct fib_table *ip_fib_local_table; - struct fib_table *ip_fib_main_table; -+void prepare_fib_tables(void) -+{ -+#ifdef CONFIG_VE -+ get_ve0()->_local_table = ip_fib_local_table; -+ ip_fib_local_table = (struct fib_table *)0x12345678; -+ get_ve0()->_main_table = ip_fib_main_table; -+ ip_fib_main_table = (struct fib_table *)0x12345678; -+#endif -+} -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+#define ip_fib_local_table get_exec_env()->_local_table -+#define ip_fib_main_table get_exec_env()->_main_table -+#endif - - #else - - #define RT_TABLE_MIN 1 - -+#undef fib_tables - struct fib_table *fib_tables[RT_TABLE_MAX+1]; -+void prepare_fib_tables(void) -+{ -+#ifdef CONFIG_VE -+ int i; -+ -+ BUG_ON(sizeof(fib_tables) != -+ sizeof(((struct ve_struct *)0)->_fib_tables)); -+ memcpy(get_ve0()->_fib_tables, fib_tables, sizeof(fib_tables)); -+ for (i = 0; i <= RT_TABLE_MAX; i++) -+ fib_tables[i] = (void *)0x12366678; -+#endif -+} -+ -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+#define fib_tables get_exec_env()->_fib_tables -+#endif - - struct fib_table *__fib_new_table(int id) - { -@@ -248,7 +280,7 @@ int ip_rt_ioctl(unsigned int cmd, void _ - switch (cmd) { - case SIOCADDRT: /* Add a route */ - case SIOCDELRT: /* Delete a route */ -- if (!capable(CAP_NET_ADMIN)) -+ if (!capable(CAP_VE_NET_ADMIN)) - return -EPERM; - if (copy_from_user(&r, arg, sizeof(struct rtentry))) - return -EFAULT; -@@ -595,6 +627,7 @@ struct notifier_block fib_netdev_notifie - - void __init ip_fib_init(void) - { -+ prepare_fib_tables(); - #ifndef CONFIG_IP_MULTIPLE_TABLES - ip_fib_local_table = fib_hash_init(RT_TABLE_LOCAL); - ip_fib_main_table = fib_hash_init(RT_TABLE_MAIN); -diff -uprN linux-2.6.8.1.orig/net/ipv4/fib_hash.c linux-2.6.8.1-ve022stab078/net/ipv4/fib_hash.c ---- linux-2.6.8.1.orig/net/ipv4/fib_hash.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/fib_hash.c 2006-05-11 13:05:41.000000000 +0400 -@@ -35,6 +35,7 @@ - #include <linux/skbuff.h> - #include <linux/netlink.h> - #include <linux/init.h> -+#include <linux/ve.h> - - #include <net/ip.h> - #include <net/protocol.h> -@@ -101,12 +102,6 @@ struct fn_zone - can be cheaper than memory lookup, so that FZ_* macros are used. - */ - --struct fn_hash --{ -- struct fn_zone *fn_zones[33]; -- struct fn_zone *fn_zone_list; --}; -- - static __inline__ fn_hash_idx_t fn_hash(fn_key_t key, struct fn_zone *fz) - { - u32 h = ntohl(key.datum)>>(32 - fz->fz_order); -@@ -701,7 +696,14 @@ FTprint("tb(%d)_delete: %d %08x/%d %d\n" - f = *del_fp; - rtmsg_fib(RTM_DELROUTE, f, z, tb->tb_id, n, req); - -- if (matched != 1) { -+ if (matched != 1 || -+ /* -+ * Don't try to be excessively smart if it's not one of -+ * the host system tables, it would be a waste of -+ * memory. -+ */ -+ !ve_is_super(get_exec_env())) -+ { - write_lock_bh(&fib_hash_lock); - *del_fp = f->fn_next; - write_unlock_bh(&fib_hash_lock); -@@ -766,6 +768,92 @@ static int fn_hash_flush(struct fib_tabl - return found; - } - -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+static __inline__ void -+fib_destroy_list(struct fib_node ** fp, int z, struct fn_hash *table) -+{ -+ struct fib_node *f; -+ -+ while ((f = *fp) != NULL) { -+ write_lock_bh(&fib_hash_lock); -+ *fp = f->fn_next; -+ write_unlock_bh(&fib_hash_lock); -+ -+ fn_free_node(f); -+ } -+} -+ -+void fib_hash_destroy(struct fib_table *tb) -+{ -+ struct fn_hash *table = (struct fn_hash*)tb->tb_data; -+ struct fn_zone *fz; -+ -+ for (fz = table->fn_zone_list; fz; fz = fz->fz_next) { -+ int i; -+ for (i=fz->fz_divisor-1; i>=0; i--) -+ fib_destroy_list(&fz->fz_hash[i], fz->fz_order, table); -+ fz->fz_nent = 0; -+ } -+} -+ -+/* -+ * Initialization of virtualized networking subsystem. -+ */ -+int init_ve_route(struct ve_struct *ve) -+{ -+#ifdef CONFIG_IP_MULTIPLE_TABLES -+ if (fib_rules_create()) -+ return -ENOMEM; -+ ve->_fib_tables[RT_TABLE_LOCAL] = fib_hash_init(RT_TABLE_LOCAL); -+ if (!ve->_fib_tables[RT_TABLE_LOCAL]) -+ goto out_destroy; -+ ve->_fib_tables[RT_TABLE_MAIN] = fib_hash_init(RT_TABLE_MAIN); -+ if (!ve->_fib_tables[RT_TABLE_MAIN]) -+ goto out_destroy_local; -+ -+ return 0; -+ -+out_destroy_local: -+ fib_hash_destroy(ve->_fib_tables[RT_TABLE_LOCAL]); -+out_destroy: -+ fib_rules_destroy(); -+ ve->_local_rule = NULL; -+ return -ENOMEM; -+#else -+ ve->_local_table = fib_hash_init(RT_TABLE_LOCAL); -+ if (!ve->_local_table) -+ return -ENOMEM; -+ ve->_main_table = fib_hash_init(RT_TABLE_MAIN); -+ if (!ve->_main_table) { -+ fib_hash_destroy(ve->_local_table); -+ return -ENOMEM; -+ } -+ return 0; -+#endif -+} -+ -+void fini_ve_route(struct ve_struct *ve) -+{ -+#ifdef CONFIG_IP_MULTIPLE_TABLES -+ int i; -+ for (i=0; i<RT_TABLE_MAX+1; i++) -+ { -+ if (!ve->_fib_tables[i]) -+ continue; -+ fib_hash_destroy(ve->_fib_tables[i]); -+ } -+ fib_rules_destroy(); -+ ve->_local_rule = NULL; -+#else -+ fib_hash_destroy(ve->_local_table); -+ fib_hash_destroy(ve->_main_table); -+#endif -+} -+ -+EXPORT_SYMBOL(init_ve_route); -+EXPORT_SYMBOL(fini_ve_route); -+#endif -+ - - static __inline__ int - fn_hash_dump_bucket(struct sk_buff *skb, struct netlink_callback *cb, -@@ -863,7 +951,7 @@ static void rtmsg_fib(int event, struct - netlink_unicast(rtnl, skb, pid, MSG_DONTWAIT); - } - --#ifdef CONFIG_IP_MULTIPLE_TABLES -+#if defined(CONFIG_IP_MULTIPLE_TABLES) || defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) - struct fib_table * fib_hash_init(int id) - #else - struct fib_table * __init fib_hash_init(int id) -@@ -973,13 +1061,23 @@ out: - return iter->node; - } - -+static struct fib_node *fib_get_idx(struct seq_file *seq, loff_t pos) -+{ -+ struct fib_node *fn = fib_get_first(seq); -+ -+ if (fn) -+ while (pos && (fn = fib_get_next(seq))) -+ --pos; -+ return pos ? NULL : fn; -+} -+ - static void *fib_seq_start(struct seq_file *seq, loff_t *pos) - { - void *v = NULL; - - read_lock(&fib_hash_lock); - if (ip_fib_main_table) -- v = *pos ? fib_get_next(seq) : SEQ_START_TOKEN; -+ v = *pos ? fib_get_idx(seq, *pos - 1) : SEQ_START_TOKEN; - return v; - } - -diff -uprN linux-2.6.8.1.orig/net/ipv4/fib_rules.c linux-2.6.8.1-ve022stab078/net/ipv4/fib_rules.c ---- linux-2.6.8.1.orig/net/ipv4/fib_rules.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/fib_rules.c 2006-05-11 13:05:41.000000000 +0400 -@@ -38,6 +38,7 @@ - #include <linux/proc_fs.h> - #include <linux/skbuff.h> - #include <linux/netlink.h> -+#include <linux/rtnetlink.h> - #include <linux/init.h> - - #include <net/ip.h> -@@ -101,6 +102,87 @@ static struct fib_rule local_rule = { - static struct fib_rule *fib_rules = &local_rule; - static rwlock_t fib_rules_lock = RW_LOCK_UNLOCKED; - -+void prepare_fib_rules(void) -+{ -+#ifdef CONFIG_VE -+ get_ve0()->_local_rule = &local_rule; -+ get_ve0()->_fib_rules = fib_rules; -+ fib_rules = (void *)0x12345678; -+#endif -+} -+ -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+#define ve_local_rule (get_exec_env()->_local_rule) -+#define ve_fib_rules (get_exec_env()->_fib_rules) -+#else -+#define ve_local_rule (&local_rule) -+#define ve_fib_rules fib_rules -+#endif -+ -+#if defined(CONFIG_VE_CALLS) || defined(CONFIG_VE_CALLS_MODULE) -+int fib_rules_create() -+{ -+ struct fib_rule *default_rule, *main_rule, *loc_rule; -+ -+ default_rule = kmalloc(sizeof(struct fib_rule), GFP_KERNEL); -+ if (default_rule == NULL) -+ goto out_def; -+ memset(default_rule, 0, sizeof(struct fib_rule)); -+ atomic_set(&default_rule->r_clntref, 1); -+ default_rule->r_preference = 0x7FFF; -+ default_rule->r_table = RT_TABLE_DEFAULT; -+ default_rule->r_action = RTN_UNICAST; -+ -+ main_rule = kmalloc(sizeof(struct fib_rule), GFP_KERNEL); -+ if (main_rule == NULL) -+ goto out_main; -+ memset(main_rule, 0, sizeof(struct fib_rule)); -+ atomic_set(&main_rule->r_clntref, 1); -+ main_rule->r_preference = 0x7FFE; -+ main_rule->r_table = RT_TABLE_MAIN; -+ main_rule->r_action = RTN_UNICAST; -+ main_rule->r_next = default_rule; -+ -+ loc_rule = kmalloc(sizeof(struct fib_rule), GFP_KERNEL); -+ if (loc_rule == NULL) -+ goto out_loc; -+ memset(loc_rule, 0, sizeof(struct fib_rule)); -+ atomic_set(&loc_rule->r_clntref, 1); -+ loc_rule->r_preference = 0; -+ loc_rule->r_table = RT_TABLE_LOCAL; -+ loc_rule->r_action = RTN_UNICAST; -+ loc_rule->r_next = main_rule; -+ -+ ve_local_rule = loc_rule; -+ ve_fib_rules = loc_rule; -+ -+ return 0; -+ -+out_loc: -+ kfree(main_rule); -+out_main: -+ kfree(default_rule); -+out_def: -+ return -1; -+} -+ -+void fib_rules_destroy() -+{ -+ struct fib_rule *r; -+ -+ rtnl_lock(); -+ write_lock_bh(&fib_rules_lock); -+ while(ve_fib_rules != NULL) { -+ r = ve_fib_rules; -+ ve_fib_rules = ve_fib_rules->r_next; -+ r->r_dead = 1; -+ fib_rule_put(r); -+ } -+ write_unlock_bh(&fib_rules_lock); -+ rtnl_unlock(); -+} -+#endif -+ - int inet_rtm_delrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg) - { - struct rtattr **rta = arg; -@@ -108,7 +190,7 @@ int inet_rtm_delrule(struct sk_buff *skb - struct fib_rule *r, **rp; - int err = -ESRCH; - -- for (rp=&fib_rules; (r=*rp) != NULL; rp=&r->r_next) { -+ for (rp=&ve_fib_rules; (r=*rp) != NULL; rp=&r->r_next) { - if ((!rta[RTA_SRC-1] || memcmp(RTA_DATA(rta[RTA_SRC-1]), &r->r_src, 4) == 0) && - rtm->rtm_src_len == r->r_src_len && - rtm->rtm_dst_len == r->r_dst_len && -@@ -122,7 +204,7 @@ int inet_rtm_delrule(struct sk_buff *skb - (!rta[RTA_IIF-1] || strcmp(RTA_DATA(rta[RTA_IIF-1]), r->r_ifname) == 0) && - (!rtm->rtm_table || (r && rtm->rtm_table == r->r_table))) { - err = -EPERM; -- if (r == &local_rule) -+ if (r == ve_local_rule) - break; - - write_lock_bh(&fib_rules_lock); -@@ -186,6 +268,7 @@ int inet_rtm_newrule(struct sk_buff *skb - new_r = kmalloc(sizeof(*new_r), GFP_KERNEL); - if (!new_r) - return -ENOMEM; -+ - memset(new_r, 0, sizeof(*new_r)); - if (rta[RTA_SRC-1]) - memcpy(&new_r->r_src, RTA_DATA(rta[RTA_SRC-1]), 4); -@@ -221,11 +304,11 @@ int inet_rtm_newrule(struct sk_buff *skb - memcpy(&new_r->r_tclassid, RTA_DATA(rta[RTA_FLOW-1]), 4); - #endif - -- rp = &fib_rules; -+ rp = &ve_fib_rules; - if (!new_r->r_preference) { -- r = fib_rules; -+ r = ve_fib_rules; - if (r && (r = r->r_next) != NULL) { -- rp = &fib_rules->r_next; -+ rp = &ve_fib_rules->r_next; - if (r->r_preference) - new_r->r_preference = r->r_preference - 1; - } -@@ -285,7 +368,7 @@ static void fib_rules_detach(struct net_ - { - struct fib_rule *r; - -- for (r=fib_rules; r; r=r->r_next) { -+ for (r=ve_fib_rules; r; r=r->r_next) { - if (r->r_ifindex == dev->ifindex) { - write_lock_bh(&fib_rules_lock); - r->r_ifindex = -1; -@@ -298,7 +381,7 @@ static void fib_rules_attach(struct net_ - { - struct fib_rule *r; - -- for (r=fib_rules; r; r=r->r_next) { -+ for (r=ve_fib_rules; r; r=r->r_next) { - if (r->r_ifindex == -1 && strcmp(dev->name, r->r_ifname) == 0) { - write_lock_bh(&fib_rules_lock); - r->r_ifindex = dev->ifindex; -@@ -319,7 +402,7 @@ int fib_lookup(const struct flowi *flp, - FRprintk("Lookup: %u.%u.%u.%u <- %u.%u.%u.%u ", - NIPQUAD(flp->fl4_dst), NIPQUAD(flp->fl4_src)); - read_lock(&fib_rules_lock); -- for (r = fib_rules; r; r=r->r_next) { -+ for (r = ve_fib_rules; r; r=r->r_next) { - if (((saddr^r->r_src) & r->r_srcmask) || - ((daddr^r->r_dst) & r->r_dstmask) || - #ifdef CONFIG_IP_ROUTE_TOS -@@ -449,7 +532,7 @@ int inet_dump_rules(struct sk_buff *skb, - struct fib_rule *r; - - read_lock(&fib_rules_lock); -- for (r=fib_rules, idx=0; r; r = r->r_next, idx++) { -+ for (r=ve_fib_rules, idx=0; r; r = r->r_next, idx++) { - if (idx < s_idx) - continue; - if (inet_fill_rule(skb, r, cb) < 0) -@@ -463,5 +546,6 @@ int inet_dump_rules(struct sk_buff *skb, - - void __init fib_rules_init(void) - { -+ prepare_fib_rules(); - register_netdevice_notifier(&fib_rules_notifier); - } -diff -uprN linux-2.6.8.1.orig/net/ipv4/fib_semantics.c linux-2.6.8.1-ve022stab078/net/ipv4/fib_semantics.c ---- linux-2.6.8.1.orig/net/ipv4/fib_semantics.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/fib_semantics.c 2006-05-11 13:05:41.000000000 +0400 -@@ -32,6 +32,7 @@ - #include <linux/netdevice.h> - #include <linux/if_arp.h> - #include <linux/proc_fs.h> -+#include <linux/ve.h> - #include <linux/skbuff.h> - #include <linux/netlink.h> - #include <linux/init.h> -@@ -49,6 +50,18 @@ static struct fib_info *fib_info_list; - static rwlock_t fib_info_lock = RW_LOCK_UNLOCKED; - int fib_info_cnt; - -+void prepare_fib_info(void) -+{ -+#ifdef CONFIG_VE -+ get_ve0()->_fib_info_list = fib_info_list; -+ fib_info_list = (void *)0x12345678; -+#endif -+} -+ -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+#define fib_info_list (get_exec_env()->_fib_info_list) -+#endif -+ - #define for_fib_info() { struct fib_info *fi; \ - for (fi = fib_info_list; fi; fi = fi->fib_next) - -@@ -155,7 +168,6 @@ void free_fib_info(struct fib_info *fi) - dev_put(nh->nh_dev); - nh->nh_dev = NULL; - } endfor_nexthops(fi); -- fib_info_cnt--; - kfree(fi); - } - -@@ -483,11 +495,13 @@ fib_create_info(const struct rtmsg *r, s - } - #endif - -- fi = kmalloc(sizeof(*fi)+nhs*sizeof(struct fib_nh), GFP_KERNEL); -+ - err = -ENOBUFS; -+ -+ fi = kmalloc(sizeof(*fi)+nhs*sizeof(struct fib_nh), GFP_KERNEL); - if (fi == NULL) - goto failure; -- fib_info_cnt++; -+ - memset(fi, 0, sizeof(*fi)+nhs*sizeof(struct fib_nh)); - - fi->fib_protocol = r->rtm_protocol; -diff -uprN linux-2.6.8.1.orig/net/ipv4/icmp.c linux-2.6.8.1-ve022stab078/net/ipv4/icmp.c ---- linux-2.6.8.1.orig/net/ipv4/icmp.c 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/icmp.c 2006-05-11 13:05:27.000000000 +0400 -@@ -346,12 +346,12 @@ static void icmp_push_reply(struct icmp_ - { - struct sk_buff *skb; - -- ip_append_data(icmp_socket->sk, icmp_glue_bits, icmp_param, -- icmp_param->data_len+icmp_param->head_len, -- icmp_param->head_len, -- ipc, rt, MSG_DONTWAIT); -- -- if ((skb = skb_peek(&icmp_socket->sk->sk_write_queue)) != NULL) { -+ if (ip_append_data(icmp_socket->sk, icmp_glue_bits, icmp_param, -+ icmp_param->data_len+icmp_param->head_len, -+ icmp_param->head_len, -+ ipc, rt, MSG_DONTWAIT) < 0) -+ ip_flush_pending_frames(icmp_socket->sk); -+ else if ((skb = skb_peek(&icmp_socket->sk->sk_write_queue)) != NULL) { - struct icmphdr *icmph = skb->h.icmph; - unsigned int csum = 0; - struct sk_buff *skb1; -diff -uprN linux-2.6.8.1.orig/net/ipv4/igmp.c linux-2.6.8.1-ve022stab078/net/ipv4/igmp.c ---- linux-2.6.8.1.orig/net/ipv4/igmp.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/igmp.c 2006-05-11 13:05:42.000000000 +0400 -@@ -889,7 +889,10 @@ int igmp_rcv(struct sk_buff *skb) - /* Is it our report looped back? */ - if (((struct rtable*)skb->dst)->fl.iif == 0) - break; -- igmp_heard_report(in_dev, ih->group); -+ /* don't rely on MC router hearing unicast reports */ -+ if (skb->pkt_type == PACKET_MULTICAST || -+ skb->pkt_type == PACKET_BROADCAST) -+ igmp_heard_report(in_dev, ih->group); - break; - case IGMP_PIM: - #ifdef CONFIG_IP_PIMSM_V1 -@@ -1776,12 +1779,12 @@ int ip_mc_source(int add, int omode, str - goto done; - rv = !0; - for (i=0; i<psl->sl_count; i++) { -- rv = memcmp(&psl->sl_addr, &mreqs->imr_multiaddr, -+ rv = memcmp(&psl->sl_addr[i], &mreqs->imr_sourceaddr, - sizeof(__u32)); -- if (rv >= 0) -+ if (rv == 0) - break; - } -- if (!rv) /* source not found */ -+ if (rv) /* source not found */ - goto done; - - /* update the interface filter */ -@@ -1823,9 +1826,9 @@ int ip_mc_source(int add, int omode, str - } - rv = 1; /* > 0 for insert logic below if sl_count is 0 */ - for (i=0; i<psl->sl_count; i++) { -- rv = memcmp(&psl->sl_addr, &mreqs->imr_multiaddr, -+ rv = memcmp(&psl->sl_addr[i], &mreqs->imr_sourceaddr, - sizeof(__u32)); -- if (rv >= 0) -+ if (rv == 0) - break; - } - if (rv == 0) /* address already there is an error */ -@@ -2297,7 +2300,8 @@ static inline struct ip_sf_list *igmp_mc - struct ip_mc_list *im = NULL; - struct igmp_mcf_iter_state *state = igmp_mcf_seq_private(seq); - -- for (state->dev = dev_base, state->idev = NULL, state->im = NULL; -+ for (state->dev = dev_base, -+ state->idev = NULL, state->im = NULL; - state->dev; - state->dev = state->dev->next) { - struct in_device *idev; -diff -uprN linux-2.6.8.1.orig/net/ipv4/ip_forward.c linux-2.6.8.1-ve022stab078/net/ipv4/ip_forward.c ---- linux-2.6.8.1.orig/net/ipv4/ip_forward.c 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/ip_forward.c 2006-05-11 13:05:41.000000000 +0400 -@@ -91,6 +91,23 @@ int ip_forward(struct sk_buff *skb) - if (opt->is_strictroute && rt->rt_dst != rt->rt_gateway) - goto sr_failed; - -+ /* -+ * We try to optimize forwarding of VE packets: -+ * do not decrement TTL (and so save skb_cow) -+ * during forwarding of outgoing pkts from VE. -+ * For incoming pkts we still do ttl decr, -+ * since such skb is not cloned and does not require -+ * actual cow. So, there is at least one place -+ * in pkts path with mandatory ttl decr, that is -+ * sufficient to prevent routing loops. -+ */ -+ if ( -+#ifdef CONFIG_IP_ROUTE_NAT -+ (rt->rt_flags & RTCF_NAT) == 0 && /* no NAT mangling expected */ -+#endif /* and */ -+ (skb->dev->features & NETIF_F_VENET)) /* src is VENET device */ -+ goto no_ttl_decr; -+ - /* We are about to mangle packet. Copy it! */ - if (skb_cow(skb, LL_RESERVED_SPACE(rt->u.dst.dev)+rt->u.dst.header_len)) - goto drop; -@@ -99,6 +116,8 @@ int ip_forward(struct sk_buff *skb) - /* Decrease ttl after skb cow done */ - ip_decrease_ttl(iph); - -+no_ttl_decr: -+ - /* - * We now generate an ICMP HOST REDIRECT giving the route - * we calculated. -diff -uprN linux-2.6.8.1.orig/net/ipv4/ip_fragment.c linux-2.6.8.1-ve022stab078/net/ipv4/ip_fragment.c ---- linux-2.6.8.1.orig/net/ipv4/ip_fragment.c 2004-08-14 14:55:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/ip_fragment.c 2006-05-11 13:05:41.000000000 +0400 -@@ -42,6 +42,7 @@ - #include <linux/udp.h> - #include <linux/inet.h> - #include <linux/netfilter_ipv4.h> -+#include <linux/ve_owner.h> - - /* NOTE. Logic of IP defragmentation is parallel to corresponding IPv6 - * code now. If you change something here, _PLEASE_ update ipv6/reassembly.c -@@ -73,6 +74,7 @@ struct ipfrag_skb_cb - struct ipq { - struct ipq *next; /* linked list pointers */ - struct list_head lru_list; /* lru list member */ -+ u32 user; - u32 saddr; - u32 daddr; - u16 id; -@@ -91,8 +93,12 @@ struct ipq { - struct ipq **pprev; - int iif; - struct timeval stamp; -+ struct ve_struct *owner_env; - }; - -+DCL_VE_OWNER_PROTO(IPQ, TAIL_SOFT, struct ipq, owner_env, inline, (always_inline)) -+DCL_VE_OWNER(IPQ, TAIL_SOFT, struct ipq, owner_env, inline, (always_inline)) -+ - /* Hash table. */ - - #define IPQ_HASHSZ 64 -@@ -104,6 +110,20 @@ static u32 ipfrag_hash_rnd; - static LIST_HEAD(ipq_lru_list); - int ip_frag_nqueues = 0; - -+void prepare_ipq(void) -+{ -+ struct ipq *qp; -+ unsigned int hash; -+ -+ write_lock(&ipfrag_lock); -+ for (hash = 0; hash < IPQ_HASHSZ; hash++) { -+ for(qp = ipq_hash[hash]; qp; qp = qp->next) { -+ SET_VE_OWNER_IPQ(qp, get_ve0()); -+ } -+ } -+ write_unlock(&ipfrag_lock); -+} -+ - static __inline__ void __ipq_unlink(struct ipq *qp) - { - if(qp->next) -@@ -183,7 +203,8 @@ static __inline__ void frag_free_queue(s - - static __inline__ struct ipq *frag_alloc_queue(void) - { -- struct ipq *qp = kmalloc(sizeof(struct ipq), GFP_ATOMIC); -+ struct ipq *qp = kmalloc(sizeof(struct ipq) + sizeof(void *), -+ GFP_ATOMIC); - - if(!qp) - return NULL; -@@ -273,6 +294,9 @@ static void ip_evictor(void) - static void ip_expire(unsigned long arg) - { - struct ipq *qp = (struct ipq *) arg; -+ struct ve_struct *envid; -+ -+ envid = set_exec_env(VE_OWNER_IPQ(qp)); - - spin_lock(&qp->lock); - -@@ -295,6 +319,8 @@ static void ip_expire(unsigned long arg) - out: - spin_unlock(&qp->lock); - ipq_put(qp); -+ -+ (void)set_exec_env(envid); - } - - /* Creation primitives. */ -@@ -313,7 +339,9 @@ static struct ipq *ip_frag_intern(unsign - if(qp->id == qp_in->id && - qp->saddr == qp_in->saddr && - qp->daddr == qp_in->daddr && -- qp->protocol == qp_in->protocol) { -+ qp->protocol == qp_in->protocol && -+ qp->user == qp_in->user && -+ qp->owner_env == get_exec_env()) { - atomic_inc(&qp->refcnt); - write_unlock(&ipfrag_lock); - qp_in->last_in |= COMPLETE; -@@ -340,7 +368,7 @@ static struct ipq *ip_frag_intern(unsign - } - - /* Add an entry to the 'ipq' queue for a newly received IP datagram. */ --static struct ipq *ip_frag_create(unsigned hash, struct iphdr *iph) -+static struct ipq *ip_frag_create(unsigned hash, struct iphdr *iph, u32 user) - { - struct ipq *qp; - -@@ -352,6 +380,7 @@ static struct ipq *ip_frag_create(unsign - qp->id = iph->id; - qp->saddr = iph->saddr; - qp->daddr = iph->daddr; -+ qp->user = user; - qp->len = 0; - qp->meat = 0; - qp->fragments = NULL; -@@ -364,6 +393,8 @@ static struct ipq *ip_frag_create(unsign - qp->lock = SPIN_LOCK_UNLOCKED; - atomic_set(&qp->refcnt, 1); - -+ SET_VE_OWNER_IPQ(qp, get_exec_env()); -+ - return ip_frag_intern(hash, qp); - - out_nomem: -@@ -374,7 +405,7 @@ out_nomem: - /* Find the correct entry in the "incomplete datagrams" queue for - * this IP datagram, and create new one, if nothing is found. - */ --static inline struct ipq *ip_find(struct iphdr *iph) -+static inline struct ipq *ip_find(struct iphdr *iph, u32 user) - { - __u16 id = iph->id; - __u32 saddr = iph->saddr; -@@ -388,7 +419,9 @@ static inline struct ipq *ip_find(struct - if(qp->id == id && - qp->saddr == saddr && - qp->daddr == daddr && -- qp->protocol == protocol) { -+ qp->protocol == protocol && -+ qp->user == user && -+ qp->owner_env == get_exec_env()) { - atomic_inc(&qp->refcnt); - read_unlock(&ipfrag_lock); - return qp; -@@ -396,7 +429,7 @@ static inline struct ipq *ip_find(struct - } - read_unlock(&ipfrag_lock); - -- return ip_frag_create(hash, iph); -+ return ip_frag_create(hash, iph, user); - } - - /* Add new segment to existing queue. */ -@@ -630,7 +663,7 @@ out_fail: - } - - /* Process an incoming IP datagram fragment. */ --struct sk_buff *ip_defrag(struct sk_buff *skb) -+struct sk_buff *ip_defrag(struct sk_buff *skb, u32 user) - { - struct iphdr *iph = skb->nh.iph; - struct ipq *qp; -@@ -645,7 +678,7 @@ struct sk_buff *ip_defrag(struct sk_buff - dev = skb->dev; - - /* Lookup (or create) queue header */ -- if ((qp = ip_find(iph)) != NULL) { -+ if ((qp = ip_find(iph, user)) != NULL) { - struct sk_buff *ret = NULL; - - spin_lock(&qp->lock); -@@ -656,6 +689,9 @@ struct sk_buff *ip_defrag(struct sk_buff - qp->meat == qp->len) - ret = ip_frag_reasm(qp, dev); - -+ if (ret) -+ SET_VE_OWNER_SKB(ret, VE_OWNER_SKB(skb)); -+ - spin_unlock(&qp->lock); - ipq_put(qp); - return ret; -@@ -666,6 +702,48 @@ struct sk_buff *ip_defrag(struct sk_buff - return NULL; - } - -+#ifdef CONFIG_VE -+/* XXX */ -+void ip_fragment_cleanup(struct ve_struct *envid) -+{ -+ int i, progress; -+ -+ /* All operations with fragment queues are performed from NET_RX/TX -+ * soft interrupts or from timer context. --Den */ -+ local_bh_disable(); -+ do { -+ progress = 0; -+ for (i = 0; i < IPQ_HASHSZ; i++) { -+ struct ipq *qp; -+ if (ipq_hash[i] == NULL) -+ continue; -+inner_restart: -+ read_lock(&ipfrag_lock); -+ for (qp = ipq_hash[i]; qp; qp = qp->next) { -+ if (!ve_accessible_strict( -+ VE_OWNER_IPQ(qp), -+ envid)) -+ continue; -+ atomic_inc(&qp->refcnt); -+ read_unlock(&ipfrag_lock); -+ -+ spin_lock(&qp->lock); -+ if (!(qp->last_in&COMPLETE)) -+ ipq_kill(qp); -+ spin_unlock(&qp->lock); -+ -+ ipq_put(qp); -+ progress = 1; -+ goto inner_restart; -+ } -+ read_unlock(&ipfrag_lock); -+ } -+ } while(progress); -+ local_bh_enable(); -+} -+EXPORT_SYMBOL(ip_fragment_cleanup); -+#endif -+ - void ipfrag_init(void) - { - ipfrag_hash_rnd = (u32) ((num_physpages ^ (num_physpages>>7)) ^ -diff -uprN linux-2.6.8.1.orig/net/ipv4/ip_input.c linux-2.6.8.1-ve022stab078/net/ipv4/ip_input.c ---- linux-2.6.8.1.orig/net/ipv4/ip_input.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/ip_input.c 2006-05-11 13:05:25.000000000 +0400 -@@ -172,7 +172,7 @@ int ip_call_ra_chain(struct sk_buff *skb - (!sk->sk_bound_dev_if || - sk->sk_bound_dev_if == skb->dev->ifindex)) { - if (skb->nh.iph->frag_off & htons(IP_MF|IP_OFFSET)) { -- skb = ip_defrag(skb); -+ skb = ip_defrag(skb, IP_DEFRAG_CALL_RA_CHAIN); - if (skb == NULL) { - read_unlock(&ip_ra_lock); - return 1; -@@ -274,7 +274,7 @@ int ip_local_deliver(struct sk_buff *skb - */ - - if (skb->nh.iph->frag_off & htons(IP_MF|IP_OFFSET)) { -- skb = ip_defrag(skb); -+ skb = ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER); - if (!skb) - return 0; - } -diff -uprN linux-2.6.8.1.orig/net/ipv4/ip_options.c linux-2.6.8.1-ve022stab078/net/ipv4/ip_options.c ---- linux-2.6.8.1.orig/net/ipv4/ip_options.c 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/ip_options.c 2006-05-11 13:05:33.000000000 +0400 -@@ -515,6 +515,8 @@ int ip_options_get(struct ip_options **o - kfree(opt); - return -EINVAL; - } -+ if (*optp) -+ kfree(*optp); - *optp = opt; - return 0; - } -diff -uprN linux-2.6.8.1.orig/net/ipv4/ip_output.c linux-2.6.8.1-ve022stab078/net/ipv4/ip_output.c ---- linux-2.6.8.1.orig/net/ipv4/ip_output.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/ip_output.c 2006-05-11 13:05:44.000000000 +0400 -@@ -405,6 +405,7 @@ static void ip_copy_metadata(struct sk_b - to->priority = from->priority; - to->protocol = from->protocol; - to->security = from->security; -+ dst_release(to->dst); - to->dst = dst_clone(from->dst); - to->dev = from->dev; - -@@ -519,6 +520,7 @@ int ip_fragment(struct sk_buff *skb, int - /* Prepare header of the next frame, - * before previous one went down. */ - if (frag) { -+ frag->ip_summed = CHECKSUM_NONE; - frag->h.raw = frag->data; - frag->nh.raw = __skb_push(frag, hlen); - memcpy(frag->nh.raw, iph, hlen); -@@ -1147,11 +1149,7 @@ int ip_push_pending_frames(struct sock * - iph->tos = inet->tos; - iph->tot_len = htons(skb->len); - iph->frag_off = df; -- if (!df) { -- __ip_select_ident(iph, &rt->u.dst, 0); -- } else { -- iph->id = htons(inet->id++); -- } -+ ip_select_ident(iph, &rt->u.dst, sk); - iph->ttl = ttl; - iph->protocol = sk->sk_protocol; - iph->saddr = rt->rt_src; -@@ -1242,13 +1240,14 @@ void ip_send_reply(struct sock *sk, stru - char data[40]; - } replyopts; - struct ipcm_cookie ipc; -- u32 daddr; -+ u32 saddr, daddr; - struct rtable *rt = (struct rtable*)skb->dst; - - if (ip_options_echo(&replyopts.opt, skb)) - return; - -- daddr = ipc.addr = rt->rt_src; -+ saddr = skb->nh.iph->daddr; -+ daddr = ipc.addr = skb->nh.iph->saddr; - ipc.opt = NULL; - - if (replyopts.opt.optlen) { -@@ -1261,7 +1260,7 @@ void ip_send_reply(struct sock *sk, stru - { - struct flowi fl = { .nl_u = { .ip4_u = - { .daddr = daddr, -- .saddr = rt->rt_spec_dst, -+ .saddr = saddr, - .tos = RT_TOS(skb->nh.iph->tos) } }, - /* Not quite clean, but right. */ - .uli_u = { .ports = -diff -uprN linux-2.6.8.1.orig/net/ipv4/ip_sockglue.c linux-2.6.8.1-ve022stab078/net/ipv4/ip_sockglue.c ---- linux-2.6.8.1.orig/net/ipv4/ip_sockglue.c 2004-08-14 14:55:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/ip_sockglue.c 2006-05-11 13:05:34.000000000 +0400 -@@ -146,11 +146,8 @@ int ip_cmsg_send(struct msghdr *msg, str - struct cmsghdr *cmsg; - - for (cmsg = CMSG_FIRSTHDR(msg); cmsg; cmsg = CMSG_NXTHDR(msg, cmsg)) { -- if (cmsg->cmsg_len < sizeof(struct cmsghdr) || -- (unsigned long)(((char*)cmsg - (char*)msg->msg_control) -- + cmsg->cmsg_len) > msg->msg_controllen) { -+ if (!CMSG_OK(msg, cmsg)) - return -EINVAL; -- } - if (cmsg->cmsg_level != SOL_IP) - continue; - switch (cmsg->cmsg_type) { -@@ -851,6 +848,9 @@ mc_msf_out: - - case IP_IPSEC_POLICY: - case IP_XFRM_POLICY: -+ err = -EPERM; -+ if (!capable(CAP_NET_ADMIN)) -+ break; - err = xfrm_user_policy(sk, optname, optval, optlen); - break; - -diff -uprN linux-2.6.8.1.orig/net/ipv4/ipmr.c linux-2.6.8.1-ve022stab078/net/ipv4/ipmr.c ---- linux-2.6.8.1.orig/net/ipv4/ipmr.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/ipmr.c 2006-05-11 13:05:41.000000000 +0400 -@@ -828,7 +828,7 @@ static void mrtsock_destruct(struct sock - { - rtnl_lock(); - if (sk == mroute_socket) { -- ipv4_devconf.mc_forwarding--; -+ ve_ipv4_devconf.mc_forwarding--; - - write_lock_bh(&mrt_lock); - mroute_socket=NULL; -@@ -879,7 +879,7 @@ int ip_mroute_setsockopt(struct sock *sk - mroute_socket=sk; - write_unlock_bh(&mrt_lock); - -- ipv4_devconf.mc_forwarding++; -+ ve_ipv4_devconf.mc_forwarding++; - } - rtnl_unlock(); - return ret; -diff -uprN linux-2.6.8.1.orig/net/ipv4/ipvs/ip_vs_conn.c linux-2.6.8.1-ve022stab078/net/ipv4/ipvs/ip_vs_conn.c ---- linux-2.6.8.1.orig/net/ipv4/ipvs/ip_vs_conn.c 2004-08-14 14:56:15.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/ipvs/ip_vs_conn.c 2006-05-11 13:05:39.000000000 +0400 -@@ -876,7 +876,8 @@ int ip_vs_conn_init(void) - /* Allocate ip_vs_conn slab cache */ - ip_vs_conn_cachep = kmem_cache_create("ip_vs_conn", - sizeof(struct ip_vs_conn), 0, -- SLAB_HWCACHE_ALIGN, NULL, NULL); -+ SLAB_HWCACHE_ALIGN | SLAB_UBC, -+ NULL, NULL); - if (!ip_vs_conn_cachep) { - vfree(ip_vs_conn_tab); - return -ENOMEM; -diff -uprN linux-2.6.8.1.orig/net/ipv4/ipvs/ip_vs_core.c linux-2.6.8.1-ve022stab078/net/ipv4/ipvs/ip_vs_core.c ---- linux-2.6.8.1.orig/net/ipv4/ipvs/ip_vs_core.c 2004-08-14 14:54:46.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/ipvs/ip_vs_core.c 2006-05-11 13:05:41.000000000 +0400 -@@ -541,9 +541,9 @@ u16 ip_vs_checksum_complete(struct sk_bu - } - - static inline struct sk_buff * --ip_vs_gather_frags(struct sk_buff *skb) -+ip_vs_gather_frags(struct sk_buff *skb, u_int32_t user) - { -- skb = ip_defrag(skb); -+ skb = ip_defrag(skb, user); - if (skb) - ip_send_check(skb->nh.iph); - return skb; -@@ -617,7 +617,7 @@ static int ip_vs_out_icmp(struct sk_buff - - /* reassemble IP fragments */ - if (skb->nh.iph->frag_off & __constant_htons(IP_MF|IP_OFFSET)) { -- skb = ip_vs_gather_frags(skb); -+ skb = ip_vs_gather_frags(skb, IP_DEFRAG_VS_OUT); - if (!skb) - return NF_STOLEN; - *pskb = skb; -@@ -759,7 +759,7 @@ ip_vs_out(unsigned int hooknum, struct s - /* reassemble IP fragments */ - if (unlikely(iph->frag_off & __constant_htons(IP_MF|IP_OFFSET) && - !pp->dont_defrag)) { -- skb = ip_vs_gather_frags(skb); -+ skb = ip_vs_gather_frags(skb, IP_DEFRAG_VS_OUT); - if (!skb) - return NF_STOLEN; - iph = skb->nh.iph; -@@ -862,7 +862,8 @@ check_for_ip_vs_out(struct sk_buff **psk - * forward to the right destination host if relevant. - * Currently handles error types - unreachable, quench, ttl exceeded. - */ --static int ip_vs_in_icmp(struct sk_buff **pskb, int *related) -+static int -+ip_vs_in_icmp(struct sk_buff **pskb, int *related, unsigned int hooknum) - { - struct sk_buff *skb = *pskb; - struct iphdr *iph; -@@ -876,7 +877,9 @@ static int ip_vs_in_icmp(struct sk_buff - - /* reassemble IP fragments */ - if (skb->nh.iph->frag_off & __constant_htons(IP_MF|IP_OFFSET)) { -- skb = ip_vs_gather_frags(skb); -+ skb = ip_vs_gather_frags(skb, -+ hooknum == NF_IP_LOCAL_IN ? -+ IP_DEFRAG_VS_IN : IP_DEFRAG_VS_FWD); - if (!skb) - return NF_STOLEN; - *pskb = skb; -@@ -972,6 +975,10 @@ ip_vs_in(unsigned int hooknum, struct sk - * Big tappo: only PACKET_HOST (neither loopback nor mcasts) - * ... don't know why 1st test DOES NOT include 2nd (?) - */ -+ /* -+ * VZ: the question above is right. -+ * The second test is superfluous. -+ */ - if (unlikely(skb->pkt_type != PACKET_HOST - || skb->dev == &loopback_dev || skb->sk)) { - IP_VS_DBG(12, "packet type=%d proto=%d daddr=%d.%d.%d.%d ignored\n", -@@ -990,7 +997,7 @@ ip_vs_in(unsigned int hooknum, struct sk - - iph = skb->nh.iph; - if (unlikely(iph->protocol == IPPROTO_ICMP)) { -- int related, verdict = ip_vs_in_icmp(pskb, &related); -+ int related, verdict = ip_vs_in_icmp(pskb, &related, hooknum); - - if (related) - return verdict; -@@ -1085,7 +1092,7 @@ ip_vs_forward_icmp(unsigned int hooknum, - if ((*pskb)->nh.iph->protocol != IPPROTO_ICMP) - return NF_ACCEPT; - -- return ip_vs_in_icmp(pskb, &r); -+ return ip_vs_in_icmp(pskb, &r, hooknum); - } - - -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ip_conntrack_core.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_conntrack_core.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ip_conntrack_core.c 2004-08-14 14:54:46.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_conntrack_core.c 2006-05-11 13:05:45.000000000 +0400 -@@ -47,6 +47,7 @@ - #include <linux/netfilter_ipv4/ip_conntrack_helper.h> - #include <linux/netfilter_ipv4/ip_conntrack_core.h> - #include <linux/netfilter_ipv4/listhelp.h> -+#include <ub/ub_mem.h> - - #define IP_CONNTRACK_VERSION "2.1" - -@@ -62,10 +63,10 @@ DECLARE_RWLOCK(ip_conntrack_expect_tuple - void (*ip_conntrack_destroyed)(struct ip_conntrack *conntrack) = NULL; - LIST_HEAD(ip_conntrack_expect_list); - LIST_HEAD(protocol_list); --static LIST_HEAD(helpers); -+LIST_HEAD(helpers); - unsigned int ip_conntrack_htable_size = 0; - int ip_conntrack_max; --static atomic_t ip_conntrack_count = ATOMIC_INIT(0); -+atomic_t ip_conntrack_count = ATOMIC_INIT(0); - struct list_head *ip_conntrack_hash; - static kmem_cache_t *ip_conntrack_cachep; - struct ip_conntrack ip_conntrack_untracked; -@@ -83,7 +84,7 @@ struct ip_conntrack_protocol *__ip_ct_fi - struct ip_conntrack_protocol *p; - - MUST_BE_READ_LOCKED(&ip_conntrack_lock); -- p = LIST_FIND(&protocol_list, proto_cmpfn, -+ p = LIST_FIND(&ve_ip_conntrack_protocol_list, proto_cmpfn, - struct ip_conntrack_protocol *, protocol); - if (!p) - p = &ip_conntrack_generic_protocol; -@@ -126,6 +127,28 @@ hash_conntrack(const struct ip_conntrack - ip_conntrack_hash_rnd) % ip_conntrack_htable_size); - } - -+#ifdef CONFIG_VE_IPTABLES -+/* this function gives us an ability to safely restore -+ * connection in case of failure */ -+void ip_conntrack_hash_insert(struct ip_conntrack *ct) -+{ -+ u_int32_t hash, repl_hash; -+ -+ if (!ip_conntrack_hash_rnd_initted) { -+ get_random_bytes(&ip_conntrack_hash_rnd, 4); -+ ip_conntrack_hash_rnd_initted = 1; -+ } -+ -+ hash = hash_conntrack(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple); -+ repl_hash = hash_conntrack(&ct->tuplehash[IP_CT_DIR_REPLY].tuple); -+ list_add(&ct->tuplehash[IP_CT_DIR_ORIGINAL].list, -+ &ve_ip_conntrack_hash[hash]); -+ list_add(&ct->tuplehash[IP_CT_DIR_REPLY].list, -+ &ve_ip_conntrack_hash[repl_hash]); -+} -+EXPORT_SYMBOL(ip_conntrack_hash_insert); -+#endif -+ - int - get_tuple(const struct iphdr *iph, - const struct sk_buff *skb, -@@ -195,7 +218,7 @@ __ip_ct_expect_find(const struct ip_conn - { - MUST_BE_READ_LOCKED(&ip_conntrack_lock); - MUST_BE_READ_LOCKED(&ip_conntrack_expect_tuple_lock); -- return LIST_FIND(&ip_conntrack_expect_list, expect_cmp, -+ return LIST_FIND(&ve_ip_conntrack_expect_list, expect_cmp, - struct ip_conntrack_expect *, tuple); - } - -@@ -278,7 +301,11 @@ static void remove_expectations(struct i - continue; - } - -+#ifdef CONFIG_VE_IPTABLES -+ IP_NF_ASSERT(list_inlist(&(ct->ct_env)->_ip_conntrack_expect_list, exp)); -+#else - IP_NF_ASSERT(list_inlist(&ip_conntrack_expect_list, exp)); -+#endif - IP_NF_ASSERT(exp->expectant == ct); - - /* delete expectation from global and private lists */ -@@ -296,8 +323,15 @@ clean_from_lists(struct ip_conntrack *ct - - ho = hash_conntrack(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple); - hr = hash_conntrack(&ct->tuplehash[IP_CT_DIR_REPLY].tuple); -+#ifdef CONFIG_VE_IPTABLES -+ LIST_DELETE(&((ct->ct_env)->_ip_conntrack_hash)[ho], -+ &ct->tuplehash[IP_CT_DIR_ORIGINAL]); -+ LIST_DELETE(&((ct->ct_env)->_ip_conntrack_hash)[hr], -+ &ct->tuplehash[IP_CT_DIR_REPLY]); -+#else - LIST_DELETE(&ip_conntrack_hash[ho], &ct->tuplehash[IP_CT_DIR_ORIGINAL]); - LIST_DELETE(&ip_conntrack_hash[hr], &ct->tuplehash[IP_CT_DIR_REPLY]); -+#endif - - /* Destroy all un-established, pending expectations */ - remove_expectations(ct, 1); -@@ -320,8 +354,13 @@ destroy_conntrack(struct nf_conntrack *n - if (proto && proto->destroy) - proto->destroy(ct); - -+#ifdef CONFIG_VE_IPTABLES -+ if (ct->ct_env->_ip_conntrack_destroyed) -+ ct->ct_env->_ip_conntrack_destroyed(ct); -+#else - if (ip_conntrack_destroyed) - ip_conntrack_destroyed(ct); -+#endif - - WRITE_LOCK(&ip_conntrack_lock); - /* Make sure don't leave any orphaned expectations lying around */ -@@ -343,9 +382,13 @@ destroy_conntrack(struct nf_conntrack *n - if (master) - ip_conntrack_put(master); - -+#ifdef CONFIG_VE_IPTABLES -+ atomic_dec(&(ct->ct_env->_ip_conntrack_count)); -+#else -+ atomic_dec(&ip_conntrack_count); -+#endif - DEBUGP("destroy_conntrack: returning ct=%p to slab\n", ct); - kmem_cache_free(ip_conntrack_cachep, ct); -- atomic_dec(&ip_conntrack_count); - } - - static void death_by_timeout(unsigned long ul_conntrack) -@@ -376,7 +419,7 @@ __ip_conntrack_find(const struct ip_conn - unsigned int hash = hash_conntrack(tuple); - - MUST_BE_READ_LOCKED(&ip_conntrack_lock); -- h = LIST_FIND(&ip_conntrack_hash[hash], -+ h = LIST_FIND(&ve_ip_conntrack_hash[hash], - conntrack_tuple_cmp, - struct ip_conntrack_tuple_hash *, - tuple, ignored_conntrack); -@@ -454,17 +497,23 @@ __ip_conntrack_confirm(struct nf_ct_info - /* See if there's one in the list already, including reverse: - NAT could have grabbed it without realizing, since we're - not in the hash. If there is, we lost race. */ -- if (!LIST_FIND(&ip_conntrack_hash[hash], -+ if (!LIST_FIND(&ve_ip_conntrack_hash[hash], - conntrack_tuple_cmp, - struct ip_conntrack_tuple_hash *, - &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple, NULL) -- && !LIST_FIND(&ip_conntrack_hash[repl_hash], -+ && !LIST_FIND(&ve_ip_conntrack_hash[repl_hash], - conntrack_tuple_cmp, - struct ip_conntrack_tuple_hash *, - &ct->tuplehash[IP_CT_DIR_REPLY].tuple, NULL)) { -- list_prepend(&ip_conntrack_hash[hash], -+ /* -+ * Just to avoid one ct to be inserted in 2 or more -+ * ve_ip_conntrack_hash'es... Otherwise it can crash. -+ */ -+ if (is_confirmed(ct)) -+ goto ok; -+ list_prepend(&ve_ip_conntrack_hash[hash], - &ct->tuplehash[IP_CT_DIR_ORIGINAL]); -- list_prepend(&ip_conntrack_hash[repl_hash], -+ list_prepend(&ve_ip_conntrack_hash[repl_hash], - &ct->tuplehash[IP_CT_DIR_REPLY]); - /* Timer relative to confirmation time, not original - setting time, otherwise we'd get timer wrap in -@@ -473,6 +522,7 @@ __ip_conntrack_confirm(struct nf_ct_info - add_timer(&ct->timeout); - atomic_inc(&ct->ct_general.use); - set_bit(IPS_CONFIRMED_BIT, &ct->status); -+ok: - WRITE_UNLOCK(&ip_conntrack_lock); - return NF_ACCEPT; - } -@@ -611,11 +661,45 @@ static inline int helper_cmp(const struc - - struct ip_conntrack_helper *ip_ct_find_helper(const struct ip_conntrack_tuple *tuple) - { -- return LIST_FIND(&helpers, helper_cmp, -+ return LIST_FIND(&ve_ip_conntrack_helpers, helper_cmp, - struct ip_conntrack_helper *, - tuple); - } - -+struct ip_conntrack * -+ip_conntrack_alloc(struct user_beancounter *ub) -+{ -+ int i; -+ struct ip_conntrack *conntrack; -+ struct user_beancounter *old_ub; -+ -+ old_ub = set_exec_ub(ub); -+ conntrack = kmem_cache_alloc(ip_conntrack_cachep, GFP_ATOMIC); -+ (void)set_exec_ub(old_ub); -+ if (unlikely(!conntrack)) { -+ DEBUGP("Can't allocate conntrack.\n"); -+ return NULL; -+ } -+ -+ memset(conntrack, 0, sizeof(*conntrack)); -+ atomic_set(&conntrack->ct_general.use, 1); -+ conntrack->ct_general.destroy = destroy_conntrack; -+ for (i=0; i < IP_CT_NUMBER; i++) -+ conntrack->infos[i].master = &conntrack->ct_general; -+ -+ /* Don't set timer yet: wait for confirmation */ -+ init_timer(&conntrack->timeout); -+ conntrack->timeout.data = (unsigned long)conntrack; -+ conntrack->timeout.function = death_by_timeout; -+#ifdef CONFIG_VE_IPTABLES -+ conntrack->ct_env = (get_exec_env())->_ip_conntrack; -+#endif -+ -+ INIT_LIST_HEAD(&conntrack->sibling_list); -+ return conntrack; -+} -+EXPORT_SYMBOL(ip_conntrack_alloc); -+ - /* Allocate a new conntrack: we return -ENOMEM if classification - failed due to stress. Otherwise it really is unclassifiable. */ - static struct ip_conntrack_tuple_hash * -@@ -625,10 +709,11 @@ init_conntrack(const struct ip_conntrack - { - struct ip_conntrack *conntrack; - struct ip_conntrack_tuple repl_tuple; -+ struct ip_conntrack_tuple_hash *ret; - size_t hash; - struct ip_conntrack_expect *expected; -- int i; - static unsigned int drop_next; -+ struct user_beancounter *ub; - - if (!ip_conntrack_hash_rnd_initted) { - get_random_bytes(&ip_conntrack_hash_rnd, 4); -@@ -637,19 +722,19 @@ init_conntrack(const struct ip_conntrack - - hash = hash_conntrack(tuple); - -- if (ip_conntrack_max && -- atomic_read(&ip_conntrack_count) >= ip_conntrack_max) { -+ if (ve_ip_conntrack_max && -+ atomic_read(&ve_ip_conntrack_count) >= ve_ip_conntrack_max) { - /* Try dropping from random chain, or else from the - chain about to put into (in case they're trying to - bomb one hash chain). */ - unsigned int next = (drop_next++)%ip_conntrack_htable_size; - -- if (!early_drop(&ip_conntrack_hash[next]) -- && !early_drop(&ip_conntrack_hash[hash])) { -+ if (!early_drop(&ve_ip_conntrack_hash[next]) -+ && !early_drop(&ve_ip_conntrack_hash[hash])) { - if (net_ratelimit()) -- printk(KERN_WARNING -- "ip_conntrack: table full, dropping" -- " packet.\n"); -+ ve_printk(VE_LOG_BOTH, KERN_WARNING -+ "ip_conntrack: VPS %d: table full, dropping" -+ " packet.\n", VEID(get_exec_env())); - return ERR_PTR(-ENOMEM); - } - } -@@ -659,37 +744,33 @@ init_conntrack(const struct ip_conntrack - return NULL; - } - -- conntrack = kmem_cache_alloc(ip_conntrack_cachep, GFP_ATOMIC); -- if (!conntrack) { -- DEBUGP("Can't allocate conntrack.\n"); -- return ERR_PTR(-ENOMEM); -- } -+#ifdef CONFIG_USER_RESOURCE -+ if (skb->dev != NULL) /* received skb */ -+ ub = netdev_bc(skb->dev)->exec_ub; -+ else if (skb->sk != NULL) /* sent skb */ -+ ub = sock_bc(skb->sk)->ub; -+ else -+#endif -+ ub = NULL; -+ -+ ret = ERR_PTR(-ENOMEM); -+ conntrack = ip_conntrack_alloc(ub); -+ if (!conntrack) -+ goto out; - -- memset(conntrack, 0, sizeof(*conntrack)); -- atomic_set(&conntrack->ct_general.use, 1); -- conntrack->ct_general.destroy = destroy_conntrack; - conntrack->tuplehash[IP_CT_DIR_ORIGINAL].tuple = *tuple; - conntrack->tuplehash[IP_CT_DIR_ORIGINAL].ctrack = conntrack; - conntrack->tuplehash[IP_CT_DIR_REPLY].tuple = repl_tuple; - conntrack->tuplehash[IP_CT_DIR_REPLY].ctrack = conntrack; -- for (i=0; i < IP_CT_NUMBER; i++) -- conntrack->infos[i].master = &conntrack->ct_general; - -- if (!protocol->new(conntrack, skb)) { -- kmem_cache_free(ip_conntrack_cachep, conntrack); -- return NULL; -- } -- /* Don't set timer yet: wait for confirmation */ -- init_timer(&conntrack->timeout); -- conntrack->timeout.data = (unsigned long)conntrack; -- conntrack->timeout.function = death_by_timeout; -- -- INIT_LIST_HEAD(&conntrack->sibling_list); -+ ret = NULL; -+ if (!protocol->new(conntrack, skb)) -+ goto free_ct; - - WRITE_LOCK(&ip_conntrack_lock); - /* Need finding and deleting of expected ONLY if we win race */ - READ_LOCK(&ip_conntrack_expect_tuple_lock); -- expected = LIST_FIND(&ip_conntrack_expect_list, expect_cmp, -+ expected = LIST_FIND(&ve_ip_conntrack_expect_list, expect_cmp, - struct ip_conntrack_expect *, tuple); - READ_UNLOCK(&ip_conntrack_expect_tuple_lock); - -@@ -718,16 +799,21 @@ init_conntrack(const struct ip_conntrack - __set_bit(IPS_EXPECTED_BIT, &conntrack->status); - conntrack->master = expected; - expected->sibling = conntrack; -- LIST_DELETE(&ip_conntrack_expect_list, expected); -+ LIST_DELETE(&ve_ip_conntrack_expect_list, expected); - expected->expectant->expecting--; - nf_conntrack_get(&master_ct(conntrack)->infos[0]); - } -- atomic_inc(&ip_conntrack_count); -+ atomic_inc(&ve_ip_conntrack_count); - WRITE_UNLOCK(&ip_conntrack_lock); - - if (expected && expected->expectfn) - expected->expectfn(conntrack); - return &conntrack->tuplehash[IP_CT_DIR_ORIGINAL]; -+ -+free_ct: -+ kmem_cache_free(ip_conntrack_cachep, conntrack); -+out: -+ return ret; - } - - /* On success, returns conntrack ptr, sets skb->nfct and ctinfo */ -@@ -937,7 +1023,7 @@ ip_conntrack_expect_alloc(void) - return new; - } - --static void -+void - ip_conntrack_expect_insert(struct ip_conntrack_expect *new, - struct ip_conntrack *related_to) - { -@@ -949,7 +1035,7 @@ ip_conntrack_expect_insert(struct ip_con - /* add to expected list for this connection */ - list_add_tail(&new->expected_list, &related_to->sibling_list); - /* add to global list of expectations */ -- list_prepend(&ip_conntrack_expect_list, &new->list); -+ list_prepend(&ve_ip_conntrack_expect_list, &new->list); - /* add and start timer if required */ - if (related_to->helper->timeout) { - init_timer(&new->timeout); -@@ -961,6 +1047,7 @@ ip_conntrack_expect_insert(struct ip_con - } - related_to->expecting++; - } -+EXPORT_SYMBOL(ip_conntrack_expect_insert); - - /* Add a related connection. */ - int ip_conntrack_expect_related(struct ip_conntrack_expect *expect, -@@ -977,7 +1064,7 @@ int ip_conntrack_expect_related(struct i - DEBUGP("tuple: "); DUMP_TUPLE(&expect->tuple); - DEBUGP("mask: "); DUMP_TUPLE(&expect->mask); - -- old = LIST_FIND(&ip_conntrack_expect_list, resent_expect, -+ old = LIST_FIND(&ve_ip_conntrack_expect_list, resent_expect, - struct ip_conntrack_expect *, &expect->tuple, - &expect->mask); - if (old) { -@@ -1043,7 +1130,7 @@ int ip_conntrack_expect_related(struct i - */ - unexpect_related(old); - ret = -EPERM; -- } else if (LIST_FIND(&ip_conntrack_expect_list, expect_clash, -+ } else if (LIST_FIND(&ve_ip_conntrack_expect_list, expect_clash, - struct ip_conntrack_expect *, &expect->tuple, - &expect->mask)) { - WRITE_UNLOCK(&ip_conntrack_lock); -@@ -1077,7 +1164,7 @@ int ip_conntrack_change_expect(struct ip - /* Never seen before */ - DEBUGP("change expect: never seen before\n"); - if (!ip_ct_tuple_equal(&expect->tuple, newtuple) -- && LIST_FIND(&ip_conntrack_expect_list, expect_clash, -+ && LIST_FIND(&ve_ip_conntrack_expect_list, expect_clash, - struct ip_conntrack_expect *, newtuple, &expect->mask)) { - /* Force NAT to find an unused tuple */ - ret = -1; -@@ -1128,12 +1215,42 @@ int ip_conntrack_alter_reply(struct ip_c - int ip_conntrack_helper_register(struct ip_conntrack_helper *me) - { - WRITE_LOCK(&ip_conntrack_lock); -- list_prepend(&helpers, me); -+ list_prepend(&ve_ip_conntrack_helpers, me); - WRITE_UNLOCK(&ip_conntrack_lock); - - return 0; - } - -+int visible_ip_conntrack_helper_register(struct ip_conntrack_helper *me) -+{ -+ int ret; -+ struct module *mod = me->me; -+ -+ if (!ve_is_super(get_exec_env())) { -+ struct ip_conntrack_helper *tmp; -+ __module_get(mod); -+ ret = -ENOMEM; -+ tmp = kmalloc(sizeof(struct ip_conntrack_helper), GFP_KERNEL); -+ if (!tmp) -+ goto nomem; -+ memcpy(tmp, me, sizeof(struct ip_conntrack_helper)); -+ me = tmp; -+ } -+ -+ ret = ip_conntrack_helper_register(me); -+ if (ret) -+ goto out; -+ -+ return 0; -+out: -+ if (!ve_is_super(get_exec_env())){ -+ kfree(me); -+nomem: -+ module_put(mod); -+ } -+ return ret; -+} -+ - static inline int unhelp(struct ip_conntrack_tuple_hash *i, - const struct ip_conntrack_helper *me) - { -@@ -1152,11 +1269,11 @@ void ip_conntrack_helper_unregister(stru - - /* Need write lock here, to delete helper. */ - WRITE_LOCK(&ip_conntrack_lock); -- LIST_DELETE(&helpers, me); -+ LIST_DELETE(&ve_ip_conntrack_helpers, me); - - /* Get rid of expecteds, set helpers to NULL. */ - for (i = 0; i < ip_conntrack_htable_size; i++) -- LIST_FIND_W(&ip_conntrack_hash[i], unhelp, -+ LIST_FIND_W(&ve_ip_conntrack_hash[i], unhelp, - struct ip_conntrack_tuple_hash *, me); - WRITE_UNLOCK(&ip_conntrack_lock); - -@@ -1164,6 +1281,29 @@ void ip_conntrack_helper_unregister(stru - synchronize_net(); - } - -+void visible_ip_conntrack_helper_unregister(struct ip_conntrack_helper *me) -+{ -+ struct ip_conntrack_helper *i; -+ -+ READ_LOCK(&ip_conntrack_lock); -+ list_for_each_entry(i, &ve_ip_conntrack_helpers, list) { -+ if (i->name == me->name) { -+ me = i; -+ break; -+ } -+ } -+ READ_UNLOCK(&ip_conntrack_lock); -+ if (me != i) -+ return; -+ -+ ip_conntrack_helper_unregister(me); -+ -+ if (!ve_is_super(get_exec_env())) { -+ module_put(me->me); -+ kfree(me); -+ } -+} -+ - /* Refresh conntrack for this many jiffies. */ - void ip_ct_refresh(struct ip_conntrack *ct, unsigned long extra_jiffies) - { -@@ -1185,7 +1325,7 @@ void ip_ct_refresh(struct ip_conntrack * - - /* Returns new sk_buff, or NULL */ - struct sk_buff * --ip_ct_gather_frags(struct sk_buff *skb) -+ip_ct_gather_frags(struct sk_buff *skb, u_int32_t user) - { - struct sock *sk = skb->sk; - #ifdef CONFIG_NETFILTER_DEBUG -@@ -1197,7 +1337,7 @@ ip_ct_gather_frags(struct sk_buff *skb) - } - - local_bh_disable(); -- skb = ip_defrag(skb); -+ skb = ip_defrag(skb, user); - local_bh_enable(); - - if (!skb) { -@@ -1257,7 +1397,7 @@ get_next_corpse(int (*kill)(const struct - - READ_LOCK(&ip_conntrack_lock); - for (; !h && *bucket < ip_conntrack_htable_size; (*bucket)++) { -- h = LIST_FIND(&ip_conntrack_hash[*bucket], do_kill, -+ h = LIST_FIND(&ve_ip_conntrack_hash[*bucket], do_kill, - struct ip_conntrack_tuple_hash *, kill, data); - } - if (h) -@@ -1295,6 +1435,9 @@ getorigdst(struct sock *sk, int optval, - struct ip_conntrack_tuple_hash *h; - struct ip_conntrack_tuple tuple; - -+ if (!get_exec_env()->_ip_conntrack) -+ return -ENOPROTOOPT; -+ - IP_CT_TUPLE_U_BLANK(&tuple); - tuple.src.ip = inet->rcv_saddr; - tuple.src.u.tcp.port = inet->sport; -@@ -1354,6 +1497,9 @@ static int kill_all(const struct ip_conn - supposed to kill the mall. */ - void ip_conntrack_cleanup(void) - { -+#ifdef CONFIG_VE -+ struct ve_struct *env; -+#endif - ip_ct_attach = NULL; - /* This makes sure all current packets have passed through - netfilter framework. Roll on, two-stage module -@@ -1362,22 +1508,45 @@ void ip_conntrack_cleanup(void) - - i_see_dead_people: - ip_ct_selective_cleanup(kill_all, NULL); -- if (atomic_read(&ip_conntrack_count) != 0) { -+ if (atomic_read(&ve_ip_conntrack_count) != 0) { - schedule(); - goto i_see_dead_people; - } - -+#ifdef CONFIG_VE_IPTABLES -+ env = get_exec_env(); -+ if (ve_is_super(env)) { -+ kmem_cache_destroy(ip_conntrack_cachep); -+ nf_unregister_sockopt(&so_getorigdst); -+ } else { -+ visible_ip_conntrack_protocol_unregister( -+ &ip_conntrack_protocol_icmp); -+ visible_ip_conntrack_protocol_unregister( -+ &ip_conntrack_protocol_udp); -+ visible_ip_conntrack_protocol_unregister( -+ &ip_conntrack_protocol_tcp); -+ } -+ vfree(ve_ip_conntrack_hash); -+ ve_ip_conntrack_hash = NULL; -+ INIT_LIST_HEAD(&ve_ip_conntrack_expect_list); -+ INIT_LIST_HEAD(&ve_ip_conntrack_protocol_list); -+ INIT_LIST_HEAD(&ve_ip_conntrack_helpers); -+ ve_ip_conntrack_max = 0; -+ atomic_set(&ve_ip_conntrack_count, 0); -+ kfree(env->_ip_conntrack); -+ env->_ip_conntrack = NULL; -+#else - kmem_cache_destroy(ip_conntrack_cachep); - vfree(ip_conntrack_hash); - nf_unregister_sockopt(&so_getorigdst); -+#endif /*CONFIG_VE_IPTABLES*/ - } - - static int hashsize; - MODULE_PARM(hashsize, "i"); - --int __init ip_conntrack_init(void) -+static int ip_conntrack_cache_create(void) - { -- unsigned int i; - int ret; - - /* Idea from tcp.c: use 1/16384 of memory. On i386: 32MB -@@ -1393,33 +1562,135 @@ int __init ip_conntrack_init(void) - if (ip_conntrack_htable_size < 16) - ip_conntrack_htable_size = 16; - } -- ip_conntrack_max = 8 * ip_conntrack_htable_size; -+ ve_ip_conntrack_max = 8 * ip_conntrack_htable_size; - - printk("ip_conntrack version %s (%u buckets, %d max)" - " - %Zd bytes per conntrack\n", IP_CONNTRACK_VERSION, -- ip_conntrack_htable_size, ip_conntrack_max, -+ ip_conntrack_htable_size, ve_ip_conntrack_max, - sizeof(struct ip_conntrack)); - - ret = nf_register_sockopt(&so_getorigdst); - if (ret != 0) { - printk(KERN_ERR "Unable to register netfilter socket option\n"); -- return ret; -- } -- -- ip_conntrack_hash = vmalloc(sizeof(struct list_head) -- * ip_conntrack_htable_size); -- if (!ip_conntrack_hash) { -- printk(KERN_ERR "Unable to create ip_conntrack_hash\n"); -- goto err_unreg_sockopt; -+ goto out_sockopt; - } - -+ ret = -ENOMEM; - ip_conntrack_cachep = kmem_cache_create("ip_conntrack", -- sizeof(struct ip_conntrack), 0, -- SLAB_HWCACHE_ALIGN, NULL, NULL); -+ sizeof(struct ip_conntrack), 0, -+ SLAB_HWCACHE_ALIGN | SLAB_UBC, -+ NULL, NULL); - if (!ip_conntrack_cachep) { - printk(KERN_ERR "Unable to create ip_conntrack slab cache\n"); -- goto err_free_hash; -+ goto err_unreg_sockopt; - } -+ -+ return 0; -+ -+err_unreg_sockopt: -+ nf_unregister_sockopt(&so_getorigdst); -+out_sockopt: -+ return ret; -+} -+ -+/* From ip_conntrack_proto_tcp.c */ -+extern unsigned long ip_ct_tcp_timeout_syn_sent; -+extern unsigned long ip_ct_tcp_timeout_syn_recv; -+extern unsigned long ip_ct_tcp_timeout_established; -+extern unsigned long ip_ct_tcp_timeout_fin_wait; -+extern unsigned long ip_ct_tcp_timeout_close_wait; -+extern unsigned long ip_ct_tcp_timeout_last_ack; -+extern unsigned long ip_ct_tcp_timeout_time_wait; -+extern unsigned long ip_ct_tcp_timeout_close; -+ -+/* From ip_conntrack_proto_udp.c */ -+extern unsigned long ip_ct_udp_timeout; -+extern unsigned long ip_ct_udp_timeout_stream; -+ -+/* From ip_conntrack_proto_icmp.c */ -+extern unsigned long ip_ct_icmp_timeout; -+ -+/* From ip_conntrack_proto_icmp.c */ -+extern unsigned long ip_ct_generic_timeout; -+ -+int ip_conntrack_init(void) -+{ -+ unsigned int i; -+ int ret; -+ -+#ifdef CONFIG_VE_IPTABLES -+ struct ve_struct *env; -+ -+ env = get_exec_env(); -+ ret = -ENOMEM; -+ env->_ip_conntrack = -+ kmalloc(sizeof(struct ve_ip_conntrack), GFP_KERNEL); -+ if (!env->_ip_conntrack) -+ goto out; -+ memset(env->_ip_conntrack, 0, sizeof(struct ve_ip_conntrack)); -+ if (ve_is_super(env)) { -+ ret = ip_conntrack_cache_create(); -+ if (ret) -+ goto cache_fail; -+ } else -+ ve_ip_conntrack_max = 8 * ip_conntrack_htable_size; -+#else /* CONFIG_VE_IPTABLES */ -+ ret = ip_conntrack_cache_create(); -+ if (ret) -+ goto out; -+#endif -+ -+ ret = -ENOMEM; -+ ve_ip_conntrack_hash = ub_vmalloc(sizeof(struct list_head) -+ * ip_conntrack_htable_size); -+ if (!ve_ip_conntrack_hash) { -+ printk(KERN_ERR "Unable to create ip_conntrack_hash\n"); -+ goto err_free_cache; -+ } -+ -+#ifdef CONFIG_VE_IPTABLES -+ INIT_LIST_HEAD(&ve_ip_conntrack_expect_list); -+ INIT_LIST_HEAD(&ve_ip_conntrack_protocol_list); -+ INIT_LIST_HEAD(&ve_ip_conntrack_helpers); -+ -+ ve_ip_conntrack_max = ip_conntrack_max; -+ ve_ip_ct_tcp_timeouts[1] = ip_ct_tcp_timeout_established; -+ ve_ip_ct_tcp_timeouts[2] = ip_ct_tcp_timeout_syn_sent; -+ ve_ip_ct_tcp_timeouts[3] = ip_ct_tcp_timeout_syn_recv; -+ ve_ip_ct_tcp_timeouts[4] = ip_ct_tcp_timeout_fin_wait; -+ ve_ip_ct_tcp_timeouts[5] = ip_ct_tcp_timeout_time_wait; -+ ve_ip_ct_tcp_timeouts[6] = ip_ct_tcp_timeout_close; -+ ve_ip_ct_tcp_timeouts[7] = ip_ct_tcp_timeout_close_wait; -+ ve_ip_ct_tcp_timeouts[8] = ip_ct_tcp_timeout_last_ack; -+ ve_ip_ct_udp_timeout = ip_ct_udp_timeout; -+ ve_ip_ct_udp_timeout_stream = ip_ct_udp_timeout_stream; -+ ve_ip_ct_icmp_timeout = ip_ct_icmp_timeout; -+ ve_ip_ct_generic_timeout = ip_ct_generic_timeout; -+ -+ if (!ve_is_super(env)) { -+ ret = visible_ip_conntrack_protocol_register( -+ &ip_conntrack_protocol_tcp); -+ if (ret) -+ goto tcp_fail; -+ ret = visible_ip_conntrack_protocol_register( -+ &ip_conntrack_protocol_udp); -+ if (ret) -+ goto udp_fail; -+ ret = visible_ip_conntrack_protocol_register( -+ &ip_conntrack_protocol_icmp); -+ if (ret) -+ goto icmp_fail; -+ } else { -+ WRITE_LOCK(&ip_conntrack_lock); -+ list_append(&ve_ip_conntrack_protocol_list, -+ &ip_conntrack_protocol_tcp); -+ list_append(&ve_ip_conntrack_protocol_list, -+ &ip_conntrack_protocol_udp); -+ list_append(&ve_ip_conntrack_protocol_list, -+ &ip_conntrack_protocol_icmp); -+ WRITE_UNLOCK(&ip_conntrack_lock); -+ } -+#else - /* Don't NEED lock here, but good form anyway. */ - WRITE_LOCK(&ip_conntrack_lock); - /* Sew in builtin protocols. */ -@@ -1427,12 +1698,18 @@ int __init ip_conntrack_init(void) - list_append(&protocol_list, &ip_conntrack_protocol_udp); - list_append(&protocol_list, &ip_conntrack_protocol_icmp); - WRITE_UNLOCK(&ip_conntrack_lock); -+#endif /* CONFIG_VE_IPTABLES */ - - for (i = 0; i < ip_conntrack_htable_size; i++) -- INIT_LIST_HEAD(&ip_conntrack_hash[i]); -+ INIT_LIST_HEAD(&ve_ip_conntrack_hash[i]); - -+#ifdef CONFIG_VE_IPTABLES -+ if (ve_is_super(env)) -+ ip_ct_attach = ip_conntrack_attach; -+#else - /* For use by ipt_REJECT */ - ip_ct_attach = ip_conntrack_attach; -+#endif - - /* Set up fake conntrack: - - to never be deleted, not in any hashes */ -@@ -1445,12 +1722,29 @@ int __init ip_conntrack_init(void) - ip_conntrack_untracked.infos[IP_CT_RELATED + IP_CT_IS_REPLY].master = - &ip_conntrack_untracked.ct_general; - -- return ret; -+ return 0; - --err_free_hash: -- vfree(ip_conntrack_hash); --err_unreg_sockopt: -+#ifdef CONFIG_VE_IPTABLES -+icmp_fail: -+ visible_ip_conntrack_protocol_unregister(&ip_conntrack_protocol_udp); -+udp_fail: -+ visible_ip_conntrack_protocol_unregister(&ip_conntrack_protocol_tcp); -+tcp_fail: -+ vfree(ve_ip_conntrack_hash); -+ ve_ip_conntrack_hash = NULL; -+err_free_cache: -+ if (ve_is_super(env)) { -+ kmem_cache_destroy(ip_conntrack_cachep); -+ nf_unregister_sockopt(&so_getorigdst); -+ } -+cache_fail: -+ kfree(env->_ip_conntrack); -+ env->_ip_conntrack = NULL; -+#else -+err_free_cache: -+ kmem_cache_destroy(ip_conntrack_cachep); - nf_unregister_sockopt(&so_getorigdst); -- -- return -ENOMEM; -+#endif /* CONFIG_VE_IPTABLES */ -+out: -+ return ret; - } -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ip_conntrack_ftp.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_conntrack_ftp.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ip_conntrack_ftp.c 2004-08-14 14:56:26.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_conntrack_ftp.c 2006-05-11 13:05:41.000000000 +0400 -@@ -15,6 +15,7 @@ - #include <linux/ctype.h> - #include <net/checksum.h> - #include <net/tcp.h> -+#include <linux/nfcalls.h> - - #include <linux/netfilter_ipv4/lockhelp.h> - #include <linux/netfilter_ipv4/ip_conntrack_helper.h> -@@ -27,17 +28,25 @@ MODULE_DESCRIPTION("ftp connection track - /* This is slow, but it's simple. --RR */ - static char ftp_buffer[65536]; - --DECLARE_LOCK(ip_ftp_lock); -+static DECLARE_LOCK(ip_ftp_lock); - struct module *ip_conntrack_ftp = THIS_MODULE; - - #define MAX_PORTS 8 - static int ports[MAX_PORTS]; --static int ports_c; - MODULE_PARM(ports, "1-" __MODULE_STRING(MAX_PORTS) "i"); - - static int loose; - MODULE_PARM(loose, "i"); - -+#ifdef CONFIG_VE_IPTABLES -+#include <linux/sched.h> -+#define ve_ports_c \ -+ (get_exec_env()->_ip_conntrack->_ip_conntrack_ftp_ports_c) -+#else -+static int ports_c = 0; -+#define ve_ports_c ports_c -+#endif -+ - #if 0 - #define DEBUGP printk - #else -@@ -375,6 +384,7 @@ static int help(struct sk_buff *skb, - problem (DMZ machines opening holes to internal - networks, or the packet filter itself). */ - if (!loose) { -+ ip_conntrack_expect_put(exp); - ret = NF_ACCEPT; - goto out; - } -@@ -404,15 +414,43 @@ static int help(struct sk_buff *skb, - static struct ip_conntrack_helper ftp[MAX_PORTS]; - static char ftp_names[MAX_PORTS][10]; - --/* Not __exit: called from init() */ --static void fini(void) -+void fini_iptable_ftp(void) - { - int i; -- for (i = 0; i < ports_c; i++) { -+ -+ for (i = 0; i < ve_ports_c; i++) { - DEBUGP("ip_ct_ftp: unregistering helper for port %d\n", - ports[i]); -- ip_conntrack_helper_unregister(&ftp[i]); -+ visible_ip_conntrack_helper_unregister(&ftp[i]); -+ } -+ ve_ports_c = 0; -+} -+ -+int init_iptable_ftp(void) -+{ -+ int i, ret; -+ -+ ve_ports_c = 0; -+ for (i = 0; (i < MAX_PORTS) && ports[i]; i++) { -+ DEBUGP("ip_ct_ftp: registering helper for port %d\n", -+ ports[i]); -+ ret = visible_ip_conntrack_helper_register(&ftp[i]); -+ if (ret) { -+ fini_iptable_ftp(); -+ return ret; -+ } -+ ve_ports_c++; - } -+ return 0; -+} -+ -+/* Not __exit: called from init() */ -+static void fini(void) -+{ -+ KSYMMODUNRESOLVE(ip_conntrack_ftp); -+ KSYMUNRESOLVE(init_iptable_ftp); -+ KSYMUNRESOLVE(fini_iptable_ftp); -+ fini_iptable_ftp(); - } - - static int __init init(void) -@@ -423,6 +461,7 @@ static int __init init(void) - if (ports[0] == 0) - ports[0] = FTP_PORT; - -+ ve_ports_c = 0; - for (i = 0; (i < MAX_PORTS) && ports[i]; i++) { - ftp[i].tuple.src.u.tcp.port = htons(ports[i]); - ftp[i].tuple.dst.protonum = IPPROTO_TCP; -@@ -443,19 +482,22 @@ static int __init init(void) - - DEBUGP("ip_ct_ftp: registering helper for port %d\n", - ports[i]); -- ret = ip_conntrack_helper_register(&ftp[i]); -+ ret = visible_ip_conntrack_helper_register(&ftp[i]); - - if (ret) { - fini(); - return ret; - } -- ports_c++; -+ ve_ports_c++; - } -+ -+ KSYMRESOLVE(init_iptable_ftp); -+ KSYMRESOLVE(fini_iptable_ftp); -+ KSYMMODRESOLVE(ip_conntrack_ftp); - return 0; - } - - PROVIDES_CONNTRACK(ftp); --EXPORT_SYMBOL(ip_ftp_lock); - - module_init(init); - module_exit(fini); -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ip_conntrack_irc.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_conntrack_irc.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ip_conntrack_irc.c 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_conntrack_irc.c 2006-05-11 13:05:41.000000000 +0400 -@@ -28,6 +28,7 @@ - #include <linux/ip.h> - #include <net/checksum.h> - #include <net/tcp.h> -+#include <linux/nfcalls.h> - - #include <linux/netfilter_ipv4/lockhelp.h> - #include <linux/netfilter_ipv4/ip_conntrack_helper.h> -@@ -35,11 +36,11 @@ - - #define MAX_PORTS 8 - static int ports[MAX_PORTS]; --static int ports_c; - static int max_dcc_channels = 8; - static unsigned int dcc_timeout = 300; - /* This is slow, but it's simple. --RR */ - static char irc_buffer[65536]; -+static DECLARE_LOCK(irc_buffer_lock); - - MODULE_AUTHOR("Harald Welte <laforge@netfilter.org>"); - MODULE_DESCRIPTION("IRC (DCC) connection tracking helper"); -@@ -54,9 +55,17 @@ MODULE_PARM_DESC(dcc_timeout, "timeout o - static char *dccprotos[] = { "SEND ", "CHAT ", "MOVE ", "TSEND ", "SCHAT " }; - #define MINMATCHLEN 5 - --DECLARE_LOCK(ip_irc_lock); - struct module *ip_conntrack_irc = THIS_MODULE; - -+#ifdef CONFIG_VE_IPTABLES -+#include <linux/sched.h> -+#define ve_ports_c \ -+ (get_exec_env()->_ip_conntrack->_ip_conntrack_irc_ports_c) -+#else -+static int ports_c = 0; -+#define ve_ports_c ports_c -+#endif -+ - #if 0 - #define DEBUGP(format, args...) printk(KERN_DEBUG "%s:%s:" format, \ - __FILE__, __FUNCTION__ , ## args) -@@ -134,7 +143,7 @@ static int help(struct sk_buff *skb, - if (dataoff >= skb->len) - return NF_ACCEPT; - -- LOCK_BH(&ip_irc_lock); -+ LOCK_BH(&irc_buffer_lock); - skb_copy_bits(skb, dataoff, irc_buffer, skb->len - dataoff); - - data = irc_buffer; -@@ -227,7 +236,7 @@ static int help(struct sk_buff *skb, - } /* while data < ... */ - - out: -- UNLOCK_BH(&ip_irc_lock); -+ UNLOCK_BH(&irc_buffer_lock); - return NF_ACCEPT; - } - -@@ -236,6 +245,37 @@ static char irc_names[MAX_PORTS][10]; - - static void fini(void); - -+void fini_iptable_irc(void) -+{ -+ int i; -+ -+ for (i = 0; i < ve_ports_c; i++) { -+ DEBUGP("unregistering port %d\n", -+ ports[i]); -+ visible_ip_conntrack_helper_unregister(&irc_helpers[i]); -+ } -+ ve_ports_c = 0; -+} -+ -+int init_iptable_irc(void) -+{ -+ int i, ret; -+ -+ ve_ports_c = 0; -+ for (i = 0; (i < MAX_PORTS) && ports[i]; i++) { -+ DEBUGP("port #%d: %d\n", i, ports[i]); -+ ret = visible_ip_conntrack_helper_register(&irc_helpers[i]); -+ if (ret) { -+ printk("ip_conntrack_irc: ERROR registering port %d\n", -+ ports[i]); -+ fini_iptable_irc(); -+ return -EBUSY; -+ } -+ ve_ports_c++; -+ } -+ return 0; -+} -+ - static int __init init(void) - { - int i, ret; -@@ -255,6 +295,7 @@ static int __init init(void) - if (ports[0] == 0) - ports[0] = IRC_PORT; - -+ ve_ports_c = 0; - for (i = 0; (i < MAX_PORTS) && ports[i]; i++) { - hlpr = &irc_helpers[i]; - hlpr->tuple.src.u.tcp.port = htons(ports[i]); -@@ -276,7 +317,7 @@ static int __init init(void) - - DEBUGP("port #%d: %d\n", i, ports[i]); - -- ret = ip_conntrack_helper_register(hlpr); -+ ret = visible_ip_conntrack_helper_register(hlpr); - - if (ret) { - printk("ip_conntrack_irc: ERROR registering port %d\n", -@@ -284,8 +325,12 @@ static int __init init(void) - fini(); - return -EBUSY; - } -- ports_c++; -+ ve_ports_c++; - } -+ -+ KSYMRESOLVE(init_iptable_irc); -+ KSYMRESOLVE(fini_iptable_irc); -+ KSYMMODRESOLVE(ip_conntrack_irc); - return 0; - } - -@@ -293,16 +338,13 @@ static int __init init(void) - * it is needed by the init function */ - static void fini(void) - { -- int i; -- for (i = 0; i < ports_c; i++) { -- DEBUGP("unregistering port %d\n", -- ports[i]); -- ip_conntrack_helper_unregister(&irc_helpers[i]); -- } -+ KSYMMODUNRESOLVE(ip_conntrack_irc); -+ KSYMUNRESOLVE(init_iptable_irc); -+ KSYMUNRESOLVE(fini_iptable_irc); -+ fini_iptable_irc(); - } - - PROVIDES_CONNTRACK(irc); --EXPORT_SYMBOL(ip_irc_lock); - - module_init(init); - module_exit(fini); -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ip_conntrack_proto_tcp.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_conntrack_proto_tcp.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ip_conntrack_proto_tcp.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_conntrack_proto_tcp.c 2006-05-11 13:05:41.000000000 +0400 -@@ -66,7 +66,7 @@ unsigned long ip_ct_tcp_timeout_last_ack - unsigned long ip_ct_tcp_timeout_time_wait = 2 MINS; - unsigned long ip_ct_tcp_timeout_close = 10 SECS; - --static unsigned long * tcp_timeouts[] -+unsigned long * tcp_timeouts[] - = { NULL, /* TCP_CONNTRACK_NONE */ - &ip_ct_tcp_timeout_established, /* TCP_CONNTRACK_ESTABLISHED, */ - &ip_ct_tcp_timeout_syn_sent, /* TCP_CONNTRACK_SYN_SENT, */ -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ip_conntrack_standalone.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_conntrack_standalone.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ip_conntrack_standalone.c 2004-08-14 14:55:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_conntrack_standalone.c 2006-05-11 13:05:41.000000000 +0400 -@@ -25,6 +25,7 @@ - #endif - #include <net/checksum.h> - #include <net/ip.h> -+#include <linux/nfcalls.h> - - #define ASSERT_READ_LOCK(x) MUST_BE_READ_LOCKED(&ip_conntrack_lock) - #define ASSERT_WRITE_LOCK(x) MUST_BE_WRITE_LOCKED(&ip_conntrack_lock) -@@ -43,6 +44,9 @@ - - MODULE_LICENSE("GPL"); - -+int ip_conntrack_enable_ve0 = 0; -+MODULE_PARM(ip_conntrack_enable_ve0, "i"); -+ - static int kill_proto(const struct ip_conntrack *i, void *data) - { - return (i->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.protonum == -@@ -153,7 +157,7 @@ list_conntracks(char *buffer, char **sta - READ_LOCK(&ip_conntrack_lock); - /* Traverse hash; print originals then reply. */ - for (i = 0; i < ip_conntrack_htable_size; i++) { -- if (LIST_FIND(&ip_conntrack_hash[i], conntrack_iterate, -+ if (LIST_FIND(&ve_ip_conntrack_hash[i], conntrack_iterate, - struct ip_conntrack_tuple_hash *, - buffer, offset, &upto, &len, length)) - goto finished; -@@ -161,7 +165,7 @@ list_conntracks(char *buffer, char **sta - - /* Now iterate through expecteds. */ - READ_LOCK(&ip_conntrack_expect_tuple_lock); -- list_for_each(e, &ip_conntrack_expect_list) { -+ list_for_each(e, &ve_ip_conntrack_expect_list) { - unsigned int last_len; - struct ip_conntrack_expect *expect - = (struct ip_conntrack_expect *)e; -@@ -208,7 +212,10 @@ static unsigned int ip_conntrack_defrag( - - /* Gather fragments. */ - if ((*pskb)->nh.iph->frag_off & htons(IP_MF|IP_OFFSET)) { -- *pskb = ip_ct_gather_frags(*pskb); -+ *pskb = ip_ct_gather_frags(*pskb, -+ hooknum == NF_IP_PRE_ROUTING ? -+ IP_DEFRAG_CONNTRACK_IN : -+ IP_DEFRAG_CONNTRACK_OUT); - if (!*pskb) - return NF_STOLEN; - } -@@ -334,7 +341,25 @@ extern unsigned long ip_ct_icmp_timeout; - /* From ip_conntrack_proto_icmp.c */ - extern unsigned long ip_ct_generic_timeout; - -+#ifdef CONFIG_VE -+#define ve_ip_ct_sysctl_header \ -+ (get_exec_env()->_ip_conntrack->_ip_ct_sysctl_header) -+#define ve_ip_ct_net_table \ -+ (get_exec_env()->_ip_conntrack->_ip_ct_net_table) -+#define ve_ip_ct_ipv4_table \ -+ (get_exec_env()->_ip_conntrack->_ip_ct_ipv4_table) -+#define ve_ip_ct_netfilter_table \ -+ (get_exec_env()->_ip_conntrack->_ip_ct_netfilter_table) -+#define ve_ip_ct_sysctl_table \ -+ (get_exec_env()->_ip_conntrack->_ip_ct_sysctl_table) -+#else - static struct ctl_table_header *ip_ct_sysctl_header; -+#define ve_ip_ct_sysctl_header ip_ct_sysctl_header -+#define ve_ip_ct_net_table ip_ct_net_table -+#define ve_ip_ct_ipv4_table ip_ct_ipv4_table -+#define ve_ip_ct_netfilter_table ip_ct_netfilter_table -+#define ve_ip_ct_sysctl_table ip_ct_sysctl_table -+#endif - - static ctl_table ip_ct_sysctl_table[] = { - { -@@ -491,7 +516,89 @@ static ctl_table ip_ct_net_table[] = { - }, - { .ctl_name = 0 } - }; --#endif -+ -+#ifdef CONFIG_VE -+static void ip_conntrack_sysctl_cleanup(void) -+{ -+ if (!ve_is_super(get_exec_env())) { -+ kfree(ve_ip_ct_net_table); -+ kfree(ve_ip_ct_ipv4_table); -+ kfree(ve_ip_ct_netfilter_table); -+ kfree(ve_ip_ct_sysctl_table); -+ } -+ ve_ip_ct_net_table = NULL; -+ ve_ip_ct_ipv4_table = NULL; -+ ve_ip_ct_netfilter_table = NULL; -+ ve_ip_ct_sysctl_table = NULL; -+} -+ -+#define ALLOC_ENVCTL(field,k,label) \ -+ if ( !(field = kmalloc(k*sizeof(ctl_table), GFP_KERNEL)) ) \ -+ goto label; -+static int ip_conntrack_sysctl_init(void) -+{ -+ int i, ret = 0; -+ -+ ret = -ENOMEM; -+ if (ve_is_super(get_exec_env())) { -+ ve_ip_ct_net_table = ip_ct_net_table; -+ ve_ip_ct_ipv4_table = ip_ct_ipv4_table; -+ ve_ip_ct_netfilter_table = ip_ct_netfilter_table; -+ ve_ip_ct_sysctl_table = ip_ct_sysctl_table; -+ } else { -+ /* allocate structures in ve_struct */ -+ ALLOC_ENVCTL(ve_ip_ct_net_table, 2, out); -+ ALLOC_ENVCTL(ve_ip_ct_ipv4_table, 2, nomem_1); -+ ALLOC_ENVCTL(ve_ip_ct_netfilter_table, 3, nomem_2); -+ ALLOC_ENVCTL(ve_ip_ct_sysctl_table, 15, nomem_3); -+ -+ memcpy(ve_ip_ct_net_table, ip_ct_net_table, -+ 2*sizeof(ctl_table)); -+ memcpy(ve_ip_ct_ipv4_table, ip_ct_ipv4_table, -+ 2*sizeof(ctl_table)); -+ memcpy(ve_ip_ct_netfilter_table, ip_ct_netfilter_table, -+ 3*sizeof(ctl_table)); -+ memcpy(ve_ip_ct_sysctl_table, ip_ct_sysctl_table, -+ 15*sizeof(ctl_table)); -+ -+ ve_ip_ct_net_table[0].child = ve_ip_ct_ipv4_table; -+ ve_ip_ct_ipv4_table[0].child = ve_ip_ct_netfilter_table; -+ ve_ip_ct_netfilter_table[0].child = ve_ip_ct_sysctl_table; -+ } -+ ve_ip_ct_sysctl_table[0].data = &ve_ip_conntrack_max; -+ /* skip ve_ip_ct_sysctl_table[1].data as it is read-only and common -+ * for all environments */ -+ ve_ip_ct_sysctl_table[2].data = &ve_ip_ct_tcp_timeouts[2]; -+ ve_ip_ct_sysctl_table[3].data = &ve_ip_ct_tcp_timeouts[3]; -+ ve_ip_ct_sysctl_table[4].data = &ve_ip_ct_tcp_timeouts[1]; -+ ve_ip_ct_sysctl_table[5].data = &ve_ip_ct_tcp_timeouts[4]; -+ ve_ip_ct_sysctl_table[6].data = &ve_ip_ct_tcp_timeouts[7]; -+ ve_ip_ct_sysctl_table[7].data = &ve_ip_ct_tcp_timeouts[8]; -+ ve_ip_ct_sysctl_table[8].data = &ve_ip_ct_tcp_timeouts[5]; -+ ve_ip_ct_sysctl_table[9].data = &ve_ip_ct_tcp_timeouts[6]; -+ ve_ip_ct_sysctl_table[10].data = &ve_ip_ct_udp_timeout; -+ ve_ip_ct_sysctl_table[11].data = &ve_ip_ct_udp_timeout_stream; -+ ve_ip_ct_sysctl_table[12].data = &ve_ip_ct_icmp_timeout; -+ ve_ip_ct_sysctl_table[13].data = &ve_ip_ct_generic_timeout; -+ for (i = 0; i < 14; i++) -+ ve_ip_ct_sysctl_table[i].owner_env = get_exec_env(); -+ return 0; -+ -+nomem_3: -+ kfree(ve_ip_ct_netfilter_table); -+ ve_ip_ct_netfilter_table = NULL; -+nomem_2: -+ kfree(ve_ip_ct_ipv4_table); -+ ve_ip_ct_ipv4_table = NULL; -+nomem_1: -+ kfree(ve_ip_ct_net_table); -+ ve_ip_ct_net_table = NULL; -+out: -+ return ret; -+} -+#endif /*CONFIG_VE*/ -+#endif /*CONFIG_SYSCTL*/ -+ - static int init_or_cleanup(int init) - { - struct proc_dir_entry *proc; -@@ -499,77 +606,115 @@ static int init_or_cleanup(int init) - - if (!init) goto cleanup; - -+ ret = -ENOENT; -+ if (!ve_is_super(get_exec_env())) -+ __module_get(THIS_MODULE); -+ - ret = ip_conntrack_init(); - if (ret < 0) -- goto cleanup_nothing; -+ goto cleanup_unget; -+ -+ if (ve_is_super(get_exec_env()) && !ip_conntrack_enable_ve0) -+ return 0; - -- proc = proc_net_create("ip_conntrack", 0440, list_conntracks); -+ ret = -ENOENT; -+ proc = proc_mkdir("net", NULL); - if (!proc) goto cleanup_init; -+ proc = create_proc_info_entry("net/ip_conntrack", 0440, -+ NULL, list_conntracks); -+ if (!proc) goto cleanup_proc2; - proc->owner = THIS_MODULE; - -- ret = nf_register_hook(&ip_conntrack_defrag_ops); -+ ret = visible_nf_register_hook(&ip_conntrack_defrag_ops); - if (ret < 0) { - printk("ip_conntrack: can't register pre-routing defrag hook.\n"); - goto cleanup_proc; - } -- ret = nf_register_hook(&ip_conntrack_defrag_local_out_ops); -+ ret = visible_nf_register_hook(&ip_conntrack_defrag_local_out_ops); - if (ret < 0) { - printk("ip_conntrack: can't register local_out defrag hook.\n"); - goto cleanup_defragops; - } -- ret = nf_register_hook(&ip_conntrack_in_ops); -+ ret = visible_nf_register_hook(&ip_conntrack_in_ops); - if (ret < 0) { - printk("ip_conntrack: can't register pre-routing hook.\n"); - goto cleanup_defraglocalops; - } -- ret = nf_register_hook(&ip_conntrack_local_out_ops); -+ ret = visible_nf_register_hook(&ip_conntrack_local_out_ops); - if (ret < 0) { - printk("ip_conntrack: can't register local out hook.\n"); - goto cleanup_inops; - } -- ret = nf_register_hook(&ip_conntrack_out_ops); -+ ret = visible_nf_register_hook(&ip_conntrack_out_ops); - if (ret < 0) { - printk("ip_conntrack: can't register post-routing hook.\n"); - goto cleanup_inandlocalops; - } -- ret = nf_register_hook(&ip_conntrack_local_in_ops); -+ ret = visible_nf_register_hook(&ip_conntrack_local_in_ops); - if (ret < 0) { - printk("ip_conntrack: can't register local in hook.\n"); - goto cleanup_inoutandlocalops; - } - #ifdef CONFIG_SYSCTL -- ip_ct_sysctl_header = register_sysctl_table(ip_ct_net_table, 0); -- if (ip_ct_sysctl_header == NULL) { -+#ifdef CONFIG_VE -+ ret = ip_conntrack_sysctl_init(); -+ if (ret < 0) -+ goto cleanup_sysctl; -+#endif -+ ret = -ENOMEM; -+ ve_ip_ct_sysctl_header = register_sysctl_table(ve_ip_ct_net_table, 0); -+ if (ve_ip_ct_sysctl_header == NULL) { - printk("ip_conntrack: can't register to sysctl.\n"); -- goto cleanup; -+ goto cleanup_sysctl2; - } - #endif -+ return 0; - -- return ret; -- -- cleanup: -+cleanup: -+ if (ve_is_super(get_exec_env()) && !ip_conntrack_enable_ve0) -+ goto cleanup_init; - #ifdef CONFIG_SYSCTL -- unregister_sysctl_table(ip_ct_sysctl_header); -+ unregister_sysctl_table(ve_ip_ct_sysctl_header); -+cleanup_sysctl2: -+#ifdef CONFIG_VE -+ ip_conntrack_sysctl_cleanup(); -+cleanup_sysctl: -+#endif - #endif -- nf_unregister_hook(&ip_conntrack_local_in_ops); -+ visible_nf_unregister_hook(&ip_conntrack_local_in_ops); - cleanup_inoutandlocalops: -- nf_unregister_hook(&ip_conntrack_out_ops); -+ visible_nf_unregister_hook(&ip_conntrack_out_ops); - cleanup_inandlocalops: -- nf_unregister_hook(&ip_conntrack_local_out_ops); -+ visible_nf_unregister_hook(&ip_conntrack_local_out_ops); - cleanup_inops: -- nf_unregister_hook(&ip_conntrack_in_ops); -+ visible_nf_unregister_hook(&ip_conntrack_in_ops); - cleanup_defraglocalops: -- nf_unregister_hook(&ip_conntrack_defrag_local_out_ops); -+ visible_nf_unregister_hook(&ip_conntrack_defrag_local_out_ops); - cleanup_defragops: -- nf_unregister_hook(&ip_conntrack_defrag_ops); -+ visible_nf_unregister_hook(&ip_conntrack_defrag_ops); - cleanup_proc: -- proc_net_remove("ip_conntrack"); -+ remove_proc_entry("net/ip_conntrack", NULL); -+ cleanup_proc2: -+ if (!ve_is_super(get_exec_env())) -+ remove_proc_entry("net", NULL); - cleanup_init: - ip_conntrack_cleanup(); -- cleanup_nothing: -+ cleanup_unget: -+ if (!ve_is_super(get_exec_env())) -+ module_put(THIS_MODULE); - return ret; - } - -+int init_iptable_conntrack(void) -+{ -+ return init_or_cleanup(1); -+} -+ -+void fini_iptable_conntrack(void) -+{ -+ init_or_cleanup(0); -+} -+ - /* FIXME: Allow NULL functions and sub in pointers to generic for - them. --RR */ - int ip_conntrack_protocol_register(struct ip_conntrack_protocol *proto) -@@ -578,7 +723,7 @@ int ip_conntrack_protocol_register(struc - struct list_head *i; - - WRITE_LOCK(&ip_conntrack_lock); -- list_for_each(i, &protocol_list) { -+ list_for_each(i, &ve_ip_conntrack_protocol_list) { - if (((struct ip_conntrack_protocol *)i)->proto - == proto->proto) { - ret = -EBUSY; -@@ -586,20 +731,47 @@ int ip_conntrack_protocol_register(struc - } - } - -- list_prepend(&protocol_list, proto); -+ list_prepend(&ve_ip_conntrack_protocol_list, proto); - - out: - WRITE_UNLOCK(&ip_conntrack_lock); - return ret; - } - -+int visible_ip_conntrack_protocol_register(struct ip_conntrack_protocol *proto) -+{ -+ int ret = 0; -+ -+ if (!ve_is_super(get_exec_env())) { -+ struct ip_conntrack_protocol *tmp; -+ ret = -ENOMEM; -+ tmp = kmalloc(sizeof(struct ip_conntrack_protocol), -+ GFP_KERNEL); -+ if (!tmp) -+ goto nomem; -+ memcpy(tmp, proto, sizeof(struct ip_conntrack_protocol)); -+ proto = tmp; -+ } -+ -+ ret = ip_conntrack_protocol_register(proto); -+ if (ret) -+ goto out; -+ -+ return 0; -+out: -+ if (!ve_is_super(get_exec_env())) -+ kfree(proto); -+nomem: -+ return ret; -+} -+ - void ip_conntrack_protocol_unregister(struct ip_conntrack_protocol *proto) - { - WRITE_LOCK(&ip_conntrack_lock); - - /* ip_ct_find_proto() returns proto_generic in case there is no protocol - * helper. So this should be enough - HW */ -- LIST_DELETE(&protocol_list, proto); -+ LIST_DELETE(&ve_ip_conntrack_protocol_list, proto); - WRITE_UNLOCK(&ip_conntrack_lock); - - /* Somebody could be still looking at the proto in bh. */ -@@ -609,17 +781,53 @@ void ip_conntrack_protocol_unregister(st - ip_ct_selective_cleanup(kill_proto, &proto->proto); - } - -+void visible_ip_conntrack_protocol_unregister( -+ struct ip_conntrack_protocol *proto) -+{ -+#ifdef CONFIG_VE -+ struct ip_conntrack_protocol *i; -+ -+ READ_LOCK(&ip_conntrack_lock); -+ list_for_each_entry(i, &ve_ip_conntrack_protocol_list, list) { -+ if (i->proto == proto->proto) { -+ proto = i; -+ break; -+ } -+ } -+ READ_UNLOCK(&ip_conntrack_lock); -+ if (proto != i) -+ return; -+#endif -+ -+ ip_conntrack_protocol_unregister(proto); -+ -+ if (!ve_is_super(get_exec_env())) -+ kfree(proto); -+} -+ - static int __init init(void) - { -- return init_or_cleanup(1); -+ int err; -+ -+ err = init_iptable_conntrack(); -+ if (err < 0) -+ return err; -+ -+ KSYMRESOLVE(init_iptable_conntrack); -+ KSYMRESOLVE(fini_iptable_conntrack); -+ KSYMMODRESOLVE(ip_conntrack); -+ return 0; - } - - static void __exit fini(void) - { -- init_or_cleanup(0); -+ KSYMMODUNRESOLVE(ip_conntrack); -+ KSYMUNRESOLVE(init_iptable_conntrack); -+ KSYMUNRESOLVE(fini_iptable_conntrack); -+ fini_iptable_conntrack(); - } - --module_init(init); -+subsys_initcall(init); - module_exit(fini); - - /* Some modules need us, but don't depend directly on any symbol. -@@ -628,8 +836,11 @@ void need_ip_conntrack(void) - { - } - -+EXPORT_SYMBOL(ip_conntrack_enable_ve0); - EXPORT_SYMBOL(ip_conntrack_protocol_register); - EXPORT_SYMBOL(ip_conntrack_protocol_unregister); -+EXPORT_SYMBOL(visible_ip_conntrack_protocol_register); -+EXPORT_SYMBOL(visible_ip_conntrack_protocol_unregister); - EXPORT_SYMBOL(invert_tuplepr); - EXPORT_SYMBOL(ip_conntrack_alter_reply); - EXPORT_SYMBOL(ip_conntrack_destroyed); -@@ -637,6 +848,8 @@ EXPORT_SYMBOL(ip_conntrack_get); - EXPORT_SYMBOL(need_ip_conntrack); - EXPORT_SYMBOL(ip_conntrack_helper_register); - EXPORT_SYMBOL(ip_conntrack_helper_unregister); -+EXPORT_SYMBOL(visible_ip_conntrack_helper_register); -+EXPORT_SYMBOL(visible_ip_conntrack_helper_unregister); - EXPORT_SYMBOL(ip_ct_selective_cleanup); - EXPORT_SYMBOL(ip_ct_refresh); - EXPORT_SYMBOL(ip_ct_find_proto); -@@ -652,8 +865,8 @@ EXPORT_SYMBOL(ip_conntrack_tuple_taken); - EXPORT_SYMBOL(ip_ct_gather_frags); - EXPORT_SYMBOL(ip_conntrack_htable_size); - EXPORT_SYMBOL(ip_conntrack_expect_list); --EXPORT_SYMBOL(ip_conntrack_lock); - EXPORT_SYMBOL(ip_conntrack_hash); -+EXPORT_SYMBOL(ip_conntrack_lock); - EXPORT_SYMBOL(ip_conntrack_untracked); - EXPORT_SYMBOL_GPL(ip_conntrack_find_get); - EXPORT_SYMBOL_GPL(ip_conntrack_put); -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ip_fw_compat.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_fw_compat.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ip_fw_compat.c 2004-08-14 14:55:19.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_fw_compat.c 2006-05-11 13:05:25.000000000 +0400 -@@ -80,7 +80,7 @@ fw_in(unsigned int hooknum, - &redirpt, pskb); - - if ((*pskb)->nh.iph->frag_off & htons(IP_MF|IP_OFFSET)) { -- *pskb = ip_ct_gather_frags(*pskb); -+ *pskb = ip_ct_gather_frags(*pskb, IP_DEFRAG_FW_COMPAT); - - if (!*pskb) - return NF_STOLEN; -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ip_nat_core.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_nat_core.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ip_nat_core.c 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_nat_core.c 2006-05-11 13:05:45.000000000 +0400 -@@ -20,6 +20,7 @@ - #include <net/tcp.h> /* For tcp_prot in getorigdst */ - #include <linux/icmp.h> - #include <linux/udp.h> -+#include <ub/ub_mem.h> - - #define ASSERT_READ_LOCK(x) MUST_BE_READ_LOCKED(&ip_nat_lock) - #define ASSERT_WRITE_LOCK(x) MUST_BE_WRITE_LOCKED(&ip_nat_lock) -@@ -46,10 +47,19 @@ DECLARE_RWLOCK_EXTERN(ip_conntrack_lock) - /* Calculated at init based on memory size */ - static unsigned int ip_nat_htable_size; - --static struct list_head *bysource; --static struct list_head *byipsproto; -+#ifdef CONFIG_VE_IPTABLES -+#define ve_ip_nat_bysource \ -+ (get_exec_env()->_ip_conntrack->_ip_nat_bysource) -+#define ve_ip_nat_byipsproto \ -+ (get_exec_env()->_ip_conntrack->_ip_nat_bysource+ip_nat_htable_size) -+#else - LIST_HEAD(protos); - LIST_HEAD(helpers); -+static struct list_head *bysource; -+static struct list_head *byipsproto; -+#define ve_ip_nat_bysource bysource -+#define ve_ip_nat_byipsproto byipsproto -+#endif - - extern struct ip_nat_protocol unknown_nat_protocol; - -@@ -74,7 +84,9 @@ static void ip_nat_cleanup_conntrack(str - { - struct ip_nat_info *info = &conn->nat.info; - unsigned int hs, hp; -- -+#ifdef CONFIG_VE_IPTABLES -+ struct ve_ip_conntrack *env; -+#endif - if (!info->initialized) - return; - -@@ -91,8 +103,15 @@ static void ip_nat_cleanup_conntrack(str - .tuple.dst.protonum); - - WRITE_LOCK(&ip_nat_lock); -+#ifdef CONFIG_VE_IPTABLES -+ env = conn->ct_env; -+ LIST_DELETE(&(env->_ip_nat_bysource)[hs], &info->bysource); -+ LIST_DELETE(&(env->_ip_nat_bysource + ip_nat_htable_size)[hp], -+ &info->byipsproto); -+#else - LIST_DELETE(&bysource[hs], &info->bysource); - LIST_DELETE(&byipsproto[hp], &info->byipsproto); -+#endif - WRITE_UNLOCK(&ip_nat_lock); - } - -@@ -118,7 +137,8 @@ find_nat_proto(u_int16_t protonum) - struct ip_nat_protocol *i; - - MUST_BE_READ_LOCKED(&ip_nat_lock); -- i = LIST_FIND(&protos, cmp_proto, struct ip_nat_protocol *, protonum); -+ i = LIST_FIND(&ve_ip_nat_protos, cmp_proto, -+ struct ip_nat_protocol *, protonum); - if (!i) - i = &unknown_nat_protocol; - return i; -@@ -197,7 +217,8 @@ find_appropriate_src(const struct ip_con - struct ip_nat_hash *i; - - MUST_BE_READ_LOCKED(&ip_nat_lock); -- i = LIST_FIND(&bysource[h], src_cmp, struct ip_nat_hash *, tuple, mr); -+ i = LIST_FIND(&ve_ip_nat_bysource[h], src_cmp, -+ struct ip_nat_hash *, tuple, mr); - if (i) - return &i->conntrack->tuplehash[IP_CT_DIR_ORIGINAL].tuple.src; - else -@@ -253,7 +274,7 @@ count_maps(u_int32_t src, u_int32_t dst, - - MUST_BE_READ_LOCKED(&ip_nat_lock); - h = hash_by_ipsproto(src, dst, protonum); -- LIST_FIND(&byipsproto[h], fake_cmp, struct ip_nat_hash *, -+ LIST_FIND(&ve_ip_nat_byipsproto[h], fake_cmp, struct ip_nat_hash *, - src, dst, protonum, &score, conntrack); - - return score; -@@ -505,6 +526,28 @@ helper_cmp(const struct ip_nat_helper *h - return ip_ct_tuple_mask_cmp(tuple, &helper->tuple, &helper->mask); - } - -+/* this function gives us an ability to safely restore -+ * connection in case of failure */ -+int ip_nat_install_conntrack(struct ip_conntrack *conntrack, int helper) -+{ -+ int ret = 0; -+ -+ WRITE_LOCK(&ip_nat_lock); -+ if (helper) { -+ conntrack->nat.info.helper = LIST_FIND(&ve_ip_nat_helpers, -+ helper_cmp, struct ip_nat_helper *, -+ &conntrack->tuplehash[1].tuple); -+ if (conntrack->nat.info.helper == NULL) -+ ret = -EINVAL; -+ } -+ if (!ret) -+ place_in_hashes(conntrack, &conntrack->nat.info); -+ WRITE_UNLOCK(&ip_nat_lock); -+ return ret; -+} -+EXPORT_SYMBOL(ip_nat_install_conntrack); -+ -+ - /* Where to manip the reply packets (will be reverse manip). */ - static unsigned int opposite_hook[NF_IP_NUMHOOKS] - = { [NF_IP_PRE_ROUTING] = NF_IP_POST_ROUTING, -@@ -643,8 +686,8 @@ ip_nat_setup_info(struct ip_conntrack *c - - /* If there's a helper, assign it; based on new tuple. */ - if (!conntrack->master) -- info->helper = LIST_FIND(&helpers, helper_cmp, struct ip_nat_helper *, -- &reply); -+ info->helper = LIST_FIND(&ve_ip_nat_helpers, -+ helper_cmp, struct ip_nat_helper *, &reply); - - /* It's done. */ - info->initialized |= (1 << HOOK2MANIP(hooknum)); -@@ -684,8 +727,8 @@ void replace_in_hashes(struct ip_conntra - list_del(&info->bysource.list); - list_del(&info->byipsproto.list); - -- list_prepend(&bysource[srchash], &info->bysource); -- list_prepend(&byipsproto[ipsprotohash], &info->byipsproto); -+ list_prepend(&ve_ip_nat_bysource[srchash], &info->bysource); -+ list_prepend(&ve_ip_nat_byipsproto[ipsprotohash], &info->byipsproto); - } - - void place_in_hashes(struct ip_conntrack *conntrack, -@@ -712,8 +755,8 @@ void place_in_hashes(struct ip_conntrack - info->byipsproto.conntrack = conntrack; - info->bysource.conntrack = conntrack; - -- list_prepend(&bysource[srchash], &info->bysource); -- list_prepend(&byipsproto[ipsprotohash], &info->byipsproto); -+ list_prepend(&ve_ip_nat_bysource[srchash], &info->bysource); -+ list_prepend(&ve_ip_nat_byipsproto[ipsprotohash], &info->byipsproto); - } - - /* Returns true if succeeded. */ -@@ -988,41 +1031,64 @@ icmp_reply_translation(struct sk_buff ** - return 0; - } - --int __init ip_nat_init(void) -+int ip_nat_init(void) - { - size_t i; -+ int ret; - -- /* Leave them the same for the moment. */ -- ip_nat_htable_size = ip_conntrack_htable_size; -+ if (ve_is_super(get_exec_env())) -+ ip_nat_htable_size = ip_conntrack_htable_size; -+ INIT_LIST_HEAD(&ve_ip_nat_protos); -+ INIT_LIST_HEAD(&ve_ip_nat_helpers); - - /* One vmalloc for both hash tables */ -- bysource = vmalloc(sizeof(struct list_head) * ip_nat_htable_size*2); -- if (!bysource) { -- return -ENOMEM; -- } -- byipsproto = bysource + ip_nat_htable_size; -- -- /* Sew in builtin protocols. */ -- WRITE_LOCK(&ip_nat_lock); -- list_append(&protos, &ip_nat_protocol_tcp); -- list_append(&protos, &ip_nat_protocol_udp); -- list_append(&protos, &ip_nat_protocol_icmp); -- WRITE_UNLOCK(&ip_nat_lock); -+ ret = -ENOMEM; -+ ve_ip_nat_bysource = ub_vmalloc(sizeof(struct list_head)*ip_nat_htable_size*2); -+ if (!ve_ip_nat_bysource) -+ goto err; -+ /*byipsproto = bysource + ip_nat_htable_size;*/ - - for (i = 0; i < ip_nat_htable_size; i++) { -- INIT_LIST_HEAD(&bysource[i]); -- INIT_LIST_HEAD(&byipsproto[i]); -+ INIT_LIST_HEAD(&ve_ip_nat_bysource[i]); -+ INIT_LIST_HEAD(&ve_ip_nat_byipsproto[i]); -+ } -+ -+ if (!ve_is_super(get_exec_env())) { -+ ret = visible_ip_nat_protocol_register(&ip_nat_protocol_tcp); -+ if (ret) -+ goto tcp_fail; -+ ret = visible_ip_nat_protocol_register(&ip_nat_protocol_udp); -+ if (ret) -+ goto udp_fail; -+ ret = visible_ip_nat_protocol_register(&ip_nat_protocol_icmp); -+ if (ret) -+ goto icmp_fail; -+ } else { -+ /* Sew in builtin protocols. */ -+ WRITE_LOCK(&ip_nat_lock); -+ list_append(&ve_ip_nat_protos, &ip_nat_protocol_tcp); -+ list_append(&ve_ip_nat_protos, &ip_nat_protocol_udp); -+ list_append(&ve_ip_nat_protos, &ip_nat_protocol_icmp); -+ WRITE_UNLOCK(&ip_nat_lock); -+ -+ /* Initialize fake conntrack so that NAT will skip it */ -+ ip_conntrack_untracked.nat.info.initialized |= -+ (1 << IP_NAT_MANIP_SRC) | (1 << IP_NAT_MANIP_DST); - } - - /* FIXME: Man, this is a hack. <SIGH> */ -- IP_NF_ASSERT(ip_conntrack_destroyed == NULL); -- ip_conntrack_destroyed = &ip_nat_cleanup_conntrack; -- -- /* Initialize fake conntrack so that NAT will skip it */ -- ip_conntrack_untracked.nat.info.initialized |= -- (1 << IP_NAT_MANIP_SRC) | (1 << IP_NAT_MANIP_DST); -+ IP_NF_ASSERT(ve_ip_conntrack_destroyed == NULL); -+ ve_ip_conntrack_destroyed = &ip_nat_cleanup_conntrack; - - return 0; -+icmp_fail: -+ visible_ip_nat_protocol_unregister(&ip_nat_protocol_udp); -+udp_fail: -+ visible_ip_nat_protocol_unregister(&ip_nat_protocol_tcp); -+tcp_fail: -+ vfree(ve_ip_nat_bysource); -+err: -+ return ret; - } - - /* Clear NAT section of all conntracks, in case we're loaded again. */ -@@ -1036,6 +1102,13 @@ static int clean_nat(const struct ip_con - void ip_nat_cleanup(void) - { - ip_ct_selective_cleanup(&clean_nat, NULL); -- ip_conntrack_destroyed = NULL; -- vfree(bysource); -+ ve_ip_conntrack_destroyed = NULL; -+ vfree(ve_ip_nat_bysource); -+ ve_ip_nat_bysource = NULL; -+ -+ if (!ve_is_super(get_exec_env())){ -+ visible_ip_nat_protocol_unregister(&ip_nat_protocol_icmp); -+ visible_ip_nat_protocol_unregister(&ip_nat_protocol_udp); -+ visible_ip_nat_protocol_unregister(&ip_nat_protocol_tcp); -+ } - } -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ip_nat_ftp.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_nat_ftp.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ip_nat_ftp.c 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_nat_ftp.c 2006-05-11 13:05:41.000000000 +0400 -@@ -18,6 +18,7 @@ - #include <linux/netfilter_ipv4/ip_nat_rule.h> - #include <linux/netfilter_ipv4/ip_conntrack_ftp.h> - #include <linux/netfilter_ipv4/ip_conntrack_helper.h> -+#include <linux/nfcalls.h> - - MODULE_LICENSE("GPL"); - MODULE_AUTHOR("Rusty Russell <rusty@rustcorp.com.au>"); -@@ -31,11 +32,17 @@ MODULE_DESCRIPTION("ftp NAT helper"); - - #define MAX_PORTS 8 - static int ports[MAX_PORTS]; --static int ports_c; - - MODULE_PARM(ports, "1-" __MODULE_STRING(MAX_PORTS) "i"); - --DECLARE_LOCK_EXTERN(ip_ftp_lock); -+#ifdef CONFIG_VE_IPTABLES -+#include <linux/sched.h> -+#define ve_ports_c \ -+ (get_exec_env()->_ip_conntrack->_ip_nat_ftp_ports_c) -+#else -+static int ports_c = 0; -+#define ve_ports_c ports_c -+#endif - - /* FIXME: Time out? --RR */ - -@@ -59,8 +66,6 @@ ftp_nat_expected(struct sk_buff **pskb, - DEBUGP("nat_expected: We have a connection!\n"); - exp_ftp_info = &ct->master->help.exp_ftp_info; - -- LOCK_BH(&ip_ftp_lock); -- - if (exp_ftp_info->ftptype == IP_CT_FTP_PORT - || exp_ftp_info->ftptype == IP_CT_FTP_EPRT) { - /* PORT command: make connection go to the client. */ -@@ -75,7 +80,6 @@ ftp_nat_expected(struct sk_buff **pskb, - DEBUGP("nat_expected: PASV cmd. %u.%u.%u.%u->%u.%u.%u.%u\n", - NIPQUAD(newsrcip), NIPQUAD(newdstip)); - } -- UNLOCK_BH(&ip_ftp_lock); - - if (HOOK2MANIP(hooknum) == IP_NAT_MANIP_SRC) - newip = newsrcip; -@@ -111,8 +115,6 @@ mangle_rfc959_packet(struct sk_buff **ps - { - char buffer[sizeof("nnn,nnn,nnn,nnn,nnn,nnn")]; - -- MUST_BE_LOCKED(&ip_ftp_lock); -- - sprintf(buffer, "%u,%u,%u,%u,%u,%u", - NIPQUAD(newip), port>>8, port&0xFF); - -@@ -134,8 +136,6 @@ mangle_eprt_packet(struct sk_buff **pskb - { - char buffer[sizeof("|1|255.255.255.255|65535|")]; - -- MUST_BE_LOCKED(&ip_ftp_lock); -- - sprintf(buffer, "|1|%u.%u.%u.%u|%u|", NIPQUAD(newip), port); - - DEBUGP("calling ip_nat_mangle_tcp_packet\n"); -@@ -156,8 +156,6 @@ mangle_epsv_packet(struct sk_buff **pskb - { - char buffer[sizeof("|||65535|")]; - -- MUST_BE_LOCKED(&ip_ftp_lock); -- - sprintf(buffer, "|||%u|", port); - - DEBUGP("calling ip_nat_mangle_tcp_packet\n"); -@@ -189,7 +187,6 @@ static int ftp_data_fixup(const struct i - u_int16_t port; - struct ip_conntrack_tuple newtuple; - -- MUST_BE_LOCKED(&ip_ftp_lock); - DEBUGP("FTP_NAT: seq %u + %u in %u\n", - expect->seq, ct_ftp_info->len, - ntohl(tcph->seq)); -@@ -268,13 +265,11 @@ static unsigned int help(struct ip_connt - } - - datalen = (*pskb)->len - iph->ihl * 4 - tcph->doff * 4; -- LOCK_BH(&ip_ftp_lock); - /* If it's in the right range... */ - if (between(exp->seq + ct_ftp_info->len, - ntohl(tcph->seq), - ntohl(tcph->seq) + datalen)) { - if (!ftp_data_fixup(ct_ftp_info, ct, pskb, ctinfo, exp)) { -- UNLOCK_BH(&ip_ftp_lock); - return NF_DROP; - } - } else { -@@ -286,26 +281,52 @@ static unsigned int help(struct ip_connt - ntohl(tcph->seq), - ntohl(tcph->seq) + datalen); - } -- UNLOCK_BH(&ip_ftp_lock); - return NF_DROP; - } -- UNLOCK_BH(&ip_ftp_lock); -- - return NF_ACCEPT; - } - - static struct ip_nat_helper ftp[MAX_PORTS]; - static char ftp_names[MAX_PORTS][10]; - --/* Not __exit: called from init() */ --static void fini(void) -+void fini_iptable_nat_ftp(void) - { - int i; - -- for (i = 0; i < ports_c; i++) { -+ for (i = 0; i < ve_ports_c; i++) { - DEBUGP("ip_nat_ftp: unregistering port %d\n", ports[i]); -- ip_nat_helper_unregister(&ftp[i]); -+ visible_ip_nat_helper_unregister(&ftp[i]); -+ } -+ ve_ports_c = 0; -+} -+ -+int init_iptable_nat_ftp(void) -+{ -+ int i, ret = 0; -+ -+ ve_ports_c = 0; -+ for (i = 0; (i < MAX_PORTS) && ports[i]; i++) { -+ DEBUGP("ip_nat_ftp: Trying to register for port %d\n", -+ ports[i]); -+ ret = visible_ip_nat_helper_register(&ftp[i]); -+ if (ret) { -+ printk("ip_nat_ftp: error registering " -+ "helper for port %d\n", ports[i]); -+ fini_iptable_nat_ftp(); -+ return ret; -+ } -+ ve_ports_c++; - } -+ return 0; -+} -+ -+/* Not __exit: called from init() */ -+static void fini(void) -+{ -+ KSYMMODUNRESOLVE(ip_nat_ftp); -+ KSYMUNRESOLVE(init_iptable_nat_ftp); -+ KSYMUNRESOLVE(fini_iptable_nat_ftp); -+ fini_iptable_nat_ftp(); - } - - static int __init init(void) -@@ -316,6 +337,7 @@ static int __init init(void) - if (ports[0] == 0) - ports[0] = FTP_PORT; - -+ ve_ports_c = 0; - for (i = 0; (i < MAX_PORTS) && ports[i]; i++) { - ftp[i].tuple.dst.protonum = IPPROTO_TCP; - ftp[i].tuple.src.u.tcp.port = htons(ports[i]); -@@ -335,7 +357,7 @@ static int __init init(void) - - DEBUGP("ip_nat_ftp: Trying to register for port %d\n", - ports[i]); -- ret = ip_nat_helper_register(&ftp[i]); -+ ret = visible_ip_nat_helper_register(&ftp[i]); - - if (ret) { - printk("ip_nat_ftp: error registering " -@@ -343,9 +365,12 @@ static int __init init(void) - fini(); - return ret; - } -- ports_c++; -+ ve_ports_c++; - } - -+ KSYMRESOLVE(init_iptable_nat_ftp); -+ KSYMRESOLVE(fini_iptable_nat_ftp); -+ KSYMMODRESOLVE(ip_nat_ftp); - return ret; - } - -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ip_nat_helper.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_nat_helper.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ip_nat_helper.c 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_nat_helper.c 2006-05-11 13:05:41.000000000 +0400 -@@ -410,33 +410,59 @@ int ip_nat_helper_register(struct ip_nat - int ret = 0; - - WRITE_LOCK(&ip_nat_lock); -- if (LIST_FIND(&helpers, helper_cmp, struct ip_nat_helper *,&me->tuple)) -+ if (LIST_FIND(&ve_ip_nat_helpers, helper_cmp, -+ struct ip_nat_helper *,&me->tuple)) - ret = -EBUSY; - else -- list_prepend(&helpers, me); -+ list_prepend(&ve_ip_nat_helpers, me); - WRITE_UNLOCK(&ip_nat_lock); - - return ret; - } - --static int --kill_helper(const struct ip_conntrack *i, void *helper) -+int visible_ip_nat_helper_register(struct ip_nat_helper *me) - { - int ret; -+ struct module *mod = me->me; - -- READ_LOCK(&ip_nat_lock); -- ret = (i->nat.info.helper == helper); -- READ_UNLOCK(&ip_nat_lock); -+ if (!ve_is_super(get_exec_env())) { -+ struct ip_nat_helper *tmp; -+ __module_get(mod); -+ ret = -ENOMEM; -+ tmp = kmalloc(sizeof(struct ip_nat_helper), GFP_KERNEL); -+ if (!tmp) -+ goto nomem; -+ memcpy(tmp, me, sizeof(struct ip_nat_helper)); -+ me = tmp; -+ } - -+ ret = ip_nat_helper_register(me); -+ if (ret) -+ goto out; -+ -+ return 0; -+out: -+ if (!ve_is_super(get_exec_env())) { -+ kfree(me); -+nomem: -+ module_put(mod); -+ } - return ret; - } - -+static int -+kill_helper(const struct ip_conntrack *i, void *helper) -+{ -+ return (i->nat.info.helper == helper); -+} -+ - void ip_nat_helper_unregister(struct ip_nat_helper *me) - { - WRITE_LOCK(&ip_nat_lock); - /* Autoloading conntrack helper might have failed */ -- if (LIST_FIND(&helpers, helper_cmp, struct ip_nat_helper *,&me->tuple)) { -- LIST_DELETE(&helpers, me); -+ if (LIST_FIND(&ve_ip_nat_helpers, helper_cmp, -+ struct ip_nat_helper *,&me->tuple)) { -+ LIST_DELETE(&ve_ip_nat_helpers, me); - } - WRITE_UNLOCK(&ip_nat_lock); - -@@ -452,3 +478,26 @@ void ip_nat_helper_unregister(struct ip_ - worse. --RR */ - ip_ct_selective_cleanup(kill_helper, me); - } -+ -+void visible_ip_nat_helper_unregister(struct ip_nat_helper *me) -+{ -+ struct ip_nat_helper *i; -+ -+ READ_LOCK(&ip_nat_lock); -+ list_for_each_entry(i, &ve_ip_nat_helpers, list) { -+ if (i->name == me->name) { -+ me = i; -+ break; -+ } -+ } -+ READ_UNLOCK(&ip_nat_lock); -+ if (me != i) -+ return; -+ -+ ip_nat_helper_unregister(me); -+ -+ if (!ve_is_super(get_exec_env())) { -+ module_put(me->me); -+ kfree(me); -+ } -+} -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ip_nat_irc.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_nat_irc.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ip_nat_irc.c 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_nat_irc.c 2006-05-11 13:05:41.000000000 +0400 -@@ -27,6 +27,7 @@ - #include <linux/netfilter_ipv4/ip_nat_rule.h> - #include <linux/netfilter_ipv4/ip_conntrack_irc.h> - #include <linux/netfilter_ipv4/ip_conntrack_helper.h> -+#include <linux/nfcalls.h> - - #if 0 - #define DEBUGP printk -@@ -36,7 +37,15 @@ - - #define MAX_PORTS 8 - static int ports[MAX_PORTS]; --static int ports_c; -+ -+#ifdef CONFIG_VE_IPTABLES -+#include <linux/sched.h> -+#define ve_ports_c \ -+ (get_exec_env()->_ip_conntrack->_ip_nat_irc_ports_c) -+#else -+static int ports_c = 0; -+#define ve_ports_c ports_c -+#endif - - MODULE_AUTHOR("Harald Welte <laforge@gnumonks.org>"); - MODULE_DESCRIPTION("IRC (DCC) NAT helper"); -@@ -44,9 +53,6 @@ MODULE_LICENSE("GPL"); - MODULE_PARM(ports, "1-" __MODULE_STRING(MAX_PORTS) "i"); - MODULE_PARM_DESC(ports, "port numbers of IRC servers"); - --/* protects irc part of conntracks */ --DECLARE_LOCK_EXTERN(ip_irc_lock); -- - /* FIXME: Time out? --RR */ - - static unsigned int -@@ -102,8 +108,6 @@ static int irc_data_fixup(const struct i - /* "4294967296 65635 " */ - char buffer[18]; - -- MUST_BE_LOCKED(&ip_irc_lock); -- - DEBUGP("IRC_NAT: info (seq %u + %u) in %u\n", - expect->seq, ct_irc_info->len, - ntohl(tcph->seq)); -@@ -111,11 +115,6 @@ static int irc_data_fixup(const struct i - newip = ct->tuplehash[IP_CT_DIR_REPLY].tuple.dst.ip; - - /* Alter conntrack's expectations. */ -- -- /* We can read expect here without conntrack lock, since it's -- only set in ip_conntrack_irc, with ip_irc_lock held -- writable */ -- - t = expect->tuple; - t.dst.ip = newip; - for (port = ct_irc_info->port; port != 0; port++) { -@@ -185,13 +184,11 @@ static unsigned int help(struct ip_connt - DEBUGP("got beyond not touching\n"); - - datalen = (*pskb)->len - iph->ihl * 4 - tcph->doff * 4; -- LOCK_BH(&ip_irc_lock); - /* Check whether the whole IP/address pattern is carried in the payload */ - if (between(exp->seq + ct_irc_info->len, - ntohl(tcph->seq), - ntohl(tcph->seq) + datalen)) { - if (!irc_data_fixup(ct_irc_info, ct, pskb, ctinfo, exp)) { -- UNLOCK_BH(&ip_irc_lock); - return NF_DROP; - } - } else { -@@ -204,28 +201,59 @@ static unsigned int help(struct ip_connt - ntohl(tcph->seq), - ntohl(tcph->seq) + datalen); - } -- UNLOCK_BH(&ip_irc_lock); - return NF_DROP; - } -- UNLOCK_BH(&ip_irc_lock); -- - return NF_ACCEPT; - } - - static struct ip_nat_helper ip_nat_irc_helpers[MAX_PORTS]; - static char irc_names[MAX_PORTS][10]; - --/* This function is intentionally _NOT_ defined as __exit, because -- * it is needed by init() */ --static void fini(void) -+void fini_iptable_nat_irc(void) - { - int i; - -- for (i = 0; i < ports_c; i++) { -+ for (i = 0; i < ve_ports_c; i++) { - DEBUGP("ip_nat_irc: unregistering helper for port %d\n", - ports[i]); -- ip_nat_helper_unregister(&ip_nat_irc_helpers[i]); -+ visible_ip_nat_helper_unregister(&ip_nat_irc_helpers[i]); - } -+ ve_ports_c = 0; -+} -+ -+/* This function is intentionally _NOT_ defined as __exit, because -+ * it is needed by the init function */ -+static void fini(void) -+{ -+ KSYMMODUNRESOLVE(ip_nat_irc); -+ KSYMUNRESOLVE(init_iptable_nat_irc); -+ KSYMUNRESOLVE(fini_iptable_nat_irc); -+ fini_iptable_nat_irc(); -+} -+ -+int init_iptable_nat_irc(void) -+{ -+ int ret = 0; -+ int i; -+ struct ip_nat_helper *hlpr; -+ -+ ve_ports_c = 0; -+ for (i = 0; (i < MAX_PORTS) && ports[i]; i++) { -+ hlpr = &ip_nat_irc_helpers[i]; -+ DEBUGP -+ ("ip_nat_irc: Trying to register helper for port %d: name %s\n", -+ ports[i], hlpr->name); -+ ret = visible_ip_nat_helper_register(hlpr); -+ if (ret) { -+ printk -+ ("ip_nat_irc: error registering helper for port %d\n", -+ ports[i]); -+ fini_iptable_nat_irc(); -+ return 1; -+ } -+ ve_ports_c++; -+ } -+ return 0; - } - - static int __init init(void) -@@ -239,6 +267,7 @@ static int __init init(void) - ports[0] = IRC_PORT; - } - -+ ve_ports_c = 0; - for (i = 0; (i < MAX_PORTS) && ports[i] != 0; i++) { - hlpr = &ip_nat_irc_helpers[i]; - hlpr->tuple.dst.protonum = IPPROTO_TCP; -@@ -260,7 +289,7 @@ static int __init init(void) - DEBUGP - ("ip_nat_irc: Trying to register helper for port %d: name %s\n", - ports[i], hlpr->name); -- ret = ip_nat_helper_register(hlpr); -+ ret = visible_ip_nat_helper_register(hlpr); - - if (ret) { - printk -@@ -269,8 +298,12 @@ static int __init init(void) - fini(); - return 1; - } -- ports_c++; -+ ve_ports_c++; - } -+ -+ KSYMRESOLVE(init_iptable_nat_irc); -+ KSYMRESOLVE(fini_iptable_nat_irc); -+ KSYMMODRESOLVE(ip_nat_irc); - return ret; - } - -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ip_nat_proto_tcp.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_nat_proto_tcp.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ip_nat_proto_tcp.c 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_nat_proto_tcp.c 2006-05-11 13:05:27.000000000 +0400 -@@ -40,7 +40,8 @@ tcp_unique_tuple(struct ip_conntrack_tup - enum ip_nat_manip_type maniptype, - const struct ip_conntrack *conntrack) - { -- static u_int16_t port, *portptr; -+ static u_int16_t port; -+ u_int16_t *portptr; - unsigned int range_size, min, i; - - if (maniptype == IP_NAT_MANIP_SRC) -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ip_nat_proto_udp.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_nat_proto_udp.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ip_nat_proto_udp.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_nat_proto_udp.c 2006-05-11 13:05:27.000000000 +0400 -@@ -41,7 +41,8 @@ udp_unique_tuple(struct ip_conntrack_tup - enum ip_nat_manip_type maniptype, - const struct ip_conntrack *conntrack) - { -- static u_int16_t port, *portptr; -+ static u_int16_t port; -+ u_int16_t *portptr; - unsigned int range_size, min, i; - - if (maniptype == IP_NAT_MANIP_SRC) -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ip_nat_rule.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_nat_rule.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ip_nat_rule.c 2004-08-14 14:55:19.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_nat_rule.c 2006-05-11 13:05:49.000000000 +0400 -@@ -17,6 +17,7 @@ - #include <linux/proc_fs.h> - #include <net/checksum.h> - #include <linux/bitops.h> -+#include <ub/ub_mem.h> - - #define ASSERT_READ_LOCK(x) MUST_BE_READ_LOCKED(&ip_nat_lock) - #define ASSERT_WRITE_LOCK(x) MUST_BE_WRITE_LOCKED(&ip_nat_lock) -@@ -33,6 +34,16 @@ - #define DEBUGP(format, args...) - #endif - -+#ifdef CONFIG_VE_IPTABLES -+#define ve_ip_nat_table \ -+ (get_exec_env()->_ip_conntrack->_ip_nat_table) -+#define ve_ip_nat_initial_table \ -+ (get_exec_env()->_ip_conntrack->_ip_nat_initial_table) -+#else -+#define ve_ip_nat_table &nat_table -+#define ve_ip_nat_initial_table &nat_initial_table -+#endif -+ - #define NAT_VALID_HOOKS ((1<<NF_IP_PRE_ROUTING) | (1<<NF_IP_POST_ROUTING) | (1<<NF_IP_LOCAL_OUT)) - - /* Standard entry. */ -@@ -54,12 +65,12 @@ struct ipt_error - struct ipt_error_target target; - }; - --static struct -+static struct ipt_nat_initial_table - { - struct ipt_replace repl; - struct ipt_standard entries[3]; - struct ipt_error term; --} nat_initial_table __initdata -+} nat_initial_table - = { { "nat", NAT_VALID_HOOKS, 4, - sizeof(struct ipt_standard) * 3 + sizeof(struct ipt_error), - { [NF_IP_PRE_ROUTING] = 0, -@@ -241,6 +252,93 @@ static int ipt_dnat_checkentry(const cha - return 1; - } - -+#ifdef CONFIG_COMPAT -+static int compat_to_user(void *target, void **dstptr, -+ int *size, int off) -+{ -+ struct ipt_entry_target *pt; -+ struct ip_nat_multi_range *pinfo; -+ struct compat_ip_nat_multi_range info; -+ u_int16_t tsize; -+ -+ pt = (struct ipt_entry_target *)target; -+ tsize = pt->u.user.target_size; -+ if (__copy_to_user(*dstptr, pt, sizeof(struct ipt_entry_target))) -+ return -EFAULT; -+ pinfo = (struct ip_nat_multi_range *)pt->data; -+ memset(&info, 0, sizeof(struct compat_ip_nat_multi_range)); -+ info.rangesize = pinfo->rangesize; -+ info.range[0].flags = pinfo->range[0].flags; -+ info.range[0].min_ip = pinfo->range[0].min_ip; -+ info.range[0].max_ip = pinfo->range[0].max_ip; -+ info.range[0].min = pinfo->range[0].min; -+ info.range[0].max = pinfo->range[0].max; -+ if (__copy_to_user(*dstptr + sizeof(struct ipt_entry_target), -+ &info, sizeof(struct compat_ip_nat_multi_range))) -+ return -EFAULT; -+ tsize -= off; -+ if (put_user(tsize, (u_int16_t *)*dstptr)) -+ return -EFAULT; -+ *size -= off; -+ *dstptr += tsize; -+ return 0; -+} -+ -+static int compat_from_user(void *target, void **dstptr, -+ int *size, int off) -+{ -+ struct compat_ipt_entry_target *pt; -+ struct ipt_entry_target *dstpt; -+ struct compat_ip_nat_multi_range *pinfo; -+ struct ip_nat_multi_range info; -+ u_int16_t tsize; -+ -+ pt = (struct compat_ipt_entry_target *)target; -+ dstpt = (struct ipt_entry_target *)*dstptr; -+ tsize = pt->u.user.target_size; -+ memcpy(*dstptr, pt, sizeof(struct compat_ipt_entry_target)); -+ pinfo = (struct compat_ip_nat_multi_range *)pt->data; -+ memset(&info, 0, sizeof(struct ip_nat_multi_range)); -+ info.rangesize = pinfo->rangesize; -+ info.range[0].flags = pinfo->range[0].flags; -+ info.range[0].min_ip = pinfo->range[0].min_ip; -+ info.range[0].max_ip = pinfo->range[0].max_ip; -+ info.range[0].min = pinfo->range[0].min; -+ info.range[0].max = pinfo->range[0].max; -+ memcpy(*dstptr + sizeof(struct compat_ipt_entry_target), -+ &info, sizeof(struct ip_nat_multi_range)); -+ tsize += off; -+ dstpt->u.user.target_size = tsize; -+ *size += off; -+ *dstptr += tsize; -+ return 0; -+} -+ -+static int compat(void *target, void **dstptr, int *size, int convert) -+{ -+ int ret, off; -+ -+ off = IPT_ALIGN(sizeof(struct ip_nat_multi_range)) - -+ COMPAT_IPT_ALIGN(sizeof(struct compat_ip_nat_multi_range)); -+ switch (convert) { -+ case COMPAT_TO_USER: -+ ret = compat_to_user(target, dstptr, size, off); -+ break; -+ case COMPAT_FROM_USER: -+ ret = compat_from_user(target, dstptr, size, off); -+ break; -+ case COMPAT_CALC_SIZE: -+ *size += off; -+ ret = 0; -+ break; -+ default: -+ ret = -ENOPROTOOPT; -+ break; -+ } -+ return ret; -+} -+#endif -+ - inline unsigned int - alloc_null_binding(struct ip_conntrack *conntrack, - struct ip_nat_info *info, -@@ -271,7 +369,7 @@ int ip_nat_rule_find(struct sk_buff **ps - { - int ret; - -- ret = ipt_do_table(pskb, hooknum, in, out, &nat_table, NULL); -+ ret = ipt_do_table(pskb, hooknum, in, out, ve_ip_nat_table, NULL); - - if (ret == NF_ACCEPT) { - if (!(info->initialized & (1 << HOOK2MANIP(hooknum)))) -@@ -285,42 +383,91 @@ static struct ipt_target ipt_snat_reg = - .name = "SNAT", - .target = ipt_snat_target, - .checkentry = ipt_snat_checkentry, -+#ifdef CONFIG_COMPAT -+ .compat = &compat, -+#endif - }; - - static struct ipt_target ipt_dnat_reg = { - .name = "DNAT", - .target = ipt_dnat_target, - .checkentry = ipt_dnat_checkentry, -+#ifdef CONFIG_COMPAT -+ .compat = &compat, -+#endif - }; - --int __init ip_nat_rule_init(void) -+int ip_nat_rule_init(void) - { - int ret; - -- ret = ipt_register_table(&nat_table); -+#ifdef CONFIG_VE_IPTABLES -+ if (ve_is_super(get_exec_env())) { -+ ve_ip_nat_table = &nat_table; -+ ve_ip_nat_initial_table = &nat_initial_table; -+ } else { -+ /* allocate structures in ve_struct */ -+ ret = -ENOMEM; -+ ve_ip_nat_initial_table = -+ ub_kmalloc(sizeof(nat_initial_table), GFP_KERNEL); -+ if (!ve_ip_nat_initial_table) -+ goto nomem_initial; -+ ve_ip_nat_table = ub_kmalloc(sizeof(nat_table), GFP_KERNEL); -+ if (!ve_ip_nat_table) -+ goto nomem_table; -+ -+ memcpy(ve_ip_nat_initial_table, &nat_initial_table, -+ sizeof(nat_initial_table)); -+ memcpy(ve_ip_nat_table, &nat_table, -+ sizeof(nat_table)); -+ ve_ip_nat_table->table = -+ &ve_ip_nat_initial_table->repl; -+ } -+#endif -+ -+ ret = ipt_register_table(ve_ip_nat_table); - if (ret != 0) -- return ret; -- ret = ipt_register_target(&ipt_snat_reg); -+ goto out; -+ ret = visible_ipt_register_target(&ipt_snat_reg); - if (ret != 0) - goto unregister_table; - -- ret = ipt_register_target(&ipt_dnat_reg); -+ ret = visible_ipt_register_target(&ipt_dnat_reg); - if (ret != 0) - goto unregister_snat; - - return ret; - - unregister_snat: -- ipt_unregister_target(&ipt_snat_reg); -+ visible_ipt_unregister_target(&ipt_snat_reg); - unregister_table: -- ipt_unregister_table(&nat_table); -- -+ ipt_unregister_table(ve_ip_nat_table); -+ out: -+#ifdef CONFIG_VE_IPTABLES -+ if (!ve_is_super(get_exec_env())) -+ kfree(ve_ip_nat_table); -+ ve_ip_nat_table = NULL; -+ nomem_table: -+ if (!ve_is_super(get_exec_env())) -+ kfree(ve_ip_nat_initial_table); -+ ve_ip_nat_initial_table = NULL; -+ nomem_initial: -+#endif - return ret; - } - - void ip_nat_rule_cleanup(void) - { -- ipt_unregister_target(&ipt_dnat_reg); -- ipt_unregister_target(&ipt_snat_reg); -- ipt_unregister_table(&nat_table); -+ ipt_unregister_table(ve_ip_nat_table); -+ visible_ipt_unregister_target(&ipt_dnat_reg); -+ visible_ipt_unregister_target(&ipt_snat_reg); -+ -+#ifdef CONFIG_VE -+ if (!ve_is_super(get_exec_env())) { -+ kfree(ve_ip_nat_initial_table); -+ kfree(ve_ip_nat_table); -+ } -+ ve_ip_nat_initial_table = NULL; -+ ve_ip_nat_table = NULL; -+#endif - } -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ip_nat_standalone.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_nat_standalone.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ip_nat_standalone.c 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_nat_standalone.c 2006-05-11 13:05:42.000000000 +0400 -@@ -30,6 +30,7 @@ - #include <net/ip.h> - #include <net/checksum.h> - #include <linux/spinlock.h> -+#include <linux/nfcalls.h> - - #define ASSERT_READ_LOCK(x) MUST_BE_READ_LOCKED(&ip_nat_lock) - #define ASSERT_WRITE_LOCK(x) MUST_BE_WRITE_LOCKED(&ip_nat_lock) -@@ -200,7 +201,7 @@ ip_nat_out(unsigned int hooknum, - I'm starting to have nightmares about fragments. */ - - if ((*pskb)->nh.iph->frag_off & htons(IP_MF|IP_OFFSET)) { -- *pskb = ip_ct_gather_frags(*pskb); -+ *pskb = ip_ct_gather_frags(*pskb, IP_DEFRAG_NAT_OUT); - - if (!*pskb) - return NF_STOLEN; -@@ -284,7 +285,7 @@ int ip_nat_protocol_register(struct ip_n - struct list_head *i; - - WRITE_LOCK(&ip_nat_lock); -- list_for_each(i, &protos) { -+ list_for_each(i, &ve_ip_nat_protos) { - if (((struct ip_nat_protocol *)i)->protonum - == proto->protonum) { - ret = -EBUSY; -@@ -292,23 +293,70 @@ int ip_nat_protocol_register(struct ip_n - } - } - -- list_prepend(&protos, proto); -+ list_prepend(&ve_ip_nat_protos, proto); - out: - WRITE_UNLOCK(&ip_nat_lock); - return ret; - } - -+int visible_ip_nat_protocol_register(struct ip_nat_protocol *proto) -+{ -+ int ret = 0; -+ -+ if (!ve_is_super(get_exec_env())) { -+ struct ip_nat_protocol *tmp; -+ ret = -ENOMEM; -+ tmp = kmalloc(sizeof(struct ip_nat_protocol), GFP_KERNEL); -+ if (!tmp) -+ goto nomem; -+ memcpy(tmp, proto, sizeof(struct ip_nat_protocol)); -+ proto = tmp; -+ } -+ -+ ret = ip_nat_protocol_register(proto); -+ if (ret) -+ goto out; -+ -+ return 0; -+out: -+ if (!ve_is_super(get_exec_env())) -+ kfree(proto); -+nomem: -+ return ret; -+} -+ - /* Noone stores the protocol anywhere; simply delete it. */ - void ip_nat_protocol_unregister(struct ip_nat_protocol *proto) - { - WRITE_LOCK(&ip_nat_lock); -- LIST_DELETE(&protos, proto); -+ LIST_DELETE(&ve_ip_nat_protos, proto); - WRITE_UNLOCK(&ip_nat_lock); - - /* Someone could be still looking at the proto in a bh. */ - synchronize_net(); - } - -+void visible_ip_nat_protocol_unregister(struct ip_nat_protocol *proto) -+{ -+ struct ip_nat_protocol *i; -+ -+ READ_LOCK(&ip_nat_lock); -+ list_for_each_entry(i, &ve_ip_nat_protos, list) { -+ if (i->protonum == proto->protonum) { -+ proto = i; -+ break; -+ } -+ } -+ READ_UNLOCK(&ip_nat_lock); -+ if (proto != i) -+ return; -+ -+ ip_nat_protocol_unregister(proto); -+ -+ if (!ve_is_super(get_exec_env())) -+ kfree(proto); -+} -+ - static int init_or_cleanup(int init) - { - int ret = 0; -@@ -317,77 +365,113 @@ static int init_or_cleanup(int init) - - if (!init) goto cleanup; - -+ if (!ve_is_super(get_exec_env())) -+ __module_get(THIS_MODULE); -+ - ret = ip_nat_rule_init(); - if (ret < 0) { - printk("ip_nat_init: can't setup rules.\n"); -- goto cleanup_nothing; -+ goto cleanup_modput; - } - ret = ip_nat_init(); - if (ret < 0) { - printk("ip_nat_init: can't setup rules.\n"); - goto cleanup_rule_init; - } -- ret = nf_register_hook(&ip_nat_in_ops); -+ if (ve_is_super(get_exec_env()) && !ip_conntrack_enable_ve0) -+ return 0; -+ -+ ret = visible_nf_register_hook(&ip_nat_in_ops); - if (ret < 0) { - printk("ip_nat_init: can't register in hook.\n"); - goto cleanup_nat; - } -- ret = nf_register_hook(&ip_nat_out_ops); -+ ret = visible_nf_register_hook(&ip_nat_out_ops); - if (ret < 0) { - printk("ip_nat_init: can't register out hook.\n"); - goto cleanup_inops; - } - #ifdef CONFIG_IP_NF_NAT_LOCAL -- ret = nf_register_hook(&ip_nat_local_out_ops); -+ ret = visible_nf_register_hook(&ip_nat_local_out_ops); - if (ret < 0) { - printk("ip_nat_init: can't register local out hook.\n"); - goto cleanup_outops; - } -- ret = nf_register_hook(&ip_nat_local_in_ops); -+ ret = visible_nf_register_hook(&ip_nat_local_in_ops); - if (ret < 0) { - printk("ip_nat_init: can't register local in hook.\n"); - goto cleanup_localoutops; - } - #endif -- return ret; -+ return 0; - - cleanup: -+ if (ve_is_super(get_exec_env()) && !ip_conntrack_enable_ve0) -+ goto cleanup_nat; - #ifdef CONFIG_IP_NF_NAT_LOCAL -- nf_unregister_hook(&ip_nat_local_in_ops); -+ visible_nf_unregister_hook(&ip_nat_local_in_ops); - cleanup_localoutops: -- nf_unregister_hook(&ip_nat_local_out_ops); -+ visible_nf_unregister_hook(&ip_nat_local_out_ops); - cleanup_outops: - #endif -- nf_unregister_hook(&ip_nat_out_ops); -+ visible_nf_unregister_hook(&ip_nat_out_ops); - cleanup_inops: -- nf_unregister_hook(&ip_nat_in_ops); -+ visible_nf_unregister_hook(&ip_nat_in_ops); - cleanup_nat: - ip_nat_cleanup(); - cleanup_rule_init: - ip_nat_rule_cleanup(); -- cleanup_nothing: -+ cleanup_modput: -+ if (!ve_is_super(get_exec_env())) -+ module_put(THIS_MODULE); - MUST_BE_READ_WRITE_UNLOCKED(&ip_nat_lock); - return ret; - } - --static int __init init(void) -+int init_iptable_nat(void) - { - return init_or_cleanup(1); - } - --static void __exit fini(void) -+void fini_iptable_nat(void) - { - init_or_cleanup(0); - } - --module_init(init); -+static int __init init(void) -+{ -+ int err; -+ -+ err = init_iptable_nat(); -+ if (err < 0) -+ return err; -+ -+ KSYMRESOLVE(init_iptable_nat); -+ KSYMRESOLVE(fini_iptable_nat); -+ KSYMMODRESOLVE(iptable_nat); -+ return 0; -+} -+ -+static void __exit fini(void) -+{ -+ KSYMMODUNRESOLVE(iptable_nat); -+ KSYMUNRESOLVE(init_iptable_nat); -+ KSYMUNRESOLVE(fini_iptable_nat); -+ fini_iptable_nat(); -+} -+ -+fs_initcall(init); - module_exit(fini); - - EXPORT_SYMBOL(ip_nat_setup_info); - EXPORT_SYMBOL(ip_nat_protocol_register); -+EXPORT_SYMBOL(visible_ip_nat_protocol_register); - EXPORT_SYMBOL(ip_nat_protocol_unregister); -+EXPORT_SYMBOL(visible_ip_nat_protocol_unregister); - EXPORT_SYMBOL(ip_nat_helper_register); -+EXPORT_SYMBOL(visible_ip_nat_helper_register); - EXPORT_SYMBOL(ip_nat_helper_unregister); -+EXPORT_SYMBOL(visible_ip_nat_helper_unregister); - EXPORT_SYMBOL(ip_nat_cheat_check); - EXPORT_SYMBOL(ip_nat_mangle_tcp_packet); - EXPORT_SYMBOL(ip_nat_mangle_udp_packet); -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ip_queue.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_queue.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ip_queue.c 2004-08-14 14:56:25.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_queue.c 2006-05-11 13:05:42.000000000 +0400 -@@ -3,6 +3,7 @@ - * communicating with userspace via netlink. - * - * (C) 2000-2002 James Morris <jmorris@intercode.com.au> -+ * (C) 2003-2005 Netfilter Core Team <coreteam@netfilter.org> - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as -@@ -14,6 +15,7 @@ - * Zander). - * 2000-08-01: Added Nick Williams' MAC support. - * 2002-06-25: Code cleanup. -+ * 2005-05-26: local_bh_{disable,enable} around nf_reinject (Harald Welte) - * - */ - #include <linux/module.h> -@@ -66,7 +68,15 @@ static DECLARE_MUTEX(ipqnl_sem); - static void - ipq_issue_verdict(struct ipq_queue_entry *entry, int verdict) - { -+ /* TCP input path (and probably other bits) assume to be called -+ * from softirq context, not from syscall, like ipq_issue_verdict is -+ * called. TCP input path deadlocks with locks taken from timer -+ * softirq, e.g. We therefore emulate this by local_bh_disable() */ -+ -+ local_bh_disable(); - nf_reinject(entry->skb, entry->info, verdict); -+ local_bh_enable(); -+ - kfree(entry); - } - -@@ -540,7 +550,14 @@ ipq_rcv_sk(struct sock *sk, int len) - return; - - while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) { -+#ifdef CONFIG_VE -+ struct ve_struct *env; -+ env = set_exec_env(VE_OWNER_SKB(skb)); -+#endif - ipq_rcv_skb(skb); -+#ifdef CONFIG_VE -+ (void)set_exec_env(env); -+#endif - kfree_skb(skb); - } - -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ip_tables.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_tables.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ip_tables.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ip_tables.c 2006-05-11 13:05:49.000000000 +0400 -@@ -23,12 +23,20 @@ - #include <linux/udp.h> - #include <linux/icmp.h> - #include <net/ip.h> -+#include <net/compat.h> - #include <asm/uaccess.h> - #include <asm/semaphore.h> - #include <linux/proc_fs.h> -+#include <linux/nfcalls.h> - - #include <linux/netfilter_ipv4/ip_tables.h> - -+#include <ub/ub_mem.h> -+ -+#ifdef CONFIG_USER_RESOURCE -+#include <ub/beancounter.h> -+#endif -+ - MODULE_LICENSE("GPL"); - MODULE_AUTHOR("Netfilter Core Team <coreteam@netfilter.org>"); - MODULE_DESCRIPTION("IPv4 packet filter"); -@@ -108,6 +116,52 @@ struct ipt_table_info - static LIST_HEAD(ipt_target); - static LIST_HEAD(ipt_match); - static LIST_HEAD(ipt_tables); -+ -+#ifdef CONFIG_VE_IPTABLES -+/* include ve.h and define get_exec_env */ -+#include <linux/sched.h> -+ -+int init_iptables(void); -+ -+#define ve_ipt_target (*(get_exec_env()->_ipt_target)) -+#define ve_ipt_match (*(get_exec_env()->_ipt_match)) -+#define ve_ipt_tables (*(get_exec_env()->_ipt_tables)) -+#define ve_ipt_standard_target (*(get_exec_env()->_ipt_standard_target)) -+#define ve_ipt_error_target (*(get_exec_env()->_ipt_error_target)) -+#define ve_tcp_matchstruct (*(get_exec_env()->_tcp_matchstruct)) -+#define ve_udp_matchstruct (*(get_exec_env()->_udp_matchstruct)) -+#define ve_icmp_matchstruct (*(get_exec_env()->_icmp_matchstruct)) -+ -+ -+#ifdef CONFIG_USER_RESOURCE -+#define UB_NUMIPTENT 23 -+static int charge_iptables(struct user_beancounter *ub, unsigned long size) -+{ -+ if (ub == NULL) -+ return 0; -+ return charge_beancounter(ub, UB_NUMIPTENT, size, 1); -+} -+static void uncharge_iptables(struct user_beancounter *ub, unsigned long size) -+{ -+ if (ub == NULL) -+ return; -+ uncharge_beancounter(ub, UB_NUMIPTENT, size); -+} -+#endif /* CONFIG_USER_RESOURCE */ -+ -+#else /* CONFIG_VE_IPTABLES */ -+ -+#define ve_ipt_target ipt_target -+#define ve_ipt_match ipt_match -+#define ve_ipt_tables ipt_tables -+#define ve_ipt_standard_target ipt_standard_target -+#define ve_ipt_error_target ipt_error_target -+#define ve_tcp_matchstruct tcp_matchstruct -+#define ve_udp_matchstruct udp_matchstruct -+#define ve_icmp_matchstruct icmp_matchstruct -+ -+#endif /* CONFIG_VE_IPTABLES */ -+ - #define ADD_COUNTER(c,b,p) do { (c).bcnt += (b); (c).pcnt += (p); } while(0) - - #ifdef CONFIG_SMP -@@ -122,6 +176,29 @@ static LIST_HEAD(ipt_tables); - #define up(x) do { printk("UP:%u:" #x "\n", __LINE__); up(x); } while(0) - #endif - -+static struct ipt_table_info *ipt_table_info_alloc(int size) -+{ -+ struct ipt_table_info *newinfo; -+ -+ if (size >= PAGE_SIZE) -+ newinfo = ub_vmalloc_best(size); -+ else -+ newinfo = ub_kmalloc(size, GFP_KERNEL); -+ -+ return newinfo; -+} -+ -+static void ipt_table_info_free(struct ipt_table_info *info) -+{ -+ if ((unsigned long)info >= VMALLOC_START && -+ (unsigned long)info < VMALLOC_END) -+ vfree(info); -+ else -+ kfree(info); -+} -+ -+#define ipt_table_info_ub(info) (mem_ub(info)) -+ - /* Returns whether matches rule or not. */ - static inline int - ip_packet_match(const struct iphdr *ip, -@@ -310,7 +387,7 @@ ipt_do_table(struct sk_buff **pskb, - do { - IP_NF_ASSERT(e); - IP_NF_ASSERT(back); -- (*pskb)->nfcache |= e->nfcache; -+ (*pskb)->nfcache |= e->nfcache & NFC_IPT_MASK; - if (ip_packet_match(ip, indev, outdev, &e->ip, offset)) { - struct ipt_entry_target *t; - -@@ -417,9 +494,9 @@ find_inlist_lock_noload(struct list_head - - #if 0 - duprintf("find_inlist: searching for `%s' in %s.\n", -- name, head == &ipt_target ? "ipt_target" -- : head == &ipt_match ? "ipt_match" -- : head == &ipt_tables ? "ipt_tables" : "UNKNOWN"); -+ name, head == &ve_ipt_target ? "ipt_target" -+ : head == &ve_ipt_match ? "ipt_match" -+ : head == &ve_ipt_tables ? "ipt_tables" : "UNKNOWN"); - #endif - - *error = down_interruptible(mutex); -@@ -460,19 +537,19 @@ find_inlist_lock(struct list_head *head, - static inline struct ipt_table * - ipt_find_table_lock(const char *name, int *error, struct semaphore *mutex) - { -- return find_inlist_lock(&ipt_tables, name, "iptable_", error, mutex); -+ return find_inlist_lock(&ve_ipt_tables, name, "iptable_", error, mutex); - } - - static inline struct ipt_match * - find_match_lock(const char *name, int *error, struct semaphore *mutex) - { -- return find_inlist_lock(&ipt_match, name, "ipt_", error, mutex); -+ return find_inlist_lock(&ve_ipt_match, name, "ipt_", error, mutex); - } - - struct ipt_target * - ipt_find_target_lock(const char *name, int *error, struct semaphore *mutex) - { -- return find_inlist_lock(&ipt_target, name, "ipt_", error, mutex); -+ return find_inlist_lock(&ve_ipt_target, name, "ipt_", error, mutex); - } - - /* All zeroes == unconditional rule. */ -@@ -513,7 +590,7 @@ mark_source_chains(struct ipt_table_info - = (void *)ipt_get_target(e); - - if (e->comefrom & (1 << NF_IP_NUMHOOKS)) { -- printk("iptables: loop hook %u pos %u %08X.\n", -+ ve_printk(VE_LOG, "iptables: loop hook %u pos %u %08X.\n", - hook, pos, e->comefrom); - return 0; - } -@@ -583,7 +660,6 @@ mark_source_chains(struct ipt_table_info - } - return 1; - } -- - static inline int - cleanup_match(struct ipt_entry_match *m, unsigned int *i) - { -@@ -607,7 +683,7 @@ standard_check(const struct ipt_entry_ta - if (t->u.target_size - != IPT_ALIGN(sizeof(struct ipt_standard_target))) { - duprintf("standard_check: target size %u != %u\n", -- t->u.target_size, -+ t->u.target_size, (unsigned int) - IPT_ALIGN(sizeof(struct ipt_standard_target))); - return 0; - } -@@ -698,7 +774,7 @@ check_entry(struct ipt_entry *e, const c - t->u.kernel.target = target; - up(&ipt_mutex); - -- if (t->u.kernel.target == &ipt_standard_target) { -+ if (t->u.kernel.target == &ve_ipt_standard_target) { - if (!standard_check(t, size)) { - ret = -EINVAL; - goto cleanup_matches; -@@ -866,6 +942,69 @@ translate_table(const char *name, - return ret; - } - -+#if defined(CONFIG_VE_IPTABLES) && defined(CONFIG_USER_RESOURCE) -+static int charge_replace_table(struct ipt_table_info *oldinfo, -+ struct ipt_table_info *newinfo) -+{ -+ struct user_beancounter *old_ub, *new_ub; -+ int old_number, new_number; -+ -+ old_ub = ipt_table_info_ub(oldinfo); -+ new_ub = ipt_table_info_ub(newinfo); -+ old_number = oldinfo->number; -+ new_number = newinfo->number; -+ -+ /* XXX: I don't understand the code below and am not sure that it does -+ * something reasonable. 2002/04/26 SAW */ -+ if (old_ub == new_ub) { -+ int charge; -+ /* charge only differences in entries */ -+ charge = new_number - old_number; -+ if (charge > 0) { -+ if (charge_iptables(old_ub, charge)) -+ return -1; -+ } else -+ uncharge_iptables(old_ub, -charge); -+ } else { -+ /* different contexts; do charge current and uncharge old */ -+ if (charge_iptables(new_ub, new_number)) -+ return -1; -+ uncharge_iptables(old_ub, old_number); -+ } -+ return 0; -+} -+#endif -+ -+static int setup_table(struct ipt_table *table, struct ipt_table_info *info) -+{ -+#ifdef CONFIG_NETFILTER_DEBUG -+ { -+ struct ipt_entry *table_base; -+ unsigned int i; -+ -+ for (i = 0; i < NR_CPUS; i++) { -+ table_base = -+ (void *)info->entries -+ + TABLE_OFFSET(info, i); -+ -+ table_base->comefrom = 0xdead57ac; -+ } -+ } -+#endif -+#if defined(CONFIG_VE_IPTABLES) && defined(CONFIG_USER_RESOURCE) -+ { -+ struct user_beancounter *ub; -+ -+ ub = ipt_table_info_ub(info); -+ if (charge_iptables(ub, info->number)) -+ return -ENOMEM; -+ } -+#endif -+ table->private = info; -+ info->initial_entries = 0; -+ return 0; -+} -+ - static struct ipt_table_info * - replace_table(struct ipt_table *table, - unsigned int num_counters, -@@ -900,6 +1039,16 @@ replace_table(struct ipt_table *table, - return NULL; - } - oldinfo = table->private; -+ -+#if defined(CONFIG_VE_IPTABLES) && defined(CONFIG_USER_RESOURCE) -+ if (charge_replace_table(oldinfo, newinfo)) { -+ oldinfo = NULL; -+ write_unlock_bh(&table->lock); -+ *error = -ENOMEM; -+ return NULL; -+ } -+#endif -+ - table->private = newinfo; - newinfo->initial_entries = oldinfo->initial_entries; - write_unlock_bh(&table->lock); -@@ -936,24 +1085,19 @@ get_counters(const struct ipt_table_info - } - } - --static int --copy_entries_to_user(unsigned int total_size, -- struct ipt_table *table, -- void __user *userptr) -+static inline struct ipt_counters * alloc_counters(struct ipt_table *table) - { -- unsigned int off, num, countersize; -- struct ipt_entry *e; - struct ipt_counters *counters; -- int ret = 0; -+ unsigned int countersize; - - /* We need atomic snapshot of counters: rest doesn't change - (other than comefrom, which userspace doesn't care - about). */ - countersize = sizeof(struct ipt_counters) * table->private->number; -- counters = vmalloc(countersize); -+ counters = vmalloc_best(countersize); - - if (counters == NULL) -- return -ENOMEM; -+ return ERR_PTR(-ENOMEM); - - /* First, sum counters... */ - memset(counters, 0, countersize); -@@ -961,6 +1105,23 @@ copy_entries_to_user(unsigned int total_ - get_counters(table->private, counters); - write_unlock_bh(&table->lock); - -+ return counters; -+} -+ -+static int -+copy_entries_to_user(unsigned int total_size, -+ struct ipt_table *table, -+ void __user *userptr) -+{ -+ unsigned int off, num; -+ struct ipt_entry *e; -+ struct ipt_counters *counters; -+ int ret = 0; -+ -+ counters = alloc_counters(table); -+ if (IS_ERR(counters)) -+ return PTR_ERR(counters); -+ - /* ... then copy entire thing from CPU 0... */ - if (copy_to_user(userptr, table->private->entries, total_size) != 0) { - ret = -EFAULT; -@@ -1015,216 +1176,1207 @@ copy_entries_to_user(unsigned int total_ - return ret; - } - --static int --get_entries(const struct ipt_get_entries *entries, -- struct ipt_get_entries __user *uptr) -+#ifdef CONFIG_COMPAT -+static DECLARE_MUTEX(compat_ipt_mutex); -+ -+struct compat_delta { -+ struct compat_delta *next; -+ u_int16_t offset; -+ short delta; -+}; -+ -+static struct compat_delta *compat_offsets = NULL; -+ -+static int compat_add_offset(u_int16_t offset, short delta) - { -- int ret; -- struct ipt_table *t; -+ struct compat_delta *tmp; - -- t = ipt_find_table_lock(entries->name, &ret, &ipt_mutex); -- if (t) { -- duprintf("t->private->number = %u\n", -- t->private->number); -- if (entries->size == t->private->size) -- ret = copy_entries_to_user(t->private->size, -- t, uptr->entrytable); -- else { -- duprintf("get_entries: I've got %u not %u!\n", -- t->private->size, -- entries->size); -- ret = -EINVAL; -- } -- up(&ipt_mutex); -- } else -- duprintf("get_entries: Can't find %s!\n", -- entries->name); -+ tmp = kmalloc(sizeof(struct compat_delta), GFP_KERNEL); -+ if (!tmp) -+ return -ENOMEM; -+ tmp->offset = offset; -+ tmp->delta = delta; -+ if (compat_offsets) { -+ tmp->next = compat_offsets->next; -+ compat_offsets->next = tmp; -+ } else { -+ compat_offsets = tmp; -+ tmp->next = NULL; -+ } -+ return 0; -+} - -- return ret; -+static void compat_flush_offsets(void) -+{ -+ struct compat_delta *tmp, *next; -+ -+ if (compat_offsets) { -+ for(tmp = compat_offsets; tmp; tmp = next) { -+ next = tmp->next; -+ kfree(tmp); -+ } -+ compat_offsets = NULL; -+ } - } - --static int --do_replace(void __user *user, unsigned int len) -+static short compat_calc_jump(u_int16_t offset) - { -- int ret; -- struct ipt_replace tmp; -- struct ipt_table *t; -- struct ipt_table_info *newinfo, *oldinfo; -- struct ipt_counters *counters; -+ struct compat_delta *tmp; -+ short delta; - -- if (copy_from_user(&tmp, user, sizeof(tmp)) != 0) -- return -EFAULT; -+ for(tmp = compat_offsets, delta = 0; tmp; tmp = tmp->next) -+ if (tmp->offset < offset) -+ delta += tmp->delta; -+ return delta; -+} - -- /* Hack: Causes ipchains to give correct error msg --RR */ -- if (len != sizeof(tmp) + tmp.size) -- return -ENOPROTOOPT; -+struct compat_ipt_standard_target -+{ -+ struct compat_ipt_entry_target target; -+ compat_int_t verdict; -+}; - -- /* Pedantry: prevent them from hitting BUG() in vmalloc.c --RR */ -- if ((SMP_ALIGN(tmp.size) >> PAGE_SHIFT) + 2 > num_physpages) -- return -ENOMEM; -+#define IPT_ST_OFFSET (sizeof(struct ipt_standard_target) - \ -+ sizeof(struct compat_ipt_standard_target)) - -- newinfo = vmalloc(sizeof(struct ipt_table_info) -- + SMP_ALIGN(tmp.size) * NR_CPUS); -- if (!newinfo) -- return -ENOMEM; -+struct ipt_standard -+{ -+ struct ipt_entry entry; -+ struct ipt_standard_target target; -+}; - -- if (copy_from_user(newinfo->entries, user + sizeof(tmp), -- tmp.size) != 0) { -- ret = -EFAULT; -- goto free_newinfo; -- } -+struct compat_ipt_standard -+{ -+ struct compat_ipt_entry entry; -+ struct compat_ipt_standard_target target; -+}; - -- counters = vmalloc(tmp.num_counters * sizeof(struct ipt_counters)); -- if (!counters) { -- ret = -ENOMEM; -- goto free_newinfo; -+static int compat_ipt_standard_fn(void *target, -+ void **dstptr, int *size, int convert) -+{ -+ struct compat_ipt_standard_target compat_st, *pcompat_st; -+ struct ipt_standard_target st, *pst; -+ int ret; -+ -+ ret = 0; -+ switch (convert) { -+ case COMPAT_TO_USER: -+ pst = (struct ipt_standard_target *)target; -+ memcpy(&compat_st.target, &pst->target, -+ sizeof(struct ipt_entry_target)); -+ compat_st.verdict = pst->verdict; -+ if (compat_st.verdict > 0) -+ compat_st.verdict -= -+ compat_calc_jump(compat_st.verdict); -+ compat_st.target.u.user.target_size = -+ sizeof(struct compat_ipt_standard_target); -+ if (__copy_to_user(*dstptr, &compat_st, -+ sizeof(struct compat_ipt_standard_target))) -+ ret = -EFAULT; -+ *size -= IPT_ST_OFFSET; -+ *dstptr += sizeof(struct compat_ipt_standard_target); -+ break; -+ case COMPAT_FROM_USER: -+ pcompat_st = -+ (struct compat_ipt_standard_target *)target; -+ memcpy(&st.target, &pcompat_st->target, -+ sizeof(struct ipt_entry_target)); -+ st.verdict = pcompat_st->verdict; -+ if (st.verdict > 0) -+ st.verdict += compat_calc_jump(st.verdict); -+ st.target.u.user.target_size = -+ sizeof(struct ipt_standard_target); -+ memcpy(*dstptr, &st, -+ sizeof(struct ipt_standard_target)); -+ *size += IPT_ST_OFFSET; -+ *dstptr += sizeof(struct ipt_standard_target); -+ break; -+ case COMPAT_CALC_SIZE: -+ *size += IPT_ST_OFFSET; -+ break; -+ default: -+ ret = -ENOPROTOOPT; -+ break; - } -- memset(counters, 0, tmp.num_counters * sizeof(struct ipt_counters)); -+ return ret; -+} - -- ret = translate_table(tmp.name, tmp.valid_hooks, -- newinfo, tmp.size, tmp.num_entries, -- tmp.hook_entry, tmp.underflow); -- if (ret != 0) -- goto free_newinfo_counters; -+int ipt_target_align_compat(void *target, void **dstptr, -+ int *size, int off, int convert) -+{ -+ struct compat_ipt_entry_target *pcompat; -+ struct ipt_entry_target *pt; -+ u_int16_t tsize; -+ int ret; - -- duprintf("ip_tables: Translated table\n"); -+ ret = 0; -+ switch (convert) { -+ case COMPAT_TO_USER: -+ pt = (struct ipt_entry_target *)target; -+ tsize = pt->u.user.target_size; -+ if (__copy_to_user(*dstptr, pt, tsize)) { -+ ret = -EFAULT; -+ break; -+ } -+ tsize -= off; -+ if (put_user(tsize, (u_int16_t *)*dstptr)) -+ ret = -EFAULT; -+ *size -= off; -+ *dstptr += tsize; -+ break; -+ case COMPAT_FROM_USER: -+ pcompat = (struct compat_ipt_entry_target *)target; -+ pt = (struct ipt_entry_target *)*dstptr; -+ tsize = pcompat->u.user.target_size; -+ memcpy(pt, pcompat, tsize); -+ tsize += off; -+ pt->u.user.target_size = tsize; -+ *size += off; -+ *dstptr += tsize; -+ break; -+ case COMPAT_CALC_SIZE: -+ *size += off; -+ break; -+ default: -+ ret = -ENOPROTOOPT; -+ break; -+ } -+ return ret; -+} - -- t = ipt_find_table_lock(tmp.name, &ret, &ipt_mutex); -- if (!t) -- goto free_newinfo_counters_untrans; -+int ipt_match_align_compat(void *match, void **dstptr, -+ int *size, int off, int convert) -+{ -+ struct compat_ipt_entry_match *pcompat_m; -+ struct ipt_entry_match *pm; -+ u_int16_t msize; -+ int ret; - -- /* You lied! */ -- if (tmp.valid_hooks != t->valid_hooks) { -- duprintf("Valid hook crap: %08X vs %08X\n", -- tmp.valid_hooks, t->valid_hooks); -- ret = -EINVAL; -- goto free_newinfo_counters_untrans_unlock; -+ ret = 0; -+ switch (convert) { -+ case COMPAT_TO_USER: -+ pm = (struct ipt_entry_match *)match; -+ msize = pm->u.user.match_size; -+ if (__copy_to_user(*dstptr, pm, msize)) { -+ ret = -EFAULT; -+ break; -+ } -+ msize -= off; -+ if (put_user(msize, (u_int16_t *)*dstptr)) -+ ret = -EFAULT; -+ *size -= off; -+ *dstptr += msize; -+ break; -+ case COMPAT_FROM_USER: -+ pcompat_m = (struct compat_ipt_entry_match *)match; -+ pm = (struct ipt_entry_match *)*dstptr; -+ msize = pcompat_m->u.user.match_size; -+ memcpy(pm, pcompat_m, msize); -+ msize += off; -+ pm->u.user.match_size = msize; -+ *size += off; -+ *dstptr += msize; -+ break; -+ case COMPAT_CALC_SIZE: -+ *size += off; -+ break; -+ default: -+ ret = -ENOPROTOOPT; -+ break; - } -+ return ret; -+} - -- /* Get a reference in advance, we're not allowed fail later */ -- if (!try_module_get(t->me)) { -- ret = -EBUSY; -- goto free_newinfo_counters_untrans_unlock; -- } -+static int tcp_compat(void *match, -+ void **dstptr, int *size, int convert) -+{ -+ int off; - -+ off = IPT_ALIGN(sizeof(struct ipt_tcp)) - -+ COMPAT_IPT_ALIGN(sizeof(struct ipt_tcp)); -+ return ipt_match_align_compat(match, dstptr, size, off, convert); -+} - -- oldinfo = replace_table(t, tmp.num_counters, newinfo, &ret); -- if (!oldinfo) -- goto put_module; -+static int udp_compat(void *match, -+ void **dstptr, int *size, int convert) -+{ -+ int off; - -- /* Update module usage count based on number of rules */ -- duprintf("do_replace: oldnum=%u, initnum=%u, newnum=%u\n", -- oldinfo->number, oldinfo->initial_entries, newinfo->number); -- if ((oldinfo->number > oldinfo->initial_entries) || -- (newinfo->number <= oldinfo->initial_entries)) -- module_put(t->me); -- if ((oldinfo->number > oldinfo->initial_entries) && -- (newinfo->number <= oldinfo->initial_entries)) -- module_put(t->me); -+ off = IPT_ALIGN(sizeof(struct ipt_udp)) - -+ COMPAT_IPT_ALIGN(sizeof(struct ipt_udp)); -+ return ipt_match_align_compat(match, dstptr, size, off, convert); -+} - -- /* Get the old counters. */ -- get_counters(oldinfo, counters); -- /* Decrease module usage counts and free resource */ -- IPT_ENTRY_ITERATE(oldinfo->entries, oldinfo->size, cleanup_entry,NULL); -- vfree(oldinfo); -- /* Silent error: too late now. */ -- copy_to_user(tmp.counters, counters, -- sizeof(struct ipt_counters) * tmp.num_counters); -- vfree(counters); -- up(&ipt_mutex); -- return 0; -+static int icmp_compat(void *match, -+ void **dstptr, int *size, int convert) -+{ -+ int off; - -- put_module: -- module_put(t->me); -- free_newinfo_counters_untrans_unlock: -- up(&ipt_mutex); -- free_newinfo_counters_untrans: -- IPT_ENTRY_ITERATE(newinfo->entries, newinfo->size, cleanup_entry,NULL); -- free_newinfo_counters: -- vfree(counters); -- free_newinfo: -- vfree(newinfo); -- return ret; -+ off = IPT_ALIGN(sizeof(struct ipt_icmp)) - -+ COMPAT_IPT_ALIGN(sizeof(struct ipt_icmp)); -+ return ipt_match_align_compat(match, dstptr, size, off, convert); - } - --/* We're lazy, and add to the first CPU; overflow works its fey magic -- * and everything is OK. */ - static inline int --add_counter_to_entry(struct ipt_entry *e, -- const struct ipt_counters addme[], -- unsigned int *i) -+compat_calc_match(struct ipt_entry_match *m, int * size) - { --#if 0 -- duprintf("add_counter: Entry %u %lu/%lu + %lu/%lu\n", -- *i, -- (long unsigned int)e->counters.pcnt, -- (long unsigned int)e->counters.bcnt, -- (long unsigned int)addme[*i].pcnt, -- (long unsigned int)addme[*i].bcnt); --#endif -- -- ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt); -- -- (*i)++; -+ if (m->u.kernel.match->compat) -+ m->u.kernel.match->compat(m, NULL, size, COMPAT_CALC_SIZE); - return 0; - } - --static int --do_add_counters(void __user *user, unsigned int len) -+static int compat_calc_entry(struct ipt_entry *e, -+ struct ipt_table_info *info, struct ipt_table_info *newinfo) - { -- unsigned int i; -- struct ipt_counters_info tmp, *paddc; -- struct ipt_table *t; -- int ret; -+ struct ipt_entry_target *t; -+ u_int16_t entry_offset; -+ int off, i, ret; - -- if (copy_from_user(&tmp, user, sizeof(tmp)) != 0) -- return -EFAULT; -+ off = 0; -+ entry_offset = (void *)e - (void *)info->entries; -+ IPT_MATCH_ITERATE(e, compat_calc_match, &off); -+ t = ipt_get_target(e); -+ if (t->u.kernel.target->compat) -+ t->u.kernel.target->compat(t, NULL, &off, COMPAT_CALC_SIZE); -+ newinfo->size -= off; -+ ret = compat_add_offset(entry_offset, off); -+ if (ret) -+ return ret; -+ -+ for (i = 0; i< NF_IP_NUMHOOKS; i++) { -+ if (info->hook_entry[i] && (e < (struct ipt_entry *) -+ (info->entries + info->hook_entry[i]))) -+ newinfo->hook_entry[i] -= off; -+ if (info->underflow[i] && (e < (struct ipt_entry *) -+ (info->entries + info->underflow[i]))) -+ newinfo->underflow[i] -= off; -+ } -+ return 0; -+} - -- if (len != sizeof(tmp) + tmp.num_counters*sizeof(struct ipt_counters)) -+static int compat_table_info(struct ipt_table_info *info, -+ struct ipt_table_info *newinfo) -+{ -+ if (!newinfo) - return -EINVAL; - -- paddc = vmalloc(len); -+ memcpy(newinfo, info, sizeof(struct ipt_table_info)); -+ return IPT_ENTRY_ITERATE(info->entries, -+ info->size, compat_calc_entry, info, newinfo); -+} -+#endif -+ -+static int get_info(void __user *user, int *len) -+{ -+ char name[IPT_TABLE_MAXNAMELEN]; -+ struct ipt_table *t; -+ int ret, size; -+ -+#ifdef CONFIG_COMPAT -+ if (is_current_32bits()) -+ size = sizeof(struct compat_ipt_getinfo); -+ else -+#endif -+ size = sizeof(struct ipt_getinfo); -+ -+ if (*len != size) { -+ duprintf("length %u != %u\n", *len, -+ (unsigned int)sizeof(struct ipt_getinfo)); -+ return -EINVAL; -+ } -+ -+ if (copy_from_user(name, user, sizeof(name)) != 0) -+ return -EFAULT; -+ -+ name[IPT_TABLE_MAXNAMELEN-1] = '\0'; -+#ifdef CONFIG_COMPAT -+ down(&compat_ipt_mutex); -+#endif -+ t = ipt_find_table_lock(name, &ret, &ipt_mutex); -+ if (t) { -+ struct ipt_getinfo info; -+#ifdef CONFIG_COMPAT -+ struct compat_ipt_getinfo compat_info; -+#endif -+ void *pinfo; -+ -+#ifdef CONFIG_COMPAT -+ if (is_current_32bits()) { -+ struct ipt_table_info t_info; -+ ret = compat_table_info(t->private, &t_info); -+ compat_flush_offsets(); -+ memcpy(compat_info.hook_entry, t_info.hook_entry, -+ sizeof(compat_info.hook_entry)); -+ memcpy(compat_info.underflow, t_info.underflow, -+ sizeof(compat_info.underflow)); -+ compat_info.valid_hooks = t->valid_hooks; -+ compat_info.num_entries = t->private->number; -+ compat_info.size = t_info.size; -+ strcpy(compat_info.name, name); -+ pinfo = (void *)&compat_info; -+ } else -+#endif -+ { -+ info.valid_hooks = t->valid_hooks; -+ memcpy(info.hook_entry, t->private->hook_entry, -+ sizeof(info.hook_entry)); -+ memcpy(info.underflow, t->private->underflow, -+ sizeof(info.underflow)); -+ info.num_entries = t->private->number; -+ info.size = t->private->size; -+ strcpy(info.name, name); -+ pinfo = (void *)&info; -+ } -+ -+ if (copy_to_user(user, pinfo, *len) != 0) -+ ret = -EFAULT; -+ else -+ ret = 0; -+ -+ up(&ipt_mutex); -+ } -+#ifdef CONFIG_COMPAT -+ up(&compat_ipt_mutex); -+#endif -+ return ret; -+} -+ -+static int -+get_entries(struct ipt_get_entries __user *uptr, int *len) -+{ -+ int ret; -+ struct ipt_get_entries get; -+ struct ipt_table *t; -+ -+ if (*len < sizeof(get)) { -+ duprintf("get_entries: %u < %d\n", *len, -+ (unsigned int)sizeof(get)); -+ return -EINVAL; -+ } -+ -+ if (copy_from_user(&get, uptr, sizeof(get)) != 0) -+ return -EFAULT; -+ -+ if (*len != sizeof(struct ipt_get_entries) + get.size) { -+ duprintf("get_entries: %u != %u\n", *len, -+ (unsigned int)(sizeof(struct ipt_get_entries) + -+ get.size)); -+ return -EINVAL; -+ } -+ -+ t = ipt_find_table_lock(get.name, &ret, &ipt_mutex); -+ if (t) { -+ duprintf("t->private->number = %u\n", -+ t->private->number); -+ if (get.size == t->private->size) -+ ret = copy_entries_to_user(t->private->size, -+ t, uptr->entrytable); -+ else { -+ duprintf("get_entries: I've got %u not %u!\n", -+ t->private->size, -+ get.size); -+ ret = -EINVAL; -+ } -+ up(&ipt_mutex); -+ } else -+ duprintf("get_entries: Can't find %s!\n", -+ get.name); -+ -+ return ret; -+} -+ -+static int -+__do_replace(const char *name, unsigned int valid_hooks, -+ struct ipt_table_info *newinfo, unsigned int size, -+ unsigned int num_counters, void __user *counters_ptr) -+{ -+ int ret; -+ struct ipt_table *t; -+ struct ipt_table_info *oldinfo; -+ struct ipt_counters *counters; -+ -+ counters = ub_vmalloc_best(num_counters * -+ sizeof(struct ipt_counters)); -+ if (!counters) { -+ ret = -ENOMEM; -+ goto out; -+ } -+ memset(counters, 0, num_counters * sizeof(struct ipt_counters)); -+ -+ t = ipt_find_table_lock(name, &ret, &ipt_mutex); -+ if (!t) -+ goto free_newinfo_counters_untrans; -+ -+ /* You lied! */ -+ if (valid_hooks != t->valid_hooks) { -+ duprintf("Valid hook crap: %08X vs %08X\n", -+ valid_hooks, t->valid_hooks); -+ ret = -EINVAL; -+ goto free_newinfo_counters_untrans_unlock; -+ } -+ -+ /* Get a reference in advance, we're not allowed fail later */ -+ if (!try_module_get(t->me)) { -+ ret = -EBUSY; -+ goto free_newinfo_counters_untrans_unlock; -+ } -+ -+ oldinfo = replace_table(t, num_counters, newinfo, &ret); -+ if (!oldinfo) -+ goto put_module; -+ -+ /* Update module usage count based on number of rules */ -+ duprintf("do_replace: oldnum=%u, initnum=%u, newnum=%u\n", -+ oldinfo->number, oldinfo->initial_entries, newinfo->number); -+ if ((oldinfo->number > oldinfo->initial_entries) || -+ (newinfo->number <= oldinfo->initial_entries)) -+ module_put(t->me); -+ if ((oldinfo->number > oldinfo->initial_entries) && -+ (newinfo->number <= oldinfo->initial_entries)) -+ module_put(t->me); -+ -+ /* Get the old counters. */ -+ get_counters(oldinfo, counters); -+ /* Decrease module usage counts and free resource */ -+ IPT_ENTRY_ITERATE(oldinfo->entries, oldinfo->size, cleanup_entry,NULL); -+ ipt_table_info_free(oldinfo); -+ /* Silent error: too late now. */ -+ copy_to_user(counters_ptr, counters, -+ sizeof(struct ipt_counters) * num_counters); -+ vfree(counters); -+ up(&ipt_mutex); -+ return 0; -+ put_module: -+ module_put(t->me); -+ free_newinfo_counters_untrans_unlock: -+ up(&ipt_mutex); -+ free_newinfo_counters_untrans: -+ vfree(counters); -+ out: -+ return ret; -+} -+ -+static int -+do_replace(void __user *user, unsigned int len) -+{ -+ int ret; -+ struct ipt_replace tmp; -+ struct ipt_table_info *newinfo; -+ -+ if (copy_from_user(&tmp, user, sizeof(tmp)) != 0) -+ return -EFAULT; -+ -+ /* Hack: Causes ipchains to give correct error msg --RR */ -+ if (len != sizeof(tmp) + tmp.size) -+ return -ENOPROTOOPT; -+ -+ /* overflow check */ -+ if (tmp.size >= (INT_MAX - sizeof(struct ipt_table_info)) / NR_CPUS - -+ SMP_CACHE_BYTES) -+ return -ENOMEM; -+ if (tmp.num_counters >= INT_MAX / sizeof(struct ipt_counters)) -+ return -ENOMEM; -+ -+ /* Pedantry: prevent them from hitting BUG() in vmalloc.c --RR */ -+ if ((SMP_ALIGN(tmp.size) >> PAGE_SHIFT) + 2 > num_physpages) -+ return -ENOMEM; -+ -+ newinfo = ipt_table_info_alloc(sizeof(struct ipt_table_info) -+ + SMP_ALIGN(tmp.size) * NR_CPUS); -+ if (!newinfo) -+ return -ENOMEM; -+ -+ if (copy_from_user(newinfo->entries, user + sizeof(tmp), tmp.size) != 0) { -+ ret = -EFAULT; -+ goto free_newinfo; -+ } -+ -+ ret = translate_table(tmp.name, tmp.valid_hooks, -+ newinfo, tmp.size, tmp.num_entries, -+ tmp.hook_entry, tmp.underflow); -+ if (ret != 0) -+ goto free_newinfo; -+ -+ duprintf("ip_tables: Translated table\n"); -+ -+ ret = __do_replace(tmp.name, tmp.valid_hooks, -+ newinfo, tmp.size, tmp.num_counters, -+ tmp.counters); -+ if (ret) -+ goto free_newinfo_untrans; -+ return 0; -+ -+ free_newinfo_untrans: -+ IPT_ENTRY_ITERATE(newinfo->entries, newinfo->size, cleanup_entry,NULL); -+ free_newinfo: -+ ipt_table_info_free(newinfo); -+ return ret; -+} -+ -+/* We're lazy, and add to the first CPU; overflow works its fey magic -+ * and everything is OK. */ -+static inline int -+add_counter_to_entry(struct ipt_entry *e, -+ const struct ipt_counters addme[], -+ unsigned int *i) -+{ -+#if 0 -+ duprintf("add_counter: Entry %u %lu/%lu + %lu/%lu\n", -+ *i, -+ (long unsigned int)e->counters.pcnt, -+ (long unsigned int)e->counters.bcnt, -+ (long unsigned int)addme[*i].pcnt, -+ (long unsigned int)addme[*i].bcnt); -+#endif -+ -+ ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt); -+ -+ (*i)++; -+ return 0; -+} -+ -+static int -+do_add_counters(void __user *user, unsigned int len) -+{ -+ unsigned int i; -+ struct ipt_counters_info tmp; -+ void *ptmp; -+ struct ipt_table *t; -+ unsigned int num_counters; -+ char *name; -+ struct ipt_counters *paddc; -+ int ret, size; -+#ifdef CONFIG_COMPAT -+ struct compat_ipt_counters_info compat_tmp; -+ -+ if (is_current_32bits()) { -+ ptmp = &compat_tmp; -+ size = sizeof(struct compat_ipt_counters_info); -+ } else -+#endif -+ { -+ ptmp = &tmp; -+ size = sizeof(struct ipt_counters_info); -+ } -+ -+ if (copy_from_user(ptmp, user, size) != 0) -+ return -EFAULT; -+ -+#ifdef CONFIG_COMPAT -+ if (is_current_32bits()) { -+ num_counters = compat_tmp.num_counters; -+ name = compat_tmp.name; -+ } else -+#endif -+ { -+ num_counters = tmp.num_counters; -+ name = tmp.name; -+ } -+ -+ if (len != size + num_counters * sizeof(struct ipt_counters)) -+ return -EINVAL; -+ -+ paddc = ub_vmalloc_best(len - size); - if (!paddc) - return -ENOMEM; - -- if (copy_from_user(paddc, user, len) != 0) { -+ if (copy_from_user(paddc, user + size, len - size) != 0) { -+ ret = -EFAULT; -+ goto free; -+ } -+ -+ t = ipt_find_table_lock(name, &ret, &ipt_mutex); -+ if (!t) -+ goto free; -+ -+ write_lock_bh(&t->lock); -+ if (t->private->number != num_counters) { -+ ret = -EINVAL; -+ goto unlock_up_free; -+ } -+ -+ i = 0; -+ IPT_ENTRY_ITERATE(t->private->entries, -+ t->private->size, -+ add_counter_to_entry, -+ paddc, -+ &i); -+ unlock_up_free: -+ write_unlock_bh(&t->lock); -+ up(&ipt_mutex); -+ free: -+ vfree(paddc); -+ -+ return ret; -+} -+ -+#ifdef CONFIG_COMPAT -+struct compat_ipt_replace { -+ char name[IPT_TABLE_MAXNAMELEN]; -+ u32 valid_hooks; -+ u32 num_entries; -+ u32 size; -+ u32 hook_entry[NF_IP_NUMHOOKS]; -+ u32 underflow[NF_IP_NUMHOOKS]; -+ u32 num_counters; -+ compat_uptr_t counters; /* struct ipt_counters * */ -+ struct compat_ipt_entry entries[0]; -+}; -+ -+static inline int compat_copy_match_to_user(struct ipt_entry_match *m, -+ void __user **dstptr, compat_uint_t *size) -+{ -+ if (m->u.kernel.match->compat) -+ m->u.kernel.match->compat(m, dstptr, size, COMPAT_TO_USER); -+ else { -+ if (__copy_to_user(*dstptr, m, m->u.match_size)) -+ return -EFAULT; -+ *dstptr += m->u.match_size; -+ } -+ return 0; -+} -+ -+static int compat_copy_entry_to_user(struct ipt_entry *e, -+ void __user **dstptr, compat_uint_t *size) -+{ -+ struct ipt_entry_target __user *t; -+ struct compat_ipt_entry __user *ce; -+ u_int16_t target_offset, next_offset; -+ compat_uint_t origsize; -+ int ret; -+ -+ ret = -EFAULT; -+ origsize = *size; -+ ce = (struct compat_ipt_entry __user *)*dstptr; -+ if (__copy_to_user(ce, e, sizeof(struct ipt_entry))) -+ goto out; -+ -+ *dstptr += sizeof(struct compat_ipt_entry); -+ ret = IPT_MATCH_ITERATE(e, compat_copy_match_to_user, dstptr, size); -+ target_offset = e->target_offset - (origsize - *size); -+ if (ret) -+ goto out; -+ t = ipt_get_target(e); -+ if (t->u.kernel.target->compat) { -+ ret = t->u.kernel.target->compat(t, -+ dstptr, size, COMPAT_TO_USER); -+ if (ret) -+ goto out; -+ } else { -+ ret = -EFAULT; -+ if (__copy_to_user(*dstptr, t, t->u.target_size)) -+ goto out; -+ *dstptr += t->u.target_size; -+ } -+ ret = -EFAULT; -+ next_offset = e->next_offset - (origsize - *size); -+ if (__put_user(target_offset, &ce->target_offset)) -+ goto out; -+ if (__put_user(next_offset, &ce->next_offset)) -+ goto out; -+ return 0; -+out: -+ return ret; -+} -+ -+static inline int -+compat_check_calc_match(struct ipt_entry_match *m, -+ const char *name, -+ const struct ipt_ip *ip, -+ unsigned int hookmask, -+ int *size, int *i) -+{ -+ int ret; -+ struct ipt_match *match; -+ -+ match = find_match_lock(m->u.user.name, &ret, &ipt_mutex); -+ if (!match) { -+ duprintf("check_match: `%s' not found\n", m->u.user.name); -+ return ret; -+ } -+ if (!try_module_get(match->me)) { -+ up(&ipt_mutex); -+ return -ENOENT; -+ } -+ m->u.kernel.match = match; -+ up(&ipt_mutex); -+ -+ if (m->u.kernel.match->compat) -+ m->u.kernel.match->compat(m, NULL, size, COMPAT_CALC_SIZE); -+ -+ (*i)++; -+ return 0; -+} -+ -+static inline int -+check_compat_entry_size_and_hooks(struct ipt_entry *e, -+ struct ipt_table_info *newinfo, -+ unsigned char *base, -+ unsigned char *limit, -+ unsigned int *hook_entries, -+ unsigned int *underflows, -+ unsigned int *i, -+ const char *name) -+{ -+ struct ipt_entry_target *t; -+ struct ipt_target *target; -+ u_int16_t entry_offset; -+ int ret, off, h, j; -+ -+ duprintf("check_compat_entry_size_and_hooks %p\n", e); -+ if ((unsigned long)e % __alignof__(struct compat_ipt_entry) != 0 -+ || (unsigned char *)e + sizeof(struct compat_ipt_entry) >= limit) { -+ duprintf("Bad offset %p, limit = %p\n", e, limit); -+ return -EINVAL; -+ } -+ -+ if (e->next_offset < sizeof(struct compat_ipt_entry) + -+ sizeof(struct compat_ipt_entry_target)) { -+ duprintf("checking: element %p size %u\n", -+ e, e->next_offset); -+ return -EINVAL; -+ } -+ -+ if (!ip_checkentry(&e->ip)) { -+ duprintf("ip_tables: ip check failed %p %s.\n", e, name); -+ return -EINVAL; -+ } -+ -+ off = 0; -+ entry_offset = (void *)e - (void *)base; -+ j = 0; -+ ret = IPT_MATCH_ITERATE(e, compat_check_calc_match, name, &e->ip, -+ e->comefrom, &off, &j); -+ if (ret != 0) -+ goto out; -+ -+ t = ipt_get_target(e); -+ target = ipt_find_target_lock(t->u.user.name, &ret, &ipt_mutex); -+ if (!target) { -+ duprintf("check_entry: `%s' not found\n", t->u.user.name); -+ goto out; -+ } -+ if (!try_module_get(target->me)) { -+ up(&ipt_mutex); -+ ret = -ENOENT; -+ goto out; -+ } -+ t->u.kernel.target = target; -+ up(&ipt_mutex); -+ -+ if (t->u.kernel.target->compat) -+ t->u.kernel.target->compat(t, NULL, &off, COMPAT_CALC_SIZE); -+ newinfo->size += off; -+ ret = compat_add_offset(entry_offset, off); -+ if (ret) -+ goto out; -+ -+ /* Check hooks & underflows */ -+ for (h = 0; h < NF_IP_NUMHOOKS; h++) { -+ if ((unsigned char *)e - base == hook_entries[h]) -+ newinfo->hook_entry[h] = hook_entries[h]; -+ if ((unsigned char *)e - base == underflows[h]) -+ newinfo->underflow[h] = underflows[h]; -+ } -+ -+ /* Clear counters and comefrom */ -+ e->counters = ((struct ipt_counters) { 0, 0 }); -+ e->comefrom = 0; -+ -+ (*i)++; -+ return 0; -+out: -+ IPT_MATCH_ITERATE(e, cleanup_match, &j); -+ return ret; -+} -+ -+static inline int compat_copy_match_from_user(struct ipt_entry_match *m, -+ void **dstptr, compat_uint_t *size, const char *name, -+ const struct ipt_ip *ip, unsigned int hookmask) -+{ -+ struct ipt_entry_match *dm; -+ -+ dm = (struct ipt_entry_match *)*dstptr; -+ if (m->u.kernel.match->compat) -+ m->u.kernel.match->compat(m, dstptr, size, COMPAT_FROM_USER); -+ else { -+ memcpy(*dstptr, m, m->u.match_size); -+ *dstptr += m->u.match_size; -+ } -+ -+ if (dm->u.kernel.match->checkentry -+ && !dm->u.kernel.match->checkentry(name, ip, dm->data, -+ dm->u.match_size - sizeof(*dm), -+ hookmask)) { -+ module_put(dm->u.kernel.match->me); -+ duprintf("ip_tables: check failed for `%s'.\n", -+ dm->u.kernel.match->name); -+ return -EINVAL; -+ } -+ -+ return 0; -+} -+ -+static int compat_copy_entry_from_user(struct ipt_entry *e, void **dstptr, -+ unsigned int *size, const char *name, -+ struct ipt_table_info *newinfo, unsigned char *base) -+{ -+ struct ipt_entry_target *t; -+ struct ipt_entry *de; -+ unsigned int origsize; -+ int ret, h; -+ -+ ret = 0; -+ origsize = *size; -+ de = (struct ipt_entry *)*dstptr; -+ memcpy(de, e, sizeof(struct ipt_entry)); -+ -+ *dstptr += sizeof(struct compat_ipt_entry); -+ ret = IPT_MATCH_ITERATE(e, compat_copy_match_from_user, dstptr, size, -+ name, &de->ip, de->comefrom); -+ if (ret) -+ goto out; -+ de->target_offset = e->target_offset - (origsize - *size); -+ t = ipt_get_target(e); -+ if (t->u.kernel.target->compat) -+ t->u.kernel.target->compat(t, -+ dstptr, size, COMPAT_FROM_USER); -+ else { -+ memcpy(*dstptr, t, t->u.target_size); -+ *dstptr += t->u.target_size; -+ } -+ -+ de->next_offset = e->next_offset - (origsize - *size); -+ for (h = 0; h < NF_IP_NUMHOOKS; h++) { -+ if ((unsigned char *)de - base < newinfo->hook_entry[h]) -+ newinfo->hook_entry[h] -= origsize - *size; -+ if ((unsigned char *)de - base < newinfo->underflow[h]) -+ newinfo->underflow[h] -= origsize - *size; -+ } -+ -+ ret = -EINVAL; -+ t = ipt_get_target(de); -+ if (t->u.kernel.target == &ve_ipt_standard_target) { -+ if (!standard_check(t, *size)) -+ goto out; -+ } else if (t->u.kernel.target->checkentry -+ && !t->u.kernel.target->checkentry(name, de, t->data, -+ t->u.target_size -+ - sizeof(*t), -+ de->comefrom)) { -+ module_put(t->u.kernel.target->me); -+ duprintf("ip_tables: compat: check failed for `%s'.\n", -+ t->u.kernel.target->name); -+ goto out; -+ } -+ ret = 0; -+out: -+ return ret; -+} -+ -+static int -+translate_compat_table(const char *name, -+ unsigned int valid_hooks, -+ struct ipt_table_info **pinfo, -+ unsigned int total_size, -+ unsigned int number, -+ unsigned int *hook_entries, -+ unsigned int *underflows) -+{ -+ unsigned int i; -+ struct ipt_table_info *newinfo, *info; -+ void *pos; -+ unsigned int size; -+ int ret; -+ -+ info = *pinfo; -+ info->size = total_size; -+ info->number = number; -+ -+ /* Init all hooks to impossible value. */ -+ for (i = 0; i < NF_IP_NUMHOOKS; i++) { -+ info->hook_entry[i] = 0xFFFFFFFF; -+ info->underflow[i] = 0xFFFFFFFF; -+ } -+ -+ duprintf("translate_compat_table: size %u\n", info->size); -+ i = 0; -+ down(&compat_ipt_mutex); -+ /* Walk through entries, checking offsets. */ -+ ret = IPT_ENTRY_ITERATE(info->entries, total_size, -+ check_compat_entry_size_and_hooks, -+ info, info->entries, -+ info->entries + total_size, -+ hook_entries, underflows, &i, name); -+ if (ret != 0) -+ goto out_unlock; -+ -+ ret = -EINVAL; -+ if (i != number) { -+ duprintf("translate_compat_table: %u not %u entries\n", -+ i, number); -+ goto out_unlock; -+ } -+ -+ /* Check hooks all assigned */ -+ for (i = 0; i < NF_IP_NUMHOOKS; i++) { -+ /* Only hooks which are valid */ -+ if (!(valid_hooks & (1 << i))) -+ continue; -+ if (info->hook_entry[i] == 0xFFFFFFFF) { -+ duprintf("Invalid hook entry %u %u\n", -+ i, hook_entries[i]); -+ goto out_unlock; -+ } -+ if (info->underflow[i] == 0xFFFFFFFF) { -+ duprintf("Invalid underflow %u %u\n", -+ i, underflows[i]); -+ goto out_unlock; -+ } -+ } -+ -+ ret = -ENOMEM; -+ newinfo = ipt_table_info_alloc(sizeof(struct ipt_table_info) -+ + SMP_ALIGN(info->size) * NR_CPUS); -+ if (!newinfo) -+ goto out_unlock; -+ -+ memcpy(newinfo, info, sizeof(struct ipt_table_info)); -+ pos = newinfo->entries; -+ size = total_size; -+ ret = IPT_ENTRY_ITERATE(info->entries, total_size, -+ compat_copy_entry_from_user, &pos, &size, -+ name, newinfo, newinfo->entries); -+ compat_flush_offsets(); -+ up(&compat_ipt_mutex); -+ if (ret) -+ goto free_newinfo; -+ -+ ret = -ELOOP; -+ if (!mark_source_chains(newinfo, valid_hooks)) -+ goto free_newinfo; -+ -+ /* And one copy for every other CPU */ -+ for (i = 1; i < NR_CPUS; i++) { -+ memcpy(newinfo->entries + SMP_ALIGN(newinfo->size)*i, -+ newinfo->entries, -+ SMP_ALIGN(newinfo->size)); -+ } -+ -+ *pinfo = newinfo; -+ ipt_table_info_free(info); -+ return 0; -+ -+free_newinfo: -+ ipt_table_info_free(newinfo); -+out: -+ return ret; -+out_unlock: -+ up(&compat_ipt_mutex); -+ goto out; -+} -+ -+static int -+compat_do_replace(void __user *user, unsigned int len) -+{ -+ int ret; -+ struct compat_ipt_replace tmp; -+ struct ipt_table_info *newinfo; -+ -+ if (copy_from_user(&tmp, user, sizeof(tmp)) != 0) -+ return -EFAULT; -+ -+ /* Hack: Causes ipchains to give correct error msg --RR */ -+ if (len != sizeof(tmp) + tmp.size) -+ return -ENOPROTOOPT; -+ -+ /* Pedantry: prevent them from hitting BUG() in vmalloc.c --RR */ -+ if ((SMP_ALIGN(tmp.size) >> PAGE_SHIFT) + 2 > num_physpages) -+ return -ENOMEM; -+ -+ newinfo = ipt_table_info_alloc(sizeof(struct ipt_table_info) -+ + SMP_ALIGN(tmp.size) * NR_CPUS); -+ if (!newinfo) -+ return -ENOMEM; -+ -+ if (copy_from_user(newinfo->entries, user + sizeof(tmp), tmp.size) != 0) { - ret = -EFAULT; -- goto free; -+ goto free_newinfo; - } - -- t = ipt_find_table_lock(tmp.name, &ret, &ipt_mutex); -- if (!t) -- goto free; -+ ret = translate_compat_table(tmp.name, tmp.valid_hooks, -+ &newinfo, tmp.size, tmp.num_entries, -+ tmp.hook_entry, tmp.underflow); -+ if (ret != 0) -+ goto free_newinfo; - -- write_lock_bh(&t->lock); -- if (t->private->number != paddc->num_counters) { -- ret = -EINVAL; -- goto unlock_up_free; -+ duprintf("do_compat_replace: Translated table\n"); -+ -+ ret = __do_replace(tmp.name, tmp.valid_hooks, -+ newinfo, tmp.size, tmp.num_counters, -+ compat_ptr(tmp.counters)); -+ if (ret) -+ goto free_newinfo_untrans; -+ return 0; -+ -+ free_newinfo_untrans: -+ IPT_ENTRY_ITERATE(newinfo->entries, newinfo->size, cleanup_entry,NULL); -+ free_newinfo: -+ ipt_table_info_free(newinfo); -+ return ret; -+} -+ -+struct compat_ipt_get_entries -+{ -+ char name[IPT_TABLE_MAXNAMELEN]; -+ compat_uint_t size; -+ struct compat_ipt_entry entrytable[0]; -+}; -+ -+static int compat_copy_entries_to_user(unsigned int total_size, -+ struct ipt_table *table, void __user *userptr) -+{ -+ unsigned int off, num; -+ struct compat_ipt_entry e; -+ struct ipt_counters *counters; -+ void __user *pos; -+ unsigned int size; -+ int ret = 0; -+ -+ counters = alloc_counters(table); -+ if (IS_ERR(counters)) -+ return PTR_ERR(counters); -+ -+ /* ... then copy entire thing from CPU 0... */ -+ pos = userptr; -+ size = total_size; -+ ret = IPT_ENTRY_ITERATE(table->private->entries, -+ total_size, compat_copy_entry_to_user, &pos, &size); -+ -+ /* ... then go back and fix counters and names */ -+ for (off = 0, num = 0; off < size; off += e.next_offset, num++) { -+ unsigned int i; -+ struct ipt_entry_match m; -+ struct ipt_entry_target t; -+ -+ ret = -EFAULT; -+ if (copy_from_user(&e, userptr + off, -+ sizeof(struct compat_ipt_entry))) -+ goto free_counters; -+ if (copy_to_user(userptr + off + -+ offsetof(struct compat_ipt_entry, counters), -+ &counters[num], sizeof(counters[num]))) -+ goto free_counters; -+ -+ for (i = sizeof(struct compat_ipt_entry); -+ i < e.target_offset; i += m.u.match_size) { -+ if (copy_from_user(&m, userptr + off + i, -+ sizeof(struct ipt_entry_match))) -+ goto free_counters; -+ if (copy_to_user(userptr + off + i + -+ offsetof(struct ipt_entry_match, u.user.name), -+ m.u.kernel.match->name, -+ strlen(m.u.kernel.match->name) + 1)) -+ goto free_counters; -+ } -+ -+ if (copy_from_user(&t, userptr + off + e.target_offset, -+ sizeof(struct ipt_entry_target))) -+ goto free_counters; -+ if (copy_to_user(userptr + off + e.target_offset + -+ offsetof(struct ipt_entry_target, u.user.name), -+ t.u.kernel.target->name, -+ strlen(t.u.kernel.target->name) + 1)) -+ goto free_counters; - } -+ ret = 0; -+free_counters: -+ vfree(counters); -+ return ret; -+} - -- i = 0; -- IPT_ENTRY_ITERATE(t->private->entries, -- t->private->size, -- add_counter_to_entry, -- paddc->counters, -- &i); -- unlock_up_free: -- write_unlock_bh(&t->lock); -- up(&ipt_mutex); -- free: -- vfree(paddc); -+static int -+compat_get_entries(struct compat_ipt_get_entries __user *uptr, int *len) -+{ -+ int ret; -+ struct compat_ipt_get_entries get; -+ struct ipt_table *t; -+ -+ -+ if (*len < sizeof(get)) { -+ duprintf("compat_get_entries: %u < %u\n", -+ *len, (unsigned int)sizeof(get)); -+ return -EINVAL; -+ } -+ -+ if (copy_from_user(&get, uptr, sizeof(get)) != 0) -+ return -EFAULT; -+ -+ if (*len != sizeof(struct compat_ipt_get_entries) + get.size) { -+ duprintf("compat_get_entries: %u != %u\n", *len, -+ (unsigned int)(sizeof(struct compat_ipt_get_entries) + -+ get.size)); -+ return -EINVAL; -+ } -+ -+ down(&compat_ipt_mutex); -+ t = ipt_find_table_lock(get.name, &ret, &ipt_mutex); -+ if (t) { -+ struct ipt_table_info info; -+ duprintf("t->private->number = %u\n", -+ t->private->number); -+ ret = compat_table_info(t->private, &info); -+ if (!ret && get.size == info.size) { -+ ret = compat_copy_entries_to_user(t->private->size, -+ t, uptr->entrytable); -+ } else if (!ret) { -+ duprintf("compat_get_entries: I've got %u not %u!\n", -+ t->private->size, -+ get.size); -+ ret = -EINVAL; -+ } -+ compat_flush_offsets(); -+ up(&ipt_mutex); -+ } else -+ duprintf("compat_get_entries: Can't find %s!\n", -+ get.name); -+ up(&compat_ipt_mutex); -+ return ret; -+} -+ -+static int -+compat_do_ipt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len) -+{ -+ int ret; - -+ switch (cmd) { -+ case IPT_SO_GET_INFO: -+ ret = get_info(user, len); -+ break; -+ case IPT_SO_GET_ENTRIES: -+ ret = compat_get_entries(user, len); -+ break; -+ default: -+ duprintf("compat_do_ipt_get_ctl: unknown request %i\n", cmd); -+ ret = -EINVAL; -+ } - return ret; - } -+#endif - - static int - do_ipt_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned int len) - { - int ret; - -- if (!capable(CAP_NET_ADMIN)) -+ if (!capable(CAP_VE_NET_ADMIN)) - return -EPERM; - -+#ifdef CONFIG_COMPAT -+ if (is_current_32bits() && (cmd == IPT_SO_SET_REPLACE)) -+ return compat_do_replace(user, len); -+#endif -+ - switch (cmd) { - case IPT_SO_SET_REPLACE: - ret = do_replace(user, len); -@@ -1247,65 +2399,22 @@ do_ipt_get_ctl(struct sock *sk, int cmd, - { - int ret; - -- if (!capable(CAP_NET_ADMIN)) -+ if (!capable(CAP_VE_NET_ADMIN)) - return -EPERM; - -- switch (cmd) { -- case IPT_SO_GET_INFO: { -- char name[IPT_TABLE_MAXNAMELEN]; -- struct ipt_table *t; -- -- if (*len != sizeof(struct ipt_getinfo)) { -- duprintf("length %u != %u\n", *len, -- sizeof(struct ipt_getinfo)); -- ret = -EINVAL; -- break; -- } -- -- if (copy_from_user(name, user, sizeof(name)) != 0) { -- ret = -EFAULT; -- break; -- } -- name[IPT_TABLE_MAXNAMELEN-1] = '\0'; -- t = ipt_find_table_lock(name, &ret, &ipt_mutex); -- if (t) { -- struct ipt_getinfo info; -- -- info.valid_hooks = t->valid_hooks; -- memcpy(info.hook_entry, t->private->hook_entry, -- sizeof(info.hook_entry)); -- memcpy(info.underflow, t->private->underflow, -- sizeof(info.underflow)); -- info.num_entries = t->private->number; -- info.size = t->private->size; -- strcpy(info.name, name); -- -- if (copy_to_user(user, &info, *len) != 0) -- ret = -EFAULT; -- else -- ret = 0; -- -- up(&ipt_mutex); -- } -- } -- break; -+#ifdef CONFIG_COMPAT -+ if (is_current_32bits()) -+ return compat_do_ipt_get_ctl(sk, cmd, user, len); -+#endif - -- case IPT_SO_GET_ENTRIES: { -- struct ipt_get_entries get; -+ switch (cmd) { -+ case IPT_SO_GET_INFO: -+ ret = get_info(user, len); -+ break; - -- if (*len < sizeof(get)) { -- duprintf("get_entries: %u < %u\n", *len, sizeof(get)); -- ret = -EINVAL; -- } else if (copy_from_user(&get, user, sizeof(get)) != 0) { -- ret = -EFAULT; -- } else if (*len != sizeof(struct ipt_get_entries) + get.size) { -- duprintf("get_entries: %u != %u\n", *len, -- sizeof(struct ipt_get_entries) + get.size); -- ret = -EINVAL; -- } else -- ret = get_entries(&get, user); -+ case IPT_SO_GET_ENTRIES: -+ ret = get_entries(user, len); - break; -- } - - default: - duprintf("do_ipt_get_ctl: unknown request %i\n", cmd); -@@ -1325,7 +2434,7 @@ ipt_register_target(struct ipt_target *t - if (ret != 0) - return ret; - -- if (!list_named_insert(&ipt_target, target)) { -+ if (!list_named_insert(&ve_ipt_target, target)) { - duprintf("ipt_register_target: `%s' already in list!\n", - target->name); - ret = -EINVAL; -@@ -1334,12 +2443,60 @@ ipt_register_target(struct ipt_target *t - return ret; - } - -+int -+visible_ipt_register_target(struct ipt_target *target) -+{ -+ int ret; -+ struct module *mod = target->me; -+ -+ if (!ve_is_super(get_exec_env())) { -+ struct ipt_target *tmp; -+ __module_get(mod); -+ ret = -ENOMEM; -+ tmp = kmalloc(sizeof(struct ipt_target), GFP_KERNEL); -+ if (!tmp) -+ goto nomem; -+ memcpy(tmp, target, sizeof(struct ipt_target)); -+ target = tmp; -+ } -+ -+ ret = ipt_register_target(target); -+ if (ret) -+ goto out; -+ -+ return 0; -+out: -+ if (!ve_is_super(get_exec_env())) { -+ kfree(target); -+nomem: -+ module_put(mod); -+ } -+ return ret; -+} -+ - void - ipt_unregister_target(struct ipt_target *target) - { - down(&ipt_mutex); -- LIST_DELETE(&ipt_target, target); -+ LIST_DELETE(&ve_ipt_target, target); -+ up(&ipt_mutex); -+} -+ -+void -+visible_ipt_unregister_target(struct ipt_target *target) -+{ -+ down(&ipt_mutex); -+ target = list_named_find(&ve_ipt_target, target->name); - up(&ipt_mutex); -+ if (!target) -+ return; -+ -+ ipt_unregister_target(target); -+ -+ if (!ve_is_super(get_exec_env())) { -+ module_put(target->me); -+ kfree(target); -+ } - } - - int -@@ -1351,13 +2508,43 @@ ipt_register_match(struct ipt_match *mat - if (ret != 0) - return ret; - -- if (!list_named_insert(&ipt_match, match)) { -+ if (!list_named_insert(&ve_ipt_match, match)) { - duprintf("ipt_register_match: `%s' already in list!\n", - match->name); - ret = -EINVAL; - } - up(&ipt_mutex); -+ return ret; -+} -+ -+int -+visible_ipt_register_match(struct ipt_match *match) -+{ -+ int ret; -+ struct module *mod = match->me; -+ -+ if (!ve_is_super(get_exec_env())) { -+ struct ipt_match *tmp; -+ __module_get(mod); -+ ret = -ENOMEM; -+ tmp = kmalloc(sizeof(struct ipt_match), GFP_KERNEL); -+ if (!tmp) -+ goto nomem; -+ memcpy(tmp, match, sizeof(struct ipt_match)); -+ match = tmp; -+ } -+ -+ ret = ipt_register_match(match); -+ if (ret) -+ goto out; - -+ return 0; -+out: -+ if (!ve_is_super(get_exec_env())) { -+ kfree(match); -+nomem: -+ module_put(mod); -+ } - return ret; - } - -@@ -1365,7 +2552,38 @@ void - ipt_unregister_match(struct ipt_match *match) - { - down(&ipt_mutex); -- LIST_DELETE(&ipt_match, match); -+ LIST_DELETE(&ve_ipt_match, match); -+ up(&ipt_mutex); -+} -+ -+void -+visible_ipt_unregister_match(struct ipt_match *match) -+{ -+ down(&ipt_mutex); -+ match = list_named_find(&ve_ipt_match, match->name); -+ up(&ipt_mutex); -+ if (!match) -+ return; -+ -+ ipt_unregister_match(match); -+ -+ if (!ve_is_super(get_exec_env())) { -+ module_put(match->me); -+ kfree(match); -+ } -+} -+ -+void ipt_flush_table(struct ipt_table *table) -+{ -+ if (table == NULL) -+ return; -+ -+ down(&ipt_mutex); -+ IPT_ENTRY_ITERATE(table->private->entries, table->private->size, -+ cleanup_entry, NULL); -+ if (table->private->number > table->private->initial_entries) -+ module_put(table->me); -+ table->private->size = 0; - up(&ipt_mutex); - } - -@@ -1373,13 +2591,12 @@ int ipt_register_table(struct ipt_table - { - int ret; - struct ipt_table_info *newinfo; -- static struct ipt_table_info bootstrap -- = { 0, 0, 0, { 0 }, { 0 }, { } }; - -- newinfo = vmalloc(sizeof(struct ipt_table_info) -+ ret = -ENOMEM; -+ newinfo = ipt_table_info_alloc(sizeof(struct ipt_table_info) - + SMP_ALIGN(table->table->size) * NR_CPUS); - if (!newinfo) -- return -ENOMEM; -+ goto out; - - memcpy(newinfo->entries, table->table->entries, table->table->size); - -@@ -1388,56 +2605,58 @@ int ipt_register_table(struct ipt_table - table->table->num_entries, - table->table->hook_entry, - table->table->underflow); -- if (ret != 0) { -- vfree(newinfo); -- return ret; -- } -+ if (ret != 0) -+ goto out_free; - - ret = down_interruptible(&ipt_mutex); -- if (ret != 0) { -- vfree(newinfo); -- return ret; -- } -+ if (ret != 0) -+ goto out_free; - - /* Don't autoload: we'd eat our tail... */ -- if (list_named_find(&ipt_tables, table->name)) { -- ret = -EEXIST; -- goto free_unlock; -- } -+ ret = -EEXIST; -+ if (list_named_find(&ve_ipt_tables, table->name)) -+ goto out_free_unlock; - -- /* Simplifies replace_table code. */ -- table->private = &bootstrap; -- if (!replace_table(table, 0, newinfo, &ret)) -- goto free_unlock; -+ table->lock = RW_LOCK_UNLOCKED; -+ ret = setup_table(table, newinfo); -+ if (ret) -+ goto out_free_unlock; - - duprintf("table->private->number = %u\n", - table->private->number); -- -+ - /* save number of initial entries */ - table->private->initial_entries = table->private->number; - -- table->lock = RW_LOCK_UNLOCKED; -- list_prepend(&ipt_tables, table); -+ list_prepend(&ve_ipt_tables, table); - -- unlock: - up(&ipt_mutex); -- return ret; -+ return 0; - -- free_unlock: -- vfree(newinfo); -- goto unlock; -+out_free_unlock: -+ up(&ipt_mutex); -+out_free: -+ ipt_table_info_free(newinfo); -+out: -+ return ret; - } - - void ipt_unregister_table(struct ipt_table *table) - { - down(&ipt_mutex); -- LIST_DELETE(&ipt_tables, table); -+ LIST_DELETE(&ve_ipt_tables, table); - up(&ipt_mutex); - -+ /* size to uncharge taken from ipt_register_table */ -+#if defined(CONFIG_VE_IPTABLES) && defined(CONFIG_USER_RESOURCE) -+ uncharge_iptables(ipt_table_info_ub(table->private), -+ table->private->number); -+#endif -+ - /* Decrease module usage counts and free resources */ - IPT_ENTRY_ITERATE(table->private->entries, table->private->size, - cleanup_entry, NULL); -- vfree(table->private); -+ ipt_table_info_free(table->private); - } - - /* Returns 1 if the port is matched by the range, 0 otherwise */ -@@ -1604,8 +2823,8 @@ udp_checkentry(const char *tablename, - return 0; - } - if (matchinfosize != IPT_ALIGN(sizeof(struct ipt_udp))) { -- duprintf("ipt_udp: matchsize %u != %u\n", -- matchinfosize, IPT_ALIGN(sizeof(struct ipt_udp))); -+ duprintf("ipt_udp: matchsize %u != %u\n", matchinfosize, -+ (unsigned int)IPT_ALIGN(sizeof(struct ipt_udp))); - return 0; - } - if (udpinfo->invflags & ~IPT_UDP_INV_MASK) { -@@ -1677,6 +2896,9 @@ icmp_checkentry(const char *tablename, - /* The built-in targets: standard (NULL) and error. */ - static struct ipt_target ipt_standard_target = { - .name = IPT_STANDARD_TARGET, -+#ifdef CONFIG_COMPAT -+ .compat = &compat_ipt_standard_fn, -+#endif - }; - - static struct ipt_target ipt_error_target = { -@@ -1698,18 +2920,27 @@ static struct ipt_match tcp_matchstruct - .name = "tcp", - .match = &tcp_match, - .checkentry = &tcp_checkentry, -+#ifdef CONFIG_COMPAT -+ .compat = &tcp_compat, -+#endif - }; - - static struct ipt_match udp_matchstruct = { - .name = "udp", - .match = &udp_match, - .checkentry = &udp_checkentry, -+#ifdef CONFIG_COMPAT -+ .compat = &udp_compat, -+#endif - }; - - static struct ipt_match icmp_matchstruct = { - .name = "icmp", - .match = &icmp_match, - .checkentry = &icmp_checkentry, -+#ifdef CONFIG_COMPAT -+ .compat = &icmp_compat, -+#endif - }; - - #ifdef CONFIG_PROC_FS -@@ -1735,7 +2966,7 @@ static inline int print_target(const str - off_t start_offset, char *buffer, int length, - off_t *pos, unsigned int *count) - { -- if (t == &ipt_standard_target || t == &ipt_error_target) -+ if (t == &ve_ipt_standard_target || t == &ve_ipt_error_target) - return 0; - return print_name((char *)t, start_offset, buffer, length, pos, count); - } -@@ -1745,10 +2976,16 @@ static int ipt_get_tables(char *buffer, - off_t pos = 0; - unsigned int count = 0; - -+#ifdef CONFIG_VE_IPTABLES -+ /* if we don't initialized for current VE exiting */ -+ if (&ve_ipt_standard_target == NULL) -+ return 0; -+#endif -+ - if (down_interruptible(&ipt_mutex) != 0) - return 0; - -- LIST_FIND(&ipt_tables, print_name, void *, -+ LIST_FIND(&ve_ipt_tables, print_name, void *, - offset, buffer, length, &pos, &count); - - up(&ipt_mutex); -@@ -1763,10 +3000,15 @@ static int ipt_get_targets(char *buffer, - off_t pos = 0; - unsigned int count = 0; - -+#ifdef CONFIG_VE_IPTABLES -+ /* if we don't initialized for current VE exiting */ -+ if (&ve_ipt_standard_target == NULL) -+ return 0; -+#endif - if (down_interruptible(&ipt_mutex) != 0) - return 0; - -- LIST_FIND(&ipt_target, print_target, struct ipt_target *, -+ LIST_FIND(&ve_ipt_target, print_target, struct ipt_target *, - offset, buffer, length, &pos, &count); - - up(&ipt_mutex); -@@ -1780,10 +3022,15 @@ static int ipt_get_matches(char *buffer, - off_t pos = 0; - unsigned int count = 0; - -+#ifdef CONFIG_VE_IPTABLES -+ /* if we don't initialized for current VE exiting */ -+ if (&ve_ipt_standard_target == NULL) -+ return 0; -+#endif - if (down_interruptible(&ipt_mutex) != 0) - return 0; - -- LIST_FIND(&ipt_match, print_name, void *, -+ LIST_FIND(&ve_ipt_match, print_name, void *, - offset, buffer, length, &pos, &count); - - up(&ipt_mutex); -@@ -1799,6 +3046,7 @@ static struct { char *name; get_info_t * - { NULL, NULL} }; - #endif /*CONFIG_PROC_FS*/ - -+void fini_iptables(void); - static int __init init(void) - { - int ret; -@@ -1839,11 +3087,132 @@ static int __init init(void) - #endif - - printk("ip_tables: (C) 2000-2002 Netfilter core team\n"); -+ -+#if defined(CONFIG_VE_IPTABLES) -+ /* init ve0 */ -+ ret = init_iptables(); -+ if (ret == 0) { -+ KSYMRESOLVE(init_iptables); -+ KSYMRESOLVE(fini_iptables); -+ KSYMRESOLVE(ipt_flush_table); -+ KSYMMODRESOLVE(ip_tables); -+ } -+#else -+ ret = 0; -+#endif -+ return ret; -+} -+ -+#ifdef CONFIG_VE_IPTABLES -+/* alloc helper */ -+#define ALLOC_ENVF(field,label) \ -+ if ( !(envid->field = kmalloc(sizeof(*(envid->field)), GFP_KERNEL)) ) \ -+ goto label; -+int init_iptables(void) -+{ -+ struct ve_struct *envid; -+ -+ envid = get_exec_env(); -+ -+ if (ve_is_super(envid)) { -+ envid->_ipt_target = &ipt_target; -+ envid->_ipt_match = &ipt_match; -+ envid->_ipt_tables = &ipt_tables; -+ -+ envid->_ipt_standard_target = &ipt_standard_target; -+ envid->_ipt_error_target = &ipt_error_target; -+ envid->_tcp_matchstruct = &tcp_matchstruct; -+ envid->_udp_matchstruct = &udp_matchstruct; -+ envid->_icmp_matchstruct = &icmp_matchstruct; -+ } else { -+ /* allocate structures in ve_struct */ -+ ALLOC_ENVF(_ipt_target,nomem0); -+ ALLOC_ENVF(_ipt_match,nomem1); -+ ALLOC_ENVF(_ipt_tables,nomem2); -+ ALLOC_ENVF(_ipt_standard_target,nomem3); -+ ALLOC_ENVF(_ipt_error_target,nomem4); -+ ALLOC_ENVF(_tcp_matchstruct,nomem5); -+ ALLOC_ENVF(_udp_matchstruct,nomem6); -+ ALLOC_ENVF(_icmp_matchstruct,nomem7); -+ -+ /* FIXME: charge ubc */ -+ INIT_LIST_HEAD(envid->_ipt_target); -+ INIT_LIST_HEAD(envid->_ipt_match); -+ INIT_LIST_HEAD(envid->_ipt_tables); -+ -+ memcpy(envid->_ipt_standard_target, &ipt_standard_target, -+ sizeof(ipt_standard_target)); -+ memcpy(envid->_ipt_error_target, &ipt_error_target, -+ sizeof(ipt_error_target)); -+ memcpy(envid->_tcp_matchstruct, &tcp_matchstruct, -+ sizeof(tcp_matchstruct)); -+ memcpy(envid->_udp_matchstruct, &udp_matchstruct, -+ sizeof(udp_matchstruct)); -+ memcpy(envid->_icmp_matchstruct, &icmp_matchstruct, -+ sizeof(icmp_matchstruct)); -+ -+ down(&ipt_mutex); -+ list_append(envid->_ipt_target, envid->_ipt_standard_target); -+ list_append(envid->_ipt_target, envid->_ipt_error_target); -+ list_append(envid->_ipt_match, envid->_tcp_matchstruct); -+ list_append(envid->_ipt_match, envid->_udp_matchstruct); -+ list_append(envid->_ipt_match, envid->_icmp_matchstruct); -+ up(&ipt_mutex); -+ } -+ - return 0; -+ -+nomem7: -+ kfree(envid->_udp_matchstruct); envid->_udp_matchstruct = NULL; -+nomem6: -+ kfree(envid->_tcp_matchstruct); envid->_tcp_matchstruct = NULL; -+nomem5: -+ kfree(envid->_ipt_error_target); envid->_ipt_error_target = NULL; -+nomem4: -+ kfree(envid->_ipt_standard_target); envid->_ipt_standard_target = NULL; -+nomem3: -+ kfree(envid->_ipt_tables); envid->_ipt_tables = NULL; -+nomem2: -+ kfree(envid->_ipt_match); envid->_ipt_match = NULL; -+nomem1: -+ kfree(envid->_ipt_target); envid->_ipt_target = NULL; -+nomem0: -+ return -ENOMEM; -+} -+ -+void fini_iptables(void) -+{ -+ /* some cleanup */ -+ struct ve_struct *envid = get_exec_env(); -+ -+ if (envid->_ipt_tables != NULL && !ve_is_super(envid)) { -+ kfree(envid->_ipt_tables); -+ kfree(envid->_ipt_target); -+ kfree(envid->_ipt_match); -+ kfree(envid->_ipt_standard_target); -+ kfree(envid->_ipt_error_target); -+ kfree(envid->_tcp_matchstruct); -+ kfree(envid->_udp_matchstruct); -+ kfree(envid->_icmp_matchstruct); -+ } -+ -+ envid->_ipt_tables = NULL; -+ envid->_ipt_target = NULL; -+ envid->_ipt_match = NULL; -+ envid->_ipt_standard_target = NULL; -+ envid->_ipt_error_target = NULL; -+ envid->_tcp_matchstruct = NULL; -+ envid->_udp_matchstruct = NULL; -+ envid->_icmp_matchstruct = NULL; - } -+#endif - - static void __exit fini(void) - { -+ KSYMMODUNRESOLVE(ip_tables); -+ KSYMUNRESOLVE(init_iptables); -+ KSYMUNRESOLVE(fini_iptables); -+ KSYMUNRESOLVE(ipt_flush_table); - nf_unregister_sockopt(&ipt_sockopts); - #ifdef CONFIG_PROC_FS - { -@@ -1852,16 +3221,28 @@ static void __exit fini(void) - proc_net_remove(ipt_proc_entry[i].name); - } - #endif -+#ifdef CONFIG_VE_IPTABLES -+ fini_iptables(); -+#endif - } - -+EXPORT_SYMBOL(ipt_flush_table); - EXPORT_SYMBOL(ipt_register_table); - EXPORT_SYMBOL(ipt_unregister_table); - EXPORT_SYMBOL(ipt_register_match); - EXPORT_SYMBOL(ipt_unregister_match); - EXPORT_SYMBOL(ipt_do_table); -+EXPORT_SYMBOL(visible_ipt_register_match); -+EXPORT_SYMBOL(visible_ipt_unregister_match); - EXPORT_SYMBOL(ipt_register_target); - EXPORT_SYMBOL(ipt_unregister_target); -+EXPORT_SYMBOL(visible_ipt_register_target); -+EXPORT_SYMBOL(visible_ipt_unregister_target); - EXPORT_SYMBOL(ipt_find_target_lock); -+#ifdef CONFIG_COMPAT -+EXPORT_SYMBOL(ipt_match_align_compat); -+EXPORT_SYMBOL(ipt_target_align_compat); -+#endif - --module_init(init); -+subsys_initcall(init); - module_exit(fini); -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_CLASSIFY.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_CLASSIFY.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_CLASSIFY.c 2004-08-14 14:54:46.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_CLASSIFY.c 2006-05-11 13:05:42.000000000 +0400 -@@ -48,7 +48,8 @@ checkentry(const char *tablename, - unsigned int hook_mask) - { - if (targinfosize != IPT_ALIGN(sizeof(struct ipt_classify_target_info))){ -- printk(KERN_ERR "CLASSIFY: invalid size (%u != %Zu).\n", -+ ve_printk(VE_LOG, KERN_ERR -+ "CLASSIFY: invalid size (%u != %Zu).\n", - targinfosize, - IPT_ALIGN(sizeof(struct ipt_classify_target_info))); - return 0; -@@ -56,13 +57,14 @@ checkentry(const char *tablename, - - if (hook_mask & ~((1 << NF_IP_LOCAL_OUT) | (1 << NF_IP_FORWARD) | - (1 << NF_IP_POST_ROUTING))) { -- printk(KERN_ERR "CLASSIFY: only valid in LOCAL_OUT, FORWARD " -+ ve_printk(VE_LOG, KERN_ERR -+ "CLASSIFY: only valid in LOCAL_OUT, FORWARD " - "and POST_ROUTING.\n"); - return 0; - } - - if (strcmp(tablename, "mangle") != 0) { -- printk(KERN_ERR "CLASSIFY: can only be called from " -+ ve_printk(VE_LOG, KERN_ERR "CLASSIFY: can only be called from " - "\"mangle\" table, not \"%s\".\n", - tablename); - return 0; -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_LOG.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_LOG.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_LOG.c 2004-08-14 14:56:01.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_LOG.c 2006-05-11 13:05:49.000000000 +0400 -@@ -18,6 +18,7 @@ - #include <net/udp.h> - #include <net/tcp.h> - #include <net/route.h> -+#include <linux/nfcalls.h> - - #include <linux/netfilter.h> - #include <linux/netfilter_ipv4/ip_tables.h> -@@ -48,32 +49,32 @@ static void dump_packet(const struct ipt - struct iphdr iph; - - if (skb_copy_bits(skb, iphoff, &iph, sizeof(iph)) < 0) { -- printk("TRUNCATED"); -+ ve_printk(VE_LOG, "TRUNCATED"); - return; - } - - /* Important fields: - * TOS, len, DF/MF, fragment offset, TTL, src, dst, options. */ - /* Max length: 40 "SRC=255.255.255.255 DST=255.255.255.255 " */ -- printk("SRC=%u.%u.%u.%u DST=%u.%u.%u.%u ", -+ ve_printk(VE_LOG, "SRC=%u.%u.%u.%u DST=%u.%u.%u.%u ", - NIPQUAD(iph.saddr), NIPQUAD(iph.daddr)); - - /* Max length: 46 "LEN=65535 TOS=0xFF PREC=0xFF TTL=255 ID=65535 " */ -- printk("LEN=%u TOS=0x%02X PREC=0x%02X TTL=%u ID=%u ", -+ ve_printk(VE_LOG, "LEN=%u TOS=0x%02X PREC=0x%02X TTL=%u ID=%u ", - ntohs(iph.tot_len), iph.tos & IPTOS_TOS_MASK, - iph.tos & IPTOS_PREC_MASK, iph.ttl, ntohs(iph.id)); - - /* Max length: 6 "CE DF MF " */ - if (ntohs(iph.frag_off) & IP_CE) -- printk("CE "); -+ ve_printk(VE_LOG, "CE "); - if (ntohs(iph.frag_off) & IP_DF) -- printk("DF "); -+ ve_printk(VE_LOG, "DF "); - if (ntohs(iph.frag_off) & IP_MF) -- printk("MF "); -+ ve_printk(VE_LOG, "MF "); - - /* Max length: 11 "FRAG:65535 " */ - if (ntohs(iph.frag_off) & IP_OFFSET) -- printk("FRAG:%u ", ntohs(iph.frag_off) & IP_OFFSET); -+ ve_printk(VE_LOG, "FRAG:%u ", ntohs(iph.frag_off) & IP_OFFSET); - - if ((info->logflags & IPT_LOG_IPOPT) - && iph.ihl * 4 > sizeof(struct iphdr)) { -@@ -82,15 +83,15 @@ static void dump_packet(const struct ipt - - optsize = iph.ihl * 4 - sizeof(struct iphdr); - if (skb_copy_bits(skb, iphoff+sizeof(iph), opt, optsize) < 0) { -- printk("TRUNCATED"); -+ ve_printk(VE_LOG, "TRUNCATED"); - return; - } - - /* Max length: 127 "OPT (" 15*4*2chars ") " */ -- printk("OPT ("); -+ ve_printk(VE_LOG, "OPT ("); - for (i = 0; i < optsize; i++) -- printk("%02X", opt[i]); -- printk(") "); -+ ve_printk(VE_LOG, "%02X", opt[i]); -+ ve_printk(VE_LOG, ") "); - } - - switch (iph.protocol) { -@@ -98,7 +99,7 @@ static void dump_packet(const struct ipt - struct tcphdr tcph; - - /* Max length: 10 "PROTO=TCP " */ -- printk("PROTO=TCP "); -+ ve_printk(VE_LOG, "PROTO=TCP "); - - if (ntohs(iph.frag_off) & IP_OFFSET) - break; -@@ -106,41 +107,41 @@ static void dump_packet(const struct ipt - /* Max length: 25 "INCOMPLETE [65535 bytes] " */ - if (skb_copy_bits(skb, iphoff+iph.ihl*4, &tcph, sizeof(tcph)) - < 0) { -- printk("INCOMPLETE [%u bytes] ", -+ ve_printk(VE_LOG, "INCOMPLETE [%u bytes] ", - skb->len - iphoff - iph.ihl*4); - break; - } - - /* Max length: 20 "SPT=65535 DPT=65535 " */ -- printk("SPT=%u DPT=%u ", -+ ve_printk(VE_LOG, "SPT=%u DPT=%u ", - ntohs(tcph.source), ntohs(tcph.dest)); - /* Max length: 30 "SEQ=4294967295 ACK=4294967295 " */ - if (info->logflags & IPT_LOG_TCPSEQ) -- printk("SEQ=%u ACK=%u ", -+ ve_printk(VE_LOG, "SEQ=%u ACK=%u ", - ntohl(tcph.seq), ntohl(tcph.ack_seq)); - /* Max length: 13 "WINDOW=65535 " */ -- printk("WINDOW=%u ", ntohs(tcph.window)); -+ ve_printk(VE_LOG, "WINDOW=%u ", ntohs(tcph.window)); - /* Max length: 9 "RES=0x3F " */ -- printk("RES=0x%02x ", (u8)(ntohl(tcp_flag_word(&tcph) & TCP_RESERVED_BITS) >> 22)); -+ ve_printk(VE_LOG, "RES=0x%02x ", (u8)(ntohl(tcp_flag_word(&tcph) & TCP_RESERVED_BITS) >> 22)); - /* Max length: 32 "CWR ECE URG ACK PSH RST SYN FIN " */ - if (tcph.cwr) -- printk("CWR "); -+ ve_printk(VE_LOG, "CWR "); - if (tcph.ece) -- printk("ECE "); -+ ve_printk(VE_LOG, "ECE "); - if (tcph.urg) -- printk("URG "); -+ ve_printk(VE_LOG, "URG "); - if (tcph.ack) -- printk("ACK "); -+ ve_printk(VE_LOG, "ACK "); - if (tcph.psh) -- printk("PSH "); -+ ve_printk(VE_LOG, "PSH "); - if (tcph.rst) -- printk("RST "); -+ ve_printk(VE_LOG, "RST "); - if (tcph.syn) -- printk("SYN "); -+ ve_printk(VE_LOG, "SYN "); - if (tcph.fin) -- printk("FIN "); -+ ve_printk(VE_LOG, "FIN "); - /* Max length: 11 "URGP=65535 " */ -- printk("URGP=%u ", ntohs(tcph.urg_ptr)); -+ ve_printk(VE_LOG, "URGP=%u ", ntohs(tcph.urg_ptr)); - - if ((info->logflags & IPT_LOG_TCPOPT) - && tcph.doff * 4 > sizeof(struct tcphdr)) { -@@ -150,15 +151,15 @@ static void dump_packet(const struct ipt - optsize = tcph.doff * 4 - sizeof(struct tcphdr); - if (skb_copy_bits(skb, iphoff+iph.ihl*4 + sizeof(tcph), - opt, optsize) < 0) { -- printk("TRUNCATED"); -+ ve_printk(VE_LOG, "TRUNCATED"); - return; - } - - /* Max length: 127 "OPT (" 15*4*2chars ") " */ -- printk("OPT ("); -+ ve_printk(VE_LOG, "OPT ("); - for (i = 0; i < optsize; i++) -- printk("%02X", opt[i]); -- printk(") "); -+ ve_printk(VE_LOG, "%02X", opt[i]); -+ ve_printk(VE_LOG, ") "); - } - break; - } -@@ -166,7 +167,7 @@ static void dump_packet(const struct ipt - struct udphdr udph; - - /* Max length: 10 "PROTO=UDP " */ -- printk("PROTO=UDP "); -+ ve_printk(VE_LOG, "PROTO=UDP "); - - if (ntohs(iph.frag_off) & IP_OFFSET) - break; -@@ -174,13 +175,13 @@ static void dump_packet(const struct ipt - /* Max length: 25 "INCOMPLETE [65535 bytes] " */ - if (skb_copy_bits(skb, iphoff+iph.ihl*4, &udph, sizeof(udph)) - < 0) { -- printk("INCOMPLETE [%u bytes] ", -+ ve_printk(VE_LOG, "INCOMPLETE [%u bytes] ", - skb->len - iphoff - iph.ihl*4); - break; - } - - /* Max length: 20 "SPT=65535 DPT=65535 " */ -- printk("SPT=%u DPT=%u LEN=%u ", -+ ve_printk(VE_LOG, "SPT=%u DPT=%u LEN=%u ", - ntohs(udph.source), ntohs(udph.dest), - ntohs(udph.len)); - break; -@@ -206,7 +207,7 @@ static void dump_packet(const struct ipt - [ICMP_ADDRESSREPLY] = 12 }; - - /* Max length: 11 "PROTO=ICMP " */ -- printk("PROTO=ICMP "); -+ ve_printk(VE_LOG, "PROTO=ICMP "); - - if (ntohs(iph.frag_off) & IP_OFFSET) - break; -@@ -214,19 +215,19 @@ static void dump_packet(const struct ipt - /* Max length: 25 "INCOMPLETE [65535 bytes] " */ - if (skb_copy_bits(skb, iphoff+iph.ihl*4, &icmph, sizeof(icmph)) - < 0) { -- printk("INCOMPLETE [%u bytes] ", -+ ve_printk(VE_LOG, "INCOMPLETE [%u bytes] ", - skb->len - iphoff - iph.ihl*4); - break; - } - - /* Max length: 18 "TYPE=255 CODE=255 " */ -- printk("TYPE=%u CODE=%u ", icmph.type, icmph.code); -+ ve_printk(VE_LOG, "TYPE=%u CODE=%u ", icmph.type, icmph.code); - - /* Max length: 25 "INCOMPLETE [65535 bytes] " */ - if (icmph.type <= NR_ICMP_TYPES - && required_len[icmph.type] - && skb->len-iphoff-iph.ihl*4 < required_len[icmph.type]) { -- printk("INCOMPLETE [%u bytes] ", -+ ve_printk(VE_LOG, "INCOMPLETE [%u bytes] ", - skb->len - iphoff - iph.ihl*4); - break; - } -@@ -235,19 +236,19 @@ static void dump_packet(const struct ipt - case ICMP_ECHOREPLY: - case ICMP_ECHO: - /* Max length: 19 "ID=65535 SEQ=65535 " */ -- printk("ID=%u SEQ=%u ", -+ ve_printk(VE_LOG, "ID=%u SEQ=%u ", - ntohs(icmph.un.echo.id), - ntohs(icmph.un.echo.sequence)); - break; - - case ICMP_PARAMETERPROB: - /* Max length: 14 "PARAMETER=255 " */ -- printk("PARAMETER=%u ", -+ ve_printk(VE_LOG, "PARAMETER=%u ", - ntohl(icmph.un.gateway) >> 24); - break; - case ICMP_REDIRECT: - /* Max length: 24 "GATEWAY=255.255.255.255 " */ -- printk("GATEWAY=%u.%u.%u.%u ", -+ ve_printk(VE_LOG, "GATEWAY=%u.%u.%u.%u ", - NIPQUAD(icmph.un.gateway)); - /* Fall through */ - case ICMP_DEST_UNREACH: -@@ -255,16 +256,16 @@ static void dump_packet(const struct ipt - case ICMP_TIME_EXCEEDED: - /* Max length: 3+maxlen */ - if (!iphoff) { /* Only recurse once. */ -- printk("["); -+ ve_printk(VE_LOG, "["); - dump_packet(info, skb, - iphoff + iph.ihl*4+sizeof(icmph)); -- printk("] "); -+ ve_printk(VE_LOG, "] "); - } - - /* Max length: 10 "MTU=65535 " */ - if (icmph.type == ICMP_DEST_UNREACH - && icmph.code == ICMP_FRAG_NEEDED) -- printk("MTU=%u ", ntohs(icmph.un.frag.mtu)); -+ ve_printk(VE_LOG, "MTU=%u ", ntohs(icmph.un.frag.mtu)); - } - break; - } -@@ -276,24 +277,24 @@ static void dump_packet(const struct ipt - break; - - /* Max length: 9 "PROTO=AH " */ -- printk("PROTO=AH "); -+ ve_printk(VE_LOG, "PROTO=AH "); - - /* Max length: 25 "INCOMPLETE [65535 bytes] " */ - if (skb_copy_bits(skb, iphoff+iph.ihl*4, &ah, sizeof(ah)) < 0) { -- printk("INCOMPLETE [%u bytes] ", -+ ve_printk(VE_LOG, "INCOMPLETE [%u bytes] ", - skb->len - iphoff - iph.ihl*4); - break; - } - - /* Length: 15 "SPI=0xF1234567 " */ -- printk("SPI=0x%x ", ntohl(ah.spi)); -+ ve_printk(VE_LOG, "SPI=0x%x ", ntohl(ah.spi)); - break; - } - case IPPROTO_ESP: { - struct ip_esp_hdr esph; - - /* Max length: 10 "PROTO=ESP " */ -- printk("PROTO=ESP "); -+ ve_printk(VE_LOG, "PROTO=ESP "); - - if (ntohs(iph.frag_off) & IP_OFFSET) - break; -@@ -301,18 +302,18 @@ static void dump_packet(const struct ipt - /* Max length: 25 "INCOMPLETE [65535 bytes] " */ - if (skb_copy_bits(skb, iphoff+iph.ihl*4, &esph, sizeof(esph)) - < 0) { -- printk("INCOMPLETE [%u bytes] ", -+ ve_printk(VE_LOG, "INCOMPLETE [%u bytes] ", - skb->len - iphoff - iph.ihl*4); - break; - } - - /* Length: 15 "SPI=0xF1234567 " */ -- printk("SPI=0x%x ", ntohl(esph.spi)); -+ ve_printk(VE_LOG, "SPI=0x%x ", ntohl(esph.spi)); - break; - } - /* Max length: 10 "PROTO 255 " */ - default: -- printk("PROTO=%u ", iph.protocol); -+ ve_printk(VE_LOG, "PROTO=%u ", iph.protocol); - } - - /* Proto Max log string length */ -@@ -339,8 +340,8 @@ ipt_log_packet(unsigned int hooknum, - const char *prefix) - { - spin_lock_bh(&log_lock); -- printk(level_string); -- printk("%sIN=%s OUT=%s ", -+ ve_printk(VE_LOG, level_string); -+ ve_printk(VE_LOG, "%sIN=%s OUT=%s ", - prefix == NULL ? loginfo->prefix : prefix, - in ? in->name : "", - out ? out->name : ""); -@@ -350,29 +351,29 @@ ipt_log_packet(unsigned int hooknum, - struct net_device *physoutdev = skb->nf_bridge->physoutdev; - - if (physindev && in != physindev) -- printk("PHYSIN=%s ", physindev->name); -+ ve_printk(VE_LOG, "PHYSIN=%s ", physindev->name); - if (physoutdev && out != physoutdev) -- printk("PHYSOUT=%s ", physoutdev->name); -+ ve_printk(VE_LOG, "PHYSOUT=%s ", physoutdev->name); - } - #endif - - if (in && !out) { - /* MAC logging for input chain only. */ -- printk("MAC="); -+ ve_printk(VE_LOG, "MAC="); - if (skb->dev && skb->dev->hard_header_len - && skb->mac.raw != (void*)skb->nh.iph) { - int i; - unsigned char *p = skb->mac.raw; - for (i = 0; i < skb->dev->hard_header_len; i++,p++) -- printk("%02x%c", *p, -+ ve_printk(VE_LOG, "%02x%c", *p, - i==skb->dev->hard_header_len - 1 - ? ' ':':'); - } else -- printk(" "); -+ ve_printk(VE_LOG, " "); - } - - dump_packet(loginfo, skb, 0); -- printk("\n"); -+ ve_printk(VE_LOG, "\n"); - spin_unlock_bh(&log_lock); - } - -@@ -437,28 +438,62 @@ static int ipt_log_checkentry(const char - return 1; - } - -+#ifdef CONFIG_COMPAT -+static int ipt_log_compat(void *target, -+ void **dstptr, int *size, int convert) -+{ -+ int off; -+ -+ off = IPT_ALIGN(sizeof(struct ipt_log_info)) - -+ COMPAT_IPT_ALIGN(sizeof(struct ipt_log_info)); -+ return ipt_target_align_compat(target, dstptr, size, off, convert); -+} -+#endif -+ - static struct ipt_target ipt_log_reg = { - .name = "LOG", - .target = ipt_log_target, - .checkentry = ipt_log_checkentry, -+#ifdef CONFIG_COMPAT -+ .compat = ipt_log_compat, -+#endif - .me = THIS_MODULE, - }; - -+int init_iptable_LOG(void) -+{ -+ return visible_ipt_register_target(&ipt_log_reg); -+} -+ -+void fini_iptable_LOG(void) -+{ -+ visible_ipt_unregister_target(&ipt_log_reg); -+} -+ - static int __init init(void) - { -- if (ipt_register_target(&ipt_log_reg)) -- return -EINVAL; -+ int err; -+ -+ err = init_iptable_LOG(); -+ if (err < 0) -+ return err; - if (nflog) - nf_log_register(PF_INET, &ipt_logfn); -- -+ -+ KSYMRESOLVE(init_iptable_LOG); -+ KSYMRESOLVE(fini_iptable_LOG); -+ KSYMMODRESOLVE(ipt_LOG); - return 0; - } - - static void __exit fini(void) - { -+ KSYMMODUNRESOLVE(ipt_LOG); -+ KSYMUNRESOLVE(init_iptable_LOG); -+ KSYMUNRESOLVE(fini_iptable_LOG); - if (nflog) - nf_log_unregister(PF_INET, &ipt_logfn); -- ipt_unregister_target(&ipt_log_reg); -+ fini_iptable_LOG(); - } - - module_init(init); -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_MARK.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_MARK.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_MARK.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_MARK.c 2006-05-11 13:05:42.000000000 +0400 -@@ -44,14 +44,15 @@ checkentry(const char *tablename, - unsigned int hook_mask) - { - if (targinfosize != IPT_ALIGN(sizeof(struct ipt_mark_target_info))) { -- printk(KERN_WARNING "MARK: targinfosize %u != %Zu\n", -+ ve_printk(VE_LOG, KERN_WARNING "MARK: targinfosize %u != %Zu\n", - targinfosize, - IPT_ALIGN(sizeof(struct ipt_mark_target_info))); - return 0; - } - - if (strcmp(tablename, "mangle") != 0) { -- printk(KERN_WARNING "MARK: can only be called from \"mangle\" table, not \"%s\"\n", tablename); -+ ve_printk(VE_LOG, KERN_WARNING "MARK: can only be called from " -+ "\"mangle\" table, not \"%s\"\n", tablename); - return 0; - } - -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_MASQUERADE.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_MASQUERADE.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_MASQUERADE.c 2004-08-14 14:55:34.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_MASQUERADE.c 2006-05-11 13:05:42.000000000 +0400 -@@ -140,6 +140,7 @@ masquerade_target(struct sk_buff **pskb, - return ip_nat_setup_info(ct, &newrange, hooknum); - } - -+#if 0 - static inline int - device_cmp(const struct ip_conntrack *i, void *_ina) - { -@@ -173,6 +174,7 @@ static int masq_inet_event(struct notifi - static struct notifier_block masq_inet_notifier = { - .notifier_call = masq_inet_event, - }; -+#endif - - static struct ipt_target masquerade = { - .name = "MASQUERADE", -@@ -187,9 +189,13 @@ static int __init init(void) - - ret = ipt_register_target(&masquerade); - -+#if 0 -+/* This notifier is unnecessary and may -+ lead to oops in virtual environments */ - if (ret == 0) - /* Register IP address change reports */ - register_inetaddr_notifier(&masq_inet_notifier); -+#endif - - return ret; - } -@@ -197,7 +203,7 @@ static int __init init(void) - static void __exit fini(void) - { - ipt_unregister_target(&masquerade); -- unregister_inetaddr_notifier(&masq_inet_notifier); -+/* unregister_inetaddr_notifier(&masq_inet_notifier); */ - } - - module_init(init); -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_REDIRECT.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_REDIRECT.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_REDIRECT.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_REDIRECT.c 2006-05-11 13:05:42.000000000 +0400 -@@ -17,6 +17,7 @@ - #include <linux/inetdevice.h> - #include <net/protocol.h> - #include <net/checksum.h> -+#include <linux/nfcalls.h> - #include <linux/netfilter_ipv4.h> - #include <linux/netfilter_ipv4/ip_nat_rule.h> - -@@ -25,7 +26,7 @@ MODULE_AUTHOR("Netfilter Core Team <core - MODULE_DESCRIPTION("iptables REDIRECT target module"); - - #if 0 --#define DEBUGP printk -+#define DEBUGP ve_printk - #else - #define DEBUGP(format, args...) - #endif -@@ -115,14 +116,36 @@ static struct ipt_target redirect_reg = - .me = THIS_MODULE, - }; - -+int init_iptable_REDIRECT(void) -+{ -+ return visible_ipt_register_target(&redirect_reg); -+} -+ -+void fini_iptable_REDIRECT(void) -+{ -+ visible_ipt_unregister_target(&redirect_reg); -+} -+ - static int __init init(void) - { -- return ipt_register_target(&redirect_reg); -+ int err; -+ -+ err = init_iptable_REDIRECT(); -+ if (err < 0) -+ return err; -+ -+ KSYMRESOLVE(init_iptable_REDIRECT); -+ KSYMRESOLVE(fini_iptable_REDIRECT); -+ KSYMMODRESOLVE(ipt_REDIRECT); -+ return 0; - } - - static void __exit fini(void) - { -- ipt_unregister_target(&redirect_reg); -+ KSYMMODUNRESOLVE(ipt_REDIRECT); -+ KSYMUNRESOLVE(init_iptable_REDIRECT); -+ KSYMUNRESOLVE(fini_iptable_REDIRECT); -+ fini_iptable_REDIRECT(); - } - - module_init(init); -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_REJECT.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_REJECT.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_REJECT.c 2004-08-14 14:56:00.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_REJECT.c 2006-05-11 13:05:49.000000000 +0400 -@@ -22,6 +22,7 @@ - #include <net/ip.h> - #include <net/tcp.h> - #include <net/route.h> -+#include <linux/nfcalls.h> - #include <linux/netfilter_ipv4/ip_tables.h> - #include <linux/netfilter_ipv4/ipt_REJECT.h> - #ifdef CONFIG_BRIDGE_NETFILTER -@@ -440,7 +441,7 @@ static int check(const char *tablename, - } - - if (rejinfo->with == IPT_ICMP_ECHOREPLY) { -- printk("REJECT: ECHOREPLY no longer supported.\n"); -+ ve_printk(VE_LOG, "REJECT: ECHOREPLY no longer supported.\n"); - return 0; - } else if (rejinfo->with == IPT_TCP_RESET) { - /* Must specify that it's a TCP packet */ -@@ -454,21 +455,58 @@ static int check(const char *tablename, - return 1; - } - -+#ifdef CONFIG_COMPAT -+static int compat(void *target, -+ void **dstptr, int *size, int convert) -+{ -+ int off; -+ -+ off = IPT_ALIGN(sizeof(struct ipt_reject_info)) - -+ COMPAT_IPT_ALIGN(sizeof(struct ipt_reject_info)); -+ return ipt_target_align_compat(target, dstptr, size, off, convert); -+} -+#endif -+ - static struct ipt_target ipt_reject_reg = { - .name = "REJECT", - .target = reject, - .checkentry = check, -+#ifdef CONFIG_COMPAT -+ .compat = compat, -+#endif - .me = THIS_MODULE, - }; - -+int init_iptable_REJECT(void) -+{ -+ return visible_ipt_register_target(&ipt_reject_reg); -+} -+ -+void fini_iptable_REJECT(void) -+{ -+ visible_ipt_unregister_target(&ipt_reject_reg); -+} -+ - static int __init init(void) - { -- return ipt_register_target(&ipt_reject_reg); -+ int err; -+ -+ err = init_iptable_REJECT(); -+ if (err < 0) -+ return err; -+ -+ KSYMRESOLVE(init_iptable_REJECT); -+ KSYMRESOLVE(fini_iptable_REJECT); -+ KSYMMODRESOLVE(ipt_REJECT); -+ return 0; - } - - static void __exit fini(void) - { -- ipt_unregister_target(&ipt_reject_reg); -+ KSYMMODUNRESOLVE(ipt_REJECT); -+ KSYMUNRESOLVE(init_iptable_REJECT); -+ KSYMUNRESOLVE(fini_iptable_REJECT); -+ fini_iptable_REJECT(); - } - - module_init(init); -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_TCPMSS.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_TCPMSS.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_TCPMSS.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_TCPMSS.c 2006-05-11 13:05:49.000000000 +0400 -@@ -13,6 +13,7 @@ - - #include <linux/ip.h> - #include <net/tcp.h> -+#include <linux/nfcalls.h> - - #include <linux/netfilter_ipv4/ip_tables.h> - #include <linux/netfilter_ipv4/ipt_TCPMSS.h> -@@ -228,7 +229,8 @@ ipt_tcpmss_checkentry(const char *tablen - ((hook_mask & ~((1 << NF_IP_FORWARD) - | (1 << NF_IP_LOCAL_OUT) - | (1 << NF_IP_POST_ROUTING))) != 0)) { -- printk("TCPMSS: path-MTU clamping only supported in FORWARD, OUTPUT and POSTROUTING hooks\n"); -+ ve_printk(VE_LOG, "TCPMSS: path-MTU clamping only supported in " -+ "FORWARD, OUTPUT and POSTROUTING hooks\n"); - return 0; - } - -@@ -237,25 +239,62 @@ ipt_tcpmss_checkentry(const char *tablen - && IPT_MATCH_ITERATE(e, find_syn_match)) - return 1; - -- printk("TCPMSS: Only works on TCP SYN packets\n"); -+ ve_printk(VE_LOG, "TCPMSS: Only works on TCP SYN packets\n"); - return 0; - } - -+#ifdef CONFIG_COMPAT -+static int ipt_tcpmss_compat(void *target, -+ void **dstptr, int *size, int convert) -+{ -+ int off; -+ -+ off = IPT_ALIGN(sizeof(struct ipt_tcpmss_info)) - -+ COMPAT_IPT_ALIGN(sizeof(struct ipt_tcpmss_info)); -+ return ipt_target_align_compat(target, dstptr, size, off, convert); -+} -+#endif -+ - static struct ipt_target ipt_tcpmss_reg = { - .name = "TCPMSS", - .target = ipt_tcpmss_target, - .checkentry = ipt_tcpmss_checkentry, -+#ifdef CONFIG_COMPAT -+ .compat = ipt_tcpmss_compat, -+#endif - .me = THIS_MODULE, - }; - -+int init_iptable_TCPMSS(void) -+{ -+ return visible_ipt_register_target(&ipt_tcpmss_reg); -+} -+ -+void fini_iptable_TCPMSS(void) -+{ -+ visible_ipt_unregister_target(&ipt_tcpmss_reg); -+} -+ - static int __init init(void) - { -- return ipt_register_target(&ipt_tcpmss_reg); -+ int err; -+ -+ err = init_iptable_TCPMSS(); -+ if (err < 0) -+ return err; -+ -+ KSYMRESOLVE(init_iptable_TCPMSS); -+ KSYMRESOLVE(fini_iptable_TCPMSS); -+ KSYMMODRESOLVE(ipt_TCPMSS); -+ return 0; - } - - static void __exit fini(void) - { -- ipt_unregister_target(&ipt_tcpmss_reg); -+ KSYMMODUNRESOLVE(ipt_TCPMSS); -+ KSYMUNRESOLVE(init_iptable_TCPMSS); -+ KSYMUNRESOLVE(fini_iptable_TCPMSS); -+ fini_iptable_TCPMSS(); - } - - module_init(init); -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_TOS.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_TOS.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_TOS.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_TOS.c 2006-05-11 13:05:49.000000000 +0400 -@@ -15,6 +15,7 @@ - - #include <linux/netfilter_ipv4/ip_tables.h> - #include <linux/netfilter_ipv4/ipt_TOS.h> -+#include <linux/nfcalls.h> - - MODULE_LICENSE("GPL"); - MODULE_AUTHOR("Netfilter Core Team <coreteam@netfilter.org>"); -@@ -61,14 +62,15 @@ checkentry(const char *tablename, - const u_int8_t tos = ((struct ipt_tos_target_info *)targinfo)->tos; - - if (targinfosize != IPT_ALIGN(sizeof(struct ipt_tos_target_info))) { -- printk(KERN_WARNING "TOS: targinfosize %u != %Zu\n", -+ ve_printk(VE_LOG, KERN_WARNING "TOS: targinfosize %u != %Zu\n", - targinfosize, - IPT_ALIGN(sizeof(struct ipt_tos_target_info))); - return 0; - } - - if (strcmp(tablename, "mangle") != 0) { -- printk(KERN_WARNING "TOS: can only be called from \"mangle\" table, not \"%s\"\n", tablename); -+ ve_printk(VE_LOG, KERN_WARNING "TOS: can only be called from " -+ "\"mangle\" table, not \"%s\"\n", tablename); - return 0; - } - -@@ -77,28 +79,65 @@ checkentry(const char *tablename, - && tos != IPTOS_RELIABILITY - && tos != IPTOS_MINCOST - && tos != IPTOS_NORMALSVC) { -- printk(KERN_WARNING "TOS: bad tos value %#x\n", tos); -+ ve_printk(VE_LOG, KERN_WARNING "TOS: bad tos value %#x\n", tos); - return 0; - } - - return 1; - } - -+#ifdef CONFIG_COMPAT -+static int compat(void *target, -+ void **dstptr, int *size, int convert) -+{ -+ int off; -+ -+ off = IPT_ALIGN(sizeof(struct ipt_tos_target_info)) - -+ COMPAT_IPT_ALIGN(sizeof(struct ipt_tos_target_info)); -+ return ipt_target_align_compat(target, dstptr, size, off, convert); -+} -+#endif -+ - static struct ipt_target ipt_tos_reg = { - .name = "TOS", - .target = target, - .checkentry = checkentry, -+#ifdef CONFIG_COMPAT -+ .compat = compat, -+#endif - .me = THIS_MODULE, - }; - -+int init_iptable_TOS(void) -+{ -+ return visible_ipt_register_target(&ipt_tos_reg); -+} -+ -+void fini_iptable_TOS(void) -+{ -+ visible_ipt_unregister_target(&ipt_tos_reg); -+} -+ - static int __init init(void) - { -- return ipt_register_target(&ipt_tos_reg); -+ int err; -+ -+ err = init_iptable_TOS(); -+ if (err < 0) -+ return err; -+ -+ KSYMRESOLVE(init_iptable_TOS); -+ KSYMRESOLVE(fini_iptable_TOS); -+ KSYMMODRESOLVE(ipt_TOS); -+ return 0; - } - - static void __exit fini(void) - { -- ipt_unregister_target(&ipt_tos_reg); -+ KSYMMODUNRESOLVE(ipt_TOS); -+ KSYMUNRESOLVE(init_iptable_TOS); -+ KSYMUNRESOLVE(fini_iptable_TOS); -+ fini_iptable_TOS(); - } - - module_init(init); -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_ULOG.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_ULOG.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_ULOG.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_ULOG.c 2006-05-11 13:05:42.000000000 +0400 -@@ -129,6 +129,9 @@ static void ulog_send(unsigned int nlgro - /* timer function to flush queue in ULOG_FLUSH_INTERVAL time */ - static void ulog_timer(unsigned long data) - { -+#ifdef CONFIG_VE -+#error timer context should be evaluated -+#endif - DEBUGP("ipt_ULOG: timer function called, calling ulog_send\n"); - - /* lock to protect against somebody modifying our structure -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_conntrack.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_conntrack.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_conntrack.c 2004-08-14 14:56:15.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_conntrack.c 2006-05-11 13:05:49.000000000 +0400 -@@ -13,6 +13,7 @@ - #include <linux/netfilter_ipv4/ip_conntrack.h> - #include <linux/netfilter_ipv4/ip_tables.h> - #include <linux/netfilter_ipv4/ipt_conntrack.h> -+#include <linux/nfcalls.h> - - MODULE_LICENSE("GPL"); - MODULE_AUTHOR("Marc Boucher <marc@mbsi.ca>"); -@@ -114,22 +115,146 @@ static int check(const char *tablename, - return 1; - } - -+#ifdef CONFIG_COMPAT -+static int compat_to_user(void *match, void **dstptr, -+ int *size, int off) -+{ -+ struct ipt_entry_match *pm; -+ struct ipt_conntrack_info *pinfo; -+ struct compat_ipt_conntrack_info info; -+ u_int16_t msize; -+ -+ pm = (struct ipt_entry_match *)match; -+ msize = pm->u.user.match_size; -+ if (__copy_to_user(*dstptr, pm, sizeof(struct ipt_entry_match))) -+ return -EFAULT; -+ pinfo = (struct ipt_conntrack_info *)pm->data; -+ memset(&info, 0, sizeof(struct compat_ipt_conntrack_info)); -+ info.statemask = pinfo->statemask; -+ info.statusmask = pinfo->statusmask; -+ memcpy(info.tuple, pinfo->tuple, IP_CT_DIR_MAX * -+ sizeof(struct ip_conntrack_tuple)); -+ memcpy(info.sipmsk, pinfo->sipmsk, -+ IP_CT_DIR_MAX * sizeof(struct in_addr)); -+ memcpy(info.dipmsk, pinfo->dipmsk, -+ IP_CT_DIR_MAX * sizeof(struct in_addr)); -+ info.expires_min = pinfo->expires_min; -+ info.expires_max = pinfo->expires_max; -+ info.flags = pinfo->flags; -+ info.invflags = pinfo->invflags; -+ if (__copy_to_user(*dstptr + sizeof(struct ipt_entry_match), -+ &info, sizeof(struct compat_ipt_conntrack_info))) -+ return -EFAULT; -+ msize -= off; -+ if (put_user(msize, (u_int16_t *)*dstptr)) -+ return -EFAULT; -+ *size -= off; -+ *dstptr += msize; -+ return 0; -+} -+ -+static int compat_from_user(void *match, void **dstptr, -+ int *size, int off) -+{ -+ struct compat_ipt_entry_match *pm; -+ struct ipt_entry_match *dstpm; -+ struct compat_ipt_conntrack_info *pinfo; -+ struct ipt_conntrack_info info; -+ u_int16_t msize; -+ -+ pm = (struct compat_ipt_entry_match *)match; -+ dstpm = (struct ipt_entry_match *)*dstptr; -+ msize = pm->u.user.match_size; -+ memcpy(*dstptr, pm, sizeof(struct compat_ipt_entry_match)); -+ pinfo = (struct compat_ipt_conntrack_info *)pm->data; -+ memset(&info, 0, sizeof(struct ipt_conntrack_info)); -+ info.statemask = pinfo->statemask; -+ info.statusmask = pinfo->statusmask; -+ memcpy(info.tuple, pinfo->tuple, IP_CT_DIR_MAX * -+ sizeof(struct ip_conntrack_tuple)); -+ memcpy(info.sipmsk, pinfo->sipmsk, -+ IP_CT_DIR_MAX * sizeof(struct in_addr)); -+ memcpy(info.dipmsk, pinfo->dipmsk, -+ IP_CT_DIR_MAX * sizeof(struct in_addr)); -+ info.expires_min = pinfo->expires_min; -+ info.expires_max = pinfo->expires_max; -+ info.flags = pinfo->flags; -+ info.invflags = pinfo->invflags; -+ memcpy(*dstptr + sizeof(struct compat_ipt_entry_match), -+ &info, sizeof(struct ipt_conntrack_info)); -+ msize += off; -+ dstpm->u.user.match_size = msize; -+ *size += off; -+ *dstptr += msize; -+ return 0; -+} -+ -+static int compat(void *match, void **dstptr, int *size, int convert) -+{ -+ int ret, off; -+ -+ off = IPT_ALIGN(sizeof(struct ipt_conntrack_info)) - -+ COMPAT_IPT_ALIGN(sizeof(struct compat_ipt_conntrack_info)); -+ switch (convert) { -+ case COMPAT_TO_USER: -+ ret = compat_to_user(match, dstptr, size, off); -+ break; -+ case COMPAT_FROM_USER: -+ ret = compat_from_user(match, dstptr, size, off); -+ break; -+ case COMPAT_CALC_SIZE: -+ *size += off; -+ ret = 0; -+ break; -+ default: -+ ret = -ENOPROTOOPT; -+ break; -+ } -+ return ret; -+} -+#endif -+ - static struct ipt_match conntrack_match = { - .name = "conntrack", - .match = &match, - .checkentry = &check, -+#ifdef CONFIG_COMPAT -+ .compat = &compat, -+#endif - .me = THIS_MODULE, - }; - -+int init_iptable_conntrack_match(void) -+{ -+ return visible_ipt_register_match(&conntrack_match); -+} -+ -+void fini_iptable_conntrack_match(void) -+{ -+ visible_ipt_unregister_match(&conntrack_match); -+} -+ - static int __init init(void) - { -+ int err; -+ - need_ip_conntrack(); -- return ipt_register_match(&conntrack_match); -+ err = init_iptable_conntrack_match(); -+ if (err < 0) -+ return err; -+ -+ KSYMRESOLVE(init_iptable_conntrack_match); -+ KSYMRESOLVE(fini_iptable_conntrack_match); -+ KSYMMODRESOLVE(ipt_conntrack); -+ return 0; - } - - static void __exit fini(void) - { -- ipt_unregister_match(&conntrack_match); -+ KSYMMODUNRESOLVE(ipt_conntrack); -+ KSYMUNRESOLVE(init_iptable_conntrack_match); -+ KSYMUNRESOLVE(fini_iptable_conntrack_match); -+ fini_iptable_conntrack_match(); - } - - module_init(init); -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_helper.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_helper.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_helper.c 2004-08-14 14:56:26.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_helper.c 2006-05-11 13:05:49.000000000 +0400 -@@ -18,6 +18,7 @@ - #include <linux/netfilter_ipv4/ip_conntrack_helper.h> - #include <linux/netfilter_ipv4/ip_tables.h> - #include <linux/netfilter_ipv4/ipt_helper.h> -+#include <linux/nfcalls.h> - - MODULE_LICENSE("GPL"); - MODULE_AUTHOR("Martin Josefsson <gandalf@netfilter.org>"); -@@ -98,21 +99,125 @@ static int check(const char *tablename, - return 1; - } - -+#ifdef CONFIG_COMPAT -+static int compat_to_user(void *match, void **dstptr, -+ int *size, int off) -+{ -+ struct ipt_entry_match *pm; -+ struct ipt_helper_info *pinfo; -+ struct compat_ipt_helper_info info; -+ u_int16_t msize; -+ -+ pm = (struct ipt_entry_match *)match; -+ msize = pm->u.user.match_size; -+ if (__copy_to_user(*dstptr, pm, sizeof(struct ipt_entry_match))) -+ return -EFAULT; -+ pinfo = (struct ipt_helper_info *)pm->data; -+ memset(&info, 0, sizeof(struct compat_ipt_helper_info)); -+ info.invert = pinfo->invert; -+ memcpy(info.name, pinfo->name, 30); -+ if (__copy_to_user(*dstptr + sizeof(struct ipt_entry_match), -+ &info, sizeof(struct compat_ipt_helper_info))) -+ return -EFAULT; -+ msize -= off; -+ if (put_user(msize, (u_int16_t *)*dstptr)) -+ return -EFAULT; -+ *size -= off; -+ *dstptr += msize; -+ return 0; -+} -+ -+static int compat_from_user(void *match, void **dstptr, -+ int *size, int off) -+{ -+ struct compat_ipt_entry_match *pm; -+ struct ipt_entry_match *dstpm; -+ struct compat_ipt_helper_info *pinfo; -+ struct ipt_helper_info info; -+ u_int16_t msize; -+ -+ pm = (struct compat_ipt_entry_match *)match; -+ dstpm = (struct ipt_entry_match *)*dstptr; -+ msize = pm->u.user.match_size; -+ memcpy(*dstptr, pm, sizeof(struct compat_ipt_entry_match)); -+ pinfo = (struct compat_ipt_helper_info *)pm->data; -+ memset(&info, 0, sizeof(struct ipt_helper_info)); -+ info.invert = pinfo->invert; -+ memcpy(info.name, pinfo->name, 30); -+ memcpy(*dstptr + sizeof(struct compat_ipt_entry_match), -+ &info, sizeof(struct ipt_helper_info)); -+ msize += off; -+ dstpm->u.user.match_size = msize; -+ *size += off; -+ *dstptr += msize; -+ return 0; -+} -+ -+static int compat(void *match, void **dstptr, int *size, int convert) -+{ -+ int ret, off; -+ -+ off = IPT_ALIGN(sizeof(struct ipt_helper_info)) - -+ COMPAT_IPT_ALIGN(sizeof(struct compat_ipt_helper_info)); -+ switch (convert) { -+ case COMPAT_TO_USER: -+ ret = compat_to_user(match, dstptr, size, off); -+ break; -+ case COMPAT_FROM_USER: -+ ret = compat_from_user(match, dstptr, size, off); -+ break; -+ case COMPAT_CALC_SIZE: -+ *size += off; -+ ret = 0; -+ break; -+ default: -+ ret = -ENOPROTOOPT; -+ break; -+ } -+ return ret; -+} -+#endif -+ - static struct ipt_match helper_match = { - .name = "helper", - .match = &match, - .checkentry = &check, -+#ifdef CONFIG_COMPAT -+ .compat = &compat, -+#endif - .me = THIS_MODULE, - }; - -+int init_iptable_helper(void) -+{ -+ return visible_ipt_register_match(&helper_match); -+} -+ -+void fini_iptable_helper(void) -+{ -+ visible_ipt_unregister_match(&helper_match); -+} -+ - static int __init init(void) - { -- return ipt_register_match(&helper_match); -+ int err; -+ -+ err = init_iptable_helper(); -+ if (err < 0) -+ return err; -+ -+ KSYMRESOLVE(init_iptable_helper); -+ KSYMRESOLVE(fini_iptable_helper); -+ KSYMMODRESOLVE(ipt_helper); -+ return 0; - } - - static void __exit fini(void) - { -- ipt_unregister_match(&helper_match); -+ KSYMMODUNRESOLVE(ipt_helper); -+ KSYMUNRESOLVE(init_iptable_helper); -+ KSYMUNRESOLVE(fini_iptable_helper); -+ fini_iptable_helper(); - } - - module_init(init); -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_length.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_length.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_length.c 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_length.c 2006-05-11 13:05:49.000000000 +0400 -@@ -8,6 +8,7 @@ - - #include <linux/module.h> - #include <linux/skbuff.h> -+#include <linux/nfcalls.h> - - #include <linux/netfilter_ipv4/ipt_length.h> - #include <linux/netfilter_ipv4/ip_tables.h> -@@ -43,21 +44,58 @@ checkentry(const char *tablename, - return 1; - } - -+#ifdef CONFIG_COMPAT -+static int compat(void *match, -+ void **dstptr, int *size, int convert) -+{ -+ int off; -+ -+ off = IPT_ALIGN(sizeof(struct ipt_length_info)) - -+ COMPAT_IPT_ALIGN(sizeof(struct ipt_length_info)); -+ return ipt_match_align_compat(match, dstptr, size, off, convert); -+} -+#endif -+ - static struct ipt_match length_match = { - .name = "length", - .match = &match, - .checkentry = &checkentry, -+#ifdef CONFIG_COMPAT -+ .compat = &compat, -+#endif - .me = THIS_MODULE, - }; - -+int init_iptable_length(void) -+{ -+ return visible_ipt_register_match(&length_match); -+} -+ -+void fini_iptable_length(void) -+{ -+ visible_ipt_unregister_match(&length_match); -+} -+ - static int __init init(void) - { -- return ipt_register_match(&length_match); -+ int err; -+ -+ err = init_iptable_length(); -+ if (err < 0) -+ return err; -+ -+ KSYMRESOLVE(init_iptable_length); -+ KSYMRESOLVE(fini_iptable_length); -+ KSYMMODRESOLVE(ipt_length); -+ return 0; - } - - static void __exit fini(void) - { -- ipt_unregister_match(&length_match); -+ KSYMMODUNRESOLVE(ipt_length); -+ KSYMUNRESOLVE(init_iptable_length); -+ KSYMUNRESOLVE(fini_iptable_length); -+ fini_iptable_length(); - } - - module_init(init); -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_limit.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_limit.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_limit.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_limit.c 2006-05-11 13:05:49.000000000 +0400 -@@ -17,6 +17,7 @@ - #include <linux/skbuff.h> - #include <linux/spinlock.h> - #include <linux/interrupt.h> -+#include <linux/nfcalls.h> - - #include <linux/netfilter_ipv4/ip_tables.h> - #include <linux/netfilter_ipv4/ipt_limit.h> -@@ -25,6 +26,13 @@ MODULE_LICENSE("GPL"); - MODULE_AUTHOR("Herve Eychenne <rv@wallfire.org>"); - MODULE_DESCRIPTION("iptables rate limit match"); - -+#ifdef CONFIG_VE_IPTABLES -+#include <linux/sched.h> -+#define ve_ipt_limit_reg (*(get_exec_env()->_ipt_limit_reg)) -+#else -+#define ve_ipt_limit_reg ipt_limit_reg -+#endif -+ - /* The algorithm used is the Simple Token Bucket Filter (TBF) - * see net/sched/sch_tbf.c in the linux source tree - */ -@@ -116,7 +124,7 @@ ipt_limit_checkentry(const char *tablena - /* Check for overflow. */ - if (r->burst == 0 - || user2credits(r->avg * r->burst) < user2credits(r->avg)) { -- printk("Overflow in ipt_limit, try lower: %u/%u\n", -+ ve_printk(VE_LOG, "Overflow in ipt_limit, try lower: %u/%u\n", - r->avg, r->burst); - return 0; - } -@@ -134,23 +142,128 @@ ipt_limit_checkentry(const char *tablena - return 1; - } - -+#ifdef CONFIG_COMPAT -+static int ipt_limit_compat_to_user(void *match, void **dstptr, -+ int *size, int off) -+{ -+ struct ipt_entry_match *pm; -+ struct ipt_rateinfo *pinfo; -+ struct compat_ipt_rateinfo rinfo; -+ u_int16_t msize; -+ -+ pm = (struct ipt_entry_match *)match; -+ msize = pm->u.user.match_size; -+ if (__copy_to_user(*dstptr, pm, sizeof(struct ipt_entry_match))) -+ return -EFAULT; -+ pinfo = (struct ipt_rateinfo *)pm->data; -+ memset(&rinfo, 0, sizeof(struct compat_ipt_rateinfo)); -+ rinfo.avg = pinfo->avg; -+ rinfo.burst = pinfo->burst; -+ if (__copy_to_user(*dstptr + sizeof(struct ipt_entry_match), -+ &rinfo, sizeof(struct compat_ipt_rateinfo))) -+ return -EFAULT; -+ msize -= off; -+ if (put_user(msize, (u_int16_t *)*dstptr)) -+ return -EFAULT; -+ *size -= off; -+ *dstptr += msize; -+ return 0; -+} -+ -+static int ipt_limit_compat_from_user(void *match, void **dstptr, -+ int *size, int off) -+{ -+ struct compat_ipt_entry_match *pm; -+ struct ipt_entry_match *dstpm; -+ struct compat_ipt_rateinfo *pinfo; -+ struct ipt_rateinfo rinfo; -+ u_int16_t msize; -+ -+ pm = (struct compat_ipt_entry_match *)match; -+ dstpm = (struct ipt_entry_match *)*dstptr; -+ msize = pm->u.user.match_size; -+ memcpy(*dstptr, pm, sizeof(struct compat_ipt_entry_match)); -+ pinfo = (struct compat_ipt_rateinfo *)pm->data; -+ memset(&rinfo, 0, sizeof(struct ipt_rateinfo)); -+ rinfo.avg = pinfo->avg; -+ rinfo.burst = pinfo->burst; -+ memcpy(*dstptr + sizeof(struct compat_ipt_entry_match), -+ &rinfo, sizeof(struct ipt_rateinfo)); -+ msize += off; -+ dstpm->u.user.match_size = msize; -+ *size += off; -+ *dstptr += msize; -+ return 0; -+} -+ -+static int ipt_limit_compat(void *match, void **dstptr, -+ int *size, int convert) -+{ -+ int ret, off; -+ -+ off = IPT_ALIGN(sizeof(struct ipt_rateinfo)) - -+ COMPAT_IPT_ALIGN(sizeof(struct compat_ipt_rateinfo)); -+ switch (convert) { -+ case COMPAT_TO_USER: -+ ret = ipt_limit_compat_to_user(match, -+ dstptr, size, off); -+ break; -+ case COMPAT_FROM_USER: -+ ret = ipt_limit_compat_from_user(match, -+ dstptr, size, off); -+ break; -+ case COMPAT_CALC_SIZE: -+ *size += off; -+ ret = 0; -+ break; -+ default: -+ ret = -ENOPROTOOPT; -+ break; -+ } -+ return ret; -+} -+#endif -+ - static struct ipt_match ipt_limit_reg = { - .name = "limit", - .match = ipt_limit_match, - .checkentry = ipt_limit_checkentry, -+#ifdef CONFIG_COMPAT -+ .compat = ipt_limit_compat, -+#endif - .me = THIS_MODULE, - }; - -+int init_iptable_limit(void) -+{ -+ return visible_ipt_register_match(&ipt_limit_reg); -+} -+ -+void fini_iptable_limit(void) -+{ -+ visible_ipt_unregister_match(&ipt_limit_reg); -+} -+ - static int __init init(void) - { -- if (ipt_register_match(&ipt_limit_reg)) -- return -EINVAL; -+ int err; -+ -+ err = init_iptable_limit(); -+ if (err < 0) -+ return err; -+ -+ KSYMRESOLVE(init_iptable_limit); -+ KSYMRESOLVE(fini_iptable_limit); -+ KSYMMODRESOLVE(ipt_limit); - return 0; - } - - static void __exit fini(void) - { -- ipt_unregister_match(&ipt_limit_reg); -+ KSYMMODUNRESOLVE(ipt_limit); -+ KSYMUNRESOLVE(init_iptable_limit); -+ KSYMUNRESOLVE(fini_iptable_limit); -+ fini_iptable_limit(); - } - - module_init(init); -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_mac.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_mac.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_mac.c 2004-08-14 14:54:49.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_mac.c 2006-05-11 13:05:42.000000000 +0400 -@@ -48,7 +48,8 @@ ipt_mac_checkentry(const char *tablename - if (hook_mask - & ~((1 << NF_IP_PRE_ROUTING) | (1 << NF_IP_LOCAL_IN) - | (1 << NF_IP_FORWARD))) { -- printk("ipt_mac: only valid for PRE_ROUTING, LOCAL_IN or FORWARD.\n"); -+ ve_printk(VE_LOG, "ipt_mac: only valid for PRE_ROUTING, " -+ "LOCAL_IN or FORWARD.\n"); - return 0; - } - -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_multiport.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_multiport.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_multiport.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_multiport.c 2006-05-11 13:05:49.000000000 +0400 -@@ -13,6 +13,7 @@ - #include <linux/types.h> - #include <linux/udp.h> - #include <linux/skbuff.h> -+#include <linux/nfcalls.h> - - #include <linux/netfilter_ipv4/ipt_multiport.h> - #include <linux/netfilter_ipv4/ip_tables.h> -@@ -21,6 +22,13 @@ MODULE_LICENSE("GPL"); - MODULE_AUTHOR("Netfilter Core Team <coreteam@netfilter.org>"); - MODULE_DESCRIPTION("iptables multiple port match module"); - -+#ifdef CONFIG_VE_IPTABLES -+#include <linux/sched.h> -+#define ve_multiport_match (*(get_exec_env()->_multiport_match)) -+#else -+#define ve_multiport_match multiport_match -+#endif -+ - #if 0 - #define duprintf(format, args...) printk(format , ## args) - #else -@@ -100,21 +108,58 @@ checkentry(const char *tablename, - && multiinfo->count <= IPT_MULTI_PORTS; - } - -+#ifdef CONFIG_COMPAT -+static int compat(void *match, -+ void **dstptr, int *size, int convert) -+{ -+ int off; -+ -+ off = IPT_ALIGN(sizeof(struct ipt_multiport)) - -+ COMPAT_IPT_ALIGN(sizeof(struct ipt_multiport)); -+ return ipt_match_align_compat(match, dstptr, size, off, convert); -+} -+#endif -+ - static struct ipt_match multiport_match = { - .name = "multiport", - .match = &match, - .checkentry = &checkentry, -+#ifdef CONFIG_COMPAT -+ .compat = &compat, -+#endif - .me = THIS_MODULE, - }; - -+int init_iptable_multiport(void) -+{ -+ return visible_ipt_register_match(&multiport_match); -+} -+ -+void fini_iptable_multiport(void) -+{ -+ visible_ipt_unregister_match(&multiport_match); -+} -+ - static int __init init(void) - { -- return ipt_register_match(&multiport_match); -+ int err; -+ -+ err = init_iptable_multiport(); -+ if (err < 0) -+ return err; -+ -+ KSYMRESOLVE(init_iptable_multiport); -+ KSYMRESOLVE(fini_iptable_multiport); -+ KSYMMODRESOLVE(ipt_multiport); -+ return 0; - } - - static void __exit fini(void) - { -- ipt_unregister_match(&multiport_match); -+ KSYMMODUNRESOLVE(ipt_multiport); -+ KSYMUNRESOLVE(init_iptable_multiport); -+ KSYMUNRESOLVE(fini_iptable_multiport); -+ fini_iptable_multiport(); - } - - module_init(init); -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_owner.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_owner.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_owner.c 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_owner.c 2006-05-11 13:05:42.000000000 +0400 -@@ -23,12 +23,13 @@ MODULE_DESCRIPTION("iptables owner match - static int - match_comm(const struct sk_buff *skb, const char *comm) - { -+#ifndef CONFIG_VE - struct task_struct *g, *p; - struct files_struct *files; - int i; - - read_lock(&tasklist_lock); -- do_each_thread(g, p) { -+ do_each_thread_ve(g, p) { - if(strncmp(p->comm, comm, sizeof(p->comm))) - continue; - -@@ -48,20 +49,22 @@ match_comm(const struct sk_buff *skb, co - spin_unlock(&files->file_lock); - } - task_unlock(p); -- } while_each_thread(g, p); -+ } while_each_thread_ve(g, p); - read_unlock(&tasklist_lock); -+#endif - return 0; - } - - static int - match_pid(const struct sk_buff *skb, pid_t pid) - { -+#ifndef CONFIG_VE - struct task_struct *p; - struct files_struct *files; - int i; - - read_lock(&tasklist_lock); -- p = find_task_by_pid(pid); -+ p = find_task_by_pid_ve(pid); - if (!p) - goto out; - task_lock(p); -@@ -82,18 +85,20 @@ match_pid(const struct sk_buff *skb, pid - task_unlock(p); - out: - read_unlock(&tasklist_lock); -+#endif - return 0; - } - - static int - match_sid(const struct sk_buff *skb, pid_t sid) - { -+#ifndef CONFIG_VE - struct task_struct *g, *p; - struct file *file = skb->sk->sk_socket->file; - int i, found=0; - - read_lock(&tasklist_lock); -- do_each_thread(g, p) { -+ do_each_thread_ve(g, p) { - struct files_struct *files; - if (p->signal->session != sid) - continue; -@@ -113,11 +118,14 @@ match_sid(const struct sk_buff *skb, pid - task_unlock(p); - if (found) - goto out; -- } while_each_thread(g, p); -+ } while_each_thread_ve(g, p); - out: - read_unlock(&tasklist_lock); - - return found; -+#else -+ return 0; -+#endif - } - - static int -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_recent.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_recent.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_recent.c 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_recent.c 2006-05-11 13:05:34.000000000 +0400 -@@ -222,7 +222,7 @@ static int ip_recent_ctrl(struct file *f - curr_table->table[count].last_seen = 0; - curr_table->table[count].addr = 0; - curr_table->table[count].ttl = 0; -- memset(curr_table->table[count].last_pkts,0,ip_pkt_list_tot*sizeof(u_int32_t)); -+ memset(curr_table->table[count].last_pkts,0,ip_pkt_list_tot*sizeof(unsigned long)); - curr_table->table[count].oldest_pkt = 0; - curr_table->table[count].time_pos = 0; - curr_table->time_info[count].position = count; -@@ -501,7 +501,7 @@ match(const struct sk_buff *skb, - location = time_info[curr_table->time_pos].position; - hash_table[r_list[location].hash_entry] = -1; - hash_table[hash_result] = location; -- memset(r_list[location].last_pkts,0,ip_pkt_list_tot*sizeof(u_int32_t)); -+ memset(r_list[location].last_pkts,0,ip_pkt_list_tot*sizeof(unsigned long)); - r_list[location].time_pos = curr_table->time_pos; - r_list[location].addr = addr; - r_list[location].ttl = ttl; -@@ -630,7 +630,7 @@ match(const struct sk_buff *skb, - r_list[location].last_seen = 0; - r_list[location].addr = 0; - r_list[location].ttl = 0; -- memset(r_list[location].last_pkts,0,ip_pkt_list_tot*sizeof(u_int32_t)); -+ memset(r_list[location].last_pkts,0,ip_pkt_list_tot*sizeof(unsigned long)); - r_list[location].oldest_pkt = 0; - ans = !info->invert; - } -@@ -733,10 +733,10 @@ checkentry(const char *tablename, - memset(curr_table->table,0,sizeof(struct recent_ip_list)*ip_list_tot); - #ifdef DEBUG - if(debug) printk(KERN_INFO RECENT_NAME ": checkentry: Allocating %d for pkt_list.\n", -- sizeof(u_int32_t)*ip_pkt_list_tot*ip_list_tot); -+ sizeof(unsigned long)*ip_pkt_list_tot*ip_list_tot); - #endif - -- hold = vmalloc(sizeof(u_int32_t)*ip_pkt_list_tot*ip_list_tot); -+ hold = vmalloc(sizeof(unsigned long)*ip_pkt_list_tot*ip_list_tot); - #ifdef DEBUG - if(debug) printk(KERN_INFO RECENT_NAME ": checkentry: After pkt_list allocation.\n"); - #endif -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_state.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_state.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_state.c 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_state.c 2006-05-11 13:05:49.000000000 +0400 -@@ -10,6 +10,7 @@ - - #include <linux/module.h> - #include <linux/skbuff.h> -+#include <linux/nfcalls.h> - #include <linux/netfilter_ipv4/ip_conntrack.h> - #include <linux/netfilter_ipv4/ip_tables.h> - #include <linux/netfilter_ipv4/ipt_state.h> -@@ -52,22 +53,124 @@ static int check(const char *tablename, - return 1; - } - -+#ifdef CONFIG_COMPAT -+static int compat_to_user(void *match, void **dstptr, -+ int *size, int off) -+{ -+ struct ipt_entry_match *pm; -+ struct ipt_state_info *pinfo; -+ struct compat_ipt_state_info info; -+ u_int16_t msize; -+ -+ pm = (struct ipt_entry_match *)match; -+ msize = pm->u.user.match_size; -+ if (__copy_to_user(*dstptr, pm, sizeof(struct ipt_entry_match))) -+ return -EFAULT; -+ pinfo = (struct ipt_state_info *)pm->data; -+ memset(&info, 0, sizeof(struct compat_ipt_state_info)); -+ info.statemask = pinfo->statemask; -+ if (__copy_to_user(*dstptr + sizeof(struct ipt_entry_match), -+ &info, sizeof(struct compat_ipt_state_info))) -+ return -EFAULT; -+ msize -= off; -+ if (put_user(msize, (u_int16_t *)*dstptr)) -+ return -EFAULT; -+ *size -= off; -+ *dstptr += msize; -+ return 0; -+} -+ -+static int compat_from_user(void *match, void **dstptr, -+ int *size, int off) -+{ -+ struct compat_ipt_entry_match *pm; -+ struct ipt_entry_match *dstpm; -+ struct compat_ipt_state_info *pinfo; -+ struct ipt_state_info info; -+ u_int16_t msize; -+ -+ pm = (struct compat_ipt_entry_match *)match; -+ dstpm = (struct ipt_entry_match *)*dstptr; -+ msize = pm->u.user.match_size; -+ memcpy(*dstptr, pm, sizeof(struct compat_ipt_entry_match)); -+ pinfo = (struct compat_ipt_state_info *)pm->data; -+ memset(&info, 0, sizeof(struct ipt_state_info)); -+ info.statemask = pinfo->statemask; -+ memcpy(*dstptr + sizeof(struct compat_ipt_entry_match), -+ &info, sizeof(struct ipt_state_info)); -+ msize += off; -+ dstpm->u.user.match_size = msize; -+ *size += off; -+ *dstptr += msize; -+ return 0; -+} -+ -+static int compat(void *match, void **dstptr, int *size, int convert) -+{ -+ int ret, off; -+ -+ off = IPT_ALIGN(sizeof(struct ipt_state_info)) - -+ COMPAT_IPT_ALIGN(sizeof(struct compat_ipt_state_info)); -+ switch (convert) { -+ case COMPAT_TO_USER: -+ ret = compat_to_user(match, dstptr, size, off); -+ break; -+ case COMPAT_FROM_USER: -+ ret = compat_from_user(match, dstptr, size, off); -+ break; -+ case COMPAT_CALC_SIZE: -+ *size += off; -+ ret = 0; -+ break; -+ default: -+ ret = -ENOPROTOOPT; -+ break; -+ } -+ return ret; -+} -+#endif -+ - static struct ipt_match state_match = { - .name = "state", - .match = &match, - .checkentry = &check, -+#ifdef CONFIG_COMPAT -+ .compat = &compat, -+#endif - .me = THIS_MODULE, - }; - -+int init_iptable_state(void) -+{ -+ return visible_ipt_register_match(&state_match); -+} -+ -+void fini_iptable_state(void) -+{ -+ visible_ipt_unregister_match(&state_match); -+} -+ - static int __init init(void) - { -+ int err; -+ - need_ip_conntrack(); -- return ipt_register_match(&state_match); -+ err = init_iptable_state(); -+ if (err < 0) -+ return err; -+ -+ KSYMRESOLVE(init_iptable_state); -+ KSYMRESOLVE(fini_iptable_state); -+ KSYMMODRESOLVE(ipt_state); -+ return 0; - } - - static void __exit fini(void) - { -- ipt_unregister_match(&state_match); -+ KSYMMODUNRESOLVE(ipt_state); -+ KSYMUNRESOLVE(init_iptable_state); -+ KSYMUNRESOLVE(fini_iptable_state); -+ fini_iptable_state(); - } - - module_init(init); -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_tcpmss.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_tcpmss.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_tcpmss.c 2004-08-14 14:54:49.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_tcpmss.c 2006-05-11 13:05:49.000000000 +0400 -@@ -10,6 +10,7 @@ - #include <linux/module.h> - #include <linux/skbuff.h> - #include <net/tcp.h> -+#include <linux/nfcalls.h> - - #include <linux/netfilter_ipv4/ipt_tcpmss.h> - #include <linux/netfilter_ipv4/ip_tables.h> -@@ -103,28 +104,65 @@ checkentry(const char *tablename, - - /* Must specify -p tcp */ - if (ip->proto != IPPROTO_TCP || (ip->invflags & IPT_INV_PROTO)) { -- printk("tcpmss: Only works on TCP packets\n"); -+ ve_printk(VE_LOG, "tcpmss: Only works on TCP packets\n"); - return 0; - } - - return 1; - } - -+#ifdef CONFIG_COMPAT -+static int compat(void *match, -+ void **dstptr, int *size, int convert) -+{ -+ int off; -+ -+ off = IPT_ALIGN(sizeof(struct ipt_tcpmss_match_info)) - -+ COMPAT_IPT_ALIGN(sizeof(struct ipt_tcpmss_match_info)); -+ return ipt_match_align_compat(match, dstptr, size, off, convert); -+} -+#endif -+ - static struct ipt_match tcpmss_match = { - .name = "tcpmss", - .match = &match, - .checkentry = &checkentry, -+#ifdef CONFIG_COMPAT -+ .compat = &compat, -+#endif - .me = THIS_MODULE, - }; - -+int init_iptable_tcpmss(void) -+{ -+ return visible_ipt_register_match(&tcpmss_match); -+} -+ -+void fini_iptable_tcpmss(void) -+{ -+ visible_ipt_unregister_match(&tcpmss_match); -+} -+ - static int __init init(void) - { -- return ipt_register_match(&tcpmss_match); -+ int err; -+ -+ err = init_iptable_tcpmss(); -+ if (err < 0) -+ return err; -+ -+ KSYMRESOLVE(init_iptable_tcpmss); -+ KSYMRESOLVE(fini_iptable_tcpmss); -+ KSYMMODRESOLVE(ipt_tcpmss); -+ return 0; - } - - static void __exit fini(void) - { -- ipt_unregister_match(&tcpmss_match); -+ KSYMMODUNRESOLVE(ipt_tcpmss); -+ KSYMUNRESOLVE(init_iptable_tcpmss); -+ KSYMUNRESOLVE(fini_iptable_tcpmss); -+ fini_iptable_tcpmss(); - } - - module_init(init); -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_tos.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_tos.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_tos.c 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_tos.c 2006-05-11 13:05:49.000000000 +0400 -@@ -10,6 +10,7 @@ - - #include <linux/module.h> - #include <linux/skbuff.h> -+#include <linux/nfcalls.h> - - #include <linux/netfilter_ipv4/ipt_tos.h> - #include <linux/netfilter_ipv4/ip_tables.h> -@@ -17,6 +18,13 @@ - MODULE_LICENSE("GPL"); - MODULE_DESCRIPTION("iptables TOS match module"); - -+#ifdef CONFIG_VE_IPTABLES -+#include <linux/sched.h> -+#define ve_tos_match (*(get_exec_env()->_tos_match)) -+#else -+#define ve_tos_match tos_match -+#endif -+ - static int - match(const struct sk_buff *skb, - const struct net_device *in, -@@ -43,21 +51,58 @@ checkentry(const char *tablename, - return 1; - } - -+#ifdef CONFIG_COMPAT -+static int compat(void *match, -+ void **dstptr, int *size, int convert) -+{ -+ int off; -+ -+ off = IPT_ALIGN(sizeof(struct ipt_tos_info)) - -+ COMPAT_IPT_ALIGN(sizeof(struct ipt_tos_info)); -+ return ipt_match_align_compat(match, dstptr, size, off, convert); -+} -+#endif -+ - static struct ipt_match tos_match = { - .name = "tos", - .match = &match, - .checkentry = &checkentry, -+#ifdef CONFIG_COMPAT -+ .compat = &compat, -+#endif - .me = THIS_MODULE, - }; - -+int init_iptable_tos(void) -+{ -+ return visible_ipt_register_match(&tos_match); -+} -+ -+void fini_iptable_tos(void) -+{ -+ visible_ipt_unregister_match(&tos_match); -+} -+ - static int __init init(void) - { -- return ipt_register_match(&tos_match); -+ int err; -+ -+ err = init_iptable_tos(); -+ if (err < 0) -+ return err; -+ -+ KSYMRESOLVE(init_iptable_tos); -+ KSYMRESOLVE(fini_iptable_tos); -+ KSYMMODRESOLVE(ipt_tos); -+ return 0; - } - - static void __exit fini(void) - { -- ipt_unregister_match(&tos_match); -+ KSYMMODUNRESOLVE(ipt_tos); -+ KSYMUNRESOLVE(init_iptable_tos); -+ KSYMUNRESOLVE(fini_iptable_tos); -+ fini_iptable_tos(); - } - - module_init(init); -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_ttl.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_ttl.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/ipt_ttl.c 2004-08-14 14:56:24.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/ipt_ttl.c 2006-05-11 13:05:49.000000000 +0400 -@@ -11,6 +11,7 @@ - - #include <linux/module.h> - #include <linux/skbuff.h> -+#include <linux/nfcalls.h> - - #include <linux/netfilter_ipv4/ipt_ttl.h> - #include <linux/netfilter_ipv4/ip_tables.h> -@@ -57,22 +58,58 @@ static int checkentry(const char *tablen - return 1; - } - -+#ifdef CONFIG_COMPAT -+static int compat(void *match, -+ void **dstptr, int *size, int convert) -+{ -+ int off; -+ -+ off = IPT_ALIGN(sizeof(struct ipt_ttl_info)) - -+ COMPAT_IPT_ALIGN(sizeof(struct ipt_ttl_info)); -+ return ipt_match_align_compat(match, dstptr, size, off, convert); -+} -+#endif -+ - static struct ipt_match ttl_match = { - .name = "ttl", - .match = &match, - .checkentry = &checkentry, -+#ifdef CONFIG_COMPAT -+ .compat = &compat, -+#endif - .me = THIS_MODULE, - }; - -+int init_iptable_ttl(void) -+{ -+ return visible_ipt_register_match(&ttl_match); -+} -+ -+void fini_iptable_ttl(void) -+{ -+ visible_ipt_unregister_match(&ttl_match); -+} -+ - static int __init init(void) - { -- return ipt_register_match(&ttl_match); -+ int err; -+ -+ err = init_iptable_ttl(); -+ if (err < 0) -+ return err; -+ -+ KSYMRESOLVE(init_iptable_ttl); -+ KSYMRESOLVE(fini_iptable_ttl); -+ KSYMMODRESOLVE(ipt_ttl); -+ return 0; - } - - static void __exit fini(void) - { -- ipt_unregister_match(&ttl_match); -- -+ KSYMMODUNRESOLVE(ipt_ttl); -+ KSYMUNRESOLVE(init_iptable_ttl); -+ KSYMUNRESOLVE(fini_iptable_ttl); -+ fini_iptable_ttl(); - } - - module_init(init); -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/iptable_filter.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/iptable_filter.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/iptable_filter.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/iptable_filter.c 2006-05-11 13:05:42.000000000 +0400 -@@ -11,12 +11,23 @@ - */ - - #include <linux/module.h> -+#include <linux/nfcalls.h> - #include <linux/netfilter_ipv4/ip_tables.h> -+#include <ub/ub_mem.h> - - MODULE_LICENSE("GPL"); - MODULE_AUTHOR("Netfilter Core Team <coreteam@netfilter.org>"); - MODULE_DESCRIPTION("iptables filter table"); - -+#ifdef CONFIG_VE_IPTABLES -+#include <linux/sched.h> -+#define ve_packet_filter (*(get_exec_env()->_ve_ipt_filter_pf)) -+#define ve_ipt_ops (get_exec_env()->_ve_ipt_filter_io) -+#else -+#define ve_packet_filter packet_filter -+#define ve_ipt_ops ipt_ops -+#endif -+ - #define FILTER_VALID_HOOKS ((1 << NF_IP_LOCAL_IN) | (1 << NF_IP_FORWARD) | (1 << NF_IP_LOCAL_OUT)) - - /* Standard entry. */ -@@ -38,12 +49,12 @@ struct ipt_error - struct ipt_error_target target; - }; - --static struct -+static struct ipt_filter_initial_table - { - struct ipt_replace repl; - struct ipt_standard entries[3]; - struct ipt_error term; --} initial_table __initdata -+} initial_table - = { { "filter", FILTER_VALID_HOOKS, 4, - sizeof(struct ipt_standard) * 3 + sizeof(struct ipt_error), - { [NF_IP_LOCAL_IN] = 0, -@@ -108,7 +119,7 @@ ipt_hook(unsigned int hook, - const struct net_device *out, - int (*okfn)(struct sk_buff *)) - { -- return ipt_do_table(pskb, hook, in, out, &packet_filter, NULL); -+ return ipt_do_table(pskb, hook, in, out, &ve_packet_filter, NULL); - } - - static unsigned int -@@ -126,7 +137,7 @@ ipt_local_out_hook(unsigned int hook, - return NF_ACCEPT; - } - -- return ipt_do_table(pskb, hook, in, out, &packet_filter, NULL); -+ return ipt_do_table(pskb, hook, in, out, &ve_packet_filter, NULL); - } - - static struct nf_hook_ops ipt_ops[] = { -@@ -157,56 +168,161 @@ static struct nf_hook_ops ipt_ops[] = { - static int forward = NF_ACCEPT; - MODULE_PARM(forward, "i"); - --static int __init init(void) -+#ifdef CONFIG_VE_IPTABLES -+static void init_ve0_iptable_filter(struct ve_struct *envid) -+{ -+ envid->_ipt_filter_initial_table = &initial_table; -+ envid->_ve_ipt_filter_pf = &packet_filter; -+ envid->_ve_ipt_filter_io = ipt_ops; -+} -+#endif -+ -+int init_iptable_filter(void) - { - int ret; -+#ifdef CONFIG_VE_IPTABLES -+ struct ve_struct *envid; - -- if (forward < 0 || forward > NF_MAX_VERDICT) { -- printk("iptables forward must be 0 or 1\n"); -- return -EINVAL; -- } -+ envid = get_exec_env(); - -- /* Entry 1 is the FORWARD hook */ -- initial_table.entries[1].target.verdict = -forward - 1; -+ if (ve_is_super(envid)) { -+ init_ve0_iptable_filter(envid); -+ } else { -+ __module_get(THIS_MODULE); -+ ret = -ENOMEM; -+ envid->_ipt_filter_initial_table = -+ ub_kmalloc(sizeof(initial_table), GFP_KERNEL); -+ if (!envid->_ipt_filter_initial_table) -+ goto nomem_1; -+ envid->_ve_ipt_filter_pf = -+ ub_kmalloc(sizeof(packet_filter), GFP_KERNEL); -+ if (!envid->_ve_ipt_filter_pf) -+ goto nomem_2; -+ envid->_ve_ipt_filter_io = -+ ub_kmalloc(sizeof(ipt_ops), GFP_KERNEL); -+ if (!envid->_ve_ipt_filter_io) -+ goto nomem_3; -+ -+ /* -+ * Note: in general, it isn't safe to copy the static table -+ * used for VE0, since that table is already registered -+ * and now has some run-time information. -+ * However, inspection of ip_tables.c shows that the only -+ * dynamically changed fields `list' and `private' are -+ * given new values in ipt_register_table() without looking -+ * at the old values. 2004/06/01 SAW -+ */ -+ memcpy(envid->_ipt_filter_initial_table, &initial_table, -+ sizeof(initial_table)); -+ memcpy(envid->_ve_ipt_filter_pf, &packet_filter, -+ sizeof(packet_filter)); -+ memcpy(envid->_ve_ipt_filter_io, &ipt_ops[0], sizeof(ipt_ops)); -+ -+ envid->_ve_ipt_filter_pf->table = -+ &envid->_ipt_filter_initial_table->repl; -+ } -+#endif - - /* Register table */ -- ret = ipt_register_table(&packet_filter); -+ ret = ipt_register_table(&ve_packet_filter); - if (ret < 0) -- return ret; -+ goto nomem_4; - - /* Register hooks */ -- ret = nf_register_hook(&ipt_ops[0]); -+ ret = nf_register_hook(&ve_ipt_ops[0]); - if (ret < 0) - goto cleanup_table; - -- ret = nf_register_hook(&ipt_ops[1]); -+ ret = nf_register_hook(&ve_ipt_ops[1]); - if (ret < 0) - goto cleanup_hook0; - -- ret = nf_register_hook(&ipt_ops[2]); -+ ret = nf_register_hook(&ve_ipt_ops[2]); - if (ret < 0) - goto cleanup_hook1; - - return ret; - - cleanup_hook1: -- nf_unregister_hook(&ipt_ops[1]); -+ nf_unregister_hook(&ve_ipt_ops[1]); - cleanup_hook0: -- nf_unregister_hook(&ipt_ops[0]); -+ nf_unregister_hook(&ve_ipt_ops[0]); - cleanup_table: -- ipt_unregister_table(&packet_filter); -- -+ ipt_unregister_table(&ve_packet_filter); -+ nomem_4: -+#ifdef CONFIG_VE_IPTABLES -+ if (!ve_is_super(envid)) -+ kfree(envid->_ve_ipt_filter_io); -+ envid->_ve_ipt_filter_io = NULL; -+ nomem_3: -+ if (!ve_is_super(envid)) -+ kfree(envid->_ve_ipt_filter_pf); -+ envid->_ve_ipt_filter_pf = NULL; -+ nomem_2: -+ if (!ve_is_super(envid)) -+ kfree(envid->_ipt_filter_initial_table); -+ envid->_ipt_filter_initial_table = NULL; -+ nomem_1: -+ if (!ve_is_super(envid)) -+ module_put(THIS_MODULE); -+#endif - return ret; - } - --static void __exit fini(void) -+void fini_iptable_filter(void) - { - unsigned int i; -+#ifdef CONFIG_VE_IPTABLES -+ struct ve_struct *envid; -+#endif - - for (i = 0; i < sizeof(ipt_ops)/sizeof(struct nf_hook_ops); i++) -- nf_unregister_hook(&ipt_ops[i]); -+ nf_unregister_hook(&ve_ipt_ops[i]); -+ -+ ipt_unregister_table(&ve_packet_filter); -+ -+#ifdef CONFIG_VE_IPTABLES -+ envid = get_exec_env(); -+ if (envid->_ipt_filter_initial_table != NULL && !ve_is_super(envid)) { -+ kfree(envid->_ipt_filter_initial_table); -+ kfree(envid->_ve_ipt_filter_pf); -+ kfree(envid->_ve_ipt_filter_io); -+ module_put(THIS_MODULE); -+ } -+ envid->_ipt_filter_initial_table = NULL; -+ envid->_ve_ipt_filter_pf = NULL; -+ envid->_ve_ipt_filter_io = NULL; -+#endif -+} -+ -+static int __init init(void) -+{ -+ int err; - -- ipt_unregister_table(&packet_filter); -+ if (forward < 0 || forward > NF_MAX_VERDICT) { -+ printk("iptables forward must be 0 or 1\n"); -+ return -EINVAL; -+ } -+ -+ /* Entry 1 is the FORWARD hook */ -+ initial_table.entries[1].target.verdict = -forward - 1; -+ -+ err = init_iptable_filter(); -+ if (err < 0) -+ return err; -+ -+ KSYMRESOLVE(init_iptable_filter); -+ KSYMRESOLVE(fini_iptable_filter); -+ KSYMMODRESOLVE(iptable_filter); -+ return 0; -+} -+ -+static void __exit fini(void) -+{ -+ KSYMMODUNRESOLVE(iptable_filter); -+ KSYMUNRESOLVE(init_iptable_filter); -+ KSYMUNRESOLVE(fini_iptable_filter); -+ fini_iptable_filter(); - } - - module_init(init); -diff -uprN linux-2.6.8.1.orig/net/ipv4/netfilter/iptable_mangle.c linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/iptable_mangle.c ---- linux-2.6.8.1.orig/net/ipv4/netfilter/iptable_mangle.c 2004-08-14 14:56:01.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/netfilter/iptable_mangle.c 2006-05-11 13:05:42.000000000 +0400 -@@ -17,6 +17,7 @@ - #include <linux/skbuff.h> - #include <net/sock.h> - #include <net/route.h> -+#include <linux/nfcalls.h> - #include <linux/ip.h> - - MODULE_LICENSE("GPL"); -@@ -54,7 +55,7 @@ static struct - struct ipt_replace repl; - struct ipt_standard entries[5]; - struct ipt_error term; --} initial_table __initdata -+} initial_table - = { { "mangle", MANGLE_VALID_HOOKS, 6, - sizeof(struct ipt_standard) * 5 + sizeof(struct ipt_error), - { [NF_IP_PRE_ROUTING] = 0, -@@ -131,6 +132,13 @@ static struct ipt_table packet_mangler = - .me = THIS_MODULE, - }; - -+#ifdef CONFIG_VE_IPTABLES -+#include <linux/sched.h> -+#define ve_packet_mangler (*(get_exec_env()->_ipt_mangle_table)) -+#else -+#define ve_packet_mangler packet_mangler -+#endif -+ - /* The work comes in here from netfilter.c. */ - static unsigned int - ipt_route_hook(unsigned int hook, -@@ -139,7 +147,7 @@ ipt_route_hook(unsigned int hook, - const struct net_device *out, - int (*okfn)(struct sk_buff *)) - { -- return ipt_do_table(pskb, hook, in, out, &packet_mangler, NULL); -+ return ipt_do_table(pskb, hook, in, out, &ve_packet_mangler, NULL); - } - - static unsigned int -@@ -168,7 +176,8 @@ ipt_local_hook(unsigned int hook, - daddr = (*pskb)->nh.iph->daddr; - tos = (*pskb)->nh.iph->tos; - -- ret = ipt_do_table(pskb, hook, in, out, &packet_mangler, NULL); -+ ret = ipt_do_table(pskb, hook, in, out, &ve_packet_mangler, NULL); -+ - /* Reroute for ANY change. */ - if (ret != NF_DROP && ret != NF_STOLEN && ret != NF_QUEUE - && ((*pskb)->nh.iph->saddr != saddr -@@ -220,12 +229,12 @@ static struct nf_hook_ops ipt_ops[] = { - }, - }; - --static int __init init(void) -+static int mangle_init(struct ipt_table *packet_mangler, struct nf_hook_ops ipt_ops[]) - { - int ret; - - /* Register table */ -- ret = ipt_register_table(&packet_mangler); -+ ret = ipt_register_table(packet_mangler); - if (ret < 0) - return ret; - -@@ -261,19 +270,117 @@ static int __init init(void) - cleanup_hook0: - nf_unregister_hook(&ipt_ops[0]); - cleanup_table: -- ipt_unregister_table(&packet_mangler); -+ ipt_unregister_table(packet_mangler); - - return ret; - } - --static void __exit fini(void) -+static void mangle_fini(struct ipt_table *packet_mangler, struct nf_hook_ops ipt_ops[]) - { - unsigned int i; - -- for (i = 0; i < sizeof(ipt_ops)/sizeof(struct nf_hook_ops); i++) -+ for (i = 0; i < 5; i++) - nf_unregister_hook(&ipt_ops[i]); - -- ipt_unregister_table(&packet_mangler); -+ ipt_unregister_table(packet_mangler); -+} -+ -+int init_iptable_mangle(void) -+{ -+#ifdef CONFIG_VE_IPTABLES -+ struct ve_struct *envid; -+ struct ipt_table *table; -+ struct nf_hook_ops *hooks; -+ int err; -+ -+ envid = get_exec_env(); -+ if (ve_is_super(envid)) { -+ table = &packet_mangler; -+ hooks = ipt_ops; -+ } else { -+ __module_get(THIS_MODULE); -+ err = -ENOMEM; -+ table = kmalloc(sizeof(packet_mangler), GFP_KERNEL); -+ if (table == NULL) -+ goto nomem_1; -+ hooks = kmalloc(sizeof(ipt_ops), GFP_KERNEL); -+ if (hooks == NULL) -+ goto nomem_2; -+ -+ memcpy(table, &packet_mangler, sizeof(packet_mangler)); -+ memcpy(hooks, ipt_ops, sizeof(ipt_ops)); -+ } -+ envid->_ipt_mangle_hooks = hooks; -+ envid->_ipt_mangle_table = table; -+ -+ err = mangle_init(table, hooks); -+ if (err) -+ goto err_minit; -+ -+ return 0; -+ -+err_minit: -+ envid->_ipt_mangle_table = NULL; -+ envid->_ipt_mangle_hooks = NULL; -+ if (!ve_is_super(envid)) -+ kfree(hooks); -+nomem_2: -+ if (!ve_is_super(envid)) { -+ kfree(table); -+nomem_1: -+ module_put(THIS_MODULE); -+ } -+ return err; -+#else -+ return mangle_init(&packet_mangler, ipt_ops); -+#endif -+} -+ -+void fini_iptable_mangle(void) -+{ -+#ifdef CONFIG_VE_IPTABLES -+ struct ve_struct *envid; -+ struct ipt_table *table; -+ struct nf_hook_ops *hooks; -+ -+ envid = get_exec_env(); -+ table = envid->_ipt_mangle_table; -+ hooks = envid->_ipt_mangle_hooks; -+ if (table == NULL) -+ return; -+ mangle_fini(table, hooks); -+ envid->_ipt_mangle_table = NULL; -+ envid->_ipt_mangle_hooks = NULL; -+ if (!ve_is_super(envid)) { -+ kfree(hooks); -+ kfree(table); -+ module_put(THIS_MODULE); -+ } -+#else -+ mangle_fini(&packet_mangler, ipt_ops); -+#endif -+} -+ -+static int __init init(void) -+{ -+ int err; -+ -+ err = init_iptable_mangle(); -+ if (err < 0) -+ return err; -+ -+ KSYMRESOLVE(init_iptable_mangle); -+ KSYMRESOLVE(fini_iptable_mangle); -+ KSYMMODRESOLVE(iptable_mangle); -+ return 0; -+} -+ -+static void __exit fini(void) -+{ -+ KSYMMODUNRESOLVE(iptable_mangle); -+ KSYMUNRESOLVE(init_iptable_mangle); -+ KSYMUNRESOLVE(fini_iptable_mangle); -+ fini_iptable_mangle(); - } - - module_init(init); -diff -uprN linux-2.6.8.1.orig/net/ipv4/proc.c linux-2.6.8.1-ve022stab078/net/ipv4/proc.c ---- linux-2.6.8.1.orig/net/ipv4/proc.c 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/proc.c 2006-05-11 13:05:42.000000000 +0400 -@@ -262,11 +262,12 @@ static int snmp_seq_show(struct seq_file - seq_printf(seq, " %s", snmp4_ipstats_list[i].name); - - seq_printf(seq, "\nIp: %d %d", -- ipv4_devconf.forwarding ? 1 : 2, sysctl_ip_default_ttl); -+ ve_ipv4_devconf.forwarding ? 1 : 2, -+ sysctl_ip_default_ttl); - - for (i = 0; snmp4_ipstats_list[i].name != NULL; i++) - seq_printf(seq, " %lu", -- fold_field((void **) ip_statistics, -+ fold_field((void **) ve_ip_statistics, - snmp4_ipstats_list[i].entry)); - - seq_puts(seq, "\nIcmp:"); -@@ -276,7 +277,7 @@ static int snmp_seq_show(struct seq_file - seq_puts(seq, "\nIcmp:"); - for (i = 0; snmp4_icmp_list[i].name != NULL; i++) - seq_printf(seq, " %lu", -- fold_field((void **) icmp_statistics, -+ fold_field((void **) ve_icmp_statistics, - snmp4_icmp_list[i].entry)); - - seq_puts(seq, "\nTcp:"); -@@ -288,11 +289,11 @@ static int snmp_seq_show(struct seq_file - /* MaxConn field is signed, RFC 2012 */ - if (snmp4_tcp_list[i].entry == TCP_MIB_MAXCONN) - seq_printf(seq, " %ld", -- fold_field((void **) tcp_statistics, -+ fold_field((void **) ve_tcp_statistics, - snmp4_tcp_list[i].entry)); - else - seq_printf(seq, " %lu", -- fold_field((void **) tcp_statistics, -+ fold_field((void **) ve_tcp_statistics, - snmp4_tcp_list[i].entry)); - } - -@@ -303,7 +304,7 @@ static int snmp_seq_show(struct seq_file - seq_puts(seq, "\nUdp:"); - for (i = 0; snmp4_udp_list[i].name != NULL; i++) - seq_printf(seq, " %lu", -- fold_field((void **) udp_statistics, -+ fold_field((void **) ve_udp_statistics, - snmp4_udp_list[i].entry)); - - seq_putc(seq, '\n'); -@@ -337,7 +338,7 @@ static int netstat_seq_show(struct seq_f - seq_puts(seq, "\nTcpExt:"); - for (i = 0; snmp4_net_list[i].name != NULL; i++) - seq_printf(seq, " %lu", -- fold_field((void **) net_statistics, -+ fold_field((void **) ve_net_statistics, - snmp4_net_list[i].entry)); - - seq_putc(seq, '\n'); -diff -uprN linux-2.6.8.1.orig/net/ipv4/raw.c linux-2.6.8.1-ve022stab078/net/ipv4/raw.c ---- linux-2.6.8.1.orig/net/ipv4/raw.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/raw.c 2006-05-11 13:05:42.000000000 +0400 -@@ -114,7 +114,8 @@ struct sock *__raw_v4_lookup(struct sock - if (inet->num == num && - !(inet->daddr && inet->daddr != raddr) && - !(inet->rcv_saddr && inet->rcv_saddr != laddr) && -- !(sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif)) -+ !(sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif) && -+ ve_accessible_strict(VE_OWNER_SK(sk), get_exec_env())) - goto found; /* gotcha */ - } - sk = NULL; -@@ -689,8 +690,12 @@ static struct sock *raw_get_first(struct - struct hlist_node *node; - - sk_for_each(sk, node, &raw_v4_htable[state->bucket]) -- if (sk->sk_family == PF_INET) -+ if (sk->sk_family == PF_INET) { -+ if (!ve_accessible(VE_OWNER_SK(sk), -+ get_exec_env())) -+ continue; - goto found; -+ } - } - sk = NULL; - found: -@@ -704,8 +709,14 @@ static struct sock *raw_get_next(struct - do { - sk = sk_next(sk); - try_again: -- ; -- } while (sk && sk->sk_family != PF_INET); -+ if (!sk) -+ break; -+ if (sk->sk_family != PF_INET) -+ continue; -+ if (ve_accessible(VE_OWNER_SK(sk), -+ get_exec_env())) -+ break; -+ } while (1); - - if (!sk && ++state->bucket < RAWV4_HTABLE_SIZE) { - sk = sk_head(&raw_v4_htable[state->bucket]); -diff -uprN linux-2.6.8.1.orig/net/ipv4/route.c linux-2.6.8.1-ve022stab078/net/ipv4/route.c ---- linux-2.6.8.1.orig/net/ipv4/route.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/route.c 2006-05-11 13:05:42.000000000 +0400 -@@ -108,6 +108,8 @@ - - #define RT_GC_TIMEOUT (300*HZ) - -+int ip_rt_src_check = 1; -+ - int ip_rt_min_delay = 2 * HZ; - int ip_rt_max_delay = 10 * HZ; - int ip_rt_max_size; -@@ -215,11 +217,28 @@ static unsigned int rt_hash_code(u32 dad - & rt_hash_mask); - } - -+void prepare_rt_cache(void) -+{ -+#ifdef CONFIG_VE -+ struct rtable *r; -+ int i; -+ -+ for (i = rt_hash_mask; i >= 0; i--) { -+ spin_lock_bh(&rt_hash_table[i].lock); -+ for (r = rt_hash_table[i].chain; r; r = r->u.rt_next) { -+ r->fl.owner_env = get_ve0(); -+ } -+ spin_unlock_bh(&rt_hash_table[i].lock); -+ } -+#endif -+} -+ - #ifdef CONFIG_PROC_FS - struct rt_cache_iter_state { - int bucket; - }; - -+static struct rtable *rt_cache_get_next(struct seq_file *seq, struct rtable *r); - static struct rtable *rt_cache_get_first(struct seq_file *seq) - { - struct rtable *r = NULL; -@@ -232,6 +251,8 @@ static struct rtable *rt_cache_get_first - break; - rcu_read_unlock(); - } -+ if (r && !ve_accessible_strict(r->fl.owner_env, get_exec_env())) -+ r = rt_cache_get_next(seq, r); - return r; - } - -@@ -239,15 +260,20 @@ static struct rtable *rt_cache_get_next( - { - struct rt_cache_iter_state *st = seq->private; - -+start: - smp_read_barrier_depends(); -- r = r->u.rt_next; -+ do { -+ r = r->u.rt_next; -+ } while (r && !ve_accessible_strict(r->fl.owner_env, get_exec_env())); - while (!r) { - rcu_read_unlock(); - if (--st->bucket < 0) -- break; -+ goto out; - rcu_read_lock(); - r = rt_hash_table[st->bucket].chain; - } -+ goto start; -+out: - return r; - } - -@@ -549,26 +575,106 @@ static void rt_check_expire(unsigned lon - mod_timer(&rt_periodic_timer, now + ip_rt_gc_interval); - } - -+typedef unsigned long rt_flush_gen_t; -+ -+#ifdef CONFIG_VE -+ -+static rt_flush_gen_t rt_flush_gen; -+ -+/* called under rt_flush_lock */ -+static void set_rt_flush_required(struct ve_struct *env) -+{ -+ /* -+ * If the global generation rt_flush_gen is equal to G, then -+ * the pass considering entries labelled by G is yet to come. -+ */ -+ env->rt_flush_required = rt_flush_gen; -+} -+ -+static spinlock_t rt_flush_lock; -+static rt_flush_gen_t reset_rt_flush_required(void) -+{ -+ rt_flush_gen_t g; -+ -+ spin_lock_bh(&rt_flush_lock); -+ g = rt_flush_gen++; -+ spin_unlock_bh(&rt_flush_lock); -+ return g; -+} -+ -+static int check_rt_flush_required(struct ve_struct *env, rt_flush_gen_t gen) -+{ -+ /* can be checked without the lock */ -+ return env->rt_flush_required >= gen; -+} -+ -+#else -+ -+static void set_rt_flush_required(struct ve_struct *env) -+{ -+} -+ -+static rt_flush_gen_t reset_rt_flush_required(void) -+{ -+ return 0; -+} -+ -+#endif -+ - /* This can run from both BH and non-BH contexts, the latter - * in the case of a forced flush event. - */ - static void rt_run_flush(unsigned long dummy) - { - int i; -- struct rtable *rth, *next; -+ struct rtable * rth, * next; -+ struct rtable * tail; -+ rt_flush_gen_t gen; - - rt_deadline = 0; - - get_random_bytes(&rt_hash_rnd, 4); - -+ gen = reset_rt_flush_required(); -+ - for (i = rt_hash_mask; i >= 0; i--) { -+#ifdef CONFIG_VE -+ struct rtable ** prev, * p; -+ -+ spin_lock_bh(&rt_hash_table[i].lock); -+ rth = rt_hash_table[i].chain; -+ -+ /* defer releasing the head of the list after spin_unlock */ -+ for (tail = rth; tail; tail = tail->u.rt_next) -+ if (!check_rt_flush_required(tail->fl.owner_env, gen)) -+ break; -+ if (rth != tail) -+ rt_hash_table[i].chain = tail; -+ -+ /* call rt_free on entries after the tail requiring flush */ -+ prev = &rt_hash_table[i].chain; -+ for (p = *prev; p; p = next) { -+ next = p->u.rt_next; -+ if (!check_rt_flush_required(p->fl.owner_env, gen)) { -+ prev = &p->u.rt_next; -+ } else { -+ *prev = next; -+ rt_free(p); -+ } -+ } -+ -+#else - spin_lock_bh(&rt_hash_table[i].lock); - rth = rt_hash_table[i].chain; -+ - if (rth) - rt_hash_table[i].chain = NULL; -+ tail = NULL; -+ -+#endif - spin_unlock_bh(&rt_hash_table[i].lock); - -- for (; rth; rth = next) { -+ for (; rth != tail; rth = next) { - next = rth->u.rt_next; - rt_free(rth); - } -@@ -604,6 +710,8 @@ void rt_cache_flush(int delay) - delay = tmo; - } - -+ set_rt_flush_required(get_exec_env()); -+ - if (delay <= 0) { - spin_unlock_bh(&rt_flush_lock); - rt_run_flush(0); -@@ -619,9 +727,30 @@ void rt_cache_flush(int delay) - - static void rt_secret_rebuild(unsigned long dummy) - { -+ int i; -+ struct rtable *rth, *next; - unsigned long now = jiffies; - -- rt_cache_flush(0); -+ spin_lock_bh(&rt_flush_lock); -+ del_timer(&rt_flush_timer); -+ spin_unlock_bh(&rt_flush_lock); -+ -+ rt_deadline = 0; -+ get_random_bytes(&rt_hash_rnd, 4); -+ -+ for (i = rt_hash_mask; i >= 0; i--) { -+ spin_lock_bh(&rt_hash_table[i].lock); -+ rth = rt_hash_table[i].chain; -+ if (rth) -+ rt_hash_table[i].chain = NULL; -+ spin_unlock_bh(&rt_hash_table[i].lock); -+ -+ for (; rth; rth = next) { -+ next = rth->u.rt_next; -+ rt_free(rth); -+ } -+ } -+ - mod_timer(&rt_secret_timer, now + ip_rt_secret_interval); - } - -@@ -763,7 +892,8 @@ static inline int compare_keys(struct fl - { - return memcmp(&fl1->nl_u.ip4_u, &fl2->nl_u.ip4_u, sizeof(fl1->nl_u.ip4_u)) == 0 && - fl1->oif == fl2->oif && -- fl1->iif == fl2->iif; -+ fl1->iif == fl2->iif && -+ ve_accessible_strict(fl1->owner_env, fl2->owner_env); - } - - static int rt_intern_hash(unsigned hash, struct rtable *rt, struct rtable **rp) -@@ -975,7 +1105,9 @@ void ip_rt_redirect(u32 old_gw, u32 dadd - struct rtable *rth, **rthp; - u32 skeys[2] = { saddr, 0 }; - int ikeys[2] = { dev->ifindex, 0 }; -+ struct ve_struct *ve; - -+ ve = get_exec_env(); - tos &= IPTOS_RT_MASK; - - if (!in_dev) -@@ -1012,6 +1144,10 @@ void ip_rt_redirect(u32 old_gw, u32 dadd - rth->fl.fl4_src != skeys[i] || - rth->fl.fl4_tos != tos || - rth->fl.oif != ikeys[k] || -+#ifdef CONFIG_VE -+ !ve_accessible_strict(rth->fl.owner_env, -+ ve) || -+#endif - rth->fl.iif != 0) { - rthp = &rth->u.rt_next; - continue; -@@ -1050,6 +1186,9 @@ void ip_rt_redirect(u32 old_gw, u32 dadd - rt->u.dst.neighbour = NULL; - rt->u.dst.hh = NULL; - rt->u.dst.xfrm = NULL; -+#ifdef CONFIG_VE -+ rt->fl.owner_env = ve; -+#endif - - rt->rt_flags |= RTCF_REDIRECTED; - -@@ -1495,6 +1634,9 @@ static int ip_route_input_mc(struct sk_b - #ifdef CONFIG_IP_ROUTE_FWMARK - rth->fl.fl4_fwmark= skb->nfmark; - #endif -+#ifdef CONFIG_VE -+ rth->fl.owner_env = get_exec_env(); -+#endif - rth->fl.fl4_src = saddr; - rth->rt_src = saddr; - #ifdef CONFIG_IP_ROUTE_NAT -@@ -1506,7 +1648,7 @@ static int ip_route_input_mc(struct sk_b - #endif - rth->rt_iif = - rth->fl.iif = dev->ifindex; -- rth->u.dst.dev = &loopback_dev; -+ rth->u.dst.dev = &visible_loopback_dev; - dev_hold(rth->u.dst.dev); - rth->idev = in_dev_get(rth->u.dst.dev); - rth->fl.oif = 0; -@@ -1641,7 +1783,7 @@ static int ip_route_input_slow(struct sk - if (res.type == RTN_LOCAL) { - int result; - result = fib_validate_source(saddr, daddr, tos, -- loopback_dev.ifindex, -+ visible_loopback_dev.ifindex, - dev, &spec_dst, &itag); - if (result < 0) - goto martian_source; -@@ -1705,6 +1847,9 @@ static int ip_route_input_slow(struct sk - #ifdef CONFIG_IP_ROUTE_FWMARK - rth->fl.fl4_fwmark= skb->nfmark; - #endif -+#ifdef CONFIG_VE -+ rth->fl.owner_env = get_exec_env(); -+#endif - rth->fl.fl4_src = saddr; - rth->rt_src = saddr; - rth->rt_gateway = daddr; -@@ -1774,6 +1919,9 @@ local_input: - #ifdef CONFIG_IP_ROUTE_FWMARK - rth->fl.fl4_fwmark= skb->nfmark; - #endif -+#ifdef CONFIG_VE -+ rth->fl.owner_env = get_exec_env(); -+#endif - rth->fl.fl4_src = saddr; - rth->rt_src = saddr; - #ifdef CONFIG_IP_ROUTE_NAT -@@ -1785,7 +1933,7 @@ local_input: - #endif - rth->rt_iif = - rth->fl.iif = dev->ifindex; -- rth->u.dst.dev = &loopback_dev; -+ rth->u.dst.dev = &visible_loopback_dev; - dev_hold(rth->u.dst.dev); - rth->idev = in_dev_get(rth->u.dst.dev); - rth->rt_gateway = daddr; -@@ -1873,6 +2021,9 @@ int ip_route_input(struct sk_buff *skb, - #ifdef CONFIG_IP_ROUTE_FWMARK - rth->fl.fl4_fwmark == skb->nfmark && - #endif -+#ifdef CONFIG_VE -+ rth->fl.owner_env == get_exec_env() && -+#endif - rth->fl.fl4_tos == tos) { - rth->u.dst.lastuse = jiffies; - dst_hold(&rth->u.dst); -@@ -1938,7 +2089,7 @@ static int ip_route_output_slow(struct r - .fwmark = oldflp->fl4_fwmark - #endif - } }, -- .iif = loopback_dev.ifindex, -+ .iif = visible_loopback_dev.ifindex, - .oif = oldflp->oif }; - struct fib_result res; - unsigned flags = 0; -@@ -1961,10 +2112,13 @@ static int ip_route_output_slow(struct r - ZERONET(oldflp->fl4_src)) - goto out; - -- /* It is equivalent to inet_addr_type(saddr) == RTN_LOCAL */ -- dev_out = ip_dev_find(oldflp->fl4_src); -- if (dev_out == NULL) -- goto out; -+ if (ip_rt_src_check) { -+ /* It is equivalent to -+ inet_addr_type(saddr) == RTN_LOCAL */ -+ dev_out = ip_dev_find(oldflp->fl4_src); -+ if (dev_out == NULL) -+ goto out; -+ } - - /* I removed check for oif == dev_out->oif here. - It was wrong for two reasons: -@@ -1991,6 +2145,12 @@ static int ip_route_output_slow(struct r - Luckily, this hack is good workaround. - */ - -+ if (dev_out == NULL) { -+ dev_out = ip_dev_find(oldflp->fl4_src); -+ if (dev_out == NULL) -+ goto out; -+ } -+ - fl.oif = dev_out->ifindex; - goto make_route; - } -@@ -2030,9 +2190,9 @@ static int ip_route_output_slow(struct r - fl.fl4_dst = fl.fl4_src = htonl(INADDR_LOOPBACK); - if (dev_out) - dev_put(dev_out); -- dev_out = &loopback_dev; -+ dev_out = &visible_loopback_dev; - dev_hold(dev_out); -- fl.oif = loopback_dev.ifindex; -+ fl.oif = visible_loopback_dev.ifindex; - res.type = RTN_LOCAL; - flags |= RTCF_LOCAL; - goto make_route; -@@ -2080,7 +2240,7 @@ static int ip_route_output_slow(struct r - fl.fl4_src = fl.fl4_dst; - if (dev_out) - dev_put(dev_out); -- dev_out = &loopback_dev; -+ dev_out = &visible_loopback_dev; - dev_hold(dev_out); - fl.oif = dev_out->ifindex; - if (res.fi) -@@ -2162,6 +2322,9 @@ make_route: - #ifdef CONFIG_IP_ROUTE_FWMARK - rth->fl.fl4_fwmark= oldflp->fl4_fwmark; - #endif -+#ifdef CONFIG_VE -+ rth->fl.owner_env = get_exec_env(); -+#endif - rth->rt_dst = fl.fl4_dst; - rth->rt_src = fl.fl4_src; - #ifdef CONFIG_IP_ROUTE_NAT -@@ -2241,6 +2404,7 @@ int __ip_route_output_key(struct rtable - #ifdef CONFIG_IP_ROUTE_FWMARK - rth->fl.fl4_fwmark == flp->fl4_fwmark && - #endif -+ ve_accessible_strict(rth->fl.owner_env, get_exec_env()) && - !((rth->fl.fl4_tos ^ flp->fl4_tos) & - (IPTOS_RT_MASK | RTO_ONLINK))) { - rth->u.dst.lastuse = jiffies; -@@ -2345,7 +2509,7 @@ static int rt_fill_info(struct sk_buff * - u32 dst = rt->rt_dst; - - if (MULTICAST(dst) && !LOCAL_MCAST(dst) && -- ipv4_devconf.mc_forwarding) { -+ ve_ipv4_devconf.mc_forwarding) { - int err = ipmr_get_route(skb, r, nowait); - if (err <= 0) { - if (!nowait) { -@@ -2390,7 +2554,10 @@ int inet_rtm_getroute(struct sk_buff *in - /* Reserve room for dummy headers, this skb can pass - through good chunk of routing engine. - */ -- skb->mac.raw = skb->data; -+ skb->mac.raw = skb->nh.raw = skb->data; -+ -+ /* Bugfix: need to give ip_route_input enough of an IP header to not gag. */ -+ skb->nh.iph->protocol = IPPROTO_ICMP; - skb_reserve(skb, MAX_HEADER + sizeof(struct iphdr)); - - if (rta[RTA_SRC - 1]) -@@ -2496,6 +2663,11 @@ void ip_rt_multicast_event(struct in_dev - #ifdef CONFIG_SYSCTL - static int flush_delay; - -+void *get_flush_delay_addr(void) -+{ -+ return &flush_delay; -+} -+ - static int ipv4_sysctl_rtcache_flush(ctl_table *ctl, int write, - struct file *filp, void __user *buffer, - size_t *lenp, loff_t *ppos) -@@ -2509,6 +2681,13 @@ static int ipv4_sysctl_rtcache_flush(ctl - return -EINVAL; - } - -+int visible_ipv4_sysctl_rtcache_flush(ctl_table *ctl, int write, -+ struct file *filp, void __user *buffer, -+ size_t *lenp, loff_t *ppos) -+{ -+ return ipv4_sysctl_rtcache_flush(ctl, write, filp, buffer, lenp, ppos); -+} -+ - static int ipv4_sysctl_rtcache_flush_strategy(ctl_table *table, - int __user *name, - int nlen, -@@ -2527,6 +2706,19 @@ static int ipv4_sysctl_rtcache_flush_str - return 0; - } - -+int visible_ipv4_sysctl_rtcache_flush_strategy(ctl_table *table, -+ int __user *name, -+ int nlen, -+ void __user *oldval, -+ size_t __user *oldlenp, -+ void __user *newval, -+ size_t newlen, -+ void **context) -+{ -+ return ipv4_sysctl_rtcache_flush_strategy(table, name, nlen, oldval, -+ oldlenp, newval, newlen, context); -+} -+ - ctl_table ipv4_route_table[] = { - { - .ctl_name = NET_IPV4_ROUTE_FLUSH, -@@ -2838,7 +3030,7 @@ int __init ip_rt_init(void) - } - - #ifdef CONFIG_NET_CLS_ROUTE -- create_proc_read_entry("rt_acct", 0, proc_net, ip_rt_acct_read, NULL); -+ create_proc_read_entry("net/rt_acct", 0, NULL, ip_rt_acct_read, NULL); - #endif - #endif - #ifdef CONFIG_XFRM -diff -uprN linux-2.6.8.1.orig/net/ipv4/sysctl_net_ipv4.c linux-2.6.8.1-ve022stab078/net/ipv4/sysctl_net_ipv4.c ---- linux-2.6.8.1.orig/net/ipv4/sysctl_net_ipv4.c 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/sysctl_net_ipv4.c 2006-05-11 13:05:42.000000000 +0400 -@@ -48,6 +48,8 @@ extern int inet_peer_maxttl; - extern int inet_peer_gc_mintime; - extern int inet_peer_gc_maxtime; - -+int sysctl_tcp_use_sg = 1; -+ - #ifdef CONFIG_SYSCTL - static int tcp_retr1_max = 255; - static int ip_local_port_range_min[] = { 1, 1 }; -@@ -64,17 +66,23 @@ static - int ipv4_sysctl_forward(ctl_table *ctl, int write, struct file * filp, - void __user *buffer, size_t *lenp, loff_t *ppos) - { -- int val = ipv4_devconf.forwarding; -+ int val = ve_ipv4_devconf.forwarding; - int ret; - - ret = proc_dointvec(ctl, write, filp, buffer, lenp, ppos); - -- if (write && ipv4_devconf.forwarding != val) -+ if (write && ve_ipv4_devconf.forwarding != val) - inet_forward_change(); - - return ret; - } - -+int visible_ipv4_sysctl_forward(ctl_table *ctl, int write, struct file * filp, -+ void __user *buffer, size_t *lenp, loff_t *ppos) -+{ -+ return ipv4_sysctl_forward(ctl, write, filp, buffer, lenp, ppos); -+} -+ - static int ipv4_sysctl_forward_strategy(ctl_table *table, - int __user *name, int nlen, - void __user *oldval, size_t __user *oldlenp, -@@ -117,6 +125,16 @@ static int ipv4_sysctl_forward_strategy( - return 1; - } - -+int visible_ipv4_sysctl_forward_strategy(ctl_table *table, -+ int __user *name, int nlen, -+ void __user *oldval, size_t __user *oldlenp, -+ void __user *newval, size_t newlen, -+ void **context) -+{ -+ return ipv4_sysctl_forward_strategy(table, name, nlen, -+ oldval, oldlenp, newval, newlen, context); -+} -+ - ctl_table ipv4_table[] = { - { - .ctl_name = NET_IPV4_TCP_TIMESTAMPS, -@@ -682,6 +700,14 @@ ctl_table ipv4_table[] = { - .mode = 0644, - .proc_handler = &proc_dointvec, - }, -+ { -+ .ctl_name = NET_TCP_USE_SG, -+ .procname = "tcp_use_sg", -+ .data = &sysctl_tcp_use_sg, -+ .maxlen = sizeof(int), -+ .mode = 0644, -+ .proc_handler = &proc_dointvec, -+ }, - { .ctl_name = 0 } - }; - -diff -uprN linux-2.6.8.1.orig/net/ipv4/tcp.c linux-2.6.8.1-ve022stab078/net/ipv4/tcp.c ---- linux-2.6.8.1.orig/net/ipv4/tcp.c 2004-08-14 14:55:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/tcp.c 2006-05-11 13:05:44.000000000 +0400 -@@ -248,6 +248,7 @@ - */ - - #include <linux/config.h> -+#include <linux/kmem_cache.h> - #include <linux/module.h> - #include <linux/types.h> - #include <linux/fcntl.h> -@@ -262,6 +263,9 @@ - #include <net/xfrm.h> - #include <net/ip.h> - -+#include <ub/ub_orphan.h> -+#include <ub/ub_net.h> -+#include <ub/ub_tcp.h> - - #include <asm/uaccess.h> - #include <asm/ioctls.h> -@@ -333,6 +337,7 @@ unsigned int tcp_poll(struct file *file, - unsigned int mask; - struct sock *sk = sock->sk; - struct tcp_opt *tp = tcp_sk(sk); -+ int check_send_space; - - poll_wait(file, sk->sk_sleep, wait); - if (sk->sk_state == TCP_LISTEN) -@@ -347,6 +352,21 @@ unsigned int tcp_poll(struct file *file, - if (sk->sk_err) - mask = POLLERR; - -+ check_send_space = 1; -+#ifdef CONFIG_USER_RESOURCE -+ if (!(sk->sk_shutdown & SEND_SHUTDOWN) && sock_has_ubc(sk)) { -+ unsigned long size; -+ size = MAX_TCP_HEADER + tp->mss_cache; -+ if (size > SOCK_MIN_UBCSPACE) -+ size = SOCK_MIN_UBCSPACE; -+ size = skb_charge_size(size); -+ if (ub_sock_makewres_tcp(sk, size)) { -+ check_send_space = 0; -+ ub_sock_sndqueueadd_tcp(sk, size); -+ } -+ } -+#endif -+ - /* - * POLLHUP is certainly not done right. But poll() doesn't - * have a notion of HUP in just one direction, and for a -@@ -390,7 +410,7 @@ unsigned int tcp_poll(struct file *file, - sock_flag(sk, SOCK_URGINLINE) || !tp->urg_data)) - mask |= POLLIN | POLLRDNORM; - -- if (!(sk->sk_shutdown & SEND_SHUTDOWN)) { -+ if (check_send_space && !(sk->sk_shutdown & SEND_SHUTDOWN)) { - if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk)) { - mask |= POLLOUT | POLLWRNORM; - } else { /* send SIGIO later */ -@@ -566,7 +586,7 @@ static void tcp_listen_stop (struct sock - - sock_orphan(child); - -- atomic_inc(&tcp_orphan_count); -+ tcp_inc_orphan_count(child); - - tcp_destroy_sock(child); - -@@ -659,16 +679,23 @@ static ssize_t do_tcp_sendpages(struct s - int copy, i; - int offset = poffset % PAGE_SIZE; - int size = min_t(size_t, psize, PAGE_SIZE - offset); -+ unsigned long chargesize = 0; - - if (!sk->sk_send_head || (copy = mss_now - skb->len) <= 0) { - new_segment: -+ chargesize = 0; - if (!sk_stream_memory_free(sk)) - goto wait_for_sndbuf; - -+ chargesize = skb_charge_size(MAX_TCP_HEADER + -+ tp->mss_cache); -+ if (ub_sock_getwres_tcp(sk, chargesize) < 0) -+ goto wait_for_ubspace; - skb = sk_stream_alloc_pskb(sk, 0, tp->mss_cache, - sk->sk_allocation); - if (!skb) - goto wait_for_memory; -+ ub_skb_set_charge(skb, sk, chargesize, UB_TCPSNDBUF); - - skb_entail(sk, tp, skb); - copy = mss_now; -@@ -715,10 +742,14 @@ new_segment: - wait_for_sndbuf: - set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); - wait_for_memory: -+ ub_sock_retwres_tcp(sk, chargesize, -+ skb_charge_size(MAX_TCP_HEADER + tp->mss_cache)); -+ chargesize = 0; -+wait_for_ubspace: - if (copied) - tcp_push(sk, tp, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH); - -- if ((err = sk_stream_wait_memory(sk, &timeo)) != 0) -+ if ((err = sk_stream_wait_memory(sk, &timeo, chargesize)) != 0) - goto do_error; - - mss_now = tcp_current_mss(sk, !(flags&MSG_OOB)); -@@ -758,9 +789,6 @@ ssize_t tcp_sendpage(struct socket *sock - return res; - } - --#define TCP_PAGE(sk) (sk->sk_sndmsg_page) --#define TCP_OFF(sk) (sk->sk_sndmsg_off) -- - static inline int select_size(struct sock *sk, struct tcp_opt *tp) - { - int tmp = tp->mss_cache_std; -@@ -814,6 +842,7 @@ int tcp_sendmsg(struct kiocb *iocb, stru - while (--iovlen >= 0) { - int seglen = iov->iov_len; - unsigned char __user *from = iov->iov_base; -+ unsigned long chargesize = 0; - - iov++; - -@@ -824,18 +853,26 @@ int tcp_sendmsg(struct kiocb *iocb, stru - - if (!sk->sk_send_head || - (copy = mss_now - skb->len) <= 0) { -+ unsigned long size; - - new_segment: - /* Allocate new segment. If the interface is SG, - * allocate skb fitting to single page. - */ -+ chargesize = 0; - if (!sk_stream_memory_free(sk)) - goto wait_for_sndbuf; -- -- skb = sk_stream_alloc_pskb(sk, select_size(sk, tp), -- 0, sk->sk_allocation); -+ size = select_size(sk, tp); -+ chargesize = skb_charge_size(MAX_TCP_HEADER + -+ size); -+ if (ub_sock_getwres_tcp(sk, chargesize) < 0) -+ goto wait_for_ubspace; -+ skb = sk_stream_alloc_pskb(sk, size, 0, -+ sk->sk_allocation); - if (!skb) - goto wait_for_memory; -+ ub_skb_set_charge(skb, sk, chargesize, -+ UB_TCPSNDBUF); - - /* - * Check whether we can use HW checksum. -@@ -888,11 +925,15 @@ new_segment: - ~(L1_CACHE_BYTES - 1); - if (off == PAGE_SIZE) { - put_page(page); -+ ub_sock_tcp_detachpage(sk); - TCP_PAGE(sk) = page = NULL; - } - } - - if (!page) { -+ chargesize = PAGE_SIZE; -+ if (ub_sock_tcp_chargepage(sk) < 0) -+ goto wait_for_ubspace; - /* Allocate new cache page. */ - if (!(page = sk_stream_alloc_page(sk))) - goto wait_for_memory; -@@ -928,7 +969,8 @@ new_segment: - } else if (off + copy < PAGE_SIZE) { - get_page(page); - TCP_PAGE(sk) = page; -- } -+ } else -+ ub_sock_tcp_detachpage(sk); - } - - TCP_OFF(sk) = off + copy; -@@ -958,10 +1000,15 @@ new_segment: - wait_for_sndbuf: - set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); - wait_for_memory: -+ ub_sock_retwres_tcp(sk, chargesize, -+ skb_charge_size(MAX_TCP_HEADER+tp->mss_cache)); -+ chargesize = 0; -+wait_for_ubspace: - if (copied) - tcp_push(sk, tp, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH); - -- if ((err = sk_stream_wait_memory(sk, &timeo)) != 0) -+ if ((err = sk_stream_wait_memory(sk, &timeo, -+ chargesize)) != 0) - goto do_error; - - mss_now = tcp_current_mss(sk, !(flags&MSG_OOB)); -@@ -1058,7 +1105,18 @@ static void cleanup_rbuf(struct sock *sk - #if TCP_DEBUG - struct sk_buff *skb = skb_peek(&sk->sk_receive_queue); - -- BUG_TRAP(!skb || before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq)); -+ if (!(skb==NULL || before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq))) { -+ printk("KERNEL: assertion: skb==NULL || " -+ "before(tp->copied_seq, skb->end_seq)\n"); -+ printk("VE%u pid %d comm %.16s\n", -+ (get_exec_env() ? VEID(get_exec_env()) : 0), -+ current->pid, current->comm); -+ printk("copied=%d, copied_seq=%d, rcv_nxt=%d\n", copied, -+ tp->copied_seq, tp->rcv_nxt); -+ printk("skb->len=%d, skb->seq=%d, skb->end_seq=%d\n", -+ skb->len, TCP_SKB_CB(skb)->seq, -+ TCP_SKB_CB(skb)->end_seq); -+ } - #endif - - if (tcp_ack_scheduled(tp)) { -@@ -1281,7 +1339,22 @@ int tcp_recvmsg(struct kiocb *iocb, stru - goto found_ok_skb; - if (skb->h.th->fin) - goto found_fin_ok; -- BUG_TRAP(flags & MSG_PEEK); -+ if (!(flags & MSG_PEEK)) { -+ printk("KERNEL: assertion: flags&MSG_PEEK\n"); -+ printk("VE%u pid %d comm %.16s\n", -+ (get_exec_env() ? -+ VEID(get_exec_env()) : 0), -+ current->pid, current->comm); -+ printk("flags=0x%x, len=%d, copied_seq=%d, " -+ "rcv_nxt=%d\n", flags, len, -+ tp->copied_seq, tp->rcv_nxt); -+ printk("skb->len=%d, *seq=%d, skb->seq=%d, " -+ "skb->end_seq=%d, offset=%d\n", -+ skb->len, *seq, -+ TCP_SKB_CB(skb)->seq, -+ TCP_SKB_CB(skb)->end_seq, -+ offset); -+ } - skb = skb->next; - } while (skb != (struct sk_buff *)&sk->sk_receive_queue); - -@@ -1344,8 +1417,18 @@ int tcp_recvmsg(struct kiocb *iocb, stru - - tp->ucopy.len = len; - -- BUG_TRAP(tp->copied_seq == tp->rcv_nxt || -- (flags & (MSG_PEEK | MSG_TRUNC))); -+ if (!(tp->copied_seq == tp->rcv_nxt || -+ (flags&(MSG_PEEK|MSG_TRUNC)))) { -+ printk("KERNEL: assertion: tp->copied_seq == " -+ "tp->rcv_nxt || ...\n"); -+ printk("VE%u pid %d comm %.16s\n", -+ (get_exec_env() ? -+ VEID(get_exec_env()) : 0), -+ current->pid, current->comm); -+ printk("flags=0x%x, len=%d, copied_seq=%d, " -+ "rcv_nxt=%d\n", flags, len, -+ tp->copied_seq, tp->rcv_nxt); -+ } - - /* Ugly... If prequeue is not empty, we have to - * process it before releasing socket, otherwise -@@ -1614,7 +1697,7 @@ void tcp_destroy_sock(struct sock *sk) - } - #endif - -- atomic_dec(&tcp_orphan_count); -+ tcp_dec_orphan_count(sk); - sock_put(sk); - } - -@@ -1738,7 +1821,7 @@ adjudge_to_death: - if (tmo > TCP_TIMEWAIT_LEN) { - tcp_reset_keepalive_timer(sk, tcp_fin_time(tp)); - } else { -- atomic_inc(&tcp_orphan_count); -+ tcp_inc_orphan_count(sk); - tcp_time_wait(sk, TCP_FIN_WAIT2, tmo); - goto out; - } -@@ -1746,9 +1829,7 @@ adjudge_to_death: - } - if (sk->sk_state != TCP_CLOSE) { - sk_stream_mem_reclaim(sk); -- if (atomic_read(&tcp_orphan_count) > sysctl_tcp_max_orphans || -- (sk->sk_wmem_queued > SOCK_MIN_SNDBUF && -- atomic_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])) { -+ if (tcp_too_many_orphans(sk, tcp_get_orphan_count(sk))) { - if (net_ratelimit()) - printk(KERN_INFO "TCP: too many of orphaned " - "sockets\n"); -@@ -1757,7 +1838,7 @@ adjudge_to_death: - NET_INC_STATS_BH(LINUX_MIB_TCPABORTONMEMORY); - } - } -- atomic_inc(&tcp_orphan_count); -+ tcp_inc_orphan_count(sk); - - if (sk->sk_state == TCP_CLOSE) - tcp_destroy_sock(sk); -@@ -1823,12 +1904,13 @@ int tcp_disconnect(struct sock *sk, int - tp->packets_out = 0; - tp->snd_ssthresh = 0x7fffffff; - tp->snd_cwnd_cnt = 0; -+ tp->advmss = 65535; - tcp_set_ca_state(tp, TCP_CA_Open); - tcp_clear_retrans(tp); - tcp_delack_init(tp); - sk->sk_send_head = NULL; -- tp->saw_tstamp = 0; -- tcp_sack_reset(tp); -+ tp->rx_opt.saw_tstamp = 0; -+ tcp_sack_reset(&tp->rx_opt); - __sk_dst_reset(sk); - - BUG_TRAP(!inet->num || tp->bind_hash); -@@ -1967,7 +2049,7 @@ int tcp_setsockopt(struct sock *sk, int - err = -EINVAL; - break; - } -- tp->user_mss = val; -+ tp->rx_opt.user_mss = val; - break; - - case TCP_NODELAY: -@@ -2125,7 +2207,7 @@ int tcp_getsockopt(struct sock *sk, int - case TCP_MAXSEG: - val = tp->mss_cache_std; - if (!val && ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))) -- val = tp->user_mss; -+ val = tp->rx_opt.user_mss; - break; - case TCP_NODELAY: - val = !!(tp->nonagle&TCP_NAGLE_OFF); -@@ -2189,6 +2271,7 @@ int tcp_getsockopt(struct sock *sk, int - - extern void __skb_cb_too_small_for_tcp(int, int); - extern void tcpdiag_init(void); -+extern unsigned int nr_free_lowpages(void); - - static __initdata unsigned long thash_entries; - static int __init set_thash_entries(char *str) -@@ -2212,24 +2295,26 @@ void __init tcp_init(void) - - tcp_openreq_cachep = kmem_cache_create("tcp_open_request", - sizeof(struct open_request), -- 0, SLAB_HWCACHE_ALIGN, -+ 0, SLAB_HWCACHE_ALIGN | SLAB_UBC, - NULL, NULL); - if (!tcp_openreq_cachep) - panic("tcp_init: Cannot alloc open_request cache."); - - tcp_bucket_cachep = kmem_cache_create("tcp_bind_bucket", - sizeof(struct tcp_bind_bucket), -- 0, SLAB_HWCACHE_ALIGN, -+ 0, SLAB_HWCACHE_ALIGN | SLAB_UBC, - NULL, NULL); - if (!tcp_bucket_cachep) - panic("tcp_init: Cannot alloc tcp_bind_bucket cache."); - - tcp_timewait_cachep = kmem_cache_create("tcp_tw_bucket", - sizeof(struct tcp_tw_bucket), -- 0, SLAB_HWCACHE_ALIGN, -+ 0, -+ SLAB_HWCACHE_ALIGN | SLAB_UBC, - NULL, NULL); - if (!tcp_timewait_cachep) - panic("tcp_init: Cannot alloc tcp_tw_bucket cache."); -+ tcp_timewait_cachep->flags |= CFLGS_ENVIDS; - - /* Size and allocate the main established and bind bucket - * hash tables. -@@ -2295,10 +2380,19 @@ void __init tcp_init(void) - } - tcp_port_rover = sysctl_local_port_range[0] - 1; - -+ goal = nr_free_lowpages() / 6; -+ while (order >= 3 && (1536<<order) > goal) -+ order--; -+ - sysctl_tcp_mem[0] = 768 << order; - sysctl_tcp_mem[1] = 1024 << order; - sysctl_tcp_mem[2] = 1536 << order; - -+ if (sysctl_tcp_mem[2] - sysctl_tcp_mem[1] > 4096) -+ sysctl_tcp_mem[1] = sysctl_tcp_mem[2] - 4096; -+ if (sysctl_tcp_mem[1] - sysctl_tcp_mem[0] > 4096) -+ sysctl_tcp_mem[0] = sysctl_tcp_mem[1] - 4096; -+ - if (order < 3) { - sysctl_tcp_wmem[2] = 64 * 1024; - sysctl_tcp_rmem[0] = PAGE_SIZE; -diff -uprN linux-2.6.8.1.orig/net/ipv4/tcp_diag.c linux-2.6.8.1-ve022stab078/net/ipv4/tcp_diag.c ---- linux-2.6.8.1.orig/net/ipv4/tcp_diag.c 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/tcp_diag.c 2006-05-11 13:05:42.000000000 +0400 -@@ -55,14 +55,14 @@ void tcp_get_info(struct sock *sk, struc - info->tcpi_probes = tp->probes_out; - info->tcpi_backoff = tp->backoff; - -- if (tp->tstamp_ok) -+ if (tp->rx_opt.tstamp_ok) - info->tcpi_options |= TCPI_OPT_TIMESTAMPS; -- if (tp->sack_ok) -+ if (tp->rx_opt.sack_ok) - info->tcpi_options |= TCPI_OPT_SACK; -- if (tp->wscale_ok) { -+ if (tp->rx_opt.wscale_ok) { - info->tcpi_options |= TCPI_OPT_WSCALE; -- info->tcpi_snd_wscale = tp->snd_wscale; -- info->tcpi_rcv_wscale = tp->rcv_wscale; -+ info->tcpi_snd_wscale = tp->rx_opt.snd_wscale; -+ info->tcpi_rcv_wscale = tp->rx_opt.rcv_wscale; - } - - if (tp->ecn_flags&TCP_ECN_OK) -@@ -253,7 +253,7 @@ static int tcpdiag_get_exact(struct sk_b - return -EINVAL; - } - -- if (sk == NULL) -+ if (sk == NULL || !ve_accessible(VE_OWNER_SK(sk), get_exec_env())) - return -ENOENT; - - err = -ESTALE; -@@ -465,6 +465,9 @@ static int tcpdiag_dump(struct sk_buff * - int s_i, s_num; - struct tcpdiagreq *r = NLMSG_DATA(cb->nlh); - struct rtattr *bc = NULL; -+ struct ve_struct *ve; -+ -+ ve = get_exec_env(); - - if (cb->nlh->nlmsg_len > 4+NLMSG_SPACE(sizeof(struct tcpdiagreq))) - bc = (struct rtattr*)(r+1); -@@ -486,6 +489,9 @@ static int tcpdiag_dump(struct sk_buff * - num = 0; - sk_for_each(sk, node, &tcp_listening_hash[i]) { - struct inet_opt *inet = inet_sk(sk); -+ -+ if (!ve_accessible(VE_OWNER_SK(sk), ve)) -+ continue; - if (num < s_num) - continue; - if (!(r->tcpdiag_states&TCPF_LISTEN) || -@@ -528,6 +534,8 @@ skip_listen_ht: - sk_for_each(sk, node, &head->chain) { - struct inet_opt *inet = inet_sk(sk); - -+ if (!ve_accessible(VE_OWNER_SK(sk), ve)) -+ continue; - if (num < s_num) - continue; - if (!(r->tcpdiag_states & (1 << sk->sk_state))) -@@ -552,10 +560,14 @@ skip_listen_ht: - sk_for_each(sk, node, - &tcp_ehash[i + tcp_ehash_size].chain) { - struct inet_opt *inet = inet_sk(sk); -+ struct tcp_tw_bucket *tw; - -+ tw = (struct tcp_tw_bucket*)sk; -+ if (!ve_accessible_veid(TW_VEID(tw), VEID(ve))) -+ continue; - if (num < s_num) - continue; -- if (!(r->tcpdiag_states & (1 << sk->sk_zapped))) -+ if (!(r->tcpdiag_states & (1 << tw->tw_substate))) - continue; - if (r->id.tcpdiag_sport != inet->sport && - r->id.tcpdiag_sport) -diff -uprN linux-2.6.8.1.orig/net/ipv4/tcp_input.c linux-2.6.8.1-ve022stab078/net/ipv4/tcp_input.c ---- linux-2.6.8.1.orig/net/ipv4/tcp_input.c 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/tcp_input.c 2006-05-11 13:05:39.000000000 +0400 -@@ -72,6 +72,8 @@ - #include <net/inet_common.h> - #include <linux/ipsec.h> - -+#include <ub/ub_tcp.h> -+ - int sysctl_tcp_timestamps = 1; - int sysctl_tcp_window_scaling = 1; - int sysctl_tcp_sack = 1; -@@ -118,9 +120,9 @@ int sysctl_tcp_bic_low_window = 14; - #define FLAG_CA_ALERT (FLAG_DATA_SACKED|FLAG_ECE) - #define FLAG_FORWARD_PROGRESS (FLAG_ACKED|FLAG_DATA_SACKED) - --#define IsReno(tp) ((tp)->sack_ok == 0) --#define IsFack(tp) ((tp)->sack_ok & 2) --#define IsDSack(tp) ((tp)->sack_ok & 4) -+#define IsReno(tp) ((tp)->rx_opt.sack_ok == 0) -+#define IsFack(tp) ((tp)->rx_opt.sack_ok & 2) -+#define IsDSack(tp) ((tp)->rx_opt.sack_ok & 4) - - #define TCP_REMNANT (TCP_FLAG_FIN|TCP_FLAG_URG|TCP_FLAG_SYN|TCP_FLAG_PSH) - -@@ -203,7 +205,7 @@ static __inline__ int tcp_in_quickack_mo - - static void tcp_fixup_sndbuf(struct sock *sk) - { -- int sndmem = tcp_sk(sk)->mss_clamp + MAX_TCP_HEADER + 16 + -+ int sndmem = tcp_sk(sk)->rx_opt.mss_clamp + MAX_TCP_HEADER + 16 + - sizeof(struct sk_buff); - - if (sk->sk_sndbuf < 3 * sndmem) -@@ -259,7 +261,7 @@ tcp_grow_window(struct sock *sk, struct - /* Check #1 */ - if (tp->rcv_ssthresh < tp->window_clamp && - (int)tp->rcv_ssthresh < tcp_space(sk) && -- !tcp_memory_pressure) { -+ ub_tcp_rmem_allows_expand(sk)) { - int incr; - - /* Check #2. Increase window, if skb with such overhead -@@ -328,6 +330,8 @@ static void tcp_init_buffer_space(struct - - tp->rcv_ssthresh = min(tp->rcv_ssthresh, tp->window_clamp); - tp->snd_cwnd_stamp = tcp_time_stamp; -+ -+ ub_tcp_update_maxadvmss(sk); - } - - static void init_bictcp(struct tcp_opt *tp) -@@ -358,7 +362,7 @@ static void tcp_clamp_window(struct sock - if (ofo_win) { - if (sk->sk_rcvbuf < sysctl_tcp_rmem[2] && - !(sk->sk_userlocks & SOCK_RCVBUF_LOCK) && -- !tcp_memory_pressure && -+ !ub_tcp_memory_pressure(sk) && - atomic_read(&tcp_memory_allocated) < sysctl_tcp_mem[0]) - sk->sk_rcvbuf = min(atomic_read(&sk->sk_rmem_alloc), - sysctl_tcp_rmem[2]); -@@ -438,10 +442,10 @@ new_measure: - - static inline void tcp_rcv_rtt_measure_ts(struct tcp_opt *tp, struct sk_buff *skb) - { -- if (tp->rcv_tsecr && -+ if (tp->rx_opt.rcv_tsecr && - (TCP_SKB_CB(skb)->end_seq - - TCP_SKB_CB(skb)->seq >= tp->ack.rcv_mss)) -- tcp_rcv_rtt_update(tp, tcp_time_stamp - tp->rcv_tsecr, 0); -+ tcp_rcv_rtt_update(tp, tcp_time_stamp - tp->rx_opt.rcv_tsecr, 0); - } - - /* -@@ -828,7 +832,7 @@ static void tcp_init_metrics(struct sock - } - if (dst_metric(dst, RTAX_REORDERING) && - tp->reordering != dst_metric(dst, RTAX_REORDERING)) { -- tp->sack_ok &= ~2; -+ tp->rx_opt.sack_ok &= ~2; - tp->reordering = dst_metric(dst, RTAX_REORDERING); - } - -@@ -860,7 +864,7 @@ static void tcp_init_metrics(struct sock - } - tcp_set_rto(tp); - tcp_bound_rto(tp); -- if (tp->rto < TCP_TIMEOUT_INIT && !tp->saw_tstamp) -+ if (tp->rto < TCP_TIMEOUT_INIT && !tp->rx_opt.saw_tstamp) - goto reset; - tp->snd_cwnd = tcp_init_cwnd(tp, dst); - tp->snd_cwnd_stamp = tcp_time_stamp; -@@ -871,7 +875,7 @@ reset: - * supported, TCP will fail to recalculate correct - * rtt, if initial rto is too small. FORGET ALL AND RESET! - */ -- if (!tp->saw_tstamp && tp->srtt) { -+ if (!tp->rx_opt.saw_tstamp && tp->srtt) { - tp->srtt = 0; - tp->mdev = tp->mdev_max = tp->rttvar = TCP_TIMEOUT_INIT; - tp->rto = TCP_TIMEOUT_INIT; -@@ -894,12 +898,12 @@ static void tcp_update_reordering(struct - NET_INC_STATS_BH(LINUX_MIB_TCPSACKREORDER); - #if FASTRETRANS_DEBUG > 1 - printk(KERN_DEBUG "Disorder%d %d %u f%u s%u rr%d\n", -- tp->sack_ok, tp->ca_state, -+ tp->rx_opt.sack_ok, tp->ca_state, - tp->reordering, tp->fackets_out, tp->sacked_out, - tp->undo_marker ? tp->undo_retrans : 0); - #endif - /* Disable FACK yet. */ -- tp->sack_ok &= ~2; -+ tp->rx_opt.sack_ok &= ~2; - } - } - -@@ -989,13 +993,13 @@ tcp_sacktag_write_queue(struct sock *sk, - - if (before(start_seq, ack)) { - dup_sack = 1; -- tp->sack_ok |= 4; -+ tp->rx_opt.sack_ok |= 4; - NET_INC_STATS_BH(LINUX_MIB_TCPDSACKRECV); - } else if (num_sacks > 1 && - !after(end_seq, ntohl(sp[1].end_seq)) && - !before(start_seq, ntohl(sp[1].start_seq))) { - dup_sack = 1; -- tp->sack_ok |= 4; -+ tp->rx_opt.sack_ok |= 4; - NET_INC_STATS_BH(LINUX_MIB_TCPDSACKOFORECV); - } - -@@ -1617,8 +1621,8 @@ static void tcp_cwnd_down(struct tcp_opt - static __inline__ int tcp_packet_delayed(struct tcp_opt *tp) - { - return !tp->retrans_stamp || -- (tp->saw_tstamp && tp->rcv_tsecr && -- (__s32)(tp->rcv_tsecr - tp->retrans_stamp) < 0); -+ (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr && -+ (__s32)(tp->rx_opt.rcv_tsecr - tp->retrans_stamp) < 0); - } - - /* Undo procedures. */ -@@ -1966,7 +1970,7 @@ static void tcp_ack_saw_tstamp(struct tc - * answer arrives rto becomes 120 seconds! If at least one of segments - * in window is lost... Voila. --ANK (010210) - */ -- seq_rtt = tcp_time_stamp - tp->rcv_tsecr; -+ seq_rtt = tcp_time_stamp - tp->rx_opt.rcv_tsecr; - tcp_rtt_estimator(tp, seq_rtt); - tcp_set_rto(tp); - tp->backoff = 0; -@@ -1997,7 +2001,7 @@ static __inline__ void - tcp_ack_update_rtt(struct tcp_opt *tp, int flag, s32 seq_rtt) - { - /* Note that peer MAY send zero echo. In this case it is ignored. (rfc1323) */ -- if (tp->saw_tstamp && tp->rcv_tsecr) -+ if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr) - tcp_ack_saw_tstamp(tp, flag); - else if (seq_rtt >= 0) - tcp_ack_no_tstamp(tp, seq_rtt, flag); -@@ -2401,7 +2405,7 @@ static int tcp_clean_rtx_queue(struct so - BUG_TRAP((int)tp->sacked_out >= 0); - BUG_TRAP((int)tp->lost_out >= 0); - BUG_TRAP((int)tp->retrans_out >= 0); -- if (!tp->packets_out && tp->sack_ok) { -+ if (!tp->packets_out && tp->rx_opt.sack_ok) { - if (tp->lost_out) { - printk(KERN_DEBUG "Leak l=%u %d\n", tp->lost_out, - tp->ca_state); -@@ -2477,7 +2481,7 @@ static int tcp_ack_update_window(struct - u32 nwin = ntohs(skb->h.th->window); - - if (likely(!skb->h.th->syn)) -- nwin <<= tp->snd_wscale; -+ nwin <<= tp->rx_opt.snd_wscale; - - if (tcp_may_update_window(tp, ack, ack_seq, nwin)) { - flag |= FLAG_WIN_UPDATE; -@@ -2888,14 +2892,15 @@ uninteresting_ack: - * But, this can also be called on packets in the established flow when - * the fast version below fails. - */ --void tcp_parse_options(struct sk_buff *skb, struct tcp_opt *tp, int estab) -+void tcp_parse_options(struct sk_buff *skb, -+ struct tcp_options_received *opt_rx, int estab) - { - unsigned char *ptr; - struct tcphdr *th = skb->h.th; - int length=(th->doff*4)-sizeof(struct tcphdr); - - ptr = (unsigned char *)(th + 1); -- tp->saw_tstamp = 0; -+ opt_rx->saw_tstamp = 0; - - while(length>0) { - int opcode=*ptr++; -@@ -2918,41 +2923,41 @@ void tcp_parse_options(struct sk_buff *s - if(opsize==TCPOLEN_MSS && th->syn && !estab) { - u16 in_mss = ntohs(*(__u16 *)ptr); - if (in_mss) { -- if (tp->user_mss && tp->user_mss < in_mss) -- in_mss = tp->user_mss; -- tp->mss_clamp = in_mss; -+ if (opt_rx->user_mss && opt_rx->user_mss < in_mss) -+ in_mss = opt_rx->user_mss; -+ opt_rx->mss_clamp = in_mss; - } - } - break; - case TCPOPT_WINDOW: - if(opsize==TCPOLEN_WINDOW && th->syn && !estab) - if (sysctl_tcp_window_scaling) { -- tp->wscale_ok = 1; -- tp->snd_wscale = *(__u8 *)ptr; -- if(tp->snd_wscale > 14) { -+ opt_rx->wscale_ok = 1; -+ opt_rx->snd_wscale = *(__u8 *)ptr; -+ if(opt_rx->snd_wscale > 14) { - if(net_ratelimit()) - printk("tcp_parse_options: Illegal window " - "scaling value %d >14 received.", -- tp->snd_wscale); -- tp->snd_wscale = 14; -+ opt_rx->snd_wscale); -+ opt_rx->snd_wscale = 14; - } - } - break; - case TCPOPT_TIMESTAMP: - if(opsize==TCPOLEN_TIMESTAMP) { -- if ((estab && tp->tstamp_ok) || -+ if ((estab && opt_rx->tstamp_ok) || - (!estab && sysctl_tcp_timestamps)) { -- tp->saw_tstamp = 1; -- tp->rcv_tsval = ntohl(*(__u32 *)ptr); -- tp->rcv_tsecr = ntohl(*(__u32 *)(ptr+4)); -+ opt_rx->saw_tstamp = 1; -+ opt_rx->rcv_tsval = ntohl(*(__u32 *)ptr); -+ opt_rx->rcv_tsecr = ntohl(*(__u32 *)(ptr+4)); - } - } - break; - case TCPOPT_SACK_PERM: - if(opsize==TCPOLEN_SACK_PERM && th->syn && !estab) { - if (sysctl_tcp_sack) { -- tp->sack_ok = 1; -- tcp_sack_reset(tp); -+ opt_rx->sack_ok = 1; -+ tcp_sack_reset(opt_rx); - } - } - break; -@@ -2960,7 +2965,7 @@ void tcp_parse_options(struct sk_buff *s - case TCPOPT_SACK: - if((opsize >= (TCPOLEN_SACK_BASE + TCPOLEN_SACK_PERBLOCK)) && - !((opsize - TCPOLEN_SACK_BASE) % TCPOLEN_SACK_PERBLOCK) && -- tp->sack_ok) { -+ opt_rx->sack_ok) { - TCP_SKB_CB(skb)->sacked = (ptr - 2) - (unsigned char *)th; - } - }; -@@ -2976,36 +2981,36 @@ void tcp_parse_options(struct sk_buff *s - static __inline__ int tcp_fast_parse_options(struct sk_buff *skb, struct tcphdr *th, struct tcp_opt *tp) - { - if (th->doff == sizeof(struct tcphdr)>>2) { -- tp->saw_tstamp = 0; -+ tp->rx_opt.saw_tstamp = 0; - return 0; -- } else if (tp->tstamp_ok && -+ } else if (tp->rx_opt.tstamp_ok && - th->doff == (sizeof(struct tcphdr)>>2)+(TCPOLEN_TSTAMP_ALIGNED>>2)) { - __u32 *ptr = (__u32 *)(th + 1); - if (*ptr == ntohl((TCPOPT_NOP << 24) | (TCPOPT_NOP << 16) - | (TCPOPT_TIMESTAMP << 8) | TCPOLEN_TIMESTAMP)) { -- tp->saw_tstamp = 1; -+ tp->rx_opt.saw_tstamp = 1; - ++ptr; -- tp->rcv_tsval = ntohl(*ptr); -+ tp->rx_opt.rcv_tsval = ntohl(*ptr); - ++ptr; -- tp->rcv_tsecr = ntohl(*ptr); -+ tp->rx_opt.rcv_tsecr = ntohl(*ptr); - return 1; - } - } -- tcp_parse_options(skb, tp, 1); -+ tcp_parse_options(skb, &tp->rx_opt, 1); - return 1; - } - - static __inline__ void - tcp_store_ts_recent(struct tcp_opt *tp) - { -- tp->ts_recent = tp->rcv_tsval; -- tp->ts_recent_stamp = xtime.tv_sec; -+ tp->rx_opt.ts_recent = tp->rx_opt.rcv_tsval; -+ tp->rx_opt.ts_recent_stamp = xtime.tv_sec; - } - - static __inline__ void - tcp_replace_ts_recent(struct tcp_opt *tp, u32 seq) - { -- if (tp->saw_tstamp && !after(seq, tp->rcv_wup)) { -+ if (tp->rx_opt.saw_tstamp && !after(seq, tp->rcv_wup)) { - /* PAWS bug workaround wrt. ACK frames, the PAWS discard - * extra check below makes sure this can only happen - * for pure ACK frames. -DaveM -@@ -3013,8 +3018,8 @@ tcp_replace_ts_recent(struct tcp_opt *tp - * Not only, also it occurs for expired timestamps. - */ - -- if((s32)(tp->rcv_tsval - tp->ts_recent) >= 0 || -- xtime.tv_sec >= tp->ts_recent_stamp + TCP_PAWS_24DAYS) -+ if((s32)(tp->rx_opt.rcv_tsval - tp->rx_opt.ts_recent) >= 0 || -+ xtime.tv_sec >= tp->rx_opt.ts_recent_stamp + TCP_PAWS_24DAYS) - tcp_store_ts_recent(tp); - } - } -@@ -3055,16 +3060,16 @@ static int tcp_disordered_ack(struct tcp - ack == tp->snd_una && - - /* 3. ... and does not update window. */ -- !tcp_may_update_window(tp, ack, seq, ntohs(th->window)<<tp->snd_wscale) && -+ !tcp_may_update_window(tp, ack, seq, ntohs(th->window)<<tp->rx_opt.snd_wscale) && - - /* 4. ... and sits in replay window. */ -- (s32)(tp->ts_recent - tp->rcv_tsval) <= (tp->rto*1024)/HZ); -+ (s32)(tp->rx_opt.ts_recent - tp->rx_opt.rcv_tsval) <= (tp->rto*1024)/HZ); - } - - static __inline__ int tcp_paws_discard(struct tcp_opt *tp, struct sk_buff *skb) - { -- return ((s32)(tp->ts_recent - tp->rcv_tsval) > TCP_PAWS_WINDOW && -- xtime.tv_sec < tp->ts_recent_stamp + TCP_PAWS_24DAYS && -+ return ((s32)(tp->rx_opt.ts_recent - tp->rx_opt.rcv_tsval) > TCP_PAWS_WINDOW && -+ xtime.tv_sec < tp->rx_opt.ts_recent_stamp + TCP_PAWS_24DAYS && - !tcp_disordered_ack(tp, skb)); - } - -@@ -3177,8 +3182,8 @@ static void tcp_fin(struct sk_buff *skb, - * Probably, we should reset in this case. For now drop them. - */ - __skb_queue_purge(&tp->out_of_order_queue); -- if (tp->sack_ok) -- tcp_sack_reset(tp); -+ if (tp->rx_opt.sack_ok) -+ tcp_sack_reset(&tp->rx_opt); - sk_stream_mem_reclaim(sk); - - if (!sock_flag(sk, SOCK_DEAD)) { -@@ -3208,22 +3213,22 @@ tcp_sack_extend(struct tcp_sack_block *s - - static __inline__ void tcp_dsack_set(struct tcp_opt *tp, u32 seq, u32 end_seq) - { -- if (tp->sack_ok && sysctl_tcp_dsack) { -+ if (tp->rx_opt.sack_ok && sysctl_tcp_dsack) { - if (before(seq, tp->rcv_nxt)) - NET_INC_STATS_BH(LINUX_MIB_TCPDSACKOLDSENT); - else - NET_INC_STATS_BH(LINUX_MIB_TCPDSACKOFOSENT); - -- tp->dsack = 1; -+ tp->rx_opt.dsack = 1; - tp->duplicate_sack[0].start_seq = seq; - tp->duplicate_sack[0].end_seq = end_seq; -- tp->eff_sacks = min(tp->num_sacks+1, 4-tp->tstamp_ok); -+ tp->rx_opt.eff_sacks = min(tp->rx_opt.num_sacks+1, 4-tp->rx_opt.tstamp_ok); - } - } - - static __inline__ void tcp_dsack_extend(struct tcp_opt *tp, u32 seq, u32 end_seq) - { -- if (!tp->dsack) -+ if (!tp->rx_opt.dsack) - tcp_dsack_set(tp, seq, end_seq); - else - tcp_sack_extend(tp->duplicate_sack, seq, end_seq); -@@ -3238,7 +3243,7 @@ static void tcp_send_dupack(struct sock - NET_INC_STATS_BH(LINUX_MIB_DELAYEDACKLOST); - tcp_enter_quickack_mode(tp); - -- if (tp->sack_ok && sysctl_tcp_dsack) { -+ if (tp->rx_opt.sack_ok && sysctl_tcp_dsack) { - u32 end_seq = TCP_SKB_CB(skb)->end_seq; - - if (after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) -@@ -3262,16 +3267,16 @@ static void tcp_sack_maybe_coalesce(stru - /* See if the recent change to the first SACK eats into - * or hits the sequence space of other SACK blocks, if so coalesce. - */ -- for (this_sack = 1; this_sack < tp->num_sacks; ) { -+ for (this_sack = 1; this_sack < tp->rx_opt.num_sacks; ) { - if (tcp_sack_extend(sp, swalk->start_seq, swalk->end_seq)) { - int i; - - /* Zap SWALK, by moving every further SACK up by one slot. - * Decrease num_sacks. - */ -- tp->num_sacks--; -- tp->eff_sacks = min(tp->num_sacks+tp->dsack, 4-tp->tstamp_ok); -- for(i=this_sack; i < tp->num_sacks; i++) -+ tp->rx_opt.num_sacks--; -+ tp->rx_opt.eff_sacks = min(tp->rx_opt.num_sacks + tp->rx_opt.dsack, 4 - tp->rx_opt.tstamp_ok); -+ for(i=this_sack; i < tp->rx_opt.num_sacks; i++) - sp[i] = sp[i+1]; - continue; - } -@@ -3296,7 +3301,7 @@ static void tcp_sack_new_ofo_skb(struct - { - struct tcp_opt *tp = tcp_sk(sk); - struct tcp_sack_block *sp = &tp->selective_acks[0]; -- int cur_sacks = tp->num_sacks; -+ int cur_sacks = tp->rx_opt.num_sacks; - int this_sack; - - if (!cur_sacks) -@@ -3321,7 +3326,7 @@ static void tcp_sack_new_ofo_skb(struct - */ - if (this_sack >= 4) { - this_sack--; -- tp->num_sacks--; -+ tp->rx_opt.num_sacks--; - sp--; - } - for(; this_sack > 0; this_sack--, sp--) -@@ -3331,8 +3336,8 @@ new_sack: - /* Build the new head SACK, and we're done. */ - sp->start_seq = seq; - sp->end_seq = end_seq; -- tp->num_sacks++; -- tp->eff_sacks = min(tp->num_sacks + tp->dsack, 4 - tp->tstamp_ok); -+ tp->rx_opt.num_sacks++; -+ tp->rx_opt.eff_sacks = min(tp->rx_opt.num_sacks + tp->rx_opt.dsack, 4 - tp->rx_opt.tstamp_ok); - } - - /* RCV.NXT advances, some SACKs should be eaten. */ -@@ -3340,13 +3345,13 @@ new_sack: - static void tcp_sack_remove(struct tcp_opt *tp) - { - struct tcp_sack_block *sp = &tp->selective_acks[0]; -- int num_sacks = tp->num_sacks; -+ int num_sacks = tp->rx_opt.num_sacks; - int this_sack; - - /* Empty ofo queue, hence, all the SACKs are eaten. Clear. */ - if (skb_queue_len(&tp->out_of_order_queue) == 0) { -- tp->num_sacks = 0; -- tp->eff_sacks = tp->dsack; -+ tp->rx_opt.num_sacks = 0; -+ tp->rx_opt.eff_sacks = tp->rx_opt.dsack; - return; - } - -@@ -3367,9 +3372,9 @@ static void tcp_sack_remove(struct tcp_o - this_sack++; - sp++; - } -- if (num_sacks != tp->num_sacks) { -- tp->num_sacks = num_sacks; -- tp->eff_sacks = min(tp->num_sacks+tp->dsack, 4-tp->tstamp_ok); -+ if (num_sacks != tp->rx_opt.num_sacks) { -+ tp->rx_opt.num_sacks = num_sacks; -+ tp->rx_opt.eff_sacks = min(tp->rx_opt.num_sacks + tp->rx_opt.dsack, 4 - tp->rx_opt.tstamp_ok); - } - } - -@@ -3427,10 +3432,10 @@ static void tcp_data_queue(struct sock * - - TCP_ECN_accept_cwr(tp, skb); - -- if (tp->dsack) { -- tp->dsack = 0; -- tp->eff_sacks = min_t(unsigned int, tp->num_sacks, -- 4 - tp->tstamp_ok); -+ if (tp->rx_opt.dsack) { -+ tp->rx_opt.dsack = 0; -+ tp->rx_opt.eff_sacks = min_t(unsigned int, tp->rx_opt.num_sacks, -+ 4 - tp->rx_opt.tstamp_ok); - } - - /* Queue data for delivery to the user. -@@ -3467,7 +3472,7 @@ queue_and_out: - !sk_stream_rmem_schedule(sk, skb))) { - if (tcp_prune_queue(sk) < 0 || - !sk_stream_rmem_schedule(sk, skb)) -- goto drop; -+ goto drop_part; - } - sk_stream_set_owner_r(skb, sk); - __skb_queue_tail(&sk->sk_receive_queue, skb); -@@ -3488,7 +3493,7 @@ queue_and_out: - tp->ack.pingpong = 0; - } - -- if (tp->num_sacks) -+ if (tp->rx_opt.num_sacks) - tcp_sack_remove(tp); - - tcp_fast_path_check(sk, tp); -@@ -3511,6 +3516,12 @@ out_of_window: - drop: - __kfree_skb(skb); - return; -+ -+drop_part: -+ if (after(tp->copied_seq, tp->rcv_nxt)) -+ tp->rcv_nxt = tp->copied_seq; -+ __kfree_skb(skb); -+ return; - } - - /* Out of window. F.e. zero window probe. */ -@@ -3555,10 +3566,10 @@ drop: - - if (!skb_peek(&tp->out_of_order_queue)) { - /* Initial out of order segment, build 1 SACK. */ -- if (tp->sack_ok) { -- tp->num_sacks = 1; -- tp->dsack = 0; -- tp->eff_sacks = 1; -+ if (tp->rx_opt.sack_ok) { -+ tp->rx_opt.num_sacks = 1; -+ tp->rx_opt.dsack = 0; -+ tp->rx_opt.eff_sacks = 1; - tp->selective_acks[0].start_seq = TCP_SKB_CB(skb)->seq; - tp->selective_acks[0].end_seq = - TCP_SKB_CB(skb)->end_seq; -@@ -3572,7 +3583,7 @@ drop: - if (seq == TCP_SKB_CB(skb1)->end_seq) { - __skb_append(skb1, skb); - -- if (!tp->num_sacks || -+ if (!tp->rx_opt.num_sacks || - tp->selective_acks[0].end_seq != seq) - goto add_sack; - -@@ -3620,7 +3631,7 @@ drop: - } - - add_sack: -- if (tp->sack_ok) -+ if (tp->rx_opt.sack_ok) - tcp_sack_new_ofo_skb(sk, seq, end_seq); - } - } -@@ -3682,6 +3693,10 @@ tcp_collapse(struct sock *sk, struct sk_ - nskb = alloc_skb(copy+header, GFP_ATOMIC); - if (!nskb) - return; -+ if (ub_tcprcvbuf_charge_forced(skb->sk, nskb) < 0) { -+ kfree_skb(nskb); -+ return; -+ } - skb_reserve(nskb, header); - memcpy(nskb->head, skb->head, header); - nskb->nh.raw = nskb->head + (skb->nh.raw-skb->head); -@@ -3777,7 +3792,7 @@ static int tcp_prune_queue(struct sock * - - if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf) - tcp_clamp_window(sk, tp); -- else if (tcp_memory_pressure) -+ else if (ub_tcp_memory_pressure(sk)) - tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss); - - tcp_collapse_ofo_queue(sk); -@@ -3803,8 +3818,8 @@ static int tcp_prune_queue(struct sock * - * is in a sad state like this, we care only about integrity - * of the connection not performance. - */ -- if (tp->sack_ok) -- tcp_sack_reset(tp); -+ if (tp->rx_opt.sack_ok) -+ tcp_sack_reset(&tp->rx_opt); - sk_stream_mem_reclaim(sk); - } - -@@ -3859,7 +3874,7 @@ static void tcp_new_space(struct sock *s - !(sk->sk_userlocks & SOCK_SNDBUF_LOCK) && - !tcp_memory_pressure && - atomic_read(&tcp_memory_allocated) < sysctl_tcp_mem[0]) { -- int sndmem = max_t(u32, tp->mss_clamp, tp->mss_cache) + -+ int sndmem = max_t(u32, tp->rx_opt.mss_clamp, tp->mss_cache) + - MAX_TCP_HEADER + 16 + sizeof(struct sk_buff), - demanded = max_t(unsigned int, tp->snd_cwnd, - tp->reordering + 1); -@@ -4126,7 +4141,7 @@ int tcp_rcv_established(struct sock *sk, - * We do checksum and copy also but from device to kernel. - */ - -- tp->saw_tstamp = 0; -+ tp->rx_opt.saw_tstamp = 0; - - /* pred_flags is 0xS?10 << 16 + snd_wnd - * if header_predition is to be made -@@ -4155,14 +4170,14 @@ int tcp_rcv_established(struct sock *sk, - | (TCPOPT_TIMESTAMP << 8) | TCPOLEN_TIMESTAMP)) - goto slow_path; - -- tp->saw_tstamp = 1; -+ tp->rx_opt.saw_tstamp = 1; - ++ptr; -- tp->rcv_tsval = ntohl(*ptr); -+ tp->rx_opt.rcv_tsval = ntohl(*ptr); - ++ptr; -- tp->rcv_tsecr = ntohl(*ptr); -+ tp->rx_opt.rcv_tsecr = ntohl(*ptr); - - /* If PAWS failed, check it more carefully in slow path */ -- if ((s32)(tp->rcv_tsval - tp->ts_recent) < 0) -+ if ((s32)(tp->rx_opt.rcv_tsval - tp->rx_opt.ts_recent) < 0) - goto slow_path; - - /* DO NOT update ts_recent here, if checksum fails -@@ -4242,6 +4257,10 @@ int tcp_rcv_established(struct sock *sk, - - if ((int)skb->truesize > sk->sk_forward_alloc) - goto step5; -+ /* This is OK not to try to free memory here. -+ * Do this below on slow path. Den */ -+ if (ub_tcprcvbuf_charge(sk, skb) < 0) -+ goto step5; - - NET_INC_STATS_BH(LINUX_MIB_TCPHPHITS); - -@@ -4288,7 +4307,7 @@ slow_path: - /* - * RFC1323: H1. Apply PAWS check first. - */ -- if (tcp_fast_parse_options(skb, th, tp) && tp->saw_tstamp && -+ if (tcp_fast_parse_options(skb, th, tp) && tp->rx_opt.saw_tstamp && - tcp_paws_discard(tp, skb)) { - if (!th->rst) { - NET_INC_STATS_BH(LINUX_MIB_PAWSESTABREJECTED); -@@ -4360,9 +4379,9 @@ static int tcp_rcv_synsent_state_process - struct tcphdr *th, unsigned len) - { - struct tcp_opt *tp = tcp_sk(sk); -- int saved_clamp = tp->mss_clamp; -+ int saved_clamp = tp->rx_opt.mss_clamp; - -- tcp_parse_options(skb, tp, 0); -+ tcp_parse_options(skb, &tp->rx_opt, 0); - - if (th->ack) { - /* rfc793: -@@ -4379,8 +4398,8 @@ static int tcp_rcv_synsent_state_process - if (TCP_SKB_CB(skb)->ack_seq != tp->snd_nxt) - goto reset_and_undo; - -- if (tp->saw_tstamp && tp->rcv_tsecr && -- !between(tp->rcv_tsecr, tp->retrans_stamp, -+ if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr && -+ !between(tp->rx_opt.rcv_tsecr, tp->retrans_stamp, - tcp_time_stamp)) { - NET_INC_STATS_BH(LINUX_MIB_PAWSACTIVEREJECTED); - goto reset_and_undo; -@@ -4435,13 +4454,13 @@ static int tcp_rcv_synsent_state_process - tp->snd_wnd = ntohs(th->window); - tcp_init_wl(tp, TCP_SKB_CB(skb)->ack_seq, TCP_SKB_CB(skb)->seq); - -- if (!tp->wscale_ok) { -- tp->snd_wscale = tp->rcv_wscale = 0; -+ if (!tp->rx_opt.wscale_ok) { -+ tp->rx_opt.snd_wscale = tp->rx_opt.rcv_wscale = 0; - tp->window_clamp = min(tp->window_clamp, 65535U); - } - -- if (tp->saw_tstamp) { -- tp->tstamp_ok = 1; -+ if (tp->rx_opt.saw_tstamp) { -+ tp->rx_opt.tstamp_ok = 1; - tp->tcp_header_len = - sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED; - tp->advmss -= TCPOLEN_TSTAMP_ALIGNED; -@@ -4450,8 +4469,8 @@ static int tcp_rcv_synsent_state_process - tp->tcp_header_len = sizeof(struct tcphdr); - } - -- if (tp->sack_ok && sysctl_tcp_fack) -- tp->sack_ok |= 2; -+ if (tp->rx_opt.sack_ok && sysctl_tcp_fack) -+ tp->rx_opt.sack_ok |= 2; - - tcp_sync_mss(sk, tp->pmtu_cookie); - tcp_initialize_rcv_mss(sk); -@@ -4478,7 +4497,7 @@ static int tcp_rcv_synsent_state_process - if (sock_flag(sk, SOCK_KEEPOPEN)) - tcp_reset_keepalive_timer(sk, keepalive_time_when(tp)); - -- if (!tp->snd_wscale) -+ if (!tp->rx_opt.snd_wscale) - __tcp_fast_path_on(tp, tp->snd_wnd); - else - tp->pred_flags = 0; -@@ -4525,7 +4544,7 @@ discard: - } - - /* PAWS check. */ -- if (tp->ts_recent_stamp && tp->saw_tstamp && tcp_paws_check(tp, 0)) -+ if (tp->rx_opt.ts_recent_stamp && tp->rx_opt.saw_tstamp && tcp_paws_check(&tp->rx_opt, 0)) - goto discard_and_undo; - - if (th->syn) { -@@ -4535,8 +4554,8 @@ discard: - */ - tcp_set_state(sk, TCP_SYN_RECV); - -- if (tp->saw_tstamp) { -- tp->tstamp_ok = 1; -+ if (tp->rx_opt.saw_tstamp) { -+ tp->rx_opt.tstamp_ok = 1; - tcp_store_ts_recent(tp); - tp->tcp_header_len = - sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED; -@@ -4583,13 +4602,13 @@ discard: - */ - - discard_and_undo: -- tcp_clear_options(tp); -- tp->mss_clamp = saved_clamp; -+ tcp_clear_options(&tp->rx_opt); -+ tp->rx_opt.mss_clamp = saved_clamp; - goto discard; - - reset_and_undo: -- tcp_clear_options(tp); -- tp->mss_clamp = saved_clamp; -+ tcp_clear_options(&tp->rx_opt); -+ tp->rx_opt.mss_clamp = saved_clamp; - return 1; - } - -@@ -4607,7 +4626,7 @@ int tcp_rcv_state_process(struct sock *s - struct tcp_opt *tp = tcp_sk(sk); - int queued = 0; - -- tp->saw_tstamp = 0; -+ tp->rx_opt.saw_tstamp = 0; - - switch (sk->sk_state) { - case TCP_CLOSE: -@@ -4662,7 +4681,7 @@ int tcp_rcv_state_process(struct sock *s - return 0; - } - -- if (tcp_fast_parse_options(skb, th, tp) && tp->saw_tstamp && -+ if (tcp_fast_parse_options(skb, th, tp) && tp->rx_opt.saw_tstamp && - tcp_paws_discard(tp, skb)) { - if (!th->rst) { - NET_INC_STATS_BH(LINUX_MIB_PAWSESTABREJECTED); -@@ -4722,7 +4741,7 @@ int tcp_rcv_state_process(struct sock *s - - tp->snd_una = TCP_SKB_CB(skb)->ack_seq; - tp->snd_wnd = ntohs(th->window) << -- tp->snd_wscale; -+ tp->rx_opt.snd_wscale; - tcp_init_wl(tp, TCP_SKB_CB(skb)->ack_seq, - TCP_SKB_CB(skb)->seq); - -@@ -4730,11 +4749,11 @@ int tcp_rcv_state_process(struct sock *s - * and does not calculate rtt. - * Fix it at least with timestamps. - */ -- if (tp->saw_tstamp && tp->rcv_tsecr && -+ if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr && - !tp->srtt) - tcp_ack_saw_tstamp(tp, 0); - -- if (tp->tstamp_ok) -+ if (tp->rx_opt.tstamp_ok) - tp->advmss -= TCPOLEN_TSTAMP_ALIGNED; - - /* Make sure socket is routed, for -diff -uprN linux-2.6.8.1.orig/net/ipv4/tcp_ipv4.c linux-2.6.8.1-ve022stab078/net/ipv4/tcp_ipv4.c ---- linux-2.6.8.1.orig/net/ipv4/tcp_ipv4.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/tcp_ipv4.c 2006-05-11 13:05:44.000000000 +0400 -@@ -69,12 +69,16 @@ - #include <net/inet_common.h> - #include <net/xfrm.h> - -+#include <ub/ub_tcp.h> -+ - #include <linux/inet.h> - #include <linux/ipv6.h> - #include <linux/stddef.h> - #include <linux/proc_fs.h> - #include <linux/seq_file.h> - -+#include <linux/ve_owner.h> -+ - extern int sysctl_ip_dynaddr; - int sysctl_tcp_tw_reuse; - int sysctl_tcp_low_latency; -@@ -105,9 +109,10 @@ int sysctl_local_port_range[2] = { 1024, - int tcp_port_rover = 1024 - 1; - - static __inline__ int tcp_hashfn(__u32 laddr, __u16 lport, -- __u32 faddr, __u16 fport) -+ __u32 faddr, __u16 fport, -+ envid_t veid) - { -- int h = (laddr ^ lport) ^ (faddr ^ fport); -+ int h = (laddr ^ lport) ^ (faddr ^ fport) ^ (veid ^ (veid >> 16)); - h ^= h >> 16; - h ^= h >> 8; - return h & (tcp_ehash_size - 1); -@@ -120,15 +125,20 @@ static __inline__ int tcp_sk_hashfn(stru - __u16 lport = inet->num; - __u32 faddr = inet->daddr; - __u16 fport = inet->dport; -+ envid_t veid = VEID(VE_OWNER_SK(sk)); - -- return tcp_hashfn(laddr, lport, faddr, fport); -+ return tcp_hashfn(laddr, lport, faddr, fport, veid); - } - -+DCL_VE_OWNER(TB, GENERIC, struct tcp_bind_bucket, owner_env, -+ inline, (always_inline)) -+ - /* Allocate and initialize a new TCP local port bind bucket. - * The bindhash mutex for snum's hash chain must be held here. - */ - struct tcp_bind_bucket *tcp_bucket_create(struct tcp_bind_hashbucket *head, -- unsigned short snum) -+ unsigned short snum, -+ struct ve_struct *env) - { - struct tcp_bind_bucket *tb = kmem_cache_alloc(tcp_bucket_cachep, - SLAB_ATOMIC); -@@ -136,6 +146,7 @@ struct tcp_bind_bucket *tcp_bucket_creat - tb->port = snum; - tb->fastreuse = 0; - INIT_HLIST_HEAD(&tb->owners); -+ SET_VE_OWNER_TB(tb, env); - hlist_add_head(&tb->node, &head->chain); - } - return tb; -@@ -153,10 +164,11 @@ void tcp_bucket_destroy(struct tcp_bind_ - /* Caller must disable local BH processing. */ - static __inline__ void __tcp_inherit_port(struct sock *sk, struct sock *child) - { -- struct tcp_bind_hashbucket *head = -- &tcp_bhash[tcp_bhashfn(inet_sk(child)->num)]; -+ struct tcp_bind_hashbucket *head; - struct tcp_bind_bucket *tb; - -+ head = &tcp_bhash[tcp_bhashfn(inet_sk(child)->num, -+ VEID(VE_OWNER_SK(child)))]; - spin_lock(&head->lock); - tb = tcp_sk(sk)->bind_hash; - sk_add_bind_node(child, &tb->owners); -@@ -212,8 +224,10 @@ static int tcp_v4_get_port(struct sock * - struct tcp_bind_hashbucket *head; - struct hlist_node *node; - struct tcp_bind_bucket *tb; -+ struct ve_struct *env; - int ret; - -+ env = VE_OWNER_SK(sk); - local_bh_disable(); - if (!snum) { - int low = sysctl_local_port_range[0]; -@@ -227,10 +241,11 @@ static int tcp_v4_get_port(struct sock * - rover++; - if (rover < low || rover > high) - rover = low; -- head = &tcp_bhash[tcp_bhashfn(rover)]; -+ head = &tcp_bhash[tcp_bhashfn(rover, VEID(env))]; - spin_lock(&head->lock); - tb_for_each(tb, node, &head->chain) -- if (tb->port == rover) -+ if (tb->port == rover && -+ ve_accessible_strict(VE_OWNER_TB(tb), env)) - goto next; - break; - next: -@@ -249,10 +264,11 @@ static int tcp_v4_get_port(struct sock * - */ - snum = rover; - } else { -- head = &tcp_bhash[tcp_bhashfn(snum)]; -+ head = &tcp_bhash[tcp_bhashfn(snum, VEID(env))]; - spin_lock(&head->lock); - tb_for_each(tb, node, &head->chain) -- if (tb->port == snum) -+ if (tb->port == snum && -+ ve_accessible_strict(VE_OWNER_TB(tb), env)) - goto tb_found; - } - tb = NULL; -@@ -272,7 +288,7 @@ tb_found: - } - tb_not_found: - ret = 1; -- if (!tb && (tb = tcp_bucket_create(head, snum)) == NULL) -+ if (!tb && (tb = tcp_bucket_create(head, snum, env)) == NULL) - goto fail_unlock; - if (hlist_empty(&tb->owners)) { - if (sk->sk_reuse && sk->sk_state != TCP_LISTEN) -@@ -301,9 +317,10 @@ fail: - static void __tcp_put_port(struct sock *sk) - { - struct inet_opt *inet = inet_sk(sk); -- struct tcp_bind_hashbucket *head = &tcp_bhash[tcp_bhashfn(inet->num)]; -+ struct tcp_bind_hashbucket *head; - struct tcp_bind_bucket *tb; - -+ head = &tcp_bhash[tcp_bhashfn(inet->num, VEID(VE_OWNER_SK(sk)))]; - spin_lock(&head->lock); - tb = tcp_sk(sk)->bind_hash; - __sk_del_bind_node(sk); -@@ -412,7 +429,8 @@ void tcp_unhash(struct sock *sk) - * during the search since they can never be otherwise. - */ - static struct sock *__tcp_v4_lookup_listener(struct hlist_head *head, u32 daddr, -- unsigned short hnum, int dif) -+ unsigned short hnum, int dif, -+ struct ve_struct *env) - { - struct sock *result = NULL, *sk; - struct hlist_node *node; -@@ -422,7 +440,9 @@ static struct sock *__tcp_v4_lookup_list - sk_for_each(sk, node, head) { - struct inet_opt *inet = inet_sk(sk); - -- if (inet->num == hnum && !ipv6_only_sock(sk)) { -+ if (inet->num == hnum && -+ ve_accessible_strict(VE_OWNER_SK(sk), env) && -+ !ipv6_only_sock(sk)) { - __u32 rcv_saddr = inet->rcv_saddr; - - score = (sk->sk_family == PF_INET ? 1 : 0); -@@ -453,18 +473,21 @@ inline struct sock *tcp_v4_lookup_listen - { - struct sock *sk = NULL; - struct hlist_head *head; -+ struct ve_struct *env; - -+ env = get_exec_env(); - read_lock(&tcp_lhash_lock); -- head = &tcp_listening_hash[tcp_lhashfn(hnum)]; -+ head = &tcp_listening_hash[tcp_lhashfn(hnum, VEID(env))]; - if (!hlist_empty(head)) { - struct inet_opt *inet = inet_sk((sk = __sk_head(head))); - - if (inet->num == hnum && !sk->sk_node.next && -+ ve_accessible_strict(VE_OWNER_SK(sk), env) && - (!inet->rcv_saddr || inet->rcv_saddr == daddr) && - (sk->sk_family == PF_INET || !ipv6_only_sock(sk)) && - !sk->sk_bound_dev_if) - goto sherry_cache; -- sk = __tcp_v4_lookup_listener(head, daddr, hnum, dif); -+ sk = __tcp_v4_lookup_listener(head, daddr, hnum, dif, env); - } - if (sk) { - sherry_cache: -@@ -492,17 +515,22 @@ static inline struct sock *__tcp_v4_look - /* Optimize here for direct hit, only listening connections can - * have wildcards anyways. - */ -- int hash = tcp_hashfn(daddr, hnum, saddr, sport); -+ int hash; -+ struct ve_struct *env; -+ -+ env = get_exec_env(); -+ hash = tcp_hashfn(daddr, hnum, saddr, sport, VEID(env)); - head = &tcp_ehash[hash]; - read_lock(&head->lock); - sk_for_each(sk, node, &head->chain) { -- if (TCP_IPV4_MATCH(sk, acookie, saddr, daddr, ports, dif)) -+ if (TCP_IPV4_MATCH(sk, acookie, saddr, daddr, ports, dif, env)) - goto hit; /* You sunk my battleship! */ - } - - /* Must check for a TIME_WAIT'er before going to listener hash. */ - sk_for_each(sk, node, &(head + tcp_ehash_size)->chain) { -- if (TCP_IPV4_TW_MATCH(sk, acookie, saddr, daddr, ports, dif)) -+ if (TCP_IPV4_TW_MATCH(sk, acookie, saddr, daddr, -+ ports, dif, env)) - goto hit; - } - sk = NULL; -@@ -553,11 +581,16 @@ static int __tcp_v4_check_established(st - int dif = sk->sk_bound_dev_if; - TCP_V4_ADDR_COOKIE(acookie, saddr, daddr) - __u32 ports = TCP_COMBINED_PORTS(inet->dport, lport); -- int hash = tcp_hashfn(daddr, lport, saddr, inet->dport); -- struct tcp_ehash_bucket *head = &tcp_ehash[hash]; -+ int hash; -+ struct tcp_ehash_bucket *head; - struct sock *sk2; - struct hlist_node *node; - struct tcp_tw_bucket *tw; -+ struct ve_struct *env; -+ -+ env = VE_OWNER_SK(sk); -+ hash = tcp_hashfn(daddr, lport, saddr, inet->dport, VEID(env)); -+ head = &tcp_ehash[hash]; - - write_lock(&head->lock); - -@@ -565,7 +598,8 @@ static int __tcp_v4_check_established(st - sk_for_each(sk2, node, &(head + tcp_ehash_size)->chain) { - tw = (struct tcp_tw_bucket *)sk2; - -- if (TCP_IPV4_TW_MATCH(sk2, acookie, saddr, daddr, ports, dif)) { -+ if (TCP_IPV4_TW_MATCH(sk2, acookie, saddr, daddr, -+ ports, dif, env)) { - struct tcp_opt *tp = tcp_sk(sk); - - /* With PAWS, it is safe from the viewpoint -@@ -589,8 +623,8 @@ static int __tcp_v4_check_established(st - if ((tp->write_seq = - tw->tw_snd_nxt + 65535 + 2) == 0) - tp->write_seq = 1; -- tp->ts_recent = tw->tw_ts_recent; -- tp->ts_recent_stamp = tw->tw_ts_recent_stamp; -+ tp->rx_opt.ts_recent = tw->tw_ts_recent; -+ tp->rx_opt.ts_recent_stamp = tw->tw_ts_recent_stamp; - sock_hold(sk2); - goto unique; - } else -@@ -601,7 +635,7 @@ static int __tcp_v4_check_established(st - - /* And established part... */ - sk_for_each(sk2, node, &head->chain) { -- if (TCP_IPV4_MATCH(sk2, acookie, saddr, daddr, ports, dif)) -+ if (TCP_IPV4_MATCH(sk2, acookie, saddr, daddr, ports, dif, env)) - goto not_unique; - } - -@@ -643,7 +677,9 @@ static int tcp_v4_hash_connect(struct so - struct tcp_bind_hashbucket *head; - struct tcp_bind_bucket *tb; - int ret; -+ struct ve_struct *env; - -+ env = VE_OWNER_SK(sk); - if (!snum) { - int rover; - int low = sysctl_local_port_range[0]; -@@ -674,7 +710,7 @@ static int tcp_v4_hash_connect(struct so - rover++; - if ((rover < low) || (rover > high)) - rover = low; -- head = &tcp_bhash[tcp_bhashfn(rover)]; -+ head = &tcp_bhash[tcp_bhashfn(rover, VEID(env))]; - spin_lock(&head->lock); - - /* Does not bother with rcv_saddr checks, -@@ -682,7 +718,9 @@ static int tcp_v4_hash_connect(struct so - * unique enough. - */ - tb_for_each(tb, node, &head->chain) { -- if (tb->port == rover) { -+ if (tb->port == rover && -+ ve_accessible_strict(VE_OWNER_TB(tb), env)) -+ { - BUG_TRAP(!hlist_empty(&tb->owners)); - if (tb->fastreuse >= 0) - goto next_port; -@@ -694,7 +732,7 @@ static int tcp_v4_hash_connect(struct so - } - } - -- tb = tcp_bucket_create(head, rover); -+ tb = tcp_bucket_create(head, rover, env); - if (!tb) { - spin_unlock(&head->lock); - break; -@@ -733,7 +771,7 @@ ok: - goto out; - } - -- head = &tcp_bhash[tcp_bhashfn(snum)]; -+ head = &tcp_bhash[tcp_bhashfn(snum, VEID(env))]; - tb = tcp_sk(sk)->bind_hash; - spin_lock_bh(&head->lock); - if (sk_head(&tb->owners) == sk && !sk->sk_bind_node.next) { -@@ -793,25 +831,25 @@ int tcp_v4_connect(struct sock *sk, stru - inet->saddr = rt->rt_src; - inet->rcv_saddr = inet->saddr; - -- if (tp->ts_recent_stamp && inet->daddr != daddr) { -+ if (tp->rx_opt.ts_recent_stamp && inet->daddr != daddr) { - /* Reset inherited state */ -- tp->ts_recent = 0; -- tp->ts_recent_stamp = 0; -- tp->write_seq = 0; -+ tp->rx_opt.ts_recent = 0; -+ tp->rx_opt.ts_recent_stamp = 0; -+ tp->write_seq = 0; - } - - if (sysctl_tcp_tw_recycle && -- !tp->ts_recent_stamp && rt->rt_dst == daddr) { -+ !tp->rx_opt.ts_recent_stamp && rt->rt_dst == daddr) { - struct inet_peer *peer = rt_get_peer(rt); - - /* VJ's idea. We save last timestamp seen from - * the destination in peer table, when entering state TIME-WAIT -- * and initialize ts_recent from it, when trying new connection. -+ * and initialize rx_opt.ts_recent from it, when trying new connection. - */ - - if (peer && peer->tcp_ts_stamp + TCP_PAWS_MSL >= xtime.tv_sec) { -- tp->ts_recent_stamp = peer->tcp_ts_stamp; -- tp->ts_recent = peer->tcp_ts; -+ tp->rx_opt.ts_recent_stamp = peer->tcp_ts_stamp; -+ tp->rx_opt.ts_recent = peer->tcp_ts; - } - } - -@@ -822,7 +860,7 @@ int tcp_v4_connect(struct sock *sk, stru - if (inet->opt) - tp->ext_header_len = inet->opt->optlen; - -- tp->mss_clamp = 536; -+ tp->rx_opt.mss_clamp = 536; - - /* Socket identity is still unknown (sport may be zero). - * However we set state to SYN-SENT and not releasing socket -@@ -1033,11 +1071,7 @@ void tcp_v4_err(struct sk_buff *skb, u32 - - switch (type) { - case ICMP_SOURCE_QUENCH: -- /* This is deprecated, but if someone generated it, -- * we have no reasons to ignore it. -- */ -- if (!sock_owned_by_user(sk)) -- tcp_enter_cwr(tp); -+ /* Just silently ignore these. */ - goto out; - case ICMP_PARAMETERPROB: - err = EPROTO; -@@ -1261,9 +1295,8 @@ static void tcp_v4_timewait_ack(struct s - struct tcp_tw_bucket *tw = (struct tcp_tw_bucket *)sk; - - tcp_v4_send_ack(skb, tw->tw_snd_nxt, tw->tw_rcv_nxt, -- tw->tw_rcv_wnd >> tw->tw_rcv_wscale, tw->tw_ts_recent); -- -- tcp_tw_put(tw); -+ tw->tw_rcv_wnd >> (tw->tw_rcv_wscale & TW_WSCALE_MASK), -+ tw->tw_ts_recent); - } - - static void tcp_v4_or_send_ack(struct sk_buff *skb, struct open_request *req) -@@ -1407,7 +1440,7 @@ struct or_calltable or_ipv4 = { - - int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb) - { -- struct tcp_opt tp; -+ struct tcp_options_received tmp_opt; - struct open_request *req; - __u32 saddr = skb->nh.iph->saddr; - __u32 daddr = skb->nh.iph->daddr; -@@ -1449,29 +1482,29 @@ int tcp_v4_conn_request(struct sock *sk, - if (!req) - goto drop; - -- tcp_clear_options(&tp); -- tp.mss_clamp = 536; -- tp.user_mss = tcp_sk(sk)->user_mss; -+ tcp_clear_options(&tmp_opt); -+ tmp_opt.mss_clamp = 536; -+ tmp_opt.user_mss = tcp_sk(sk)->rx_opt.user_mss; - -- tcp_parse_options(skb, &tp, 0); -+ tcp_parse_options(skb, &tmp_opt, 0); - - if (want_cookie) { -- tcp_clear_options(&tp); -- tp.saw_tstamp = 0; -+ tcp_clear_options(&tmp_opt); -+ tmp_opt.saw_tstamp = 0; - } - -- if (tp.saw_tstamp && !tp.rcv_tsval) { -+ if (tmp_opt.saw_tstamp && !tmp_opt.rcv_tsval) { - /* Some OSes (unknown ones, but I see them on web server, which - * contains information interesting only for windows' - * users) do not send their stamp in SYN. It is easy case. - * We simply do not advertise TS support. - */ -- tp.saw_tstamp = 0; -- tp.tstamp_ok = 0; -+ tmp_opt.saw_tstamp = 0; -+ tmp_opt.tstamp_ok = 0; - } -- tp.tstamp_ok = tp.saw_tstamp; -+ tmp_opt.tstamp_ok = tmp_opt.saw_tstamp; - -- tcp_openreq_init(req, &tp, skb); -+ tcp_openreq_init(req, &tmp_opt, skb); - - req->af.v4_req.loc_addr = daddr; - req->af.v4_req.rmt_addr = saddr; -@@ -1497,7 +1530,7 @@ int tcp_v4_conn_request(struct sock *sk, - * timewait bucket, so that all the necessary checks - * are made in the function processing timewait state. - */ -- if (tp.saw_tstamp && -+ if (tmp_opt.saw_tstamp && - sysctl_tcp_tw_recycle && - (dst = tcp_v4_route_req(sk, req)) != NULL && - (peer = rt_get_peer((struct rtable *)dst)) != NULL && -@@ -1684,12 +1717,15 @@ static int tcp_v4_checksum_init(struct s - */ - int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) - { -+ struct user_beancounter *ub; -+ -+ ub = set_sk_exec_ub(sk); - if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */ - TCP_CHECK_TIMER(sk); - if (tcp_rcv_established(sk, skb, skb->h.th, skb->len)) - goto reset; - TCP_CHECK_TIMER(sk); -- return 0; -+ goto restore_context; - } - - if (skb->len < (skb->h.th->doff << 2) || tcp_checksum_complete(skb)) -@@ -1703,7 +1739,7 @@ int tcp_v4_do_rcv(struct sock *sk, struc - if (nsk != sk) { - if (tcp_child_process(sk, nsk, skb)) - goto reset; -- return 0; -+ goto restore_context; - } - } - -@@ -1711,6 +1747,9 @@ int tcp_v4_do_rcv(struct sock *sk, struc - if (tcp_rcv_state_process(sk, skb, skb->h.th, skb->len)) - goto reset; - TCP_CHECK_TIMER(sk); -+ -+restore_context: -+ (void)set_exec_ub(ub); - return 0; - - reset: -@@ -1722,7 +1761,7 @@ discard: - * might be destroyed here. This current version compiles correctly, - * but you have been warned. - */ -- return 0; -+ goto restore_context; - - csum_err: - TCP_INC_STATS_BH(TCP_MIB_INERRS); -@@ -1835,13 +1874,17 @@ do_time_wait: - tcp_tw_put((struct tcp_tw_bucket *) sk); - goto discard_it; - } -+ spin_lock(&((struct tcp_tw_bucket *)sk)->tw_lock); - switch (tcp_timewait_state_process((struct tcp_tw_bucket *)sk, - skb, th, skb->len)) { - case TCP_TW_SYN: { -- struct sock *sk2 = tcp_v4_lookup_listener(skb->nh.iph->daddr, -+ struct sock *sk2; -+ -+ sk2 = tcp_v4_lookup_listener(skb->nh.iph->daddr, - ntohs(th->dest), - tcp_v4_iif(skb)); - if (sk2) { -+ spin_unlock(&((struct tcp_tw_bucket *)sk)->tw_lock); - tcp_tw_deschedule((struct tcp_tw_bucket *)sk); - tcp_tw_put((struct tcp_tw_bucket *)sk); - sk = sk2; -@@ -1853,9 +1896,13 @@ do_time_wait: - tcp_v4_timewait_ack(sk, skb); - break; - case TCP_TW_RST: -+ spin_unlock(&((struct tcp_tw_bucket *)sk)->tw_lock); -+ tcp_tw_put((struct tcp_tw_bucket *)sk); - goto no_tcp_socket; - case TCP_TW_SUCCESS:; - } -+ spin_unlock(&((struct tcp_tw_bucket *)sk)->tw_lock); -+ tcp_tw_put((struct tcp_tw_bucket *)sk); - goto discard_it; - } - -@@ -2001,11 +2048,11 @@ int tcp_v4_remember_stamp(struct sock *s - } - - if (peer) { -- if ((s32)(peer->tcp_ts - tp->ts_recent) <= 0 || -+ if ((s32)(peer->tcp_ts - tp->rx_opt.ts_recent) <= 0 || - (peer->tcp_ts_stamp + TCP_PAWS_MSL < xtime.tv_sec && -- peer->tcp_ts_stamp <= tp->ts_recent_stamp)) { -- peer->tcp_ts_stamp = tp->ts_recent_stamp; -- peer->tcp_ts = tp->ts_recent; -+ peer->tcp_ts_stamp <= tp->rx_opt.ts_recent_stamp)) { -+ peer->tcp_ts_stamp = tp->rx_opt.ts_recent_stamp; -+ peer->tcp_ts = tp->rx_opt.ts_recent; - } - if (release_it) - inet_putpeer(peer); -@@ -2077,6 +2124,8 @@ static int tcp_v4_init_sock(struct sock - tp->snd_cwnd_clamp = ~0; - tp->mss_cache = 536; - -+ tp->advmss = 65535; /* max value */ -+ - tp->reordering = sysctl_tcp_reordering; - - sk->sk_state = TCP_CLOSE; -@@ -2117,6 +2166,8 @@ int tcp_v4_destroy_sock(struct sock *sk) - * If sendmsg cached page exists, toss it. - */ - if (sk->sk_sndmsg_page) { -+ /* queue is empty, uncharge */ -+ ub_sock_tcp_detachpage(sk); - __free_page(sk->sk_sndmsg_page); - sk->sk_sndmsg_page = NULL; - } -@@ -2131,16 +2182,34 @@ EXPORT_SYMBOL(tcp_v4_destroy_sock); - #ifdef CONFIG_PROC_FS - /* Proc filesystem TCP sock list dumping. */ - --static inline struct tcp_tw_bucket *tw_head(struct hlist_head *head) -+static inline struct tcp_tw_bucket *tw_head(struct hlist_head *head, -+ envid_t veid) - { -- return hlist_empty(head) ? NULL : -- list_entry(head->first, struct tcp_tw_bucket, tw_node); -+ struct tcp_tw_bucket *tw; -+ struct hlist_node *pos; -+ -+ if (hlist_empty(head)) -+ return NULL; -+ hlist_for_each_entry(tw, pos, head, tw_node) { -+ if (!ve_accessible_veid(TW_VEID(tw), veid)) -+ continue; -+ return tw; -+ } -+ return NULL; - } - --static inline struct tcp_tw_bucket *tw_next(struct tcp_tw_bucket *tw) -+static inline struct tcp_tw_bucket *tw_next(struct tcp_tw_bucket *tw, -+ envid_t veid) - { -- return tw->tw_node.next ? -- hlist_entry(tw->tw_node.next, typeof(*tw), tw_node) : NULL; -+ while (1) { -+ if (tw->tw_node.next == NULL) -+ return NULL; -+ tw = hlist_entry(tw->tw_node.next, typeof(*tw), tw_node); -+ if (!ve_accessible_veid(TW_VEID(tw), veid)) -+ continue; -+ return tw; -+ } -+ return NULL; /* make compiler happy */ - } - - static void *listening_get_next(struct seq_file *seq, void *cur) -@@ -2149,7 +2218,9 @@ static void *listening_get_next(struct s - struct hlist_node *node; - struct sock *sk = cur; - struct tcp_iter_state* st = seq->private; -+ struct ve_struct *ve; - -+ ve = get_exec_env(); - if (!sk) { - st->bucket = 0; - sk = sk_head(&tcp_listening_hash[0]); -@@ -2183,6 +2254,8 @@ get_req: - sk = sk_next(sk); - get_sk: - sk_for_each_from(sk, node) { -+ if (!ve_accessible(VE_OWNER_SK(sk), ve)) -+ continue; - if (sk->sk_family == st->family) { - cur = sk; - goto out; -@@ -2222,7 +2295,9 @@ static void *established_get_first(struc - { - struct tcp_iter_state* st = seq->private; - void *rc = NULL; -+ struct ve_struct *ve; - -+ ve = get_exec_env(); - for (st->bucket = 0; st->bucket < tcp_ehash_size; ++st->bucket) { - struct sock *sk; - struct hlist_node *node; -@@ -2230,6 +2305,8 @@ static void *established_get_first(struc - - read_lock(&tcp_ehash[st->bucket].lock); - sk_for_each(sk, node, &tcp_ehash[st->bucket].chain) { -+ if (!ve_accessible(VE_OWNER_SK(sk), ve)) -+ continue; - if (sk->sk_family != st->family) { - continue; - } -@@ -2239,6 +2316,8 @@ static void *established_get_first(struc - st->state = TCP_SEQ_STATE_TIME_WAIT; - tw_for_each(tw, node, - &tcp_ehash[st->bucket + tcp_ehash_size].chain) { -+ if (!ve_accessible_veid(TW_VEID(tw), VEID(ve))) -+ continue; - if (tw->tw_family != st->family) { - continue; - } -@@ -2258,16 +2337,17 @@ static void *established_get_next(struct - struct tcp_tw_bucket *tw; - struct hlist_node *node; - struct tcp_iter_state* st = seq->private; -+ struct ve_struct *ve; - -+ ve = get_exec_env(); - ++st->num; - - if (st->state == TCP_SEQ_STATE_TIME_WAIT) { - tw = cur; -- tw = tw_next(tw); -+ tw = tw_next(tw, VEID(ve)); - get_tw: -- while (tw && tw->tw_family != st->family) { -- tw = tw_next(tw); -- } -+ while (tw && tw->tw_family != st->family) -+ tw = tw_next(tw, VEID(ve)); - if (tw) { - cur = tw; - goto out; -@@ -2285,12 +2365,14 @@ get_tw: - sk = sk_next(sk); - - sk_for_each_from(sk, node) { -+ if (!ve_accessible(VE_OWNER_SK(sk), ve)) -+ continue; - if (sk->sk_family == st->family) - goto found; - } - - st->state = TCP_SEQ_STATE_TIME_WAIT; -- tw = tw_head(&tcp_ehash[st->bucket + tcp_ehash_size].chain); -+ tw = tw_head(&tcp_ehash[st->bucket + tcp_ehash_size].chain, VEID(ve)); - goto get_tw; - found: - cur = sk; -@@ -2636,6 +2718,85 @@ void __init tcp_v4_init(struct net_proto - tcp_socket->sk->sk_prot->unhash(tcp_socket->sk); - } - -+#if defined(CONFIG_VE_NETDEV) || defined(CONFIG_VE_NETDEV_MODULE) -+static void tcp_kill_ve_onesk(struct sock *sk) -+{ -+ struct tcp_opt *tp = tcp_sk(sk); -+ -+ /* Check the assumed state of the socket. */ -+ if (!sock_flag(sk, SOCK_DEAD)) { -+ static int printed; -+invalid: -+ if (!printed) -+ printk(KERN_DEBUG "Killing sk: dead %d, state %d, " -+ "wrseq %u unseq %u, wrqu %d.\n", -+ sock_flag(sk, SOCK_DEAD), sk->sk_state, -+ tp->write_seq, tp->snd_una, -+ !skb_queue_empty(&sk->sk_write_queue)); -+ printed = 1; -+ return; -+ } -+ -+ tcp_send_active_reset(sk, GFP_ATOMIC); -+ switch (sk->sk_state) { -+ case TCP_FIN_WAIT1: -+ case TCP_CLOSING: -+ /* In these 2 states the peer may want us to retransmit -+ * some data and/or FIN. Entering "resetting mode" -+ * instead. -+ */ -+ tcp_time_wait(sk, TCP_CLOSE, 0); -+ break; -+ case TCP_FIN_WAIT2: -+ /* By some reason the socket may stay in this state -+ * without turning into a TW bucket. Fix it. -+ */ -+ tcp_time_wait(sk, TCP_FIN_WAIT2, 0); -+ break; -+ case TCP_LAST_ACK: -+ /* Just jump into CLOSED state. */ -+ tcp_done(sk); -+ break; -+ default: -+ /* The socket must be already close()d. */ -+ goto invalid; -+ } -+} -+ -+void tcp_v4_kill_ve_sockets(struct ve_struct *envid) -+{ -+ struct tcp_ehash_bucket *head; -+ int i; -+ -+ /* alive */ -+ local_bh_disable(); -+ head = tcp_ehash; -+ for (i = 0; i < tcp_ehash_size; i++) { -+ struct sock *sk; -+ struct hlist_node *node; -+more_work: -+ write_lock(&head[i].lock); -+ sk_for_each(sk, node, &head[i].chain) { -+ if (ve_accessible_strict(VE_OWNER_SK(sk), envid)) { -+ sock_hold(sk); -+ write_unlock(&head[i].lock); -+ -+ bh_lock_sock(sk); -+ /* sk might have disappeared from the hash before -+ * we got the lock */ -+ if (sk->sk_state != TCP_CLOSE) -+ tcp_kill_ve_onesk(sk); -+ bh_unlock_sock(sk); -+ sock_put(sk); -+ goto more_work; -+ } -+ } -+ write_unlock(&head[i].lock); -+ } -+ local_bh_enable(); -+} -+#endif -+ - EXPORT_SYMBOL(ipv4_specific); - EXPORT_SYMBOL(tcp_bind_hash); - EXPORT_SYMBOL(tcp_bucket_create); -@@ -2654,6 +2815,7 @@ EXPORT_SYMBOL(tcp_v4_rebuild_header); - EXPORT_SYMBOL(tcp_v4_remember_stamp); - EXPORT_SYMBOL(tcp_v4_send_check); - EXPORT_SYMBOL(tcp_v4_syn_recv_sock); -+EXPORT_SYMBOL(tcp_v4_kill_ve_sockets); - - #ifdef CONFIG_PROC_FS - EXPORT_SYMBOL(tcp_proc_register); -diff -uprN linux-2.6.8.1.orig/net/ipv4/tcp_minisocks.c linux-2.6.8.1-ve022stab078/net/ipv4/tcp_minisocks.c ---- linux-2.6.8.1.orig/net/ipv4/tcp_minisocks.c 2004-08-14 14:55:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/tcp_minisocks.c 2006-05-11 13:05:44.000000000 +0400 -@@ -29,6 +29,8 @@ - #include <net/inet_common.h> - #include <net/xfrm.h> - -+#include <ub/ub_net.h> -+ - #ifdef CONFIG_SYSCTL - #define SYNC_INIT 0 /* let the user enable it */ - #else -@@ -74,7 +76,7 @@ static void tcp_timewait_kill(struct tcp - write_unlock(&ehead->lock); - - /* Disassociate with bind bucket. */ -- bhead = &tcp_bhash[tcp_bhashfn(tw->tw_num)]; -+ bhead = &tcp_bhash[tcp_bhashfn(tw->tw_num, TW_VEID(tw))]; - spin_lock(&bhead->lock); - tb = tw->tw_tb; - __hlist_del(&tw->tw_bind_node); -@@ -123,17 +125,17 @@ enum tcp_tw_status - tcp_timewait_state_process(struct tcp_tw_bucket *tw, struct sk_buff *skb, - struct tcphdr *th, unsigned len) - { -- struct tcp_opt tp; -+ struct tcp_options_received tmp_opt; - int paws_reject = 0; - -- tp.saw_tstamp = 0; -+ tmp_opt.saw_tstamp = 0; - if (th->doff > (sizeof(struct tcphdr) >> 2) && tw->tw_ts_recent_stamp) { -- tcp_parse_options(skb, &tp, 0); -+ tcp_parse_options(skb, &tmp_opt, 0); - -- if (tp.saw_tstamp) { -- tp.ts_recent = tw->tw_ts_recent; -- tp.ts_recent_stamp = tw->tw_ts_recent_stamp; -- paws_reject = tcp_paws_check(&tp, th->rst); -+ if (tmp_opt.saw_tstamp) { -+ tmp_opt.ts_recent = tw->tw_ts_recent; -+ tmp_opt.ts_recent_stamp = tw->tw_ts_recent_stamp; -+ paws_reject = tcp_paws_check(&tmp_opt, th->rst); - } - } - -@@ -150,33 +152,28 @@ tcp_timewait_state_process(struct tcp_tw - if (th->rst) - goto kill; - -- if (th->syn && !before(TCP_SKB_CB(skb)->seq, tw->tw_rcv_nxt)) -- goto kill_with_rst; -+ if (th->syn && !before(TCP_SKB_CB(skb)->seq, tw->tw_rcv_nxt)) { -+ tw->tw_substate = TCP_CLOSE; -+ tcp_tw_schedule(tw, TCP_TIMEWAIT_LEN); -+ return TCP_TW_RST; -+ } - - /* Dup ACK? */ - if (!after(TCP_SKB_CB(skb)->end_seq, tw->tw_rcv_nxt) || -- TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq) { -- tcp_tw_put(tw); -+ TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq) - return TCP_TW_SUCCESS; -- } - -- /* New data or FIN. If new data arrive after half-duplex close, -- * reset. -- */ -- if (!th->fin || -- TCP_SKB_CB(skb)->end_seq != tw->tw_rcv_nxt + 1) { --kill_with_rst: -- tcp_tw_deschedule(tw); -- tcp_tw_put(tw); -- return TCP_TW_RST; -- } -- -- /* FIN arrived, enter true time-wait state. */ -- tw->tw_substate = TCP_TIME_WAIT; -- tw->tw_rcv_nxt = TCP_SKB_CB(skb)->end_seq; -- if (tp.saw_tstamp) { -+ /* New data or FIN. */ -+ if (th->fin && TCP_SKB_CB(skb)->end_seq == tw->tw_rcv_nxt + 1) { -+ /* FIN arrived, enter true time-wait state. */ -+ tw->tw_substate = TCP_TIME_WAIT; -+ tw->tw_rcv_nxt = TCP_SKB_CB(skb)->end_seq; -+ } else -+ /* If new data arrive after half-duplex close, reset. */ -+ tw->tw_substate = TCP_CLOSE; -+ if (tmp_opt.saw_tstamp) { - tw->tw_ts_recent_stamp = xtime.tv_sec; -- tw->tw_ts_recent = tp.rcv_tsval; -+ tw->tw_ts_recent = tmp_opt.rcv_tsval; - } - - /* I am shamed, but failed to make it more elegant. -@@ -190,7 +187,9 @@ kill_with_rst: - tcp_tw_schedule(tw, tw->tw_timeout); - else - tcp_tw_schedule(tw, TCP_TIMEWAIT_LEN); -- return TCP_TW_ACK; -+ -+ return (tw->tw_substate == TCP_TIME_WAIT) ? -+ TCP_TW_ACK : TCP_TW_RST; - } - - /* -@@ -223,18 +222,16 @@ kill_with_rst: - if (sysctl_tcp_rfc1337 == 0) { - kill: - tcp_tw_deschedule(tw); -- tcp_tw_put(tw); - return TCP_TW_SUCCESS; - } - } - tcp_tw_schedule(tw, TCP_TIMEWAIT_LEN); - -- if (tp.saw_tstamp) { -- tw->tw_ts_recent = tp.rcv_tsval; -+ if (tmp_opt.saw_tstamp) { -+ tw->tw_ts_recent = tmp_opt.rcv_tsval; - tw->tw_ts_recent_stamp = xtime.tv_sec; - } - -- tcp_tw_put(tw); - return TCP_TW_SUCCESS; - } - -@@ -257,7 +254,7 @@ kill: - - if (th->syn && !th->rst && !th->ack && !paws_reject && - (after(TCP_SKB_CB(skb)->seq, tw->tw_rcv_nxt) || -- (tp.saw_tstamp && (s32)(tw->tw_ts_recent - tp.rcv_tsval) < 0))) { -+ (tmp_opt.saw_tstamp && (s32)(tw->tw_ts_recent - tmp_opt.rcv_tsval) < 0))) { - u32 isn = tw->tw_snd_nxt + 65535 + 2; - if (isn == 0) - isn++; -@@ -268,7 +265,7 @@ kill: - if (paws_reject) - NET_INC_STATS_BH(LINUX_MIB_PAWSESTABREJECTED); - -- if(!th->rst) { -+ if (!th->rst) { - /* In this case we must reset the TIMEWAIT timer. - * - * If it is ACKless SYN it may be both old duplicate -@@ -278,12 +275,9 @@ kill: - if (paws_reject || th->ack) - tcp_tw_schedule(tw, TCP_TIMEWAIT_LEN); - -- /* Send ACK. Note, we do not put the bucket, -- * it will be released by caller. -- */ -- return TCP_TW_ACK; -+ return (tw->tw_substate == TCP_TIME_WAIT) ? -+ TCP_TW_ACK : TCP_TW_RST; - } -- tcp_tw_put(tw); - return TCP_TW_SUCCESS; - } - -@@ -301,7 +295,8 @@ static void __tcp_tw_hashdance(struct so - Note, that any socket with inet_sk(sk)->num != 0 MUST be bound in - binding cache, even if it is closed. - */ -- bhead = &tcp_bhash[tcp_bhashfn(inet_sk(sk)->num)]; -+ bhead = &tcp_bhash[tcp_bhashfn(inet_sk(sk)->num, -+ VEID(VE_OWNER_SK(sk)))]; - spin_lock(&bhead->lock); - tw->tw_tb = tcp_sk(sk)->bind_hash; - BUG_TRAP(tcp_sk(sk)->bind_hash); -@@ -329,12 +324,15 @@ void tcp_time_wait(struct sock *sk, int - struct tcp_tw_bucket *tw = NULL; - struct tcp_opt *tp = tcp_sk(sk); - int recycle_ok = 0; -+ struct user_beancounter *ub; - -- if (sysctl_tcp_tw_recycle && tp->ts_recent_stamp) -+ if (sysctl_tcp_tw_recycle && tp->rx_opt.ts_recent_stamp) - recycle_ok = tp->af_specific->remember_stamp(sk); - -+ ub = set_sk_exec_ub(sk); - if (tcp_tw_count < sysctl_tcp_max_tw_buckets) - tw = kmem_cache_alloc(tcp_timewait_cachep, SLAB_ATOMIC); -+ (void)set_exec_ub(ub); - - if(tw != NULL) { - struct inet_opt *inet = inet_sk(sk); -@@ -351,16 +349,19 @@ void tcp_time_wait(struct sock *sk, int - tw->tw_dport = inet->dport; - tw->tw_family = sk->sk_family; - tw->tw_reuse = sk->sk_reuse; -- tw->tw_rcv_wscale = tp->rcv_wscale; -+ tw->tw_rcv_wscale = tp->rx_opt.rcv_wscale; -+ if (sk->sk_user_data != NULL) -+ tw->tw_rcv_wscale |= TW_WSCALE_SPEC; - atomic_set(&tw->tw_refcnt, 1); - - tw->tw_hashent = sk->sk_hashent; - tw->tw_rcv_nxt = tp->rcv_nxt; - tw->tw_snd_nxt = tp->snd_nxt; - tw->tw_rcv_wnd = tcp_receive_window(tp); -- tw->tw_ts_recent = tp->ts_recent; -- tw->tw_ts_recent_stamp = tp->ts_recent_stamp; -+ tw->tw_ts_recent = tp->rx_opt.ts_recent; -+ tw->tw_ts_recent_stamp = tp->rx_opt.ts_recent_stamp; - tw_dead_node_init(tw); -+ spin_lock_init(&tw->tw_lock); - - #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) - if (tw->tw_family == PF_INET6) { -@@ -375,6 +376,8 @@ void tcp_time_wait(struct sock *sk, int - tw->tw_v6_ipv6only = 0; - } - #endif -+ SET_TW_VEID(tw, VEID(VE_OWNER_SK(sk))); -+ - /* Linkage updates. */ - __tcp_tw_hashdance(sk, tw); - -@@ -401,7 +404,8 @@ void tcp_time_wait(struct sock *sk, int - printk(KERN_INFO "TCP: time wait bucket table overflow\n"); - } - -- tcp_update_metrics(sk); -+ if (state != TCP_CLOSE) -+ tcp_update_metrics(sk); - tcp_done(sk); - } - -@@ -694,6 +698,10 @@ struct sock *tcp_create_openreq_child(st - struct sk_filter *filter; - - memcpy(newsk, sk, sizeof(struct tcp_sock)); -+ -+ if (ub_tcp_sock_charge(newsk) < 0) -+ goto out_sk_free; -+ - newsk->sk_state = TCP_SYN_RECV; - - /* SANITY */ -@@ -703,6 +711,7 @@ struct sock *tcp_create_openreq_child(st - /* Clone the TCP header template */ - inet_sk(newsk)->dport = req->rmt_port; - -+ SET_VE_OWNER_SK(newsk, VE_OWNER_SK(sk)); - sock_lock_init(newsk); - bh_lock_sock(newsk); - -@@ -729,6 +738,7 @@ struct sock *tcp_create_openreq_child(st - if (unlikely(xfrm_sk_clone_policy(newsk))) { - /* It is still raw copy of parent, so invalidate - * destructor and make plain sk_free() */ -+out_sk_free: - newsk->sk_destruct = NULL; - sk_free(newsk); - return NULL; -@@ -778,13 +788,13 @@ struct sock *tcp_create_openreq_child(st - newtp->pushed_seq = newtp->write_seq; - newtp->copied_seq = req->rcv_isn + 1; - -- newtp->saw_tstamp = 0; -+ newtp->rx_opt.saw_tstamp = 0; - -- newtp->dsack = 0; -- newtp->eff_sacks = 0; -+ newtp->rx_opt.dsack = 0; -+ newtp->rx_opt.eff_sacks = 0; - - newtp->probes_out = 0; -- newtp->num_sacks = 0; -+ newtp->rx_opt.num_sacks = 0; - newtp->urg_data = 0; - newtp->listen_opt = NULL; - newtp->accept_queue = newtp->accept_queue_tail = NULL; -@@ -807,36 +817,36 @@ struct sock *tcp_create_openreq_child(st - newsk->sk_sleep = NULL; - newsk->sk_owner = NULL; - -- newtp->tstamp_ok = req->tstamp_ok; -- if((newtp->sack_ok = req->sack_ok) != 0) { -+ newtp->rx_opt.tstamp_ok = req->tstamp_ok; -+ if((newtp->rx_opt.sack_ok = req->sack_ok) != 0) { - if (sysctl_tcp_fack) -- newtp->sack_ok |= 2; -+ newtp->rx_opt.sack_ok |= 2; - } - newtp->window_clamp = req->window_clamp; - newtp->rcv_ssthresh = req->rcv_wnd; - newtp->rcv_wnd = req->rcv_wnd; -- newtp->wscale_ok = req->wscale_ok; -- if (newtp->wscale_ok) { -- newtp->snd_wscale = req->snd_wscale; -- newtp->rcv_wscale = req->rcv_wscale; -+ newtp->rx_opt.wscale_ok = req->wscale_ok; -+ if (newtp->rx_opt.wscale_ok) { -+ newtp->rx_opt.snd_wscale = req->snd_wscale; -+ newtp->rx_opt.rcv_wscale = req->rcv_wscale; - } else { -- newtp->snd_wscale = newtp->rcv_wscale = 0; -+ newtp->rx_opt.snd_wscale = newtp->rx_opt.rcv_wscale = 0; - newtp->window_clamp = min(newtp->window_clamp, 65535U); - } -- newtp->snd_wnd = ntohs(skb->h.th->window) << newtp->snd_wscale; -+ newtp->snd_wnd = ntohs(skb->h.th->window) << newtp->rx_opt.snd_wscale; - newtp->max_window = newtp->snd_wnd; - -- if (newtp->tstamp_ok) { -- newtp->ts_recent = req->ts_recent; -- newtp->ts_recent_stamp = xtime.tv_sec; -+ if (newtp->rx_opt.tstamp_ok) { -+ newtp->rx_opt.ts_recent = req->ts_recent; -+ newtp->rx_opt.ts_recent_stamp = xtime.tv_sec; - newtp->tcp_header_len = sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED; - } else { -- newtp->ts_recent_stamp = 0; -+ newtp->rx_opt.ts_recent_stamp = 0; - newtp->tcp_header_len = sizeof(struct tcphdr); - } - if (skb->len >= TCP_MIN_RCVMSS+newtp->tcp_header_len) - newtp->ack.last_seg_size = skb->len-newtp->tcp_header_len; -- newtp->mss_clamp = req->mss; -+ newtp->rx_opt.mss_clamp = req->mss; - TCP_ECN_openreq_child(newtp, req); - if (newtp->ecn_flags&TCP_ECN_OK) - newsk->sk_no_largesend = 1; -@@ -860,21 +870,21 @@ struct sock *tcp_check_req(struct sock * - struct tcp_opt *tp = tcp_sk(sk); - u32 flg = tcp_flag_word(th) & (TCP_FLAG_RST|TCP_FLAG_SYN|TCP_FLAG_ACK); - int paws_reject = 0; -- struct tcp_opt ttp; -+ struct tcp_options_received tmp_opt; - struct sock *child; - -- ttp.saw_tstamp = 0; -+ tmp_opt.saw_tstamp = 0; - if (th->doff > (sizeof(struct tcphdr)>>2)) { -- tcp_parse_options(skb, &ttp, 0); -+ tcp_parse_options(skb, &tmp_opt, 0); - -- if (ttp.saw_tstamp) { -- ttp.ts_recent = req->ts_recent; -+ if (tmp_opt.saw_tstamp) { -+ tmp_opt.ts_recent = req->ts_recent; - /* We do not store true stamp, but it is not required, - * it can be estimated (approximately) - * from another data. - */ -- ttp.ts_recent_stamp = xtime.tv_sec - ((TCP_TIMEOUT_INIT/HZ)<<req->retrans); -- paws_reject = tcp_paws_check(&ttp, th->rst); -+ tmp_opt.ts_recent_stamp = xtime.tv_sec - ((TCP_TIMEOUT_INIT/HZ)<<req->retrans); -+ paws_reject = tcp_paws_check(&tmp_opt, th->rst); - } - } - -@@ -979,63 +989,63 @@ struct sock *tcp_check_req(struct sock * - - /* In sequence, PAWS is OK. */ - -- if (ttp.saw_tstamp && !after(TCP_SKB_CB(skb)->seq, req->rcv_isn+1)) -- req->ts_recent = ttp.rcv_tsval; -+ if (tmp_opt.saw_tstamp && !after(TCP_SKB_CB(skb)->seq, req->rcv_isn+1)) -+ req->ts_recent = tmp_opt.rcv_tsval; - -- if (TCP_SKB_CB(skb)->seq == req->rcv_isn) { -- /* Truncate SYN, it is out of window starting -- at req->rcv_isn+1. */ -- flg &= ~TCP_FLAG_SYN; -- } -+ if (TCP_SKB_CB(skb)->seq == req->rcv_isn) { -+ /* Truncate SYN, it is out of window starting -+ at req->rcv_isn+1. */ -+ flg &= ~TCP_FLAG_SYN; -+ } - -- /* RFC793: "second check the RST bit" and -- * "fourth, check the SYN bit" -- */ -- if (flg & (TCP_FLAG_RST|TCP_FLAG_SYN)) -- goto embryonic_reset; -+ /* RFC793: "second check the RST bit" and -+ * "fourth, check the SYN bit" -+ */ -+ if (flg & (TCP_FLAG_RST|TCP_FLAG_SYN)) -+ goto embryonic_reset; - -- /* ACK sequence verified above, just make sure ACK is -- * set. If ACK not set, just silently drop the packet. -- */ -- if (!(flg & TCP_FLAG_ACK)) -- return NULL; -+ /* ACK sequence verified above, just make sure ACK is -+ * set. If ACK not set, just silently drop the packet. -+ */ -+ if (!(flg & TCP_FLAG_ACK)) -+ return NULL; - -- /* If TCP_DEFER_ACCEPT is set, drop bare ACK. */ -- if (tp->defer_accept && TCP_SKB_CB(skb)->end_seq == req->rcv_isn+1) { -- req->acked = 1; -- return NULL; -- } -+ /* If TCP_DEFER_ACCEPT is set, drop bare ACK. */ -+ if (tp->defer_accept && TCP_SKB_CB(skb)->end_seq == req->rcv_isn+1) { -+ req->acked = 1; -+ return NULL; -+ } - -- /* OK, ACK is valid, create big socket and -- * feed this segment to it. It will repeat all -- * the tests. THIS SEGMENT MUST MOVE SOCKET TO -- * ESTABLISHED STATE. If it will be dropped after -- * socket is created, wait for troubles. -- */ -- child = tp->af_specific->syn_recv_sock(sk, skb, req, NULL); -- if (child == NULL) -- goto listen_overflow; -- -- sk_set_owner(child, sk->sk_owner); -- tcp_synq_unlink(tp, req, prev); -- tcp_synq_removed(sk, req); -- -- tcp_acceptq_queue(sk, req, child); -- return child; -- --listen_overflow: -- if (!sysctl_tcp_abort_on_overflow) { -- req->acked = 1; -- return NULL; -- } -+ /* OK, ACK is valid, create big socket and -+ * feed this segment to it. It will repeat all -+ * the tests. THIS SEGMENT MUST MOVE SOCKET TO -+ * ESTABLISHED STATE. If it will be dropped after -+ * socket is created, wait for troubles. -+ */ -+ child = tp->af_specific->syn_recv_sock(sk, skb, req, NULL); -+ if (child == NULL) -+ goto listen_overflow; -+ -+ sk_set_owner(child, sk->sk_owner); -+ tcp_synq_unlink(tp, req, prev); -+ tcp_synq_removed(sk, req); -+ -+ tcp_acceptq_queue(sk, req, child); -+ return child; -+ -+ listen_overflow: -+ if (!sysctl_tcp_abort_on_overflow) { -+ req->acked = 1; -+ return NULL; -+ } - --embryonic_reset: -- NET_INC_STATS_BH(LINUX_MIB_EMBRYONICRSTS); -- if (!(flg & TCP_FLAG_RST)) -- req->class->send_reset(skb); -+ embryonic_reset: -+ NET_INC_STATS_BH(LINUX_MIB_EMBRYONICRSTS); -+ if (!(flg & TCP_FLAG_RST)) -+ req->class->send_reset(skb); - -- tcp_synq_drop(sk, req, prev); -- return NULL; -+ tcp_synq_drop(sk, req, prev); -+ return NULL; - } - - /* -diff -uprN linux-2.6.8.1.orig/net/ipv4/tcp_output.c linux-2.6.8.1-ve022stab078/net/ipv4/tcp_output.c ---- linux-2.6.8.1.orig/net/ipv4/tcp_output.c 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/tcp_output.c 2006-05-11 13:05:44.000000000 +0400 -@@ -42,6 +42,9 @@ - #include <linux/module.h> - #include <linux/smp_lock.h> - -+#include <ub/ub_net.h> -+#include <ub/ub_tcp.h> -+ - /* People can turn this off for buggy TCP's found in printers etc. */ - int sysctl_tcp_retrans_collapse = 1; - -@@ -171,13 +174,13 @@ static __inline__ u16 tcp_select_window( - /* Make sure we do not exceed the maximum possible - * scaled window. - */ -- if (!tp->rcv_wscale) -+ if (!tp->rx_opt.rcv_wscale) - new_win = min(new_win, MAX_TCP_WINDOW); - else -- new_win = min(new_win, (65535U << tp->rcv_wscale)); -+ new_win = min(new_win, (65535U << tp->rx_opt.rcv_wscale)); - - /* RFC1323 scaling applied */ -- new_win >>= tp->rcv_wscale; -+ new_win >>= tp->rx_opt.rcv_wscale; - - /* If we advertise zero window, disable fast path. */ - if (new_win == 0) -@@ -187,6 +190,13 @@ static __inline__ u16 tcp_select_window( - } - - -+static int skb_header_size(struct sock *sk, int tcp_hlen) -+{ -+ struct ip_options *opt = inet_sk(sk)->opt; -+ return tcp_hlen + sizeof(struct iphdr) + -+ (opt ? opt->optlen : 0) + ETH_HLEN /* For hard header */; -+} -+ - /* This routine actually transmits TCP packets queued in by - * tcp_do_sendmsg(). This is used by both the initial - * transmission and possible later retransmissions. -@@ -205,6 +215,7 @@ int tcp_transmit_skb(struct sock *sk, st - struct tcp_opt *tp = tcp_sk(sk); - struct tcp_skb_cb *tcb = TCP_SKB_CB(skb); - int tcp_header_size = tp->tcp_header_len; -+ int header_size; - struct tcphdr *th; - int sysctl_flags; - int err; -@@ -229,14 +240,28 @@ int tcp_transmit_skb(struct sock *sk, st - if(!(sysctl_flags & SYSCTL_FLAG_TSTAMPS)) - tcp_header_size += TCPOLEN_SACKPERM_ALIGNED; - } -- } else if (tp->eff_sacks) { -+ } else if (tp->rx_opt.eff_sacks) { - /* A SACK is 2 pad bytes, a 2 byte header, plus - * 2 32-bit sequence numbers for each SACK block. - */ - tcp_header_size += (TCPOLEN_SACK_BASE_ALIGNED + -- (tp->eff_sacks * TCPOLEN_SACK_PERBLOCK)); -+ (tp->rx_opt.eff_sacks * TCPOLEN_SACK_PERBLOCK)); - } -- -+ -+ /* Unfortunately, we can have skb from outside world here -+ * with size insufficient for header. It is impossible to make -+ * guess when we queue skb, so the decision should be made -+ * here. Den -+ */ -+ header_size = skb_header_size(sk, tcp_header_size); -+ if (skb->data - header_size < skb->head) { -+ int delta = header_size - skb_headroom(skb); -+ err = pskb_expand_head(skb, SKB_DATA_ALIGN(delta), -+ 0, GFP_ATOMIC); -+ if (err) -+ return err; -+ } -+ - /* - * If the connection is idle and we are restarting, - * then we don't want to do any Vegas calculations -@@ -282,9 +307,9 @@ int tcp_transmit_skb(struct sock *sk, st - (sysctl_flags & SYSCTL_FLAG_TSTAMPS), - (sysctl_flags & SYSCTL_FLAG_SACK), - (sysctl_flags & SYSCTL_FLAG_WSCALE), -- tp->rcv_wscale, -+ tp->rx_opt.rcv_wscale, - tcb->when, -- tp->ts_recent); -+ tp->rx_opt.ts_recent); - } else { - tcp_build_and_update_options((__u32 *)(th + 1), - tp, tcb->when); -@@ -374,15 +399,23 @@ static int tcp_fragment(struct sock *sk, - int nsize = skb->len - len; - u16 flags; - -- if (skb_cloned(skb) && -- skb_is_nonlinear(skb) && -- pskb_expand_head(skb, 0, 0, GFP_ATOMIC)) -- return -ENOMEM; -+ if (skb_cloned(skb) && skb_is_nonlinear(skb)) { -+ unsigned long chargesize; -+ chargesize = skb_bc(skb)->charged; -+ if (pskb_expand_head(skb, 0, 0, GFP_ATOMIC)) -+ return -ENOMEM; -+ ub_sock_retwres_tcp(sk, chargesize, chargesize); -+ ub_tcpsndbuf_charge_forced(sk, skb); -+ } - - /* Get a new skb... force flag on. */ - buff = sk_stream_alloc_skb(sk, nsize, GFP_ATOMIC); - if (buff == NULL) - return -ENOMEM; /* We'll just try again later. */ -+ if (ub_tcpsndbuf_charge(sk, buff) < 0) { -+ kfree_skb(buff); -+ return -ENOMEM; -+ } - sk_charge_skb(sk, buff); - - /* Correct the sequence numbers. */ -@@ -479,10 +512,10 @@ static int tcp_trim_head(struct sock *sk - - /* This function synchronize snd mss to current pmtu/exthdr set. - -- tp->user_mss is mss set by user by TCP_MAXSEG. It does NOT counts -+ tp->rx_opt.user_mss is mss set by user by TCP_MAXSEG. It does NOT counts - for TCP options, but includes only bare TCP header. - -- tp->mss_clamp is mss negotiated at connection setup. -+ tp->rx_opt.mss_clamp is mss negotiated at connection setup. - It is minumum of user_mss and mss received with SYN. - It also does not include TCP options. - -@@ -491,7 +524,7 @@ static int tcp_trim_head(struct sock *sk - tp->mss_cache is current effective sending mss, including - all tcp options except for SACKs. It is evaluated, - taking into account current pmtu, but never exceeds -- tp->mss_clamp. -+ tp->rx_opt.mss_clamp. - - NOTE1. rfc1122 clearly states that advertised MSS - DOES NOT include either tcp or ip options. -@@ -515,8 +548,8 @@ int tcp_sync_mss(struct sock *sk, u32 pm - mss_now = pmtu - tp->af_specific->net_header_len - sizeof(struct tcphdr); - - /* Clamp it (mss_clamp does not include tcp options) */ -- if (mss_now > tp->mss_clamp) -- mss_now = tp->mss_clamp; -+ if (mss_now > tp->rx_opt.mss_clamp) -+ mss_now = tp->rx_opt.mss_clamp; - - /* Now subtract optional transport overhead */ - mss_now -= tp->ext_header_len + tp->ext2_header_len; -@@ -680,7 +713,7 @@ u32 __tcp_select_window(struct sock *sk) - if (free_space < full_space/2) { - tp->ack.quick = 0; - -- if (tcp_memory_pressure) -+ if (ub_tcp_shrink_rcvbuf(sk)) - tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U*tp->advmss); - - if (free_space < mss) -@@ -694,16 +727,16 @@ u32 __tcp_select_window(struct sock *sk) - * scaled window will not line up with the MSS boundary anyway. - */ - window = tp->rcv_wnd; -- if (tp->rcv_wscale) { -+ if (tp->rx_opt.rcv_wscale) { - window = free_space; - - /* Advertise enough space so that it won't get scaled away. - * Import case: prevent zero window announcement if - * 1<<rcv_wscale > mss. - */ -- if (((window >> tp->rcv_wscale) << tp->rcv_wscale) != window) -- window = (((window >> tp->rcv_wscale) + 1) -- << tp->rcv_wscale); -+ if (((window >> tp->rx_opt.rcv_wscale) << tp->rx_opt.rcv_wscale) != window) -+ window = (((window >> tp->rx_opt.rcv_wscale) + 1) -+ << tp->rx_opt.rcv_wscale); - } else { - /* Get the largest window that is a nice multiple of mss. - * Window clamp already applied above. -@@ -778,7 +811,7 @@ static void tcp_retrans_try_collapse(str - tp->left_out--; - } - /* Reno case is special. Sigh... */ -- if (!tp->sack_ok && tp->sacked_out) { -+ if (!tp->rx_opt.sack_ok && tp->sacked_out) { - tp->sacked_out--; - tp->left_out--; - } -@@ -998,7 +1031,7 @@ void tcp_xmit_retransmit_queue(struct so - return; - - /* No forward retransmissions in Reno are possible. */ -- if (!tp->sack_ok) -+ if (!tp->rx_opt.sack_ok) - return; - - /* Yeah, we have to make difficult choice between forward transmission -@@ -1062,6 +1095,7 @@ void tcp_send_fin(struct sock *sk) - break; - yield(); - } -+ ub_tcpsndbuf_charge_forced(sk, skb); - - /* Reserve space for headers and prepare control bits. */ - skb_reserve(skb, MAX_TCP_HEADER); -@@ -1127,6 +1161,10 @@ int tcp_send_synack(struct sock *sk) - struct sk_buff *nskb = skb_copy(skb, GFP_ATOMIC); - if (nskb == NULL) - return -ENOMEM; -+ if (ub_tcpsndbuf_charge(sk, skb) < 0) { -+ kfree_skb(nskb); -+ return -ENOMEM; -+ } - __skb_unlink(skb, &sk->sk_write_queue); - __skb_queue_head(&sk->sk_write_queue, nskb); - sk_stream_free_skb(sk, skb); -@@ -1224,23 +1262,38 @@ static inline void tcp_connect_init(stru - (sysctl_tcp_timestamps ? TCPOLEN_TSTAMP_ALIGNED : 0); - - /* If user gave his TCP_MAXSEG, record it to clamp */ -- if (tp->user_mss) -- tp->mss_clamp = tp->user_mss; -+ if (tp->rx_opt.user_mss) -+ tp->rx_opt.mss_clamp = tp->rx_opt.user_mss; - tp->max_window = 0; - tcp_sync_mss(sk, dst_pmtu(dst)); - -+ if (tp->advmss == 0 || dst_metric(dst, RTAX_ADVMSS) == 0) { -+ printk("Oops in connect_init! tp->advmss=%d, dst->advmss=%d\n", -+ tp->advmss, dst_metric(dst, RTAX_ADVMSS)); -+ printk("dst: pmtu=%u, advmss=%u\n", -+ dst_metric(dst, RTAX_MTU), -+ dst_metric(dst, RTAX_ADVMSS)); -+ printk("sk->state=%d, tp: ack.rcv_mss=%d, mss_cache=%d, " -+ "advmss=%d, user_mss=%d\n", -+ sk->sk_state, tp->ack.rcv_mss, tp->mss_cache, -+ tp->advmss, tp->rx_opt.user_mss); -+ } -+ - if (!tp->window_clamp) - tp->window_clamp = dst_metric(dst, RTAX_WINDOW); -- tp->advmss = dst_metric(dst, RTAX_ADVMSS); -+ if (dst_metric(dst, RTAX_ADVMSS) < tp->advmss) -+ tp->advmss = dst_metric(dst, RTAX_ADVMSS); -+ if (tp->advmss == 0) -+ tp->advmss = 1460; - tcp_initialize_rcv_mss(sk); - tcp_vegas_init(tp); - - tcp_select_initial_window(tcp_full_space(sk), -- tp->advmss - (tp->ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0), -+ tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0), - &tp->rcv_wnd, - &tp->window_clamp, - sysctl_tcp_window_scaling, -- &tp->rcv_wscale); -+ &tp->rx_opt.rcv_wscale); - - tp->rcv_ssthresh = tp->rcv_wnd; - -@@ -1272,6 +1325,10 @@ int tcp_connect(struct sock *sk) - buff = alloc_skb(MAX_TCP_HEADER + 15, sk->sk_allocation); - if (unlikely(buff == NULL)) - return -ENOBUFS; -+ if (ub_tcpsndbuf_charge(sk, buff) < 0) { -+ kfree_skb(buff); -+ return -ENOBUFS; -+ } - - /* Reserve space for headers. */ - skb_reserve(buff, MAX_TCP_HEADER); -diff -uprN linux-2.6.8.1.orig/net/ipv4/tcp_timer.c linux-2.6.8.1-ve022stab078/net/ipv4/tcp_timer.c ---- linux-2.6.8.1.orig/net/ipv4/tcp_timer.c 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/tcp_timer.c 2006-05-11 13:05:42.000000000 +0400 -@@ -22,6 +22,8 @@ - - #include <linux/module.h> - #include <net/tcp.h> -+#include <ub/ub_orphan.h> -+#include <ub/ub_tcp.h> - - int sysctl_tcp_syn_retries = TCP_SYN_RETRIES; - int sysctl_tcp_synack_retries = TCP_SYNACK_RETRIES; -@@ -100,7 +102,7 @@ static void tcp_write_err(struct sock *s - static int tcp_out_of_resources(struct sock *sk, int do_reset) - { - struct tcp_opt *tp = tcp_sk(sk); -- int orphans = atomic_read(&tcp_orphan_count); -+ int orphans = tcp_get_orphan_count(sk); - - /* If peer does not open window for long time, or did not transmit - * anything for long time, penalize it. */ -@@ -111,9 +113,7 @@ static int tcp_out_of_resources(struct s - if (sk->sk_err_soft) - orphans <<= 1; - -- if (orphans >= sysctl_tcp_max_orphans || -- (sk->sk_wmem_queued > SOCK_MIN_SNDBUF && -- atomic_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])) { -+ if (tcp_too_many_orphans(sk, orphans)) { - if (net_ratelimit()) - printk(KERN_INFO "Out of socket memory\n"); - -@@ -206,6 +206,7 @@ static int tcp_write_timeout(struct sock - static void tcp_delack_timer(unsigned long data) - { - struct sock *sk = (struct sock*)data; -+ struct ve_struct *env = set_exec_env(VE_OWNER_SK(sk)); - struct tcp_opt *tp = tcp_sk(sk); - - bh_lock_sock(sk); -@@ -257,11 +258,12 @@ static void tcp_delack_timer(unsigned lo - TCP_CHECK_TIMER(sk); - - out: -- if (tcp_memory_pressure) -+ if (ub_tcp_memory_pressure(sk)) - sk_stream_mem_reclaim(sk); - out_unlock: - bh_unlock_sock(sk); - sock_put(sk); -+ (void)set_exec_env(env); - } - - static void tcp_probe_timer(struct sock *sk) -@@ -315,6 +317,9 @@ static void tcp_probe_timer(struct sock - static void tcp_retransmit_timer(struct sock *sk) - { - struct tcp_opt *tp = tcp_sk(sk); -+ struct ve_struct *ve_old; -+ -+ ve_old = set_exec_env(VE_OWNER_SK(sk)); - - if (tp->packets_out == 0) - goto out; -@@ -351,7 +356,7 @@ static void tcp_retransmit_timer(struct - - if (tp->retransmits == 0) { - if (tp->ca_state == TCP_CA_Disorder || tp->ca_state == TCP_CA_Recovery) { -- if (tp->sack_ok) { -+ if (tp->rx_opt.sack_ok) { - if (tp->ca_state == TCP_CA_Recovery) - NET_INC_STATS_BH(LINUX_MIB_TCPSACKRECOVERYFAIL); - else -@@ -410,12 +415,14 @@ out_reset_timer: - if (tp->retransmits > sysctl_tcp_retries1) - __sk_dst_reset(sk); - --out:; -+out: -+ (void)set_exec_env(ve_old); - } - - static void tcp_write_timer(unsigned long data) - { - struct sock *sk = (struct sock*)data; -+ struct ve_struct *env = set_exec_env(VE_OWNER_SK(sk)); - struct tcp_opt *tp = tcp_sk(sk); - int event; - -@@ -452,6 +459,7 @@ out: - out_unlock: - bh_unlock_sock(sk); - sock_put(sk); -+ (void)set_exec_env(env); - } - - /* -@@ -571,6 +579,7 @@ void tcp_set_keepalive(struct sock *sk, - static void tcp_keepalive_timer (unsigned long data) - { - struct sock *sk = (struct sock *) data; -+ struct ve_struct *env = set_exec_env(VE_OWNER_SK(sk)); - struct tcp_opt *tp = tcp_sk(sk); - __u32 elapsed; - -@@ -645,6 +654,7 @@ death: - out: - bh_unlock_sock(sk); - sock_put(sk); -+ (void)set_exec_env(env); - } - - EXPORT_SYMBOL(tcp_clear_xmit_timers); -diff -uprN linux-2.6.8.1.orig/net/ipv4/udp.c linux-2.6.8.1-ve022stab078/net/ipv4/udp.c ---- linux-2.6.8.1.orig/net/ipv4/udp.c 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv4/udp.c 2006-05-11 13:05:42.000000000 +0400 -@@ -125,7 +125,9 @@ static int udp_v4_get_port(struct sock * - struct hlist_node *node; - struct sock *sk2; - struct inet_opt *inet = inet_sk(sk); -+ struct ve_struct *env; - -+ env = VE_OWNER_SK(sk); - write_lock_bh(&udp_hash_lock); - if (snum == 0) { - int best_size_so_far, best, result, i; -@@ -139,7 +141,7 @@ static int udp_v4_get_port(struct sock * - struct hlist_head *list; - int size; - -- list = &udp_hash[result & (UDP_HTABLE_SIZE - 1)]; -+ list = &udp_hash[udp_hashfn(result, VEID(env))]; - if (hlist_empty(list)) { - if (result > sysctl_local_port_range[1]) - result = sysctl_local_port_range[0] + -@@ -161,7 +163,7 @@ static int udp_v4_get_port(struct sock * - result = sysctl_local_port_range[0] - + ((result - sysctl_local_port_range[0]) & - (UDP_HTABLE_SIZE - 1)); -- if (!udp_lport_inuse(result)) -+ if (!udp_lport_inuse(result, env)) - break; - } - if (i >= (1 << 16) / UDP_HTABLE_SIZE) -@@ -170,11 +172,12 @@ gotit: - udp_port_rover = snum = result; - } else { - sk_for_each(sk2, node, -- &udp_hash[snum & (UDP_HTABLE_SIZE - 1)]) { -+ &udp_hash[udp_hashfn(snum, VEID(env))]) { - struct inet_opt *inet2 = inet_sk(sk2); - - if (inet2->num == snum && - sk2 != sk && -+ ve_accessible_strict(VE_OWNER_SK(sk2), env) && - !ipv6_only_sock(sk2) && - (!sk2->sk_bound_dev_if || - !sk->sk_bound_dev_if || -@@ -188,7 +191,7 @@ gotit: - } - inet->num = snum; - if (sk_unhashed(sk)) { -- struct hlist_head *h = &udp_hash[snum & (UDP_HTABLE_SIZE - 1)]; -+ struct hlist_head *h = &udp_hash[udp_hashfn(snum, VEID(env))]; - - sk_add_node(sk, h); - sock_prot_inc_use(sk->sk_prot); -@@ -225,11 +228,15 @@ struct sock *udp_v4_lookup_longway(u32 s - struct hlist_node *node; - unsigned short hnum = ntohs(dport); - int badness = -1; -+ struct ve_struct *env; - -- sk_for_each(sk, node, &udp_hash[hnum & (UDP_HTABLE_SIZE - 1)]) { -+ env = get_exec_env(); -+ sk_for_each(sk, node, &udp_hash[udp_hashfn(hnum, VEID(env))]) { - struct inet_opt *inet = inet_sk(sk); - -- if (inet->num == hnum && !ipv6_only_sock(sk)) { -+ if (inet->num == hnum && -+ ve_accessible_strict(VE_OWNER_SK(sk), env) && -+ !ipv6_only_sock(sk)) { - int score = (sk->sk_family == PF_INET ? 1 : 0); - if (inet->rcv_saddr) { - if (inet->rcv_saddr != daddr) -@@ -1053,7 +1060,8 @@ static int udp_v4_mcast_deliver(struct s - int dif; - - read_lock(&udp_hash_lock); -- sk = sk_head(&udp_hash[ntohs(uh->dest) & (UDP_HTABLE_SIZE - 1)]); -+ sk = sk_head(&udp_hash[udp_hashfn(ntohs(uh->dest), -+ VEID(VE_OWNER_SKB(skb)))]); - dif = skb->dev->ifindex; - sk = udp_v4_mcast_next(sk, uh->dest, daddr, uh->source, saddr, dif); - if (sk) { -@@ -1329,10 +1337,14 @@ static struct sock *udp_get_first(struct - { - struct sock *sk; - struct udp_iter_state *state = seq->private; -+ struct ve_struct *env; - -+ env = get_exec_env(); - for (state->bucket = 0; state->bucket < UDP_HTABLE_SIZE; ++state->bucket) { - struct hlist_node *node; - sk_for_each(sk, node, &udp_hash[state->bucket]) { -+ if (!ve_accessible(VE_OWNER_SK(sk), env)) -+ continue; - if (sk->sk_family == state->family) - goto found; - } -@@ -1349,8 +1361,13 @@ static struct sock *udp_get_next(struct - do { - sk = sk_next(sk); - try_again: -- ; -- } while (sk && sk->sk_family != state->family); -+ if (!sk) -+ break; -+ if (sk->sk_family != state->family) -+ continue; -+ if (ve_accessible(VE_OWNER_SK(sk), get_exec_env())) -+ break; -+ } while (1); - - if (!sk && ++state->bucket < UDP_HTABLE_SIZE) { - sk = sk_head(&udp_hash[state->bucket]); -diff -uprN linux-2.6.8.1.orig/net/ipv6/addrconf.c linux-2.6.8.1-ve022stab078/net/ipv6/addrconf.c ---- linux-2.6.8.1.orig/net/ipv6/addrconf.c 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv6/addrconf.c 2006-05-11 13:05:42.000000000 +0400 -@@ -1875,6 +1875,10 @@ static int addrconf_notify(struct notifi - struct net_device *dev = (struct net_device *) data; - struct inet6_dev *idev = __in6_dev_get(dev); - -+ /* not virtualized yet */ -+ if (!ve_is_super(get_exec_env())) -+ return NOTIFY_OK; -+ - switch(event) { - case NETDEV_UP: - switch(dev->type) { -diff -uprN linux-2.6.8.1.orig/net/ipv6/datagram.c linux-2.6.8.1-ve022stab078/net/ipv6/datagram.c ---- linux-2.6.8.1.orig/net/ipv6/datagram.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv6/datagram.c 2006-05-11 13:05:33.000000000 +0400 -@@ -416,9 +416,7 @@ int datagram_send_ctl(struct msghdr *msg - int addr_type; - struct net_device *dev = NULL; - -- if (cmsg->cmsg_len < sizeof(struct cmsghdr) || -- (unsigned long)(((char*)cmsg - (char*)msg->msg_control) -- + cmsg->cmsg_len) > msg->msg_controllen) { -+ if (!CMSG_OK(msg, cmsg)) { - err = -EINVAL; - goto exit_f; - } -diff -uprN linux-2.6.8.1.orig/net/ipv6/ip6_output.c linux-2.6.8.1-ve022stab078/net/ipv6/ip6_output.c ---- linux-2.6.8.1.orig/net/ipv6/ip6_output.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv6/ip6_output.c 2006-05-11 13:05:25.000000000 +0400 -@@ -593,6 +593,7 @@ static int ip6_fragment(struct sk_buff * - /* Prepare header of the next frame, - * before previous one went down. */ - if (frag) { -+ frag->ip_summed = CHECKSUM_NONE; - frag->h.raw = frag->data; - fh = (struct frag_hdr*)__skb_push(frag, sizeof(struct frag_hdr)); - frag->nh.raw = __skb_push(frag, hlen); -diff -uprN linux-2.6.8.1.orig/net/ipv6/ipv6_sockglue.c linux-2.6.8.1-ve022stab078/net/ipv6/ipv6_sockglue.c ---- linux-2.6.8.1.orig/net/ipv6/ipv6_sockglue.c 2004-08-14 14:54:48.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv6/ipv6_sockglue.c 2006-05-11 13:05:34.000000000 +0400 -@@ -503,6 +503,9 @@ done: - break; - case IPV6_IPSEC_POLICY: - case IPV6_XFRM_POLICY: -+ retv = -EPERM; -+ if (!capable(CAP_NET_ADMIN)) -+ break; - retv = xfrm_user_policy(sk, optname, optval, optlen); - break; - -diff -uprN linux-2.6.8.1.orig/net/ipv6/mcast.c linux-2.6.8.1-ve022stab078/net/ipv6/mcast.c ---- linux-2.6.8.1.orig/net/ipv6/mcast.c 2004-08-14 14:56:01.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv6/mcast.c 2006-05-11 13:05:42.000000000 +0400 -@@ -389,12 +389,12 @@ int ip6_mc_source(int add, int omode, st - goto done; - rv = !0; - for (i=0; i<psl->sl_count; i++) { -- rv = memcmp(&psl->sl_addr, group, -+ rv = memcmp(&psl->sl_addr[i], source, - sizeof(struct in6_addr)); -- if (rv >= 0) -+ if (rv == 0) - break; - } -- if (!rv) /* source not found */ -+ if (rv) /* source not found */ - goto done; - - /* update the interface filter */ -@@ -435,8 +435,8 @@ int ip6_mc_source(int add, int omode, st - } - rv = 1; /* > 0 for insert logic below if sl_count is 0 */ - for (i=0; i<psl->sl_count; i++) { -- rv = memcmp(&psl->sl_addr, group, sizeof(struct in6_addr)); -- if (rv >= 0) -+ rv = memcmp(&psl->sl_addr[i], source, sizeof(struct in6_addr)); -+ if (rv == 0) - break; - } - if (rv == 0) /* address already there is an error */ -@@ -1175,6 +1175,11 @@ int igmp6_event_report(struct sk_buff *s - if (skb->pkt_type == PACKET_LOOPBACK) - return 0; - -+ /* send our report if the MC router may not have heard this report */ -+ if (skb->pkt_type != PACKET_MULTICAST && -+ skb->pkt_type != PACKET_BROADCAST) -+ return 0; -+ - if (!pskb_may_pull(skb, sizeof(struct in6_addr))) - return -EINVAL; - -diff -uprN linux-2.6.8.1.orig/net/ipv6/netfilter/ip6_queue.c linux-2.6.8.1-ve022stab078/net/ipv6/netfilter/ip6_queue.c ---- linux-2.6.8.1.orig/net/ipv6/netfilter/ip6_queue.c 2004-08-14 14:55:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv6/netfilter/ip6_queue.c 2006-05-11 13:05:27.000000000 +0400 -@@ -71,7 +71,9 @@ static DECLARE_MUTEX(ipqnl_sem); - static void - ipq_issue_verdict(struct ipq_queue_entry *entry, int verdict) - { -+ local_bh_disable(); - nf_reinject(entry->skb, entry->info, verdict); -+ local_bh_enable(); - kfree(entry); - } - -diff -uprN linux-2.6.8.1.orig/net/ipv6/tcp_ipv6.c linux-2.6.8.1-ve022stab078/net/ipv6/tcp_ipv6.c ---- linux-2.6.8.1.orig/net/ipv6/tcp_ipv6.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv6/tcp_ipv6.c 2006-05-11 13:05:42.000000000 +0400 -@@ -142,7 +142,7 @@ static int tcp_v6_get_port(struct sock * - do { rover++; - if ((rover < low) || (rover > high)) - rover = low; -- head = &tcp_bhash[tcp_bhashfn(rover)]; -+ head = &tcp_bhash[tcp_bhashfn(rover, 0)]; - spin_lock(&head->lock); - tb_for_each(tb, node, &head->chain) - if (tb->port == rover) -@@ -162,7 +162,7 @@ static int tcp_v6_get_port(struct sock * - /* OK, here is the one we will use. */ - snum = rover; - } else { -- head = &tcp_bhash[tcp_bhashfn(snum)]; -+ head = &tcp_bhash[tcp_bhashfn(snum, 0)]; - spin_lock(&head->lock); - tb_for_each(tb, node, &head->chain) - if (tb->port == snum) -@@ -183,7 +183,7 @@ tb_found: - } - tb_not_found: - ret = 1; -- if (!tb && (tb = tcp_bucket_create(head, snum)) == NULL) -+ if (!tb && (tb = tcp_bucket_create(head, snum, NULL)) == NULL) - goto fail_unlock; - if (hlist_empty(&tb->owners)) { - if (sk->sk_reuse && sk->sk_state != TCP_LISTEN) -@@ -255,7 +255,7 @@ static struct sock *tcp_v6_lookup_listen - - hiscore=0; - read_lock(&tcp_lhash_lock); -- sk_for_each(sk, node, &tcp_listening_hash[tcp_lhashfn(hnum)]) { -+ sk_for_each(sk, node, &tcp_listening_hash[tcp_lhashfn(hnum, 0)]) { - if (inet_sk(sk)->num == hnum && sk->sk_family == PF_INET6) { - struct ipv6_pinfo *np = inet6_sk(sk); - -@@ -470,8 +470,8 @@ static int tcp_v6_check_established(stru - tp->write_seq = tw->tw_snd_nxt + 65535 + 2; - if (!tp->write_seq) - tp->write_seq = 1; -- tp->ts_recent = tw->tw_ts_recent; -- tp->ts_recent_stamp = tw->tw_ts_recent_stamp; -+ tp->rx_opt.ts_recent = tw->tw_ts_recent; -+ tp->rx_opt.ts_recent_stamp = tw->tw_ts_recent_stamp; - sock_hold(sk2); - goto unique; - } else -@@ -522,7 +522,7 @@ static int tcp_v6_hash_connect(struct so - inet_sk(sk)->sport = htons(inet_sk(sk)->num); - } - -- head = &tcp_bhash[tcp_bhashfn(inet_sk(sk)->num)]; -+ head = &tcp_bhash[tcp_bhashfn(inet_sk(sk)->num, 0)]; - tb = tb_head(head); - - spin_lock_bh(&head->lock); -@@ -606,10 +606,10 @@ static int tcp_v6_connect(struct sock *s - return -EINVAL; - } - -- if (tp->ts_recent_stamp && -+ if (tp->rx_opt.ts_recent_stamp && - ipv6_addr_cmp(&np->daddr, &usin->sin6_addr)) { -- tp->ts_recent = 0; -- tp->ts_recent_stamp = 0; -+ tp->rx_opt.ts_recent = 0; -+ tp->rx_opt.ts_recent_stamp = 0; - tp->write_seq = 0; - } - -@@ -686,13 +686,15 @@ static int tcp_v6_connect(struct sock *s - ip6_dst_store(sk, dst, NULL); - sk->sk_route_caps = dst->dev->features & - ~(NETIF_F_IP_CSUM | NETIF_F_TSO); -+ if (!sysctl_tcp_use_sg) -+ sk->sk_route_caps &= ~NETIF_F_SG; - - tp->ext_header_len = 0; - if (np->opt) - tp->ext_header_len = np->opt->opt_flen + np->opt->opt_nflen; - tp->ext2_header_len = dst->header_len; - -- tp->mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr); -+ tp->rx_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr); - - inet->dport = usin->sin6_port; - -@@ -1166,7 +1168,8 @@ static void tcp_v6_synq_add(struct sock - static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb) - { - struct ipv6_pinfo *np = inet6_sk(sk); -- struct tcp_opt tmptp, *tp = tcp_sk(sk); -+ struct tcp_options_received tmp_opt; -+ struct tcp_opt *tp = tcp_sk(sk); - struct open_request *req = NULL; - __u32 isn = TCP_SKB_CB(skb)->when; - -@@ -1192,14 +1195,14 @@ static int tcp_v6_conn_request(struct so - if (req == NULL) - goto drop; - -- tcp_clear_options(&tmptp); -- tmptp.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr); -- tmptp.user_mss = tp->user_mss; -+ tcp_clear_options(&tmp_opt); -+ tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr); -+ tmp_opt.user_mss = tp->rx_opt.user_mss; - -- tcp_parse_options(skb, &tmptp, 0); -+ tcp_parse_options(skb, &tmp_opt, 0); - -- tmptp.tstamp_ok = tmptp.saw_tstamp; -- tcp_openreq_init(req, &tmptp, skb); -+ tmp_opt.tstamp_ok = tmp_opt.saw_tstamp; -+ tcp_openreq_init(req, &tmp_opt, skb); - - req->class = &or_ipv6; - ipv6_addr_copy(&req->af.v6_req.rmt_addr, &skb->nh.ipv6h->saddr); -@@ -1343,6 +1346,8 @@ static struct sock * tcp_v6_syn_recv_soc - ip6_dst_store(newsk, dst, NULL); - newsk->sk_route_caps = dst->dev->features & - ~(NETIF_F_IP_CSUM | NETIF_F_TSO); -+ if (!sysctl_tcp_use_sg) -+ sk->sk_route_caps &= ~NETIF_F_SG; - - newtcp6sk = (struct tcp6_sock *)newsk; - newtcp6sk->pinet6 = &newtcp6sk->inet6; -@@ -1675,12 +1680,14 @@ do_time_wait: - goto discard_it; - } - -+ spin_lock(&((struct tcp_tw_bucket *)sk)->tw_lock); - switch(tcp_timewait_state_process((struct tcp_tw_bucket *)sk, - skb, th, skb->len)) { - case TCP_TW_SYN: - { - struct sock *sk2; - -+ spin_unlock(&((struct tcp_tw_bucket *)sk)->tw_lock); - sk2 = tcp_v6_lookup_listener(&skb->nh.ipv6h->daddr, ntohs(th->dest), tcp_v6_iif(skb)); - if (sk2 != NULL) { - tcp_tw_deschedule((struct tcp_tw_bucket *)sk); -@@ -1694,9 +1701,13 @@ do_time_wait: - tcp_v6_timewait_ack(sk, skb); - break; - case TCP_TW_RST: -+ spin_unlock(&((struct tcp_tw_bucket *)sk)->tw_lock); -+ tcp_tw_put((struct tcp_tw_bucket *)sk); - goto no_tcp_socket; - case TCP_TW_SUCCESS:; - } -+ spin_unlock(&((struct tcp_tw_bucket *)sk)->tw_lock); -+ tcp_tw_put((struct tcp_tw_bucket *)sk); - goto discard_it; - } - -@@ -1736,6 +1747,8 @@ static int tcp_v6_rebuild_header(struct - ip6_dst_store(sk, dst, NULL); - sk->sk_route_caps = dst->dev->features & - ~(NETIF_F_IP_CSUM | NETIF_F_TSO); -+ if (!sysctl_tcp_use_sg) -+ sk->sk_route_caps &= ~NETIF_F_SG; - tcp_sk(sk)->ext2_header_len = dst->header_len; - } - -@@ -1778,6 +1791,8 @@ static int tcp_v6_xmit(struct sk_buff *s - ip6_dst_store(sk, dst, NULL); - sk->sk_route_caps = dst->dev->features & - ~(NETIF_F_IP_CSUM | NETIF_F_TSO); -+ if (!sysctl_tcp_use_sg) -+ sk->sk_route_caps &= ~NETIF_F_SG; - tcp_sk(sk)->ext2_header_len = dst->header_len; - } - -diff -uprN linux-2.6.8.1.orig/net/ipv6/udp.c linux-2.6.8.1-ve022stab078/net/ipv6/udp.c ---- linux-2.6.8.1.orig/net/ipv6/udp.c 2004-08-14 14:56:00.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/ipv6/udp.c 2006-05-11 13:05:42.000000000 +0400 -@@ -67,7 +67,9 @@ static int udp_v6_get_port(struct sock * - { - struct sock *sk2; - struct hlist_node *node; -+ struct ve_struct *env; - -+ env = VE_OWNER_SK(sk); - write_lock_bh(&udp_hash_lock); - if (snum == 0) { - int best_size_so_far, best, result, i; -@@ -81,7 +83,7 @@ static int udp_v6_get_port(struct sock * - int size; - struct hlist_head *list; - -- list = &udp_hash[result & (UDP_HTABLE_SIZE - 1)]; -+ list = &udp_hash[udp_hashfn(result, VEID(env))]; - if (hlist_empty(list)) { - if (result > sysctl_local_port_range[1]) - result = sysctl_local_port_range[0] + -@@ -103,16 +105,17 @@ static int udp_v6_get_port(struct sock * - result = sysctl_local_port_range[0] - + ((result - sysctl_local_port_range[0]) & - (UDP_HTABLE_SIZE - 1)); -- if (!udp_lport_inuse(result)) -+ if (!udp_lport_inuse(result, env)) - break; - } - gotit: - udp_port_rover = snum = result; - } else { - sk_for_each(sk2, node, -- &udp_hash[snum & (UDP_HTABLE_SIZE - 1)]) { -+ &udp_hash[udp_hashfn(snum, VEID(env))]) { - if (inet_sk(sk2)->num == snum && - sk2 != sk && -+ ve_accessible_strict(VE_OWNER_SK(sk2), env) && - (!sk2->sk_bound_dev_if || - !sk->sk_bound_dev_if || - sk2->sk_bound_dev_if == sk->sk_bound_dev_if) && -@@ -124,7 +127,7 @@ gotit: - - inet_sk(sk)->num = snum; - if (sk_unhashed(sk)) { -- sk_add_node(sk, &udp_hash[snum & (UDP_HTABLE_SIZE - 1)]); -+ sk_add_node(sk, &udp_hash[udp_hashfn(snum, VEID(env))]); - sock_prot_inc_use(sk->sk_prot); - } - write_unlock_bh(&udp_hash_lock); -diff -uprN linux-2.6.8.1.orig/net/netlink/af_netlink.c linux-2.6.8.1-ve022stab078/net/netlink/af_netlink.c ---- linux-2.6.8.1.orig/net/netlink/af_netlink.c 2004-08-14 14:55:32.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/netlink/af_netlink.c 2006-05-11 13:05:45.000000000 +0400 -@@ -47,26 +47,15 @@ - #include <net/sock.h> - #include <net/scm.h> - -+#include <ub/beancounter.h> -+#include <ub/ub_net.h> -+ - #define Nprintk(a...) - - #if defined(CONFIG_NETLINK_DEV) || defined(CONFIG_NETLINK_DEV_MODULE) - #define NL_EMULATE_DEV - #endif - --struct netlink_opt --{ -- u32 pid; -- unsigned groups; -- u32 dst_pid; -- unsigned dst_groups; -- unsigned long state; -- int (*handler)(int unit, struct sk_buff *skb); -- wait_queue_head_t wait; -- struct netlink_callback *cb; -- spinlock_t cb_lock; -- void (*data_ready)(struct sock *sk, int bytes); --}; -- - #define nlk_sk(__sk) ((struct netlink_opt *)(__sk)->sk_protinfo) - - static struct hlist_head nl_table[MAX_LINKS]; -@@ -165,7 +154,10 @@ static __inline__ struct sock *netlink_l - - read_lock(&nl_table_lock); - sk_for_each(sk, node, &nl_table[protocol]) { -- if (nlk_sk(sk)->pid == pid) { -+ /* VEs should find sockets, created by kernel */ -+ if ((nlk_sk(sk)->pid == pid) && -+ (!pid || ve_accessible_strict(VE_OWNER_SK(sk), -+ get_exec_env()))){ - sock_hold(sk); - goto found; - } -@@ -186,7 +178,9 @@ static int netlink_insert(struct sock *s - - netlink_table_grab(); - sk_for_each(osk, node, &nl_table[sk->sk_protocol]) { -- if (nlk_sk(osk)->pid == pid) -+ if ((nlk_sk(osk)->pid == pid) && -+ ve_accessible_strict(VE_OWNER_SK(osk), -+ get_exec_env())) - break; - } - if (!node) { -@@ -226,15 +220,16 @@ static int netlink_create(struct socket - sk = sk_alloc(PF_NETLINK, GFP_KERNEL, 1, NULL); - if (!sk) - return -ENOMEM; -+ if (ub_other_sock_charge(sk)) -+ goto out_free; - - sock_init_data(sock,sk); - sk_set_owner(sk, THIS_MODULE); - - nlk = sk->sk_protinfo = kmalloc(sizeof(*nlk), GFP_KERNEL); -- if (!nlk) { -- sk_free(sk); -- return -ENOMEM; -- } -+ if (!nlk) -+ goto out_free; -+ - memset(nlk, 0, sizeof(*nlk)); - - spin_lock_init(&nlk->cb_lock); -@@ -244,6 +239,10 @@ static int netlink_create(struct socket - - sk->sk_protocol = protocol; - return 0; -+ -+out_free: -+ sk_free(sk); -+ return -ENOMEM; - } - - static int netlink_release(struct socket *sock) -@@ -255,6 +254,7 @@ static int netlink_release(struct socket - return 0; - - netlink_remove(sk); -+ sock_orphan(sk); - nlk = nlk_sk(sk); - - spin_lock(&nlk->cb_lock); -@@ -269,7 +269,6 @@ static int netlink_release(struct socket - /* OK. Socket is unlinked, and, therefore, - no new packets will arrive */ - -- sock_orphan(sk); - sock->sk = NULL; - wake_up_interruptible_all(&nlk->wait); - -@@ -292,13 +291,15 @@ static int netlink_autobind(struct socke - struct sock *sk = sock->sk; - struct sock *osk; - struct hlist_node *node; -- s32 pid = current->pid; -+ s32 pid = virt_pid(current); - int err; - - retry: - netlink_table_grab(); - sk_for_each(osk, node, &nl_table[sk->sk_protocol]) { -- if (nlk_sk(osk)->pid == pid) { -+ if ((nlk_sk(osk)->pid == pid) && -+ ve_accessible_strict(VE_OWNER_SK(osk), -+ get_exec_env())){ - /* Bind collision, search negative pid values. */ - if (pid > 0) - pid = -4096; -@@ -319,7 +320,7 @@ retry: - static inline int netlink_capable(struct socket *sock, unsigned flag) - { - return (nl_nonroot[sock->sk->sk_protocol] & flag) || -- capable(CAP_NET_ADMIN); -+ capable(CAP_VE_NET_ADMIN); - } - - static int netlink_bind(struct socket *sock, struct sockaddr *addr, int addr_len) -@@ -465,7 +466,8 @@ struct sock *netlink_getsockbyfilp(struc - * 0: continue - * 1: repeat lookup - reference dropped while waiting for socket memory. - */ --int netlink_attachskb(struct sock *sk, struct sk_buff *skb, int nonblock, long timeo) -+int netlink_attachskb(struct sock *sk, struct sk_buff *skb, int nonblock, -+ long timeo, struct sock *ssk) - { - struct netlink_opt *nlk; - -@@ -479,7 +481,7 @@ int netlink_attachskb(struct sock *sk, s - test_bit(0, &nlk->state)) { - DECLARE_WAITQUEUE(wait, current); - if (!timeo) { -- if (!nlk->pid) -+ if (!ssk || nlk_sk(ssk)->pid == 0) - netlink_overrun(sk); - sock_put(sk); - kfree_skb(skb); -@@ -523,6 +525,11 @@ int netlink_sendskb(struct sock *sk, str - return len; - } - #endif -+ if (ub_sockrcvbuf_charge(sk, skb) < 0) { -+ sock_put(sk); -+ kfree_skb(skb); -+ return -EACCES; -+ } - - skb_queue_tail(&sk->sk_receive_queue, skb); - sk->sk_data_ready(sk, len); -@@ -549,7 +556,7 @@ retry: - kfree_skb(skb); - return PTR_ERR(sk); - } -- err = netlink_attachskb(sk, skb, nonblock, timeo); -+ err = netlink_attachskb(sk, skb, nonblock, timeo, ssk); - if (err == 1) - goto retry; - if (err) -@@ -570,12 +577,15 @@ static __inline__ int netlink_broadcast_ - #endif - if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf && - !test_bit(0, &nlk->state)) { -+ if (ub_sockrcvbuf_charge(sk, skb)) -+ goto out; - skb_orphan(skb); - skb_set_owner_r(skb, sk); - skb_queue_tail(&sk->sk_receive_queue, skb); - sk->sk_data_ready(sk, skb->len); - return 0; - } -+out: - return -1; - } - -@@ -601,6 +611,9 @@ int netlink_broadcast(struct sock *ssk, - if (nlk->pid == pid || !(nlk->groups & group)) - continue; - -+ if (!ve_accessible_strict(get_exec_env(), VE_OWNER_SK(sk))) -+ continue; -+ - if (failure) { - netlink_overrun(sk); - continue; -@@ -656,6 +669,9 @@ void netlink_set_err(struct sock *ssk, u - if (nlk->pid == pid || !(nlk->groups & group)) - continue; - -+ if (!ve_accessible_strict(get_exec_env(), VE_OWNER_SK(sk))) -+ continue; -+ - sk->sk_err = code; - sk->sk_error_report(sk); - } -@@ -678,12 +694,17 @@ static int netlink_sendmsg(struct kiocb - struct sock_iocb *siocb = kiocb_to_siocb(kiocb); - struct sock *sk = sock->sk; - struct netlink_opt *nlk = nlk_sk(sk); -- struct sockaddr_nl *addr=msg->msg_name; -+ struct sockaddr_nl *addr = msg->msg_name; - u32 dst_pid; -- u32 dst_groups; - struct sk_buff *skb; - int err; - struct scm_cookie scm; -+ struct sock *dstsk; -+ long timeo; -+ int no_ubc, no_buf; -+ unsigned long chargesize; -+ -+ DECLARE_WAITQUEUE(wait, current); - - if (msg->msg_flags&MSG_OOB) - return -EOPNOTSUPP; -@@ -694,17 +715,16 @@ static int netlink_sendmsg(struct kiocb - if (err < 0) - return err; - -+ /* Broadcasts are disabled as it was in 2.4 with UBC. According to -+ * ANK this is OK. Den */ - if (msg->msg_namelen) { - if (addr->nl_family != AF_NETLINK) - return -EINVAL; - dst_pid = addr->nl_pid; -- dst_groups = addr->nl_groups; -- if (dst_groups && !netlink_capable(sock, NL_NONROOT_SEND)) -+ if (addr->nl_groups && !netlink_capable(sock, NL_NONROOT_SEND)) - return -EPERM; -- } else { -+ } else - dst_pid = nlk->dst_pid; -- dst_groups = nlk->dst_groups; -- } - - if (!nlk->pid) { - err = netlink_autobind(sock); -@@ -717,13 +737,13 @@ static int netlink_sendmsg(struct kiocb - goto out; - err = -ENOBUFS; - skb = alloc_skb(len, GFP_KERNEL); -- if (skb==NULL) -+ if (skb == NULL) - goto out; - - NETLINK_CB(skb).pid = nlk->pid; - NETLINK_CB(skb).groups = nlk->groups; - NETLINK_CB(skb).dst_pid = dst_pid; -- NETLINK_CB(skb).dst_groups = dst_groups; -+ NETLINK_CB(skb).dst_groups = 0; - memcpy(NETLINK_CREDS(skb), &siocb->scm->creds, sizeof(struct ucred)); - - /* What can I do? Netlink is asynchronous, so that -@@ -733,25 +753,88 @@ static int netlink_sendmsg(struct kiocb - */ - - err = -EFAULT; -- if (memcpy_fromiovec(skb_put(skb,len), msg->msg_iov, len)) { -- kfree_skb(skb); -- goto out; -- } -+ if (memcpy_fromiovec(skb_put(skb,len), msg->msg_iov, len)) -+ goto out_free; - - err = security_netlink_send(sk, skb); -- if (err) { -- kfree_skb(skb); -- goto out; -+ if (err) -+ goto out_free; -+ -+ timeo = sock_sndtimeo(sk, msg->msg_flags&MSG_DONTWAIT); -+retry: -+ dstsk = netlink_getsockbypid(sk, dst_pid); -+ if (IS_ERR(dstsk)) { -+ err = PTR_ERR(dstsk); -+ goto out_free; -+ } -+ -+ nlk = nlk_sk(dstsk); -+#ifdef NL_EMULATE_DEV -+ if (nlk->handler) { -+ skb_orphan(skb); -+ err = nlk->handler(protocol, skb); -+ goto out_put; - } -+#endif -+ -+ /* BTW, it could be done once, before the retry loop */ -+ chargesize = skb_charge_fullsize(skb); -+ no_ubc = ub_sock_getwres_other(sk, chargesize); -+ no_buf = atomic_read(&dstsk->sk_rmem_alloc) > dstsk->sk_rcvbuf || -+ test_bit(0, &nlk->state); -+ if (no_ubc || no_buf) { -+ wait_queue_head_t *sleep; -+ -+ if (!no_ubc) -+ ub_sock_retwres_other(sk, chargesize, -+ SOCK_MIN_UBCSPACE_CH); -+ err = -EAGAIN; -+ if (timeo == 0) { -+ kfree_skb(skb); -+ goto out_put; -+ } - -- if (dst_groups) { -- atomic_inc(&skb->users); -- netlink_broadcast(sk, skb, dst_pid, dst_groups, GFP_KERNEL); -+ /* wake up comes to different queues */ -+ sleep = no_ubc ? sk->sk_sleep : &nlk->wait; -+ __set_current_state(TASK_INTERRUPTIBLE); -+ add_wait_queue(sleep, &wait); -+ -+ /* this if can't be moved upper because ub_sock_snd_queue_add() -+ * may change task state to TASK_RUNNING */ -+ if (no_ubc) -+ ub_sock_sndqueueadd_other(sk, chargesize); -+ -+ if ((atomic_read(&dstsk->sk_rmem_alloc) > dstsk->sk_rcvbuf || -+ test_bit(0, &nlk->state) || no_ubc) && -+ !sock_flag(dstsk, SOCK_DEAD)) -+ timeo = schedule_timeout(timeo); -+ -+ __set_current_state(TASK_RUNNING); -+ remove_wait_queue(sleep, &wait); -+ if (no_ubc) -+ ub_sock_sndqueuedel(sk); -+ sock_put(dstsk); -+ -+ if (!signal_pending(current)) -+ goto retry; -+ err = sock_intr_errno(timeo); -+ goto out_free; - } -- err = netlink_unicast(sk, skb, dst_pid, msg->msg_flags&MSG_DONTWAIT); - -+ skb_orphan(skb); -+ skb_set_owner_r(skb, dstsk); -+ ub_skb_set_charge(skb, sk, chargesize, UB_OTHERSOCKBUF); -+ skb_queue_tail(&dstsk->sk_receive_queue, skb); -+ dstsk->sk_data_ready(dstsk, len); -+ err = len; -+out_put: -+ sock_put(dstsk); - out: - return err; -+ -+out_free: -+ kfree_skb(skb); -+ return err; - } - - static int netlink_recvmsg(struct kiocb *kiocb, struct socket *sock, -@@ -882,6 +965,10 @@ static int netlink_dump(struct sock *sk) - skb = sock_rmalloc(sk, NLMSG_GOODSIZE, 0, GFP_KERNEL); - if (!skb) - return -ENOBUFS; -+ if (ub_nlrcvbuf_charge(skb, sk) < 0) { -+ kfree_skb(skb); -+ return -EACCES; -+ } - - spin_lock(&nlk->cb_lock); - -@@ -942,9 +1029,9 @@ int netlink_dump_start(struct sock *ssk, - return -ECONNREFUSED; - } - nlk = nlk_sk(sk); -- /* A dump is in progress... */ -+ /* A dump or destruction is in progress... */ - spin_lock(&nlk->cb_lock); -- if (nlk->cb) { -+ if (nlk->cb || sock_flag(sk, SOCK_DEAD)) { - spin_unlock(&nlk->cb_lock); - netlink_destroy_callback(cb); - sock_put(sk); -@@ -1198,6 +1285,7 @@ static int __init netlink_proto_init(voi - } - sock_register(&netlink_family_ops); - #ifdef CONFIG_PROC_FS -+ /* FIXME: virtualize before give access from VEs */ - proc_net_fops_create("netlink", 0, &netlink_seq_fops); - #endif - /* The netlink device handler may be needed early. */ -diff -uprN linux-2.6.8.1.orig/net/packet/af_packet.c linux-2.6.8.1-ve022stab078/net/packet/af_packet.c ---- linux-2.6.8.1.orig/net/packet/af_packet.c 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/packet/af_packet.c 2006-05-11 13:05:42.000000000 +0400 -@@ -71,6 +71,8 @@ - #include <linux/module.h> - #include <linux/init.h> - -+#include <ub/ub_net.h> -+ - #ifdef CONFIG_INET - #include <net/inet_common.h> - #endif -@@ -260,7 +262,8 @@ static int packet_rcv_spkt(struct sk_buf - * so that this procedure is noop. - */ - -- if (skb->pkt_type == PACKET_LOOPBACK) -+ if (skb->pkt_type == PACKET_LOOPBACK || -+ !ve_accessible(VE_OWNER_SKB(skb), VE_OWNER_SK(sk))) - goto out; - - if ((skb = skb_share_check(skb, GFP_ATOMIC)) == NULL) -@@ -449,6 +452,9 @@ static int packet_rcv(struct sk_buff *sk - sk = pt->af_packet_priv; - po = pkt_sk(sk); - -+ if (!ve_accessible(VE_OWNER_SKB(skb), VE_OWNER_SK(sk))) -+ goto drop; -+ - skb->dev = dev; - - if (dev->hard_header) { -@@ -508,6 +514,9 @@ static int packet_rcv(struct sk_buff *sk - if (pskb_trim(skb, snaplen)) - goto drop_n_acct; - -+ if (ub_sockrcvbuf_charge(sk, skb)) -+ goto drop_n_acct; -+ - skb_set_owner_r(skb, sk); - skb->dev = NULL; - dst_release(skb->dst); -@@ -555,6 +564,9 @@ static int tpacket_rcv(struct sk_buff *s - sk = pt->af_packet_priv; - po = pkt_sk(sk); - -+ if (!ve_accessible(VE_OWNER_SKB(skb), VE_OWNER_SK(sk))) -+ goto drop; -+ - if (dev->hard_header) { - if (sk->sk_type != SOCK_DGRAM) - skb_push(skb, skb->data - skb->mac.raw); -@@ -604,6 +616,12 @@ static int tpacket_rcv(struct sk_buff *s - if (snaplen > skb->len-skb->data_len) - snaplen = skb->len-skb->data_len; - -+ if (copy_skb && -+ ub_sockrcvbuf_charge(sk, copy_skb)) { -+ spin_lock(&sk->sk_receive_queue.lock); -+ goto ring_is_full; -+ } -+ - spin_lock(&sk->sk_receive_queue.lock); - h = (struct tpacket_hdr *)packet_lookup_frame(po, po->head); - -@@ -975,6 +993,8 @@ static int packet_create(struct socket * - sk = sk_alloc(PF_PACKET, GFP_KERNEL, 1, NULL); - if (sk == NULL) - goto out; -+ if (ub_other_sock_charge(sk)) -+ goto out_free; - - sock->ops = &packet_ops; - #ifdef CONFIG_SOCK_PACKET -@@ -1394,11 +1414,16 @@ static int packet_notifier(struct notifi - struct sock *sk; - struct hlist_node *node; - struct net_device *dev = (struct net_device*)data; -+ struct ve_struct *ve; - -+ ve = get_exec_env(); - read_lock(&packet_sklist_lock); - sk_for_each(sk, node, &packet_sklist) { - struct packet_opt *po = pkt_sk(sk); - -+ if (!ve_accessible_strict(VE_OWNER_SK(sk), ve)) -+ continue; -+ - switch (msg) { - case NETDEV_UNREGISTER: - #ifdef CONFIG_PACKET_MULTICAST -@@ -1797,6 +1822,8 @@ static inline struct sock *packet_seq_id - struct hlist_node *node; - - sk_for_each(s, node, &packet_sklist) { -+ if (!ve_accessible(VE_OWNER_SK(s), get_exec_env())) -+ continue; - if (!off--) - return s; - } -@@ -1812,9 +1839,13 @@ static void *packet_seq_start(struct seq - static void *packet_seq_next(struct seq_file *seq, void *v, loff_t *pos) - { - ++*pos; -- return (v == SEQ_START_TOKEN) -- ? sk_head(&packet_sklist) -- : sk_next((struct sock*)v) ; -+ do { -+ v = (v == SEQ_START_TOKEN) -+ ? sk_head(&packet_sklist) -+ : sk_next((struct sock*)v); -+ } while (v != NULL && -+ !ve_accessible(VE_OWNER_SK((struct sock*)v), get_exec_env())); -+ return v; - } - - static void packet_seq_stop(struct seq_file *seq, void *v) -diff -uprN linux-2.6.8.1.orig/net/rose/rose_route.c linux-2.6.8.1-ve022stab078/net/rose/rose_route.c ---- linux-2.6.8.1.orig/net/rose/rose_route.c 2004-08-14 14:56:23.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/rose/rose_route.c 2006-05-11 13:05:34.000000000 +0400 -@@ -727,7 +727,8 @@ int rose_rt_ioctl(unsigned int cmd, void - } - if (rose_route.mask > 10) /* Mask can't be more than 10 digits */ - return -EINVAL; -- -+ if (rose_route.ndigis > 8) /* No more than 8 digipeats */ -+ return -EINVAL; - err = rose_add_node(&rose_route, dev); - dev_put(dev); - return err; -diff -uprN linux-2.6.8.1.orig/net/sched/sch_api.c linux-2.6.8.1-ve022stab078/net/sched/sch_api.c ---- linux-2.6.8.1.orig/net/sched/sch_api.c 2004-08-14 14:55:20.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/sched/sch_api.c 2006-05-11 13:05:42.000000000 +0400 -@@ -1204,7 +1204,7 @@ static int __init pktsched_init(void) - - register_qdisc(&pfifo_qdisc_ops); - register_qdisc(&bfifo_qdisc_ops); -- proc_net_fops_create("psched", 0, &psched_fops); -+ __proc_net_fops_create("net/psched", 0, &psched_fops, NULL); - - return 0; - } -diff -uprN linux-2.6.8.1.orig/net/sched/sch_cbq.c linux-2.6.8.1-ve022stab078/net/sched/sch_cbq.c ---- linux-2.6.8.1.orig/net/sched/sch_cbq.c 2004-08-14 14:54:49.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/sched/sch_cbq.c 2006-05-11 13:05:36.000000000 +0400 -@@ -956,8 +956,8 @@ cbq_dequeue_prio(struct Qdisc *sch, int - - if (cl->deficit <= 0) { - q->active[prio] = cl; -- cl = cl->next_alive; - cl->deficit += cl->quantum; -+ cl = cl->next_alive; - } - return skb; - -@@ -1133,17 +1133,19 @@ static void cbq_normalize_quanta(struct - - for (h=0; h<16; h++) { - for (cl = q->classes[h]; cl; cl = cl->next) { -+ long mtu; - /* BUGGGG... Beware! This expression suffer of - arithmetic overflows! - */ - if (cl->priority == prio) { -- cl->quantum = (cl->weight*cl->allot*q->nclasses[prio])/ -- q->quanta[prio]; -- } -- if (cl->quantum <= 0 || cl->quantum>32*cl->qdisc->dev->mtu) { -- printk(KERN_WARNING "CBQ: class %08x has bad quantum==%ld, repaired.\n", cl->classid, cl->quantum); -- cl->quantum = cl->qdisc->dev->mtu/2 + 1; -+ cl->quantum = (cl->weight * cl->allot) / -+ (q->quanta[prio] / q->nclasses[prio]); - } -+ mtu = cl->qdisc->dev->mtu; -+ if (cl->quantum <= mtu/2) -+ cl->quantum = mtu/2 + 1; -+ else if (cl->quantum > 32*mtu) -+ cl->quantum = 32*mtu; - } - } - } -@@ -1746,15 +1748,20 @@ static void cbq_destroy_filters(struct c - } - } - --static void cbq_destroy_class(struct cbq_class *cl) -+static void cbq_destroy_class(struct Qdisc *sch, struct cbq_class *cl) - { -+ struct cbq_sched_data *q = qdisc_priv(sch); -+ -+ BUG_TRAP(!cl->filters); -+ - cbq_destroy_filters(cl); - qdisc_destroy(cl->q); - qdisc_put_rtab(cl->R_tab); - #ifdef CONFIG_NET_ESTIMATOR - qdisc_kill_estimator(&cl->stats); - #endif -- kfree(cl); -+ if (cl != &q->link) -+ kfree(cl); - } - - static void -@@ -1767,22 +1774,23 @@ cbq_destroy(struct Qdisc* sch) - #ifdef CONFIG_NET_CLS_POLICE - q->rx_class = NULL; - #endif -- for (h = 0; h < 16; h++) { -+ /* -+ * Filters must be destroyed first because we don't destroy the -+ * classes from root to leafs which means that filters can still -+ * be bound to classes which have been destroyed already. --TGR '04 -+ */ -+ for (h = 0; h < 16; h++) - for (cl = q->classes[h]; cl; cl = cl->next) - cbq_destroy_filters(cl); -- } - - for (h = 0; h < 16; h++) { - struct cbq_class *next; - - for (cl = q->classes[h]; cl; cl = next) { - next = cl->next; -- if (cl != &q->link) -- cbq_destroy_class(cl); -+ cbq_destroy_class(sch, cl); - } - } -- -- qdisc_put_rtab(q->link.R_tab); - } - - static void cbq_put(struct Qdisc *sch, unsigned long arg) -@@ -1799,7 +1807,7 @@ static void cbq_put(struct Qdisc *sch, u - spin_unlock_bh(&sch->dev->queue_lock); - #endif - -- cbq_destroy_class(cl); -+ cbq_destroy_class(sch, cl); - } - } - -@@ -2035,7 +2043,7 @@ static int cbq_delete(struct Qdisc *sch, - sch_tree_unlock(sch); - - if (--cl->refcnt == 0) -- cbq_destroy_class(cl); -+ cbq_destroy_class(sch, cl); - - return 0; - } -diff -uprN linux-2.6.8.1.orig/net/sched/sch_generic.c linux-2.6.8.1-ve022stab078/net/sched/sch_generic.c ---- linux-2.6.8.1.orig/net/sched/sch_generic.c 2004-08-14 14:54:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/sched/sch_generic.c 2006-05-11 13:05:42.000000000 +0400 -@@ -97,6 +97,9 @@ int qdisc_restart(struct net_device *dev - - /* Dequeue packet */ - if ((skb = q->dequeue(q)) != NULL) { -+ struct ve_struct *envid; -+ -+ envid = set_exec_env(VE_OWNER_SKB(skb)); - if (spin_trylock(&dev->xmit_lock)) { - /* Remember that the driver is grabbed by us. */ - dev->xmit_lock_owner = smp_processor_id(); -@@ -113,6 +116,7 @@ int qdisc_restart(struct net_device *dev - spin_unlock(&dev->xmit_lock); - - spin_lock(&dev->queue_lock); -+ (void)set_exec_env(envid); - return -1; - } - } -@@ -134,6 +138,7 @@ int qdisc_restart(struct net_device *dev - kfree_skb(skb); - if (net_ratelimit()) - printk(KERN_DEBUG "Dead loop on netdevice %s, fix it urgently!\n", dev->name); -+ (void)set_exec_env(envid); - return -1; - } - __get_cpu_var(netdev_rx_stat).cpu_collision++; -@@ -151,6 +156,7 @@ int qdisc_restart(struct net_device *dev - - q->ops->requeue(skb, q); - netif_schedule(dev); -+ (void)set_exec_env(envid); - return 1; - } - return q->q.qlen; -@@ -557,3 +563,4 @@ EXPORT_SYMBOL(qdisc_reset); - EXPORT_SYMBOL(qdisc_restart); - EXPORT_SYMBOL(qdisc_lock_tree); - EXPORT_SYMBOL(qdisc_unlock_tree); -+EXPORT_SYMBOL(dev_shutdown); -diff -uprN linux-2.6.8.1.orig/net/sched/sch_teql.c linux-2.6.8.1-ve022stab078/net/sched/sch_teql.c ---- linux-2.6.8.1.orig/net/sched/sch_teql.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/sched/sch_teql.c 2006-05-11 13:05:42.000000000 +0400 -@@ -186,6 +186,9 @@ static int teql_qdisc_init(struct Qdisc - struct teql_master *m = (struct teql_master*)sch->ops; - struct teql_sched_data *q = qdisc_priv(sch); - -+ if (!capable(CAP_NET_ADMIN)) -+ return -EPERM; -+ - if (dev->hard_header_len > m->dev->hard_header_len) - return -EINVAL; - -diff -uprN linux-2.6.8.1.orig/net/sctp/socket.c linux-2.6.8.1-ve022stab078/net/sctp/socket.c ---- linux-2.6.8.1.orig/net/sctp/socket.c 2004-08-14 14:56:25.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/sctp/socket.c 2006-05-11 13:05:33.000000000 +0400 -@@ -4052,12 +4052,8 @@ SCTP_STATIC int sctp_msghdr_parse(const - for (cmsg = CMSG_FIRSTHDR(msg); - cmsg != NULL; - cmsg = CMSG_NXTHDR((struct msghdr*)msg, cmsg)) { -- /* Check for minimum length. The SCM code has this check. */ -- if (cmsg->cmsg_len < sizeof(struct cmsghdr) || -- (unsigned long)(((char*)cmsg - (char*)msg->msg_control) -- + cmsg->cmsg_len) > msg->msg_controllen) { -+ if (!CMSG_OK(msg, cmsg)) - return -EINVAL; -- } - - /* Should we parse this header or ignore? */ - if (cmsg->cmsg_level != IPPROTO_SCTP) -diff -uprN linux-2.6.8.1.orig/net/socket.c linux-2.6.8.1-ve022stab078/net/socket.c ---- linux-2.6.8.1.orig/net/socket.c 2004-08-14 14:55:10.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/socket.c 2006-05-11 13:05:42.000000000 +0400 -@@ -81,6 +81,7 @@ - #include <linux/syscalls.h> - #include <linux/compat.h> - #include <linux/kmod.h> -+#include <linux/in.h> - - #ifdef CONFIG_NET_RADIO - #include <linux/wireless.h> /* Note : will define WIRELESS_EXT */ -@@ -1071,6 +1072,37 @@ int sock_wake_async(struct socket *sock, - return 0; - } - -+int vz_security_proto_check(int family, int type, int protocol) -+{ -+#ifdef CONFIG_VE -+ if (ve_is_super(get_exec_env())) -+ return 0; -+ -+ switch (family) { -+ case PF_UNSPEC: -+ case PF_PACKET: -+ case PF_NETLINK: -+ case PF_UNIX: -+ break; -+ case PF_INET: -+ switch (protocol) { -+ case IPPROTO_IP: -+ case IPPROTO_ICMP: -+ case IPPROTO_TCP: -+ case IPPROTO_UDP: -+ case IPPROTO_RAW: -+ break; -+ default: -+ return -EAFNOSUPPORT; -+ } -+ break; -+ default: -+ return -EAFNOSUPPORT; -+ } -+#endif -+ return 0; -+} -+ - static int __sock_create(int family, int type, int protocol, struct socket **res, int kern) - { - int i; -@@ -1099,6 +1131,11 @@ static int __sock_create(int family, int - family = PF_PACKET; - } - -+ /* VZ compatibility layer */ -+ err = vz_security_proto_check(family, type, protocol); -+ if (err < 0) -+ return err; -+ - err = security_socket_create(family, type, protocol, kern); - if (err) - return err; -@@ -1746,10 +1783,11 @@ asmlinkage long sys_sendmsg(int fd, stru - goto out_freeiov; - ctl_len = msg_sys.msg_controllen; - if ((MSG_CMSG_COMPAT & flags) && ctl_len) { -- err = cmsghdr_from_user_compat_to_kern(&msg_sys, ctl, sizeof(ctl)); -+ err = cmsghdr_from_user_compat_to_kern(&msg_sys, sock->sk, ctl, sizeof(ctl)); - if (err) - goto out_freeiov; - ctl_buf = msg_sys.msg_control; -+ ctl_len = msg_sys.msg_controllen; - } else if (ctl_len) { - if (ctl_len > sizeof(ctl)) - { -diff -uprN linux-2.6.8.1.orig/net/sunrpc/clnt.c linux-2.6.8.1-ve022stab078/net/sunrpc/clnt.c ---- linux-2.6.8.1.orig/net/sunrpc/clnt.c 2004-08-14 14:55:19.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/sunrpc/clnt.c 2006-05-11 13:05:42.000000000 +0400 -@@ -164,10 +164,10 @@ rpc_create_client(struct rpc_xprt *xprt, - } - - /* save the nodename */ -- clnt->cl_nodelen = strlen(system_utsname.nodename); -+ clnt->cl_nodelen = strlen(ve_utsname.nodename); - if (clnt->cl_nodelen > UNX_MAXNODENAME) - clnt->cl_nodelen = UNX_MAXNODENAME; -- memcpy(clnt->cl_nodename, system_utsname.nodename, clnt->cl_nodelen); -+ memcpy(clnt->cl_nodename, ve_utsname.nodename, clnt->cl_nodelen); - return clnt; - - out_no_auth: -diff -uprN linux-2.6.8.1.orig/net/sunrpc/sched.c linux-2.6.8.1-ve022stab078/net/sunrpc/sched.c ---- linux-2.6.8.1.orig/net/sunrpc/sched.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/sunrpc/sched.c 2006-05-11 13:05:25.000000000 +0400 -@@ -1125,9 +1125,9 @@ rpciod(void *ptr) - spin_lock_bh(&rpc_queue_lock); - } - __rpc_schedule(); -- if (current->flags & PF_FREEZE) { -+ if (test_thread_flag(TIF_FREEZE)) { - spin_unlock_bh(&rpc_queue_lock); -- refrigerator(PF_FREEZE); -+ refrigerator(); - spin_lock_bh(&rpc_queue_lock); - } - -diff -uprN linux-2.6.8.1.orig/net/sunrpc/svcsock.c linux-2.6.8.1-ve022stab078/net/sunrpc/svcsock.c ---- linux-2.6.8.1.orig/net/sunrpc/svcsock.c 2004-08-14 14:54:49.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/sunrpc/svcsock.c 2006-05-11 13:05:44.000000000 +0400 -@@ -362,6 +362,9 @@ svc_sendto(struct svc_rqst *rqstp, struc - size_t base = xdr->page_base; - unsigned int pglen = xdr->page_len; - unsigned int flags = MSG_MORE; -+ struct ve_struct *old_env; -+ -+ old_env = set_exec_env(get_ve0()); - - slen = xdr->len; - -@@ -426,6 +429,8 @@ out: - rqstp->rq_sock, xdr->head[0].iov_base, xdr->head[0].iov_len, xdr->len, len, - rqstp->rq_addr.sin_addr.s_addr); - -+ (void)set_exec_env(old_env); -+ - return len; - } - -@@ -438,9 +443,12 @@ svc_recv_available(struct svc_sock *svsk - mm_segment_t oldfs; - struct socket *sock = svsk->sk_sock; - int avail, err; -+ struct ve_struct *old_env; - - oldfs = get_fs(); set_fs(KERNEL_DS); -+ old_env = set_exec_env(get_ve0()); - err = sock->ops->ioctl(sock, TIOCINQ, (unsigned long) &avail); -+ (void)set_exec_env(old_env); - set_fs(oldfs); - - return (err >= 0)? avail : err; -@@ -455,6 +463,7 @@ svc_recvfrom(struct svc_rqst *rqstp, str - struct msghdr msg; - struct socket *sock; - int len, alen; -+ struct ve_struct *old_env; - - rqstp->rq_addrlen = sizeof(rqstp->rq_addr); - sock = rqstp->rq_sock->sk_sock; -@@ -466,7 +475,9 @@ svc_recvfrom(struct svc_rqst *rqstp, str - - msg.msg_flags = MSG_DONTWAIT; - -+ old_env = set_exec_env(get_ve0()); - len = kernel_recvmsg(sock, &msg, iov, nr, buflen, MSG_DONTWAIT); -+ (void)set_exec_env(get_ve0()); - - /* sock_recvmsg doesn't fill in the name/namelen, so we must.. - * possibly we should cache this in the svc_sock structure -@@ -770,17 +781,19 @@ svc_tcp_accept(struct svc_sock *svsk) - struct proto_ops *ops; - struct svc_sock *newsvsk; - int err, slen; -+ struct ve_struct *old_env; - - dprintk("svc: tcp_accept %p sock %p\n", svsk, sock); - if (!sock) - return; - -+ old_env = set_exec_env(get_ve0()); - err = sock_create_lite(PF_INET, SOCK_STREAM, IPPROTO_TCP, &newsock); - if (err) { - if (err == -ENOMEM) - printk(KERN_WARNING "%s: no more sockets!\n", - serv->sv_name); -- return; -+ goto restore; - } - - dprintk("svc: tcp_accept %p allocated\n", newsock); -@@ -874,6 +887,8 @@ svc_tcp_accept(struct svc_sock *svsk) - - } - -+ (void)set_exec_env(old_env); -+ - if (serv->sv_stats) - serv->sv_stats->nettcpconn++; - -@@ -881,6 +896,8 @@ svc_tcp_accept(struct svc_sock *svsk) - - failed: - sock_release(newsock); -+restore: -+ (void)set_exec_env(old_env); - return; - } - -@@ -1227,8 +1244,8 @@ svc_recv(struct svc_serv *serv, struct s - - schedule_timeout(timeout); - -- if (current->flags & PF_FREEZE) -- refrigerator(PF_FREEZE); -+ if (test_thread_flag(TIF_FREEZE)) -+ refrigerator(); - - spin_lock_bh(&serv->sv_lock); - remove_wait_queue(&rqstp->rq_wait, &wait); -@@ -1397,6 +1414,7 @@ svc_create_socket(struct svc_serv *serv, - struct socket *sock; - int error; - int type; -+ struct ve_struct *old_env; - - dprintk("svc: svc_create_socket(%s, %d, %u.%u.%u.%u:%d)\n", - serv->sv_program->pg_name, protocol, -@@ -1410,8 +1428,10 @@ svc_create_socket(struct svc_serv *serv, - } - type = (protocol == IPPROTO_UDP)? SOCK_DGRAM : SOCK_STREAM; - -+ old_env = set_exec_env(get_ve0()); -+ - if ((error = sock_create_kern(PF_INET, type, protocol, &sock)) < 0) -- return error; -+ goto restore; - - if (sin != NULL) { - if (type == SOCK_STREAM) -@@ -1427,12 +1447,16 @@ svc_create_socket(struct svc_serv *serv, - goto bummer; - } - -- if ((svsk = svc_setup_socket(serv, sock, &error, 1)) != NULL) -+ if ((svsk = svc_setup_socket(serv, sock, &error, 1)) != NULL) { -+ (void)set_exec_env(old_env); - return 0; -+ } - - bummer: - dprintk("svc: svc_create_socket error = %d\n", -error); - sock_release(sock); -+restore: -+ (void)set_exec_env(old_env); - return error; - } - -@@ -1450,6 +1474,8 @@ svc_delete_socket(struct svc_sock *svsk) - serv = svsk->sk_server; - sk = svsk->sk_sk; - -+ /* XXX: serialization? */ -+ sk->sk_user_data = NULL; - sk->sk_state_change = svsk->sk_ostate; - sk->sk_data_ready = svsk->sk_odata; - sk->sk_write_space = svsk->sk_owspace; -diff -uprN linux-2.6.8.1.orig/net/sunrpc/xprt.c linux-2.6.8.1-ve022stab078/net/sunrpc/xprt.c ---- linux-2.6.8.1.orig/net/sunrpc/xprt.c 2004-08-14 14:55:47.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/sunrpc/xprt.c 2006-05-11 13:05:42.000000000 +0400 -@@ -246,6 +246,7 @@ xprt_sendmsg(struct rpc_xprt *xprt, stru - int addrlen = 0; - unsigned int skip; - int result; -+ struct ve_struct *old_env; - - if (!sock) - return -ENOTCONN; -@@ -263,7 +264,9 @@ xprt_sendmsg(struct rpc_xprt *xprt, stru - skip = req->rq_bytes_sent; - - clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags); -+ old_env = set_exec_env(get_ve0()); - result = xdr_sendpages(sock, addr, addrlen, xdr, skip, MSG_DONTWAIT); -+ (void)set_exec_env(old_env); - - dprintk("RPC: xprt_sendmsg(%d) = %d\n", xdr->len - skip, result); - -@@ -484,6 +487,7 @@ static void xprt_socket_connect(void *ar - struct rpc_xprt *xprt = (struct rpc_xprt *)args; - struct socket *sock = xprt->sock; - int status = -EIO; -+ struct ve_struct *old_env; - - if (xprt->shutdown || xprt->addr.sin_port == 0) - goto out; -@@ -508,8 +512,10 @@ static void xprt_socket_connect(void *ar - /* - * Tell the socket layer to start connecting... - */ -+ old_env = set_exec_env(get_ve0()); - status = sock->ops->connect(sock, (struct sockaddr *) &xprt->addr, - sizeof(xprt->addr), O_NONBLOCK); -+ (void)set_exec_env(old_env); - dprintk("RPC: %p connect status %d connected %d sock state %d\n", - xprt, -status, xprt_connected(xprt), sock->sk->sk_state); - if (status < 0) { -@@ -1506,13 +1512,16 @@ static inline int xprt_bindresvport(stru - .sin_family = AF_INET, - }; - int err, port; -+ struct ve_struct *old_env; - - /* Were we already bound to a given port? Try to reuse it */ - port = xprt->port; - do { - myaddr.sin_port = htons(port); -+ old_env = set_exec_env(get_ve0()); - err = sock->ops->bind(sock, (struct sockaddr *) &myaddr, - sizeof(myaddr)); -+ (void)set_exec_env(old_env); - if (err == 0) { - xprt->port = port; - return 0; -@@ -1588,15 +1597,18 @@ static struct socket * xprt_create_socke - { - struct socket *sock; - int type, err; -+ struct ve_struct *old_env; - - dprintk("RPC: xprt_create_socket(%s %d)\n", - (proto == IPPROTO_UDP)? "udp" : "tcp", proto); - - type = (proto == IPPROTO_UDP)? SOCK_DGRAM : SOCK_STREAM; - -+ old_env = set_exec_env(get_ve0()); -+ - if ((err = sock_create_kern(PF_INET, type, proto, &sock)) < 0) { - printk("RPC: can't create socket (%d).\n", -err); -- return NULL; -+ goto out; - } - - /* If the caller has the capability, bind to a reserved port */ -@@ -1605,10 +1617,13 @@ static struct socket * xprt_create_socke - goto failed; - } - -+ (void)set_exec_env(old_env); - return sock; - - failed: - sock_release(sock); -+out: -+ (void)set_exec_env(old_env); - return NULL; - } - -diff -uprN linux-2.6.8.1.orig/net/unix/af_unix.c linux-2.6.8.1-ve022stab078/net/unix/af_unix.c ---- linux-2.6.8.1.orig/net/unix/af_unix.c 2004-08-14 14:55:35.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/unix/af_unix.c 2006-05-11 13:05:42.000000000 +0400 -@@ -119,6 +119,9 @@ - #include <net/checksum.h> - #include <linux/security.h> - -+#include <ub/ub_net.h> -+#include <ub/beancounter.h> -+ - int sysctl_unix_max_dgram_qlen = 10; - - kmem_cache_t *unix_sk_cachep; -@@ -242,6 +245,8 @@ static struct sock *__unix_find_socket_b - sk_for_each(s, node, &unix_socket_table[hash ^ type]) { - struct unix_sock *u = unix_sk(s); - -+ if (!ve_accessible(VE_OWNER_SK(s), get_exec_env())) -+ continue; - if (u->addr->len == len && - !memcmp(u->addr->name, sunname, len)) - goto found; -@@ -446,7 +451,7 @@ static int unix_listen(struct socket *so - sk->sk_max_ack_backlog = backlog; - sk->sk_state = TCP_LISTEN; - /* set credentials so connect can copy them */ -- sk->sk_peercred.pid = current->tgid; -+ sk->sk_peercred.pid = virt_tgid(current); - sk->sk_peercred.uid = current->euid; - sk->sk_peercred.gid = current->egid; - err = 0; -@@ -553,6 +558,8 @@ static struct sock * unix_create1(struct - unix_sk_cachep); - if (!sk) - goto out; -+ if (ub_other_sock_charge(sk)) -+ goto out_sk_free; - - atomic_inc(&unix_nr_socks); - -@@ -572,6 +579,9 @@ static struct sock * unix_create1(struct - unix_insert_socket(unix_sockets_unbound, sk); - out: - return sk; -+out_sk_free: -+ sk_free(sk); -+ return NULL; - } - - static int unix_create(struct socket *sock, int protocol) -@@ -677,7 +687,7 @@ static struct sock *unix_find_other(stru - err = path_lookup(sunname->sun_path, LOOKUP_FOLLOW, &nd); - if (err) - goto fail; -- err = permission(nd.dentry->d_inode,MAY_WRITE, &nd); -+ err = permission(nd.dentry->d_inode, MAY_WRITE, &nd, NULL); - if (err) - goto put_fail; - -@@ -955,6 +965,7 @@ static int unix_stream_connect(struct so - int st; - int err; - long timeo; -+ unsigned long chargesize; - - err = unix_mkname(sunaddr, addr_len, &hash); - if (err < 0) -@@ -982,6 +993,10 @@ static int unix_stream_connect(struct so - skb = sock_wmalloc(newsk, 1, 0, GFP_KERNEL); - if (skb == NULL) - goto out; -+ chargesize = skb_charge_fullsize(skb); -+ if (ub_sock_getwres_other(newsk, chargesize) < 0) -+ goto out; -+ ub_skb_set_charge(skb, newsk, chargesize, UB_OTHERSOCKBUF); - - restart: - /* Find listening sock. */ -@@ -1065,7 +1080,7 @@ restart: - unix_peer(newsk) = sk; - newsk->sk_state = TCP_ESTABLISHED; - newsk->sk_type = sk->sk_type; -- newsk->sk_peercred.pid = current->tgid; -+ newsk->sk_peercred.pid = virt_tgid(current); - newsk->sk_peercred.uid = current->euid; - newsk->sk_peercred.gid = current->egid; - newu = unix_sk(newsk); -@@ -1127,7 +1142,7 @@ static int unix_socketpair(struct socket - sock_hold(skb); - unix_peer(ska)=skb; - unix_peer(skb)=ska; -- ska->sk_peercred.pid = skb->sk_peercred.pid = current->tgid; -+ ska->sk_peercred.pid = skb->sk_peercred.pid = virt_tgid(current); - ska->sk_peercred.uid = skb->sk_peercred.uid = current->euid; - ska->sk_peercred.gid = skb->sk_peercred.gid = current->egid; - -@@ -1450,6 +1465,16 @@ static int unix_stream_sendmsg(struct ki - - size=len-sent; - -+ if (msg->msg_flags & MSG_DONTWAIT) -+ ub_sock_makewres_other(sk, skb_charge_size(size)); -+ if (sock_bc(sk) != NULL && -+ sock_bc(sk)->poll_reserv >= -+ SOCK_MIN_UBCSPACE && -+ skb_charge_size(size) > -+ sock_bc(sk)->poll_reserv) -+ size = skb_charge_datalen(sock_bc(sk)->poll_reserv); -+ -+ - /* Keep two messages in the pipe so it schedules better */ - if (size > sk->sk_sndbuf / 2 - 64) - size = sk->sk_sndbuf / 2 - 64; -@@ -1461,7 +1486,8 @@ static int unix_stream_sendmsg(struct ki - * Grab a buffer - */ - -- skb=sock_alloc_send_skb(sk,size,msg->msg_flags&MSG_DONTWAIT, &err); -+ skb = sock_alloc_send_skb2(sk, size, SOCK_MIN_UBCSPACE, -+ msg->msg_flags&MSG_DONTWAIT, &err); - - if (skb==NULL) - goto out_err; -@@ -1546,9 +1572,11 @@ static int unix_dgram_recvmsg(struct kio - - msg->msg_namelen = 0; - -+ down(&u->readsem); -+ - skb = skb_recv_datagram(sk, flags, noblock, &err); - if (!skb) -- goto out; -+ goto out_unlock; - - wake_up_interruptible(&u->peer_wait); - -@@ -1598,6 +1626,8 @@ static int unix_dgram_recvmsg(struct kio - - out_free: - skb_free_datagram(sk,skb); -+out_unlock: -+ up(&u->readsem); - out: - return err; - } -@@ -1859,6 +1889,7 @@ static unsigned int unix_poll(struct fil - { - struct sock *sk = sock->sk; - unsigned int mask; -+ int no_ub_res; - - poll_wait(file, sk->sk_sleep, wait); - mask = 0; -@@ -1869,6 +1900,10 @@ static unsigned int unix_poll(struct fil - if (sk->sk_shutdown == SHUTDOWN_MASK) - mask |= POLLHUP; - -+ no_ub_res = ub_sock_makewres_other(sk, SOCK_MIN_UBCSPACE_CH); -+ if (no_ub_res) -+ ub_sock_sndqueueadd_other(sk, SOCK_MIN_UBCSPACE_CH); -+ - /* readable? */ - if (!skb_queue_empty(&sk->sk_receive_queue) || - (sk->sk_shutdown & RCV_SHUTDOWN)) -@@ -1882,7 +1917,7 @@ static unsigned int unix_poll(struct fil - * we set writable also when the other side has shut down the - * connection. This prevents stuck sockets. - */ -- if (unix_writable(sk)) -+ if (!no_ub_res && unix_writable(sk)) - mask |= POLLOUT | POLLWRNORM | POLLWRBAND; - - return mask; -diff -uprN linux-2.6.8.1.orig/net/xfrm/xfrm_user.c linux-2.6.8.1-ve022stab078/net/xfrm/xfrm_user.c ---- linux-2.6.8.1.orig/net/xfrm/xfrm_user.c 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/net/xfrm/xfrm_user.c 2006-05-11 13:05:34.000000000 +0400 -@@ -1139,6 +1139,9 @@ struct xfrm_policy *xfrm_compile_policy( - if (nr > XFRM_MAX_DEPTH) - return NULL; - -+ if (p->dir > XFRM_POLICY_OUT) -+ return NULL; -+ - xp = xfrm_policy_alloc(GFP_KERNEL); - if (xp == NULL) { - *dir = -ENOBUFS; -diff -uprN linux-2.6.8.1.orig/scripts/kconfig/mconf.c linux-2.6.8.1-ve022stab078/scripts/kconfig/mconf.c ---- linux-2.6.8.1.orig/scripts/kconfig/mconf.c 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/scripts/kconfig/mconf.c 2006-05-11 13:05:32.000000000 +0400 -@@ -88,7 +88,7 @@ static char *args[1024], **argptr = args - static int indent; - static struct termios ios_org; - static int rows = 0, cols = 0; --static struct menu *current_menu; -+struct menu *current_menu; - static int child_count; - static int do_resize; - static int single_menu_mode; -diff -uprN linux-2.6.8.1.orig/security/commoncap.c linux-2.6.8.1-ve022stab078/security/commoncap.c ---- linux-2.6.8.1.orig/security/commoncap.c 2004-08-14 14:55:19.000000000 +0400 -+++ linux-2.6.8.1-ve022stab078/security/commoncap.c 2006-05-11 13:05:49.000000000 +0400 -@@ -17,6 +17,7 @@ - #include <linux/mman.h> - #include <linux/pagemap.h> - #include <linux/swap.h> -+#include <linux/virtinfo.h> - #include <linux/smp_lock.h> - #include <linux/skbuff.h> - #include <linux/netlink.h> -@@ -174,7 +175,7 @@ int cap_inode_setxattr(struct dentry *de - { - if (!strncmp(name, XATTR_SECURITY_PREFIX, - sizeof(XATTR_SECURITY_PREFIX) - 1) && -- !capable(CAP_SYS_ADMIN)) -+ !capable(CAP_SYS_ADMIN) && !capable(CAP_VE_ADMIN)) - return -EPERM; - return 0; - } -@@ -183,7 +184,7 @@ int cap_inode_removexattr(struct dentry - { - if (!strncmp(name, XATTR_SECURITY_PREFIX, - sizeof(XATTR_SECURITY_PREFIX) - 1) && -- !capable(CAP_SYS_ADMIN)) -+ !capable(CAP_SYS_ADMIN) && !capable(CAP_VE_ADMIN)) - return -EPERM; - return 0; - } -@@ -289,7 +290,7 @@ void cap_task_reparent_to_init (struct t - - int cap_syslog (int type) - { -- if ((type != 3 && type != 10) && !capable(CAP_SYS_ADMIN)) -+ if ((type != 3 && type != 10) && !capable(CAP_VE_SYS_ADMIN)) - return -EPERM; - return 0; - } -@@ -311,6 +312,18 @@ int cap_vm_enough_memory(long pages) - - vm_acct_memory(pages); - -+#ifdef CONFIG_USER_RESOURCE -+ switch (virtinfo_notifier_call(VITYPE_GENERAL, VIRTINFO_ENOUGHMEM, -+ (void *)pages) -+ & (NOTIFY_OK | NOTIFY_FAIL)) { -+ case NOTIFY_OK: -+ return 0; -+ case NOTIFY_FAIL: -+ vm_unacct_memory(pages); -+ return -ENOMEM; -+ } -+#endif -+ - /* - * Sometimes we want to use more memory than we have - */ -diff -uprN linux-2.6.8.1.orig/arch/i386/Kconfig linux-2.6.8.1-ve022test023/arch/i386/Kconfig ---- linux-2.6.8.1.orig/arch/i386/Kconfig 2004-08-14 14:54:50.000000000 +0400 -+++ linux-2.6.8.1-ve022test023/arch/i386/Kconfig 2005-06-08 13:32:09.000000000 +0400 -@@ -424,6 +424,54 @@ config X86_OOSTORE - depends on (MWINCHIP3D || MWINCHIP2 || MWINCHIPC6) && MTRR - default y - -+config X86_4G -+ bool "4 GB kernel-space and 4 GB user-space virtual memory support" -+ help -+ This option is only useful for systems that have more than 1 GB -+ of RAM. -+ -+ The default kernel VM layout leaves 1 GB of virtual memory for -+ kernel-space mappings, and 3 GB of VM for user-space applications. -+ This option ups both the kernel-space VM and the user-space VM to -+ 4 GB. -+ -+ The cost of this option is additional TLB flushes done at -+ system-entry points that transition from user-mode into kernel-mode. -+ I.e. system calls and page faults, and IRQs that interrupt user-mode -+ code. There's also additional overhead to kernel operations that copy -+ memory to/from user-space. The overhead from this is hard to tell and -+ depends on the workload - it can be anything from no visible overhead -+ to 20-30% overhead. A good rule of thumb is to count with a runtime -+ overhead of 20%. -+ -+ The upside is the much increased kernel-space VM, which more than -+ quadruples the maximum amount of RAM supported. Kernels compiled with -+ this option boot on 64GB of RAM and still have more than 3.1 GB of -+ 'lowmem' left. Another bonus is that highmem IO bouncing decreases, -+ if used with drivers that still use bounce-buffers. -+ -+ There's also a 33% increase in user-space VM size - database -+ applications might see a boost from this. -+ -+ But the cost of the TLB flushes and the runtime overhead has to be -+ weighed against the bonuses offered by the larger VM spaces. The -+ dividing line depends on the actual workload - there might be 4 GB -+ systems that benefit from this option. Systems with less than 4 GB -+ of RAM will rarely see a benefit from this option - but it's not -+ out of question, the exact circumstances have to be considered. -+ -+config X86_SWITCH_PAGETABLES -+ def_bool X86_4G -+ -+config X86_4G_VM_LAYOUT -+ def_bool X86_4G -+ -+config X86_UACCESS_INDIRECT -+ def_bool X86_4G -+ -+config X86_HIGH_ENTRY -+ def_bool X86_4G -+ - config HPET_TIMER - bool "HPET Timer Support" - help -@@ -482,6 +530,28 @@ config NR_CPUS - This is purely to save memory - each supported CPU adds - approximately eight kilobytes to the kernel image. - -+config FAIRSCHED -+ bool "Fair CPU scheduler (EXPERIMENTAL)" -+ default y -+ help -+ Config option for Fair CPU scheduler (fairsched). -+ This option allows to group processes to scheduling nodes -+ which receive CPU proportional to their weight. -+ This is very important feature for process groups isolation and -+ QoS management. -+ -+ If unsure, say N. -+ -+config SCHED_VCPU -+ bool "VCPU scheduler support" -+ depends on SMP || FAIRSCHED -+ default FAIRSCHED -+ help -+ VCPU scheduler support adds additional layer of abstraction -+ which allows to virtualize cpu notion and split physical cpus -+ and virtual cpus. This support allows to use CPU fair scheduler, -+ dynamically add/remove cpus to/from VPS and so on. -+ - config SCHED_SMT - bool "SMT (Hyperthreading) scheduler support" - depends on SMP -@@ -1242,6 +1316,14 @@ config MAGIC_SYSRQ - keys are documented in <file:Documentation/sysrq.txt>. Don't say Y - unless you really know what this hack does. - -+config SYSRQ_DEBUG -+ bool "Debugging via sysrq keys" -+ depends on MAGIC_SYSRQ -+ help -+ Say Y if you want to extend functionality of magic key. It will -+ provide you with some debugging facilities such as dumping and -+ writing memory, resolving symbols and some other. -+ - config DEBUG_SPINLOCK - bool "Spinlock debugging" - depends on DEBUG_KERNEL -@@ -1298,6 +1380,14 @@ config 4KSTACKS - on the VM subsystem for higher order allocations. This option - will also use IRQ stacks to compensate for the reduced stackspace. - -+config NMI_WATCHDOG -+ bool "NMI Watchdog" -+ default y -+ help -+ If you say Y here the kernel will activate NMI watchdog by default -+ on boot. You can still activate NMI watchdog via nmi_watchdog -+ command line option even if you say N here. -+ - config X86_FIND_SMP_CONFIG - bool - depends on X86_LOCAL_APIC || X86_VOYAGER -@@ -1310,12 +1400,18 @@ config X86_MPPARSE - - endmenu - -+menu "OpenVZ" -+source "kernel/Kconfig.openvz" -+endmenu -+ - source "security/Kconfig" - - source "crypto/Kconfig" - - source "lib/Kconfig" - -+source "kernel/ub/Kconfig" -+ - config X86_SMP - bool - depends on SMP && !X86_VOYAGER -diff -uprN linux-2.6.8.1.orig/drivers/net/Makefile linux-2.6.8.1-ve022stab028/drivers/net/Makefile ---- linux-2.6.8.1.orig/drivers/net/Makefile 2004-08-14 14:55:09.000000000 +0400 -+++ linux-2.6.8.1-ve022stab028/drivers/net/Makefile 2005-07-22 11:16:23.000000000 +0400 -@@ -11,6 +11,9 @@ obj-$(CONFIG_IBM_EMAC) += ibm_emac/ - obj-$(CONFIG_IXGB) += ixgb/ - obj-$(CONFIG_BONDING) += bonding/ - -+obj-$(CONFIG_VE_NETDEV) += vznetdev.o -+vznetdev-objs := open_vznet.o venet_core.o -+ - # - # link order important here - # -diff -uprN linux-2.6.8.1.orig/fs/Kconfig linux-2.6.8.1-ve022stab038/fs/Kconfig ---- linux-2.6.8.1.orig/fs/Kconfig 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab038/fs/Kconfig 2005-09-22 14:49:52.000000000 +0400 -@@ -417,6 +417,15 @@ config QUOTA - with the quota tools. Probably the quota support is only useful for - multi user systems. If unsure, say N. - -+config QUOTA_COMPAT -+ bool "Compatibility with older quotactl interface" -+ depends on QUOTA -+ help -+ This option enables compatibility layer for older version -+ of quotactl interface with byte granularity (QUOTAON at 0x0100, -+ GETQUOTA at 0x0D00). Interface versions older than that one and -+ with block granularity are still not supported. -+ - config QFMT_V1 - tristate "Old quota format support" - depends on QUOTA -@@ -433,6 +442,38 @@ config QFMT_V2 - need this functionality say Y here. Note that you will need recent - quota utilities (>= 3.01) for new quota format with this kernel. - -+config SIM_FS -+ tristate "VPS filesystem" -+ depends on VZ_QUOTA -+ default m -+ help -+ This file system is a part of Virtuozzo. It intoduces a fake -+ superblock and blockdev to VE to hide real device and show -+ statfs results taken from quota. -+ -+config VZ_QUOTA -+ tristate "Virtuozzo Disk Quota support" -+ depends on QUOTA -+ default m -+ help -+ Virtuozzo Disk Quota imposes disk quota on directories with their -+ files and subdirectories in total. Such disk quota is used to -+ account and limit disk usage by Virtuozzo VPS, but also may be used -+ separately. -+ -+config VZ_QUOTA_UNLOAD -+ bool "Unloadable Virtuozzo Disk Quota module" -+ depends on VZ_QUOTA=m -+ default n -+ help -+ Make Virtuozzo Disk Quota module unloadable. -+ Doesn't work reliably now. -+ -+config VZ_QUOTA_UGID -+ bool "Per-user and per-group quota in Virtuozzo quota partitions" -+ depends on VZ_QUOTA!=n -+ default y -+ - config QUOTACTL - bool - depends on XFS_QUOTA || QUOTA -diff -uprN linux-2.6.8.1.orig/kernel/Makefile linux-2.6.8.1-ve022stab036/kernel/Makefile ---- linux-2.6.8.1.orig/kernel/Makefile 2004-08-14 14:54:51.000000000 +0400 -+++ linux-2.6.8.1-ve022stab036/kernel/Makefile 2005-09-17 15:18:16.000000000 +0400 -@@ -2,13 +2,22 @@ - # Makefile for the linux kernel. - # - --obj-y = sched.o fork.o exec_domain.o panic.o printk.o profile.o \ -+obj-y = sched.o fairsched.o \ -+ fork.o exec_domain.o panic.o printk.o profile.o \ - exit.o itimer.o time.o softirq.o resource.o \ - sysctl.o capability.o ptrace.o timer.o user.o \ - signal.o sys.o kmod.o workqueue.o pid.o \ - rcupdate.o intermodule.o extable.o params.o posix-timers.o \ - kthread.o - -+obj-$(CONFIG_VE) += ve.o -+obj-y += ub/ -+obj-y += veowner.o -+obj-$(CONFIG_VE_CALLS) += vzdev.o -+obj-$(CONFIG_VZ_WDOG) += vzwdog.o -+obj-$(CONFIG_VE_CALLS) += vzmon.o -+vzmon-objs = vecalls.o -+ - obj-$(CONFIG_FUTEX) += futex.o - obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o - obj-$(CONFIG_SMP) += cpu.o -diff -uprN linux-2.6.8.1.orig/fs/Makefile linux-2.6.8.1-ve022stab026/fs/Makefile ---- linux-2.6.8.1.orig/fs/Makefile 2004-08-14 14:55:33.000000000 +0400 -+++ linux-2.6.8.1-ve022stab026/fs/Makefile 2005-07-08 16:26:55.000000000 +0400 -@@ -36,6 +36,12 @@ obj-$(CONFIG_QUOTA) += dquot.o - obj-$(CONFIG_QFMT_V1) += quota_v1.o - obj-$(CONFIG_QFMT_V2) += quota_v2.o - obj-$(CONFIG_QUOTACTL) += quota.o -+obj-$(CONFIG_VZ_QUOTA) += vzdquota.o -+vzdquota-y += vzdquot.o vzdq_mgmt.o vzdq_ops.o vzdq_tree.o -+vzdquota-$(CONFIG_VZ_QUOTA_UGID) += vzdq_ugid.o -+vzdquota-$(CONFIG_VZ_QUOTA_UGID) += vzdq_file.o -+ -+obj-$(CONFIG_SIM_FS) += simfs.o - - obj-$(CONFIG_PROC_FS) += proc/ - obj-y += partitions/ -diff -uprN linux-2.6.8.1.orig/arch/x86_64/Kconfig linux-2.6.8.1-ve022stab036/arch/x86_64/Kconfig ---- linux-2.6.8.1.orig/arch/x86_64/Kconfig 2004-08-14 14:55:59.000000000 +0400 -+++ linux-2.6.8.1-ve022stab036/arch/x86_64/Kconfig 2005-09-17 15:18:15.000000000 +0400 -@@ -239,6 +239,28 @@ config PREEMPT - Say Y here if you are feeling brave and building a kernel for a - desktop, embedded or real-time system. Say N if you are unsure. - -+config FAIRSCHED -+ bool "Fair CPU scheduler (EXPERIMENTAL)" -+ default y -+ help -+ Config option for Fair CPU scheduler (fairsched). -+ This option allows to group processes to scheduling nodes -+ which receive CPU proportional to their weight. -+ This is very important feature for process groups isolation and -+ QoS management. -+ -+ If unsure, say N. -+ -+config SCHED_VCPU -+ bool "VCPU scheduler support" -+ depends on SMP || FAIRSCHED -+ default FAIRSCHED -+ help -+ VCPU scheduler support adds additional layer of abstraction -+ which allows to virtualize cpu notion and split physical cpus -+ and virtual cpus. This support allows to use CPU fair scheduler, -+ dynamically add/remove cpus to/from VPS and so on. -+ - config SCHED_SMT - bool "SMT (Hyperthreading) scheduler support" - depends on SMP -@@ -499,9 +525,14 @@ config IOMMU_LEAK - - endmenu - -+menu "OpenVZ" -+source "kernel/Kconfig.openvz" -+endmenu -+ - source "security/Kconfig" - - source "crypto/Kconfig" - - source "lib/Kconfig" - -+source "kernel/ub/Kconfig" -diff -uprN linux-2.6.8.1.orig/arch/ia64/Kconfig linux-2.6.8.1-ve022stab042/arch/ia64/Kconfig ---- linux-2.6.8.1.orig/arch/ia64/Kconfig 2004-08-14 14:56:22.000000000 +0400 -+++ linux-2.6.8.1-ve022stab042/arch/ia64/Kconfig 2005-10-14 14:56:03.000000000 +0400 -@@ -251,6 +251,28 @@ config PREEMPT - Say Y here if you are building a kernel for a desktop, embedded - or real-time system. Say N if you are unsure. - -+config FAIRSCHED -+ bool "Fair CPU scheduler (EXPERIMENTAL)" -+ default y -+ help -+ Config option for Fair CPU scheduler (fairsched). -+ This option allows to group processes to scheduling nodes -+ which receive CPU proportional to their weight. -+ This is very important feature for process groups isolation and -+ QoS management. -+ -+ If unsure, say N. -+ -+config SCHED_VCPU -+ bool "VCPU scheduler support" -+ depends on SMP || FAIRSCHED -+ default FAIRSCHED -+ help -+ VCPU scheduler support adds additional layer of abstraction -+ which allows to virtualize cpu notion and split physical cpus -+ and virtual cpus. This support allows to use CPU fair scheduler, -+ dynamically add/remove cpus to/from VPS and so on. -+ - config HAVE_DEC_LOCK - bool - depends on (SMP || PREEMPT) -@@ -486,6 +512,12 @@ config SYSVIPC_COMPAT - default y - endmenu - -+menu "OpenVZ" -+source "kernel/Kconfig.openvz" -+endmenu -+ - source "security/Kconfig" - - source "crypto/Kconfig" -+ -+source "kernel/ub/Kconfig" |