0:00:15.749,0:00:18.960 I do apologize for the (other) 0:00:18.960,0:00:22.130 for the EuroBSDCon slides. I've redone the 0:00:22.130,0:00:23.890 title page and redone the 0:00:23.890,0:00:27.380 and made some changes to the slides and they didn't make it through for approval 0:00:27.380,0:00:33.130 by this afternoon so 0:00:33.130,0:00:34.640 okay so 0:00:34.640,0:00:36.390 I'm gonna be talking about 0:00:36.390,0:00:38.430 doing 0:00:38.430,0:00:42.889 about isolating jobs for performance and predictability in clusters 0:00:42.889,0:00:43.970 before I get into that 0:00:43.970,0:00:46.010 I want to talk a little bit about 0:00:46.010,0:00:47.229 who we are and 0:00:47.229,0:00:49.520 what our problem space is like because that 0:00:49.520,0:00:54.760 dictates that… has an effect on our solutions base 0:00:54.760,0:00:57.079 I work for the aerospace corporation. 0:00:57.079,0:00:58.609 We work; 0:00:58.609,0:01:02.480 we operate a federally-funded research and development center 0:01:02.480,0:01:05.400 in the area national security space 0:01:05.400,0:01:09.310 and in particular we work with the air force space and missile command 0:01:09.310,0:01:13.090 and with the national reconnaissance office 0:01:13.090,0:01:16.670 and our engineers support a wide variety 0:01:16.670,0:01:20.550 of activities within that area 0:01:20.550,0:01:21.830 so we have 0:01:21.830,0:01:23.740 a bit over fourteen hundred to correct 0:01:23.740,0:01:25.860 sorry twenty four hundred engineers 0:01:25.860,0:01:28.820 in virtually every discipline we have 0:01:28.820,0:01:33.520 as you would expect we have our rocket scientists, we have people who build satellites 0:01:33.520,0:01:37.439 we have people who build sensors that go on satellites, people who study these sort of things 0:01:37.439,0:01:38.130 that you 0:01:38.130,0:01:39.590 see when you 0:01:39.590,0:01:40.819 use those sensors 0:01:40.819,0:01:42.040 that sort of thing. 0:01:42.040,0:01:44.180 We also have civil engineers and 0:01:44.180,0:01:45.680 electronic engineers 0:01:45.680,0:01:46.649 and process, 0:01:46.649,0:01:49.170 computer process people 0:01:49.170,0:01:53.120 so we literally do everything related to space and all sorts of things that you might not 0:01:53.120,0:01:55.270 expect to be related to space, 0:01:55.270,0:01:58.820 since we also for instance help build ground systems ‘cause satellites aren’t very useful if 0:01:58.820,0:02:00.680 there isn't anything to talk to them; 0:02:02.540,0:02:04.090 and these engineers 0:02:04.090,0:02:07.420 since they're solving all these different problems we have 0:02:07.420,0:02:11.499 engineering applications in you know virtually every size you can think of 0:02:11.499,0:02:15.539 ranging from you know little spreadsheet things that you might not think of as an engineering 0:02:15.539,0:02:17.229 application but they are 0:02:17.229,0:02:22.249 to Matlab programs or a lot of C code 0:02:22.249,0:02:23.960 or one of traditional parallel for us 0:02:23.960,0:02:25.159 serial code 0:02:25.159,0:02:26.049 and then 0:02:26.049,0:02:30.949 large parallel applications either in house; genetic algorithms and that sort 0:02:30.949,0:02:31.769 of thing, 0:02:31.769,0:02:32.900 or traditional 0:02:32.900,0:02:34.749 the classic parallel code 0:02:34.749,0:02:37.599 like you work around a crate or something material simulation 0:02:40.119,0:02:41.459 or that or food flow 0:02:41.459,0:02:43.869 or that sort of thing 0:02:43.869,0:02:44.240 so 0:02:44.240,0:02:46.349 so we have this big application space 0:02:46.349,0:02:49.029 just want to give a little introduction to that because it 0:02:49.029,0:02:51.529 does come back and influence what we 0:02:51.529,0:02:55.999 the sort of solutions we look at 0:02:55.999,0:03:00.499 so the rest of the talk I’m gonna talk about rese… 0:03:00.499,0:03:05.259 we skipped a slide, there we are, that’s a little better. 0:03:05.259,0:03:08.940 Now, what I'm interested in is I do high performance computing 0:03:08.940,0:03:10.109 at company 0:03:10.109,0:03:13.949 and I provide high performance computing resources to our users 0:03:13.949,0:03:19.949 as part of my role in our technical computing services organization 0:03:19.949,0:03:20.370 so 0:03:20.370,0:03:23.120 our primary resource at this point is 0:03:23.120,0:03:25.429 the fellowship cluster 0:03:25.429,0:03:26.540 it's a for the 0:03:26.540,0:03:29.569 named for the fellowship of the ring 0:03:29.569,0:03:30.449 so it's a… 0:03:30.449,0:03:32.520 … eleven axel nodes 0:03:32.520,0:03:33.930 wrap the core systems 0:03:33.930,0:03:35.909 over here there's a 0:03:35.909,0:03:39.659 Cisco a large Cisco switch. Actually today there are around two sixty five oh nines if 0:03:39.659,0:03:40.899 you assess them 0:03:40.899,0:03:46.149 and because we couldn’t get the port density we wanted otherwise 0:03:46.149,0:03:50.219 and primarily the Gigabit Ethernet system runs FreeBSD currently 6.0 ‘cause we haven’t upgraded 0:03:50.219,0:03:51.089 it yet 0:03:51.089,0:03:55.639 planning to move probably to 7.1 or maybe slightly past 7.1 0:03:55.639,0:04:01.029 if we want to get the latest HWPMC changes in 0:04:01.029,0:04:05.900 we use the Sun Grid Engine scheduler was one of the two main options for open source 0:04:05.900,0:04:08.949 resource managers on clusters the other one being the… 0:04:09.959,0:04:11.499 … the TORQUE 0:04:11.499,0:04:15.939 and now recombination from cluster resources 0:04:15.939,0:04:17.389 so we also have 0:04:17.389,0:04:18.079 that's actually 0:04:18.079,0:04:22.090 40 TB that’s really the raw number on a sun thumper and 0:04:23.219,0:04:26.290 that’s thirty two usable once you start using RAID-Z2 0:04:26.290,0:04:30.939 since you might actually like to have your data should a disk fail 0:04:30.939,0:04:32.969 and with today's discs RAID… 0:04:32.969,0:04:34.009 RAID five 0:04:34.009,0:04:35.249 doesn't really cut it, 0:04:37.379,0:04:40.220 And then we also have some other resources coming on but I’m going to be (concentrating on) 0:04:40.220,0:04:43.530 two smaller clusters unfortunately probably running Linux and 0:04:43.530,0:04:45.900 some SMPs but 0:04:45.900,0:04:49.990 I’m going to be concentrating here on the work we're doing on our other 0:04:49.990,0:04:54.259 FreeBSD based cluster. 0:04:54.259,0:04:55.060 So, first of all 0:04:55.060,0:04:59.410 first of all I want to talk about why we want to share resources. Should be fairly obvious 0:04:59.410,0:05:00.610 but I'll talk about it in a little bit 0:05:00.610,0:05:04.900 and then what goes wrong when you start sharing resources 0:05:04.900,0:05:08.449 after that I'll talk about some different solutions to those problems 0:05:08.449,0:05:09.759 and 0:05:09.759,0:05:13.399 some fairly trivial experiments that we've done so far in terms of enhancing the schedule or 0:05:13.399,0:05:15.860 using operating system features 0:05:15.860,0:05:17.730 so you mitigate those problems 0:05:19.349,0:05:20.110 and 0:05:20.110,0:05:25.110 then conclude with some feature work. 0:05:25.110,0:05:29.289 So, obviously if you have a resource the size… the size of our cluster, fourteen hundred 0:05:29.289,0:05:30.970 cores roughly 0:05:30.970,0:05:32.819 you probably want to share it unless you 0:05:32.819,0:05:35.080 purpose built it for a single application 0:05:35.080,0:05:37.340 you're going to want to have your users 0:05:37.340,0:05:39.440 sharing it 0:05:39.440,0:05:42.909 and you don't want to just say you know, you get on Monday 0:05:42.909,0:05:45.330 probably not going to be a very effective option 0:05:45.330,0:05:49.270 especially not when we have as many users as we do 0:05:49.270,0:05:53.849 we also can't just afford to buy another one every time a user shows up 0:05:53.849,0:05:54.959 so one of our 0:05:54.959,0:05:57.339 senior VPs said a while back 0:05:57.339,0:05:57.969 you know 0:05:57.969,0:06:02.349 we could probably afford to buy just about anything we could need once 0:06:02.349,0:06:03.800 we can't just 0:06:03.800,0:06:06.359 buy ten of them though 0:06:06.359,0:06:08.939 if we really, really needed it 0:06:08.939,0:06:09.680 dropping 0:06:09.680,0:06:11.460 small numbers of millions of dollars on 0:06:11.460,0:06:13.349 computing resources wouldn’t be 0:06:13.349,0:06:15.039 impossible 0:06:15.039,0:06:20.829 but we can't go to you know just have every engineer who wants one just call up Dell and say ship me ten racks 0:06:20.829,0:06:24.030 it's not going to work 0:06:24.030,0:06:25.580 and the other thing is that we can’t 0:06:25.580,0:06:28.360 we need to also provide quick turnaround 0:06:28.360,0:06:29.390 for some users 0:06:29.390,0:06:33.229 so we can't have one user hogging the system and hogging it until they are done 0:06:33.229,0:06:34.720 because we have some users 0:06:34.720,0:06:37.099 and then the next one can run 0:06:37.099,0:06:40.949 because we have some users who'll come in and say well I need to run 0:06:40.949,0:06:43.159 for three months 0:06:43.159,0:06:43.690 and 0:06:43.690,0:06:46.810 we've had users come in and literally run 0:06:46.810,0:06:49.740 pretty much using the entire system for three months 0:06:49.740,0:06:53.839 well so we've had to provide some ability for other users to still get their work done 0:06:53.839,0:06:58.300 so we can't just… so we do have to have some sharing 0:06:58.300,0:07:00.619 however when you start to share any resource 0:07:00.619,0:07:01.610 like this 0:07:01.610,0:07:03.509 you start getting contention 0:07:03.509,0:07:06.300 users need the same thing at the same time 0:07:06.300,0:07:09.700 and so they fight back and forth for it and they can't get what they want 0:07:09.700,0:07:11.639 so you have to balance them a bit 0:07:12.999,0:07:14.529 you know also 0:07:14.529,0:07:17.869 some jobs lie when they 0:07:17.869,0:07:20.870 request resources and they actually need more than they ask for 0:07:20.870,0:07:23.279 which can cause problems 0:07:23.279,0:07:27.229 so we schedule them. We say you're going to fit here fine and they run off and use 0:07:27.229,0:07:28.580 more than they said 0:07:28.580,0:07:31.000 and if we don't have a mechanism to constrain them 0:07:31.000,0:07:32.389 we have problems. 0:07:32.389,0:07:34.270 Likewise 0:07:34.270,0:07:37.109 once these users start to contend 0:07:37.109,0:07:39.029 that doesn't just result in 0:07:39.029,0:07:40.439 the jobs taking, 0:07:40.439,0:07:43.360 taking longer in terms of wall clock time 0:07:43.360,0:07:44.659 because they are extremely slow 0:07:44.659,0:07:48.430 but there's overhead related to that contention; they get swapped out due to pressure on 0:07:49.219,0:07:51.509 various systems 0:07:51.509,0:07:52.550 if you really 0:07:52.550,0:07:57.039 for instance run out of memory then you go into swap and you end up wasting all your cycles 0:07:57.039,0:07:58.710 pulling junk in and out of disc 0:07:58.710,0:08:00.830 wasting your bandwidth on that 0:08:00.830,0:08:03.530 so there are 0:08:03.530,0:08:04.219 resource 0:08:04.219,0:08:08.139 there are resource costs to the contention not merely 0:08:08.139,0:08:11.979 a delay in returning results. 0:08:11.979,0:08:16.590 So now I'm going to switch gears and start talk… so I'm going to talk a little bit about different 0:08:16.590,0:08:18.270 solutions to these 0:08:18.270,0:08:20.610 to the 0:08:20.610,0:08:22.339 these contention issues 0:08:23.710,0:08:27.840 and look at different ways of solving the problem. Most of these are things that have 0:08:27.840,0:08:29.440 already been done 0:08:29.440,0:08:30.620 but I just want to talk about 0:08:30.620,0:08:32.990 the different ways and then 0:08:32.990,0:08:35.710 evaluate them in our context. 0:08:35.710,0:08:38.119 So a classic solution to the problem is 0:08:38.119,0:08:39.280 Gang Scheduling 0:08:39.280,0:08:44.139 It's basically conventional Unix process context switching 0:08:44.139,0:08:46.560 written really big 0:08:46.560,0:08:50.339 you what you do is you have your parallel job that’s running 0:08:50.339,0:08:51.390 on a system 0:08:51.390,0:08:52.839 and it runs for a while 0:08:52.839,0:08:57.920 and then after a certain amount of time you basically shove it all; you kick it off of all the nodes 0:08:57.920,0:08:59.940 and let the next one come in 0:08:59.940,0:09:04.030 and typically when people do this they do it on on the order of hours because the context switch 0:09:04.030,0:09:09.270 time is extremely large is extremely high 0:09:09.270,0:09:10.639 for example 0:09:10.639,0:09:14.530 because it's not just like swapping a process internet. You suddenly have to co-ordinate 0:09:14.530,0:09:17.470 the this context which across to all your processes 0:09:17.470,0:09:19.280 if you're running say 0:09:19.280,0:09:21.190 MPI over TCP 0:09:21.190,0:09:25.910 you actually need to tear down the TCP sessions because you can't just have TCP timers sitting 0:09:25.910,0:09:26.570 around 0:09:26.570,0:09:28.260 or that sort of thing so 0:09:28.260,0:09:29.950 there there's a there's a lot of overhead 0:09:29.950,0:09:34.340 associated with this. You take a long context switch 0:09:34.340,0:09:36.820 if all of your infrastructure supports this 0:09:36.820,0:09:39.420 it's fairly effective 0:09:39.420,0:09:43.300 and it does allow jobs to avoid interfering with each other which is nice 0:09:43.300,0:09:46.100 so you can't you don't have issues 0:09:46.100,0:09:47.689 because you're typically allocating 0:09:47.689,0:09:50.950 whole swaps of the system 0:09:50.950,0:09:53.390 and for properly written applications 0:09:55.000,0:09:59.690 partial results can be returned which for some of our users is really important where you're doing a 0:09:59.690,0:10:00.490 refinement 0:10:00.490,0:10:04.350 users would want to look at the results and say okay 0:10:04.350,0:10:06.130 you know is this just going off into the weeds 0:10:06.130,0:10:10.860 or does it look like it's actually converging on some sort of useful solution 0:10:10.860,0:10:13.980 as they don't want to just wait till the end. 0:10:13.980,0:10:19.270 Down side of course is that this context switches costs are very high 0:10:19.270,0:10:22.460 and most importantly there's really a lack of useful implementations 0:10:22.460,0:10:25.340 a number of platforms have implemented this in the past 0:10:25.340,0:10:29.840 but in practice on modern clusters which are built on commodity hardware 0:10:29.840,0:10:32.340 with you know 0:10:32.340,0:10:35.530 communication libraries written on standard protocols 0:10:35.530,0:10:37.050 the tools just aren’t there 0:10:37.050,0:10:39.100 and so 0:10:39.100,0:10:40.860 it's not very practical. 0:10:40.860,0:10:44.010 Also it doesn't really make a lot of sense with small jobs 0:10:44.010,0:10:47.789 and one of the things that we found is we have users who have 0:10:47.789,0:10:50.860 embarrassingly parallel problems for they need to look at 0:10:50.860,0:10:53.450 you know twenty thousand studies 0:10:53.450,0:10:57.400 and they could write something that looked more like a conventional parallel application where they 0:10:57.400,0:11:01.930 you know wrote a Scheduler and set up an MPI a Message Passing Interface 0:11:01.930,0:11:05.400 and handed out tasks to pieces of their job and then you could do this 0:11:05.400,0:11:09.280 but then they would be running a Scheduler and they would probably do a bad job of it turns out it's actually 0:11:09.280,0:11:10.820 fairly difficult to do right 0:11:10.820,0:11:13.740 even a trivial case 0:11:13.740,0:11:16.189 and so what they do instead is they just select twenty 0:11:16.189,0:11:18.730 twenty thousand jobs to grid engine and say okay 0:11:18.730,0:11:21.330 whatever I'll deal with it 0:11:21.330,0:11:23.140 earlier versions that might have been a problem 0:11:23.140,0:11:24.730 current versions of the code 0:11:24.730,0:11:27.060 handle easily a million jobs that 0:11:27.060,0:11:29.370 so not really a big deal 0:11:29.370,0:11:31.610 but those sort of users wouldn't fit well 0:11:31.610,0:11:34.190 into the gang scheduled environment 0:11:34.190,0:11:35.690 at least not in a 0:11:35.690,0:11:39.149 conventional gang scheduled environment where you do gang scheduling on the granularity of 0:11:39.149,0:11:40.940 jobs 0:11:40.940,0:11:44.140 so from that perspective it wouldn’t work very well. 0:11:44.140,0:11:48.380 If you have all the pieces in place and you are doing a big parallel applications it is in fact 0:11:48.380,0:11:53.770 an extremely effective approach. 0:11:53.770,0:11:56.290 Another option which is sort of related 0:11:56.290,0:11:57.420 it's in fact 0:11:57.420,0:12:00.079 take taking an even courser granularity 0:12:00.079,0:12:04.360 is single application or single project clusters or sub-clusters. 0:12:04.360,0:12:07.590 For instance this is used some national labs 0:12:07.590,0:12:11.910 where you're given a cycle allocation for a year based on your grant proposals 0:12:11.910,0:12:14.779 and what your cycle allocation actually comes to you as is 0:12:14.779,0:12:16.580 here's your cluster 0:12:16.580,0:12:17.489 here's a frontend 0:12:17.489,0:12:19.840 here's this chunk of notes, they're yours, go to it. 0:12:19.840,0:12:21.930 Install your own OS, whatever you want 0:12:21.930,0:12:25.580 it's yours 0:12:25.580,0:12:30.310 and then and at a sort of finer scale there's things such as 0:12:30.310,0:12:31.800 you could use Emulab 0:12:31.800,0:12:36.300 which is the network emulation system but also does a OS install and configuration 0:12:36.300,0:12:39.300 so you could do dynamic allocation that way 0:12:39.300,0:12:40.540 Sun's 0:12:40.540,0:12:44.040 Project Hedeby now actually I think it's called service domain manager 0:12:44.040,0:12:46.500 is the productised version 0:12:46.500,0:12:50.010 or some Clusters on Demand 0:12:50.010,0:12:54.450 they were actually talking about web hosting clusters but 0:12:54.450,0:12:57.780 things that allow rapid deployment unless you do that a little 0:12:57.780,0:12:59.510 little 0:12:59.510,0:13:02.810 a more granular level than the 0:13:02.810,0:13:05.580 the allocate them once a year approach 0:13:05.580,0:13:07.720 nonetheless 0:13:07.720,0:13:11.220 let’s you give people whole clusters to work with 0:13:11.220,0:13:12.920 nice one nice thing about it is 0:13:12.920,0:13:15.450 the isolation between the processes 0:13:15.450,0:13:16.890 is complete 0:13:16.890,0:13:20.800 so you don’t have to worry about users stomping on each other. It’s their own system, they can trash it all they 0:13:20.800,0:13:22.230 want 0:13:22.230,0:13:24.709 if they flood the network or they 0:13:24.709,0:13:26.180 run the nodes into swap 0:13:26.180,0:13:28.480 well that's their problem 0:13:28.480,0:13:32.120 but it also has the advantage that you can tailor the images 0:13:32.120,0:13:36.980 on the nodes of the operative systems to meet the exact needs of the application 0:13:36.980,0:13:40.560 down side of course is its coarse granularity, in our environment that doesn't work 0:13:40.560,0:13:41.500 very well 0:13:41.500,0:13:46.800 since we do have all of these all these different types of jobs 0:13:46.800,0:13:51.710 context switches are also pretty expensive. Certainly on the order of minutes 0:13:51.710,0:13:54.690 Emulab typically claim something like ten minutes 0:13:54.690,0:13:57.970 there are some systems out there 0:13:57.970,0:14:03.320 for instance if you use I think it’s Open Boot that they're calling it today. It used to be 1xBIOS 0:14:03.320,0:14:06.790 where you can actually deploy a system in 0:14:06.790,0:14:08.700 tens of seconds 0:14:08.700,0:14:11.520 mostly by getting rid of all that junk the BIOS writers wrote 0:14:11.520,0:14:12.890 and 0:14:12.890,0:14:17.770 the OS boots pretty fast if you don’t have all that stuff to waylay you, 0:14:17.770,0:14:19.940 but in practice on sort of 0:14:19.940,0:14:21.660 off the shelf hardware 0:14:21.660,0:14:24.400 the context switches times’ are quite high 0:14:24.400,0:14:26.930 users of course can interfere with themselves 0:14:26.930,0:14:29.200 you can argue it's not a problem but 0:14:29.200,0:14:31.660 ideally you would like to prevent that 0:14:31.660,0:14:35.350 one of the things that I have to deal with is that my users are 0:14:35.350,0:14:37.830 almost universally 0:14:37.830,0:14:40.410 not trained as computer scientists or programmers 0:14:40.410,0:14:42.550 you know they’re trained in their domain area 0:14:42.550,0:14:44.780 they're really good in that area 0:14:44.780,0:14:48.389 but their concepts of the way hardware works in the way software works 0:14:48.389,0:14:55.389 don’t match reality in many cases 0:15:01.269,0:15:02.830 (inaudible question) It’s pretty rare in practice 0:15:02.830,0:15:06.700 well I've heard one lab that does it significantly 0:15:06.700,0:15:09.839 but it's like they do it on sort of a yearly allocation basis 0:15:09.839,0:15:12.790 and throw the hardware away after two or three years 0:15:12.790,0:15:15.999 and you do typically have some sort of the deployment 0:15:15.999,0:15:18.340 system in place 0:15:18.340,0:15:20.680 or in those types of cases actually 0:15:20.680,0:15:22.359 usually your application comes with 0:15:22.359,0:15:26.500 and here's what we're going to spend on this many people 0:15:26.500,0:15:27.730 on this project so this is 0:15:27.730,0:15:34.730 big resource allocation 0:15:36.000,0:15:39.780 And yeah I guess one other issue with this is there's no real easy 0:15:39.780,0:15:43.320 way to capture underutilized resources for example 0:15:43.320,0:15:44.389 if you have 0:15:44.389,0:15:49.190 an application which you know say single-threaded and uses a ton of memory 0:15:49.190,0:15:51.210 and is running on a machine 0:15:51.210,0:15:55.040 the machines we're buying these days are eight core so 0:15:55.040,0:16:00.040 that’s wasting a lot of CPU cycles you're just generating a lot of heat doing nothing 0:16:00.040,0:16:03.890 so ideally you would like a scheduler that said okay so you're using 0:16:03.890,0:16:08.040 using eight or seven of the eight Gigabytes of RAM but we've got these jobs 0:16:08.040,0:16:10.080 sitting here that 0:16:10.080,0:16:11.560 need next to know need 0:16:11.560,0:16:15.910 a hundred megabytes so we slap seven of those in along with the big job 0:16:15.910,0:16:18.580 and backfill and in this 0:16:18.580,0:16:19.600 mechanism there's no 0:16:19.600,0:16:21.810 there's no good way to do that 0:16:21.810,0:16:26.820 obviously if the users have that application next they can do it themselves 0:16:26.820,0:16:30.510 but it's not something where we can easily bring in 0:16:30.510,0:16:35.090 bring in more jobs and have a mix to take advantage of the different 0:16:35.090,0:16:37.300 resources. 0:16:37.300,0:16:39.940 A related approach is to 0:16:39.940,0:16:43.950 to install virtualization software on the equipment and this is this is 0:16:44.980,0:16:46.379 this is the essence of 0:16:46.379,0:16:49.800 what Cloud Computing is at the moment 0:16:49.800,0:16:53.520 it's Amazon providing Zen 0:16:53.520,0:16:55.129 Zen hosting for 0:16:55.129,0:16:56.769 relatively arbitrary 0:16:56.769,0:16:59.710 OS images 0:16:59.710,0:17:02.720 it does have the advantage that it allows rapid deployment 0:17:02.720,0:17:06.510 in theory if your application is scalable provides for 0:17:06.510,0:17:08.259 extremely high scalability 0:17:08.259,0:17:10.110 particularly if you 0:17:10.110,0:17:14.470 aren’t us and therefore can possibly use somebody else's hardware 0:17:14.470,0:17:16.520 in our application's case that’s 0:17:16.520,0:17:18.790 not very practical so 0:17:18.790,0:17:20.360 we can't do that 0:17:20.360,0:17:20.870 and 0:17:20.870,0:17:23.790 it also has the advantage that you can run 0:17:23.790,0:17:26.470 you can have people with their own image in there 0:17:26.470,0:17:30.000 which is tightly resource constrained but you can run more than one of them on a node. So for instance 0:17:30.000,0:17:31.170 you can give 0:17:31.170,0:17:32.730 one job 0:17:32.730,0:17:35.489 four cores and another job two cores another 0:17:35.489,0:17:37.500 you know and have a couple single core 0:17:37.500,0:17:38.860 jobs in theory 0:17:38.860,0:17:43.340 you can get fairly strong isolation there obviously there are shared resources underneath 0:17:43.340,0:17:44.710 and you 0:17:44.710,0:17:45.570 probably can't 0:17:45.570,0:17:48.370 afford to completely isolate say network bandwidth 0:17:48.370,0:17:49.520 at the bottom layer 0:17:49.520,0:17:51.580 you can do some but 0:17:51.580,0:17:56.170 if you go overboard you can spend all your time on accounting 0:17:56.170,0:17:58.830 you also can again 0:17:58.830,0:18:01.410 tailor the images to the job 0:18:01.410,0:18:05.030 and in this environment actually you can do that even more strongly than that 0:18:05.030,0:18:07.030 the sub-cluster approach 0:18:07.030,0:18:09.860 in that you can often do run 0:18:09.860,0:18:16.360 a five-year-old operating system or ten-year-old operating system if you're using full virtualization 0:18:16.360,0:18:19.030 and that can allow 0:18:19.030,0:18:23.820 allow obsolete code with weird baselines to work which is important in our space because 0:18:23.820,0:18:27.390 the average program runs ten years or more 0:18:27.390,0:18:30.860 our average project runs ten years or more 0:18:30.860,0:18:32.530 and as a result 0:18:32.530,0:18:36.010 you might have to go rerun this program that was written 0:18:36.010,0:18:37.320 way back on 0:18:37.320,0:18:40.550 some ancient version of windows or whatever 0:18:40.550,0:18:41.890 it also does provide 0:18:41.890,0:18:43.840 the ability to recover resources 0:18:43.840,0:18:45.290 as I was talking about before 0:18:45.290,0:18:49.530 but you can't do easily with sub-clusters because you can’t just slip 0:18:49.530,0:18:50.360 another image 0:18:50.360,0:18:52.910 on the on there and say are you can use anything and 0:18:52.910,0:18:56.730 you know give that image idle priority essentially 0:18:56.730,0:19:00.480 down side of course is that it is in complete isolation and that there is a shared 0:19:00.480,0:19:02.340 hardware 0:19:02.340,0:19:06.490 you're not likely to find I don't think any the virtualization systems out there 0:19:06.490,0:19:08.890 right now 0:19:08.890,0:19:09.890 virtualize 0:19:09.890,0:19:11.470 your segment of 0:19:11.470,0:19:13.540 memory bandwidth 0:19:13.540,0:19:15.159 or your segment 0:19:15.159,0:19:16.390 of cache 0:19:16.390,0:19:18.390 of cache space 0:19:18.390,0:19:24.809 so users can’t in fact interfere with themselves and each other in this environment 0:19:24.809,0:19:25.589 it's also 0:19:25.589,0:19:30.479 not really efficient for small jobs; the cost of running an entire OS for every 0:19:30.479,0:19:33.020 job is fairly high 0:19:33.020,0:19:34.020 even with 0:19:34.020,0:19:34.710 relatively light 0:19:34.710,0:19:38.250 Unix like OSes is you're still looking 0:19:38.250,0:19:40.900 couple hundred megabytes in practice 0:19:40.900,0:19:46.240 once you get everything up and running unless you run something totally stripped down 0:19:47.230,0:19:49.460 there’s significant overhead 0:19:49.460,0:19:52.240 there’s CPU slowdown typically in the 0:19:52.240,0:19:55.360 you know typical estimates are in the twenty percent range 0:19:55.360,0:20:00.450 numbers really range from fifty percent to five percent depending on what exactly you're doing 0:20:00.450,0:20:02.100 possibly even lower 0:20:02.100,0:20:04.830 or higher 0:20:04.830,0:20:05.870 and just 0:20:05.870,0:20:09.920 you know the overhead because you have the whole OS there's a lot of a lot 0:20:09.920,0:20:11.420 of duplicate 0:20:11.420,0:20:13.320 stuff 0:20:13.320,0:20:15.010 the various vendors 0:20:15.010,0:20:17.090 have their answers they claim you know we can 0:20:17.090,0:20:21.430 we can merge that and say oh you're running the same kernel so we'll keep your memory 0:20:21.430,0:20:24.120 we use the same memory but 0:20:24.120,0:20:25.220 at some level 0:20:25.220,0:20:29.309 it's all going to get duplicated. 0:20:29.309,0:20:30.590 A related option 0:20:30.590,0:20:34.820 comes from sort of the internet havesting industry which is to use virtual private 0:20:34.820,0:20:38.130 which is the technology from virtual private servers 0:20:38.130,0:20:42.110 the example that everyone here is probably familiar with is Jails where 0:20:42.110,0:20:44.130 you can provide 0:20:44.130,0:20:46.720 your own file system root 0:20:46.720,0:20:49.060 your own network interface 0:20:49.060,0:20:50.620 and what not 0:20:50.620,0:20:51.500 and 0:20:51.500,0:20:53.129 the nice thing about this is 0:20:53.129,0:20:56.210 that unlike full virtualization 0:20:56.210,0:20:58.680 the overhead is very small 0:20:58.680,0:21:01.030 basically it costs you 0:21:01.030,0:21:02.820 an entry in your process table 0:21:02.820,0:21:05.570 or an entry in few structures 0:21:05.570,0:21:08.760 there's some extra tests in their kernel but otherwise 0:21:10.220,0:21:14.900 there's not a huge overhead for virtualization you don't need an extra kernel for every 0:21:14.900,0:21:15.460 image 0:21:15.460,0:21:18.390 so you get the difference here between 0:21:18.390,0:21:21.620 be able to run maybe 0:21:21.620,0:21:25.250 you might be able to squeeze two hundred VMWare images onto a machine 0:21:25.250,0:21:29.620 VMWare people say no no don't do that but we have machines that are running 0:21:29.620,0:21:30.509 nearly that many. 0:21:34.790,0:21:38.289 On the other hand there are people out there who run thousands of 0:21:38.289,0:21:40.730 virtual hosts 0:21:40.730,0:21:43.170 using this technique on a single machine so 0:21:43.170,0:21:45.200 big difference in resource use 0:21:45.200,0:21:46.400 on especially with light 0:21:46.400,0:21:48.070 in the lightly loaded use 0:21:48.070,0:21:52.400 in our environment we're looking more running a very small number of them but still 0:21:52.400,0:21:55.880 that overhead is significant 0:21:55.880,0:21:59.440 you still do have some ability to tailor the 0:21:59.440,0:22:01.670 images to a job’s needs 0:22:01.670,0:22:03.309 you could have a 0:22:03.309,0:22:05.400 custom root that for instance you could be running 0:22:05.400,0:22:07.380 FreeBSD 6.0 in one 0:22:07.380,0:22:08.650 in one 0:22:08.650,0:22:11.040 virtual server and 7.0 in another 0:22:11.040,0:22:15.090 you have to be running of course 7.0 kernel or 8.0 kernel to make that work 0:22:15.090,0:22:16.330 but it allows you to do that 0:22:16.330,0:22:18.500 we also in principle can do 0:22:18.500,0:22:23.080 evil things like our 64-bit kernel and then 32-bit user spaces because 0:22:23.080,0:22:26.400 say you have applications that you can't find the source to anymore 0:22:26.400,0:22:31.830 or libraries you don't have the source to any more 0:22:31.830,0:22:32.990 an answer 0:22:32.990,0:22:34.150 interesting things there 0:22:34.150,0:22:36.680 and the other nice thing is since you're 0:22:36.680,0:22:39.629 you're doing a very lightweight and incomplete virtualization 0:22:39.629,0:22:43.269 you don't have to virtualize things you don't care about so you don’t have the overhead of 0:22:43.269,0:22:45.520 virtualizing everything. 0:22:45.520,0:22:48.070 Downsides of course are incomplete isolation 0:22:48.070,0:22:50.690 you are running processes that on the same kernel 0:22:50.690,0:22:52.770 and they can interfere with each other 0:22:52.770,0:22:55.320 and there's dubious flexibility obviously 0:22:55.320,0:22:57.900 I don't think anyone 0:22:57.900,0:23:01.850 should have the ability to run Windows in a jail. 0:23:01.850,0:23:02.860 There’s some 0:23:02.860,0:23:04.960 Net BSD support but 0:23:04.960,0:23:10.510 and I don’t think it's really gotten to that point. 0:23:10.510,0:23:12.420 One final area 0:23:12.420,0:23:14.350 that sort of diverges from this 0:23:14.350,0:23:16.159 is the classic 0:23:16.159,0:23:18.400 Unix solution to the problem 0:23:18.400,0:23:20.580 on this on single 0:23:20.580,0:23:22.070 in a single machine 0:23:22.070,0:23:22.800 which is 0:23:22.800,0:23:28.950 to use existing resource limits and resource partitioning techniques 0:23:28.950,0:23:33.430 you know for example all Unix like our Unix systems have to process resource limits 0:23:33.430,0:23:36.240 a resource and typically 0:23:36.240,0:23:36.999 scheduler a 0:23:38.340,0:23:41.510 cluster schedulers support the common ones 0:23:41.510,0:23:43.150 so you can set a 0:23:43.150,0:23:47.230 memory limit on your process or a CPU time limit on your process 0:23:47.230,0:23:49.830 and the schedulers typically provide 0:23:49.830,0:23:51.350 at least 0:23:51.350,0:23:54.740 launch support for 0:23:54.740,0:23:56.850 the limits on 0:23:56.850,0:24:01.900 a given set of process, that’s part of the job 0:24:01.900,0:24:02.850 also the most 0:24:02.850,0:24:05.640 you know there are a number of forms of resource partitioning that 0:24:05.640,0:24:07.170 are available 0:24:08.100,0:24:09.700 as a standard feature 0:24:09.700,0:24:12.000 on so memory discs are one of them so 0:24:12.000,0:24:16.800 if you want to create a file system space that’s limited in size, create a memory disc 0:24:16.800,0:24:17.969 and back it 0:24:17.969,0:24:21.130 and back it with a NMAP file 0:24:21.130,0:24:22.520 or swap 0:24:22.520,0:24:24.570 of partitioning 0:24:24.570,0:24:26.330 disc use 0:24:26.330,0:24:30.330 and then there are techniques like CPU affinities that you can walk processes to it 0:24:30.330,0:24:32.010 a single process 0:24:32.010,0:24:34.540 processor or a set of processors 0:24:34.540,0:24:39.310 and so they can't interfere with each other with processes running on other processors 0:24:39.310,0:24:44.280 the nice thing about this first is that you're using existing facilities so you don’t have to rewrite 0:24:44.280,0:24:46.170 lots of new features 0:24:46.170,0:24:49.590 for a niche application 0:24:49.590,0:24:52.790 and they tend to integrate well with existing schedulers in many cases 0:24:52.790,0:24:55.940 parts of them are already implemented 0:24:55.940,0:24:59.650 and in fact the experiments that I'll talk about later are all using this type of 0:24:59.650,0:25:02.160 technique. 0:25:02.160,0:25:02.830 Cons are of course 0:25:02.830,0:25:04.850 incomplete isolation again 0:25:04.850,0:25:08.270 and there’s typically no unified framework 0:25:08.270,0:25:12.310 for the concept of a job when a job is composed of the center processes 0:25:12.310,0:25:16.710 yeah there are a number of data structures within the kernel for instance the session 0:25:16.710,0:25:18.120 which 0:25:18.120,0:25:19.499 sort of aggregate processes 0:25:19.499,0:25:20.990 but there isn’t one 0:25:22.230,0:25:24.800 in BSD or Linux at this point 0:25:24.800,0:25:29.020 which allows you to place resource limits on those in a way that you can a process 0:25:29.020,0:25:32.520 IREX did have support like that 0:25:32.520,0:25:34.160 where they have a job ID 0:25:34.160,0:25:36.210 and there could be a job limit 0:25:36.210,0:25:38.280 and selected projects 0:25:38.280,0:25:41.320 are sort of similar but not quite the same 0:25:41.320,0:25:43.149 processes or part of a project but 0:25:43.149,0:25:46.770 it's not quite the same inherited relationship 0:25:47.720,0:25:49.500 and typically 0:25:49.500,0:25:50.900 there aren’t 0:25:50.900,0:25:55.390 limits on things like bandwidth. There was 0:25:55.390,0:25:56.430 a sort of a 0:25:56.430,0:25:58.350 bandwidth limiting 0:25:58.350,0:26:00.630 nice type interface 0:26:00.630,0:26:01.950 on that I saw 0:26:01.950,0:26:03.720 posted as a research project 0:26:03.720,0:26:07.150 many years ago I think in the 2.x days 0:26:07.150,0:26:09.880 where you could say this process can have 0:26:09.880,0:26:11.580 you know five megabits 0:26:11.580,0:26:12.530 or whatever 0:26:12.530,0:26:14.380 but I haven't really seen anything take off 0:26:14.380,0:26:16.940 that would be a pretty neat thing to have 0:26:16.940,0:26:19.309 actually one other exception there 0:26:19.309,0:26:22.230 is on IREX again 0:26:22.230,0:26:28.210 the XFS file system supported guaranteed data rates on file handles you could say 0:26:28.210,0:26:30.140 you could open a file and say I need 0:26:30.140,0:26:32.940 ten megabits read or ten megabits write 0:26:32.940,0:26:34.029 or whatever and it would say 0:26:34.029,0:26:35.529 okay or no 0:26:35.529,0:26:39.279 and then you could read and write and it would do evil things at the file system layer 0:26:39.279,0:26:40.600 in some cases 0:26:40.600,0:26:43.940 all to ensure that you could get that streaming data rate 0:26:44.900,0:26:49.710 by keeping the file. 0:26:49.710,0:26:53.620 So now I’m going to talk about what we've done 0:26:53.620,0:26:59.510 what we needed was a solution to handle a wide range of job types 0:26:59.510,0:27:01.570 So of the options we looked at for instance 0:27:01.570,0:27:04.990 single application clusters or project clusters 0:27:04.990,0:27:11.990 I think that the isolation they provide is essentially unparalleled 0:27:12.590,0:27:16.630 and in our environment we probably have to virtualize in order to be 0:27:16.630,0:27:18.179 efficient in terms of 0:27:18.179,0:27:22.060 being able to handle our job mix and what not and handle the fact that our users 0:27:22.060,0:27:23.740 tend to have 0:27:23.740,0:27:27.730 spikes in their use 0:27:27.730,0:27:32.799 on a large scale so for instance we get GPS we’ll show up and say we need to run for a month 0:27:32.799,0:27:33.780 on and then 0:27:33.780,0:27:38.460 some indeterminate number of months later they'll do it again 0:27:38.460,0:27:40.840 for that sort of quick 0:27:40.840,0:27:41.480 demands 0:27:42.240,0:27:44.850 we really need the virtuals something virtualized 0:27:44.850,0:27:47.120 and then we have to pay the price of 0:27:47.120,0:27:48.380 of the overhead 0:27:48.380,0:27:51.590 and again it doesn't handle small jobs well and that is a 0:27:51.590,0:27:54.050 large portion of our job mix so 0:27:54.050,0:27:55.180 of the 0:27:55.180,0:27:58.070 quarter million or something jobs we’ve run 0:27:58.070,0:27:59.700 on our cluster 0:27:59.700,0:28:02.490 I would guess that 0:28:02.490,0:28:04.730 more than half of those were submitted 0:28:04.730,0:28:05.890 in 0:28:05.890,0:28:09.660 batches of more than ten thousand 0:28:09.660,0:28:11.400 so they'll just pop up 0:28:11.400,0:28:14.030 the other method to have looked at 0:28:14.800,0:28:16.750 are using resource limits 0:28:16.750,0:28:19.060 the nice thing of course is they're achievable with 0:28:19.060,0:28:21.429 they achieve useful isolation 0:28:21.429,0:28:26.289 and they’re implementable with either existing functionality or small extensions so that's what we’ve 0:28:26.289,0:28:27.230 concentrating on. 0:28:27.230,0:28:29.740 We’ve also been doing some thinking about 0:28:29.740,0:28:31.809 could we use the techniques there 0:28:31.809,0:28:33.940 and combine them with jails 0:28:33.940,0:28:36.170 or related features 0:28:36.170,0:28:40.019 it may be bulking up jails to be more like zones in Solaris 0:28:40.019,0:28:44.150 or containers I think they're calling them this week 0:28:44.150,0:28:44.840 and 0:28:44.840,0:28:46.770 so we're looking at that as well 0:28:46.770,0:28:50.840 to be able to provide 0:28:50.840,0:28:54.250 to be able to provide pretty user operating environments 0:28:54.250,0:28:59.200 potentially isolating users from upgrades so for instance as we upgrade the kernel 0:28:59.200,0:29:03.469 and users can continue using it all the images they don't have time to rebuild their 0:29:03.469,0:29:04.330 application in 0:29:04.330,0:29:09.970 and handle the updates in libraries and what not 0:29:09.970,0:29:13.840 they also have the potential to provide strong isolation for security purposes 0:29:13.840,0:29:18.740 which could be useful in the future. 0:29:18.740,0:29:20.159 We do think that 0:29:20.159,0:29:24.040 of these mechanisms the nice thing is that resource limit 0:29:24.040,0:29:26.150 the resource limits and partitioning scheme 0:29:26.150,0:29:29.860 as well as virtual private service are very similar implementation requirements 0:29:29.860,0:29:33.090 set up a fair bit more expensive 0:29:33.090,0:29:34.620 in the VPS case 0:29:34.620,0:29:38.780 but nonetheless they're fairly similar. 0:29:38.780,0:29:42.610 So, what we've been doing is we've taken the Sun Grid Engine 0:29:42.610,0:29:46.880 and we were originally intended to actually extend Sun Grid Engine and modify its daemons 0:29:46.880,0:29:48.480 to do the work 0:29:48.480,0:29:51.150 on what we ended up doing instead is realize that well 0:29:51.150,0:29:54.910 we can actually specify an alternate program to run instead of the shepherd 0:29:54.910,0:29:57.990 The shepherd is the process 0:29:57.990,0:30:00.580 that starts all 0:30:00.580,0:30:02.250 starts the script that 0:30:02.250,0:30:03.380 can for each job 0:30:03.380,0:30:04.920 on a given node 0:30:04.920,0:30:08.559 it collects usage and forwards signals to the children 0:30:08.559,0:30:12.620 and also is responsible for starting remote components 0:30:12.620,0:30:14.560 so a shepherd is started and then 0:30:14.560,0:30:17.640 traditionally in Sun grid engine it starts out 0:30:17.640,0:30:19.910 its own RShell Daemon 0:30:19.910,0:30:20.800 and 0:30:20.800,0:30:22.010 jobs connect over 0:30:22.010,0:30:23.670 these days that for their own 0:30:23.670,0:30:25.870 mechanism which is 0:30:25.870,0:30:26.950 secure 0:30:26.950,0:30:28.000 not using the 0:30:28.840,0:30:30.530 crafty old RShell code. 0:30:35.370,0:30:37.970 So what we've done is we've implemented a wrapper script 0:30:37.970,0:30:40.139 which allows a pre-command hook 0:30:40.139,0:30:42.559 to run before the shepherd starts 0:30:42.559,0:30:47.170 the command wrapper so before we start shepherd we can run like the N program 0:30:47.170,0:30:49.150 or we can run 0:30:49.150,0:30:50.430 TRUE to whatever 0:30:50.430,0:30:54.040 to set up the environment that it runs in or CPU 0:30:54.040,0:30:56.600 setters I’ll show later 0:30:56.600,0:30:58.750 and a post command hook for cleanup 0:30:58.750,0:31:03.940 it's implemented in Ruby because I felt like it. 0:31:03.940,0:31:07.830 The first thing we implemented was memory backed temporary directories. The motivation for 0:31:07.830,0:31:08.700 this 0:31:08.700,0:31:09.640 is that 0:31:09.640,0:31:12.180 we've had problems for users will you know 0:31:12.180,0:31:15.510 run slash temp out on the nodes 0:31:15.510,0:31:19.059 where we have the nodes configured is that they do have discs 0:31:19.059,0:31:22.960 and most of the disc is available as slash temp 0:31:22.960,0:31:25.049 we had some cases 0:31:25.049,0:31:27.840 particularly early on where users would fill up the discs and not delete it 0:31:27.840,0:31:32.300 their job would crash or they would forget to add clean up code or whatever 0:31:32.300,0:31:35.100 and then other jobs would fail strangely 0:31:35.100,0:31:39.029 you might expect that you just get a nice error message 0:31:39.029,0:31:42.040 programmers being programmers 0:31:42.040,0:31:42.909 people would not do their 0:31:42.909,0:31:44.630 error handling correctly. 0:31:44.630,0:31:47.380 A number of libraries do have issues like for instance 0:31:47.380,0:31:49.600 the PVM library 0:31:49.600,0:31:52.600 unexpectedly fails and reports a completely strange error 0:31:52.600,0:31:54.759 if it can't create a file in temp 0:31:54.759,0:32:01.669 because it needs to create a UNIX domain socket so it can talk to itself. 0:32:01.669,0:32:03.360 So, what we’ve done here 0:32:03.360,0:32:08.059 is it turns out that Sun Grid Engine actually creates a temporary directory often the 0:32:08.059,0:32:11.730 typically /TEMP but you can change that 0:32:11.730,0:32:14.490 and points temp dir to that 0:32:14.490,0:32:15.370 location 0:32:15.370,0:32:17.499 we've educated most of all users now 0:32:17.499,0:32:21.360 to use that location correctly so they’ll use that variable 0:32:21.360,0:32:23.279 they treat their files under temp dir 0:32:23.279,0:32:24.950 and then when the job exits 0:32:24.950,0:32:26.569 the Grid Engine deletes the directory 0:32:26.569,0:32:28.510 and that all gets cleaned up 0:32:28.510,0:32:32.720 the problem of course being that of multiple are also running on the same node at the same time 0:32:32.720,0:32:35.290 one of them could still fill temp 0:32:35.290,0:32:38.759 so the solution was pretty simple we created a 0:32:38.759,0:32:41.420 wrapper script at the beginning of the job 0:32:41.420,0:32:42.760 creates a 0:32:42.760,0:32:43.940 a 0:32:43.940,0:32:47.260 memory file to swap back to MD file system 0:32:47.260,0:32:50.790 of a user requestable size with the default 0:32:50.790,0:32:53.310 and 0:32:53.310,0:32:56.520 this has a number of advantages the biggest one of course is that 0:32:56.520,0:32:58.320 it's fixed size so we get 0:32:58.320,0:32:59.449 you know 0:32:59.449,0:33:01.000 the user gets 0:33:01.000,0:33:03.420 what they asked for 0:33:03.420,0:33:05.930 and once they run of space, they run out of space well 0:33:05.930,0:33:09.300 and too bad they ran out of space 0:33:09.300,0:33:12.760 they should have asked for more 0:33:12.760,0:33:16.350 the other 0:33:16.350,0:33:18.770 the other advantage is the side-effect that 0:33:18.770,0:33:21.619 now that we're running swap back memory files systems for temp 0:33:21.619,0:33:24.560 the users who only use a fairly small amount of temp 0:33:24.560,0:33:28.190 should see vastly improved performance because they're running in memory 0:33:28.190,0:33:32.980 rather than writing to disc 0:33:32.980,0:33:34.690 quick example 0:33:34.690,0:33:38.270 we've a little job script here 0:33:38.270,0:33:39.830 prints temp dir and 0:33:39.830,0:33:41.950 prints the 0:33:41.950,0:33:43.080 amount of space 0:33:43.080,0:33:46.210 we submit our job request saying that we want 0:33:46.210,0:33:51.539 this is what we want hundred megabytes of temp space 0:33:51.539,0:33:53.580 the same that's why if this 0:33:53.580,0:33:55.230 so the program doesn't 0:33:55.230,0:33:57.620 so the program ends at the end of it 0:33:57.620,0:33:58.709 for doing it 0:33:58.709,0:34:00.510 here's a live demo 0:34:00.510,0:34:01.840 all and then 0:34:01.840,0:34:03.389 you look at the output 0:34:03.389,0:34:04.280 you can see it 0:34:04.280,0:34:07.549 does in fact it creates a memory file system 0:34:07.549,0:34:10.449 I attempted to do great code 0:34:10.449,0:34:13.409 having a variable space 0:34:13.409,0:34:15.839 that is roughly what the user asked for 0:34:15.839,0:34:17.089 the version that I had 0:34:17.089,0:34:20.739 when I was attempting this was not entirely accurate 0:34:20.739,0:34:24.710 trying to guess what all the UFS overhead would be 0:34:24.710,0:34:25.889 as the result was 0:34:25.889,0:34:28.399 not quite consistent 0:34:30.790,0:34:33.899 I couldn't figure out easy function so 0:34:33.899,0:34:39.589 it does a better job than it did to start with, it’s not perfect 0:34:39.589,0:34:40.600 sometimes however 0:34:40.600,0:34:42.329 today that that's a good fix 0:34:42.329,0:34:43.550 we're coming to 0:34:43.550,0:34:45.359 Deploy it pretty soon 0:34:45.359,0:34:47.159 it works pretty easily 0:34:47.159,0:34:48.570 well sometimes it's not enough 0:34:48.570,0:34:51.390 the biggest issue is that they were badly designed programs all 0:34:51.390,0:34:52.720 all over the world 0:34:52.720,0:34:54.919 don't use temp dir like they're supposed to 0:34:54.919,0:34:59.319 in fact 0:35:10.099,0:35:12.759 (inaudible question) so there are all these applications 0:35:12.759,0:35:17.979 there are all these applications still that need temp say during start up 0:35:17.979,0:35:19.230 that sort of thing 0:35:19.230,0:35:20.809 so 0:35:20.809,0:35:22.599 all 0:35:22.599,0:35:25.829 so we have problems with these 0:35:25.829,0:35:26.290 realistically 0:35:26.290,0:35:27.799 we can’t change all of them 0:35:27.799,0:35:30.019 it's just not going to happen 0:35:30.019,0:35:31.950 so we still have problems with people 0:35:31.950,0:35:34.509 running out of resources 0:35:34.509,0:35:35.819 so we probably 0:35:35.819,0:35:37.489 feel that 0:35:37.489,0:35:41.240 the most general solution is to write a per job slash temp 0:35:41.240,0:35:44.880 and virtualize that portion of the files system in memory space 0:35:44.880,0:35:47.119 and variate symlinks can do that 0:35:47.119,0:35:52.539 and so we said okay let's give it a shot 0:35:52.539,0:35:56.969 just to introduce the concept of variate symlinks for people who aren’t familiar with them 0:35:56.969,0:36:00.280 variate symlinks are basically symlinks that contain variables 0:36:00.280,0:36:02.389 which are expanded at run time 0:36:02.389,0:36:05.549 it allows paths to be different for different processes 0:36:05.549,0:36:06.969 for example 0:36:06.969,0:36:08.689 you create some files 0:36:08.689,0:36:10.069 you create 0:36:10.069,0:36:12.459 a symlink whose contents are 0:36:12.459,0:36:18.329 this variable which has the default shell value 0:36:18.329,0:36:18.990 and you 0:36:18.990,0:36:24.949 get different results with different variable sets. 0:36:24.949,0:36:27.170 So, to talk about the implementation we’ve done, 0:36:27.170,0:36:32.389 it's derived from direct implementation, most of the data structures are identical 0:36:32.389,0:36:33.869 however, I’ve made a number of changes 0:36:33.869,0:36:39.649 the biggest one is that we took the concept of scopes and we turned them entirely around 0:36:40.409,0:36:45.329 in there is a system scope which is over overridden by a user scope and by a 0:36:45.329,0:36:47.259 process scope 0:36:49.819,0:36:53.449 problem with that is if you 0:36:53.449,0:36:56.099 only think about say the systems scope 0:36:56.099,0:36:57.079 and 0:36:57.079,0:36:59.459 you decide you want to do something clever like have 0:36:59.459,0:37:02.219 a root file system which 0:37:02.219,0:37:06.109 where slash lib points to different things for different 0:37:06.109,0:37:08.249 different architectures 0:37:08.249,0:37:11.849 well, works quite nicely until the users come along and 0:37:11.849,0:37:14.189 set their arch variable 0:37:14.189,0:37:15.629 up for you 0:37:15.629,0:37:18.900 if you have say a Set UID program and you don't defensively 0:37:18.900,0:37:22.319 and you don't implement correctly 0:37:22.319,0:37:24.900 the obvious bad things happen. Obviously you would 0:37:24.900,0:37:28.599 write your code to not do that I believe they did, but 0:37:28.599,0:37:31.700 there's a whole class of problems where 0:37:31.700,0:37:33.449 it's easy to screw up 0:37:33.449,0:37:36.219 add and do something wrong there 0:37:36.219,0:37:37.270 so by 0:37:37.270,0:37:38.509 reversing the order 0:37:38.509,0:37:41.849 we can reduce the risks 0:37:41.849,0:37:43.329 at the moment we don't 0:37:43.329,0:37:44.309 have a user scope 0:37:44.309,0:37:47.530 I just don't like the idea of the users scope to be honest 0:37:47.530,0:37:50.900 problem being that then you have to have per user state in kernel 0:37:50.900,0:37:55.509 that just sort of sits around forever you can never garbage collect it except the 0:37:55.509,0:37:57.059 Administrator way 0:37:57.059,0:37:59.489 just doesn't seem like a great idea to me 0:37:59.489,0:38:00.700 And jail scope 0:38:00.700,0:38:04.609 just hasn't been implemented 0:38:04.609,0:38:09.809 because it wasn't entirely clear what the semantics should be 0:38:11.010,0:38:14.719 I also added default variable support variable also shell style 0:38:14.719,0:38:16.999 variable support 0:38:16.999,0:38:19.169 to some extent undoes the scope 0:38:19.169,0:38:20.870 the scope change 0:38:20.870,0:38:21.779 in that 0:38:21.779,0:38:24.749 the default variable becomes a system scope 0:38:24.749,0:38:26.540 which is overridden by everything 0:38:26.540,0:38:30.890 but there are cases where we need to do that in particular who wants implement their 0:38:30.890,0:38:33.380 slashed temp which varies 0:38:33.380,0:38:36.240 we have to do something like this because temp needs to work 0:38:37.209,0:38:42.059 if we don't have the job values set 0:38:42.059,0:38:45.829 I also decided to use 0:38:45.829,0:38:49.839 percent instead of dollar sign to avoid confusion with shell variables because these 0:38:49.839,0:38:50.379 are 0:38:50.379,0:38:52.620 a separate namespace in the kernel 0:38:52.620,0:38:56.669 we can't do it to main OS and do all the evaluation in the user space 0:38:56.669,0:38:59.269 it's classic vulnerability 0:38:59.269,0:39:02.739 in the CVE database for instance 0:39:02.739,0:39:08.109 and we’re not using @ and avoid confusion with AFS 0:39:08.109,0:39:09.819 or the Net BSD implementation 0:39:09.819,0:39:11.019 which does not allow 0:39:11.019,0:39:14.879 user or administratively settable values 0:39:14.879,0:39:17.019 that support 0:39:17.019,0:39:20.359 I don't have any automated variables such as 0:39:20.359,0:39:25.789 the percent sys value which is universally set in the Net BDS implementation 0:39:25.789,0:39:26.750 or 0:39:28.039,0:39:32.579 a UID variable which they also have 0:39:32.579,0:39:34.909 and currently and it allows 0:39:34.909,0:39:40.880 setting of values in other processes, you can only set them in your own and inherit it 0:39:40.880,0:39:42.699 that may change but 0:39:42.699,0:39:47.339 one of my goals here is because they were subtle ways to make dumb mistakes and 0:39:47.339,0:39:48.930 cause security vulnerabilities 0:39:48.930,0:39:52.479 I've attempted to slim the feature set down to the point where you 0:39:52.479,0:39:54.909 have some reasonable chance of not 0:39:54.909,0:39:56.339 doing that 0:39:56.339,0:40:03.339 if you start building systems on them for deployment. 0:40:04.419,0:40:06.909 The final area that we've worked on 0:40:06.909,0:40:09.499 is moving away from the final system space 0:40:09.499,0:40:12.559 and into CPU sets 0:40:12.559,0:40:16.379 Jeff Roberts implemented a program 0:40:16.379,0:40:20.699 implemented a CPU set functionality which allows you to 0:40:20.699,0:40:23.489 create… put a process into a CPU set 0:40:23.489,0:40:24.879 and then set the affinity of that 0:40:24.879,0:40:26.269 CPU set 0:40:26.269,0:40:29.189 by default every process has an anonymous 0:40:29.189,0:40:33.059 CPU set that was stuffed into one that was created by this 0:40:33.059,0:40:37.269 in a parent 0:40:37.269,0:40:38.619 so for a little background here 0:40:38.619,0:40:40.740 in a typical SGE configuration 0:40:40.740,0:40:42.769 every node has one slot 0:40:42.769,0:40:44.429 per CPU 0:40:44.429,0:40:48.639 There are a number of other ways you can configure it, basically a slot is something 0:40:48.639,0:40:50.019 a job can run in 0:40:50.019,0:40:56.719 and a parallel job crosses slots and can be in more than one slot 0:40:56.719,0:41:01.359 for instance in many applications where code tends to spend a fair bit of time 0:41:01.359,0:41:02.380 waiting for IO 0:41:02.380,0:41:06.209 you are looking at more than one slot per CPU so two slots per 0:41:06.209,0:41:08.089 core is not uncommon 0:41:08.089,0:41:10.869 but probably the most common configuration and the one that 0:41:10.869,0:41:13.719 you get out of the box is you just install a Grid Engine 0:41:13.719,0:41:16.739 is one slot for each CPU 0:41:16.739,0:41:19.830 and that's how that's how we run because we want users to have 0:41:19.830,0:41:23.699 that whole CPU for whatever they want to do with it 0:41:23.699,0:41:26.130 so jobs are allocated one or more slots 0:41:26.130,0:41:27.599 if they're 0:41:27.599,0:41:33.189 depending on whether they're sequential or parallel jobs and how many they ask for 0:41:33.189,0:41:37.239 but this is just a convention there's no actual connection between slots 0:41:37.239,0:41:39.119 and CPUs 0:41:39.119,0:41:40.829 so it's quite possible to 0:41:40.829,0:41:42.819 submit a non-parallel job 0:41:42.819,0:41:45.019 that goes off and spawns a zillion threads 0:41:45.019,0:41:48.369 and sucks up all the CPUs on the whole system 0:41:48.369,0:41:50.800 in some early versions of grid engine 0:41:50.800,0:41:53.569 there actually was 0:41:53.569,0:41:55.729 support for tying slots 0:41:55.729,0:41:58.669 to CPUs if you set it up that way 0:41:58.669,0:42:02.979 there is a sensible implementation for IREX and then things got weirder and weirder is 0:42:02.979,0:42:06.010 people tried to implement it on other platforms which had 0:42:06.010,0:42:07.030 vastly different 0:42:07.030,0:42:09.839 CPU binding semantics 0:42:09.839,0:42:12.359 and at this point it’s entirely broken 0:42:12.359,0:42:14.959 on every platform as far as I can tell 0:42:14.959,0:42:18.759 so we decided okay we've got this wrapper let's see what we can do 0:42:18.759,0:42:21.009 in terms of making things work. 0:42:21.659,0:42:27.119 We now have the wrapper store allocations in the final system 0:42:27.119,0:42:31.239 we have a not yet recursive allocation algorithm 0:42:31.239,0:42:33.369 well we try to do is 0:42:33.369,0:42:34.690 find the best fit 0:42:34.690,0:42:35.779 fitting set of 0:42:35.779,0:42:39.539 adjacent cores 0:42:39.539,0:42:42.329 and then if that doesn't work we take the largest to repeat 0:42:43.519,0:42:45.180 and until we fix 0:42:45.180,0:42:47.300 or until we've got enough slots 0:42:47.300,0:42:50.800 the goal is to minimize new fragments we haven't done any analysis 0:42:50.800,0:42:52.269 to determine whether that's actually 0:42:52.269,0:42:55.179 an appropriate algorithm 0:42:55.179,0:42:56.289 but off hand it seems 0:42:56.289,0:43:00.519 fine given I’ve thought about it over lunch. 0:43:00.519,0:43:02.810 Should 40’s lay down their OSes 0:43:02.810,0:43:09.649 turns out that FreeBSD, CPU setting, API and the Linux one 0:43:09.649,0:43:12.519 differ only in the very small details 0:43:12.519,0:43:13.599 They’re 0:43:13.599,0:43:15.479 essentially exactly 0:43:15.479,0:43:17.569 identical which is 0:43:17.569,0:43:20.489 convenient semantically, so converting between then is pretty straight forward 0:43:20.489,0:43:24.869 so converting between then is pretty straight forward, so I did a set of benchmarks 0:43:24.869,0:43:27.019 to demonstrate the 0:43:28.089,0:43:29.359 effectiveness of CPU set, they also happen to demonstrate the wrapper 0:43:29.359,0:43:33.319 but don’t really have any relevance 0:43:33.319,0:43:35.229 used a little eight core Intel Xeon box 0:43:38.289,0:43:40.749 7.1 pre-release that had 0:43:40.749,0:43:43.239 John Bjorkman backported 0:43:43.239,0:43:46.640 CPU set 0:43:46.640,0:43:49.039 from 8.0 shortly before release 0:43:49.039,0:43:53.450 well not so shortly, it's supposed to be shortly before 0:43:53.450,0:43:55.579 and the SG 6.2 0:43:55.579,0:43:59.739 we used the simple integer benchmarks 0:43:59.739,0:44:02.519 end Queens program were tested 0:44:02.519,0:44:03.349 for instance an 8 x 8 board 0:44:03.349,0:44:05.360 placed 0:44:05.360,0:44:08.069 the 8 queens so they can’t capture each other 0:44:08.069,0:44:09.289 on the board 0:44:11.039,0:44:13.680 so it's a simple load benchmark 0:44:13.680,0:44:18.800 that we ran a small version of the problem as our measure command to generate 0:44:19.599,0:44:24.439 load we ran a larger version that we ran for much longer 0:44:24.439,0:44:28.149 some results 0:44:28.149,0:44:30.129 so for baseline, 0:44:30.129,0:44:33.170 the most interesting thing is to do a baseline run 0:44:33.170,0:44:34.279 you see this 0:44:34.279,0:44:36.410 some variance it's not really very high 0:44:36.410,0:44:38.979 not surprising it doesn't really do anything 0:44:38.979,0:44:40.979 except suck CPU see here 0:44:40.979,0:44:41.729 Really not much 0:44:41.729,0:44:45.229 going on 0:44:45.229,0:44:50.029 in this case we’ve got seven load processes and a single 0:44:50.029,0:44:52.789 a single test process running 0:44:52.789,0:44:55.160 we see things slow down slightly 0:44:55.160,0:44:55.890 and 0:44:55.890,0:44:58.389 the standard deviation goes up a bit 0:44:58.389,0:45:00.829 it’s a little bit of deviation from baseline 0:45:00.829,0:45:03.659 the obvious explanation is clearly 0:45:03.659,0:45:07.339 we’re just content switching a bit more 0:45:08.840,0:45:10.349 because we don't have 0:45:10.349,0:45:12.410 CPUs that are doing nothing at all 0:45:12.410,0:45:15.559 there some extra load from the system as well 0:45:15.559,0:45:20.049 since the kernel has to run and background tests have to run 0:45:20.049,0:45:23.150 you know in this case we have a badly behaved application 0:45:23.150,0:45:26.579 we now have 8 load processes which would suck up all the CPU 0:45:26.579,0:45:28.879 and then we try to run our measurement process 0:45:28.879,0:45:30.639 we see a you know 0:45:30.639,0:45:32.739 substantial performance decrease 0:45:32.739,0:45:35.570 you know about in the range we would expect 0:45:35.570,0:45:37.289 see if we had any 0:45:37.289,0:45:40.140 decrease 0:45:40.140,0:45:43.220 we fired up with CPU set 0:45:43.220,0:45:44.249 quite obviously 0:45:44.249,0:45:46.190 the interesting thing here is to see it 0:45:46.190,0:45:49.429 we’re getting no statistically significant difference 0:45:49.429,0:45:52.819 between the baseline case with 0:45:52.819,0:45:56.539 7 processors if we use CPU sets we don't see this variance 0:45:56.539,0:45:58.520 which is nice to know that this shows 0:45:58.520,0:45:59.509 that's it 0:45:59.509,0:46:02.869 we actually see a slight performance improvement 0:46:02.869,0:46:04.179 and 0:46:04.179,0:46:05.579 we 0:46:05.579,0:46:07.589 we see a reduction in variance 0:46:07.589,0:46:11.569 so CPU set is actually improving performance even if we’re not overloaded 0:46:11.569,0:46:13.510 and we see in the overloaded case 0:46:13.510,0:46:15.589 it's the same 0:46:15.589,0:46:20.319 for the other processes they’re stuck on other CPUs 0:46:20.319,0:46:22.820 one interesting side note actually is that 0:46:22.820,0:46:26.719 when I was doing some tests early on 0:46:26.719,0:46:27.869 we actually saw 0:46:27.869,0:46:32.359 I tried doing the base line and the baseline with CPU set and if you just fired off with the original 0:46:32.359,0:46:33.869 algorithm 0:46:33.869,0:46:34.540 which 0:46:34.540,0:46:36.489 grabbed CPU0 0:46:36.489,0:46:39.339 you saw a significant performance decline 0:46:39.339,0:46:42.319 because there's a lot of stuff that ends up running on CPU0 0:46:42.319,0:46:43.819 which 0:46:43.819,0:46:45.100 what led to the 0:46:45.100,0:46:49.890 quick observation you want to allocate from the large numbers down 0:46:49.890,0:46:50.569 so that you use 0:46:50.569,0:46:55.069 the CPUs which are not running the random processes that get stuck on zero 0:46:55.069,0:46:57.880 or get all the interrupts in some architectures 0:46:57.880,0:47:02.199 and avoid Core0 in particular. 0:47:02.199,0:47:04.029 so some conclusions 0:47:04.029,0:47:07.530 I think we have useful proof of concept of going to be deploying 0:47:07.530,0:47:09.880 certainly the 0:47:09.880,0:47:11.000 memory stuff soon 0:47:11.000,0:47:13.329 once we upgrade to seven we’ll 0:47:13.329,0:47:15.959 definitely be deploying the CPU sets 0:47:15.959,0:47:16.849 so it's 0:47:16.849,0:47:18.509 both improves performance 0:47:18.509,0:47:22.009 in the contended case and in the and uncontended case 0:47:22.009,0:47:26.299 we would like in the future to do some more work with virtual private server stuff 0:47:26.299,0:47:28.979 Particularly it would be really interesting 0:47:28.979,0:47:30.759 to be able to run different 0:47:30.759,0:47:32.540 different FreeBSD versions in jails 0:47:32.540,0:47:37.660 for to run up for instance CentOS images in jail since we’re running CentOS 0:47:37.660,0:47:40.649 on our Linux based systems 0:47:40.649,0:47:43.240 there could actually be some really interesting things there 0:47:43.240,0:47:45.759 in that for instance we can run 0:47:45.759,0:47:50.989 we could potentially detrace Linux applications it's never going to happen on native Linux 0:47:50.989,0:47:53.069 there's also another example where 0:47:53.069,0:47:56.269 Paul Sub who’s doing some benchmarking recently 0:47:56.269,0:48:01.039 and relative to Linux on the same hardware 0:48:01.039,0:48:04.900 he was seeing a three and a half times improvement 0:48:04.900,0:48:07.230 in basic matrix multiplication 0:48:07.230,0:48:08.549 relative to current 0:48:08.549,0:48:11.849 because previously super-pegged functionality 0:48:11.849,0:48:14.499 where you vastly reduce the number of TLV entries 0:48:14.499,0:48:16.150 in the page table 0:48:16.150,0:48:17.229 and so 0:48:17.229,0:48:21.109 that sort of thing can apply even to apply to our Linux using population 0:48:21.109,0:48:23.969 could give FreeBSD some real wins there 0:48:26.309,0:48:27.579 I’d like to look at 0:48:27.579,0:48:30.859 more on the point of isolating users from kernel upgrades 0:48:30.859,0:48:32.620 one of the issues we've had is that 0:48:32.620,0:48:34.019 when you do a new bump 0:48:34.019,0:48:38.399 we have users who depend on all sorts of libraries immediate which 0:48:38.399,0:48:41.380 you know the vendors like to rev them to do 0:48:41.380,0:48:44.640 stupid API breaking changes is fairly regularly so 0:48:44.640,0:48:48.380 it’d be nice for users if we can get all the benefits to kernel upgrades 0:48:48.380,0:48:51.699 and they could upgrade at their leisure 0:48:51.699,0:48:54.459 so we're hoping to do that in future as well 0:48:54.459,0:48:57.809 we’d would like to see more limits on bandwidth type resources 0:48:59.219,0:49:01.199 for instance say limiting the amount of 0:49:02.910,0:49:05.649 it's fairly easy to know the amount of sockets I own 0:49:05.649,0:49:10.279 but it’s hard to place a total limit on network bandwidth 0:49:10.279,0:49:11.819 by a particular process 0:49:11.819,0:49:16.979 when almost all of our storage is on NFS how do you classify that traffic 0:49:17.649,0:49:21.259 without a fair bit of change to the kernel and somehow tagging that 0:49:21.259,0:49:23.799 it's an interesting challenge. 0:49:23.799,0:49:28.309 we'd also like to see it could be needed some you implement something like 0:49:28.309,0:49:30.089 the IRIX job ID 0:49:30.089,0:49:34.099 to allow the scheduler to just tag processes as part of a job 0:49:34.099,0:49:36.309 currently 0:49:36.309,0:49:38.939 I've grid engine uses a clever but evil hack 0:49:38.939,0:49:40.010 where they add 0:49:40.010,0:49:42.509 an extra group to the process 0:49:42.509,0:49:44.819 and they just have a range of groups 0:49:44.819,0:49:48.209 available so they get inherited in the users can’t drop them so 0:49:48.209,0:49:51.889 that allows them to track the process but it’s an ugly hack 0:49:51.889,0:49:57.499 and with the current limits on the number of groups it can become a real problem 0:49:57.499,0:49:59.529 actually before I take questions 0:49:59.529,0:49:59.980 I do want to put in 0:49:59.980,0:50:01.119 one quick point 0:50:01.119,0:50:05.100 the think it's not interesting you live in the area and if you're looking for 0:50:05.100,0:50:06.430 looking for a job 0:50:06.430,0:50:09.780 we are trying to hire a few people it's difficult to hire good 0:50:09.780,0:50:13.069 we do have some openings and we're looking for 0:50:13.069,0:50:17.409 BSD people in general system Admin people 0:50:17.409,0:50:24.409 so questions? 0:50:38.419,0:50:40.989 Yes (inaudible question) 0:50:40.989,0:50:45.719 I would expect that to happen but it's not something I’ve attempted to test 0:50:45.719,0:50:50.570 what I would really like is to have a topology aware allocator 0:50:50.570,0:50:53.179 so that you can request that you know I want 0:50:53.179,0:50:56.229 I want to share cache or I don't want to share cache 0:50:56.229,0:51:00.170 I want to share memory band width or not share memory bandwidth 0:51:00.170,0:51:02.459 open MPI 1.3 0:51:02.459,0:51:08.469 on the Linux side have a topology where a wrapper for their CPU 0:51:08.469,0:51:10.159 functionality 0:51:10.159,0:51:12.249 makes it something called 0:51:12.249,0:51:14.139 the PLAP 0:51:14.139,0:51:15.259 portable Linux 0:51:16.519,0:51:19.599 CPU allocator. Is that what it's actually been 0:51:19.599,0:51:21.959 what the acronym is 0:51:21.959,0:51:25.400 in essence they have to work around the fact that there were three standard 0:51:25.400,0:51:27.809 there were three different 0:51:27.809,0:51:31.759 kernel APIs for the same syscall 0:51:31.759,0:51:38.759 for CPU allocation because all the vendors did it themselves somehow 0:51:38.769,0:51:44.969 they're the same number but they’re completely incompatible 0:51:44.969,0:51:48.749 when you first load the application it calls the syscall and it tries to figure out which 0:51:48.749,0:51:50.579 one it is 0:51:50.579,0:51:52.719 by what errors it returns depending on what 0:51:52.719,0:51:56.139 are you missing and completely evil 0:51:56.139,0:52:00.859 I think people should port their API and have their library work but 0:52:00.859,0:52:05.650 we don’t need to do that junk because we did not make that mistake 0:52:05.650,0:52:12.650 so I would like to see the topology aware stuff in particular 0:52:30.710,0:52:32.529 (inaudible question) 0:52:32.529,0:52:37.180 the trick is it’s easy to limit application bandwidth 0:52:39.500,0:52:42.269 fairly easy to limit application bandwidth 0:52:42.269,0:52:44.329 it becomes more difficult when you have to 0:52:44.329,0:52:45.430 if your 0:52:45.430,0:52:49.759 interfaces are shared between application traffic 0:52:49.759,0:52:50.880 and 0:52:50.880,0:52:53.049 say NFS 0:52:53.049,0:52:57.399 getting classifying that is going to be trickier you have to tag you’d have to add a fair bit of code 0:52:57.399,0:53:04.399 to trace that down through the kernel certainly doable 0:53:12.069,0:53:15.499 (inaudible question) 0:53:15.499,0:53:18.389 I have contemplated doing just that 0:53:18.389,0:53:22.059 or in fact the other thing we consider doing 0:53:22.059,0:53:24.829 more as a research project than is a practical thing 0:53:24.829,0:53:26.719 would be actually how 0:53:26.719,0:53:28.619 would be 0:53:28.619,0:53:30.029 independent VLANs 0:53:30.029,0:53:31.839 because then we could do 0:53:31.839,0:53:32.459 things like 0:53:32.459,0:53:35.489 give each process a VLAN they couldn't even 0:53:35.489,0:53:37.979 share at the internet layer 0:53:37.979,0:53:41.259 once the images’ in place for instance we will be able to do that 0:53:41.259,0:53:45.049 and that say you know you've got your interfaces it’s yours whatever 0:53:45.049,0:53:46.479 but then we could limit it 0:53:46.479,0:53:49.959 we could rate limit that at the kernel we can also have 0:53:49.959,0:53:54.729 we’d have a physically isolated we’d have a logically isolated network as well 0:53:54.729,0:53:57.589 with some of the latest switches we could actually rate limit 0:53:57.589,0:54:04.589 at the switch as well 0:54:19.939,0:54:22.369 (inaudible questions) so to the first question 0:54:22.369,0:54:26.190 we don’t run multiple 0:54:26.190,0:54:27.639 sensitivity data on these clusters 0:54:27.639,0:54:28.709 unclassified cluster 0:54:28.709,0:54:30.460 we've avoided that problem by 0:54:30.460,0:54:32.299 not allowing it 0:54:32.299,0:54:34.929 But it is a real issue 0:54:34.929,0:54:36.939 it's just not one we've had to deal with 0:54:39.559,0:54:42.109 in practice with stuff that’s sensitive 0:54:43.059,0:54:47.579 has handling requirements that you can't touch the same hardware without a scrub 0:54:47.579,0:54:49.859 you need a pretty 0:54:49.859,0:54:51.739 ridiculously aggressive 0:54:51.739,0:54:53.770 you need a very coarse granularity 0:54:53.770,0:54:57.240 a ridiculous remote imaging process that you moved all of the data 0:54:57.240,0:55:00.959 so if I were to do that I would probably get rid of the discs 0:55:00.959,0:55:01.389 just 0:55:01.389,0:55:02.400 go disc less 0:55:02.400,0:55:04.910 that would get rid of my number-one failure case of 0:55:04.910,0:55:07.839 that would be pretty good but 0:55:07.839,0:55:09.419 but haven’t done it 0:55:10.609,0:55:13.819 NFS failures we've had occasional problems of NFS overloading 0:55:13.819,0:55:15.679 we haven't had real problem 0:55:15.679,0:55:19.279 we're all local network it’s fairly tightly contained so we haven't had problems with 0:55:19.279,0:55:20.539 things 0:55:20.539,0:55:21.819 with 0:55:21.819,0:55:26.039 you know the server going down for extended periods and causing everything to hang 0:55:26.039,0:55:27.819 it's been more an issue of 0:55:27.819,0:55:33.189 I mean there isn't there's a problem that Panasas is described as in cast 0:55:33.189,0:55:36.109 you can take out any NFS server 0:55:36.109,0:55:40.809 I mean we have the bluearc guys come in and the PGA based stuff with multiple ten-gig links I said 0:55:40.809,0:55:42.049 you know I've got 0:55:42.049,0:55:46.779 to do this and they said can we not try this with your whole cluster 0:55:46.779,0:55:47.950 because if you got 0:55:47.950,0:55:49.370 three hundred and fifty 0:55:49.370,0:55:52.599 gigabit ethernet interfaces going into the system 0:55:52.599,0:55:56.589 Even ten gig you can saturate pretty trivially 0:55:56.589,0:55:57.120 so that level 0:55:57.120,0:55:58.930 there's an inherent problem 0:55:58.930,0:56:01.969 on we need to handle that kind of bandwidth we've 0:56:01.969,0:56:04.459 got to get it a parallel file system 0:56:04.459,0:56:06.069 get a cluster 0:56:06.069,0:56:12.289 before doing streaming stuff we could go via SWAN or something 0:56:12.289,0:56:14.949 anyone else? 0:56:14.949,0:56:15.429 thank you, everyone (applause and end)