0:00:00.530,0:00:01.590 So basically, 0:00:04.590,0:00:10.029 we are going to look, mainly in this second part, at how to 0:00:10.029,0:00:11.519 handle some 0:00:11.519,0:00:12.560 locking problems 0:00:12.560,0:00:17.910 that categorize in the kernel. 0:00:17.910,0:00:24.410 Here, there are described two kinds of problems you can get with locks, that are pretty much common. 0:00:24.410,0:00:27.859 The first one is called Lock Order Reversal (LOR). 0:00:27.859,0:00:30.140 When you have for example a thread A, 0:00:30.140,0:00:32.340 which owns 0:00:32.340,0:00:35.870 a lock code, for example L1 0:00:35.870,0:00:37.920 and another thread B 0:00:37.920,0:00:40.070 which owns the lock, L2 0:00:40.070,0:00:43.150 Then thread A tries to.. 0:00:43.150,0:00:44.730 Right.. it's wrong. 0:00:44.730,0:00:46.220 The slide is wrong. 0:00:46.220,0:00:48.020 The slide is wrong. 0:00:48.020,0:00:51.910 Thread A tries to acquire L2, 0:00:51.910,0:00:55.670 but obviously it sleeps because it is owned by thread B 0:00:55.670,0:00:58.530 and then thread B tries to acquire 0:00:58.530,0:01:00.030 the lock L1 0:01:00.030,0:01:02.240 and it sleeps because 0:01:02.240,0:01:06.440 it's owned by thread B. 0:01:06.440,0:01:11.410 This is a situation that never ends and it's pretty much well documented in Cormen 0:01:11.410,0:01:16.150 and in traditional literature. 0:01:16.150,0:01:18.650 It's a classical deadlock. 0:01:18.650,0:01:19.960 This means that, 0:01:19.960,0:01:21.950 as everybody who 0:01:21.950,0:01:24.940 has ever read a book 0:01:24.940,0:01:25.899 about an operating system, 0:01:25.899,0:01:30.420 knows that 0:01:30.420,0:01:32.910 locks should maintain 0:01:32.910,0:01:34.319 an ordering in regard of each other. 0:01:34.319,0:01:38.859 That's not very simple when 0:01:38.859,0:01:40.100 you speak about a kernel. 0:01:40.100,0:01:44.850 From this point of view, the fact that there are 3 kinds of classes of locks 0:01:44.850,0:01:49.180 is going to count because you can never mix two different kinds of locks. 0:01:49.180,0:01:50.680 For example 0:01:50.680,0:01:51.610 a spinlock 0:01:51.610,0:01:53.770 and a mutex 0:01:53.770,0:01:59.120 can be mixed in this way. 0:01:59.120,0:02:01.720 You can have the mutex first and the spinlock later, while the opposite is not actually true. 0:02:01.720,0:02:07.060 So, you will see that these kind of deadlocks are possible 0:02:07.060,0:02:09.290 only in the same class of locks, 0:02:09.290,0:02:13.019 like for example 2 mutex or 2 spin mutex, 0:02:13.019,0:02:14.569 or such. 0:02:14.569,0:02:16.090 0:02:16.090,0:02:17.409 Also, 0:02:17.409,0:02:19.949 even if it's not very well documented, 0:02:19.949,0:02:22.880 for example spinlocks 0:02:22.880,0:02:26.599 in FreeBSD, have a way to identify such kind of deadlocks. 0:02:26.599,0:02:27.619 And it's pretty much implemented. 0:02:27.619,0:02:29.709 0:02:29.709,0:02:32.449 It's a feature enabled in the code. 0:02:32.449,0:02:34.949 They just count how many times they are spinning 0:02:34.949,0:02:36.010 and 0:02:36.010,0:02:39.169 if it exceeds 0:02:39.169,0:02:41.379 an exaggerated result, 0:02:41.379,0:02:47.870 it means that they are probably under a deadlock and the system panics. 0:02:47.870,0:02:52.489 Another common problem about locking is when you have 0:02:52.489,0:02:55.189 a wait channel 0:02:55.189,0:02:56.659 like a condition variable (cond var), 0:02:56.659,0:02:58.849 which it is protected by a lock. 0:02:58.849,0:03:03.629 If there are some ratio chances that this condition variable encounters, 0:03:03.629,0:03:05.489 for example, 0:03:05.489,0:03:07.219 when cond var are chased 0:03:07.219,0:03:12.649 with some preliminary conditions, like a waiter's counter 0:03:12.649,0:03:16.359 that has to be updated 0:03:16.359,0:03:21.249 anytime that any thread tries to sleep on the counter or flags to be set. 0:03:21.249,0:03:24.909 If these conditions are not protected by the same lock, 0:03:24.909,0:03:30.569 you can end up having some threads sleeping on this wait channel 0:03:30.569,0:03:34.589 and nobody is going to wake them up again. 0:03:34.589,0:03:37.629 This is usually called missed wakeup 0:03:37.629,0:03:41.249 and it's a pretty common mistake 0:03:41.249,0:03:44.799 that leads to a deadlock. 0:03:44.799,0:03:46.719 The problem is that 0:03:46.719,0:03:52.109 it's very difficult to differentiate between missed wakeup and 0:03:52.109,0:03:53.480 for example 0:03:53.480,0:03:56.189 forever sleep 0:03:56.189,0:03:58.419 of a thread 0:03:58.419,0:04:01.859 that is not likely to be awaken. 0:04:01.859,0:04:07.109 So these kind of deadlocks are very very difficult to be discovered 0:04:07.109,0:04:11.669 and will require some bit of work that we will see right now. 0:04:11.669,0:04:14.509 For example, 0:04:14.509,0:04:15.270 using 0:04:15.270,0:04:16.219 some 0:04:16.219,0:04:18.179 kernel systems 0:04:18.179,0:04:22.240 and some things integrated into the debugger. 0:04:22.240,0:04:22.979 0:04:22.979,0:04:25.520 In FreeBSD, 0:04:25.520,0:04:29.859 we have quite a lot of good mechanisms we can use to cope 0:04:29.859,0:04:32.080 with kernel problems. 0:04:32.080,0:04:36.539 The first one (and the most important) is called WITNESS. 0:04:36.539,0:04:39.169 It was introduced 0:04:39.169,0:04:42.080 in the context of SMPng 0:04:42.080,0:04:44.979 and has been rewritten in the recent past, 0:04:44.979,0:04:47.919 mainly by a contribution of 0:04:47.919,0:04:51.360 Isilon systems. 0:04:51.360,0:04:52.270 They contributed back then 0:04:52.270,0:04:54.989 to the writing of WITNESS. 0:04:54.989,0:04:57.389 This subsystem is very important 0:04:57.389,0:05:02.730 because it tracks down exactly every order 0:05:02.730,0:05:03.949 of the locks. 0:05:03.949,0:05:07.810 So that, if there is an ordering violation like a LOR, 0:05:07.810,0:05:09.439 it's going to 0:05:09.439,0:05:12.150 tell the system. 0:05:12.150,0:05:18.029 You can even set it to directly panic if it finds some deadlocks, 0:05:18.029,0:05:19.879 or only 0:05:19.879,0:05:22.729 some possible deadlocks. 0:05:22.729,0:05:27.260 Another important feature with it is 0:05:27.260,0:05:32.569 that it can keep track of read/write locks. 0:05:32.569,0:05:33.690 Doing that, 0:05:33.690,0:05:36.539 we can identify 0:05:36.539,0:05:38.419 deadlocks, possibly 0:05:38.419,0:05:39.500 even 0:05:39.500,0:05:40.690 on the 0:05:40.690,0:05:45.529 reader's path. 0:05:45.529,0:05:49.609 We could say that WITNESS is pretty big, 0:05:49.609,0:05:52.289 so activating it 0:05:52.289,0:05:55.039 in your production system is never an option. 0:05:55.039,0:05:59.929 It's mainly used when you are going to develop a new feature in the kernel 0:05:59.929,0:06:02.110 and you are going to test it heavily. 0:06:02.110,0:06:05.479 In particular if it has 0:06:05.479,0:06:06.819 some 0:06:06.819,0:06:10.509 relation to locking. 0:06:10.509,0:06:13.089 0:06:13.089,0:06:17.840 We could also tell that with the new code provided by Isilon and Nokia, 0:06:17.840,0:06:19.150 basically 0:06:19.150,0:06:21.689 0:06:21.689,0:06:25.479 offered by WITNESS is greatly reduced to about 0:06:25.479,0:06:27.699 the 10th part of 0:06:27.699,0:06:30.240 what we had before. 0:06:30.240,0:06:36.150 WITNESS is very good at tracking LOR, 0:06:36.150,0:06:37.849 but 0:06:37.849,0:06:40.009 it's not very good at, for example, 0:06:40.009,0:06:42.449 trying to 0:06:42.449,0:06:44.060 help you 0:06:44.060,0:06:47.479 in the case of lost wakeups, 0:06:47.479,0:06:49.519 because of its nature, 0:06:49.519,0:06:52.090 mainly. 0:06:52.090,0:06:55.889 It has a very good integration with the DDB debugger 0:06:55.889,0:06:57.740 and 0:06:57.740,0:06:58.879 basically 0:06:58.879,0:07:04.159 it's in the 8th release, we have new features 0:07:04.159,0:07:05.759 that help you 0:07:05.759,0:07:08.389 print out backtraces 0:07:08.389,0:07:11.150 of the contesting threads on the LORs 0:07:11.150,0:07:16.039 and their orderings 0:07:16.039,0:07:17.549 and 0:07:17.549,0:07:23.550 it shows some graphs of the relations even from the user space. 0:07:23.550,0:07:28.550 You don't have to go into the kernel debugger to look at it's output. 0:07:28.550,0:07:35.550 0:07:35.620,0:07:37.380 0:07:37.380,0:07:42.250 Well, I see that sometimes when they are released there is a confusion 0:07:42.250,0:07:44.250 about the information reports 0:07:44.250,0:07:48.440 in regard of deadlocks conditions and what help 0:07:48.440,0:07:50.020 users can provide to developers 0:07:50.020,0:07:52.039 about that. 0:07:52.039,0:07:54.020 So we are going to see 0:07:54.020,0:07:58.700 all the relevant information when a deadlock 0:07:58.700,0:07:59.590 is in the kernel. 0:07:59.590,0:08:02.490 0:08:02.490,0:08:03.389 Usually, 0:08:03.389,0:08:07.939 if you want to find a deadlock that's happening in the kernel, 0:08:07.939,0:08:10.909 your first line of analysis starts from the DDB 0:08:10.909,0:08:13.919 instead of a post-mortem analysis, 0:08:13.919,0:08:16.839 which is even more important. 0:08:16.839,0:08:22.330 But, using DDB you will get more processes and better information. 0:08:22.330,0:08:24.970 0:08:24.970,0:08:28.499 The most important unit in order to find the deadlock 0:08:28.499,0:08:34.389 are the LORs reported by WITNESS in order to see if there is something strange 0:08:34.389,0:08:36.690 that can be happening. 0:08:36.690,0:08:41.700 You want to know the state of all the threads that are running on the system that is deadlocking. 0:08:41.700,0:08:42.900 0:08:42.900,0:08:47.050 You can see that you're deadlocking, if you see that 0:08:47.050,0:08:48.070 on the runqueue 0:08:48.070,0:08:48.540 there are 0:08:48.540,0:08:51.850 just idle threads. 0:08:51.850,0:08:56.640 Because it's like saying that runqueues are complete flushed 0:08:56.640,0:09:02.450 and you have all the threads sleeping in their own containers. 0:09:02.450,0:09:07.850 You need to know which are the exact locks that are acquired 0:09:07.850,0:09:11.270 in the system 0:09:11.270,0:09:15.570 and that's something that WITNESS provides 0:09:15.570,0:09:20.720 and the very important thing is to know why the threads are stopping. 0:09:20.720,0:09:24.250 So one of the most important things is retrieving what the threads were doing 0:09:24.250,0:09:26.320 when 0:09:26.320,0:09:28.960 they were put asleep. 0:09:28.960,0:09:30.070 0:09:30.070,0:09:33.009 The backtraces of all the threads involved 0:09:33.009,0:09:37.130 are printed out in order to identify deadlocks. 0:09:37.130,0:09:38.589 In the case that 0:09:38.589,0:09:42.830 buffered cache and VFS are 0:09:42.830,0:09:45.910 probably parts of the deadlocking, 0:09:45.910,0:09:50.790 you should also print out 0:09:50.790,0:09:53.420 the information about vnodes 0:09:53.420,0:09:58.250 and what we're interested in is which vnodes are called, 0:09:58.250,0:09:59.320 which 0:09:59.320,0:10:01.370 are actually referenced 0:10:01.370,0:10:03.530 and 0:10:03.530,0:10:10.530 in which way they were called. 0:10:11.030,0:10:13.380 So, 0:10:13.380,0:10:15.770 this is an example 0:10:15.770,0:10:17.430 of the 0:10:17.430,0:10:18.880 thread states 0:10:18.880,0:10:20.760 in the case of a deadlock. 0:10:20.760,0:10:27.480 This is an real example of a deadlock 0:10:27.480,0:10:28.900 but you can see 0:10:28.900,0:10:31.890 that 0:10:31.890,0:10:35.650 this is not totally complete. 0:10:35.650,0:10:38.450 But you can see that all the threads are sleeping. 0:10:38.450,0:10:39.870 0:10:39.870,0:10:43.580 This one is the message 0:10:43.580,0:10:44.790 used by the wait channel 0:10:44.790,0:10:47.550 on which they're sleeping on 0:10:47.550,0:10:48.710 or used by 0:10:48.710,0:10:54.480 the container like the turnstile or the sleepqueue. 0:10:54.480,0:10:59.410 If I recall correctly, it's a forced amount that does deadlocking at some point. 0:10:59.410,0:11:01.290 I'm not really sure 0:11:01.290,0:11:04.190 because I should have looked at it. 0:11:04.190,0:11:08.810 You can see that the revelant command here is -ps 0:11:08.810,0:11:11.220 that DDB supports. 0:11:11.220,0:11:14.220 0:11:14.220,0:11:17.520 Another important thing 0:11:17.520,0:11:18.820 is getting 0:11:18.820,0:11:21.680 the situation of all CPUs. 0:11:21.680,0:11:24.100 As you can see there, 0:11:24.100,0:11:25.210 usually 0:11:25.210,0:11:31.600 its because you can add some data structures corrupted 0:11:31.600,0:11:34.320 in the per-CPU datas. 0:11:34.320,0:11:38.830 That's a very common situation where you can get deadlocks, 0:11:38.830,0:11:40.280 because, for example, 0:11:40.280,0:11:43.149 leaving a corrupted LPD will lead 0:11:48.750,0:11:55.290 to you having a bigger massive breakage like double-faults and things like that. Usually it's always a good idea to look at all the CPUs involved in the system. 0:11:55.290,0:11:57.310 The command 0:11:57.310,0:12:00.120 is """"-show allpcpu"". 0:12:00.120,0:12:04.960 0:12:04.960,0:12:06.959 This one 0:12:06.959,0:12:12.009 is a WITNESS specific command ""-show alllocks"" and it's going to show all the locks, 0:12:12.009,0:12:13.130 in the system, 0:12:13.130,0:12:15.070 who is the owner, 0:12:15.070,0:12:15.850 like in this case, 0:12:15.850,0:12:17.690 a mount, 0:12:17.690,0:12:21.270 and the thread is this one, 0:12:21.270,0:12:23.660 what the lock is holding, 0:12:23.660,0:12:24.970 that's the address 0:12:24.970,0:12:27.360 and where it was acquired. 0:12:27.360,0:12:31.140 It gives you lines and file. 0:12:31.140,0:12:32.770 0:12:32.770,0:12:34.730 Actually, 0:12:34.730,0:12:37.620 that's just possible 0:12:37.620,0:12:40.859 with WITNESS, because otherwise, 0:12:40.859,0:12:44.410 trying to keep the oldest information 0:12:44.410,0:12:51.410 in a general purpose kernel will be very expensive for our logging subsystem. 0:12:55.330,0:12:59.730 Then, the most important thing is the backtrace for any thread. 0:12:59.730,0:13:01.150 0:13:01.150,0:13:03.390 It's going to show the backtrace 0:13:03.390,0:13:05.700 for all the threads. 0:13:05.700,0:13:08.380 the seas 0:13:08.380,0:13:09.169 In this case, 0:13:09.169,0:13:13.010 the thread with these addresses TID and PID 0:13:13.010,0:13:15.350 basically got sleeping 0:13:15.350,0:13:17.140 on a vnode. 0:13:17.140,0:13:22.020 You will see that the backend in this case is FFF 0:13:22.020,0:13:24.000 and 0:13:24.000,0:13:25.729 that's the context switching function, 0:13:25.729,0:13:26.900 0:13:26.900,0:13:32.220 those are the sleepqueues of the containter that is containing the threads, 0:13:32.220,0:13:34.230 and this one 0:13:34.230,0:13:36.370 is what it was going to do 0:13:36.370,0:13:37.910 before, 0:13:37.910,0:13:41.810 in this case mounting the filesystems. 0:13:41.810,0:13:47.220 You will see that on a complete feeding, 0:13:47.220,0:13:50.310 you will have a lot of these kinds of traces, 0:13:50.310,0:13:53.079 but they are very important 0:13:53.079,0:13:59.270 for the developers in order to understand what is going on. 0:13:59.270,0:14:02.590 These ones are the locked vnodes 0:14:02.590,0:14:05.830 that are also very important when 0:14:05.830,0:14:11.780 a deadlock happens in VFS or in the buffer cache. 0:14:11.780,0:14:13.700 You will see 0:14:13.700,0:14:18.580 that these are the ref counts linked to vnodes, 0:14:18.580,0:14:20.980 they are specific 0:14:20.980,0:14:23.850 to some handling of the vnodes such as recycling, 0:14:23.850,0:14:26.020 and completely freeing. 0:14:26.020,0:14:27.290 That's the mount point 0:14:27.290,0:14:28.770 where the vnodes 0:14:28.770,0:14:31.740 belong 0:14:31.740,0:14:33.930 and 0:14:33.930,0:14:35.910 that is the backtrace 0:14:35.910,0:14:39.760 of what happened when the vnode 0:14:39.760,0:14:41.060 was acquired. 0:14:41.060,0:14:46.600 You can see that this comment also gives you information 0:14:46.600,0:14:49.000 about the lock linked to the vnode. 0:14:49.000,0:14:51.640 For example, it tells you that 0:14:51.640,0:14:52.830 the lock 0:14:52.830,0:14:55.040 is in exclusive mode 0:14:55.040,0:14:56.280 and 0:14:56.280,0:14:59.320 it does some shared waits 0:14:59.320,0:15:03.260 on its queues. 0:15:03.260,0:15:04.090 That's also 0:15:04.090,0:15:06.370 the node number. 0:15:06.370,0:15:09.140 0:15:09.140,0:15:13.880 There is also other information you could receive from the DDB linked to, for example, 0:15:13.880,0:15:16.980 the bugging deadlocks, 0:15:16.980,0:15:18.100 like sleep chains, 0:15:18.100,0:15:19.310 for any 0:15:19.310,0:15:24.250 wait channel, if you have the address 0:15:24.250,0:15:27.150 and for example, 0:15:27.150,0:15:32.650 you can also print the wall table of the lock relations from WITNESS 0:15:32.650,0:15:38.010 but it's mostly never useful because you should already know that. 0:15:38.010,0:15:41.100 So you will just need to know which is the one 0:15:41.100,0:15:41.980 that 0:15:41.980,0:15:43.019 0:15:43.019,0:15:47.750 can give the trouble. 0:15:47.750,0:15:51.640 0:15:51.640,0:15:53.980 So if you are going to submit some problems 0:15:53.980,0:15:57.180 usually called NGAP that are probably deadlocked in the kernel space, 0:15:57.180,0:16:04.130 that ones 0:16:04.130,0:16:11.130 are the information that we actually need. 0:16:11.650,0:16:14.970 Now, 0:16:14.970,0:16:18.950 it's very difficult to see very good reports about deadlocks, 0:16:18.950,0:16:20.020 so 0:16:20.020,0:16:22.569 I think that 0:16:22.569,0:16:25.670 it is a very good thing to talk about it. 0:16:25.670,0:16:31.420 Along with the WITNESS, we have another important mechanism that could help us with deadlocks 0:16:31.420,0:16:34.620 and it's called KTR. 0:16:34.620,0:16:36.100 KTR is 0:16:36.100,0:16:40.630 basically a logger, a kernel logger, of events. 0:16:40.630,0:16:42.550 It's 0:16:42.550,0:16:45.090 highly configurable, 0:16:45.090,0:16:48.280 as you can, for example, handle different classes of events. 0:16:48.280,0:16:53.940 In FreeBSD we have 0:16:53.940,0:16:55.130 classes linked to the scheduler, 0:16:55.130,0:16:56.290 to the locking, 0:16:56.290,0:16:58.520 to the VFS, to callouts, 0:16:58.520,0:17:05.030 and they are all packed in the same class. 0:17:05.030,0:17:08.880 So they can be selectively enabled or masked. 0:17:08.880,0:17:10.030 For example 0:17:10.030,0:17:12.190 the difference is that you can enable several classes, 0:17:12.190,0:17:13.610 like 0:17:13.610,0:17:16.470 the ten classes of the KTR 0:17:16.470,0:17:21.000 and then you are just interested in three of them even if all ten of them are 0:17:21.000,0:17:23.030 actually tracked. 0:17:23.030,0:17:24.240 You will are just going to 0:17:24.240,0:17:26.839 see three of them. 0:17:26.839,0:17:28.439 Um, 0:17:28.439,0:17:31.940 an important thing is that KTR 0:17:31.940,0:17:34.520 doesn't handle, 0:17:34.520,0:17:37.770 for example 0:17:37.770,0:17:38.300 pointers, 0:17:38.300,0:17:40.340 doesn't store the information 0:17:40.340,0:17:45.450 passed to init, it just stores the pointer 0:17:45.450,0:17:46.880 and not the information, 0:17:46.880,0:17:48.390 for example, 0:17:48.390,0:17:50.160 the strings, 0:17:50.160,0:17:55.000 it doesn't make copies, you need to just pass the pointers 0:17:55.000,0:17:57.610 which are persistent in the memory. 0:17:57.610,0:18:00.340 Otherwise, 0:18:00.340,0:18:05.500 KTR won't be able to access them. 0:18:05.500,0:18:09.760 The good thing about KTR is that 0:18:09.760,0:18:11.600 you can also look at it from the user space 0:18:11.600,0:18:13.430 through the ktrdump interface. 0:18:13.430,0:18:15.820 0:18:15.820,0:18:17.030 0:18:17.030,0:18:19.669 Why is that important for locking? 0:18:19.669,0:18:20.279 Because 0:18:20.279,0:18:21.090 after, 0:18:21.090,0:18:24.350 it can tell you what happed, 0:18:24.350,0:18:27.260 on which CPU branches, 0:18:27.260,0:18:30.020 and the order it happened in. 0:18:30.020,0:18:34.580 This is very important when you're going to track down for example traces, 0:18:34.580,0:18:37.720 when you are not sure about the order of operations and 0:18:37.720,0:18:44.710 how they happened. It's going to tell you. 0:18:44.710,0:18:46.290 For example 0:18:46.290,0:18:48.650 that is 0:18:48.650,0:18:51.090 a typical trace of KTR, 0:18:51.090,0:18:52.410 where you have 0:18:52.410,0:18:56.890 the CPU where the event happened, thats the index, 0:18:56.890,0:18:58.620 that's a timestamp, 0:18:58.620,0:19:03.400 I think it's retrieved directly from the TSC, but i'm actually not sure. 0:19:03.400,0:19:04.889 In this case, 0:19:04.889,0:19:10.210 i was tracking down the scheduler class, 0:19:10.210,0:19:16.100 so I was interested mainly in scheduler workloads and I could see 0:19:16.100,0:19:19.210 for example 0:19:19.210,0:19:21.100 that 0:19:21.100,0:19:24.870 a contact switch happened 0:19:24.870,0:19:26.919 scheduling 0:19:26.919,0:19:28.010 the idle CPU 0:19:28.010,0:19:30.270 and then other information, 0:19:30.270,0:19:34.370 like for example the load of the CPU 1 0:19:34.370,0:19:37.190 and scan priority boost, 0:19:37.190,0:19:38.870 like 0:19:38.870,0:19:40.310 this one 0:19:40.310,0:19:46.420 and other things. 0:19:46.420,0:19:48.770 0:19:48.770,0:19:50.820 You can enable 0:19:50.820,0:19:55.280 the option KTR, but you must handle it carefully. 0:19:55.280,0:19:57.130 For example 0:19:57.130,0:20:01.990 use an option i didn't include here, 0:20:01.990,0:20:07.410 which is the length of the buffer it uses to store the pointers in, it's called KTR_ENTRIES, 0:20:07.410,0:20:08.360 and you should specify 0:20:08.360,0:20:09.590 enough entries 0:20:09.590,0:20:11.500 to have a reliable tracking. 0:20:11.500,0:20:13.580 0:20:13.580,0:20:16.780 For example, if you are going to track a lot of events, 0:20:16.780,0:20:19.100 a short queue is not an option, 0:20:19.100,0:20:22.120 because you are going to lose some information. 0:20:22.120,0:20:26.780 A typical queue is of length 2K (2 kilobytes) 0:20:26.780,0:20:29.710 of entries. 0:20:29.710,0:20:32.520 These other options 0:20:32.520,0:20:36.190 let you compile some classes, 0:20:36.190,0:20:38.370 or mask them, 0:20:38.370,0:20:43.770 or even mask the CPU. 0:20:43.770,0:20:46.289 If you have a big SMP environment, 0:20:46.289,0:20:50.160 so that you can selectively enable some of them. 0:20:50.160,0:20:54.700 For example, this is very good for tracking down traces in the sleeping queue. 0:20:54.700,0:21:01.700 You can find referrals here. 0:21:02.770,0:21:04.820 0:21:04.820,0:21:06.220 So, 0:21:06.220,0:21:09.020 I will spend the last time of the speech speaking about possible improvements 0:21:09.020,0:21:10.500 to 0:21:10.500,0:21:15.670 our locking system, which is not very bad. 0:21:15.670,0:21:16.500 Well, 0:21:16.500,0:21:21.750 I think that actually our locking system is pretty complete, 0:21:21.750,0:21:26.919 but it's also pretty confusing for newcomers, it's not widely documented. 0:21:26.919,0:21:32.280 so maybe we will spend a good amount of time on documentation. 0:21:32.280,0:21:32.799 As you can see, 0:21:32.799,0:21:38.120 even in this presentation, which is not very huge, 0:21:38.120,0:21:42.540 there are many things to say 0:21:42.540,0:21:46.700 and that are not very simple to understand in particular 0:21:46.700,0:21:48.240 for people 0:21:48.240,0:21:50.280 who just need to do simple tasks. 0:21:50.280,0:21:56.660 For example, I saw a lot of guys coming from Linux World 0:21:56.660,0:22:00.620 who wanted to actually use spinlocks for time. 0:22:00.620,0:22:05.720 It's obvious they are missing something from our architecture. 0:22:05.720,0:22:07.250 From 0:22:07.250,0:22:11.010 just a technical point of view, 0:22:11.010,0:22:14.530 it would be very good if we could remove 0:22:14.530,0:22:20.440 legacy support and overriding support. For example, 0:22:20.440,0:22:22.900 we have lockmgr and sxlog 0:22:22.900,0:22:27.990 which are both read/write locks and are both servered by sleep queues. 0:22:27.990,0:22:31.800 They have some differences, obviously, 0:22:31.800,0:22:32.660 but, mainly, 0:22:32.660,0:22:38.920 we could manage the missing bits and just use one of the two interfaces. 0:22:38.920,0:22:42.059 0:22:42.059,0:22:43.920 In the same way, as i told you before, 0:22:43.920,0:22:47.340 the sleeping point, true-end sleep, read/write sleep and sxsleep 0:22:47.340,0:22:50.350 should probably be managed with cond vars 0:22:50.350,0:22:52.930 and superdoff our kernel 0:22:52.930,0:22:55.070 and we should probably drop sema, 0:22:55.070,0:23:00.290 because it is obsolete, and can be replaced by condvars and mutex. 0:23:00.290,0:23:02.620 0:23:02.620,0:23:03.830 From 0:23:03.830,0:23:05.639 a strong technical point of view, 0:23:05.639,0:23:07.350 as you can see, 0:23:07.350,0:23:09.680 we spent a lot of time 0:23:09.680,0:23:12.109 on optimization on our blocking primitives, 0:23:12.109,0:23:16.770 but very few on our spinning primitives. 0:23:16.770,0:23:21.810 That's because obviously blocking primitives are our first choice, 0:23:21.810,0:23:22.700 but 0:23:22.700,0:23:25.210 spinlocks could be improved too, 0:23:25.210,0:23:30.130 using technics such as the so-called back-off algorithms and 0:23:30.130,0:23:31.429 queued spinlocks. 0:23:31.429,0:23:33.560 I have a patch for that but 0:23:33.560,0:23:35.499 I would really need 0:23:35.499,0:23:39.999 to test it and tune it on a big SMP environment. 0:23:39.999,0:23:42.270 I don't think 0:23:42.270,0:23:43.540 that, for now, 0:23:43.540,0:23:45.730 they can handle such an environment. 0:23:45.730,0:23:47.820 In addition, 0:23:47.820,0:23:51.500 I'm not sure you are familiar with queued spinlocks algorithms. 0:23:51.500,0:23:55.580 Basically a back-off algorithms try 0:23:55.580,0:24:02.250 to reduce the cache pressure on the 0:24:02.250,0:24:06.090 threads contesting on the lock 0:24:06.090,0:24:08.270 by giving a time. 0:24:08.270,0:24:10.290 Instead, the other one 0:24:10.290,0:24:16.180 uses spinning on a local variable which is not shared by the threads. 0:24:16.180,0:24:18.030 and the time spent 0:24:18.030,0:24:20.140 on that 0:24:20.140,0:24:22.780 local variable increases 0:24:22.780,0:24:25.440 0:24:25.440,0:24:28.320 0:24:28.320,0:24:31.780 with the passing of time. 0:24:31.780,0:24:35.950 Another interesting thing would be benchmarking 0:24:35.950,0:24:37.930 different wake-up algorithms for blocking primitives. 0:24:37.930,0:24:42.390 We have an algorithm that has proven to be 0:24:42.390,0:24:42.910 0:24:42.910,0:24:45.200 quite good 0:24:45.200,0:24:47.440 but 0:24:47.440,0:24:51.330 we are not confronted with other kind of wake-ups that could have 0:24:51.330,0:24:56.450 a higher overhead but could give time improvements 0:24:56.450,0:24:59.760 on a big SMP environment. 0:24:59.760,0:25:02.500 0:25:02.500,0:25:07.000 Another thing that would be very interesting to fix is the priority inversion problem 0:25:07.000,0:25:08.670 in the case of read locks. 0:25:08.670,0:25:09.790 There is an approach 0:25:09.790,0:25:13.950 called dominant record implemented in FreeBSD, 0:25:13.950,0:25:16.360 but i saw that it's pretty 0:25:16.360,0:25:22.160 slow for our fastpack. In FreeBSD all our locking primitives are broken and add to path 0:25:22.160,0:25:23.290 where 0:25:23.290,0:25:25.820 the fastpack is 0:25:25.820,0:25:28.620 is often just a single atomic operation, 0:25:28.620,0:25:30.010 and 0:25:30.010,0:25:33.770 if it fails, 0:25:33.770,0:25:36.900 it falls down and the art pattern tries to do all the end-work of the sleep queues. 0:25:36.900,0:25:40.210 in this case the owner of record technic was going to make the fastpack too simple 0:25:40.210,0:25:42.640 Basically, 0:25:42.640,0:25:46.310 it just considers 0:25:46.310,0:25:50.940 the readets as one 0:25:50.940,0:25:55.380 and can switch in and out. 0:25:55.380,0:26:02.210 And it practically lands the priority to this owner of record which does it's right log. 0:26:02.210,0:26:06.900 0:26:06.900,0:26:11.900 Another important thing obviously is improving locking 0:26:11.900,0:26:13.420 where the 0:26:13.420,0:26:15.649 optimum approach is not chosen. 0:26:15.649,0:26:21.039 I see a lot of parts in which the primitives chosen by the developers 0:26:21.039,0:26:23.320 are not the most suitable ones 0:26:23.320,0:26:27.690 and we should switch to the right one. 0:26:27.690,0:26:33.120 Like, for example, in discriminated usage of the spinlocks 0:26:33.120,0:26:34.990 or the blocking primitives, 0:26:34.990,0:26:40.430 just to handle cases like that, like the one we saw before with the malloc command, 0:26:40.430,0:26:44.070 that needs to sleep. 0:26:44.070,0:26:44.320 Any questions?