Andrew Morton on kernel development
扫描二维码
随时随地手机看文章
Years ago, there was a great deal of worry about the possibility of burning out Linus. Life seems to have gotten easier for him since then; now instead, I've heard concerns about burning out Andrew. It seems that you do a lot; how do you keep the pace and how long can we expect you to stay at it?
I do less than I used to. Mainly because I have to - you can't do the same thing at a high level of intensity for over five years and stay sane.I'm still keeping up with the reviewing and merging but the -mm release periods are now far too long.
There are of course many things which I should do but which I do not.
Over the years my role has fortunately decreased - more maintainers are running their own trees and the introduction of the linux-next tree (operated by Stephen Rothwell) has helped a lot.
The linux-next tree means that 85% of the code which I used to redistribute for external testing is now being redistributed by Stephen. Some time in the next month or two I will dive into my scripts and will find a way to get the sufficiently-stable parts of the -mm tree into linux-next and then I will hopefully be able to stop doing -mm releases altogether.
So. The work level is ramping down, and others are taking things on.
What can we do to help?
I think code review would be the main thing. It's a pretty specialised function to review new code well. The people who specialise in the area which the new code is changing are the best reviewers but unfortunately I will regularly find myself having to review someone else's stuff.Secondly: it would help if people's patches were less buggy. I still have to fix a stupidly large number of compile warnings and compilation errors and each -mm release requires me to perform probably three or four separate bisection searches to weed out bad patches.
Thirdly: testing, testing, testing.
Fourthly: it's stupid how often I end up being the primary responder on bug reports. I'll typically read the linux-kernel list in 1000-email batches once every few days and each time I will come across multiple bug reports which are one to three days old and which nobody has done anything about! And sometimes I know that the person who is responsible for that part of the kernel has read the report. grr.
Is it your opinion that the quality of the kernel is in decline? Most developers seem to be pretty sanguine about the overall quality problem. Assuming there's a difference of opinion here, where do you think it comes from? How can we resolve it?
I used to think it was in decline, and I think that I might think that it still is. I see so many regressions which we never fix. Obviously we fix bugs as well as add them, but it is very hard to determine what the overall result of this is.When I'm out and about I will very often hear from people whose machines we broke in ways which I'd never heard about before. I ask them to send a bug report (expecting that nothing will end up being done about it) but they rarely do.
So I don't know where we are and I don't know what to do. All I can do is to encourage testers to report bugs and to be persistent with them, and I continue to stick my thumb in developers' ribs to get something done about them.
I do think that it would be nice to have a bugfix-only kernel release. One which is loudly publicised and during which we encourage everyone to send us their bug reports and we'll spend a couple of months doing nothing else but try to fix them. I haven't pushed this much at all, but it would be interesting to try it once. If it is beneficial, we can do it again some other time.
There have been a number of kernel security problems disclosed recently. Is any particular effort being put into the prevention and repair of security holes? What do you think we should be doing in this area?
People continue to develop new static code checkers and new runtime infrastructure which can find security holes.But a security hole is just a bug - it is just a particular type of bug, so one way in which we can reduce the incidence rate is to write less bugs. See above. More careful coding, more careful review, etc.
Now, is there any special pattern to a security-affecting bug? One which would allow us to focus more resources on preventing that type of bug than we do upon preventing "average" bugs? Well, perhaps. If someone were to sit down and go through the past five years' worth of kernel security bugs and pull together an overall picture of what our commonly-made security-affecting bugs are, then that information could perhaps be used to guide code-reviewers' efforts and code-checking tools.
That being said, I have the impression that most of our "security holes" are bugs in ancient crufty old code, mainly drivers, which nobody runs and which nobody even loads. So most metrics and measurements on kernel security holes are, I believe, misleading and unuseful.
Those security-affecting bugs in the core kernel which affect all kernel users are rare, simply because so much attention and work gets devoted to the core kernel. This is why the recent splice bug was such a surprise and head-slapper.
I have sensed that there is a bit of confusion about the difference between -mm and linux-next. How would you describe the purpose of these two trees? Which one should interested people be testing?
Well, things are in flux at present.The -mm tree used to consist of the following:
80-odd subsystem maintainer trees (git and quilt), eg: scsi, usb, net. various patches which I picked up which should be in a subsystem maintainer's tree, but which for one of various reasons didn't get merged there. I spend a lot of time acting as backup for leaky maintainers. patches which are mastered in the -mm tree. These are now organised as subsystems too, and I count about 100 such subsystems which are mastered in -mm. eg: fbdev, signals, uml, procfs. And memory management. more speculative things which aren't intended for mainline in the short-term, such as new filesystems (eg reiser4). debugging patches which I never intend to go upstream.The 80-odd subsystem trees in fact account for 85% of the changes which go into Linux. Pretty much all of the remaining 15% are the only-in-mm patches.
Right now (at 2.6.26-rc4 in "kernel time"), the 80-odd subsystem trees are in linux-next. I now merge linux-next into -mm rather than the 80-odd separate trees.
As mentioned previously, I plan to move more of -mm into linux-next - the 100-odd little subsystem trees.
Once that has happened, there isn't really much left in -mm. Just
the patches which subsystem maintainers leaked. I send these to the subsystem maintainers. the speculative not-for-next-release features the not-to-be-merged debugging patches.Do you have any specific goals for the development of the kernel over the next year or so? What would they be?
Steady as she goes, basically.I keep on hoping that kernel development in general will start to ramp down. There cannot be an infinite number of new features out there! Eventually we should get into more of a maintenance mode where we just fix bugs, tweak performance and add new drivers. Famous last words.
And it's just vaguely possible that we're starting to see that happening now. I do get a sense that there are less "big" changes coming in. When I sent my usual 1000-patch stream at Linus for 2.6.26 I actually received an email from him asking (paraphrased) "hey, where's all the scary stuff?"
In the early-May discussions, Linus said a couple of times that he does not think code review helps much. Do you agree with that point of view?
Nope.How would you describe the real role of code review in the kernel development process?
Well, it finds bugs. It improves the quality of the code. Sometimes it prevents really really bad things from getting into the product. Such as rootholes in the core kernel. I've spotted a decent number of these at review time.It also increases the number of people who have an understanding of the new code - both the reviewer(s) and those who closely followed the review are now better able to support that code.
Also, I expect that the prospect of receiving a close review will keep the originators on their toes - make them take more care over their work.
There clearly must be quite a bit of communication between you and Linus, but much of it, it seems, is out of the public view. Could you describe how the two of you work together? How are decisions (such as when to release) made?
Actually we hardly ever say anything much. We'll meet face-to-face once or twice a year and "hi how's it going".We each know how the other works and I hope we find each other predictable and that we have no particular issues with the other's actions. There just doesn't seem to be much to say, really.
Is there anything else you would like to say to LWN's readers?
Sure. Please do contribute to Linux, and a great way of doing that is to test latest mainline or linux-next or -mm and to report on any problems which you encounter.Nothing special is needed - just install it on as many machines as you dare and use them in your normal day-to-day activities.
If you do hit a bug (and you will) then please be persistent in getting us to fix it. Don't let us release a kernel with your bug in it! Shout at us if that's what it takes. Just don't let us break your machines.
Our testers are our greatest resource - the whole kernel project would grind to a complete halt without them. I profusely thank them at every opportunity I get :)
We would like to thank Andrew for taking time to answer our questions.
(Log in to post comments)
Andrew Morton on kernel development
Posted Jun 11, 2008 16:16 UTC (Wed) by Hanno (guest, #41730) [Link]
"I do think that it would be nice to have a bugfix-only kernel release." Yes, please.
Andrew Morton on kernel development
Posted Jun 11, 2008 17:10 UTC (Wed) by MisterIO (subscriber, #36192) [Link]
It may be interesting, unless kernel developers ignore the bug-fix only release and work on new futures by themselves in the meantime.
Andrew Morton on kernel development
Posted Jun 11, 2008 17:27 UTC (Wed) by hmh (subscriber, #3838) [Link]
It may be interesting, unless kernel developers ignore the bug-fix only release and work on new futures by themselves in the meantime.
Which many will do, causing total chaos in the next merge window. That's the reason why it was not done yet, AFAIK.
Now, if we could get a sufficiently big number of kernel regulars (like at least 50% of the ones with more than three patches merged in the last three releases) and all subsystem maintainers (so as to keep the new-feature-craze crowd under control) to pledge to the big bugfix experiment, then it just might work.
Andrew Morton on kernel development
Posted Jun 11, 2008 17:59 UTC (Wed) by proski (subscriber, #104) [Link]
It's not a matter of making developers doing something else. It's a priority thing. Most developers work both on new features and on bugfixes. Sometimes bugs are exposed as the code is modified to include new features.If some kernel is declared stable, it mean that only bugfixes are accepted. In other words, the merge window is skipped. To make the point, the previous kernel could be tagged as rc1 for the stable kernel.
I don't know it it's going to work, but it may be worth trying once.
Andrew Morton on kernel development
Posted Jun 11, 2008 17:34 UTC (Wed) by cdamian (subscriber, #1271) [Link]
I preferred the odd/even system we had before 2.6. I also gave up on reporting kernel bugs. Usually I am the only person with that bug and hardware configuration and nobody will fix it. This is not specific to the kernel though. I think I never got any of the bugs which I reported to fedora, red hat or gnome fixed. Two other things: is the kernel bugzilla used at all? are there any tests like unit tests to catch regressions for the kernel? both are pretty standard for any other open source project nowadays.
Andrew Morton on kernel development
Posted Jun 11, 2008 18:52 UTC (Wed) by grundler (subscriber, #23450) [Link]
> I also gave up on reporting kernel bugs. I'm sorry to hear that. I know that reporting bugs is alot of work. > Usually I am the only person with that bug and hardware > configuration and nobody will fix it. If no one else really has that HW, then there could be lots of reasons: 1) They don't care - many developers don't care about parisc, sparc, 100VG or tokenring networking, scaling up or down (embedded vs large systems),e tc. 2) They don't have documentation for the offending HW. 3) no one else was able to reproduce the bug and it's not obvious what is wrong. > This is not specific to the kernel though. I think I > never got any of the bugs which I reported to fedora, > red hat or gnome fixed. Before someone else suggests these, maybe the way the bugs are reported has something to do with the response rate? There are some good essays/resources out there on how to file useful bugreports. I don't want to suggest yours are not useful since I've never seen one (or don't know if I have). Just when you mention problems across all open source projects I wonder. > Two other things: is the kernel bugzilla used at all? > are there any tests like unit tests to catch regressions for the kernel? > both are pretty standard for any other open source project nowadays. Agreed. But to be clear, the kernel is a bit different than most open source projects since it controls HW and lots of buggy BIOS flavors. (1) I'm using bugzilla.kernel.org to track tulip driver bugs. Not everyone is doing that. It's helped that akpm has (had?) funding (from google?) for someone to help cleanup and poke maintainers about outstanding bugs. Despite not everyone using it, it's still a better tracking mechanism than sending an email to lkml. Do both. email to get attention and bugzilla to track details. But also send bug reports to topic-specific lists since it's more likely people who care about your HW will notice the report. (2) Not that I'm aware of. The kernel interacts with HW alot. It's very difficult to emulate or "mock" that interaction. Not impossible, just hard and the emulation almost never can capture all the nuances of broken HW (see drivers/net/tg3.c for examples). Secondly, we very often can only test large subsystems or several subsystems at once. e.g. a file system test almost always ends up stressing the VM and IO subsystems. Networking stresses DMA and SK buff allocators. UML and other virtualization of the OS make it possible to test some subsystems w/o specific HW. However there are smaller pieces of the kernel which can be isolated and tested: e.g bit ops (i.e. ffs()), resource allocators, etc. It's just a lot of work to automate the testing of those bits of code. But this is certainly a good area to contribute if someone wanted to learn how kernel code (doesn't? :)) work. For testing subsystems, see autotest.kernel.org and http://ltp.sourceforge.net/. autotest is attempting to find regressions during the development cycle.
Andrew Morton on kernel development
Posted Jun 11, 2008 19:02 UTC (Wed) by nbarriga (guest, #49347) [Link]
It seems that autotest.kernel.org doesn't exist...
Andrew Morton on kernel development
Posted Jun 11, 2008 22:57 UTC (Wed) by erwbgy (subscriber, #4104) [Link]
That should be http://test.kernel.org/ and http://test.kernel.org/autotest for documentation.
Andrew Morton on kernel development
Posted Jun 12, 2008 2:23 UTC (Thu) by grundler (subscriber, #23450) [Link]
Yes - I meant http://test.kernel.org. Sorry about that.
Andrew Morton on kernel development
Posted Jun 11, 2008 20:01 UTC (Wed) by iabervon (subscriber, #722) [Link]
I think there's a substantial difference to the way he phrased the suggestion here from what I've seen before. People tend to think of a bugfix-only release as one in which the mainline only merges bugfixes. Simply making that policy would almost certainly lead to no more bugfixes than usual, and twice as many features hitting the following release window. On the other hand, if the process were driven from the other end, it might work: spend some period collecting a lot of unfixed bugs, and saturate developers' development time with them, and, in the cycle after that, there ought to be a lot of bugfixes and no new features, simply because all that will have matured at the merge window will be bugfixes. So, if there were a period where there was a campaign to collect long-standing bugs and regressions against non-recent versions, with the aim of having all of these get resolved in a particular future version, as the main goal for that release, I think that would be useful.
Andrew Morton on kernel development
Posted Jun 11, 2008 19:39 UTC (Wed) by job (subscriber, #670) [Link]
I've been bitten by some bugs earlier in the 2.6 series, but I have not had any trouble since around 2.6.18 I believe. It may be luck, it may be hard work from Andrew and everyone else involved. Thank you, everyone!
Sometimes it is depressing
Posted Jun 11, 2008 21:47 UTC (Wed) by mikov (subscriber, #33179) [Link]
Sometimes I get depressed when thinking about the kernel. Mostly because I feel powerless to affect it in anyway - I can't sponsor somebody to work on fixing bugs (that would be the ideal case) and unfortunately in most cases I don't have the expertise to fix bugs myself. For example only recently I discovered to my utter amazement that USB 2.0 still doesn't work well ! I tried to connect a simple USB->Serial converter and it started failing in mysterious ways - e.g. it would work 80% of the time, but then there would be a lost byte, etc. There are workarounds (disabling USB 2.0 from the BIOS, unloading the USB 2.0 modules, using an USB 1.0 hub, etc), but it is depressing that USB 2.0, which is on practically 100% of all machines, doesn't work. Of course it works nice under Windows. I eventually dug out a couple of messages from Greg KH explaining that it is known problem for a long time (I don't remember the exact details), but there is simply not enough interest in fixing it. This is *not* an issue of undocumented hardware ! I can't really complain, since I am not paying for Linux, but it is ... I already said it ... depressing.
Sometimes it is depressing
Posted Jun 11, 2008 22:11 UTC (Wed) by dilinger (subscriber, #2867) [Link]
You don't have to sponsor developers; just send them the misbehaving hardware. Chances are good that if it's useful hardware, it'll get fixed.
Sometimes it is depressing
Posted Jun 11, 2008 22:28 UTC (Wed) by mikov (subscriber, #33179) [Link]
I am afraid it is not that simple. I am sure that there isn't a single developer without a USB 2.0 PC, so there is no point in sending them anything. USB 2.0 hubs can be bought for about $30 (and PCs have hubs builtin anyway), add another $10 for a USB->serial converter. I don't mind spending that if it would improve the kernel. As I mentioned, this is not a case of undocumented hardware or expensive. The USB 2.0 kernel subsystem is apparently not quite ready and it can't handle USB 2.0 hubs. At least that is my understanding - I could be wrong. Even assuming that it made sense to send hardware, where should I send it ?
Sometimes it is depressing
Posted Jun 11, 2008 22:32 UTC (Wed) by dilinger (subscriber, #2867) [Link]
I *highly* doubt this is a USB 2.0 host problem. More likely, it's a problem w/ the specific USB device that you're using, or a host bug that's only triggered by your USB device. There are plenty of buggy USB devices out there. I've used plenty of USB 2.0 devices with no problems. I've also used USB serial adapters with no problems at all. However, your specific USB serial adapter is clearly problematic, and that's not something that other people are likely to see unless they have the same hardware that you have.
Sometimes it is depressing
Posted Jun 11, 2008 23:08 UTC (Wed) by mikov (subscriber, #33179) [Link]
The device is fine. The USB converter uses the Prolific chip, which as far as I can tell is one of the most common ones and highly recommended for Linux. I have several different converters using it, including a $350 industrial 8-port one. They all fail (also on machines with different USB chipsets) as long as USB 2.0 is enabled. The failure is fairly subtle, so it is not always immediately obvious. Needless to say, all converters work flawlessly under Windows ... See this post: http://lkml.org/lkml/2006/6/12/279 To quote from further down the thread: "Yeah, it's a timing issue with the EHCI TT code. It's never been "correct" and we have had this problem since we first got USB 2.0 support. You were just lucky in not hitting it before" BTW, I last tried this with a fairly recent kernel (2.6.22).
Sometimes it is depressing
Posted Jun 11, 2008 23:58 UTC (Wed) by walken (subscriber, #7089) [Link]
Eh, I have that chip too. I don't know if it's got anything to do with linux (my understanding is that the chip asks to be polled over USB every millisecond, and there are only 1000 frames that can go over the USB bus by second, so that device won't work if it has to share the USB bus with anything else) There is an easy workaround: plug this device in a port where it won't have to share the bus with any other device. I.e. if you have two USB ports on your machine, plug the prolific chip in one of them and everything else in a hub on the other port. I had no idea if things are better in windows, I thought it was an issue with the USB device itself. BTW, did you try the USB_EHCI_TT_NEWSCHED thing discussed in that thread ?
Sometimes it is depressing
Posted Jun 12, 2008 0:24 UTC (Thu) by mikov (subscriber, #33179) [Link]
I am fairly certain the problem is not related to sharing the USB bus. I had four of those converters connected to an ordinary USB hub working 100% reliably, as long as USB 2.0 was disabled. Plus, you can buy a fairly expensive (hundreds of $) multi-port converter which internally is nothing more than a couple of cascaded USB hubs and pl2303 chips. I hope that they wouldn't be selling such devices if the underlying chip was fundamentally broken. Lastly, it all works peachy in Windows. I tried USB_EHCI_TT_NEWSCHED (it is included in 2.6.22), but it didn't fix it. Alas I didn't have the chance to dig too deep (and I am not an USB expert, although I have done kernel programming) - sometimes it took many hours to reproduce the errors, and using USB 1.1 solved my immediate problem. When I saw Greg KH's explanation that there are problems in the USB 2.0 implementation known for for years, I lost my hope of improving the situation constructively. Perhaps I should pick it up again. What is the best forum to report this problem ? Apparently not the kernel Bugzilla ? :-)
Sometimes it is depressing
Posted Jun 12, 2008 3:11 UTC (Thu) by dilinger (subscriber, #2867) [Link]
You'll note the wording GregKH used.. "should be fixed", etc. Mark Lord had to report back that it was still broken. If GregKH actually had the hardware available to reproduce it, development and fix time would be much quicker. As far as bugs that are known for years; this is free software. The only people that are going to fix it are ones that are either paid to do so, or have an itch to scratch because their hardware is not working correctly. The fact that this is a corner case, and has an easy work around makes it pretty clear why it has taken so long to get it fixed. I fail to see what's so depressing. It's hard enough reproducing bugs when you have the hardware, but not having it available makes fixing bugs many times more difficult (and kills much of the motivation to do anything about it).
Sometimes it is depressing
Posted Jun 12, 2008 4:23 UTC (Thu) by mikov (subscriber, #33179) [Link]
I don't think that this is a corner case at all. It is unacceptable to have random devices fail subtly and quietly when connected to a standard bus. Especially when such a fundamental and established interface as USB is concerned. It is disappointing that the kernel has known bugs of this nature which are not being addressed. The problem is not so much that my particular device doesn't work. The depressing part is that it _really_ is nobody's fault. The development model is what it is. There is nothing better and there is nothing we can do about it. RedHat is not going to pay for fixing this because they don't care about desktops with random hardware. Canonical is not going to fix it because they don't contribute that much to the kernel. Nobody is going to pay for fixing it. There is nothing to be done. That is depressing.
Sometimes it is depressing
Posted Jun 12, 2008 5:19 UTC (Thu) by dilinger (subscriber, #2867) [Link]
It *is* a corner case. A device is plugged into a USB1.1-only hub plugged into a USB2 port. From the thread, my assumption is that the kernel (ehci) thinks 2.0 is supported because the host supports it, and thus attempts to talk 2.0 to the device. The hub in the middle screws things up. Bypass the USB1.1 hub, and things work just fine. If that's _not_ what you're doing, than you are seeing a different bug.
Sometimes it is depressing
Posted Jun 12, 2008 14:15 UTC (Thu) by mikov (subscriber, #33179) [Link]
This is not what is happening. The problem occurs when a USB 1.1 device is plugged into a USB 2.0 hub. AFAICT, this matches the description of the bug referenced in Greg KH's post. This is a frequent case - there are many USB 1.1 devices, but at the same time all hubs that can be purchased right now are 2.0. I suspect that most people are not seeing the problem simply because few people actually use hubs. Since the problem is subtle - a couple of lost bytes every couple of hours - most people wouldn't recognize it anyway.
Sometimes it is depressing
Posted Jun 12, 2008 20:09 UTC (Thu) by nhippi (subscriber, #34640) [Link]
Sometimes it's depressing to see how many posts some people bother to write about their problems to a random forum, when with the same amount of energy one could have filed a bug in bugzilla.kernel.org ...
Sometimes it is depressing
Posted Jun 12, 2008 21:22 UTC (Thu) by mikov (subscriber, #33179) [Link]
It is even more depressing when the Slashdot trolls start posting on LWN. First of all, this is not some random forum. Secondly, had you bothered to read the messages, you'd seen that the bug is already known. Lastly, in case you missed it, the subject is not my specific problem, but the philosophical futility of reporting bugs in something free. Incidentally, it appears that you don't even realize how much effort and time it takes to make a useful bug report. It is ironic that some people find it more acceptable to pollute bugzilla with useless wining complaints, rather than discussing it in a forum.
Sometimes it is depressing
Posted Jun 13, 2008 17:27 UTC (Fri) by dilinger (subscriber, #2867) [Link]
Once again: no. The original reporter says that when he plugs the pl2303 device directly into the USB2.0 hub, it works just fine. It's only when it goes through a USB1.1 dock/hub that it fails. So, once again: YOU ARE TALKING ABOUT SOMETHING COMPLETELY DIFFERENT FROM THE LINK YOU POSTED. Most people aren't seeing the problem because most USB1.1 devices work just fine in USB2.0 hubs. The problem described in the link you supplied is a corner case (some weird built-in serial adapter in a hub/dock thingy). The problem you've described sounds like it's specific to some portion of your hardware. I dug through my hardware pile and found a pl2303. It works just fine in a USB2.0 port. If you want to moan about how depressing kernel development is, that's fine; but claiming that it's hopeless when you refuse to get involved is just silly.
Sometimes it is depressing
Posted Jun 13, 2008 18:09 UTC (Fri) by mikov (subscriber, #33179) [Link]
Most people aren't seeing the problem because most USB1.1 devices work just fine in USB2.0 hubs. The problem described in the link you supplied is a corner case (some weird built-in serial adapter in a hub/dock thingy). The problem you've described sounds like it's specific to some portion of your hardware.
Sigh. I explained this a couple of times. It is not specific to my hardware. As I already said, I have tested this with several different pl2303 converters, including very expensive ones. I have tested it on different machines with different USB chipsets. I have even tested a couple of different kernel versions. I am not an idiot, you know :-)
The description of the problem is simple and I don't see why I have to keep repeating it over and over. Apparently USB1.1 devices have problems when plugged into USB 2.0 hubs.
I agree t