In this episode I’ll cover an extremely satisfying application of Lean and Agile on a large ‘Legacy’ effort I was part of a while back.
Please understand that I am not at liberty to discuss certain (that is, most) details. I must protect the guilty. I will give just the bare details – only what is necessary. I bet you don’t believe I will be able to give just the bare details, right?
So – this is about using the No Estimate Approach on an End-Of-Life Legacy Support project.
First, a few definitions:
What is “Legacy” code?
I am glad you asked. There are many definitions of legacy code. Here is the one I use: Legacy code is ANY CODE THAT IS DOING ACTUAL WORK THAT YOU WANT TO KEEP WORKING. (Please note the clever use of “all caps” for emphasis.) Does that make sense to you? Doesn’t matter – just pretend that is what it means. Using my definition, most projects become legacy code as soon as you have some functionality in production. That’s when the bits hit the fan, usually, so to speak. However, even if it isn’t yet in production, it could still be “legacy” if we have stuff working we don’t want to break.
What is Legacy Support?
Regardless of what you think legacy support is, for the purposes of this post I am using it to mean this: “Work done (bug fixes and minor enhancements – that sort of thing, or more serious stuff if needed) to extend and protect the life of code that is in use”. I’m leaving out more extensive enhancements like “whole, new functionality”. You can judge for yourself the difference between minor and “more extensive” enhancements. Doesn’t really matter to me.
What is End-Of-Life?
For the purposes of this post, an End-Of-Life project is a project that is almost retired. It’s still in use but going to be replaced sometime soon. In software programming “Soon” means “perhaps sometime, but we have no clue when.” The basic idea is that people are using it, and it is important enough to keep supporting, but we’ll stop doing big additions and improvements. Some bugs will just be ignored, but any deemed to be important to fix will be fixed if possible.
Project History and Details
The project that we are discussing is is a large, line-of-business set of applications (‘The System’) that is critical to the users of the software for managing most aspects of their business. The typical user of The System manages thousands to millions of their own customers – sales, billing, transactions of all sorts, bunches of “real-time” stuff, reporting, etc. Typical line-of-business stuff. I was told the code base was about 2 million lines of code [and that’s about enough, in my opinion – any more is just bragging.]
The System has been in production for 15 years or so and has moved through several different code bases. The current code base is about 4 or 5 years old. It uses several languages – 4 or 5 perhaps. Typical stuff.
Enhancements and bug fixes are managed, worked on, and delivered using a waterfall approach, with a 3 month delivery schedule (that takes 6 months due to test and fix cycle, and other typical deadline issues).
Due to design, industry, and technical reasons it is deemed necessary to re-write The System in a new language with a new architecture.
The Path Going Forward
The current version of The System MUST continue to serve the user base, and even more – it has to be clear that we are still responsive to concerns and issues of the user base so they don’t start looking to competitors to solve problems while the “greenfield” version of The System is being developed. It has to look and feel like “business as usual, only better”.
A “maintenance team” is put together to provide this service.
I’m not sure exactly how many people were working on The System before the End-Of-Life Maintenance, but it was something on the order of 20 to 25 people – maybe more, it was a while back and I didn’t keep track.
The new team was to be 6 people: 2 QA, 2 BA (Business analyst or “Product Owner”), 2 Dev. I was one of the two Devs, and also the manager, or co-manager (more or less). I “volunteered” for this assignment on one condition: I would be allowed to manage this effort my way – which means Lean/Agile/XP.
We’ll call this System EOL (for system end-of-life) .
The State of System EOL on Day One
There are approximately 500 active bugs in the bug tracking system. Some of these bugs have been pernicious. Some have never been addressed due to level, some have been introduced in the fixing of other bugs, some have been there a long time but just recently discovered – All the typical bug stuff.
500 Acitve Bugs???!
There are 500 active bugs! There are a number of reasons there are so many bugs. Before we talk about that, I’d like to ask a question.
- Question: What is an appropriate number of known bugs for a project that is in production?
- Answer: Zero.
Got that? I’ll allow for any bugs that have easy work-arounds – you can leave those if you have to. I’ll also allow for any bugs that were discovered since the last deployment. But you better fix those NOW and then you’ll have zero bugs once again. Nice.
Remember that: Zero bugs. Does that seem unreasonable to you? If it does, I’ll ask another question: Why bother reading my blog? Go do something else that seems reasonable to you. Why waste your time on what I’ve got to say? I am NOT REASONABLE most of the time. I want a better way to do stuff than whatever we used to think is “reasonable”. Zero bugs is a “lofty goal” in some workplaces. In others, it is business as usual. I prefer zero bugs.
So, 500 bugs is not just way too many, it is infinitely too many. Or whatever 500 divided by zero is. Anyone know?
One reason there are so many bugs: The old process.
There were several reasons all these bugs existed, but the one that is most pertinent to our discussion today is the process that was in place for assessing and fixing bugs. The way things were being done was about like this:
- Bug report is entered into the tracking system by a user, support person, developer
- The bug is reviewed at a “triage” meeting where its importance determined
- If it is important enough, it is assigned to an “investigator” who will assess it: that is, attempt to reproduce it, determine the likely cause and module that will probably contains the “bad code”, and estimates the time to fix.
- After the assesment, the bug goes back to a “bug review” meeting to re-assess it’s importance and schedule the work to be done.
- If it is deemed important enought, it is scheduled to be fixed (either sooner or later based on the “cost” of fixing it and importance to the customers). If it is not important enough, it is closed and it’s status and other details are communicated to the interested parties.
- It is assigned to a developer who works in the area of the “bad code”. Since the developer is a different person than the investigator, the developer starts from square one and “re-thinks” the stuff the investigator already thought about.
- In working on the bug, the developer discovers the problem is NOT in the area of code it was thought to be in, so he adds his comments and assigns it back to the bug review team.
- The bug review team looks at it again, and then re-assigns it to an investigator.
- And so on and so on…
- Once a bug is actually fixed, it goes thorugh QA where it is discovered that it didn’t fix what needed to be fixed.
- Go through whole process again.
- Once a bug is fixed and passes QA, it is deployed in the next deployment, and…
- The users discover that it didn’t fix what needed to be fixed, or the fix was only partial, or that something else is now broken.
- Even worse – some other things have been broken that are not yet discovered that will lead to other bug reports sometime soon.
Okay – you get the point. I am not making this up.
There were other reasons System EOL was in this state, which included lack of unit tests, code neglect (huge methods, overwhelming complexity, fragility, coupling and cohesion problems), and so on… but we’ll solve one thing at a time.
Chutes and Ladders
Well. Does that remind you of that game you used to play when you were a kid? It should. And it is JUST AS FRUSTRATING, at least conceptually. Only difference is you get paid to play Bug Fixing Chutes and Ladders. But overall – it is terrible and horrendous (and other bad words).
What to do???
Since I was allowed to manage System EOL in my own way – Here is what we did.
Value Stream Map
First day I brought in 6 copies of Mary and Tom Poppendiecks “Lean Software Development” book, and we all agreed to read it, and more or less follow their basic concept. I’d been doing XP for years (more or less) but it’s a hard sell so I opted for the “easy sell”. If Lean is good enough for Toyota, it’s good enough for me.
First thing we did was a value stream map on the “Chutes and Ladders” methodolgy that was in use. If you haven’t seen a value stream map or know how to do one, take a look at Mary and Tom Poppendiecks “Lean Software Development” book. The basic idea is that anything that is not directly adding value can (and should) be eliminated. For example: waiting in queues is a waste – it brings no value to the person interested in having the bug fixed.
I bet you can guess that we found a lot of NON-VALUE-ADDING behaviors. So, we made a few rules. For example…
First of all, a few rules:
- No Triage meetings or Bug Review meetings
- No Assesments for “where the code is broken”
- No prioritization or severity determination (a bug is a bug is a bug). If it’s in the tracking system, fix it.
- No Estimation of effort. We gotta work on all the bugs.
- … and Other things
NOTE: No Estimates, No prioritization, No meetings. These are wastes (at least in some situations, if not a lot) – bringing no value to the “customer”. Even worse, waste usually removes value by using up time and “spirit”. A team doing a lot of meetings, estimates, assessments, meetings, prioritization, meetings, waiting, etc. becomes NUMB. Avoid numbness.
An initial assessment:
We quickly read each bug in the tracking system, looking for duplicates, bugs no one had asked about in a year or more, and bugs where the description could not be understood (and there was no one who could clarify). This was done very quickly and eliminated about one half of the bugs. NICE. 500 bugs cut down to 250 bugs. No one can say that is not progress!
This took about a week (as I recall). During this “first-week” the developers and QA took on bugs to fix, since there were plenty that were clear enough to work on – but spent some of their time helping the BA’s “weed out” the bugs that we removed.
Daily process we adopted:
- Daily morning 1/2 hour meeting. Sitting. Quickly talk about some interesting things (movies we had seen, a great lunch place we found, etc) , and occasionally some work stuff as well. Seriously – this was a time for the team to become comfortable with each other and build some common memories and start the day off on a pleasant note. And everyone outside the team expected we’d do daily meetings and we didn’t want to disappoint them.
- Pick a bug, any bug.
- QA and Dev sit together and reproduce bug. Confer with BA as needed.
- QA and Dev examine code together, and create a characterization test to “clamp down” the code.
- Dev fixes code using TDD whenever possible, conferring with and pairing with QA and BA whenever needed
- QA and Dev test code to verify fix works as expected
- QA and Dev run all existing characterization tests
- Code is checked in and project is built
- QA runs all existing automated UA tests and any needed manual tests.
- Pick next bug – preferable one that is in some way related to last bug – but not critical. Just pick one, and repeat.
- About every two weeks a build of whatever was fixed at that point was made avaliable to customers.
- They could take the build if any bugs they were interested in were fixed, or they could ignore it. It was up to them.
Guess how long it took to clean this project of bugs. Give up? 3 months!
250 bugs in 3 months – that’s about 4 bugs a day!!! Well, we did have 17 “un-fixable” bugs. Those were bugs that we weren’t allowed to fix for some business reason. So still – an average of just under 4 bugs a day.
This is the power of eliminating waste. All of our time was spent doing meaningful work, REAL WORK on bugs. No reviewing, no estimating, no “pre” investigation that wasn’t part of doing the fix, no passing code to QA that we were not confident would pass all tests, etc. In other words: NO WASTE. Or at least: We were no longer doing any of the steps that we had identified as waste.
A Few Other Interesting Things:
Well, what we found is that in fixing some bugs, we’d incidently fix other bugs. That is, some bugs were caused by a single underlying malfunction. That happens, and it is a sweet bonus when it does.
We had no recidivism. Once delivered to the customers, we did not have a single fix that was “sent back”. Once fixed it stayed fixed. I think this can mostly be attributed to the QA and Dev working together, the unit tests put in place, and the “one-at-a-time” approach to “fix and build.” We would fix ONE BUG at a time. Fix, Build, Unit Tests, Checked in, Built, Tested, confirmed. Only then would next bug be tackled.
So… this is about no estimates. Our rule: Eliminate waste. In this case, estimates were a waste. They were thought to provide a way to “manage” the work being done, but were in fact reducing the ability of the organization to “manage” getting work done. I think they are like the dice in chutes and ladders? Probably not. Still, focus on the real work, don’t muddle things by playing chutes and ladders.