Good to be unskilled?

Have you ever wanted someone to say this about your company “wow, we are really horrible at doing X”?  And I don’t mean in a thank-god-they-are-admitting-they-have-issues kind of way, but more in a “I’m really proud we can’t do that well” kind of way.  Up until recently, I thought the answer to my question was “are you crazy!?!?  Of course not!  I want us to do everything really well.”

Like many large companies, ones I’ve been working with are undergoing a tough time due to the economy.  And of course, that means layoffs.  People that I liked were not so lucky.  You get around to talking with these folks about how the experience of being laid off is.  I mean, it can’t be fun, but you want to know if it went relatively well.

Of course, it doesn’t go well.  I realize there’s no good way to lay someone off, but there are less bad ways.  I’ve heard horrid rumors of other companies laying people off via email and simply just locking the doors to the building.  This experience was nowhere that far down on the scale.

But there are always things you can do better.  For example, the process of laying people off starts first thing in the morning and continues until everyone has been told.  But, since you have no warning as to whether you are going to be laid off or not, those who kept our jobs sit around in our offices panicking that they’re next.  At some point during the day it is over, and wouldn’t you want to know that?  Employees heard nothing until hours after the last layoff had been done.

Unnecessary hours, in my opinion, that nobody should have had to spend worrying needlessly.  I know, I know, think of how the people who were laid off felt.  Was it really that bad of a thing to leave employees wondering?  No, not really, but it could have been done better.

Later on that evening, I considered how lucky we were that the company is terrible at communicating during layoffs.  Why are they so terrible?  Well, it’s a rare occurrence.  If you don’t practice it, even if you learn from a prior experience you never get to apply those learnings.  If a company was expertly prepared to do layoffs, I’d be a little worried.

Sure, isn’t it great that they are super-capable?  No!  It’s awful!  It’s something they shouldn’t be doing, something that they have rarely had to do.  They should be god-awful at it.  Frankly, it hurt at the time, but now I’m downright pleased.

And not just communicating layoffs applies here.  Disaster recovery of all forms might be fair game.  I mean, if you’ve gotten your system, process, product, whatever it is, so reliable that you’re unprepared for when it  fails, that might actually be a good thing.

If you fail all the time, you’ll have the people and processes in place to deal with the failure.  You’ll be expert fire fighters, and that’s just not the place you want to be.

I’m not offering a free pass to companies for not being capable of dealing with a disaster that is likely to occur – like your servers or network going down – but at some point if you’ve really gotten good at something, I’d expect you’d be bad at dealing with the outlier.

Could it be good to be unskilled at something you shouldn’t be doing in the first place?  I think so.

Risk based testing doesn’t change the goal

I was just out to lunch with a friend who was telling me that the quality assurance department they work for was switching over to risk based testing.  It’s a simple concept as I understand it – test more where there is more risk, test less where there is less risk.  Risk is determined via experience, typically in the form of some scoring system which rates how risky a given change or application is.  The higher the score, the more or different types of testing you do.

Now I’m not a testing expert by any means, but the conversation turned to how they were going to measure their success.  Prior to risk based testing, the measure of success for the department was defect containment rate (DCR).

Defect containment rate is fairly basic as well.  It’s simply (every defect you find in testing) / (every defect you find in testing + every defect you find in production).  In effect, if you find 75 defects while testing and after the code reaches production 25 more defects are found then you have a 75% (75 / (75 + 25))  defect containment rate.   Generally, the higher your DCR the better.

But, no, I’m told by my friend that the new measurement will be on defects found for the areas QA tested.  So, by that logic, if through risk based testing you determined function A wasn’t risky enough to be worth testing, and it breaks in production, that defect shouldn’t be counted against you… Such a decision would only serve to affect the denominator.  You’d still report all the bugs you found in test, but for each prod defect that was found, you’d get to decide whether or not you meant to test for that bug.  Suddenly, 25 defects in production might only count as 10 or 15 if you deemed the remainder as “things we weren’t looking for.”  Now instead of 75% containment (75 / (75 + 25)) you’d have a 88% (75 / (75+ 10)) containment rate.  Hey, you improved!!!  Wrong!

Something is amiss here!  Since when did just because you changed the way you do things change what is important to your customer?  If your prior measurment – defect containment rate – measured what your customer expected of you, where’d you get the free pass to not accomplish that goal anymore?

You don’t design metrics around what will look good.  Looking good is NOT equivalent to doing good.  Actually being good, and meeting your customers’ needs is the goal.  If you refuse to measure how defects impact your customer just because you weren’t looking for those defects, it doesn’t make the defects go away.

Workflow management isn’t about online forms

Putting a process on line isn’t just putting the forms on a website.  This is how most people treat workflow management software, though, so it’s no surprise that they hate the software.  I recently experienced such a system, and I can see why people hate it.

For example, in this system, let’s say that I want to get support for a production server.  I have to submit a support ticket, but not just any support ticket will do.  There are about 30, yes 30, types of support tickets I can choose from!  Why are there so many?  Because each group who supports some system wanted their own form.  And so the very first thing I have to do in the process of getting support is to make a decision – which ticket is the right ticket to open?  Get it wrong and I have to start the process all over again!  In reality, first I have to fill in a long form, effectively get the whole thing wrong and then do it again once my form is rejected and they tell me which form was actually the correct form.  Does it start to make your feel like you’re in a ridiculous government bureaucracy?  It does to me!

First off, it’s suboptimal that I have to make a decision about who gets the ticket in the first place.  Whether I had a single form or 30 is irrelevant.  Why do I have to tell people who gets the form?  I have no idea about the way support has their departments set up, and frankly I don’t care!  Really and truly, I don’t.  Secondly, if you are going to make me choose who gets the ticket, why do I get punished so heavily when I choose incorrectly?  I have to completely re-enter the same data on a different form.

It’s a great example of how people mistreat workflow management.  The process for me, includes deciding which group gets the ticket.  Why is that critical step, the very first step I can so easily screw up, not automated?  Why is it that the magical decision tree about which form is right is a) nowhere to be found and b) not done for me?

The second example I have seen is very similar.  Every time someone closes a defect tracking ticket they must assign it a root cause.  The root cause data helps us figure out where we want to improve the process.  In fact, the first improvement would be to improve the root cause collection system.  As it is, people are given a list of choices such as “requirements issue”, “design issue”, “coding issue”, etc.  We don’t provide the definitions of what each one means, although it might seem obvious.  That is, until you get to the next step.  In the next step, once you’ve chosen a root cause, you must choose a slightly more detailed cause.  For coding, for example, there’s “internal error” or “vendor error.”  It all seems well and good until you get to this ambiguous one “not coded according to requirements.”

I have always read this to mean “the requirement was clear but the developer didn’t code it that way anyway.”  And yet, the other day someone said “no, it means that the requirement was ambiguous so that the developer didn’t know what to code.”  Really?  I didn’t get that from the description.  Again, here’s a great example of leaving the process offline while having the form on line.

In order to arrive at the root cause, there are a series of questions you have to ask yourself.  It forms a decision tree.  Assuming you follow the decision tree, you get the right root cause.  If you don’t, and make a guess, you get what we have above which is a misunderstanding over what it means to have something “not coded to requirements.”  Instead of asking someone what the root cause is, ask them the questions from the decision tree.  Because the questions are simple yes/no questions, by the time they get to the end of the decision tree, it’s already been decided what the root cause is.

Was there a requirement for this defect?  If no, it’s a missing requirement.  If yes, continue.  Was the requirement clear?  If no, it’s a vague or incomplete requirement.  If yes, continue.  Did the design take the requirement into consideration?  If no, it was a design flaw.  If yes, continue.  Did the developer code the requirement as written?  If no, it is a requirement not coded as written.  And so on… you see how it goes.  But don’t just give me a drop down for the root cause.  Ask the questions.

The point is simple.  If you only have half the process on line, for example, just the forms, then the rest of the process is occurring without any errorproofing.  What’s the point of using workflow management for half the job?

A single point of failure

This is a thought provoking entry from Curious Cat Management on the single point of failure.  How many days do we allow an individual to hold the company, or some part of it, hostage to his/her exclusive knowledge?  How often do you allow the performance of your team to be dictated by the individuals?

And what do you do when those same indivduals have an off day, or an off week, or a personal problem, or even the sniffles…

How wonderful is it when your key employee comes through for you in fighting the hundred new fires that came up today?  Did you ever stop to think that the hundred fires might have been set ablaze by the same individual?

True, there’s nothing exciting about adherence to a process.  There’s nothing exciting or remarkable about any individual when a well designed process eliminates the art from regular high performance.  What is left to differentiate someone if they are not the lynch pin in your system – your single point of failure.

Instead of rewarding the individual for a stellar performance, how about you reward the individual for bringing up the entire team’s performance.  It reminds me of the activities of World War II, when the leaders looked to the individuals in the trenches for ingenious solutions to the things that stood in their way.  While individually recognized for their contributions to the war effort, the benefit was not in the individual working harder or faster, but in making it possible for everyone around him or her to be more productive.

Don’t reward people for standing out above their peers, reward them for not standing out above their peers because they’ve brought their peers up to their level.

A process without decisions

Maybe this is a rule of good process design that I’d never heard of, because it seemed suddenly obvious to me when I thought about it today, but I’m going to write about it anyway.

I’m going to propose that a key feature of a good process is one that has no decision points in the process flow.  That’s right, no decision points.  A decision equals an opportunity to make the wrong decision and therefore the opportunity to waste time or jeopardize quality.

If a process is to be consistently good, then the last thing you want to do is give an individual (especially software developers) the opportunity to make the wrong decision.

I’ll give you an example – the creation of an analysis document.  Let’s say you have a process which allows the developer to choose either the long (and therefore rigorous) format or the short format for the document.  The decision you want the developer to make is “select the format appropriate for the scale of the change.”  Which one would you choose as a developer?  If you were well intentioned, you might really consider the question.  Still, it’s the kind of question you could screw up.  The decision is based on judgement and maybe your developer judges it incorrectly.  The developer chooses the long form when the short form would have been acceptable and wastes all kinds of time filling out an unecessarily complex document.  Or, the developer chooses the short form and thus misses the necessary rigor and inserts a major design defect into production.  Either way, the decision allows for the wrong choice to be made some of the time.

What about a not so well intentioned developer?  A lazy developer, perhaps.  Not that those kind of people exist at your company.  Oh no, all your employees are well intentioned all the time…

A lazy developer would always opt for the short form.  Maybe s/he doesn’t value analysis or maybe s/he does but thinks writing things down is silly.  Who knows, who cares.  The short form isn’t always appropriate.

Here’s what I’m proposing.  No short form, no long form, and certainly no choice about whether or not to fill out the form at all.  NO DECISION!  Just have one form that scales to meet the need automatically.  I know what you’re thinking.  How can that work, we have 246 sections to our standard design document and a developer needs to make a decision about whether each section is appropriate or not.  WRONG!  Why do you need 246 sections to your document?  Do you think each of those sections is critical to success?  Have you bothered to figure out which sections, if done well or not well actually affect performance?  Probably not.  You probably like lengthy forms with nice headers and sections and instructions in each section about how each section should be filled out.  You’ve ignored the idea of the critical few – and there are a critical few – that really affect process performance.  Everything else is noise.

It is possible, I have done it, to have a single document format, nay a single process, which you follow unwaveringly through new development, enhancements and even simple bug fixes.  The fewer decisions in the process, the better.

With that in mind, I propose this measurement of the goodness of your process.  To figure this out, you have to get down to the micro level.  If associates are making decisions about which sections of a document to fill in, that’s a decision you should count.

Decision density = 1 – (# of decisions in a process / # of steps in a process).

For example: 

  • analysis
  • analysis review
  • decision: analysis OK?
  • design
  • design review
  • decision: design OK?
  • code
  • code review
  • decision: code OK?
  • test
  • decision: test OK?

That would have a decision density of 4 decisions / 7 steps = 42.85%.  I’d say that decision density of less than 50% is a good start.  The lower the better.  There’s probably a better way to look at it, since you might want to weight micro decisions (those decisions made without consultation with a peer or group) as being worse than bigger decisions in the process (like the decisions made after an inspection step). 

The short story is: avoid decisions in the process.  More decisions means more of a hand-wavy methodology than process.

Avoiding Non-compliance

If you’ve built yourself a good process and proved that it works, you want people to follow the process.  You know that it makes a difference and it is really frustrating when people continue to cause problems by not doing “what’s best for them.”  Like a couple of prior posts, this one is about why people don’t do what you want them to do.

 I’ve already been down the People Are Selfish road so I’m not going to rehash it.  Let’s take that for granted.  Today I wanted to give you another option for getting people to do what you know to be the right thing.

Here’s the idea, and it’s super-simple.  Make it easier to do the right thing than the wrong thing.  “What!?!?” you say, “I’m not paying for that kind of advice!”  Of course, you don’t pay me anything, so…

When process compliance is a problem, you have to look at what it takes to get the process right.  Going back to that people are selfish (and lazy) thing, if there’s a shortcut to be taken then people will take it.  If taking that shortcut hurts the performance of the process, then you must make it hurt more to go down that path than to not. 

Take, for example, my old team at my last job.  No software developer likes writing documentation (or thinking through a problem) and they want to get right down to coding.  I already knew that this was going to be an issue.  The process I had designed worked because it relied on peer review to assure that everything was getting thought through.  That also meant that being sloppy was not an option for my employees.

Why were people sloppy?  Well, they figured if they didn’t get the design right in the first place that they’d just go back and fix the code later and that fixing the code later would be easier than figuring out the right thing now.  Of course, that’s not true.  Finding a bug later in the process results in effort that is orders of magnitude worse than finding and fixing it early on.

So, what could I do to make people do the right thing?  I couldn’t make people write good documentation.  However, because I had set up controlled access to the source control I could make them write more documentation.  In fact, this is exactly what I did. 

If a person wrote a defect and it made it to QA, they would have to write a complete set of documentation in order to fix the bug.  That meant even for the most simple one-line code defect I required an analysis document, an analysis review by a peer, a design document, a design document review by a peer, the code, a code review document by a peer, and a test results document.  It is really annoying to fill out all those documents for a single line code change.  It is so annoying, in fact, that it drives down the defect rate.  The cost to the developer for screwing up is now significantly greater than the cost to get it right in the first place.

Developers, or anyone for that matter, do not see the costs they incur on someone else for their screw-up.  If my laziness results in the need for a full-time support person, it doesn’t hurt me directly at all.  But, if my laziness creates work for me, then I’m less inclined to be lazy.  I hate doing stuff over.  I hate filling out documentation.  I hate being punished for something that I had the capability to avoid.

So, while what I really wanted was for people to do a good job at the documentation there was no way I could enforce it.  But, since doing a bad job at it could result in a significant punishment of even more documentation, I had an indirect means of getting what I wanted in the first place. 

And that is making it easier to do the right thing rather than the wrong thing.

How not to improve a process

One of the things that Six Sigma is all about is data driven decision making.  This seems like a good thing to do intuitively, but I think people skip it anyway because getting the data is difficult a lot of time.  Now, your experience probably won’t be as disastrous as the one I’m about to describe to illustrate my point, but it could get there…

Fortunately, my example has nothing to do with a real process I was working on.  I was in black belt training and we were using the Statapult to learn about various statistical techniques.  I am, by nature, super competitive so I wanted to win the competition that was set up between the various teams.  That meant putting the most balls on target and then presenting out your findings (your learning experience) to the teams at the end.

For those of you not familiar with the Statapult, I’ll provide some background.  Essentially, it’s a very small catapult that fires wiffle and pingpong balls.  It has a bunch of adjustments on it.  Here’s a picture of it:

A statapult

You have many adjustments you can make to it.  There’s the tension pin, the stop pin, the pullback angle, the rubber band connection point on the arm and the cup that holds the ball on the arm.  Two things need to be tackled with the statapult experiment.

  1. You have to figure out how the various adjustments affect the distance the ball flies.  After all, the goal is to launch the ball onto some target a random distance away.
  2. You have to figure out how to get the variation out of the system so that you can do #1.  This is critical.  There is a lot of noise in an uncontrolled statapult.  The rubber band is strong enough to move the entire statapult when released even if a couple hefty folks are holding it down.  And then there’s other things, like the tension pin which rotates in it’s slot as you pull back the rubber band.

In fact, it’s that very tension pin that taught me more about the statapult than anything else.  Here’s the important message that I want to demonstrate to you: do not improve a process based on your assumptions because you can do bad, bad things without realizing it.

After getting a day to play with the statapult, we decided (again, being very competitive) that we needed some serious improvements to control variation.  I went to the hardware store on the way home and bought:

  • 4 2 1/2 inch lag bolts
  • 16 washers
  • 4 wing nuts
  • A roll of foam insulating tape
  • 2 metal plates
  • A piece of 1/4″ plywood for a base (I had this in my garage)

When we came back in, I built a base to control our statapult’s movement.  We drilled holes in the plywood and pushed the bolts (with washers to prevent rip-out) through the board.  We then placed the statapult between the bolts and used the metal plates, more washers and wing nuts to hold the statapult to the base.  I attached the insulating foam to the underside of the metal plates to a) protect the statapult from damage (they’re insanely expensive) and b) to provide shock absorption. 

Finally we looked at what else we could do to take noise out of the system.  Well surely that spinning tension pin was a bad thing.  Every time we pulled back the arm it would rotate and the rubber band was clearly slipping over it.  This had to be a bad thing, so we clamped down the tension pin so it wouldn’t rotate.

By the time we were done with this, we were learning about Design of Experiments (DOEs).  We were doing a 2k DOE, which is pretty straightforward and we expected that when we were done that we’d have the best results out of the entire class.  After all, we had all the noise under control. 

The reality was that we had one of the worst results and we didn’t understand why.  The Master Black Belt teaching the class took a look at our data and concluded it was because of the firing angles we had chosen.  For a DOE to be effective it helps to have at least one continuous variable to adjust.  All the other adjustments on the statapult are discrete, so it is important to use a wide range of firing angles to get good fine control over the firing distance. Because of where we were in the classroom, if we used a wide range of options on the firing angle we’d launch the ball right into the wall.  This meant we couldn’t get a measurement on the shot and therefore wouldn’t be able to compute the results.  To compensate we used a firing angle of between 135 and 140 degrees for our experiments and it seemed like that might have not been enough variety in ranges.

Our MBB was kind enough to stay with us and do the experiment over with a wider range of firing angles (and a bit more room to fire) and see what we got.  The results still sucked.  It wasn’t the firing angle, but we discovered something very interesting.  When you do a DOE, you randomize the order in which you adjust all the parameters.  You do this to avoid introducing bias into the experiment.  You also take two measurements at each set of parameters.  So, if you were using settings A, B and C in the first shot, you might do 15 more shots with various settings before coming back around and doing settings A, B and C again.  Because it was getting late our MBB said to skip that randomizing because each time you do that you have to reset the whole device and it takes a while.  So, we were doing shot 1 and shot 2 at settings A, B, and C.  And we found out something crazy, the two shots weren’t producing anywhere near the same result each time.  If you do the same settings on the statapult and you’ve gotten rid of all the noise in the system it should produce the same result (or at least darn close to it).

That’s when we had an epiphany.  Locking down the tension pin rotation was a bad thing!  Yes, indeed the rubber band did move over the rotating pin but it was a good thing, not a bad thing.  By allowing the pin to rotate it equalized the tension on the rubber band on both sides of the pin.  Without it, sometimes the rubber band would be really tight and sometimes really loose and result in unpredictable shots.

Alas, it was now quite late and our MBB was unable to stay for round number 3.  One of my teammates and I did the entire experiment again with just the two of us (it’s much easier with 4 or more people).  And finally, we got a good DOE and a regression equation with a 95.6% R-Sqr (adjusted).

In the end, we didn’t win the competition.  It turns out there’s one more noise factor in the system we didn’t count on – having everyone else in class watch you during competition makes you really nervous.  We did come in second place, primarily because all our failures let us tell one of the best learning experiences about the statapult.

Getting back to my message for you of “do not improve a process based on your assumptions because you can do bad, bad things without realizing it” you can see how making bold assumptions about the statapult process resulted in us making the system work worse than it did if we just hadn’t played with it at all.  This is what you should take away about real-life process improvement.  It is not enough to think you know what’s right for the process, if you don’t understand how the process works, then don’t play with it.  Or do play with it, but have the fortitude to admit you screwed it up and now know enough to not do that again.