The game starts when something breaks.
A service is running slowly, and the sounds of a room full of frustration echo down a phone line. Somewhere, business has expensively stopped, amid a mess of lagging screens and pounded keyboards.
The helpdesk technician provides sympathetic reassurance, gathers some detail, thinks for a moment, and passes the issue on. A nice serve, smooth and clean, nothing to trouble the line judges here.
THUD!
And it’s over to the application team. For a while.
“It’s not us. There’s nothing in the error logs. They’re as clean as a whistle”.
Plant feet, watch the ball…
WHACK!
Linux Server Support. Sure footed and alert, almost nimble (it’s all that dashing around those tight right-angle corners in the data center). But no, it seems this one’s not for them.
“CPU usage is normal, and there’s plenty of space on the system partition”.
SLICE!
The networks team alertly receive it. “It can’t be down to us. Everything’s flowing smoothly, and anyway we degaussed the sockets earlier”. (Bear with me. I was never very good at networking).
“Anyway, it’s slow for us, too. It must be an application problem”.
BIFF!
Back to the application team it goes. But they’re waiting right at the net. “Last time this was a RAID problem”, someone offers.
CLOUT!
…and it’s a swift volley to the storage team.
I love describing this situation in a presentation, partly because it’s fun to embellish it with a bit of bouncy time-and-motion. Mostly, though, it’s because most people in the room (at the very least, those whose glasses of water I’ve not just knocked over) seem to laugh and nod at the familiarity of it all.
Often, something dramatic has to happen to get things fixed. Calls are made, managers are shouted at, and things escalate. Eventually people are made to sit round the same table, the issue is thrashed out, and finally a bit of co-operation brings a swift resolution.
You see, it turns out that the servers are missing a patch, which is causing new application updates to fail, but they can’t write to the log files because the network isn’t correctly routing around the SAN fabric that was taken down for maintenance which has overrun. It took a group of people, working together, armed with proper information on the interdependent parts of the service, to join the dots.
Would this series of mistakes seem normal in other lines of work? Okay, it still happens sometimes, but in general most people are very capable of actually getting together to fix problems and make decisions. Round table meetings, group emails and conference calls are nothing new. When we want to chat about something, it’s easy. If we want to know who’s able to talk right now, it’s right there in our office communicator tools and on our mobile phones:
It’s hard to explain why so many service management tools remain stuck in a clumsy world of single assignments, opaque availability, and uncoordinated actions. Big problems don’t get fixed quickly if the normal pattern is to whack them over the net in the hope that they don’t come back.
Fixing stuff needs collaboration, not ticket tennis. I’ve really been enjoying demonstrating the collaboration tools in our latest Service Desk product. Chat simply makes sense. Common views of the services we’re providing customers simply make sense. It demos great, works great, and quite frankly, it all seems rather obvious.
Photo courtesy of MeddyGarnet, licensed under Creative Commons.