How to solve production problems

Jonathan FisherSeptember 3rd, 2011Last Updated: October 21st, 2012

0 1,146 6 minutes read

At my job, I’m finding that a large percentage of the questions that come to my desk are “Hey this thing you did is broken” or “Hey, this thing you worked with is on the fritz.” 99% of the time, I’d have no idea why the behavior they were asking about was occurring!

However, I think I’ve got a great problem solving methodology, which I attribute to being key in my success. Maybe some agile writer can take the concepts here and we can develop a formal methodology for solving problems.

Preventative Measures

As professionals, we should accept that production problems are unavoidable. You will create defects in your code, and no amount of testing will ever weed them all out. Code is rushed, servers crash, disks fail, and datacenters overheat. Accepting the fact that production problems occur is very important, but most importantly you need to plan for them. As one of my favorite financial advisor says, “It’s not a matter of if these events will happen; it’s simply a matter of when they will happen.” Plan your week/month with hours available to investigate strange behavior, crashes or user complaints.

Your developers must be on call. I’ve discovered there is a strong correlation between the cleanliness of a developer’s code, and whether or not he had to be on call at some point in his career. People that don’t have to shovel their own shiztpoo do not have stake in preventing problems! Here’s the thing though, the on-call phone is for emergencies. There should be an understanding that if the issue isn’t happening right now, then it doesn’t need to be worked right now. Too often a pager is used to extend an employee’s work day! If you get to the point where you are looking at code, you should probably agree to throw up a 500 page and sick the whole team on the problem in the morning. Furthermore, if a user calls in the middle of the night to report an issue that they failed to report hours ago, chances are it can wait till daylight. It’s important to protect the on call phone, save it for when the meadow is truly on fire!

Use a bug tracking system! This is so obvious it amazes me how many companies do not have this. When events occur, update the bug report with an occurrence. It’s very important that you keep record of the event… When you suggest to management they give you time to fix an issue, having a record of the hours spent doing crisis diagnosis is an invaluable resource!

Keep an application inventory. Have your developers create a wiki entry for every application, that has the following information:

What configuration files are need to run
Which databases are connected to
What tables are read
What EIS systems are interacted with
What names of queues/topics are used
What usernames are used to interact with above systems
Where the logs are located
Known problems!

Do not underestimate how vital this information is! In your SDLC, you should have a step where the above information is updated!

Ok, I’m on call. The phone is ringing. What do I do?

Your first step should always be to get an accurate description of the symptoms. You then must narrow the problem scope to either being able: to reproduce it on demand, or predict when it will occur next. For instance, if you have a B2B webservice and the consumer is complaining your sending SOAP faults, your first move should be to have them dump the XML request that is causing the issue. If an end user is complaining about an error message when they click a link, you should have them copy the link and take screenshots. I call this “narrowing the problem.” This is the most important step. It is not optional, you should not proceed to work on a system unless you know what the problem is! You could easily worsen the situation by making arbitrary changes, and the likelihood of doing something random to solve an unknown issue is abysmal.

What if the end user causing the problem is not available, or the person on the other end is unwilling to provide the information you’re requesting? I think the best tactic at this point to ask them how they would like you to proceed. Let them finish completely without interrupting. Perhaps they were onto something, or had an alternative. Or perhaps they’re in left field, and if it leans towards left field, explain to them that you cannot solve a problem you don’t understand and can’t recreate. Don’t be afraid to stick up for yourself, it takes information to solve problems!

Ok, lets get crackin. It’s late and I want to go back to sleep

After you’ve narrowed the problem to a very very specific story or use case, it’s time to get working. First, copy the logs of the production server onto your local computer. If the logs roll while you’re troubleshooting, you may never find root cause. Next, copy any data related to the problem out of the database and store in text format with the log files. This may mean taking a snapshot of configuration tables of an application, the user’s profile rows, and maybe the data they’re trying to access. If you think it’s related, copy it! If you think it’s unrelated, copy it! Finally, note the version of the application, the bugfix level of the server, and the version number of the libraries available on the server. Whew!

The next thing is to get a pencil and paper. You should write down ever step you took and the time, and more importantly why you did it. This must be done as you perform the steps, but it doesn’t have to be verbose. For example:

“3:12am Noticed the logfile contained many a transaction timeout, meaning the connection pool may be locked. Restarting the application server”

There are three parts to this entry, the first is the observation, the second is the analysis, and the third is the action to correct.

So how do you find the problem?

It’s time to use that application inventory that you did way back in the preventative measures section. Lets say the log is reporting that the user cannot be found in the database. Since you’ve already narrowed the problem, simply refer to your inventory and connect as the application, then proceed to locate the user. If you can’t find them, then problem solved (for now)! Your inventory will tell you exactly what table and database to retrieve this information from.

But what if you’re not finding anything? Well… you need to poke around some more. Filter through your logs files and database entries. Do you see anything that looks out of place? Since you’ve already narrowed the issue down and reproduce it, can you try reproducing it using a different user? How about if you tail the log in realtime while the end user create the problem again? Can you enable debug logging in the frameworks (Spring, Hibernate, etc)? Can you reproduce the problem in a non-production environment? How about remote enabling remote debugging and stepping through as the system executes? How about cleaning up temporary files or deleting JSP compilation directories? Has the system ran out of disk space? Finally, does a good old fashioned restart fix the problem?

The hangover

So you were on the phone for an hour or two last night, what do you do the next day besides show up late? Well before you leave your house, take that pencil/paper with you. Open your bug tracking system, or send an email to the owners and include the original complaint, the logs, the database entries, and what finally fixed the problem. Transcribe your quick notes into the ticket. It’s best if this is done the next day, before your forget how you fixed the problem and why you made the decisions you did.

If you followed every step in this entry, you should have no problem giving management a thorough report of what was wrong and how it was fixed. This information is critical to mend over any rough spots with your customers. It can also be used to CYA if any flak comes your way! Finally, I hope this will make you a problem solving rockstar, ready to write better, simpler code (with tons of debug statements using SLF4J).

Reference: How to solve production problems from our JCG partner Jonathan at the The Code Mechanic blog.

Related Articles :