High Scalability - High Scalability - Log Everything All the?Time
来源:百度文库 编辑:神马文学网 时间:2024/04/28 09:00:11
Log Everything All the Time
Thursday, August 30, 2007 at 9:01AMThis JoelOnSoftware threadasks the age old question of what and how to log. The usualtrace/error/warning/info advice is totally useless in a large scaledistributed system. Instead, you need to log everything all the time so you can solve problems that have already happened across a potentially huge range of servers. Yes, it can be done.
Tosee why the typical logging approach is broken, imagine this scenario:Your site has been up and running great for weeks. No problems. Aforeshadowing beeper goes off at 2AM. It seems some users can no longeradd comments to threads. Then you hear the debugging deathknell: it's anintermittent problem and customers are pissed. Fix it. Now.
Sohow are you going to debug this? The monitoring system doesn't show anyobvious problems or errors. You quickly post a comment and it worksfine. This won't be easy. So you think. Commenting involves a bunch ofservers and networks. There's the load balancer, spam filter, webserver, database server, caching server, file server, and a few networksswitches and routers along the way. Where did the fault happen? Whatwent wrong?
All you have at this point are your logs. You can'tturn on more logging because the Heisenberg already happened. You can'tstop the system because your system must always be up. You can't deploy anew build with more logging because that build has not been tested andyou have no idea when the problem will happen again anyway. Attaching adebugger to a process, while heroic sounding, doesn't help at all.
Whatyou need to be able to do is trace though all relevant logs, pulltogether a time line of all relevant operations, and see what happened.And this is where trace/info etc is useless. You don't needfunction/method traces. You need a log of all the interesting thingsthat happened in the system. Knowing "func1" was called is of no help.You need to know all the parameters that were passed to the function.You need to know the return value from the function. Along with anythingelse interesting it did.
So there are really no logging levels.You need to log everything that will help you diagnose any futureproblem. What you really need is a time machine, but you don't haveone. But if you log enough state you can mimic a time machine. This iswhat will allow you to follow a request from start to finish and see ifwhat you expect to be happening is actually happening. Did an interfacedrop a packet? Did a reply timeout? Is a mutex on perma-lock? So manythings can go wrong.
Over time systems usually evolve to thepoint of logging everything. They start with little or no logging. Thenproblem by problem they add more and more logging. But the problem isthe logging isn't systematic or well thought out, which leads to poorcoverage and poor performance.
Logs are where you find anomalies.An anomaly is something unexpected, like operations happening that youdidn't expect, in a different order than expected, or taking longer thanexpected. Anomalies have always driven science forward. Finding andfixing them will help make your system better too. They expose flaws youmight not otherwise see. They broaden you understanding of how yoursystem really responds to the world.
So step back and take a lookat what you need to debug problems in the field. Don't be afraid to addwhat you need to see how your system actually works.
For example, every request needs to have assigned to it a globally unique sequence numberthat is passed with every operation related to the request so all workfor a request can be tied together. This will allow you to trace thecomment add from the client all the way through the system. Usually whenlooking at log data you have no idea what work maps to which request.Once you know that debugging becomes a lot easier.
Every hop a request takes should log meta informationabout how long the request took to process, how big the request was,what the status of the request was. This will help you pinpoint latencyissues and any outliers that happen with big messages. If you do thiscorrectly you can simulate the running of system completely from thislog data.
I am not being completely honest when I say there areno debugging levels. There are two levels: system and developer. Systemis logging everything you need to log to debug the system. It is neverturned off. There is no need. System logging is always on.
Developerscan add more detailed log levels for their code that can be turned onand off on a module by module basis. For example, if you have a routingalgorithm you may only want to see the detailed logging for that onoccasion. The trick is there are no generic info type debug levels. Youcreate a named module in your software with a debug level for tracingthe routing algorithm. You can turn that on when you want and only thatfeature is impacted.
I usually have a configuration file withinitial debug levels. But then I make each process have a command porthosting a simple embedded web server and telnet processor so you canchange debug levels and other setting on the fly through the web ortelnet interface. This is pretty handy in the field and duringdevelopment.
I can hear many of you saying this is tooinefficient. We could never log all that data! That's crazy! No true.I've worked on very sensitive high performance real-time embeddedsystems where every nanosecond was dear and they still had very highlevels of logging, even in driver land. It's in how you do it.
Youwould be right if you logged everything within the same thread directlyto disk. Then you are toast. It won't ever work. So don't do that.
There are lots of tricks you can use to make logging fast enough that you can do it all the time:
In large scale distributed systems logging data is all you have to debug most problems.
Solog everything all the time and you may still get that call at 2AM, butat least you'll know you'll have a fighting chance to fix any problemsthat do come up.