Understand Your System Behavior By Implement Distributed Tracing Using Jaeger
I’ve been more than 3 years in programming and building many applications from simple website to an API that needs to scale. Whether it hobby project or work related job, since choosing title “back-end developer” as the first career, I’m aware that building a good back-end is not easy as what I’ve imagine. One challenges that comes into my work is how could I well-knowing about the system that I face every day?
Believe me, you will never know your Software well!
Have you think that you are deeply known of your software that you’ve created like you think that it’s the part of yours? Until it came into moment when your software scream at you at the night, waking you up and you didn’t know what it exactly doing. You even don’t know what it want.
To give you some clear picture, what I mean here is, you may think that all the lines of code that you write is well. But, you forget that the software itself is not just what you write, you may use reusable code created by other developer, or one of your function calls another service made by other department. When you test it in development or staging mode, you will find it works well. Yet, when it comes to production, the things may gone wrong and you lost your mind what caused that problem. Whether it your function slow down the request, slow database query, or the other service that you call is slow?
If you only call one service, you will easy to notice by checking to that service. How about if you call multiple service? Or, when it comes to monolith service, how do you know that the bottleneck is your database, not your logic or unreachable Redis cache that you forget to handle? This is why you need to trace every single process in your software. In terms of back-end service which handle the HTTP API request, you **MUST** trace it start from client request received in the server until it writes the response back to client.
Why do you need distributed tracing and what is it?
In terms of micro-service architecture, one service responsible to one business capability or other strategies that you can read here. This leads to the need of tracing multi-separated service which much harder than tracing monolith architecture. In monolith architecture the whole process life-cycle is on one project, but in micro-service architecture to handle one request it needs to call n other services.
You may thinks that it is enough to create separate log service that records all request log for each services. But, when your API hits by hundreds request per second, how do you know that one request log in service A corresponds to log in service B and so on?
… distributed tracing helps you, as a developer, to know deeply about your software behavior.
By this far, I should expect that you got the idea why distributed tracing is really needed. I copied the statement from here: “ Distributed tracing helps pinpoint where failures occur and what causes suboptimal performance.” From there, it clear stated that distributed tracing helps you, as a developer, to know deeply about your software behavior.
How tracing works?
To easy to follow I will made some analogous example. Supposed you are Dora the Explorer and you given a task to get to sand castle from your current position, tree house. To get there you must face some obstacle that takes time. Here’s your map to get to sand castle:
From above picture you can see that to reach sand castle from tree house, you must pass through the forest, crossing over the lake, enter the pyramid and voila you arrived to sand castle!
But, for every obstacle you met, it takes time to reach the final destination. You want to know how much time taken between the tree house and the sand castle. Therefore, you may record the current time before you depart and once after you arrived.
Now, I want you, as Dora, to report to me which obstacle that takes time you the most: how much time it takes, what causing you takes the long time and every detail that you may find it good to report. So, you may record the start time and end time of each obstacle, note some additional information when you find it useful to report, etc. Here’s the visual:
If you already got the idea, there you go! Now, you should already knowing about the tracing works. Each obstacle is like a function, then the way between them is a function calls, that may be ignored to consider since it must be fast.
Let’s coding!
Now, to be more techie, we can bring that Dora journey into code and trace what she do in every obstacle. First, we need to create 4 function:
startTheJourneyFromTreeHouse
where the Dora prepare some food for her travel’s stockpassTheForest
where Dora walk with Boot, her monkey, through the forest.crossTheLake
where the Dora cross the lake. We add some condition here, if the day is rainy, then the process will be slower because Dora doesn’t want to be drowned by her boat since the wind blows fast. This condition also to show you that the tracing can help you to pinpoint where the slowing process occurred.enterThePyramid
where Dora enter to the pyramid and surpass the dark room inside the pyramid.
As you can see, there is always context
parameter in first argument of each function. And from that context, we create a span using opentracing.StartSpanFromContext
. Each span represents function process and therefore must be initiated in the first of function process and finish using deferred function. Context
carry the information of tracing, so when you call, for example function passTheForest
, the context will brings the information from previous function call which is startTheJourneyFromTreeHouse
.
The function call will done like this:
serverSpan := eCtx.Get("serverSpan").(opentracing.Span)
ctx := opentracing.ContextWithSpan(
context.Background(),
serverSpan,
)
defer ctx.Done()...out, _ = startTheJourneyFromTreeHouse(ctx, in)
out, _ = passTheForest(ctx, out)
out, _ = crossTheLake(ctx, out, isRainyDay)
out, _ = enterThePyramid(ctx, out)
Variable serverSpan
is a value that already set in traceMiddleware
. It contains the parent span that wrap all function call span:
spanCtx, _ := opentracing.GlobalTracer().Extract(
opentracing.HTTPHeaders,
opentracing.HTTPHeadersCarrier(req.Header),
)serverSpan := opentracing.StartSpan(
c.Request().URL.Path,
ext.RPCServerOption(spanCtx),
)
c.Set("serverSpan", serverSpan)
So, each time when you call function, it always carry the information from previous function call. And then you will get the stack trace (which visualized using span) that look like this:
Since in every span we record the info such as a message, we can see that by expanding them:
Now, you notice that the longest process is the function crossTheLake
with parameter is_rainy_day
true. To ensure that this is not an anomaly, you can try to look for another trace in same end point. Finally, using this clue, you can try to optimize that function without guessing like a seer, because you have the data here!
Final words
You already know that knowing the exact problem in back-end can be tricky and painful when the traffic is high. But, the tracing system can help you to minimize your headache to trace which process that causing bottleneck of the system.
You may want to implement it soon once you see this article, but you may wondering how to ensure that your current code meet the requirement to implement this tracing. Tracing works by pipe-lining information using context
and send it’s information using UDP to Jaeger. So, if your current Go code doesn’t have context, you must add it before you can trace your application.
Note:
- The working example of above snippet can be found here: https://github.com/yusufsyaifudin/go-opentracing-example