4 - Flume and Fire Fighting
[ Note: This is a fairly long post. It also leans more on the technical side but I have tried to keep it as simple as possible. I felt the need to go into this much detail because the experience described below has affected me tremendously in the way I approach work and design systems. Of the many projects I have done at Funzio this one is the most memorable and the most hard hitting. ]
Around the time I joined, Funzio had decided to adopt Vertica as the analytics backend system. It is a highly scalable data warehousing system that supports SQL. It required minimal administration and support compared to other systems like Hadoop. I did not know anything about it. Ram gave me the 700 page manual and told me to figure out how to hook our game into it. Some other people had already started working on it before I was put onto this task.
We had decided to use Flume, an open source tool which streams logs from one machine to another. It had a plugin architecture where we can hook our code into and write to whatever type of destination we wanted. It was written in Java and Vertica had JDBC connector which we used to build the plugin.
The game servers log the analytics data locally to a file. Flume had the ability to watch a file and track what changes were made to it. So as we logged more data into the file, a flume agent process would grab the changes and send it a flume collector process over the internal network. The collector would collect the additional log statements from multiple flume agents and merge them. The collector also processed the data via our plugin and then dump the transformed data into Vertica. Flume also had a flume master process which maintained the collectors and agents and dynamically rerouted traffic as and when needed. This is called an ETL (Extract, Transform and Load) pipeline in analytics and is standard way of processing data. Reiterating, Flume had 3 parts to it -
Flume has a feature called end to end reliability streaming. What this meant is that it kept track of each and every record that was generated from the agents and saw that it was loaded into Vertica. If the data was lost somewhere midway it would resend the data from the agent. Since analytics data is critical we had it enabled. In a few days we had the whole setup up and running ready to take traffic.
The process of releasing our games is in two phases. We do a “soft launch” first where we release our new game in 1 or 2 countries with very little marketing. We get a few thousand users and see how the game performs. We closely monitor the in game behaviour of the players, find the bugs and problems in the game mechanics, tweak the game and iterate. Once we are satisfied with it’s performance, we do a worldwide launch and go for a full marketing push.
During the soft launch phase Flume and Vertica held up without any hiccups, it was only few tens of thousands of records per day. We added more tables to track more stuff and the system accepted the new data without any problems. Then on launch day this happened…
The game was prepared for this traffic, we had 35 servers ready to handle it. We had also put more collectors (4 of them) in anticipation of the volume of analytics data increase. I thought that it was enough and that we can always add more collectors to handle more traffic. It turned out (as usual) I was wrong.
The first sign of problem was subtle. As the day went by, we saw that the numbers were slightly off (less than 0.5%). I thought that it was due to the lag from the time record was generated to the time it was actually put into Vertica, and it always used to catch up. But, during the second day the numbers were going off by a lot and we saw that the data was streaming in much more slowly (lag of a few hours). This pointed to serious problem in the pipeline.
We added more collectors and it seemed to catch up. But another problem surfaced. The numbers seemed off where everything was more in number, like the number of players logging in was significantly more than they were supposed to be compared with the data we had from other sources. W-T-F was going on? Ram had the foresight of telling me to generate a unique identifier (uuid) for every record that was created. Delving more into the data we found out that there were tons of duplicate records in Vertica. Okay, I thought, since Vertica was a SQL database, we can put a primary key constraint on uuid and it won’t accept duplicate records. I cleaned out the records, which took a mind numbing 5 hours of manual mechanic labour, added the constraint on the field and got on with other tasks. Fairly straightforward right?
Wrong. Although the documentation said that primary key was used as a keyword, we missed the fine print that said that although the primary key was supported… it wasn’t enforced. All Vertica could do is that it can detect that the constraint was violated and tell that to us AFTER duplicates were inserted. I finally uncovered this fact a couple of days later. Back to cleaning up the duplicates for another day. We had to keep the database clean since it was used extensively for analyzing our game. We cannot afford to have dirty data since Funzio is a very data driven company and all our decisions were based on that data.
While all this was going on, the pipeline was slowing down due to increased amount of data we were tracking along with the consistent high traffic. Few of the agent nodes had error-ed out and we had to restart them to bring them back up. This issue was chewing out more and more of my time which could’ve been spent on supporting the game itself. A few days after the launch, it was 9 PM and I had just reached home. Alex, our sole business analyst at that time, noticed that the data pipeline had slowed to a crawl, it was backed up for around 8 hours. Ram looked at Flume’s monitoring page and saw that the agent processes were going into an error state with alarming frequency. This issue had become unmanageable and we had to tackle it ASAP. Time to go nuclear.
I was back at the office with Ram and Patrick (our analytics intern) at 10 30 PM. We shut down the flume pipeline entirely so as to not excarbate the issue and went to the drawing board. 30 minutes later we came up with a short term plan to deal with it. We split the traffic into two streams, one was high value but low volume data that was streamed in real time and the other was high volume but the content was not necessary to be available in realtime. We setup a batch processing script that would copy it into Vertica once a day. We had the new system live on production in the next few hours and breathed a sigh of relief as we saw traffic was flowing smoothly and was stable. We also understood the internals of flume enough to setup a monitoring system which alerted us the moment it saw signs of flume getting backed up because of traffic.
I have not yet talked about why the duplicates were generated in the first place. We knew that we were logging a record only once, but it was getting inserted more than once. Why? The culprit was Flume again. Remember that we had enforced end to end reliability to make sure each record was getting through. When too much traffic went through Flume, the records were not getting inserted at a fast enough rate from the collectors into Vertica. The records would go through eventually, but not before the agent would time out and thought that the record was lost… and it sent that record again to the collectors. This caused more traffic to be generated which led to more timeouts and the agents went into a death spiral finally ending in an error state and crashing the entire system. This behaviour was not documented anywhere but was a manifestation of how Flume’s protocol worked. It took me more than a week to figure this out.
//FIREFIGHTING A.K.A YOU AREN’T OUT OF WOODS YET
Flume used to get backed up for short periods of time randomly, or so it seemed. We started to notice that whenever it got backed up, Vertica would have some sort of CPU or I/O load issues. The problem here was clearly our plugin. It could not copy over data to Vertica at a fast enough rate and was very sensitive to Vertica’s load. We coded up a new plugin which was more efficient in the copy process and tested it in our development environment.
But we could not QA it. The reason for this was the way our AWS (Amazon Web Services) network was setup. All of our servers were on the same private network, it kept all our deployment processes and service communication simple and easily maintainable. Also, flume would dynamically reroute traffic to different collectors depending on its load. This was a very useful feature. But this meant that any flume process in that network would register automatically to the production network and participate in streaming data. We quickly shut down our new QA network as soon as we saw production data go into our QA vertica server and vice versa. Configuring a whole new network and keeping it semi isolated from the other servers on the network was too much of a pain and we did not have the time or resources to do so. There were 5 server engineers plus Ram supporting two games (Crime City FB and iOS) played by millions of users. I was the sole guy maintaining the analytics system and I was also simultaneously managing other projects.
There were several other issues which caused Flume to back up once in a while. I had to go clean up the mess manually afterwards and find out what had caused the problem. All through this, Ram had given me the following advice which I shall follow religiously forever.
"Never ever get into a phase where you are constantly fighting fires. It sucks up your time and you cannot ever build more tools on top of that system. Always strive to minimize the time you spend fighting fires."
Easy to read, sounds obvious, but a very subtle principle. Its really easy to not notice that what you are doing is firefighting but think that you are “learning the tools” and trying to design around it. You can easily get stuck at that level and never move forward to build more advanced tools/systems on top and it deviates you from the overall vision of your plan.
We eventually decided to get rid of flume altogether. Transitioning away from it in a seamless manner without interrupting the production systems presented with another set of challenges which I may detail in another post.
Its also worth mentioning that there was nothing wrong with Flume per se. Although it was still a project in its infancy, it was version 0.6 or something close to it, its just the way we used it to attack the problem that was wrong. Also, I did not have complete understanding of both Flume and Vertica, which caused the above mentioned issues.
The overall lessons/principles I learnt:
- Research your tools. It may not be possible to know everything about them at once but always strive to learn the internal workings of the tools you use.
- Have a kick ass monitoring system to be your eyes and ears at all times. Also, log everything to keep track of how your stuff is working.
- Make sure that your design is flexible enough to be changed at a moment’s notice. You will throw your old work out of the window multiple times over the lifetime of any project. Learn to be at peace with that. The project will inevitably look completely different from whatever you initially thought of.
- Minimize the time you spend on “firefighting”. Again, minimize the time you spend on “firefighting”. Once more just in case it did not sink in, minimize the time you spend on “firefighting”.