
Stay tuned...
Dispatches from within the MySQL Falcon storage engine team.


"This is Ron. How may I help you?"Twentysomething, confident in tone. A good sign.
"Hi Ron. This is Chris. I just configured a replacement router and need to have it enabled on your network."Hard stop.
"That is not a router. It is a mod-em. We call it a cable mod-em."Like I was five, or something.
"Ok, well, I use it to manage the network, that's why I call it a router."
"That is a cable modem, not a router."Strictly speaking, he was right: the Comcast Business Gateway (SMC 8014) modulates and demodulates the cable signal and is therefore a cable modem, albeit one with router-y capabilities.
"I have two Comcast devices. Next to the router is a Scientific Atlanta DPX2203. It's what we get our TV and phone with. What do you call that?"He should have. It was just a standard cable modem.
"I have no idea what that is."
He probably doesn't do residential customer support. Let it go.Ron logged on to the router. He did something and it rebooted.
"Try it now."No joy. My carefully configured network ceased up like an acrophobic mountain goat.
"Ron, the LAN can't see the WAN, local DHCP is busted, the NAT and DMZ configuration pages are disabled on the router and my server can't see the network."He shouldn't have to know, but I gave it to him. The router rebooted.
"What's your server's IP?"
"Ok, try it now."Better, but not right.
"The NAT and DMZ options are still disabled, and my local systems are assigned public IPs. Not good."I explained, patiently and at some length, that I wanted my local network behind the firewall and that my server had to bypass the firewall. I also didn't want my local systems using the block of static IPs assigned to my account. Basic stuff, but Ron simply didn't grasp the concept.
"I can't help you with that. Those features are disabled for your account."Something had gone terribly, terribly wrong in my world. Subarachnoid hemorrhage? Radon?
Shake it off, Powers.
"Now, wait a minute. The router was shipped unlocked, just like my old router. I configured it, just like my old router. When you remotely configured it, those features became disabled. I NEED THEM. Did the firmware revision change or something?"Clearly, I was being tested. Ron said this with complete conviction, and it threw me.
"Sometimes the firmware changes, but NAT is disabled. You do not need those features."
"Ron, I most certainly DO need those features to configure my network. I had a server in the DMZ and a LAN behind the firewall, and now you're saying that I can no longer do that? Has my account status changed?"Confusion yielded to irritation.
"No, but we don't support those features. I really can't help you with that."
"Look, this is unacceptable. You replaced my router--sorry, cable modem--but disabled the features that I was using. It's useless now. Please let me speak to your supervisor or someone else who can help."Ron stumbled, then recovered.
"I can transfer you, but those features are not enabled for your account. We don't support NAT."My incredulity finally inspired a calm sense of purpose.
I will appeal to Ron's higher mind. He can be reasoned with. He is stubborn because he is afraid. I know he is wrong and he suspects he is wrong but wants to appear confident like his fellow techs who really do know their stuff.I chose to de-escalate and reason with him instead.
"Ok, let's think this through. Assuming you are right, why is my laptop accessing the Comcast DHCP server? Are you saying I can't even turn on the router's DHCP?"A pause.
"Let me look."The router rebooted. Third time.
"Check it now."All lights were green. Homepages blossomed.
"There! Great! What did you change?"I thanked him. He said to call back if I needed anything else.
"I enabled NAT."
"The question is how something that broken could escape detection."In fact, MySQL 4.1, 5.0 and 5.1 is used extensively throughout development. We last discussed dogfooding Falcon at the London meeting in May, though nothing came of it.
"What we really need is an internal application that actually uses Falcon for something we care about. If we were actually using it, we would have noticed the breakage in about a minute."
"In an ideal world, we should have had a recovery test in PushBuild or in the Weekly Falcon Test starting sometime around Nixon's resignation."Manyi
"I wish this bug was caught in the course of the execution of some determined, comprehensive QA strategy for Falcon. Instead, it was caught because I happened to have a personal vendetta against Falcon recovery that drove me to create those tests."
1) We need better coverage on regression testing. Serious side-effects of a patch from September were not caught by regression testing until Philip introduced more recovery tests. We need to complete the remaining tasks on FalconQAIdeas and improve the regression testing we have today.Ann:
2) Falcon patches tend to be large and contain several bug fixes and worklogs, that adds complexity and makes it more difficult for the reviews. One way to solve this is having smaller patches and each patch corresponds to a bug report or a worklog.
"Neither worklogs nor small patches would have made this any easier to find. You have to understand the code, specifically that there can be several dirty page with the same page number and walk the hash duplicates chain to get all the pages written. I doubt that anybody but Jim would have caught that subtlety."Hakan had seen the recovery failure, but hadn't realized it:
"Actually it was caught by the weekly DBT2 testing with 100 warehouses, where we load the data and restart the server. There were couple of 100 warehouses runs missing in September, but I ignored it because of Falcon being unstable in general."Vlad:
"Todo: Add a hook into DBT2 to figure out why a 100 warehouse run was not started."
"I'm sorry to be annoying with my ideas of quality, but I really believe if we had good unit testing (and one for page cache of course) such regression would be caught there. SystemQA test is good to find regressions but it does not help localize the error..."Jim
"I'm the heretic about code reviews. An architectural discussion before the design is cast into code is a hundred times more useful than a code review afterward...a bad implementation of a good design is easier to fix than a good implementation of a bad design."And on it went.
"I think this entire episode was a gift. We identified the cause of a serious, complex problem and consequently identified holes in our process."I continued with my analysis, bordering on bureaucratic blather:
"Strictly speaking, our QA system worked, albeit very inefficiently, because the bug didn't make its way into a telco switch or a 911 call center or Lehman Brothers financial system or even 6.0.7 for that matter (ok, maybe that was luck), so let's improve the process and move on."
What HappenedAnn thoughtfully countered many of my suggestions:
There was "perfect storm" of concurrent events:What To Do
- A complex, multi-part change to a core component (page cache), not fully reviewed
- Other core changes piled on top (online alter, memory manager, transaction manager)
- Disproportional reliance on Pushbuild I results
- SystemQA stress tests were changed, thus no pass/fail history (was recovery bug new or old?)
- DBT2 failures were detected, but not correlated
(I promise that these ideas are perfect and that there will never be another problem if we adopt them.)
Stricter Code Reviews
- One commit-one bug (mea culpa on this one!)
- Stability is KING
- Jim is still the reigning Falcon expert. USE HIM. Nag, if necessary.
- Changes to core components (page cache, memory manager) require a higher barrier of entry:
I also think we should suspend all performance changes until we get results from PAE.
- Clear, reproducible proof of benefit
- Clear evidence of no regression
- Assign a reviewer to understand EVERY LINE of the fix
Assign Experts
Assign responsibility for subsystem expertise. I've seen this work on other teams ("Who's the DMA guy? Ralph? Ok, great.") Each subsystem (TransactionManager, Cache) has an expert, a go-to person whose sign-off is required for code reviews.
Cross-component expertise is important but we need more depth.
"El Photo Grande" for Falcon QA
We need the equivalent of Homeland Security for Falcon QA, an aggregation of QA intelligence such that we can, perhaps in a dashboard or something, see the overall state of QA: Pushbuilds I&II, SystemQA, DBT2, Performance, etc.
Once we establish a comprehensive QA baseline, then subtle-but-catastrophic bugs will manifest as perturbations in the field.
"It's not possible to do a visual code review of something as complex as Kelly's flush mechanism."I disagree with Jim and Ann on the value of code reviews. Yes, senior engineers should spend more time reviewing design than source code, but my experience has been that a second or third pair of eyes will always catch stuff--always--and that changes to core components must be reviewed by senior engineers. Less critical changes require less scrutiny.
"The problem is that we have to develop QA in parallel with the code."
"This is all hard stuff, particularly recovery since it's not easy to make it deterministic, and if you do make it deterministic, you miss cases."
"People are better at analyzing and understanding designs than code."

"Problem-solving is hunting; it is savage pleasure and we are born to it."
--Thomas Harris, Silence of the Lambs


Falcon page cache has a findBuffer() routine that looks at the BDB list starting at the oldest and moving toward the youngest. This code is single threaded and can include time consuming disk IO. This change will increase the concurrency by executing most of the algorithm without locks.The general idea was to push locking further down into the cache by replacing a single exclusive lock at the top with separate locks for each slot of the hash table.
Falcon page cache currently uses a single lock on the entire hash table. This change will create a lock per hash bucket allowing for higher concurrency.
"The problem is simple and straight forward. Cache.cpp is totally broken. More specifically, the I/O thread (and architecture) is totally broken."The fix appeared to be simple enough--ten minutes by Kelly's estimation--but rather than spin more cycles on optimization, Kevin chose a reliable but somewhat heavy-handed course of action:
"When a checkpoint was required, a dirty page bitmap was produced. The bitmap, however, didn't contain the table space id, so ioThread had to loop through the collisions in the hash table to find the dirty buffer(s)."
"This only handles the first Bdb in a collision chain. In almost all circumstances, this will be the falcon_user tablespace. Consequently, dirty pages in falcon_master are never written."
"I'm sorry it has taken this long to discover that Philip's problems were page cache bugs masquerading as recovery bugs."
"I think we may need to reconsider some of our procedures."
"In light of this analysis, I think it is prudent to back out Kelly's cache changes. We need to focus on delivering the most stable Falcon engine that we can, most immediately, to the [performance] team, who are our #1 customer right now."The performance team was the Sun PAE group lead by Allan Packer. During the Dev meeting, we agreed to get a stable version of Falcon to them ASAP so they could begin their own in-depth performance analysis.
"I do not see any way to prevent 'totally broken' in the future. How should we should change our processes to detect an error before the push? It passed all the system tests."Indeed, the problem wasn't with the Falcon code, it was with the Falcon process.