Sunday, December 07, 2008

Bluster, Blather and B.S.

Many have published their Very Important Opinions on the MySQL 5.1 GA. Now that the issue has been forced, I must take a stand.

Falcon is strictly a MySQL 6.0 project. I did not participate directly in the development of MySQL 5.1 and cannot render an informed technical opinion as to its quality except to say that I absolutely trust the judgment of MySQL management and the ability of my fellow engineers, and I fully stand behind their decision to release MySQL 5.1 GA.

In twenty-two years, I have worked at companies ranging from mom-and-pop software shops to companies like IBM, CSC, Harris, Boston Scientific and everything in between. I have developed operating systems, telecommunication software, database microkernels, medical device firmware and, most importantly, applications for the wholesale distribution of beer.

Every single one of these products shipped with known bugs--serious bugs--and every single one of these products shipped with at least someone strongly questioning the decision to ship. Every single one.

But we did ship, and the operating system ran millions of PCs and the telephone switch connected billions of phone calls and the database stored petabytes of data and the heart device saved thousands of lives and the warehouse delivered millions of gallons of beer.

And the bugs got fixed and then we moved on. We moved on. Never once did I question our ability to do so, and never once did we fail to do so.


So, in my experience, "Disagree and Commit" isn't just a feel-good platitude, it is a proven, practical approach that works. It is the hallmark of a mature organization, and it is the core principle of smart, successful teams that actually deliver.

Yet, despite all of the bluster, blather, and b.s. (mine included), the issue is really quite simple: If you can't be a team player, find another team.

Thursday, October 30, 2008

Play Action Fake

I host a MySQL server for use by the Falcon team, and this requires business class broadband service.

Comcast Business Services support has been consistently stellar.
Network or line problems are addressed within the same day, and I always get straight through to a knowledgeable person when I call support.

Recently, the coax connector on the cable modem/router snapped loose. I called Business Services, and within two hours a replacement router sat on my desk. I configured the device and called to have them enable it on their network. I got right through.
"This is Ron. How may I help you?"
Twentysomething, confident in tone. A good sign.
"Hi Ron. This is Chris. I just configured a replacement router and need to have it enabled on your network."
Hard stop.
"That is not a router. It is a mod-em. We call it a cable mod-em."
Like I was five, or something.
"Ok, well, I use it to manage the network, that's why I call it a router."
"That is a cable modem, not a router."
Strictly speaking, he was right: the Comcast Business Gateway (SMC 8014)
modulates and demodulates the cable signal and is therefore a cable modem, albeit one with router-y capabilities.

I called it a router anyway. Another question:
"I have two Comcast devices. Next to the router is a Scientific Atlanta DPX2203. It's what we get our TV and phone with. What do you call that?"

"I have no idea what that is."

He should have. It was just a standard cable modem.
He probably doesn't do residential customer support. Let it go.
Ron logged on to the router. He did something and it rebooted.

"Try it now."
No joy. My carefully configured network ceased up like an acrophobic mountain goat.
"Ron, the LAN can't see the WAN, local DHCP is busted, the NAT and DMZ configuration pages are disabled on the router and my server can't see the network."

"What's your server's IP?"

He shouldn't have to know, but I gave it to him. The router rebooted.
"Ok, try it now."
Better, but not right.
"The NAT and DMZ options are still disabled, and my local systems are assigned public IPs. Not good."
I explained, patiently and at some length, that I wanted my local network behind the firewall and that my server had to bypass the firewall. I also didn't want my local systems using the block of static IPs assigned to my account. Basic stuff, but Ron simply didn't grasp the concept.

Then he said this, I swear:
"I can't help you with that. Those features are disabled for your account."
Something had gone terribly, terribly wrong in my world. Subarachnoid hemorrhage? Radon?
Shake it off, Powers.
"Now, wait a minute. The router was shipped unlocked, just like my old router. I configured it, just like my old router. When you remotely configured it, those features became disabled. I NEED THEM. Did the firmware revision change or something?"

"Sometimes the firmware changes, but NAT is disabled. You do not need those features."
Clearly, I was being tested. Ron said this with complete conviction, and it threw me.
"Ron, I most certainly DO need those features to configure my network. I had a server in the DMZ and a LAN behind the firewall, and now you're saying that I can no longer do that? Has my account status changed?"

"No, but we don't support those features. I really can't help you with that."
Confusion yielded to irritation.
"Look, this is unacceptable. You replaced my router--sorry, cable modem--but disabled the features that I was using. It's useless now. Please let me speak to your supervisor or someone else who can help."
Ron stumbled, then recovered.
"I can transfer you, but those features are not enabled for your account. We don't support NAT."
My incredulity finally inspired a calm sense of purpose.
I will appeal to Ron's higher mind. He can be reasoned with. He is stubborn because he is afraid. I know he is wrong and he suspects he is wrong but wants to appear confident like his fellow techs who really do know their stuff.
I chose to de-escalate and reason with him instead.
"Ok, let's think this through. Assuming you are right, why is my laptop accessing the Comcast DHCP server? Are you saying I can't even turn on the router's DHCP?"
A pause.
"Let me look."
The router rebooted. Third time.
"Check it now."
All lights were green. Homepages blossomed.
"There! Great! What did you change?"

"I enabled NAT."
I thanked him. He said to call back if I needed anything else.

Saturday, October 25, 2008

The Weekly Falcon Index

Planned Falcon Blog posts: 4
Actual posts: 4
Hours resolving network outage due to physically damaged router: 3
Hours from first service call that a new router arrived via courier: 2
Hours haggling with Comcast tech over router configuration: 1
Hours resolving concurrent but unrelated disk corruption: 2.5
Hours locked out of building where I was supposed to speak at MySQL Meetup: 0.5

Falcon IRC non-system messages: 2661
IRC champ: Vlad (26%)
Runner-up: Lars-Erik (19%)
Highest percentage of smiley faces per lines of chat: 19%
Friendliest chatter: Olav
German/Norwegian message ratio: 1:1
References to reiserfs: 13
References to prison: 2
References to football (soccer): 182
Gratuitous references to Ramadan: 1
Best quote: "You obviously do not live in an apartment with a bunch of left-wing girls."
Source of quote: [redacted]

Thursday, October 23, 2008

Some Perspective on Recent Events, Part III

[Part I, Part II]

Perturbations in the Field


In the weeks leading up to the Dev meeting in Riga, the Falcon team dumped a record number of changes into the codebase, including a page cache optimization that contained a severe but
undetectable bug.

The usual indicators offered no sign of trouble: the Pushbuild matrix was green, the Falcon regression tests passed, Syst
em QA reported nothing unusual. It wasn't until Philip modified the System QA stress tests that a problem emerged.

The modification was simple: kill mysqld at the end of the test, then restart the server and wait for Falcon recovery to complete. This unsophisticated, almost whimsical improvisation should have amounted to little more than a savage afterthought in the test plan, however, not only did it expose a severe bug, it also revealed a serious gap in our test coverage.

What else had we missed? Jim surfaced an old argument:
"The question is how something that broken could escape detection."

"What we really need is an internal application that actually uses Falcon for something we care about. If we were actually using it, we would have noticed the breakage in about a minute."
In fact, MySQL 4.1, 5.0 and 5.1 is used extensively throughout development. We last discussed dogfooding Falcon at the London meeting in May, though nothing came of it.

A Personal Vendetta

The discussion continued.


Philip:
"In an ideal world, we should have had a recovery test in PushBuild or in the Weekly Falcon Test starting sometime around Nixon's resignation."

"I wish this bug was caught in the course of the execution of some determined, comprehensive QA strategy for Falcon. Instead, it was caught because I happened to have a personal vendetta against Falcon recovery that drove me to create those tests."
Manyi
1) We need better coverage on regression testing. Serious side-effects of a patch from September were not caught by regression testing until Philip introduced more recovery tests. We need to complete the remaining tasks on FalconQAIdeas and improve the regression testing we have today.

2) Falcon patches tend to be large and contain several bug fixes and worklogs, that adds complexity and makes it more difficult for the reviews. One way to solve this is having smaller patches and each patch corresponds to a bug report or a worklog.
Ann:
"Neither worklogs nor small patches would have made this any easier to find. You have to understand the code, specifically that there can be several dirty page with the same page number and walk the hash duplicates chain to get all the pages written. I doubt that anybody but Jim would have caught that subtlety."
Hakan had seen the recovery failure, but hadn't realized it:
"Actually it was caught by the weekly DBT2 testing with 100 warehouses, where we load the data and restart the server. There were couple of 100 warehouses runs missing in September, but I ignored it because of Falcon being unstable in general."

"Todo: Add a hook into DBT2 to figure out why a 100 warehouse run was not started."
Vlad:
"I'm sorry to be annoying with my ideas of quality, but I really believe if we had good unit testing (and one for page cache of course) such regression would be caught there. SystemQA test is good to find regressions but it does not help localize the error..."
Jim
"I'm the heretic about code reviews. An architectural discussion before the design is cast into code is a hundred times more useful than a code review afterward...a bad implementation of a good design is easier to fix than a good implementation of a bad design."
And on it went.

El Photo Grande

My take was that this bug did us a favor:

"I think this entire episode was a gift. We identified the cause of a serious, complex problem and consequently identified holes in our process."

"Strictly speaking, our QA system worked, albeit very inefficiently, because the bug didn't make its way into a telco switch or a 911 call center or Lehman Brothers financial system or even 6.0.7 for that matter (ok, maybe that was luck), so let's improve the process and move on."
I continued with my analysis, bordering on bureaucratic blather:
What Happened
There was "perfect storm" of concurrent events:
  1. A complex, multi-part change to a core component (page cache), not fully reviewed
  2. Other core changes piled on top (online alter, memory manager, transaction manager)
  3. Disproportional reliance on Pushbuild I results
  4. SystemQA stress tests were changed, thus no pass/fail history (was recovery bug new or old?)
  5. DBT2 failures were detected, but not correlated
What To Do
(I promise that these ideas are perfect and that there will never be another problem if we adopt them.)

Stricter Code Reviews
  1. One commit-one bug (mea culpa on this one!)
  2. Stability is KING
  3. Jim is still the reigning Falcon expert. USE HIM. Nag, if necessary.
  4. Changes to core components (page cache, memory manager) require a higher barrier of entry:
  • Clear, reproducible proof of benefit
  • Clear evidence of no regression
  • Assign a reviewer to understand EVERY LINE of the fix
I also think we should suspend all performance changes until we get results from PAE.

Assign Experts
Assign responsibility for subsystem expertise. I've seen this work on other teams ("Who's the DMA guy? Ralph? Ok, great.") Each subsystem (TransactionManager, Cache) has an expert, a go-to person whose sign-off is required for code reviews.

Cross-component expertise is important but we need more depth.

"El Photo Grande" for Falcon QA
We need the equivalent of Homeland Security for Falcon QA, an aggregation of QA intelligence such that we can, perhaps in a dashboard or something, see the overall state of QA: Pushbuilds I&II, SystemQA, DBT2, Performance, etc.

Once we establish a comprehensive QA baseline, then subtle-but-catastrophic bugs will manifest as perturbations in the field.
Ann thoughtfully countered many of my suggestions:
"It's not possible to do a visual code review of something as complex as Kelly's flush mechanism."

"The problem is that we have to develop QA in parallel with the code."


"This is all hard stuff, particularly recovery since it's not easy to make it deterministic, and if you do make it deterministic, you miss cases."

"People are better at analyzing and understanding designs than code."
I disagree with Jim and Ann on the value of code reviews. Yes, senior engineers should spend mor
e time reviewing design than source code, but my experience has been that a second or third pair of eyes will always catch stuff--always--and that changes to core components must be reviewed by senior engineers. Less critical changes require less scrutiny.

But code reviews always miss stuff, too. Always.

The Mansfield Anomaly

One case in particular comes to mind, a pacemaker firmware defect so subtle and so pernicious that it was named in honor of the engineer who introduced it.

The "Mansfield Anomaly" evaded in-depth peer reviews, design analysis tests, unit tests, s
ystem verification and validation, pre-clinical testing and even clinical trials. It never manifested as a device failure in the field, fortunately, and the bug was eventually found in-house and a firmware patch released.

I was introduced to the Mansfield Anomaly by an advisory level engineer
during an informal training session. He flashed a rather innocuous 'switch' statement on the screen and challenged the ten or so engineers in the room to find the bug.

No one did.

After entertaining a few guesses, he pointed out a very subtle cut-and-paste error that could result in a device failure, but only during a 4ms hardware timing window. Anyone could (and did) miss it, especially in the context of a more comprehensive review with hundreds of lines of code.


A Savage Pleasure

So, back to the original question: How can the Falcon team better ensure that subtle-but-devastating bugs are caught? Code reviews? Design reviews? Dogfooding? Unit tests? More system tests? The MySQL community?

All of the above, of course, but what can a team do mid-project? What is the most effective change to make now?


Here's what we've done so far:
  1. More eyes on the code--all pushes must be code-reviewed
  2. Smaller pushes into the codebase
  3. Public intra-team email: falcon@lists.mysql.com
  4. Falcon QA is moving ahead with new tests
  5. Run the new tests on previous releases, compare with current or pending release
  6. Focus on stability in favor of performance--for now
  7. Save before and after file snapshots for recovery debugging
  8. Pushbuild2: a new array of servers dedicated to automated System QA stress tests
Item 8 is my personal favorite. Stress tests are irrational killers, but I find them to be a very satisfying challenge, which brings to mind one of my favorite quotes:
"Problem-solving is hunting; it is savage pleasure and we are born to it."

--Thomas Harris, Silence of the Lambs

Tuesday, October 21, 2008

Bursty

That's always been my M.O., I guess, but last week was pretty busy:

Flyway

I went down to the Flyway Film Festival in Pepin, Wisconsin over the weekend to screen "Cottonwood", a short film I directed over the summer as part of the 48 Hour Film Project.


I gave a brief Q&A about the making of the film, then did some schmoozing at the filmmaker's reception. I had a great chat with the director of "
Redneck Zombies"--the guy was hilarious.

"Cottonwood" is five minutes long--six with credits--and was written, shot, edited and scored in exactly forty-eight hours. And although "Cottonwood"
looks like a film that was made in an absolute, mad panic (it was), we had a blast doing it.

Minnesota MySQL+PHP Meetup

Last Wednesday, I had the great pleasure of attending the local MySQL+PHP meetup.

The attendance was excellent, the attendees were interesting, and I had the opportunity to talk about Falcon, give a rundown of the Riga meeting, and dispel (I think) some of the rumors making the rounds in the MySQL community.

There were many questions about the Sun acquisition and its effect on MySQL. I was happy to emphasize how positive the acquisition has been, most recently evidenced by the addition of several developers to Falcon and a commitment to avail us of the PAE performance team's expertise.

I also met three local MySQL customers who are interested in giving Falcon a test run with their web applications. When we're closer to beta, I'll arrange to visit them and help with the configuration.

Chill/Thaw Madness

I spent the rest of the week and weekend completely immersed in debugging several chill/thaw and concurrent online alter bugs emanating from the System QA stress tests. Falcon must be stable before we can hand it over to the PAE team for testing--that has been my number 1 priority since Riga.

I have a pile of fixes to push--some of which I will do tonight--but after wrangling with several nasty thaw bugs, I've come to realize that any mechanism that complicated should probably be scrapped (more on this as it develops.)

Ok, now I'm getting agitated. Back to it.

Some Perspective on Recent Events, Part II

"Why Falcon Doesn't Work Anymore"

Recap: After the MySQL Dev meeting in Riga, Philip "Troublemaker" Stoev, reported a serious problem with Falcon recovery. Ann suspected that the problem not a true recovery bug, which Jim confirmed.


The problem wasn't that we found a bug. The problem was that we didn't.

Run-up to Riga


The Falcon team was really rockin' in the last half of August and early September. With the Dev meeting looming, all changes for the 6.0.7 alpha had to be pushed by month's end, otherwise we'd lose weeks of QA testing.


Lots of stuff was dumped into the codebase during that short time. The Falcon weekly report for August 22 records pushes to the Transaction Manager, Memory Manager, Page Cache and Online Alter, just to name a few.

The build team finally cloned mysql-6.0-falcon on September 4th, but the heavy volume of source changes continued, peaking on the 10th, one week before the conference:

22:57:47 <wlad> Today is the absolute world-record in number of pushes/hour in the Falcon tree
22:57:47 <wlad> I think we already had 8 and seeing Chris' determination, there are more to come:)

22:58:56 <cpowers> Doing my best...iterative process of refinement, you know...
22:59:19 <cpowers> Got to keep them pushbuild servers happy
22:59:27 * wlad applauds:)


Pushbuild is the MySQL automated QA system. Every code push into the source tree triggers a wave of regression tests across twelve uniquely configured servers. Test results are displayed within in a simple matrix, where each cell represents the status of a single code change on a single server. If the cell is green, all tests passed on that server. Red indicates a test failure, build break or even a compiler warning.

A high number concurrent pushes will cause Pushbuild tests to stack up, creating a multi-hour delay before results are available. Such was the case during the Great Week of Pushes as the Falcon team checked in change after change. The frantic pace also resulted in several preventable regressions that kicked the tree back into the red, necessitating repeat check-ins and further aggravating the Pushbuild backlog.

However, by the end of the week, Pushbuild returned to its refreshing, reassuringly green self, and we prepared for our journey to Latvia.

Worklog #4479

During the pre-Dev meeting melee, one item made it into Falcon tree with relatively little fanfare: a two-part cache optimization provided by our performance expert, Kelly Long. From the associated worklog descriptions:

Falcon page cache has a findBuffer() routine that looks at the BDB list starting at the oldest and moving toward the youngest. This code is single threaded and can include time consuming disk IO. This change will increase the concurrency by executing most of the algorithm without locks.

Falcon page cache currently uses a single lock on the entire hash table. This change will create a lock per hash bucket allowing for higher concurrency.

The general idea was to push locking further down into the cache by replacing a single exclusive lock at the top with separate locks for each slot of the hash table.


Jim rejected the notion when it was first proposed back in May, pointing out that the algorithm would effectively triple the number of exclusive locks, even for cache misses.

Kelly was convinced, however, that the change would allow for better parallelism, and that a loss in performance at lower concurrency would be more than compensated by a performance gain at higher concurrency. His own benchmarks appeared to support his contention.


The issue was revisited--contentiously--at the July meeting in Boston, but with no consensus. As the Dev meeting grew near, Kevin reconsidered the matter and eventually granted approval for the change, so Kelly's cache optimization finally made its way into the Falcon tree--along with twenty-five other changes in the same week.


Hard Landing


The Falcon team returned from Riga at the end of September, tanned, rested and ready for the drive towards Falcon beta. No sooner did we sit down at our dusty keyboards than did Philip alert us to new problems with Falcon recovery.


Recovery is a normal part of the engine initialization process that reestablishes the previous in-memory state and applies pending updates following an abrupt or abnormal shutdown. Recovery failures are serious and notoriously difficult to debug, so Philip's discovery of new failures stirred things up considerably.


Fortunately, Ann had a hunch as to what the problem might be. After some digging, Jim pinpointed the failure as a bug in the oft-debated cache optimization:

"The problem is simple and straight forward. Cache.cpp is totally broken. More specifically, the I/O thread (and architecture) is totally broken."

"When a checkpoint was required, a dirty page bitmap was produced. The bitmap, however, didn't contain the table space id, so ioThread had to loop through the collisions in the hash table to find the dirty buffer(s)."


"This only handles the first Bdb in a collision chain. In almost all circumstances, this will be the falcon_user tablespace. Consequently, dirty pages in falcon_master are never written."


"I'm sorry it has taken this long to discover that Philip's problems were page cache bugs masquerading as recovery bugs."


"I think we may need to reconsider some of our procedures."
The fix appeared to be simple enough--ten minutes by Kelly's estimation--but rather than spin more cycles on optimization, Kevin chose a reliable but somewhat heavy-handed course of action:
"In light of this analysis, I think it is prudent to back out Kelly's cache changes. We need to focus on delivering the most stable Falcon engine that we can, most immediately, to the [performance] team, who are our #1 customer right now."
The performance team was the Sun PAE group lead by Allan Packer. During the Dev meeting, we agreed to get a stable version of Falcon to them ASAP so they could begin their own in-depth performance analysis.

But the issue at hand was much larger than a code regression--we'd certainly had them before--the issue was that a serious bug escaped detection.

Vlad got right to the heart of the matter:

"I do not see any way to prevent 'totally broken' in the future. How should we should change our processes to detect an error before the push? It passed all the system tests."
Indeed, the problem wasn't with the Falcon code, it was with the Falcon process.

(The third and final installment the story continues tomorrow, "Perturbations in the Field")

Tuesday, October 07, 2008

Sort of Out of Sorts

I promised Giuseppe that I'd post Part II of "Some Perspective on Recent Events" today, but I got hit w/ some kind of bug.

Last month, authorities at Schiphol duly confiscated my Black Baslam, aka "The Latvian Hammer", so I've had to resort to bland, marginally effective over-the-counter remedies.

Let's see how tomorrow goes.

Monday, October 06, 2008

Prolificity

I woke up today at 4:45 am, which is a completely worthless time to be awake. I lay there for twenty minutes tracking post-REM phosphenes and thinking random thoughts. Here's one of them.

The Planet MySQL aggregator ranks blogs according to 'activity', or number of posts in the last month. I typically care little for such things because unnecessary attention attracts unwanted expectation. However, were one to care about such things, this method of ranking favors those who tend to post like this:

Sunday, October 5
I broke my pencil but that's ok because I have another pencil.

Monday, October 6
I like Drizzle.

Tuesday, October 7
I like, drizzle.

Wednesday, October 8
I believe I lost my shoes, Clyde. I think the dog got 'em.

Free idea: Rank blogs using a Prolificity factor that incorporates two subfactors, Currency and Abundance.

Currency is the number of posts in the last month, and is a fair measure of how current the blog is.

Abundance is the moving weighted average of word count per post, which is a measure of the substance of the average blog post.

A moving weighted average will favor more recent posts, so that 100,000-word post of your PhD thesis back in 2002--though a fine contribution to All Things--would have little relevance today.


Expressed formulaicly,
Prolificity = Currency*Abundance
Let's say Currency wins in the event of a tie, so maybe this:
Prolificity = Currency*Abundance + Currency
For example,
Blog Theta: 10 posts, 2000 words, 200 words/post MWA last 10 posts

Currency = 10, Abundance = 200
Prolificity = 10*200 + 10 = 2010

Blog Zeta: 2 posts, 2000 words, 1000 words/post MWA last 10 posts

Currency = 2, Abundance = 1000
Prolificity = 2*1000 + 2 = 2002
Blog Theta ranks because it's more current, but Blog Zeta isn't entirely out of the game. Perhaps the formula could be tweaked, but I'll leave that as an exercise for the reader.

Sunday, October 05, 2008

Some Perspective on Recent Events

[This has nothing to do with the financial crisis or the McCain/FeyPalin ticket. Sorry.]

Last Wednesday began innocently enough. Philip's email to the Falcon team, "Regarding Falcon Recovery", lamented the lack of progress in fixing recovery bugs. He detailed specific failures that he was seeing, many of them new, and pointed out that that the number of recovery bugs was increasing. He closed with
"All of this means that the recovery problems must be tackled immediately and head-on. A database without functioning recovery can be at most alpha-quality software."
He was, of course, absolutely right. The lack of reliable database re
covery is like flying with non-locking landing gear, so we took a hard look at the problem.

Then all hell broke loo
se.

Falcon Recovery

Falcon recovery is the equivalent of waking up after a night of partying and trying to figure out where you are, what you did and who you did it with (I have never experienced such a thing. Ever.)

When Falcon starts, it determines whether it was shutdown cleanly. If not, it invokes recovery to apply the remaining active parts of the serial log, beginning with the last serial log entry, then backing up to its recovery start point.

This multi-phase procedure must work without fail, every time, even following an aborted recovery attempt (this is called double recovery.)

The recovery process applies changes in the serial log to pag
e images in the page cache, until they are all done, then flushes the cache by forcing a checkpoint. Falcon does this in three passes through the serial log:

Phase I: Take Inventory, Establish State

- Determine transaction states
- Determine object states (track state transitions, record final state)

- Determine the last checkpointed record (prior objects guaranteed on disk)

Phase II: Physical Allocation
- Allocate and release required pages and sections
- Track object "incarnation" and state--update active objec
ts with last incarnation

Phase III: Logical Application

- Apply data and index changes (avoid reallocating pages in use)

(For details, have a look at SerialLog::recover() in SerialLog.cpp.)


Recovery failures are notoriously difficult to debug. Sometimes you have to deter
mine not only what did happen, but also what didn't happen, such as a missing page update. And quite apart from tracking various types of recovery objects, recovery must also track the state or "incarnation" of each recovery object. For instance, if a record was updated five times, we only care about the most recent version--the fifth incarnation

(Free idea: "The Fifth Incarnation", an epic tale of lives lived, lives lost, loves lived and lost loves. At quality book stores near you.)

"Expected 7, got 2"

Philip modified the SystemQA stress tests to force a recovery and found that the engine crashed every time. Up to this point, Falcon recovery was regarded as generally reliable, though not flawless, but now we had a reproducible failure that always manifested the same way:
Fatal assertion in Cache::fetchPage(), "Error: page 103/0 wrong page type, expected 7 got 2".
Cache::fetchPage() lies within the very nucleus of the Falcon database engine. It is the cosmological core through which every essential database action must flow.


Getting a "wrong page type" assertion in Cache::fetchPage() is like asking your HAL 9000 to compute "4 + 4 and getting "green" for an answer. This is not the type of error you ever want to encounter in a database, or in Jupiter orbit for that matter, because it puts everything into question. Nothing can be trusted--data pages, index pages, orbital trajectories--nothing.

Given that the cache error appeared to occur only during recovery, and given the c
omplexity of the recovery process, it was only natural to assume that we had somehow failed to properly record a page update within the serial log. (We later discovered that the error had indeed occurred elsewhere, but only occasionally. Hakan saw it during DBT2 testing, and I had also seen it testing online alter.)

"This used to work. Always"

Ninety minutes after Philip's email, Ann responded:
"This appears to be a recently introduced problem that (probably) has nothing to do with recovery. It appears that the PageInventoryPage is not being written often enough. I don't know why. As a result, pages that are actually in use are recorded as being free. This isn't a problem for a running Falcon because the page is up-to-date in the cache. After a crash, the out-of-date version is read from disk and pages are reallocated."
Falcon uses PageInventoryPages (PIPs) to track used and available pages, so an out-of-date PIP implies a very serious corruption.

Jim provided his assessment shortly thereafter:
"It doesn't work at all. There has been a huge regression. The simplest of all possible test -- start with a clean database, create a single table, kill the server, and restart -- fails first time, every time."

"This used to work. Always."

"We don't have a recovery bug, we have a non-functional code base."
And so it began.

(Part 2 of the story continues tomorrow, "Why Falcon Doesn't Work Anymore")

The Weekly Falcon Index

Planned Falcon Blog posts: 4
Actual posts: 1

Team emails debating the cost of changing core components: 44
Post-debate emails suggesting a major change to a core component: 1
Days advance notice to change my LDAP password: 9
Nag emails to change my LDAP password: 9
Days of delay before changing my LDAP password: 9
Hours unable to access email after changing my LDAP password: 6
Calls to support after changing my LDAP password: 1
Falcon IRC non-system messages: 2022
IRC champ: Hakan (23%)
Runner-up: Vlad (21%)
Messages in German: 35
References to "Ramadan": 10
Best quote: "I try never to argue against clarity"
Source of quote: Ann
Intrusions by senior staff: 1
Number of times this week I typed a password into IRC: 1
Number of times this year: 3

Monday, September 29, 2008

Falcon in London

[I'm still playing catch-up after a five-month blogging hiatus]

By May of 2008, the Falcon team's technical center of gravity lay squarely within the European continent, so it was only fair and sensible that we gather where the combined travel effort would be at a minimum.

This was to be the first assemblage of the entire team, including new team members. We were also promised (and later cruelly denied) a rare sighting of Hakan K. who was unable to attend the all-hands staff meeting in January or the UC in April, as well as appearances by MC Brown and James Day.

MC did an outstanding job as a one-man advance team, deftly applying his hitherto unknown logistical genius to the task of arranging for accommodations at a reasonably-priced, centrally-located, well-appointed and distinctly British hotel..."Holiday Inn" I think was the name.

Big News

We started the week in London on an interesting note: Jim announced that he was stepping down from the Falcon project in order to pursue his cloud database idea. This wasn't a complete shock, because Jim had been itching to explore his cloud database concept for some time, but it was still big news.

The timing of the announcement was pretty good, because Falcon was nearly feature complete and the team had gained enough expertise to carry on. We were also relieved to find that Ann Harrison (Jim's wife) intended to remain with Falcon, because she has a holographic knowledge of much of the architecture and would provide a reliable back channel of communication to Jim--reliable because they share the same office and that she can, and does, holler questions to him over the partition.

We also had a full-time project manager and three new engineers, so Falcon was clearly in capable hands. The only remaining issue was whether Jim could arrange a deal with Sun to continue assisting the project (which has since been worked out, so not much has changed on a day-to-day basis.)

Falcon Immersion

A key item on the agenda was Falcon training for the new team members. Jim provided an excellent, in-depth review of the salient aspects of the Falcon design, reinforced by line-by-line walkthroughs of certain core operations.

I took notes which will ultimately make their way into Falcon design documents and even the code itself (the stringent use of source code comments within Falcon warrants further discussion, as does the overall coding style. I will address this in a future post.)

It was a long week of long days, and we covered a considerable amount of material (I will also cover some of these in future posts):

  • Record insert process, from StorageInterface to data page
  • Serial Log record formats
  • Page Cache design and areas for optimization
  • Internal SQL engine
  • Low memory operations and record cache maintenance
  • Falcon on-disk structure
  • Falcon indexes, design and operation
  • Page types: Data, Index, Inventory, Overflow, Blobs
  • Engine initialization sequence
  • Falcon handler and the MySQL Storage Interface API
  • Overview of the Falcon Storage classes (StorageConnection, StorageTable, etc.)
  • Synchronization objects
  • Replication: Statement-based vs Row-based
Statement-based Replication

So let's explore that last item a bit.

You know how it goes: Someone asks an innocent question in a room full of people, and before you know it, an hour of unintended discussion ensues. Such was the case with Falcon and replication.

The question was: Why doesn't Falcon support statement-based replication? The complete answer is complicated and worthy of a separate post, so I will only summarize here.

Falcon supports row-based replication (RBR), where every row event on the master lands in the binlog and is subsequently pulled in by the slaves. One advantage of RBR is also it's greatest disadvantage, that is, everything is logged, which is fine except for unqualified statements like "DELETE FROM T1", or for operations that are later rolled back.

On the other hand, statement-based replication does not handle non-deterministic or concurrent operations very well, and must therefore rely upon the assumption that two statements won't have any interaction. Ann puts this very succinctly:

"Falcon does not and will not support statement based logging because Falcon transactions are not serializable, and in the absence of serializable transactions, statement based logging does not produce consistent results."

As I understand, InnoDB gets around the problem by locking record touched by subqueries, obviously at the expense of concurrency. For Falcon to support SBR, the same would also hold true, requiring several expensive changes that would also affect performance:

  • Query expressions that drive data changes must be locking
  • Autoincrement must be serialized
  • Implement next-key and end-of-table locks
  • Other stuff

This is not the complete list of changes that we'd have to make to support SBR, nor does it take into account potential enhancements to MySQL replication, but captures the essence of the issue.

(My interest in replication stems from a two-year stint on the replication project at Pervasive Software, however, that was a peer-to-peer implementation employing a change-capture mechanism rather than binlogs--a fundamentally different approach.)

Falcon Schedule

[Disclaimer: I cannot disclose product release dates. Sorry.]

Release planning is a black art, subject to the hazards of miscalculation, fuzzy estimations and plain old bad luck. Our biggest question at the London meeting was how to coordinate the performance optimization effort with that of fixing stability and managing the bug load. Assigning a block of time to any of one of these tasks, much less all three, is not only difficult but outright dangerous if not done carefully. But it had to be done, and so we did.

Assigning priorities is a classic problem in software development, but the textbook approach doesn't always apply. For example, performance optimization should be done late in the project, right? Ok, but what if performance is a major feature, or what if the performance in some cases, is really bad? Isn't that a bug? And what if the performance fix requires a change to a core component, such as the synchronization model or caching algorithm? Wouldn't you want those sort of changes in earlier rather than later?

I suppose with well-defined release criteria and well-defined performance metrics and well-defined defect criteria and well-defined quality metrics, the answers to such questions might be readily apparent, but alas, ours is not the tidy, frictionless universe of textbook theory. No, ours is a messy, non-Newtonian hodgepodge of money, marketing and mission creep, an existential riot fueled by personality, tempered by perception and begging for order in an unordered cosmos.

But I digress.

Actually, we really do have well-defined release criteria and even some pretty good performance criteria, although "good performance" has been something of a moving target for Falcon. The biggest unknown was the time required to bring the bug count under control.

[Disclaimer: I am wholly unqualified to comment upon MySQL 5.1]

After considerable and sometimes heated debate, we did manage to hammer out a shaky consensus on a GA date, saddled with preconditions and "only ifs" as it was. The Falcon schedule has since been revised, and last week in Riga it was thoroughly rehashed again within the context of the Server 6.0 release schedule.

And that's all I have to say about that.

Pilgrimage

The stark reality of the Falcon team meetings is that we spend 80% of the time sitting around a conference table screaming at each other.

No, that's not true. Let me try again.

The reality of our meetings is that we spend a long week of long days in a conference room digging into technical stuff. Despite the travel overhead, I find these meetings to be extremely productive and personally energizing.

In London, there were only two things I absolutely had to do: Visit the British Museum, and, well, see for yourself.

I won't bore you with museum photos, but I did manage to I rope Kevin, Vlad and Vlad's wife, Tatyana, to make the pilgrimage with me to Apple Studios. Tatyana was kind enough to shoot three of us crossing the zebra crossing. Unfortunately, some bearded dude jumped in front of us while we crossed.

He didn't say much. I think he was French.

Saturday, September 27, 2008

Troublemakers, Part I

Philip Stoev, Software Engineer

The mission of the System QA team is to beat the living daylights out of a MySQL release
before it is set free into the world. Philip Stoev is on the System QA team, not the Falcon team, but bless his Bulgarian heart, he's caused more trouble on Falcon than anyone in recent memory except for perhaps Peter Z.

The "trouble" of which I speak is best illustrated by this vignette: Imagine that HMS Falcon is on a shakedown cruise in the Mediterranean. Weather is clear, crew is happy, course is plotted. Midshipman Stoev voices a concern.


"I beg your pardon, Captain, but I believe the ship has run aground."

"Nonsense, Mr. Stoev. The ship is quite sound. She floats upright, the sails are full and the crew is happy with rum and song.
"

"Yes, of course, Captain, but if you'll please have a look belowdecks, sir. The bilges are awash, I'm afraid, and the keel is broken."

"Nonsense, Mr. Stoev, you imagine the worst. Now, hand me my spyglass, if you please, and go on about your duties."

"As you wish, sir."

Get the picture?

Philip does his job quite well such that he can, with the power of a two-line email, stop an entire release dead in its tracks. Last year, this caused tremendous frustration because System QA didn't begin testing until after a release clone-off, which is perilously late in the release cycle. One or two showstopper bugs can, and did, gum up the works for weeks, even months.

Since then we've fixed the process, and now System QA tests are run regularly against pre-release code. There even exists an array of Pushbuild2 servers dedicated solely to System QA stress tests, so each push into the tree results in a battery of stress test executions.

"Executions", indeed. These unassuming, cold-hearted tests, cobbled together from Perl and PHP, efficiently render the very fabric of our precious little engine into hot, screaming shards of digital shrapnel.

Case in point: falcon_online_alter.
This little gem is designed to "exercise" Falcon's online alter capability in the same sense that having your shirt pulled over your head and being kicked down the stairs is "exercise". If I sound bitter, it's only because I've been chasing stress test bugs for the better part of two months.

Let's take an example:

ALTER TABLE t1 ADD INDEX i1 (s1), ADD INDEX i2 (s2);

In MySQL 6.0, "ONLINE" is implicit. Prior to the ONLINE ALTER implementation in Falcon, ADD INDEX worked like this:

1. Get shared lock on table T1

2. Create temporary table #sql-abc with new attribute (e.g. index)
3. Copy records from T1 to #sql-abc (index created during copy)
4. Get exclusive lock on T1
5. Rename T1 to #sql-xyz
6. Rename #sql-abc to T1
7. Drop #sql-xyz

This is a horrendously inefficient way to create an index, so Falcon now creates them online such that a new index is created and populated in place--no temporary tables, no copying, etc. (the same is true with ADD COLUMN, although it's temporarily disabled due to a bug.)


So, with online alter, ADD INDEX works like this:

1. Server queries Falcon to see if it can do the operation online
2. Falcon says, "Yeah, sure, go for it."
3. Server commands online alter
4. Falcon creates new index
5. Falcon populates new index

Falcon briefly locks the table during the DDL phase, but then normal operations resume, even while the index is populated. Pretty cool, right?

Well, sure, except when Philip runs his random query generator that issues thousands of random ALTER operations emanating from many client threads. This is a wholly unnatural act--I mean, what sane application performs thousands of ALTERs from multiple clients? It's nuts, just nuts.

But, and this is a big but, this test finds stuff. All sorts of stuff. Imagine putting a beautiful, Ford 4.6L 3-valve SOHC up on an engine test stand, bolt it down, then run it at 8,000 RPM until it flies apart. Automotive engineers actually do this, so why not software engineers? We need to know where the weak spots are, and, ideally, fix them. And so we do.

Chill/Thaw

Philip also has a chill/thaw stress test, falcon_chill_thaw. This, too, is pure, resident evil, and I mean this in the kindest way.

The Falcon "chill/thaw" mechanism works like this: Each record in the record cache consists of a "system" part and a "data" part. When the total amount of record data associated with an active transaction exceeds a predefined threshold, Falcon "chills" all of the records in the transaction by writing the data portion of each record to the serial log and freeing it from the record cache. When necessary, Falcon "thaws" individual records by restoring their data portions from the serial log.


By default, the falcon_record_chill_threshold is 5MB, which means that when the size of a transaction exceeds 5MB, Falcon moves the record data for the transaction into the serial log. For the falcon_chill_thaw stress tests, Philip cranks the record chill threshold down to 1 byte, which means every record for every transaction is chilled and thawed countless times.

Again, this is a wholly unnatural act, no doubt illegal in the state (not country) of Georgia, and, in fact, I believe we are going to disallow chill thresholds lower than 32K. However, this test has exposed flaws in the chill/thaw code path that ordinarily would never see the light of day.

I have in a logfile a beautiful example of three threads trying to thaw the same record. Falcon is designed to resolve this intrinsic race condition with a compare-and-swap of the data pointer in Record::thaw(). However, the code path leading to the CAS is not (yet) properly reentrant, so we get all sorts of interesting behavior as a consequence of multiple concurrent thaws.

On a live system, this would likely manifest in seemingly random, untraceable errors.
So, it's better to find this stuff now, but it is necessarily time-consuming.

The simple fact remains, however, that Philip is responsible for exposing some serious synchronization issues within Falcon, and for that I am grateful.

I think.