There's a good chance you've heard something about a new review tool coming to Mozilla and how it will change everything. There's an even better chance you've stumbled across one of gps' blog posts on how we use mercurial at Mozilla.
With mozreview entering beta, I decided to throw out my old mq based workflow and try to use all the latest and greatest tools. That means mercurial bookmarks, a unified mozilla-central, using mozreview and completely expunging mq from my workflow.
Making all these changes at the same time was a little bit daunting, but the end result seems to be a much easier and more efficient workflow. I'm writing the steps I took down in case it helps someone else interested in making the switch. Everything in this post is either repeating the mozreview documentation or one of gps' blog posts, but I figured it might help for a step by step tutorial that puts all the pieces together, from someone who is also a mercurial noob.Setup Mercurial
Before starting you need to do a bit of setup. You'll need the mercurial reviewboard and firefoxtree extensions and mercurial 3.0 or later. Luckily you can run:$ mach mercurial-setup
And hitting 'yes' to everything should get you what you need. Make sure you at least enable the rebase extension. In my case, mercurial > 3.0 didn't exist in my package repositories (Fedora 20) so I had to download and install it manually.MozReview
There is also some setup required to use the mozreview tool. Follow the instructions to get started.Tagging the Baseline
Because we enabled the firefoxtree extension, anytime we pull a remote repo from hg.mozilla.org, a local tag will be created for us. So before proceeding further, make sure we have our baseline tagged:$ hg pull https://hg.mozilla.org/mozilla-central $ hg log -r central
Now we know where mozilla-central tip is. This is important because we'll be pulling mozilla-inbound on top later.Create path Aliases
Edit: Apparently the firefoxtree extension provides built-in aliases so there's no need to do this step. The aliases follow the central, inbound, aurora convention.
Typing the url out each time is tiresome, so I recommend creating path aliases in your ~/.hgrc:[paths] m-c = https://hg.mozilla.org/mozilla-central m-i = https://hg.mozilla.org/integration/mozilla-inbound m-a = https://hg.mozilla-org/releases/mozilla-aurora m-b = https://hg.mozilla-org/releases/mozilla-beta m-r = https://hg.mozilla-org/releases/mozilla-release Learning Bookmarks
It's a good idea to be at least somewhat familiar with bookmarks before starting. Reading this tutorial is a great primer on what to expect.Start Working on a Bug
Now that we're all set up and we understand the basics of bookmarks, it's time to get started. Create a bookmark for the feature work you want to do:$ hg bookmark my_feature
Make changes and commit as often as you want. Make sure at least one of the commits has the bug number associated with your work, this will be used by mozreview later:... do some changes ... $ hg commit -m "Bug 1234567 - Fix that thing that is broken" ... do more changes ... $ hg commit -m "Only one commit message needs a bug number"
Maybe you want to pull central again and rebase your changes on top of it. No problem:$ hg update central $ hg pull central $ hg rebase -b my_feature -d central Pushing a Bookmark for Review
When you are ready for review, all you do is:$ hg update my_feature $ hg push review
Mercurial will automatically push the currently active bookmark to the review repository. This is equivalent (no need to update):$ hg push -r my_feature review
At this point you should see some links being dumped to the console, one for each commit in your bookmark as well as a parent link to the overall review. Open this last link to see your review request. At this stage, the review is unpublished, you'll need to add some reviewers and publish it before anyone else can see it. Instead of explaining how to do this, I highly recommend reading the mozreview instructions carefully. I would have saved myself a lot of time if I had just paid closer attention to them.
Once published, mozreview will automatically update the associated bug with appropriate information.Fixing Review Comments
If all went well, someone has received your review request. If you need to make some follow up changes, it's super easy. Just activate the bookmark, make a new commit and re-push:$ hg update my_feature ... fix review comments ... $ hg commit -m "Address review comments" $ hg push review
Mozreview will automatically detect which commits have been pushed to the review server and update the review accordingly. In the reviewboard UI it will be possible for reviewers to see both the interdiff and the full diff by moving a commit slider around.Pushing to Inbound
Once you've received the r+, it's time to push to mozilla-inbound. Remember that firefoxtree makes local tags when you pull from a remote repo on hg.mozilla.org, so let's do that:$ hg update central $ hg pull inbound $ hg log -r inbound
Next we rebase our bookmark on top of inbound. In this case I want to use the --collapse argument to fold the review changes into the original commit:$ hg rebase -b my_feature -d inbound --collapse
A file will open in your default editor where you can modify the commit message to whatever you want. In this case I'll just delete everything except the original commit message and add "r=".
And now everything is ready! Verify you are pushing what you expect and push:$ hg outgoing -r my_feature inbound $ hg push -r my_feature inbound Pushing to other Branches
The beauty of this system is that it is trivial to land patches on any tree you want. If I wanted to land my_feature on aurora:$ hg pull aurora $ hg rebase -b my_feature -d aurora $ hg outgoing -r my_feature aurora $ hg push -r my_feature aurora Syncing work across Computers
You can use a remote clone of mozilla-central to sync bookmarks between computers. Instead of pushing with -r, push with -B. This will publish the bookmark on the remote server:$ hg push -B my_feature <my remote mercurial server>
From another computer, you can pull the bookmark in the same way:$ hg pull -B my_feature <my remote mercurial server>
WARNING: As of this writing, Mozilla's user repositories are publishing! This means that when you push a commit to them, they will mark the commit as public on your local clone which means you won't be able to push them to either the review server or mozilla-inbound. If this happens, you need to run:$ hg phase -f --draft <rev>
This is enough of a pain that I'd recommend avoiding user repositories for this purpose unless you can figure out how to make them non-publishing.Conclusion
I'll need to play around with things a little more, but so far everything has been working exactly as advertised. Kudos to everyone involved in making this workflow possible!
The LogView add-on, available now on AMO, solves some of these problems. It continuously records the logcat output and monitors it. When it sees an error in the logcat, the error is displayed as a toast for visibility.
You can also access the current logs through the new about:logs page.
The add-on only supports Jelly Bean (4.1) and above, and only Fennec logs are included rather than logs for all apps. Check out the source code or contribute on Github.
Feature suggestions are also welcome! I think the next version will have the ability to filter logs in about:logs. It will also allow you to copy logs to the clipboard and/or post logs as a pastebin link.
While October 24-26 marked the fifth official MozFest celebration, it was an exhilarating first for the newly formed Policy & Advocacy track. Before we wrap up the event, the Policy & Advocacy Wranglers want to share our thoughts and observations on the event.
This year, we broadened our focus from 2013’s Privacy track to involve the entire Policy & Advocacy community, celebrating the Web We Want and highlighting the global movement to protect the free and open web.
What We Planned
Our track featured more than 20 sessions spanning digital citizenship, kids safety, net neutrality, privacy, security, and anti-surveillance. The advocacy sessions shared the secrets of successful campaigns, the tools of the trade, and how to use trouble to your advantage. One session invited people to conceptualize a new Internet Alert System. The track also featured talks about current events and issues, including the surveillance ecosystem, net neutrality, and Do Not Track. Those looking to use or gain technical skills had the opportunity to join four consecutive Hackathons — ranging from creating mesh networks to creating data visualizations — and a ‘Humane Cryptoparty’, which emphasized a human-centered approach to privacy tools and practical advice and guides for self-hosting email.
Another unique session was our Privacy Learning Lab. The Learning Lab was an experiment to attract those who might want to consume and learn about privacy in smaller, less intimidating chunks. Participants could join at any time, and move through each of five tables, covering topics as diverse as location privacy, the Clean Data Movement, metadata, using Webmaker tools, and an eye-catching privacy game called OffGrid. Several of our Learning Lab participants also shared their ideas during Sunday night’s closing party demos.
On the mainstage, we announced the Ford-Mozilla Open Web Fellows program, a new program recruiting tech leaders to work at nonprofit organizations that are protecting the open Web. The search is on for Fellows who will have opportunities ranging from the ACLU, where the fellow will work with the team that is defending Edward Snowden to Amnesty International, where the fellow will be at the center of human rights and the Internet, to the Open Technology Institute, where the fellow will work with the organization’s M-Lab initiative and serve as a data scientist for the open Web movement. Applications for the 2015 Fellows are still open, and the deadline to apply is Wednesday, December 31, 2014.
Creating the Environment
At MozFest, the interactive feel leads with the physical environment. The Policy & Advocacy track was housed high on the 7th floor of Ravensbourne, a media campus in the heart of London. In designing the right environment for our community, we planned several interactive displays to entice people to climb those stairs and fill those elevators to come see what we were all about. Our entrance included a ‘superhero photo booth’ which celebrated that we are all heroes of the web. Throughout the festival, people dressed up in superhero costumes, took selfies, and tweeted them to their networks with #WebWeWant.
Continuing into our space, two thought provoking walls invited interaction. At the colorful “Web We Want” ‘chalkboard’ (inspired by Candy Chang’s iconic work, anyone could grab a chalkboard pen to express their thoughts about the web – a big hit with participants and videographers alike. Colorful responses ranged from “built by people, fun and open!” to “decentralized”, “private”, “empowering”, “an explosion of creativity,” and so much more.
Another wall, based on a recent cross-cultural study on trust, invited people to write their personal definitions of transparency and privacy. On a central kiosk we just may have hosted the first-ever offline Reddit session (not intentionally, but when the Internet connection unexpectedly glitched, reddit quickly adapted with an innovative offline AMA). Using colorful post-it notes, participants expressed a set of principles and values important to the open Web.
What We Learned
As the first year for the Policy & Advocacy track, we were in prototyping mode. We were testing what works, what doesn’t and optimizing on the fly. We learned so many lessons that we’ll chew on for next year, but we’d also like to share a few here.
We were incredibly inspired by what an AMAZING Policy & Advocacy community exists and the immeasurable value of face-to-face interaction to share ideas and solve problems together. For us Wranglers, the most difficult part of the planning process was having so many amazing proposals to choose from and not being able to include them all. Indeed, we may have created too many sessions, not giving people enough time to explore the rest of MozFest.
Another thing we learned was the need to document what was happening in the sessions. We heard several requests for video (perhaps even Firefox phone) recordings, to enable people who couldn’t attend the festival participate and to mitigate schedule overload for the people there. We’ll pitch that idea to the organizers next year, along with additional Learning Labs as a way to share more ideas in smaller chunks.
All in all, this was a great MozFest and a terrific beginning for the Policy & Advocacy track. We’d love to hear your feedback — email us at firstname.lastname@example.org. We look forward to putting what we learned into practice for next year.
Your Friendly Policy & Advocacy Space Wranglers,
Dave Steer, Alina Hua and Stacy Martin
At the beginning of October, I went to Austin for the Digital PM Summit, which is an amazingly useful gathering of digital project managers, now in its second year. I was invited to speak about Retrospectives—my favorite topic! I enjoyed putting together a presentation and talking with some really talented PMs about how to create a culture of experimentation and continuous improvement.
Then, earlier this week, I had the opportunity to talk with the very smart and fun YNPN Launchpad Fellows about how to apply Agile methodologies to non-technical projects in nonprofit organizations. I’m becoming a little obsessed with non-technical applications of Agile (see ScrumYourWedding, coming soon!)
I <3 talking to people!
- 58 changesets
- 166 files changed
- 5644 insertions
- 1883 deletions
ExtensionOccurrences cpp54 h36 xhtml18 js11 css7 build6 webidl5 java3 ini3 in3 html3 c2 xul1 py1 mm1 mk1 list1 jsx1 ipdlh1 hgtags1
ModuleOccurrences layout42 gfx32 content14 browser12 dom11 widget10 parser10 media7 mobile6 toolkit4 netwerk3 testing2 js2 webapprt1 modules1
List of changesets:Michael ComellaBug 1092254 - Use Solo.waitForCondition under the hood in BaseTest.waitForTest. r=liuche, a=test-only - cdd31f8931ae Mark FinkleBug 1091410 - Intermittent testLinkContextMenu | Wait for the URLBar. r=bnicholson, a=test-only - 2ad92b68de0b Henri SivonenBug 1088635. r=smaug, a=dveditz. - 2be3d4150683 Mats PalmgrenBug 1077687 - If we have a pending request to rebuild all style data then do so instead of processing individual restyles. r=roc, a=dveditz - fdb8b52bea5c Randall BarkerBug 1055562 - Crash in java.lang.IllegalStateException: Callback has already been executed. r=wesj, a=lsblakk - 3c9ba9327aa9 Bas SchoutenBug 1093694 - Don't allow any graphics features when there's a driver version mismatch. r=jrmuizel, a=sledru - 38b0e08b93b7 Benoit JacobBug 1021265 - Regard d3d11 as broken with displaylink on versions <= 22.214.171.124484, and fall back to basic layers. r=jrmuizel, a=sledru - 57c47cb49c03 Ryan VanderMeulenBacked out changeset fdb8b52bea5c (Bug 1077687) for bustage. - 5b4bac2ebf6c Mats PalmgrenBug 1077687 - If we have a pending request to rebuild all style data then do so instead of processing individual restyles. r=roc, a=dveditz - 4e78f69ca4a9 Georg FritzscheBug 1094035 - Keyed Histograms do not reflect key strings to JS correctly. r=froydnj, a=lmandel - b94f02c9dc7d Jeff GilbertBug 1037147 - Remove SharedTextureHandle and friends r=mattwoodrow,snorp a=lmandel - 04a5da64e518 James WillcoxBug 1014614 - Rename nsSurfaceTexture to AndroidSurfaceTexture r=jgilbert a=lsblakk - 51f45407f843 James WillcoxBug 1014614 - Expose more SurfaceTexture API in AndroidSurfaceTexture r=blassey a=lsblakk - ed90f61eb314 James WillcoxBug 1014614 - Expose Android native window via AndroidNativeWindow wrapper r=blassey a=lsblakk - 9d1af2396d45 James WillcoxBug 1014614 - Do not try to use a temporary texture for SurfaceTexture r=jgilbert a=lsblakk - 6b03a2b8f2f4 James WillcoxBug 1014614 - Fix JNI wrapper for registering SurfaceTexture listener callbacks r=blassey a=lsblakk - bef38c92bab9 James WillcoxBug 1014614 - Support attach/detach of GLContext to AndroidSurfaceTexture r=jgilbert a=lsblakk - c82e88a99ca3 Andrew Martin McDonoughBug 1014614 - Use Android MediaCodec for decoding H264 and AAC in MP4 r=cpearce,edwin a=lsblakk - 47ea294898a0 James WillcoxBug 1014614 - Add GLBlitHelper::BlitImageToFramebuffer and support SurfaceTexture images r=jgilbert a=lsblakk - 2973ae13faaa James WillcoxBug 1014614 - Fix readback of SurfaceTextureImage r=jgilbert a=lsblakk - 5813f7c574ce James WillcoxBug 1089423 - Catch MediaCodec exceptions r=gcp a=lsblakk - cd94c836426e James WillcoxBug 1089159 - Correctly use MediaCodec's audio output format r=cpearce a=lsblakk - 5811de401315 Daniel HolbertBug 1055665 part 1: Backout changeset aece7f9f944c (i.e. backout Bug 1032922 part 2). a=lmandel - b4e9b4dab577 Daniel HolbertBug 1055665 part 2: Backout changeset af2a4fb980ad (i.e. backout Bug 1032922 part 1). a=lmandel - d04d205b6c12 Stephen PohlBug 1091109: Don't sign webapprt-stub on OSX because webapps fail to launch due to quarantine bit. r=smichaud,myk a=lmandel - a8edc81c39d5 Randell JesupBug 1061702: Stop audio sources from continuing to play garbage after being stopped r=roc a=lmandel - d9f441a027e5 Daniel HolbertRevert changesets d04d205b6c1 and b4e9b4dab577 because they landed with the wrong bug number. - 0430d2b93ed3 Daniel HolbertBug 1093316 part 1: Backout changeset aece7f9f944c (i.e. backout Bug 1032922 part 2). a=lmandel - af442befe914 Daniel HolbertBug 1093316 part 2: Backout changeset af2a4fb980ad (i.e. backout Bug 1032922 part 1). a=lmandel - 6f460d9ed80d Ralph GilesBug 1073805 - Fix HE-AAC regression on windows. r=kinetik,cpearce a=lmandel - decaff6b28c7 Andrea MarchesiniBug 1082734 - Disable location.searchParams for cross-origin insecure data access. r=bz, a=lmandel - d8080081d33a Benoit JacobBug 1093863 - Blacklist D3D on dual Intel/AMD not advertised as such in the registry. r=jrmuizel, a=lmandel - c8d99c0a36d9 James WillcoxBack out 04a5da64e518..5811de401315 - 375b5fca3825 James WillcoxMerge backout, a=bustage - 4cd1151d9de0 Stephen PohlBackout a8edc81c39d5 for causing an increased number of intermittent Bug 1059238. a=bustage - 81cf187bba10 Mark BannerBug 1093475 When a Loop call URL is deleted/blocked, use the proper session. r=mikedeboer a=lmandel - caa27159afeb Randell JesupBug 1090415: add *.room.co to screensharing whitelist rs=mreavy a=lmandel - 43e9c7a57468 Jim ChenBug 1073328 - Prevent using our own handler as system handler. r=snorp, a=lmandel - 967cb2edcd52 Brian HackettBug 1091459 - Only interrupt JS execution after running long enough that the slow script dialog might need to be shown. r=bholley, a=lmandel - de49643707ae Gijs KruitboschBug 1063121 - Dropping out of fullscreen mode without titlebar breaks titlebar/tabs layout. r=jimm, a=lmandel - f6b893ef9186 Dragana DamjanovicBug 1085266 - NetworkActivityMonitor PRIOMethods changed to be static, because not attached nsUDPSockets were crashing if SocketTransportService had been shut down. A small fix to nsUDPSocket destructor has been added. r=michal, a=lmandel - dfe08b30f41f Andrew McCreightBug 1066212 - Disable dom/audiochannel/tests/test_telephonyPolicy.html on Android. r=baku a=test-only - 57e502f33317 Jeff GilbertBug 1037147 - Remove SharedTextureHandle and friends r=mattwoodrow,snorp a=lmandel - 99c1af40ea1b James WillcoxBug 1014614 - Rename nsSurfaceTexture to AndroidSurfaceTexture r=jgilbert a=lsblakk - 03bb1ca133ee James WillcoxBug 1014614 - Expose more SurfaceTexture API in AndroidSurfaceTexture r=blassey a=lsblakk - edbde27790f4 James WillcoxBug 1014614 - Expose Android native window via AndroidNativeWindow wrapper r=blassey a=lsblakk - 4053de7eee7b James WillcoxBug 1014614 - Do not try to use a temporary texture for SurfaceTexture r=jgilbert a=lsblakk - 2a7e9525f500 James WillcoxBug 1014614 - Fix JNI wrapper for registering SurfaceTexture listener callbacks r=blassey a=lsblakk - 9c77e16f165c James WillcoxBug 1014614 - Support attach/detach of GLContext to AndroidSurfaceTexture r=jgilbert a=lsblakk - 997aac78a0b2 Andrew Martin McDonoughBug 1014614 - Use Android MediaCodec for decoding H264 and AAC in MP4 r=cpearce,edwin a=lsblakk - 04cc5b970bb6 James WillcoxBug 1014614 - Add GLBlitHelper::BlitImageToFramebuffer and support SurfaceTexture images r=jgilbert a=lsblakk - 2a84f955f197 James WillcoxBug 1014614 - Fix readback of SurfaceTextureImage r=jgilbert a=lsblakk - ac0c848981db James WillcoxBug 1089423 - Catch MediaCodec exceptions r=gcp a=lsblakk - a915fb067948 James WillcoxBug 1089159 - Correctly use MediaCodec's audio output format r=cpearce a=lsblakk - 53692e16c248 Jonathan KewBug 1093949 - Reverse scroll position for RTL content. r=mats, a=lmandel - 4e453b566e83 Stephen PohlBug 1091109: Don't sign webapprt-stub on OSX because webapps fail to launch due to quarantine bit on CLOSED TREE. r=smichaud,myk a=lmandel,RyanVM - 557655b23004 Nils Ohlmeier [:drno]Bug 1089207: fix parsing of invalid fmtp att r=drno,jesup a=lmandel - f30e1c0c0694 Nicolas SilvaBug 1089183 - Blacklist D2D on a range of ATI drivers that don't handle dxgi keyed mutex properly. r=bjacob, a=sledru - 8e812440658b
I was honored to give the opening keynote for USENIX URES14 East in Philadelphia in June 2014.
“The Value of Release Engineering as a Force Multiplier” keynote built on top of the “RelEng as a Force Multiplier” presentation I gave at RelEngConf 2013 and as then as a Google Tech Talk. (The full set of slides are available here. If you want the original 25MB keynote file, let me know.)
Anyone who has ever talked with me about RelEng knows I feel very strongly that Release Engineering is important to the success of every software project. Writing a popular v1.0 product is just the first step. If you want to keep your initial early-adopter users by shipping v1.0.1 fixes, or grow your user base by shipping new v2.0 features to your existing users, you need a reproducible pipeline for accurately delivering software in a repeatable manner. Otherwise, you are “only” delivering a short-lived flash-in-the-pan one-off project. In my opinion, this pipeline is another product that software companies need to develop, alongside their own unique product, if they want to stay in the marketplace, and scale.
Its typical for Release Engineers to talk about the value of RelEng in terms that Release Engineers value – timely delivery, accurate builds, turnaround time, etc. I believe its important to also describe Release Engineering in terms that people across an organization can understand. In my keynote, I specifically talked about the value of RelEng in terms that people-who-run-companies value – unique business opportunities, market / competitive advantages, new business models, reduced legal risk, etc.
Examples included: Mozilla’s infrastructure improvements which reduced turnaround time for delivering security fixes as well as helped deter future attacks… Hortonwork’s business ability to provide enterprise-grade support SLAs to customers running mission critical production “big data” systems on 100% open source Apache Hadoop… and even NASA’s remote software update of the Mars Rover.
People seemed to enjoy the presentation, with lively questions during, afterwards… and even into the end-of-day panel session.
Big thanks to the organizers (especially Dinah McNutt (RelEng at Google), Gareth Bowles) – they did an awesome job putting together a unique and special event.
Oh, and one more thing! Next week, USENIX URES14 West will start on Monday 10nov2014 in Seattle. If you are in the area, or can get there for Monday, you should attend! And make sure to see Kmoir’s presentation “Scaling Capacity While Saving Cash” – if you follow her blog, you know you can expect it to be well worth attending.
Last Thursday we had our regular weekly call about the Reps program, where we talk about what’s going on in the program and what Reps have been doing during the last week.
- Firefox 10th updates.
- Reps of the month October.
- Reminder about Monday Meetings.
- Community Calls for QA.
- Reps newsletter coming soon.
Don’t forget to comment about this call on Discourse and we hope to see you next week!
- 4 changesets
- 8 files changed
- 53 insertions
- 16 deletions
ExtensionOccurrences cpp4 txt2 hgtags1 h1
ModuleOccurrences widget3 gfx2 config1 browser1
List of changesets:Nicolas SilvaBug 1064107 - Ensure that gfxPlatform is initialized by the time we create the compositor. r=Bas, a=sledru - 691739025fac Bas SchoutenBug 1093694 - Don't allow any graphics features when there's a driver version mismatch. r=jrmuizel, a=sledru - 7311ad1fba8c Benoit JacobBug 1021265 - Regard d3d11 as broken with displaylink on versions <= 126.96.36.199484, and fall back to basic layers. r=jrmuizel, a=sledru - 63daea50bacd Benoit JacobBug 1093863 - Blacklist D3D on dual Intel/AMD not advertised as such in the registry. r=jrmuizel, a=lmandel - 983a710b51c4
We had a hiccup on hg.mozilla.org yesterday. It resulted in prolonged tree closures for Firefox. Bug 1094922 tracks the issue.What Happened
We noticed that many HTTP requests to hg.mozilla.org were getting 503 responses. On initial glance, the servers were healthy. CPU was below 100% utilization, I/O wait was reasonable. And there was little to no swapping. Furthermore, the logs showed a healthy stream of requests being successfully processed at levels that are typical. In other words, it looked like business as usual on the servers.
Upon deeper investigation, we noticed that the WSGI process pool on the servers was fully saturated. There were were 24 slots/processes per server allocated to process Mercurial requests and all 24 of them were actively processing requests. This created a backlog of requests that had been accepted by the HTTP server but were waiting an internal dispatch/proxy to WSGI. To the client, this would appear as a request with a long lag before response generation.Mitigation
This being an emergency (trees were already closed and developers were effectively unable to use hg.mozilla.org), we decided to increase the size of the WSGI worker pool. After all, we had CPU, I/O, and memory capacity to spare and we could identify the root cause later. We first bumped worker pool capacity from 24 to 36 and immediately saw a significant reduction in the number of pending requests awaiting a WSGI worker. We still had spare CPU, I/O, and memory capacity and were still seeing requests waiting on a WSGI worker, so we bumped the capacity to 48 processes. At that time, we stopped seeing worker pool exhaustion and all incoming requests were being handed off to a WSGI worker as soon as they came in.
At this time, things were looking pretty healthy from the server end.Impact on Memory and Swapping
Increasing the number of WSGI processes had the side-effect of increasing the total amount of system memory used by Mercurial processes in two ways. First, more processes means more memory. That part is obvious. Second, more processes means fewer requests for each process per unit of time and thus it takes longer for each process to reach its max number of requests being being reaped. (It is a common practice in servers to have a single process hand multiple requests. This prevents overhead associated with spawning a new process and loading possibly expensive context in it.)
We had our Mercurial WSGI processes configured to serve 100 requests before being reaped. With the doubling of WSGI processes from 24 to 48, those processes were lingering for 2x as long as before. Since the Mercurial processes grow over time (they are aggressive about caching repository data), this was slowly exhausting our memory pool.
It took a few hours, but a few servers started flirting with high swap usage. (We don't expect the servers to swap.) This is how we identified that memory use wasn't sane.
We lowered the maximum requests per process from 100 to 50 to match the ratio increase in the WSGI worker pool.Mercurial Memory "Leak"
When we started looking at the memory usage of WSGI processes in more detail, we noticed something strange: RSS of Mercurial processes was growing steadily when processes were streaming bundle data. This seemed very odd to me. Being a Mercurial developer, I was pretty sure the behavior was wrong.
I filed a bug against Mercurial.
I was able to reproduce the issue locally and started running a bisection to find the regressing changeset. To my surprise, this issue has been around since Mercurial 2.7!
I looked at the code in question, identified why so much memory was being allocated, and submitted patches to stop an unbounded memory growth during clone/pull and to further reduce memory use during those operations. Both of those patches have been queued to go in the next stable release of Mercurial, 3.2.1.
Mercurial 3.2 is still not as memory efficient during clones as Mercurial 2.5.4. If I have time, I'd like to formulate more patches. But the important fix - not growing memory unbounded during clone/pull - is in place.
Armed with the knowledge that Mercurial is leaky (although not a leak in the traditional sense since the memory was eventually getting garbage collected), we further reduced the max requests per process from 50 to 20. This will cause processes to get reaped sooner and will be more aggressive about keeping RSS growth in check.Root Cause
We suspect the root cause of the event is a network event.
Before this outage, we rarely had more than 10 requests being served from the WSGI worker pool. In other words, we were often well below 50% capacity. But something changed yesterday. More slots were being occupied and high-bandwidth operations were taking longer to complete. Kendall Libby noted that outbound traffic dropped by ~800 Mbps during the event. For reasons that still haven't been identified, the network became slower, clones weren't being processed as quickly, and clients were occupying WSGI processes for longer amounts of time. This eventually exhausted the available process pool, leading to HTTP 503's, intermittent service availability, and a tree closure.
Interestingly, we noticed that in-flight HTTP requests are abnormally high again this morning. However, because the servers are now configured to handle the extra capacity, we seem to be powering through it without any issues.In Hindsight
You can make the argument that the servers weren't configured to serve as much traffic as possible. After all, we were able to double the WSGI process pool without hitting CPU, I/O, and memory limits.
The servers were conservatively configured. However, the worker pool was initially configured at 2x CPU core count. And as a general rule of thumb, you don't want your worker pool to be much greater than CPU count because that introduces context switching and can give each individual process a smaller slice of the CPU to process requests, leading to higher latency. Since clone operations often manage to peg a single CPU core, there is some justification for keeping the ratio of WSGI workers to CPU count low. Furthermore, we rarely came close to exhausting the WSGI worker pool before. There was little to no justification for increasing capacity to a threshold not normally seen.
But at the same time, even with 4x workers to CPU cores, our CPU usage rarely flirts with 100% across all cores, even with the majority of workers occupied. Until we actually hit CPU (or I/O) limits, running a high multiplier seems like the right thing to do.
Long term, we expect CPU usage during clone operations to drop dramatically. Mike Hommey has contributed a patch to Mercurial that allows servers to hand out a URL of a bundle file to fetch during clone. So, a server can say I have your data: fetch this static file from S3 and then apply this small subset of the data that I'll give you. When properly deployed and used at Mozilla, this will effectively drop server-side CPU usage for clones to nothing.Where to do Better
There was a long delay between the Nagios alerts firing and someone with domain-specific knowledge looking at the problem.
The trees could have reopened earlier. We were pretty confident about the state of things at 1000. Trees opened in metered mode at 1246 and completely at 1909. Although, the swapping issue wasn't mitigated until 1615, so you can argue that being conservative on the tree reopening was justified. There is a chance that full reopening could have triggered excessive swap and another round of chaos for everyone involved.
We need an alert on WSGI pool exhaustion. It took longer than it should have to identify this problem. However, now that we've encountered it, it should be obvious if/when it happens again.
Firefox release automation is the largest single consumer of hg.mozilla.org. Since they are operating thousands of machines, any reduction in interaction or increase in efficiency will result in drastic load reductions on the server. Chris AtLee and Jordan Lund have been working on bug 1050109 to reduce clones of the mozharness and build/tools repositories, which should go a long way to dropping load on the server.Timeline of Events
All times PST.
- 0705 - First Nagios alerts fire
- 0819 - Trees closed
- 0915 - WSGI process pool increased from 24 to 36
- 0945 - WSGI process pool increased from 36 to 48
- 1246 - Trees reopen in metered mode
- 1615 - Decrease max requests per process from 100 to 50
- 1909 - Trees open completely
- 0012 - Patches to reduce memory usage submitted to Mercurial
- 0800 - Mercurial patches accepted
- 0915 - Decrease max requests per process from 50 to 20
Last weekend I had the pleasure to be among the Mozilla Arabic meetup for their annual community meeting, this time in Istanbul, Turkey.
The meetup schedule was packed for two full days, and we barely had time to cover all planned items. We made it though, thanks to the fantastic organizing team (Melek, Sofien, Majda, Rami (who joined remotely), Migdadi and Nefzaoui)
Note to self #1 This is once again a reminder that such 30-people meetups that happen annually (or in less frequency) need to run beyond 2 days. The addition of half a day on Friday would tremendously help, enabling everyone to sync up, bringing people up to speed and informing the schedule of the next two days.
The first day was dedicated on meta-community organization issues. Arabic community is a group of regional communities that are coming together under same goals (especially around l10n). The challenge on having such a meta-community is that the regional ones already have structure, leadership, pace and goals in place, and those might not necessarily be compatible between each other. We initially spent some time to determine the shared functions, roles and goals that should be dealt on a meta-community level rather then individual community one (things like: l10n oversight, Arabic community visibility, cross-community events and activities etc). The structure proposed (which I totally support) is forming a coordination committee with a rolling chair. Each community gets to be the chairing (“hosting”) one, driving and coordinating the meta-community for a period of 6-months. Then another community takes over.
The notable pros of this approach is the shared load over time, the visibility this brings to individual communities, the helpful exposure to different coordination styles and the sense of involvement and leadership all communities will get to experience. The ball is already rolling with this approach and a meeting next week will determine the first chairing community and finalize the way forward.
Second day was more project specific. We had 3 core themes (L10n, FirefoxOS and Webmaker) and we split up in groups to have sessions on those. Partially training, partially brainstorming on next activities on the region, it was a productive experience for both participants and session owners. Haven’t showcased WebIDE to people? Introduce them to the magic of developing apps with Firefox Desktop and watch them drool.
During the meetup we also had a long session on participation and community building (which was kinda different from the approach taken on previous meetings). This time we introduced the idea of “Innovation from the edges” to people and brainstormed under two arcs: “Innovative ideas that you would like to work on” and “Ways that the rest of the Mozilla project could help you“.
Stating with the realization that Mozilla Project (supported by Mozilla Corporation and Mozilla Foundation) could not plan, execute, innovate and support all possible activities and projects that advance our Mozilla Mission, we let people loose to come up with regional (and global) activities and projects that would bring innovation to Mozilla and help us advance our mission. The response was enthusiastic and informing. People quickly came up with ideas that they would like to work on ranging from engineering projects to partnerships with other projects on the ground. More interestingly, patterns emerged under the arc of “how the rest of Mozilla can support you“. Hands-on training (technical or not), mandate to represent Mozilla, access to tools and systems (in an open way) and resources around IT, were some recurring themes that we identified. All these will be taken back to the Mozilla Community Building team and the appropriate Working Groups to inform our strategy for the near future and enable us to support regional and functional communities better.
Note to self #2 Budget and Swag (our default go-tos for regional support) were not even mentioned on the “how we can support you” session. We may need to rethink many of our assumptions moving forward.
I am confident that the Arabic Community has a solid way forward planned after this meetup, and I can’t wait to see the results of it. As for the learnings that we got out of this weekend, we need to evaluate them and plan the way forward for participation strategy informed by such inputs.
Event wiki page: https://wiki.mozilla.org/ArabicMozillaMeetup/2014
Analysis of the community: https://etherpad.mozilla.org/arabic-meetup-swot
Action plan: https://arabicmozilla.etherpad.mozilla.org/meetup-14-action-plan
For a year now the Systems and Data Working Group of Mozilla has been meeting, brainstorming about community building systems, designing and implementing them and pioneering new ways to measure contribution activity across Mozilla.
In the process of evaluating existing systems (like mozillians.org) and creating new ones (like Baloo) it was obvious that we needed a common set of principles to apply on all systems that are in service of community building within Mozilla. That would enable Mozillians to easily access tools and contribute in a way that maximizes impact. We as the Systems and Data Working Group recommend these principles be adapted for all tools used by Mozilla.
The principles written in buckets are:
- Unified Identity
- Tools should have single source of truth for people data
- Integration with HRIS
- mozillians.org already has staff and volunteer information, so it is a good candidate at the single source of tr
- Tools should de-duplicate people information by integrating with a single source of truth
- e.g. Reps: Not integrated with Mozillians.org, lots of duplicate information on two profiles
- Tools should have single source of truth for people data
- Unified Authentication and Authorization
- Tools should use a single identity platform that provides permissions-based access to tools (like Mozillians.org)
- e.g Reps: add people to the Reps group on mozillians.org to give them permission to use rep.mozilla.org as a Rep
- Accessible Metrics
- Tools should track each contribution a Mozillian makes and provide it in an accessible way to create a holistic view of contributions
- Tools should be localized so they are accessible to our global community
- Tools should teach the user how to use the tool, answer common usage questions, and have general documentation
- Tools should recognize the contributions that they enable
- Tools should enable anyone to improve the tool by filing bugs, suggesting ideas and bringing those ideas to life
- Content de-duplication
- Tools should de-duplicate the content that is created in those tools, making it accessible to other systems
- Tools should be personal and written in the Mozilla voice
This has been a collaborative effort involving various stakeholders of tools within Mozilla that have been reviewing those and providing feedback during our meetings. We are seeking more feedback moving forward especially with regards to how those impact the Roadmap of various tools and translate to actual features. Feel free to comment here or join our discussions in the community-building mailing list.