mozilla

Mozilla Nederland LogoDe Nederlandse
Mozilla-gemeenschap

Andrew Halberstadt: Python 3 at Mozilla

Mozilla planet - di, 30/04/2019 - 21:25

Mozilla uses a lot of Python. Most of our build system, CI configuration, test harnesses, command line tooling and countless other scripts, tools or Github projects are all handled by Python. In mozilla-central there are over 3500 Python files (excluding third party files), comprising roughly 230k lines of code. Additionally there are 462 repositories labelled with Python in the Mozilla org on Github (though many of these are not active). That’s a lot of Python, and most of it is Python 2.

With Python 2’s exaugural year well underway, it is a good time to take stock of the situation and ask some questions. How far along has Mozilla come in the Python 3 migration? Which large work items lie on the critical path? And do we have a plan to get to a good state in time for Python 2’s EOL on January 1st, 2020?

Categorieën: Mozilla-nl planet

Mozilla VR Blog: Firefox Reality coming to SteamVR

Mozilla planet - di, 30/04/2019 - 19:02
Firefox Reality coming to SteamVR

We are excited to announce that we’re working with Valve to bring the immersive web to SteamVR!

This January, we announced that we were bringing the Firefox Reality experience to desktop devices and the Vive stores. Since then, collaborating closely with Valve, we have been working to also bring Firefox Reality to the SteamVR immersive experience. In the coming months, users will be offered a way to install Firefox Reality via a new web dashboard button, and then launch a browser window over any OpenVR experience.

With a few simple clicks, users will be able to access web content such as tips or guides or stream a Twitch comment channel without having to exit their immersive experiences. In addition, users will be able to log into their Firefox account once, and access synced bookmarks and cookies across both Firefox and Firefox Reality — no need to log in twice!

Firefox Reality coming to SteamVR

We are excited to collaborate with Valve and release Firefox for SteamVR this summer.

Categorieën: Mozilla-nl planet

Mozilla GFX: WebRender newsletter #44

Mozilla planet - di, 30/04/2019 - 18:33

WebRender is a GPU based 2D rendering engine for web written in Rust, currently powering Mozilla’s research web browser servo and on its way to becoming Firefox‘s rendering engine.

WebRender on Linux in Firefox Nightly

Right after the previous newsletter was published, Andrew and Jeff enabled WebRender for Linux users on Intel integrated GPUs with Mesa 18.2 or newer on Nightly if their screen resolution is 3440×1440 or less.
We decided to start with Mesa is thanks to the quality of the drivers. Users with 4k screens will have to wait a little longer though (or enable WebRender manually) as there are a number of specific optimizations we want to do before we are comfortable getting WebRender used on these very high resolution screens. While most recent discreet GPUs can stomach about anything we throw at them, integrated GPUs operate on a much tighter budget and compete with the CPU for memory bandwidth. 4k screens are real little memory bandwidth eating monsters.

WebRender roadmap

Jessie put together a roadmap of the WebRender project and other graphics endeavors from the items discussed in the week in Toronto.
It gives a good idea of the topics that we are focusing on for the coming months.

A week in Toronto – Part deux

In the previous newsletter I went over a number of the topics that we discussed during the graphics team’s last get-together in Toronto. Let’s continue here.

WebRender on Android

We went over a number of the items in WebRender’s Android TODO-list. Getting WebRender to work at all on Android is one thing. A lot of platform-specific low level glue code which Sotaro has been steadily improving lately.

On top of that come mores questions:

  • Which portion of the Android user population support the OpenGL features that WebRender relies on?
  • Which OpenGL features we could stop relying on to cover more users
  • What do we do about the remaining users which have such a small OpenGL feature set available that we don’t plan to get WebRender in the foreseeable future.

Among the features that WebRender currently heavily relies on but are (surprisingly) not universally supported in this day and age:

  • texture arrays,
  • float 32 textures,
  • texture fetches in vertex shaders,
  • instancing,

We discussed various workarounds. Some of them will be easy to implement, some harder, some will come at a cost, some we are not sure will provide an acceptable user experience. As it turns out, building a modern rendering engine while also targetting devices that are everything but modern is quite a challenge, who would have thought!

Frame scheduling

Rendering a frame, from a change of layout triggered by some JavaScript to photons flying out of the screen, goes through a long pipeline. Sometimes some steps in that pipeline take longer than we would want but other parts of the pipeline sort of absorb and hide the issue and all is mostly fine. Sometimes, though, slowdowns in particular places with the wrong timing can cause a chain of bad interactions which results a back and forth between a rapid burst of a few frames followed by a couple of missed frames as parts the system oscillate between throttle themselves on and off.

I am describing this in the abstract because the technical description of how and why this can happen in Gecko is complicated. It’s a big topic that impacts the design of a lot of pieces in Firefox’s rendering engine. We talked about this and came up with some short and long term potential improvements.

Intel 4K performance

I mentioned this towards the beginning of this post. Integrated GPUs tend to be more limited in, well in most things, but most importantly in memory bandwidth, which is exacerbated by sharing RAM with the CPU. When high resolution screens don’t fit in the integrated GPU’s dedicated caches, Jeff and Markus made the observation that it can be significantly faster to split the screen into a few large regions and render them one by one. This is at the cost of batch breaks and an increased amount of draw calls, however the restricting rendering to smaller portions of the screen gives the GPU a more cache-friendly workload than rendering the entire screen in a single pass.

This approach is interestingly similar to the way tiled GPUs common on mobile devices work.
On top of that there are some optimizations that we want to investigate to reduce the amount of batch breaks caused by text on platforms that do not support dual-source blending, as well as continued investigation in progress of what is slow specifically on Intel devices.

Other topics

We went over a number of other technical topics such as WebRender’s threading architecture, gory details of support for backface-visibility, where to get the best Thaï food in downtown Toronto and more. I won’t cover them here because they are somewhat hard and/or boring to explain (or because I wasn’t involved enough in the topics do them justice on this blog).

In conclusion

It’s been a very useful and busy week. The graphics team will meet next in Whistler in June along with the rest of Mozilla. By then Firefox 67 will ship, enabling WebRender for a subset of Windows users in the release channel which is huge milestone for us.

Enabling WebRender in Firefox Nightly

In about:config, enable the pref gfx.webrender.all and restart the browser.

Reporting bugs

The best place to report bugs related to WebRender in Firefox is the Graphics :: WebRender component in bugzilla.

Note that it is possible to log in with a github account.

Using WebRender in a Rust project

WebRender is available as a standalone crate on crates.io (documentation)

Categorieën: Mozilla-nl planet

The Mozilla Blog: $2.4 Million in Prizes for Schools Teaching Ethics Alongside Computer Science

Mozilla planet - di, 30/04/2019 - 15:00
Omidyar Network, Mozilla, Schmidt Futures, and Craig Newmark Philanthropies are announcing the Stage I winners of our Responsible Computer Science Challenge

 

Today, we are announcing the first winners of the Responsible Computer Science Challenge. We’re awarding $2.4 million to 17 initiatives that integrate ethics into undergraduate computer science courses.

The winners’ proposed curricula are novel: They include in-class role-playing games to explore the impact of technology on society. They embed philosophy experts and social scientists in computer science classes. They feature “red teams” that probe students’ projects for possible negative societal impacts. And they have computer science students partner with local nonprofits and government agencies.

The winners will receive awards of up to $150,000, and they span the following categories: public university, private university, liberal arts college, community college, and Jesuit university. Stage 1 winners are located across 13 states, with computer science programs ranging in size from 87 students to 3,650 students.

The Responsible Computer Science Challenge is an ambitious initiative by Omidyar Network, Mozilla, Schmidt Futures, and Craig Newmark Philanthropies. It aims to integrate ethics and responsibility into undergraduate computer science curricula and pedagogy at U.S. colleges and universities.

Says Kathy Pham, computer scientist and Mozilla Fellow co-leading the Challenge: “Today’s computer scientists write code with the potential to affect billions of people’s privacy, security, equality, and well-being. Technology today can influence what journalism we read and what political discussions we engage with; whether or not we qualify for a mortgage or insurance policy; how results about us come up in an online search; whether we are released on bail or have to stay; and so much more.”

Pham continues: “These 17 winners recognize that power, and take crucial steps to integrate ethics and responsibility into core courses like algorithms, compilers, computer architecture, neural networks, and data structures. Furthermore, they will release their materials and methodology in the open, allowing other individuals and institutions to adapt and use them in their own environment, broadening the reach of the work. By deeply integrating ethics into computer science curricula and sharing the content openly, we can create more responsible technology from the start.”

Says Yoav Schlesinger, principal at Omidyar Network’s Tech and Society Lab co-leading the Challenge: “Revamping training for the next generation of technologists is critical to changing the way tech is built now and into the future. We are impressed with the quality of submissions and even more pleased to see such outstanding proposals awarded funding as part of Stage I of the Responsible Computer Science Challenge. With these financial resources, we are confident that winners will go on to develop exciting, innovative coursework that will not only be implemented at their home institutions, but also scaled to additional colleges and universities across the country.”

Challenge winners are announced in two stages: Stage I (today), for concepts that deeply integrate ethics into existing undergraduate computer science courses, either through syllabi changes or teaching methodology adjustments. Stage I winners receive up to $150,000 each to develop and pilot their ideas. Stage II (summer 2020) supports the spread and scale of the most promising approaches developed in Stage I. In total, the Challenge will award up to $3.5 million in prizes.

The winners announced today were selected by a panel of 19 independent judges from universities, community organizations, and the tech industry. Judges deliberated over the course of three weeks.

<The Winners>

(School | Location | Principal Investigator)

Allegheny College | Meadville, PA | Oliver Bonham-Carter 

While studying fields like artificial intelligence and data analytics, students will investigate potential ethical and societal challenges. For example: They might interrogate how medical data is analyzed, used, or secured. Lessons will include relevant readings, hands-on activities, and talks from experts in the field.

 

Bemidji State University | Bemidji, MN | Marty J. Wolf, Colleen Greer

The university will lead workshops that guide faculty at other institutions in developing and implementing responsible computer science teaching modules. The workshops will convene not just computer science faculty, but also social science and humanities faculty.

 

Bowdoin College | Brunswick, ME | Stacy Doore

Computer science students will participate in “ethical narratives laboratories,” where they experiment with and test the impact of technology on society. These laboratories will include transformative engagement with real and fictional narratives including case studies, science fiction readings, films, shows, and personal interviews.

 

Columbia University | New York, NY | Augustin Chaintreau

This approach integrates ethics directly into the computer science curriculum, rather than making it a stand-alone course. Students will consult and engage with an “ethical companion” that supplements a typical course textbook, allowing ethics to be addressed immediately alongside key concepts. The companion provides examples, case studies, and problem sets that connect ethics with topics like computer vision and algorithm design.

 

Georgetown University | Washington, DC | Nitin Vaidya

Georgetown’s computer science department will collaborate with the school’s Ethics Lab to create interactive experiences that illuminate how ethics and computer science interact. The goal is to introduce a series of active-learning engagements across a semester-long arc into selected courses in the computer science curriculum.

 

Georgia Institute of Technology | Atlanta, GA | Ellen Zegura

This approach embeds social responsibility into the computer science curriculum, starting with the introductory courses. Students will engage in role-playing games (RPGs) to examine how a new technology might impact the public. For example: How facial recognition or self-driving cars might affect a community.

 

Harvard University | Cambridge, MA | Barbara Grosz

Harvard will expand the open-access resources of its Embedded EthiCS program which pairs computer science faculty with philosophy PhD students to develop ethical reasoning modules that are incorporated into courses throughout the computer science curriculum. Computer science postdocs will augment module development through design of activities relevant to students’ future technology careers.

 

Miami Dade College | Miami, FL | Antonio Delgado

The college will integrate social impact projects and collaborations with local nonprofits and government agencies into the computer science curriculum. Computer science syllabi will also be updated to include ethics exercises and assignments.

 

Northeastern University | Boston, MA | Christo Wilson

This initiative will embed an ethics component into the university’s computer science, cybersecurity, and data science programs. The ethics component will include lectures, discussion prompts, case studies, exercises, and more. Students will also have access to a philosophy faculty advisor with expertise in information and data ethics.

 

Santa Clara University | Santa Clara, CA | Sukanya Manna, Shiva Houshmand, Subramaniam Vincent

This initiative will help CS students develop a deliberative ethical analysis framework that complements their technical learning. It will develop software engineering ethics, cybersecurity ethics, and data ethics modules, with integration of case studies and projects. These modules will also be adapted into free MOOC materials, so other institutions worldwide can benefit from the curriculum.

 

University of California, Berkeley | Berkeley, CA | James Demmel, Cathryn Carson

This initiative integrates a “Human Contexts and Ethics Toolkit” into the computer science/data science curriculum. The toolkit helps students discover when and how their work intersects with social power structures. For example: bias in data collection, privacy impacts, and algorithmic decision making.

 

University at Buffalo | Buffalo, NY | Atri Rudra

In this initiative, freshmen studying computer science will discuss ethics in the first-year seminar “How the internet works.” Sophomores will study responsible algorithmic development for real-­world problems. Juniors will study the ethical implications of machine learning. And seniors will incorporate ethical thinking into their capstone course.

 

University of California, Davis | Davis, CA | Annamaria (Nina) Amenta, Gerardo Con Díaz, and Xin Liu

Computer science students will be exposed to social science and humanities while pursuing their major, culminating in a “conscientious” senior project. The project will entail developing technology while assessing its impact on inclusion, privacy, and other factors, and there will be opportunities for projects with local nonprofits or government agencies.

 

University of Colorado, Boulder | Boulder, CO | Casey Fiesler

This initiative integrates an ethics component into introductory programming classes, and features an “ethics fellows program” that embeds students with an interest in ethics into upper division computer science and technical classes.

 

University of Maryland, Baltimore County | Baltimore, MD | Helena Mentis

This initiative uses three avenues to integrate ethics into the computer science curriculum: peer discussions on how technologies might affect different populations; negative implications evaluations, i.e. “red teams” that probe the potential negative societal impacts of students’ projects; and a training program to equip teaching assistants with ethics and equality literacy.

 

University of Utah | Salt Lake City, UT | Suresh Venkatasubramanian, Sorelle A. Friedler (Haverford College), Seny Kamara (Brown University)

Computer science students will be encouraged to apply problem solving and critical thinking not just to design algorithms, but also the social issues that their algorithms intersect with. For example: When studying bitcoin mining algorithms, students will focus on energy usage and environmental impact. The curriculum will be developed with the help of domain experts who have expertise in sustainability, surveillance, criminal justice, and other issue areas.

 

Washington University | St. Louis, MO | Ron Cytron

Computer science students will participate in “studio sessions,” or group discussions that unpack how their technical education and skills intersect with issues like individual privacy, data security, and biased algorithms.

 

The Responsible Computer Science Challenge is part of Mozilla’s mission to empower the people and projects on the front lines of internet health work. Learn more about Mozilla Awards.

Launched in October 2018, the Responsible Computer Science Challenge, incubated at Omidyar Network’s Tech and Society Solutions Lab, is part of Omidyar Network’s growing efforts to mitigate the unintended consequences of technology on our social fabric, and ensure products are responsibly designed and brought to market.

The post $2.4 Million in Prizes for Schools Teaching Ethics Alongside Computer Science appeared first on The Mozilla Blog.

Categorieën: Mozilla-nl planet

Mark Côté: Deconstruction of a Failure

Mozilla planet - di, 30/04/2019 - 14:38
Something I regularly tell my daughter, who can tend towards perfectionism, is that we all fail. Over the last few years, I’ve seen more and more talks and articles about embracing failure. The key is, of course, to learn from the failure. I’ve written a bit before about what I learned from leading the MozReview project, Mozilla’s experiment with a new approach to code review that lasted from about 2014 to 2018.
Categorieën: Mozilla-nl planet

About:Community: Firefox 67 new contributors

Mozilla planet - di, 30/04/2019 - 03:45

With the release of Firefox 67, we are pleased to welcome the 75 developers who contributed their first code change to Firefox in this release, 66 of whom were brand new volunteers! Please join us in thanking each of these diligent and enthusiastic individuals, and take a look at their contributions:

Categorieën: Mozilla-nl planet

The Servo Blog: This Week In Servo 129

Mozilla planet - di, 30/04/2019 - 02:30

In the past week, we merged 68 PRs in the Servo organization’s repositories.

Planning and Status

Our roadmap is available online, including the team’s plans for 2019.

This week’s status updates are here.

Exciting works in progress Notable Additions
  • ferjm implemented enough Shadow DOM support to build user agent widgets include media controls.
  • miller-time standardized the use of referrers in fetch requests.
  • krk added a build-time validation that the DOM inheritance hierarchy matches the WebIDL hierarchy.
  • paulrouget redesigned part of the embedding API to separate per-window from per-application APIs.
  • AZWN created an API for using the type system to represent important properties of the JS engine.
  • Akhilesh1996 implemented the setValueCurveAtTime Web Audio API.
  • jdm transitioned the Windows build to rely on clang-cl instead of the MSVC compiler.
  • snarasi6 implemented the setPosition and setOrientation Web Audio APIs.
New Contributors

Interested in helping build a web browser? Take a look at our curated list of issues that are good for new contributors!

Categorieën: Mozilla-nl planet

The Mozilla Blog: Facebook’s Ad Archive API is Inadequate

Mozilla planet - ma, 29/04/2019 - 14:49
Facebook’s tool meets only two of experts’ five minimum standards. That’s a failing grade.

 

Facebook pledged in February to release an ad archive API, in order to make political advertising on the platform more transparent. The company finally released this API in late March — and we’ve been doing a review to determine if it is up to snuff.

While we appreciate Facebook following through on its commitment to make the ad archive API public, its execution on the API leaves something to be desired. The European Commission also hinted at this last week in its analysis when it said that “further technical improvements” are necessary.

The fact is, the API doesn’t provide necessary data. And it is designed in ways that hinders the important work of researchers, who inform the public and policymakers about the nature and consequences of misinformation.

Last month, Mozilla and more than sixty researchers published five guidelines we hoped Facebook’s API would meet. Facebook’s API fails to meet three of these five guidelines. It’s too early to determine if it meets the two other guidelines. Below is our analysis:

[1]

Researchers’ guideline: A functional, open API should have comprehensive political advertising content.

Facebook’s API: It’s impossible to determine if Facebook’s API is comprehensive, because it requires you to use keywords to search the database. It does not provide you with all ad data and allow you to filter it down using specific criteria or filters, the way nearly all other online databases do. And since you cannot download data in bulk and ads in the API are not given a unique identifier, Facebook makes it impossible to get a complete picture of all of the ads running on their platform (which is exactly the opposite of what they claim to be doing).

[2] ❌

Researchers’ guideline: A functional, open API should provide the content of the advertisement and information about targeting criteria.

Facebook’s API: The API provides no information on targeting criteria, so researchers have no way to tell the audience that advertisers are paying to reach. The API also doesn’t provide any engagement data (e.g., clicks, likes, and shares), which means researchers cannot see how users interacted with an ad. Targeting and engagement data is important because it lets researchers see what types of users an advertiser is trying to influence, and whether or not their attempts were successful.

[3]

Researchers’ guideline: A functional, open API should have up-to-date and historical data access.

Facebook’s API: Ad data will be available in the archive for seven years, which is actually pretty good. Because the API is new and still hasn’t been properly populated, we cannot yet assess whether it is up-to-date, whether bugs will be fixed, or whether Facebook will support long-term studies.

[4]

Researchers’ guideline: A functional, open API should be accessible to and shareable with the general public.

Facebook’s API: This data is now available as part of Facebook’s standard GraphAPI and governed by Facebook Developers Terms of Service. It is too early to determine what exact constraints this will create for public availability and disclosure of data.

[5] ❌

Researchers’ guideline: A functional, open API should empower, not limit, research and analysis.

Facebook’s API: The current API design puts huge constraints on researchers, rather than allowing them to discover what is really happening on the platform. The limitations in each of these categories, coupled with search rate limits, means it could take researchers months to evaluate ads in a certain region or on a certain topic.

 

It’s not too late for Facebook to fix its API. We hope they take action soon. And, we hope bodies like the European Commission carefully scrutinize the tool’s shortcomings.

Mozilla will also be conducting an analysis of Google’s ad API when it is released in the coming weeks. Since Facebook’s ad archive API fails to let researchers do their jobs ahead of the upcoming European Parliamentary elections, we hope that Google will step up and deliver an API that enables this important research.

The post Facebook’s Ad Archive API is Inadequate appeared first on The Mozilla Blog.

Categorieën: Mozilla-nl planet

Daniel Stenberg: What is the incentive for curl to release the library for free?

Mozilla planet - ma, 29/04/2019 - 09:43

(This is a repost of the answer I posted on stackoverflow for this question. This answer immediately became my most ever upvoted answer on stackoverflow with 516 upvotes during the 48 hours it was up before a moderator deleted it for unspecified reasons. It had then already been marked “on hold” for being “primarily opinion- based” and then locked but kept: “exists because it has historical significance”. But apparently that wasn’t good enough. I’ve saved a screenshot of the deletion. Debated on meta.stackoverflow.com. Status now: it was brought back but remains locked.)

I’m Daniel Stenberg.

I made curl

I founded the curl project back in 1998, I wrote the initial curl version and I created libcurl. I’ve written more than half of all the 24,000 commits done in the source code repository up to this point in time. I’m still the lead developer of the project. To a large extent, curl is my baby.

I shipped the first version of curl as open source since I wanted to “give back” to the open source world that had given me so much code already. I had used so much open source and I wanted to be as cool as the other open source authors.

Thanks to it being open source, literally thousands of people have been able to help us out over the years and have improved the products, the documentation. the web site and just about every other detail around the project. curl and libcurl would never have become the products that they are today were they not open source. The list of contributors now surpass 1900 names and currently the list grows with a few hundred names per year.

Thanks to curl and libcurl being open source and liberally licensed, they were immediately adopted in numerous products and soon shipped by operating systems and Linux distributions everywhere thus getting a reach beyond imagination.

Thanks to them being “everywhere”, available and liberally licensed they got adopted and used everywhere and by everyone. It created a defacto transfer library standard.

At an estimated six billion installations world wide, we can safely say that curl is the most widely used internet transfer library in the world. It simply would not have gone there had it not been open source. curl runs in billions of mobile phones, a billion Windows 10 installations, in a half a billion games and several hundred million TVs – and more.

Should I have released it with proprietary license instead and charged users for it? It never occured to me, and it wouldn’t have worked because I would never had managed to create this kind of stellar project on my own. And projects and companies wouldn’t have used it.

Why do I still work on curl?

Now, why do I and my fellow curl developers still continue to develop curl and give it away for free to the world?

  1. I can’t speak for my fellow project team members. We all participate in this for our own reasons.
  2. I think it’s still the right thing to do. I’m proud of what we’ve accomplished and I truly want to make the world a better place and I think curl does its little part in this.
  3. There are still bugs to fix and features to add!
  4. curl is free but my time is not. I still have a job and someone still has to pay someone for me to get paid every month so that I can put food on the table for my family. I charge customers and companies to help them with curl. You too can get my help for a fee, which then indirectly helps making sure that curl continues to evolve, remain free and the kick-ass product it is.
  5. curl was my spare time project for twenty years before I started working with it full time. I’ve had great jobs and worked on awesome projects. I’ve been in a position of luxury where I could continue to work on curl on my spare time and keep shipping a quality product for free. My work on curl has given me friends, boosted my career and taken me to places I would not have been at otherwise.
  6. I would not do it differently if I could back and do it again.
Am I proud of what we’ve done?

Yes. So insanely much.

But I’m not satisfied with this and I’m not just leaning back, happy with what we’ve done. I keep working on curl every single day, to improve, to fix bugs, to add features and to make sure curl keeps being the number one file transfer solution for the world even going forward.

We do mistakes along the way. We make the wrong decisions and sometimes we implement things in crazy ways. But to win in the end and to conquer the world is about patience and endurance and constantly going back and reconsidering previous decisions and correcting previous mistakes. To continuously iterate, polish off rough edges and gradually improve over time.

Never give in. Never stop. Fix bugs. Add features. Iterate. To the end of time.

For real?

Yeah. For real.

Do I ever get tired? Is it ever done?

Sure I get tired at times. Working on something every day for over twenty years isn’t a paved downhill road. Sometimes there are obstacles. During times things are rough. Occasionally people are just as ugly and annoying as people can be.

But curl is my life’s project and I have patience. I have thick skin and I don’t give up easily. The tough times pass and most days are awesome. I get to hang out with awesome people and the reward is knowing that my code helps driving the Internet revolution everywhere is an ego boost above normal.

curl will never be “done” and so far I think work on curl is pretty much the most fun I can imagine. Yes, I still think so even after twenty years in the driver’s seat. And as long as I think it’s fun I intend to keep at it.

Categorieën: Mozilla-nl planet

Robert O'Callahan: Goodbye Mozilla IRC

Mozilla planet - ma, 29/04/2019 - 06:56

I've been connected to Mozilla IRC for about 20 years. When I first started hanging out on Mozilla IRC I was a grad student at CMU. It's how I got to know a lot of Mozilla people. I was never an IRC op or power user, but when #mozilla was getting overwhelmed with browser user chat I was the one who created #developers. RIP.

I'll be sad to see it go, but I understand the decision. Technologies have best-before dates. I hope that Mozilla chooses a replacement that sucks less. I hope they don't choose Slack. Slack deliberately treats non-Chrome browsers as second-class — in particular, Slack Calls don't work in Firefox. That's obviously a problem for Mozilla users, and it would send a bad message if Mozilla says that sort of attitude is fine with them.

I look forward to finding out what the new venue is. I hope it will be friendly to non-Mozilla-staff and the community can move over more or less intact.

Categorieën: Mozilla-nl planet

David Humphrey: irc.mozilla.org

Mozilla planet - za, 27/04/2019 - 04:51

Today I read Mike Hoye's blog post about Mozilla's IRC server coming to an end.  He writes:

Mozilla has relied on IRC as our main synchronous communications tool since the beginning...While we still use it heavily, IRC is an ongoing source of abuse and  harassment for many of our colleagues and getting connected to this now-obscure forum is an unnecessary technical barrier for anyone finding their way to Mozilla via the web.  

And, while "Mozilla intends to deprecate IRC," he goes on to say:

we definitely still need a globally-available, synchronous and text-first communication tool.

While I made dinner tonight, I thought back over my long history using Mozilla's IRC system, and tried to understand its place in my personal development within Mozilla and open source.

/invite

I remember the very first time I used IRC.  It was 2004, and earlier in the week I had met with Mike Shaver at Seneca, probably for the first time, and he'd ended our meeting with a phrase I'd never heard before, but I nodded knowingly nevertheless: "Ping me in #developers."

Ping me.  What on earth did that mean!? Little did I know that this phrase would come to signify so much about the next decade of my life.  After some research and initial trial and error, 'dave' joined irc.mozilla.org and found his way to the unlisted #developers channel.  And there was 'shaver', along with 300 or so other #developers.

The immediacy of it was unlike anything I'd used before (or since).  To join irc was to be transported somewhere else.  You weren't anywhere, or rather, you were simultaneously everywhere.  For many of these years I was connecting to irc from an old farm house in the middle of rural Ontario over a satellite internet connection.  But when I got online, there in the channels with me were people from New Zealand, the US, Sweden, and everywhere in between.

Possibly you've been on video calls with people from around the world, and felt something similar.  However, what was different from a video call, or teleconference, or any other medium I've used since, is that the time together didn't need to end.  You weren't meeting as such, and there wasn't a timebox or shared goal around your presence there.  Instead, you were working amongst one another, co-existing, listening, and most importantly for me, learning.

/join

Over the next year, irc went from being something I used here and there to something I used all the time.  I became 'humph' (one day Brendan confused me for Dave Herman, and shaver started calling me 'humph' to clarify) and have remained so ever since.  There are lots of people who have only ever called me 'humph' even to my face, which is hilarious and odd, but also very special.

Mike Beltzner taught me how to overcome one of the more difficult aspects of IRC: maintaining context after you log off.  Using screen and irssi I was able to start, leave, and then pick up conversations at a later time.  It's something you take for granted on Slack, but was critical to me being able to leverage IRC as a source of knowledge: if I asked a question, it might be hours before the person who could answer it would wake up and join irc from another part of the planet.

I became more engaged with different areas of the project.  IRC is siloed.  A given server is partitioned into many different channels, and each has its own sub-culture, appropriate topics, and community.  However, people typically participate in many channels.  As you get to know someone in one channel, you'll often hear more about the work happening in another.  Slowly I got invited into other channels and met more and more people across the Mozilla ecosystem.

Doing so took me places I hadn't anticipated.  For example, at some point I started chatting with people in #thunderbird, which led to me becoming an active contributor--I remember 'dascher' just started assigning me bugs to fix!  Another time I discovered the #static channel and a guy named 'taras' who was building crazy static analysis tools with gcc.  Without irc I can confidently say that I would have never started DXR, or worked on web audio, WebGL, all kinds of Firefox patches, or many of the other things I did.  I needed to be part of a community of peers and mentors for this work to be possible.

At a certain point I went from joining other channels to creating my own.  I started to build many communities within Mozilla to support new developers.  It was incredible to watch them fill up with a mix of experienced Mozilla contributors and people completely new to the project.  Over the years it helped to shape my approach to getting students involved in open source through direct participation.

/list

In some ways, IRC was short for "I Really Can do this."  On my own?  No.  No way. But with the support of a community that wasn't going to abandon me, who would answer my questions, spend long hours helping me debug things, or introduce me to people who might be able to unlock my progress, I was able to get all kinds of new things done.  People like shaver, ted, gavin, beltzner, vlad, jorendorff, reed, preed, bz, stuart, Standard8, Gijs, bsmedberg, rhelmer, dmose, myk, Sid, Pomax, and a hundred other friends and colleagues.

The kind of help you get on irc isn't perfect.  I can remember many times asking a question, and having bsmedberg give a reply, which would take me the rest of the day (or week!) to unpack and fully understand.  You got hints.  You got clues.  You were (sometimes) pointed in the right direction.  But no one was going to hold your hand the whole way.  You were at once surrounded by people who knew, and also completely on your own.  It still required a lot of personal research.  Everyone was also struggling with their own pieces of the puzzle, and it was key to know how much to ask, and how much to do on your own.

/query

Probably the most rewarding part of irc were the private messages.  Out of the blue, someone would ping you, sometimes in channel (or a new channel), but often just to you personally.  I developed many amazing friendships this way, some of them with people I've never met outside of a text window.

When I was working on the Firefox Audio Data API, I spent many weeks fighting with the DOM implementation.  There were quite a few people who knew this code, but their knowledge of it was too far beyond me, and I needed to work my way up to a place where we could discuss things.  I was very much on my own, and it was hard work.

One day I got a ping from someone calling themselves 'notmasteryet'.  I'd been blogging about my work, and linked to my patches, and 'notmasteryet' had started working on them.  You can't imagine the feeling of having someone on the internet randomly find you and say, "I think I figured out this tricky bit you've been struggling to make work."  That's exactly what happened, and we went on to spend many amazing weeks and months working on this together, sharing this quiet corner of Mozilla's irc server, moving at our own pace.

I hesitated to tell a story like this because there is no way to do justice to the many relationships I formed during the next decade.  I can't tell you all the amazing stories.  At one time or another, I got to work with just about everyone in Mozilla, and many became friends.  IRC allowed me to become a part of Mozilla in ways that would have been impossible just reading blogs, mailing lists, or bugzilla.  To build relationships, one needs long periods of time together.  It happens slowly.

/part

But then, at a certain point, I stopped completely.  It's maybe been four or five years since I last used irc.  There are lots of reasons for it.  Partly it was due to things mhoye discussed in his blog post (I can confirm that harassment is real on irc). But also Mozilla had changed, and many of my friends and colleagues had moved on.  IRC, and the Mozilla that populated it, is part of the past.

Around the same time I was leaving IRC, Slack was just starting to take off.  Since then, Slack has come to dominate the space once occupied by tools like irc.  As I write this, Slack is in the process of doing its IPO, with an impressive $400M in revenue last year.  Slack is popular.

When I gave up irc, I really didn't want to start in on another version of the same thing.  I've used it a lot out of necessity, and even in my open source classes as a way to expose my students to it, so they'll know how it works.  But I've never really found it compelling.  Slack is a better irc, there's no doubt.  But it's also not what I loved about irc.mozilla.org.

Mike writes that he's in the process of evaluating possible replacements for irc within Mozilla.  I think it's great that he and Mozilla are wrestling with this.  I wish more open source projects would do it, too.  Having a way to get deeply engaged with a community is important, especially one as large as Mozilla.

Whatever product or tool gets chosen, it needs to allow people to join without being invited.  Tools like Slack do a great job with authentication and managing identity.  But to achieve it they rely on gatekeeping.  I wasn't the typical person who used irc.mozilla.org when I started; but by using it for a long time, I made it a different place.  It's really important that any tool like this does more than just support the in-groups (e.g., employees, core contributors, etc).  It's also really important that any tool like this does better than create out-groups.

/quit

IRC was a critical part of my beginnings in open source.  I loved it.  I still miss many of the friends I used to talk to daily.  I miss having people ping me.  As I work with my open source students, I think a lot about what I'd do if I was starting today.  It's not possible to follow the same path I took.  The conclusion I've come to is that the only way to get started is to focus on connecting with people.  In the end, the tools don't matter, they change.  But the people matter a lot, and we should put all of our effort into building relationships with them.  

Categorieën: Mozilla-nl planet

Robert O'Callahan: Update To rr Master To Debug Firefox Trunk

Mozilla planet - za, 27/04/2019 - 02:42

A few days ago Firefox started using LMDB (via rkv) to store some startup info. LMDB relies on file descriptor I/O being coherent with memory-maps in a way that rr didn't support, so people have had trouble debugging Firefox in rr, and Pernosco's CI test failure reproducer also broke. We have checked in a fix to rr master and are in the process of updating the Pernosco pipeline.

The issue is that LMDB opens a file, maps it into memory MAP_SHARED, and then opens the file again and writes to it through the new file descriptor, and requires that the written data be immediately reflected in the shared memory mapping. (This behavior is not guaranteed by POSIX but is guaranteed by Linux.) rr needs to observe these writes and record the necessary memory changes, otherwise they won't happen during replay (because writes to files don't happen during replay) and replay will fail. rr already handled the case when the application write to the file descriptor (technically, the file description) that was used to map the file — Chromium has needed this for a while. The LMDB case is harder to handle. To fix LMDB, whenever the application opens a file for writing, we have to check to see if any shared mapping of that file exists and if so, mark that file description so writes to it have their shared-memory effects recorded. Unfortunately this adds overhead to writable file opens, but hopefully it doesn't matter much since in many workloads most file opens are read-only. (If it turns out to be a problem there are ways we can optimize further.) While fixing this, we also added support for the case where the application opens a file (possibly multiple times with different file descriptions) and then creates a shared mapping of one of them. To handle that, when creating a shared mapping we have to scan all open files to see if any of them refer to the mapped file, and if so, mark them so the effects of their writes are recorded.

Update Actually, at least this commit is required.

Categorieën: Mozilla-nl planet

Chris H-C: Firefox Origin Telemetry: Putting Prio in Practice

Mozilla planet - vr, 26/04/2019 - 21:32

Prio is neat. It allows us to learn counts of things that happen across the Firefox population without ever being able to learn which Firefox sent us which pieces of information.

For example, Content Blocking will soon be using this to count how often different trackers are blocked and exempted from blocking so we can more quickly roll our Enhanced Tracking Protection to our users to protect them from companies who want to track their activities across the Web.

To get from “Prio is neat” to “Content Blocking is using it” required a lot of effort and the design and implementation of a system I called Firefox Origin Telemetry.

Prio on its own has some very rough edges. It can only operate on a list of at most 2046 yes or no questions (a bit vector). It needs to know cryptographic keys from the servers that will be doing the sums and decryption. It needs to know what a “Batch ID” is. And it needs something to reliably and reasonably-frequently send the data once it has been encoded.

So how can we turn “tracker fb.com was blocked” into a bit in a bit vector into an encoded prio buffer into a network payload…

Firefox Origin Telemetry has two lists: a list of “origins” and a list of “metrics”. The list of origins is a list of where things happen. Did you block fb.com or google.com? Each of those trackers are “origins”. The list of metrics is a list of what happened. Did you block fb.com or did you have to exempt it from blocking because otherwise the site broke? Both “blocked” and “exempt” are “metrics”.

In this way Content Blocking can, whenever fb.com is blocked, call

Telemetry::RecordOrigin(OriginMetricID::ContentBlocking_Blocked, "fb.com");

And Firefox Origin Telemetry will take it from there.

Step 0 is in-memory storage. Firefox Origin Telemetry stores tables mapping from encoding id (ContentBlocking_Blocked) to tables of origins mapped to counts (“fb.com”: 1). If there’s any data in Firefox Origin Telemetry, you can view it in about:telemetry and it might look something like this:

originTelemetryAbout

Step 1 is App Encoding: turning “ContentBlocking_Blocked: {“fb.com”: 1}” into “bit twelve on shard 2 should be set to 1 for encoding ‘content-blocking-blocked’ ”

The full list of origins is too long to talk to Prio. So Firefox Origin Telemetry splits the list into 2046-element “shards”. The order of the origins list and the split locations for the shards must be stable and known ahead of time. When we change it in the future (either because Prio can start accepting larger or smaller buffers, or when the list of origins changes) we will have to change the name of the encoding from ‘content-blocking-blocked’ to maybe ‘content-blocking-blocked-v2’.

Step 2 is Prio Encoding: Firefox Origin Telemetry generates batch IDs of the encoding name suffixed with the shard number: for our example the batch ID is “content-blocking-blocked-1”. The server keys are communicated by Firefox Preferences (you can see them in about:config). With those pieces and the bit vector shards themselves, Prio has everything it needs to generate opaque binary blobs about 50 kilobytes in size.

Yeah, 2kb of data in a 50kb package. Not a small increase.

Step 3 is Base64 Encoding where we turn those 50kb binary blobs into 67kb strings of the letters a-z and A-Z, the numbers 0-9, and the symbols “+” or “/”. This is so we can send it in a normal Telemetry ping.

Step 4 is the “prio” ping. Once a day or when Firefox shuts down we need to send a ping containing these pairs of batch ids and base64-encoded strings plus a minimum amount of environmental data (Firefox version, current date, etc.), if there’s data to be sent. In the event that sending fails, we need to retry (TelemetrySend). After sending the ping should be available to be inspected for a period of time (TelemetryArchive).

…basically, this is where Telemetry does what Telemetry does best.

And then the ping becomes the problem of the servers who need to count and verify and sum and decode and… stuff. I dunno, I’m a Firefox Telemetry Engineer, not a Data Engineer. :amiyaguchi’s doing that part, not me : )

I’ve smoothed over some details here, but I hope I’ve given you an idea of what value Firefox Origin Telemetry brings to Firefox’s data collection systems. It makes Prio usable for callers like Content Blocking and establishes systems for managing the keys and batch IDs necessary for decoding on the server side (Prio will generate int vector shards for us, but how will we know which position of which shard maps back to which origin and which metric?).

Firefox Origin Telemetry is shipping in Firefox 68 and is currently only enabled for Firefox Nightly and Beta. Content Blocking is targeting Firefox 69 to start using Origin Telemetry to measure tracker blocking and exempting for 0.014% of pageloads of 1% of clients.

:chutten

Categorieën: Mozilla-nl planet

Mike Hoye: Synchronous Text

Mozilla planet - vr, 26/04/2019 - 19:44

Envoy.

Let’s lead with the punchline: the question of what comes after IRC, for Mozilla, is now on my desk.

I wasn’t in the room when IRC.mozilla.org was stood up, but from what I’ve heard IRC wasn’t “chosen” so much as it was the obvious default, the only tool available in the late ’90s. Suffice to say that as a globally distributed organization, Mozilla has relied on IRC as our main synchronous communications tool since the beginning. For much of that time it’s served us well, if for some less-than-ideal values of “us” and “well”.

Like a lot of the early internet IRC is a quasi-standard protocol built with far more of the optimism of the time than the paranoia the infosec community now refers to as “common sense”, born before we learned how much easier it is to automate bad acts than it is to foster healthy communities. Like all unauthenticated systems on the modern net it’s aging badly and showing no signs of getting better.

While we still use it heavily, IRC is an ongoing source of abuse and harassment for many of our colleagues and getting connected to this now-obscure forum is an unnecessary technical barrier for anyone finding their way to Mozilla via the web. Available interfaces really haven’t kept up with modern expectations, spambots and harassment are endemic to the platform, and in light of that it’s no coincidence that people trying to get in touch with us from inside schools, colleges or corporate networks are finding that often as not IRC traffic isn’t allowed past institutional firewalls at all.

All of that adds up to a set of real hazards and unnecessary barriers to participation in the Mozilla project; we definitely still need a globally-available, synchronous and text-first communication tool; our commitment to working in the open as an organization hasn’t changed. But we’re setting a higher bar for ourselves and our communities now and IRC can’t meet that bar. We’ve come to the conclusion that for all IRC’s utility, it’s irresponsible of us to ask our people – employees, volunteers, partners or anyone else – to work in an environment that we can’t make sure is healthy, safe and productive.

In short, it’s no longer practical or responsible for us to keep that forum alive.

In the next small number of months, Mozilla intends to deprecate IRC as our primary synchronous-text communications platform, stand up a replacement and decommission irc.mozilla.org soon afterwards. I’m charged with leading that process on behalf of the organization.

Very soon, I’ll be setting up the evaluation process for a couple of candidate replacement stacks. We’re lucky; we’re spoiled for good options these days. I’ll talk a bit more about them in a future post, but the broad strokes of our requirements are pretty straightforward:

  • We are not rolling our own. Whether we host it ourselves or pay for a service, we’re getting something off the shelf that best meets our needs.
  • It needs to be accessible to the greater Mozilla community.
  • We are evaluating products, not protocols.
  • We aren’t picking an outlier; whatever stack we choose needs to be a modern, proven service that seems to have a solid provenance and a good life ahead of it. We’re not moving from one idiosyncratic outlier stack to another idiosyncratic outlier stack.
  • While we’re investigating options for semi-anonymous or pseudonymous connections, we will require authentication, because:
  • The Mozilla Community Participation Guidelines will apply, and they’ll be enforced.

I found this at the top of a draft FAQ I’d started putting together a while back. It might not be what you’d call “complete”, but maybe it is:

Q: Why are we moving away from IRC? IRC is fine!
A: IRC is not fine.

Q: Seriously? You’re kidding, right?
A: I’m dead serious.

I don’t do blog comments anymore – unfortunately, for a lot of the same reasons I’m dealing with this – but if you’ve got questions, you can email me.

Or, if you like, you can find me on IRC.

Categorieën: Mozilla-nl planet

Christopher Arnold: An Author-Optimized Social Network Approach

Mozilla planet - vr, 26/04/2019 - 19:13
Sciam Art credit: jaybendt.com/In this month’s edition of Scientific American magazine, Wade Roush comments on social networks' potential deleterious impact on emotional well-being. (Scientific American May 2019: Turning Off the Emotion Pump)  He prompts, "Are there better social technologies than Facebook?" and cites previous attempts such as now-defunct Path and still struggling Diaspora as potential promising developments. I don’t wish to detract from the contemporary concerns about notification overload and privacy leaks. But I’d like to highlight the positive side of social platforms for spurring creative collaboration and suggest an approach to potentially expand the positive impacts they can facilitate in the future. I think the answer to his question is: More diversity of platforms and better utilities needed. 


In our current era, everyone is a participant, in some way, in the authorship of the web. That's a profound and positive thing. We are all enfranchised in a way that previously most were not.  As an advocate for the power of the internet for advancing creative expression, I believe the benefits we've gained by this online enfranchisement should not be overshadowed by aforementioned bumps along the road.  We need more advancement, perhaps in a different way than has been achieved in most mainstream social platforms to date.  Perhaps it is just the utilization that needs to shift, more than the tools themselves. But as a product-focused person, I think some design factors could shape this change we'd need to see to have social networks be a positive force in everybody's lives. 


When Facebook turned away from "the Facebook Wall", its earliest iteration, I was fascinated by this innovation.  It was no longer a bunch of different profile destinations interlinked by notifications of what people said about each other. It became an atomized webpage that looked different to everyone who saw it, depending on the quality of contributions of the linked users.  The outcome was a mixed bag because the range of experiences of each visitor were so different. Some people saw amazing things, from active creators/contributors they'd linked to.  Some people saw the boredom of a stagnant or overly-narrow pool of peer contributors reflected back to them. Whatever your opinion of the content of Facebook, Twitter and Reddit, as subscription services they provide tremendous utility in today's web.  They are far superior to the web-rings and Open Directory Project of the 1990s, as they are reader-driven rather than author/editor driven. 


The experimental approach I'm going to suggest for advancement of next-generation social networks should probably happen outside the established platforms. For when experimentation is done within these services it can jeopardize the perceived user control and trust that attracted their users in the first place.   


In a brainstorm with an entrepreneur, named Lisa, she pointed out that the most engaging and involved collaborative discussions she'd seen had taken place in Ravelry and Second Life.  Knitting and creating 3D art takes an amazing amount of time investment.  She posited that it may be this invested time that leads to the quality of the personal interactions that happen on such platforms.  It may actually be the casualness of engagement on conventional public forums that makes those interactions more haphazard, impersonal and less constructive or considerate. Our brainstorm spread to how might more such platforms emerge to spur ever greater realization of new authorship, artistry and collaboration. We focused not on volume of people nor velocity of engagement, but rather greatest individual contribution. 


The focus (raison d'être) of a platform tends to skew the nature of the behaviors on it and can hamper or facilitate the individual creation or art represented based on the constraints of the platform interface. (For instance Blogger, Wordpress and Medium are great for long form essays. Twitter, Instagram and Reddit excel as forums for sharing observations about other works or references.) If one were to frame a platform objective on the maximum volume of individual contribution or artistry and less on the interactions, you'd get a different nature of network. And across a network of networks, it would be possible to observe what components of a platform contribute best to the unfettered artistry of the individual contributors among them. 


I am going to refer to this platform concept as "Mikoshi", because it reminds me of the Japanese portable shrines of the same name, pictured at right. In festival parades, dozens of people heft a one-ton shrine atop their shoulders.  The bobbing of the shrine is supposed to bring good luck to the participants and onlookers. The time I participated in a mikoshi parade, I found it to be exhausting effort, fun as it was.  The thing that stuck out to me was that that whole group is focused toward one end.  There were no detractors.   


Metaphorically, I see the mikoshi act of revelry as somewhat similar to the collaborative creative artistry sharing that Lisa was pointing out. In Lisa's example, there was a barrier to entry and a shared intent in the group. You had to be a knitter or a 3D artist to have a seat at the table. Why would hurdles create the improved quality of engagement and discourse? Presumably, if you're at that table you want to see others succeed and create more! There is a certain amount of credibility and respect the community gives contributors based on the table-stakes of participation that got them there.  This is the same with most other effort-intensive sharing platforms, like Mixcloud and Soundcloud, where I contribute. The work of others inspires us to increase our level of commitment and quality as well.  The shared direction, the furtherance of art, propels ever more art by all participants.  It virtuously improves in a cycle.  This drives greater complexity, quality and retention with time.   


To achieve a pure utility of greatest contributor creation would be a different process than creating a tool optimized purely for volume or velocity of engagement. Lisa and I posited an evolving biological style of product "mutation" that might create a proliferating organic process, driven by participant contribution and automated selection of attributes observed across the most healthy offshoot networks. Maximum individual authorship should be the leading selective pressure for Mikoshi to work. This is not to say that essays are better than aphorisms because of their length. But the goal to be incentivized by a creativity-inspiring ecosystem should be one where the individuals participating feel empowered to create to the maximum extent. There are other tools designed for optimizing velocity and visibility, but those elements could be detrimental to individual participation or group dynamics. 


To give over control to contribution-driven optimization as an end, it would Mikoshi would need to be a modular system akin to the Wordpress platform of Automattic. But platform mutation would have to be achieved agnostic of author self-promotion. The optimizing mutation of Mikoshi would need to be outside of the influence of content creators' drive for self promotion. This is similar to the way that "Pagerank" listened to interlinking of non-affiliated web publishers to drive its anti-spam filter, rather than the publishers' own attempts to promote themselves. Visibility and promulgation of new Mikoshi offshoots should be delegated to a different promotion-agnostic algorithm entirely, one looking at the health of a community of active authors in other preceding Mikoshi groups. Evolutionary adaptation is driven by what ends up dying. But Mikoshi would be driven by what previously thrived.


I don't think Mikoshi should be a single tool, but an approach to building many different web properties. It's centered around planned redundancy and planned end-of-life for non-productive forks of Mikoshi. Any single Mikoshi offshoot could exist indefinitely. But ideally, certain of them would thrive and attract greater engagement and offshoots.

The successive alterations of Mikoshi would be enabled by its capability to fork, like open source projects such as Linux or Gecko do.  As successive deployments are customized and distributed, the most useful elements of the underlying architecture can be notated with telemetry to suggest optimizations to other Mikoshi forks that may not have certain specific tools.  This quasi-organic process, with feedback on the overall contribution "health" of the ecosystem represented by participant contribution, could then suggest attributes for viable offshoot networks to come.  (I'm framing this akin to a browser's extensions, or a Wordpress template's themes and plugins which offer certain optional expansions to pages using past templates of other developers.)  The end products of Mikoshi are multitudinous and not constrained.  Similar to Wordpress, attributes to be included in any future iteration are at the discretion of the communities maintaining them.


Of course Facebook and Reddit could facilitate this.  Yet, "roll your own platform" doesn't fit their business models particularly.  Mozilla, manages several purpose-built social networks for their communities. (Bugzilla and Mozillians internally,  and the former Webmaker and new Hubs for web enthusiasts)  But Mikoshi doesn't particularly fit their mission or business model either.  I believe Automattic is better positioned to go after this opportunity, as it already powers 1/3 of global websites, and has competencies in massively-scaled hosting of web pages with social components. 


I know from my own personal explorations on dozens of web publishing and media platforms that they have each, in different ways, facilitated and drawn out different aspects of my own creativity.  I've seen many of these platforms die off.  It wasn't that those old platforms didn't have great utility or value to their users.  Most of them were just not designed to evolve.  They were essentially too rigid, or encountered political problems within the organizations that hosted them.  As the old Ani Difranco song "Buildings and Bridges" points out, "What doesn't bend breaks." (Caution that lyrics contain some potentially objectionable language.)  The web of tomorrow may need a new manner of collaborative social network that is able to weather the internal and external pressures that threaten them.  Designing an adaptive platform like Mikoshi may accomplish this.  
Categorieën: Mozilla-nl planet

Cameron Kaiser: Another interesting TenFourFox downstream

Mozilla planet - vr, 26/04/2019 - 09:49
Because we're one of the few older forks of Firefox to still backport security updates, TenFourFox code turns up in surprising places sometimes. I've known about roytam's various Pale Moon and Mozilla builds; the patches are used in both the rebuilds of Pale Moon 27 and 28 and his own fork of 45ESR. Arctic Fox, which is a Pale Moon 27 (descended from Firefox 38, with patches) rebuild for Snow Leopard and PowerPC Linux, also uses TenFourFox security patches as well as some of our OS X platform code.

Recently I was also informed of a new place TenFourFox code has turned up: OS/2. There's no Rust for OS/2, so they're in the same boat that PowerPC OS X is, and it doesn't look like 52ESR was ever successfully ported to OS/2 either; indeed, the last "official" Firefox I can find from Bitwise is 45.9. Dave Yeo took that version (as well as Thunderbird 45.9 and SeaMonkey 2.42.9) and backported our accumulated security patches along with other fixes to yield updated "SUa1" Firefox, Thunderbird and SeaMonkey builds for OS/2. If you're curious, here are the prerequisites.

Frankly, I'm glad that we can give back to other orphaned platforms, and while I'm definitely not slow to ding Mozilla for eroding cross-platform support, they've still been the friendliest to portability even considering recent lapses. Even though we're not current on Firefox anymore other than the features I rewrite for TenFourFox, we're still part of the family and it's nice to see our work keeping other systems and niche userbases running.

An update for FPR14 final, which is still scheduled for mid-May, is a new localization for Simplified Chinese from a new contributor. Thanks, paizhang! Updated language packs will be made available with FPR14 for all languages except Japanese, which is still maintained separately.

Categorieën: Mozilla-nl planet

Niko Matsakis: AiC: Language-design team meta working group

Mozilla planet - vr, 26/04/2019 - 06:00

On internals, I just announced the formation of the language-design team meta working group. The role of the meta working group is to figure out how other language-design team working groups should work. The plan is to begin by enumerating some of our goals – the problems we aim to solve, the good things we aim to keep – and then move on to draw up more details plans. I expect this discussion will intersect the RFC process quite heavily (at least when it comes to language design changes). Should be interesting! It’s all happening in the open, and a major goal of mine is for this to be easy to follow along with from the outside – so if talking about talking is your thing, you should check it out.

Categorieën: Mozilla-nl planet

The Mozilla Blog: Firefox and Emerging Markets Leadership

Mozilla planet - do, 25/04/2019 - 22:05

Building on the success of Firefox Quantum, we have a renewed focus on better enabling people to take control of their internet-connected lives as their trusted personal agent — through continued evolution of the browser and web platform — and with new products and services that provide enhanced security, privacy and user agency across connected life.

To accelerate this work, we’re announcing some changes to our senior leadership team:

Dave Camp has been appointed SVP Firefox. In this new role, Dave will be responsible for overall Firefox product and web platform development.

As a long time Mozillian, Dave joined Mozilla in 2006 to work on Gecko, building networking and security features and was a contributor to the release of Firefox 3. After a short stint at a startup he rejoined Mozilla in 2011 as part of the Firefox Developer Tools team. Dave has since served in a variety of senior leadership roles within the Firefox product organization, most recently leading the Firefox engineering team through the launch of Firefox Quantum.

Under Dave’s leadership the new Firefox organization will pull together all product management, engineering, technology and operations in support of our Firefox products, services and web platform. As part of this change, we are also announcing the promotion of Marissa (Reese) Wood to VP Firefox Product Management, and Joe Hildebrand to VP Firefox Engineering. Both Joe and Reese have been key drivers of the continued development of our core browser across platforms, and the expansion of the Firefox portfolio of products and services globally.

In addition, we are increasing our investment and focus in emerging markets, building on the early success of products like Firefox Lite which we launched in India earlier this year, we are also formally establishing an emerging markets team based in Taipei:

Stan Leong appointed as VP and General Manager, Emerging Markets. In this new role, Stan will be responsible for our product development and go-to-market strategy for the region. Stan joins us from DCX Technology where he was Global Head of Emerging Product Engineering. He has a great combination of start-up and large company experience having spent years at Hewlett Packard, and he has worked extensively in the Asian markets.

As part of this, Mark Mayo, who has served as our Chief Product Officer (CPO), will move into a new role focused on strategic product development initiatives with an initial emphasis on accelerating our emerging markets strategy. We will be conducting an executive search for a CPO to lead the ongoing development and evolution of our global product portfolio.

I’m confident that with these changes, we are well positioned to continue the evolution of the browser and web platform and introduce new products and services that provide enhanced security, privacy and user agency across connected life.

The post Firefox and Emerging Markets Leadership appeared first on The Mozilla Blog.

Categorieën: Mozilla-nl planet

Nathan Froyd: an unexpected benefit of standardizing on clang-cl

Mozilla planet - do, 25/04/2019 - 18:45

I wrote several months ago about our impending decision to switch to clang-cl on Windows.  In the intervening months, we did that, and we also dropped MSVC as a supported compiler.  (We still build on Linux with GCC, and will probably continue to do that for some time.)  One (extremely welcome) consequence of the switch to clang-cl has only become clear to me in the past couple of weeks: using assembly language across platforms is no longer painful.

First, a little bit of background: GCC (and Clang) support a feature called inline assembly, which enables you to write little snippets of assembly code directly in your C/C++ program.  The syntax is baroque, it’s incredibly easy to shoot yourself in the foot with it, and it’s incredibly useful for a variety of low-level things.  MSVC supports inline assembly as well, but only on x86, and with a completely different syntax than GCC.

OK, so maybe you want to put your code in a separate assembly file instead.  The complementary assembler for GCC (courtesy of binutils) is called gas, with its own specific syntax for various low-level details.  If you give gcc an assembly file, it knows to pass it directly to gas, and will even run the C preprocessor on the assembly before invoking gas if you request that.  So you only ever need to invoke gcc to compile everything, and the right thing will just happen. MSVC, by contrast, requires you to invoke a separate, differently-named assembler for each architecture, with different assembly language syntaxes (e.g. directives for the x86-64 assembler are quite different than the arm64 assembler), and preprocessing files beforehand requires you to jump through hoops.  (To be fair, a number of these details are handled for you if you’re building from inside Visual Studio; the differences are only annoying to handle in cross-platform build systems.)

In short, dealing with assembler in a world where you have to support MSVC is somewhat painful.  You have to copy-and-paste code, or maybe you write Perl scripts to translate from the gas syntax to whatever flavor of syntax the Microsoft assembler you’re using is.  Your build system needs to handle Windows and non-Windows differently for assembly files, and may even need to handle different architectures for Windows differently.  Things like our ICU data generation have been made somewhat more complex than necessary to support Windows platforms.

Enter clang-cl.  Since clang-cl is just clang under the hood, it handles being passed assembly files on the command line in the same way and will even preprocess them for you.  Additionally, clang-cl contains a gas-compatible assembly syntax parser, so assembly files that you pass on the command line are parsed by clang-cl and therefore you can now write a single assembly syntax that works on Unix-y and Windows platforms.  (You do, of course, have to handle differing platform calling conventions and the like, but that’s simplified by having a preprocessor available all the time.)  Finally, clang-cl supports GCC-style inline assembly, so you don’t even have to drop into separate assembly files if you don’t want to.

In short, clang-cl solves every problem that made assembly usage painful on Windows. Might we have a future world where open source projects that have to deal with any amount of assembly standardize on clang-cl for their Windows support, and declare MSVC unsupported?

Categorieën: Mozilla-nl planet

Henri Sivonen: It’s Time to Stop Adding New Features for Non-Unicode Execution Encodings in C++

Mozilla planet - do, 25/04/2019 - 15:30

Henri Sivonen, 2019-04-24

Disclosure: I work for Mozilla, and my professional activity includes being the Gecko module owner for character encodings.

Disclaimer: Even though this document links to code and documents written as part of my Mozilla actitivities, this document is written in personal capacity.

Summary

Text processing facilities in the C++ standard library have been mostly agnostic of the actual character encoding of text. The few operations that are sensitive to the actual character encoding are defined to behave according to the implementation-defined “narrow execution encoding” (for buffers of char) and the implementation-defined “wide execution encoding” (for buffers of wchar_t).

Meanwhile, over the last two decades, a different dominant design has arisen for text processing in other programming languages as well as in C and C++ usage despite what the C and C++ standard-library facilities provide: Representing text as Unicode, and only Unicode, internally in the application even if some other representation is required externally for backward compatibility.

I think the C++ standard should adopt the approach of “Unicode-only internally” for new text processing facilities and should not support non-Unicode execution encodings in newly-introduced features. This allows new features to have less abstraction obfuscation for Unicode usage, avoids digging legacy applications deeper into non-Unicode commitment, and avoids the specification and implementation effort of adapting new features to make sense for non-Unicode execution encodings.

Concretely, I suggest:

  • In new features, do not support numbers other than Unicode scalar values as a numbering scheme for abstract characters, and design new APIs to be aware of Unicode scalar values as appropriate instead of allowing other numbering schemes. (I.e. make Unicode the only coded character set supported for new features.)
  • Use char32_t directly as the concrete type for an individual Unicode scalar value without allowing for parametrization of the type that conceptually represents a Unicode scalar value. (For sequences of Unicode scalar values, UTF-8 is preferred.)
  • When introducing new text processing facilities (other than the next item on this list), support only UTF in-memory text representations: UTF-8 and, potentially, depending on feature, also UTF-16 or also UTF-16 and UTF-32. That is, do not seek to make new text processing features applicable to non-UTF execution encodings. (This document should not be taken as a request to add features for UTF-16 or UTF-32 beyond iteration over string views by scalar value. To avoid distraction from the main point, this document should also not be taken as advocating against providing any particular feature for UTF-16 or UTF-32.)
  • Non-UTF character encodings may be supported in a conversion API whose purpose is to convert from a legacy encoding into a UTF-only representation near the IO boundary or at the boundary between a legacy part (that relies on execution encoding) and a new part (that uses Unicode) of an application. Such APIs should be std::span-based instead of iterator-based.
  • When an operation logically requires a valid sequence of Unicode scalar values, the API must either define the operation to fail upon encountering invalid UTF-8/16/32 or must replace each error with a U+FFFD REPLACEMENT CHARACTER as follows: What constitutes a single error in UTF-8 is defined in the WHATWG Encoding Standard (which matches the “best practice” from the Unicode Standard). In UTF-16, each unpaired surrogate is an error. In UTF-32, each code unit whose numeric value isn’t a valid Unicode scalar value is an error.
  • Instead of standardizing Text_view as proposed, standardize a way to obtain a Unicode scalar value iterator from std::u8string_view, std::u16string_view, and std::u32string_view.
Context

This write-up is in response to (and in disagreement with) the “Character Types” section in the P0244R2 Text_view paper:

This library defines a character class template parameterized by character set type used to represent character values. The purpose of this class template is to make explicit the association of a code point value and a character set.

It has been suggested that char32_t be supported as a character type that is implicitly associated with the Unicode character set and that values of this type always be interpreted as Unicode code point values. This suggestion is intended to enable UTF-32 string literals to be directly usable as sequences of character values (in addition to being sequences of code unit and code point values). This has a cost in that it prohibits use of the char32_t type as a code unit or code point type for other encodings. Non-Unicode encodings, including the encodings used for ordinary and wide string literals, would still require a distinct character type (such as a specialization of the character class template) so that the correct character set can be inferred from objects of the character type.

This suggestion raises concerns for the author. To a certain degree, it can be accommodated by removing the current members of the character class template in favor of free functions and type trait templates. However, it results in ambiguities when enumerating the elements of a UTF-32 string literal; are the elements code point or character values? Well, the answer would be both (and code unit values as well). This raises the potential for inadvertently writing (generic) code that confuses code points and characters, runs as expected for UTF-32 encodings, but fails to compile for other encodings. The author would prefer to enforce correct code via the type system and is unaware of any particular benefits that the ability to treat UTF-32 string literals as sequences of character type would bring.

It has also been suggested that char32_t might suffice as the only character type; that decoding of any encoded string include implicit transcoding to Unicode code points. The author believes that this suggestion is not feasible for several reasons:

  1. Some encodings use character sets that define characters such that round trip transcoding to Unicode and back fails to preserve the original code point value. For example, Shift-JIS (Microsoft code page 932) defines duplicate code points for the same character for compatibility with IBM and NEC character set extensions.
    https://support.microsoft.com/en-us/kb/170559 [sic; dead link]
  2. Transcoding to Unicode for all non-Unicode encodings would carry non-negligible performance costs and would pessimize platforms such as IBM’s z/OS that use EBCIDC by default for the non-Unicode execution character sets.

To summarize, it raises three concerns:

  1. Ambiguity between code units and scalar values (the paper says “code points”, but I say “scalar values” to emphasize the exclusion of surrogates) in the UTF-32 case.
  2. Some encodings, particularly Microsoft code page 932, can represent one Unicode scalar value in more than one way, so the distinction of which way does not round-trip.
  3. Transcoding non-Unicode execution encodings has a performance cost that pessimizes particularly IBM z/OS.
Terminology and Background

(This section and the next section should not be taken as ’splaining to SG16 what they already know. The over-explaining is meant to make this document more coherent for a broader audience of readers who might be interested in C++ standardization without full familiarity with text processing terminology or background, or the details of Microsoft code page 932.)

An abstract character is an atomic unit of text. Depending on writing system, the analysis of what constitutes an atomic unit may differ, but a given implementation on a computer has to identify some things as atomic units. Unicode’s opinion of what is an abstract character is the most widely applied opinion. In fact, Unicode itself has multiple opinions on this, and Unicode Normalization Forms bridge these multiple opinions.

A character set is a set of abstract characters. In principle, a set of characters can be defined without assigning numbers to them.

A coded character set assigns numbers, code points, to the items in the character set to each abstract character.

When the Unicode code space was extended beyond the Basic Multilingual Plane, some code points were set aside for the UTF-16 surrogate mechanism and, therefore, do not represent abstract characters. A Unicode scalar value is a Unicode code point that is not a surrogate code point. For consistency with Unicode, I use the term scalar value below when referring to non-Unicode coded character sets, too.

A character encoding is a way to represent a conceptual sequence of scalar values from one or more coded character sets as a concrete sequence of bytes. The bytes are called code units. Unicode defines in-memory Unicode encoding forms whose code unit is not a byte: UTF-16 and UTF-32. (For these Unicode encoding forms, there are corresponding Unicode encoding schemes that use byte code units and represent a non-byte code unit from a correspoding encoding form as multiple bytes and, therefore, could be used in byte-oriented IO even though UTF-8 is preferred for interchange. UTF-8, of course, uses byte code units as both a Unicode encoding form and as a Unicode encoding scheme.)

Coded character sets that assign scalar values in the range 0...255 (decimal) can be considered to trivially imply a character encoding for themselves: You just store the scalar value as an unsigned byte value. (Often such coded character sets import US-ASCII as the lower half.)

However, it is possible to define less obvious encodings even for character sets that only have up to 256 characters. IBM has several EBCDIC character encodings for the set of characters defined in ISO-8859-1. That is, compared to the trivial ISO-8859-1 encoding (the original, not the Web alias for windows-1252), these EBCDIC encodings permute the byte value assignments.

Unicode is the universal coded character set that by design includes abstract characters from all notable legacy coded character sets such that character encodings for legacy coded character sets can be redefined to represent Unicode scalar values. Consider representing ż in the ISO-8859-2 encoding. When we treat the ISO-8859-2 encoding as an encoding for the Unicode coded character set (as opposed treating it as an encoding for the ISO-8859-2 coded character set), byte 0xBF decodes to Unicode scalar value U+017C (and not as scalar value 0xBF).

A compatibility character is a character that according to Unicode principles should not be a distinct abstract character but that Unicode nonetheless codes as a distinct abstract character because some legacy coded character set treated it as distinct.

The Microsoft Code Page 932 Issue

Usually in C++ a “character type” refers to a code unit type, but the Text_view paper uses the term “character type” to refer to a Unicode scalar value when the encoding is a Unicode encoding form. The paper implies that an analogous non-Unicode type exists for Microsoft code page 932 (Microsoft’s version of Shift_JIS), but does one really exist?

Microsoft code page 932 takes the 8-bit encoding of JIS X 0201 coded character set, whose upper half is half-width katakana and lower half is ASCII-based, and replaces the lower half with actual US-ASCII (moving the difference between US-ASCII and the lower half of 8-bit-encoded JIS X 0201 into a font problem!). It then takes the JIS X 0208 coded character set and represents it with two-byte sequences (for the lead byte making use of the unassigned range of JIS X 0201). JIS X 0208 code points aren’t really one-dimensional scalars, but instead two-dimensional row and column numbers in a 94 by 94 grid. (See the first 94 rows of the visualization supplied with the Encoding Standard; avoid opening the link on RAM-limited device!) Shift_JIS / Microsoft code page 932 does not put these two numbers into bytes directly, but conceptually arranges each two rows of 94 columns into one row of a 188 columns and then transforms these new row and column numbers into bytes with some offsetting.

While the JIS X 0208 grid is rearranged into 47 rows of a 188-column grid, the full 188-column grid has 60 rows. The last 13 rows are used for IBM extensions and for private use. The private use area maps to the (start of the) Unicode Private Use Area. (See a visualization of the rearranged grid with the private use part showing up as unassigned; again avoid opening the link on a RAM-limited device.)

The extension part is where the concern that the Text_view paper seeks to address comes in. NEC and IBM came up with some characters that they felt JIS X 0208 needed to be extended with. NEC’s own extensions go onto row 13 (in one-based numbering) of the 94 by 94 JIS X 0208 grid (unallocated in JIS X 0208 proper), so that extension can safely be treated as if it had always been part of JIS X 0208 itself. The IBM extension, however, goes onto the last 3 rows of the 60-row Shift_JIS grid, i.e. outside the space that the JIS X 0208 94 by 94 grid maps to. However, US-ASCII, the half-width katakana part of JIS X 0201, and JIS X 0208 are also encoded, in a different way, by EUC-JP. EUC-JP can only encode the 94 by 94 grid of JIS X 0208. To make the IBM extensions fit into the 94 by 94 grid, NEC relocated the IBM extensions within the 94 by 94 grid in space that the JIS X 0208 standard left unallocated.

When considering IBM Shift_JIS and NEC EUC-JP (without later JIS X 0213 extension), both encode the same set of characters, but in a different way. Furthermore, both can round-trip via Unicode. Unicode principles analyze some of the IBM extension kanji as duplicates of kanji that were already in the original JIS X 0208. However, to enable round-tripping (which was thought worthwhile to achieve at the time), Unicode treats the IBM duplicates as compatibility characters. (Round-tripping is lost, of course, if the text decoded into Unicode is normalized such that compatibility characters are replaced with their canonical equivalents before re-encoding.)

This brings us to the issue that the Text_view paper treats as significant: Since Shift_JIS can represent the whole 94 by 94 JIS X 0208 grid and NEC put the IBM extension there, a naïve conversion from EUC-JP to Shift_JIS can fail to relocate the IBM extension characters to the end of the Shift_JIS code space and can put them in the position where they land if the 94 by 94 grid is simply transformed as the first 47 rows of the 188-column-wide Shift_JIS grid. When decoding to Unicode, Microsoft code page 932 supports both locations for the IBM extensions, but when encoding from Unicode, it has to pick one way of doing things, and it picks the end of the Shift_JIS code space.

That is, Unicode does not assign another set of compatibility characters to Microsoft code page 932’s duplication of the IBM extensions, so despite NEC EUC-JP and IBM Shift_JIS being round-trippable via Unicode, Microsoft code page 932, i.e. Microsoft Shift_JIS, is not. This makes sense considering that there is no analysis that claims the IBM and NEC instances of the IBM extensions as semantically different: They clearly have provenance that indicates that the duplication isn’t an attempt to make a distinction in meaning. The Text_view paper takes the position that C++ should round-trip the NEC instance of the IBM extensions in Microsoft code page 932 as distinct from the IBM instance of the IBM extensions even though Microsoft’s own implementation does not. In fact, the whole point of the Text_view paper mentioning Microsoft code page 932 is to give an example of a legacy encoding that doesn’t round-trip via Unicode, despite Unicode generally having been designed to round-trip legacy encodings, and to opine that it ought to round-trip in C++.

So:

  • The Text_view paper wants there to exist a non-transcoding-based, non-Unicode analog for what for UTF-8 would be a Unicode scalar value but for Microsoft code page 932 instead.
  • The standards that Microsoft code page 932 has been built on do not give us such a scalar.
    • Even if the private use space and the extensions are considered to occupy a consistent grid with the JIS X 0208 characters, the US-ASCII plus JIS X 0201 part is not placed on the same grid.
    • The canonical way of referring to JIS X 0208 independently of bytes isn’t a reference by one-dimensional scalar but a reference by two (one-based) numbers identifying a cell on the 94 by 94 grid.
  • The Text_view paper wants the scalar to be defined such that a distinction between the IBM instance of the IBM extensions and the NEC instance of the IBM extensions is maintained even though Microsoft, the originator of the code page, does not treat these two instances as meaningfully distinct.
Inferring a Coded Character Set from an Encoding

(This section is based on the constraints imposed by Text_view paper instead of being based on what the reference implementation does for Microsoft code page 932. From code inspection, it appears that support for multi-byte narrow execution encodings is unimplemented, and when trying to verify this experimentally, I timed out trying to get it running due to an internal compiler error when trying to build with a newer GCC and a GCC compilation error when trying to build the known-good GCC revision.)

While the standards don’t provide a scalar value definition for Microsoft code page 932, it’s easy to make one up based on tradition: Traditionally, the two-byte characters in CJK legacy encodings have been referred to by interpreting the two bytes as 16-bit big-endian unsigned number presented as hexadecimal (and single-byte characters as a 8-bit unsigned number).

As an example, let’s consider 猪 (which Wiktionary translates as wild boar). Its canonical Unicode scalar value is U+732A. That’s what the JIS X 0208 instance decodes to when decoding Microsoft code page 932 into Unicode. The compatibility character for the IBM kanji purpose is U+FA16. That’s what both the IBM instance of the IBM extension and the NEC instance of the IBM extension decode to when decoding Microsoft code page 932 into Unicode. (For reasons unknown to me, Unicode couples U+FA16 with the IBM kanji compatibility purpose and assigns another compatibility character, U+FAA0, for compatibility with North Korean KPS 10721-2000 standard, which is irrelevant to Microsoft code page 932. Note that not all IBM kanji have corresponding DPRK compatibility characters, so we couldn’t repurpose the DPRK compatibility characters for distinguishing the IBM and NEC instances of the IBM extensions even if we wanted to.)

When interpreting the Microsoft code page 932 bytes as a big-endian integer, the JIS X 0208 instance of 猪 would be 0x9296, the IBM instance would be 0xFB5E, and the NEC instance would be 0xEE42. To highlight how these “scalars” are coupled with the encoding instead of the standard character sets that the encodings originally encode, in EUC-JP the JIS X 0208 instance would be 0xC3F6 and the NEC instance would be 0xFBA3. Also, for illustration, if the same rule was applied to UTF-8, the scalar would be 0xE78CAA instead of U+732A. Clearly, we don’t want the scalars to be different between UTF-8, UTF-16, and UTF-32, so it is at least theoretically unsatisfactory for Microsoft code page 932 and EUC-JP to get different scalars for what are clearly the same characters in the underlying character sets.

It would be possible to do something else that’d give the same scalar values for Shift_JIS and EUC-JP without a lookup table. We could number the characters on the two-dimensional grid starting with 256 for the top left cell to reserve the scalars 0…255 for the JIS X 0201 part. It’s worth noting, though, that this approach wouldn’t work well for Korean and Simplified Chinese encodings that take inspiration from the 94 by 94 structure of JIS X 0208. KS X 1001 and GB2312 also define a 94 by 94 grid like JIS X 0208. However, while Microsoft code page 932 extends the grid down, so a consecutive numbering would just add greater numbers to the end, Microsoft code pages 949 and 936 extend the KS X 1001 and GB2312 grids above and to the left, which means that a consecutive numbering of the extended grid would be totally different from the consecutive numbering of the unextended grid. On the other hand, interpreting each byte pair as a big-endian 16-bit integer would yield the same values in the extended and unextended Korean and Simplified Chinese cases. (See visualizations for 949 and 936; again avoid opening on a RAM-limited device. Search for “U+3000” to locate the top left corner of the original 94 by 94 grid.)

What About EBCDIC?

Text_view wants to avoid transcoding overhead on z/OS, but z/OS has multiple character encodings for the ISO-8859-1 character set. It seems conceptually bogus for all these to have different scalar values for the same character set. However, for all of them to have the same scalar values, a lookup table-based permutation would be needed. If that table permuted to the ISO-8859-1 order, it would be the same as the Unicode order, at which point the scalar values might as well be Unicode scalar values, which Text_view wanted to avoid on z/OS citing performance concerns. (Of course, z/OS also has EBCDIC encodings whose character set is not ISO-8859-1.)

What About GB18030?

The whole point of GB18030 is that it encodes Unicode scalar values in a way that makes the encoding byte-compatible with GBK (Microsoft code page 936) and GB2312. This operation is inherently lookup table-dependent. Inventing a scalar definition for GB18030 that achieved the Text_view goal of avoiding lookup tables would break the design goal of GB18030 that it encodes all Unicode scalar values. (In the Web Platform, due to legacy reasons, all but one scalar value and representing one scalar value twice.)

What’s Wrong with This?

Let’s evaluate the above in the light of P1238R0, the SG16: Unicode Direction paper.

The reason why Text_view tries to fit Unicode-motivated operations onto legacy encodings is that, as noted by “1.1 Constraint: The ordinary and wide execution encodings are implementation defined”, non-UTF execution encodings exist. This is, obviously, true. However, I disagree with the conclusion of making new features apply to these pre-existing execution encodings. I think there is no obligation to adapt new features to make sense for non-UTF execution encodings. It should be sufficient to keep existing legacy code running, i.e. not removing existing features should be sufficient. On the topic of wchar_t the Unicode Direction paper, says “1.4. Constraint: wchar_t is a portability deadend”. I think char with non-UTF-8 execution encoding should also be declared as a deadend whereas the Unicode Direction paper merely notes “1.3. Constraint: There is no portable primary execution encoding”. Making new features work with deadend foundation lures applications deeper into deadends, which is bad.

While inferring scalar values for an encoding by interpreting the encoded bytes for each character as a big-endian integer (thereby effectively inferring a, potentially non-standard, coded character set from an encoding) might be argued to be traditional enough to fit “2.1. Guideline: Avoid excessive inventiveness; look for existing practice”, it is a bad fit for “1.6. Constraint: Implementors cannot afford to rewrite ICU”. If there is concern about implementors not having the bandwidth to implement text processing features from scratch and, therefore, should be prepared to delegate to ICU, it makes no sense make implementations or the C++ standard come up with non-Unicode numberings for abstract characters, since such numberings aren’t supported by ICU and necessarily would require writing new code for anachronistic non-Unicode schemes.

Aside: Maybe analyzing the approach of using byte sequences interpreted as big-endian numbers looks like attacking a straw man and there could be some other non-Unicode numbering instead, such as the consecutive numbering outlined above. Any alternative non-Unicode numbering would still fail “1.6. Constraint: Implementors cannot afford to rewrite ICU” and would also fail “2.1. Guideline: Avoid excessive inventiveness; look for existing practice”.

Furthermore, I think the Text_view paper’s aspiration of distinguishing between the IBM and NEC instances of the IBM extensions in Microsoft code page 932 fails “2.1. Guideline: Avoid excessive inventiveness; look for existing practice”, because it effectively amounts to inventing additional compatibility characters that aren’t recognized as distinct by Unicode or the originator of the code page (Microsoft).

Moreover, iterating over a buffer of text by scalar value is a relatively simple operation when considering the range of operations that make sense to offer for Unicode text but that may not obviously fit non-UTF execution encodings. For example, in the light of “4.2. Directive: Standardize generic interfaces for Unicode algorithms” it would be reasonable and expected to provide operations for performing Unicode Normalization on strings. What does it mean to normalize a string to Unicode Normalization Form D under the ISO-8859-1 execution encoding? What does it mean to apply any Unicode Normalization Form under the windows-1258 execution encoding, which represents Vietnamese in a way that doesn’t match any Unicode Normalization Form? If the answer just is to make these no-ops for non-UTF encodings, would that be the right answer for GB18030? Coming up with answers other than just saying that new text processing operations shouldn’t try to fit non-UTF encodings at all would very quickly violate the guideline to “Avoid excessive inventiveness”.

Looking at other programming languages in the light of “2.1. Guideline: Avoid excessive inventiveness; look for existing practice” provides the way forward. Notable other languages have settled on not supporting coded character sets other than Unicode. That is, only the Unicode way of assigning scalar values to abstract characters is supported. Interoperability with legacy character encodings is achieved by decoding into Unicode upon input and, if non-UTF-8 output is truly required for interoperability, by encoding into legacy encoding upon output. The Unicode Direction paper already acknowledges this dominant design in “4.4. Directive: Improve support for transcoding at program boundaries”. I think C++ should consider the boundary between non-UTF-8 char and non-UTF-16/32 wchar_t on one hand and Unicode (preferably represented as UTF-8) on the other hand as a similar transcoding boundary between legacy code and new code such that new text processing features (other than the encoding conversion feature itself!) are provided on the char8_t/char16_t/char32_t side but not on the non-UTF execution encoding side. That is, while the Text_view paper says “Transcoding to Unicode for all non-Unicode encodings would carry non-negligible performance costs and would pessimize platforms such as IBM’s z/OS that use EBCIDC [sic] by default for the non-Unicode execution character sets.”, I think it’s more appropriate to impose such a cost at the boundary of legacy and future parts of z/OS programs than to contaminate all new text processing APIs with the question “What does this operation even mean for non-UTF encodings generally and EBCDIC encodings specifically?”. (In the case of Windows, the system already works in UTF-16 internally, so all narrow execution encodings already involve transcoding at the system interface boundary. In that context, it seems inappropriate to pretend that the legacy narrow execution encodings on Windows were somehow free of transcoding cost to begin with.)

To avoid a distraction from my main point, I’m explicitly not opining in this document on whether new text processing features should be available for sequences of char when the narrow execution encoding is UTF-8, for sequences of wchar_t when sizeof(wchar_t) is 2 and the wide execution encoding is UTF-16, or for sequences of wchar_t when sizeof(wchar_t) is 4 and the wide execution encoding is UTF-32.

The Type for a Unicode Scalar Value Should Be char32_t

The conclusion of the previous section is that new C++ facilities should not support number assignments to abstract characters other than Unicode, i.e. should not support coded character sets (either standardized or inferred from an encoding) other than Unicode. The conclusion makes it unnecessary to abstract type-wise over Unicode scalar values and some other kinds of scalar values. It just leaves the question of what the concrete type for a Unicode scalar value should be.

The Text_view paper says:

“It has been suggested that char32_t be supported as a character type that is implicitly associated with the Unicode character set and that values of this type always be interpreted as Unicode code point values. This suggestion is intended to enable UTF-32 string literals to be directly usable as sequences of character values (in addition to being sequences of code unit and code point values). This has a cost in that it prohibits use of the char32_t type as a code unit or code point type for other encodings.

I disagree with this and am firmly in the camp that char32_t should be the type for a Unicode scalar value.

The sentence “This has a cost in that it prohibits use of the char32_t type as a code unit or code point type for other encodings.” is particularly alarming. Seeking to use char32_t as a code unit type for encodings other than UTF-32 would dilute the meaning of char32_t into another wchar_t mess. (I’m happy to see that P1041R4 “Make char16_t/char32_t string literals be UTF-16/32” was voted into C++20.)

As for the appropriateness of using the same type both for a UTF-32 code unit and a Unicode scalar value, the whole point of UTF-32 is that its code unit value is directly the Unicode scalar value. That is what UTF-32 is all about, and UTF-32 has nothing else to offer: The value space that UTF-32 can represent is more compactly represented by UTF-8 and UTF-16 both of which are more commonly needed for interoperation with existing interfaces. When having the code units be directly the scalar values is UTF-32’s whole point, it would be unhelpful to distinguish type-wise between UTF-32 code units and Unicode scalar values. (Also, considering that buffers of UTF-32 are rarely useful but iterators yielding Unicode scalar values make sense, it would be sad to make the iterators have a complicated type.)

To provide interfaces that are generic across std::u8string_view, std::u16string_view, and std::u32string_view (and, thereby, strings for which these views can be taken), all of these should have a way to obtain a scalar value iterator that yields char32_t values. To make sure such iterators really yield only Unicode scalar values in an interoperable way, the iterator should yield U+FFFD upon error. What constitutes a single error in UTF-8 is defined in the WHATWG Encoding Standard (matches the “best practice” from the Unicode Standard). In UTF-16, each unpaired surrogate is an error. In UTF-32, each code unit whose numeric value isn’t a valid Unicode scalar value is an error. (The last sentence might be taken as admission that UTF-32 code units and scalar values are not the same after all. It is not. It is merely an acknowledgement that C++ does not statically prevent programs that could erroneously put an invalid value into a buffer that is supposed to be UTF-32.)

In general, new APIs should be defined to handle invalid UTF-8/16/32 either according to the replacement behavior described in the previous paragraph or by stopping and signaling error on the first error. In particular, the replacement behavior should not be left as implementation-defined, considering that differences in the replacement behavior between V8 and Blink lead to a bug. (See another write-up on this topic.)

Transcoding Should Be std::span-Based Instead of Iterator-Based

Since the above contemplates a conversion facility between legacy encodings and Unicode encoding forms, it seems on-topic to briefly opine on what such an API should look like. The Text_view paper says:

Transcoding between encodings that use the same character set is currently possible. The following example transcodes a UTF-8 string to UTF-16.

std::string in = get_a_utf8_string(); std::u16string out; std::back_insert_iterator<std::u16string> out_it{out}; auto tv_in = make_text_view<utf8_encoding>(in); auto tv_out = make_otext_iterator<utf16_encoding>(out_it); std::copy(tv_in.begin(), tv_in.end(), tv_out);

Transcoding between encodings that use different character sets is not currently supported due to lack of interfaces to transcode a code point from one character set to the code point of a different one.

Additionally, naively transcoding between encodings using std::copy() works, but is not optimal; techniques are known to accelerate transcoding between some sets of encoding. For example, SIMD instructions can be utilized in some cases to transcode multiple code points in parallel.

Future work is intended to enable optimized transcoding and transcoding between distinct character sets.

I agree with the assessment that iterator and std::copy()-based transcoding is not optimal due to SIMD considerations. To enable the use of SIMD, the input and output should be std::spans, which, unlike iterators, allow the converter to look at more than one element of the std::span at a time. I have designed and implemented such an API for C++, and I invite SG16 to adopt its general API design. I have a written a document that covers the API design problems that I sought to address and design of the API (in Rust but directly applicable to C++). (Please don’t be distracted by the implementation internals being Rust instead of C++. The API design is still valid for C++ even if the design constraint of the implementation internals being behind C linkage is removed. Also, please don’t be distracted by the API predating char8_t.)

Implications for Text_view

Above I’ve opined that only UTF-8, UTF-16, and UTF-32 (as Unicode encoding forms—not as Unicode encoding schemes!) should be supported for iteration by scalar value and that legacy encodings should be addressed by a conversion facility. Therefore, I think that Text_view should not be standardized as proposed. Instead, I think std::u8string_view, std::u16string_view, and std::u32string_view should gain a way to obtain a Unicode scalar value iterator (that yields values of type char32_t), and a std::span-based encoding conversion API should be provided as a distinct feature (as opposed to trying to connect Unicode scalar value iterators with std::copy()).

Categorieën: Mozilla-nl planet

Pagina's