Page MenuHomePhabricator

deployment-prep: Code stewardship request
Open, MediumPublic

Description

Intro

Deployment-prep, also known as the beta cluster, is a Cloud VPS project originally created by technical volunteers {{cn}}. In the years since it has become a resource that is used by technical volunteers, the Wikimedia CI pipeline {{cn}}, Foundation staff, and manual testers. It is not however proactively maintained by any Foundation staff in their staff capacity.

This is a "weird" stewardship request because this project is not technically part of the Wikimedia production environment. It is also not exactly a single software system. Instead it is a shared environment used by multiple stakeholders to validate code and configuration changes in a non-production environment. A decision to sunset the beta cluster would be highly disruptive if it did not come along with a proposal to build a replacement environment of some type. This environment however has spent years in a state of uncertain maintainership and the code stewardship process seems like the most mature process we have to discuss the merits of the project and how it might be better supported/resourced going forward.

Issues

  • Unclear ownership of the Cloud VPS project instances (meaning that there are a large number of project admins, but little to no documentation about which people are taking care of which instances)
  • Production Puppet code is used, but often needs customization to work within the various constraints of the Cloud VPS project environment which requires special +2 rights. No holder of such rights is currently active in the project.
  • Not all Wikimedia production software changes are deployed in this environment {{cn}}
  • Puppet failures triggered by upstream configuration changes can remain for days or weeks before being addressed potentially blocking further testing of code and configuration changes

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

The question is basically: how can a team tasked with everything else it is doing (deployments, tooling, CI, etc) keep up with an SRE team of 20 people when maintaining a shadow environment?

Question from someone uninitiated: Following which rule do the "production environment" and the "shadow environment" have to be (that's what it sounds like) maintained by different groups? Following the mentioned purpose, "validate code and configuration changes in a non-production environment", maximum benefit could be expected by identical setups which arguably would be the easiest to achieve by using the same tech stack and know-how (i.e. people). Thanks.

Normally I'm all for celebrating birthdays.... I can't believe this baby is one year old already... Let us not need to celebrate its second birthday (dark, I know) :-/

Some services and products are maintained by their owners in both production data centers and Beta Cluster alike (Most Product teams, and in Tech: Perf, Analytics, and a few others).

For some other services this is not the case, which halts much development and testing whenever they prop up.

A non-urgent but recent example to illustrate this is T139044: Enable GTID on beta cluster mariaDB once upgraded.

Normally I'm all for celebrating birthdays.... I can't believe this baby is one year old already... Let us not need to celebrate its second birthday (dark, I know) :-/

We are a few months away from the second birthday party, is anything moving on this front? :) I ask as I was just reminded how bad of a job we're collectively doing at keeping deployment-prep healthy (see T257118#6536304).

I had a conversation about this ticket with @nskaggs and I felt that I should update this ticket after our conversation.

The problem of stewardship for beta cluster is really a series of problems:

  1. Beta means different things to different people
  2. Maintenance of beta
  3. Sunsetting beta

Beta means different things to different people

In 2018 a few folks on Release-Engineering-Team conducted a survey on the uses of beta cluster. From the survey we were able to identify the following uses of Beta Cluster:

  • Showcasing new work
  • End-to-end/unit testing of changes in isolation
  • Manual QA, quick iteration on bug fixes
  • Long-term testing of alpha features & services in an integrated environment
  • Test how changes integrate with a production-like environment before release
  • Test the deployment procedure
  • Test performance regressions
  • Test integration of changes with production-like data
  • Test with live traffic

The first thing to notice is that some of these use-cases are working against one another. Testing isolated changes cannot be done along with long-term testing of alpha features. New services and new extensions not in production makes the environment less "production-like". New versions of production software in beta makes beta less stable. But, delayed upgrade of production software in beta might also leave beta unstable.

Beta has many purposes but not a single primary purpose -- it's used for everything: it's a tragedy of the commons. There has never been a shared understanding of what "production-like" means for the beta cluster. It likely means different things to different people.

There is no single perfect thing for beta to become because it's doing so many things currently. There is no perfect beta cluster only perfect beta clusters tailored for their use-cases. Back in 2015 the idea of “Beta Cluster as a Service” (BCaaS [bɪˈkɒz]) had some minor traction, but for all the reasons mentioned in T215217#4965494 it didn't happen.

Maintenance

Production is maintained by a group of 23 people (SRE) dedicated to keeping that environment running, up-to-date, and safe. Release-Engineering-Team used to pretend that we could keep pace with production as a group of 7 people who are also responsible for CI, deployment, code review, and development environments, but it's proven not to work in practice. The environment is also different enough from production that the folks familiar with production are also not able to productively maintain it.

A fantastic example of the types of maintenance problems we have was me breaking beta a few hours ago T267439: MediaWiki beta varnish is down. -- an upstream puppet patch broke puppet in beta and when I fixed puppet it caused problems with packages I've never heard of. There is a lot of specialized knowledge needed to keep production running and it just gets more specialized all the time.

Currently there is a project to move existing services (as well as MediaWiki) through the deployment-pipeline and into kubernetes in production. This is making beta cluster even less production-like: there is no k8s in beta and no team has a plan to build or maintain one.

My stance on beta cluster has been, Release-Engineering-Team cares if beta-cluster is broken and we'll try to wrangle the appropriate people to help. This is very different from the kind of active maintenance that beta needs to fight entropy.

Sunsetting Beta

Another finding from the 2018 survey was that 80% of respondents said that they "agree" or "mostly agree" with the statement, "I depend on Beta Cluster for some of my regular testing needs". This past week the beta cluster found 3 release train blockers that never hit production. Beta is important and has no replacement currently. Many of its instances are pets not cattle.

Beta is also definitely an ongoing pain point for both Release-Engineering-Team and cloud-services-team

Sunsetting beta requires a plan to replace the use-cases of beta with something more maintainable. We're in the midst of a large transition in production, containerizing our services. There is a staging cluster for services that will likely supplant some portion of beta's use-cases (a "production-like" environment). The remaining use-cases will likely fall into the realm of local development and (possibly) something that utilizes existing containers to allow developers to share changes with one another -- something akin to the existing patchdemo project. This was a major recommendation that was made as part of the exploration of existing local development tooling. As we begin to supplant the use-cases of beta cluster in the future we can form a more fully realized plan about shutting it down.

As someone who considers beta essential to my role, I'll add a data point with my use case.

I have root on the webperf hosts, but those are configured via puppet and I don't have +2 rights in operations/puppet. But I do have root in beta, so I'm able to cherry-pick patches there for testing. (Even with our puppet linter and compiler infrastructure, it's extremely difficult to craft working patches without some way to test them, which requires having a puppetmaster and hosts with the affected roles.)

A specific example: upgrading the performance team services to use Python 3 (T267269) requires a series of inter-dependent patches to update both our code and some system library dependencies. The puppet changes took several patchsets to get right, e.g. figuring out why services weren't being restarted. It would have been extremely painful to iterate on this in production.

Some pain points I've experienced:

  1. Often, the first step in testing a puppet patch is to get beta back to a working state, pre-patch. For example: T244776#6364483 (Swift in beta had been mostly broken for some time).
  1. Sometimes, differences between production and beta create problems unique to beta. For example: T248041 (puppetmaster OOMs).
  1. Long-lived divergences between beta and production can be a problem, e.g. merge conflicts. For example: T244624. It'd be nice to have a clear policy about when it's OK to un-cherry-pick someone else's patch. (My stance on this re: my patches is in T245402#6517866 - please un-cherry-pick at will).

For the most part, I budget for the above when scoping testing of patches. Certainly not having a testing environment—or having a less permissive test environment without root access—would be way worse than the unrelated issues I've had to fix along the way.

There's a tragedy of the commons, but there are also economies from having a shared environment. I'm not sure it would be reasonable to expect someone to spin up e.g. their own Swift stack whenever they wanted to test a related change. Given our current dependence on puppet in production, I'm not sure spinning up a usable local testing environment for most services is even possible.

I know this is an ongoing issue, but if I were to do some maintenance on beta (read only or temporary outage, T268628), who should I notify?

I know this is an ongoing issue, but if I were to do some maintenance on beta (read only or temporary outage, T268628), who should I notify?

For brief outages, I'd think #wikimedia-releng (and the related SAL) is probably sufficient - that's where I look when something isn't working to see if someone else is already fixing it.

Cloud folks: it'd be cool if there were automatically-generated email lists based on project membership, e.g. a way to address a reply to all recipients of the "Puppet failure on ..." emails.

Cloud folks: it'd be cool if there were automatically-generated email lists based on project membership, e.g. a way to address a reply to all recipients of the "Puppet failure on ..." emails.

T47828: Implement mail aliases for Cloud-VPS projects (<novaproject>@wmcloud.org)

After some discussion, the Release Engineering and Quality and Test Engineering teams have decided to make QTE the "Product Owners" of BetaCluster. This decision comes as part of a larger testing infrastructure effort. The details of what this means and how we will proceed will come out over the course of the coming weeks. In the meantime, this task will be marked as Resolved as the primary objective of this task was to address the lack of "Code Stewardship" or more aptly "Product Ownership".

Apologies for posting on this closed task, but is there any news on the above, some sort of eta on an announcement, details, etc?

The details of what this means and how we will proceed will come out over the course of the coming weeks.

Was this ever done?

taavi removed Jrbranaa as the assignee of this task.
taavi added a subscriber: Jrbranaa.

Boldly re-opening this task, given the details mentioned in T215217#6665452 have not been published (it's been several months now, outside the "weeks" range) and the primary problem of beta cluster being unmaintained and broken is still an issue.

I just linked this task to someone today after explaining that, afaik the code ownership of beta is stalled and I don't know why. So an open status makes the most sense indeed.

@Majavah - yes, this work has stalled due to a shift in my priorities over the last few months. However, it's back on the "front burner". I think it makes sense for this task to remain open until a plan has been pulled together and published.

Just noting here that the deployment-prep Cloud VPS project currently has 30 instances running Debian Stretch, which must be upgraded or deleted by May 2022. These include:

  • The entire media storage (Swift) cluster
  • The entire ElasticSearch cluster
  • kafka-main and kafka-jumbo clusters, responsible for the MW job queue, purging cached pages, and other tasks, plus Zookeeper responsible for providing authentication to all Kafka clusters
  • Multiple miscellaneous support services

Is anyone going to work on those?

Should we consider it necessary to have support for longer, https://deb.freexian.com/extended-lts/ could be an option. Note, both the timeframe and specific support would have to be defined. I would also caution this is NOT a "solution" to avoiding upgrading these instances. However, it could be part of a plan to upgrade if needed.

In T215217#7796938, @Majavah wrote:

Just noting here that the deployment-prep Cloud VPS project currently has 30 instances running Debian Stretch, which must be upgraded or deleted by May 2022.

Thanks for raising this, @Majavah and for all the work you've done on beta — it's in a better place than you've found it.


As I mentioned in T215217#6610236, Release-Engineering-Team cares if Beta is down; however, we're not resourced to rebuild all of beta (which is what needs to happen now).

My current plan is to draft something for the tech decision forum so we can figure it out together.

Should we consider it necessary to have support for longer, https://deb.freexian.com/extended-lts/ could be an option. Note, both the timeframe and specific support would have to be defined. I would also caution this is NOT a "solution" to avoiding upgrading these instances. However, it could be part of a plan to upgrade if needed.

If this is an acceptable solution to buy time, I'm in favor of doing this.

In the time that this would buy, we can figure out how to sustain beta (I hope).

Pinging because one month has passed since the last comment on this.

For everyone's info, currently no Code-Stewardship-Reviews are taking place as there is no clear path forward and as this is not prioritized work.
(Entirely personal opinion: I also assume lack of decision authority due to WMF not having a CTO currently. However, discussing this is off-topic for this task.)

I would like to point out that especially on dewiki, beta is used actively downstream for development of templates, modules, javascript etc., with permissions elevated in comparison to production. It would be a pity to lose these capabilities.