Jump to content

Data Platform Engineering/Data Platform SRE/Upcoming Operations

From mediawiki.org

How to know what operations are planned or executed on DPE systems?

  • Check the list below for upcoming operations.
  • Check Server Admin Log (SAL) tagged with "DPE" for general ongoing operations.
  • Check Analytics logs for ongoing operations specific to DPE.

Instructions for SRE

The teams and people depending on the technical stacks that we maintain need to have visibility on upcoming operations that might have impact on their work. This is achieved by:

  • Sending an email to the appropriate mailing list when scheduling an operation.
  • Listing known upcoming operations on this page (and removing them once they are completed).
  • If the operation might have impact for the whole organization, !log the start (and end when appropriate) of the operations in the #wikimedia-operations connect (tagged with "DPE") IRC channel. The tag allows to easily find operations related to DPE in the Server Admin Log (SAL).
  • If the operation is purely about DPE systems, !log the start (and end when appropriate) of the operations in the #wikimedia-analytics connect IRC channel. Analytics logs are on mediawiki.org

Limitations

By nature, the DPE SRE team is performing operations all the time on our production infrastructure. Not everyone needs to know everything. For example, a standard service restart with low chance of impact does not need to be actively communicated in advance. A complex upgrade of a system with a high chance of generating incompatibilities needs to be scheduled and coordinated outside of this page. SREs are expected to use their best judgement, with a few guidelines:

  • Every operation on our production infrastructure needs to be logged in #wikimedia-operations connect.
  • Operations that have a low probability of impact don't need to be announced in advance or tracked outside of their phab tickets and IRC/SAL logs.
  • Operations that have no impact if everything goes well, but have a probability of impact if not everything goes according to plan should be announced and tracked.

Up coming operations

  • template: <date/time (YYYY-MM-DD HH:MM UTC)>: task T1234 - summary of the operation - expected impact - potential unexpected consequences