GSOC ‘23: Automating Area Management in MusicBrainz

Hi, this is Prathamesh (IRC Nick: “Pratha-Fish”), an aspiring Data Engineer based in India with a sheer obsession for data, music, computers, and open-source software.

This year, I had the pleasure of being mentored by bitmap and reosarevok for this project and being guided at every hurdle by the incredible team at the MetaBrainz Foundation throughout this project. This blog summarizes my journey with MetaBrainz over the past 22 weeks.

Introduction

MusicBrainz, while primarily being a music metadata goldmine, also maintains an extensive database of area metadata to link various entities like artists, concerts, labels, recording studios, instruments, etc. to their respective areas like cities, countries, and subdivisions. Currently, the process of adding and editing this data is done through a super tedious process of submitting AREQ tickets, and getting them manually resolved by drsaunde (thanks for your hard work!)

The objective of this project was to completely automate this uninspiring process by building a robust data pipeline to automatically import and synchronize area data from Wikidata and Geonames into MusicBrainz!

The Problem Statement

Automate the process to maintain area metadata on MusicBrainz by automatically synchronizing area metadata between MusicBrainz and Wikidata using a data pipeline.

The Solution

My proposed solution can be divided into five distinct steps, as follows:

  1. Fetch cities, subdivisions, and countries from Wikidata.
  2. Fetch cities, subdivisions, and countries from the MusicBrainz database.
  3. Combine, transform, and wrangle the collected data into an intermediary format.
  4. Efficiently compare metadata from MusicBrainz and Wikidata.
  5. Utilize the MusicBrainz-Bot to automatically add or edit areas reported by the comparator.

Implementation:

This complete project was implemented using Python (especially pandas), PostgreSQL, docker, and Linux (shell scripting, hosting, etc.)

(1) MusicBrainz-AreaBot:

  • This bot features a ton of services and utilities to fetch, clean, combine, transform, and wrangle data and report inconsistencies between Wikidata and MusicBrainz data.
  • Some of its services include:
  • Fetch Wikidata areas (step 1):
    • Uses SPARQL queries to fetch areas from the Wikdiata Query Service.
    • Scripted using Python and Pandas to programmatically fetch areas in the order of hierarchy (countries > subdivisions for countries > cities for subdivision)
    • Supercharged using multiprocessing to fetch data 400% faster using parallel computation.
    • Fetches and dumps ~100k areas to a CSV file.
  • Fetch MusicBrainz areas (step 2):
    • Uses a complex SQL query to recursively fetch areas from MusicBrainz.
    • Fetches and dumps ~ 120k areas to a CSV file.
  • Combine, Transform, and Wrangle areas (step 3):
    • Uses pandas to combine load and area dumps, generate URLs, generate unique indices, etc.
    • Regularizes the columns and column names in both Wikidata and MusicBrainz areas. Structures the data into an intermediary format.
  • Compare Metadata (step 4):
    • Efficiently compares transformed data using pandas and reports missing data in MusicBrainz based on Wikidata data.
    • Creates payloads for the MusicBrainz-Bot to further process.
  • Update Areas on MusicBrainz (step 5):
    • Uses the MusicBrainz-Bot (discussed further below) at its core to automate the addition of area differences to the MusicBrainz database based on payloads generated in the previous steps.

(2) MusicBrainz-Bot

  • The first half of the project started off by updating the MusicBrainz-Bot to add support for adding and updating area data.
  • Written in Python, this bot framework uses Mechanize to automate form submission on the  MusicBrainz-Server.

(3) MusicBrainz-Docker

  • Provides a handy MusicBrainz DB and MusicBrainz Server setup rolled into easy-to-setup docker containers.
  • Supports testing in MusicBrainz-Bot and area lookups for MusicBrainz-AreaBot.

Project Outcomes

Here are some interesting stats and insights from the very first run of this project!

  • There are currently ~120,000 areas in the MusicBrainz database.
  • Using this bot, we’ve fetched ~100,000 areas from Wikidata complete with ISO codes, Reference URLs, Relations, etc.
  • New areas added: ~73,000 (60.44% Increase!)
  • Existing areas edited: 4,000 (3.95% Updated)

What’s Left

  • Tests for the MusicBrainz-AreaBot
  • Documentation for the MusicBrainz-AreaBot
  • Code Refactoring and Review
  • Deployment on a full database outside of a testing environment.
  • Some Minor Bug Fixes
  • Add support for posting relationships with data.

End Note

Just like last year, this project too was an absolute joy to work on! Almost put into production, I believe this project will be instrumental in significantly reducing the manual work required to maintain areas in MusicBrainz! 

There were a few ups and downs for a significant duration of the project, and I’d really really love to acknowledge and thank bitmap and reosarevok for their unwavering support through everything – without whom this project would have been a case long gone.

I’ve personally learned a lot from this project and honed in on my skill set while contributing to something I find meaningful as a music nerd. I simply can’t thank this community enough for everything. ❤️

Cheers!

6 thoughts on “GSOC ‘23: Automating Area Management in MusicBrainz”

  1. Thanks Prathamesh, it was nice to mentor you on this project and I was happy to see all the progress you made. 🙂 Great work!

  2. That’s certainly the plan eventually! We’ll need to test it a bit, possibly make a few changes, but the whole idea is to make this work 🙂

  3. So cool! I was just about to make a new area request but I’m going to put my faith in you instead Prathmesh (and all the diligent testers/reviews!) and wait a bit 😛

    A cool side-effect of linking projects like this is that MB editors might end up contributing to Wikidata, when a location isn’t right or needs to be added. Which is better than duplicating the work, just on our end! Data nerds working together, for better results everywhere ❤

  4. First of all its drsaunde or D.R. Saunders. I am not a doctor and I have never played one on TV.

    Secondly, I don’t know why you are thanking me, I didn’t do anything at all to help your project, even though I had lots of experience and expertise to share, I was never asked. In fact nobody from musicbrainz so much as even mentioned this project to me, so this post was like… “Surprise!”

    Part of me feels that I should be incredibly insulted how invisible and worthless I am, however I guess I should be happy to be rid of the responsibilities. Thanks for taking over all the area tickets Prathmesh.

    Could you please let me know what your JIRA ID is, so that I can make sure you are assigned all future tickets.

  5. I have fixed the name spelling on the blog post. Other than that small mishap, any blame in communication here is the mentors’ (and me in particular), not on our GSoC student who did nothing wrong. I had already talked this situation through with drsaunde in private back in December, but I thought I’d make it clear here since the comment got posted publicly eventually and some people are now seeing it and worrying that we never responded to it 🙂

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.