World of Warcraft, Cleaning ArenaStats Match Results Data

Problem


The World of Warcraft ‘ArenaStats’ add-on lets players track their Arena (2v2/3v3/5v5 player-vs-player matches) results automatically. This tool is fan-created and not natively supported, however, and is prone to occasional errors when recording results. Can we clean the exported data to get a usable dataset?

ArenaStats TBC.

Data Collection and Data Cleansing


Jupyter Notebook – Python

The most common errors with ArenaStats are duplicated matches and a ‘ghost player’ error, where a player from the previous match will be duplicated in the record for the following match (so three players would be listed where there were only two). Duplicate matches are easy enough to filter out, but ghost players are more difficult to determine and rely on player race-class-faction combinations to be distinct between matches. If the two matches feature characters with similar race, class, and faction the ghost player might not be able to be determined and the match will have to be dropped from the dataset.

Example of a ghost player.

Data Cleaning steps:

  1. Add columns for ‘queueType‘ (2v2, 3v3, 5v5) and ‘playerPerTeam‘ (2, 3, 5).
  2. Drop duplicated records (matches with ‘matchDuration‘ = 0 and/or a blank ‘teamName‘ field are duplicates).
  3. Check for ghost players.
    1. Attempt to determine the ghost player. Update the record by dropping the erroneous recorded player.
    2. If the ghost player could not be determined, drop the record from the dataset entirely.
  4. Update the ‘zoneId‘ column with map names.
  5. Standardize the ‘date‘ format.
  6. Replace ‘endTime‘ column with a single ‘matchDuration‘ column in seconds.
  7. Add columns for ‘teamComp‘ and ‘enemyTeamComp‘ (eg. “Rogue-Priest”) and a binary ‘winLoss‘ column.
Cleaned ArenaStats dataset.

Data Visualizations Built with Dataset


TBC Arena Matchups, Season 3, Rogue-Priest – Jupyter Notebook – R

TBC Arena Season 3, Rogue-Priest, Distribution of Match Durations – Jupyter Notebook – Python