As you probably know, the 2022 NCAA Men’s Basketball Tournament ended earlier this month with the Kansas Jayhawks winning their fourth national championship. But while the event is over, we haven’t put it in our rearview mirror yet. That’s because we thought it would make for a good opportunity to write about the process of creating a data app rather than showing a data app. Specifically we will follow up on our previous post on March Madness.

One of the reasons Domo is a great platform is the end-to-end functionality it offers in creating data apps. Two of the first steps in creating a data app are collecting all of the data and blending the data together. This can be difficult, messy, and time-consuming. This post will address some of the data inconsistencies we ran into with our March Madness data app, and show how we think about bringing data into Domo and automating some of these types of processes.

During the pandemic, the NCAA set up a page with all of the results of every men’s tournament from 1939-2019. The data itself can be messy, and has errors and inconsistencies throughout. Additionally, the format of the tournament has changed many times over the years. It’s gone from being a 32-team tournament, to a 64-team tournament, to now a 68-team tournament. And at one stage there was a third-place game.

We wanted this project to mirror what many users have to go through sometimes to get data. So, instead of purchasing data from one of the many sports-data providers, we decided to get data from the NCAA using Python and Beautiful Soup, a Python package for parsing HTML and XML documents. The Domo platform is incredibly powerful and flexible, as it comes with a lot of built-in data connectors while allowing people to break out their high-code skills when they want to.

We opened Jupyter Workspaces (a beta feature) inside of our Domo instance and created a Python notebook to scrape the data and deposit it into Domo. You can also set Jupyter Notebooks to run on a schedule, clicking on the dataflow button in the notebook:

After getting the data into Domo, we blended the data together using the Magic ETL tool. Simple SQL-like statements allowed us to create a common data definition amongst the tournaments, such as for Round data. Below is a look at the raw Round data, and the number of times that Round appeared in the imported data for a game played:

Here you can see all sorts of interesting information. For instance, the first round can be called “First Round,” “First Round (Round of 64),” or even “Second Round (Round of 64),” because at one time they considered that the second round after the play-in round.

To normalize the data, we looked at all of the different Round names, and aligned on Round names so that our data app would function correctly. We created those transforms in Magic 2.0 with simple case statements like this:

CASE when `round` = 'CHAMPIONSHIP' then 'National Championship'
when `round` = 'Championship' then 'National Championship'
when `round` = 'round-1' then 'First Round (Round of 64)'
when `round` = 'First Round' then 'First Round (Round of 64)'
when `round` = 'round-2' then 'Second Round (Round of 32)'
when `round` = 'Second Round' then 'Second Round (Round of 32)'
when `round` = 'round-3' then 'Sweet 16'
when `round` = 'round-4' then 'Elite 8'
when `round` = 'Sweet Sixteen' then 'Sweet 16'
when `round` = 'Elite Eight' then 'Elite 8'
when `round` = 'Second Round (Round of 64)' then 'First Round (Round of 64)'
when `round` = 'Third Round (Round of 32)' then 'Second Round (Round of 32)'
when `round` = 'FINAL FOUR®' then 'Final Four®'
when `round` = 'Final Four' then 'Final Four®'
when `round` = 'Regional Finals' then 'Elite 8'
when `round` = 'Regional Semifinals' then 'Sweet 16'
when `round` = 'FIRST FOUR®' then 'First Four®'
when `round` = 'First Four' then 'First Four®'
when `round` = 'Opening Round' then 'Opening Round Game'
else `round`
end

Outputting those gave us a blended dataset, giving us four decades’ worth of March Madness that can be analyzed and shared with anybody. Pretty cool, huh?






Source link