Data Analysis: Analyzing Vermont Breweries Using the Untappd API

Introduction

Vermont is a beer mecca.

With the most breweries per-capita of the United States, and beers from the state regularly topping the charts of BeerAdvocate and Untappd, Vermont is known for its beer. I love beer, have gone through an IPA phase and left it, and live in Vermont; so I wanted to dig a little deeper into the brewery scene.

As someone who has worked as a beverage purchaser in a store known for it's beer selection, picking which beer to stock is a difficult choice. There are THOUSANDS of beers made in the state, and only so many refrigerated shelves!

It doesn't help that the dominant style (IPAs) have the shortest lifespan of any kind of beer. The biggest beers in the state strictly enforce a lifespan of 90 days; representatives come to the store to verify that you aren't selling skunked beer. This means that a bad choice can really tie up cash flow, so it's very important to stay current with trends, reputations, brewmasters, and any other information you can get your hands on.

Project Status

This is an ongoing project to visualize data retrieved about Vermont breweries and the beer they make. More charts, insights, and data sources are being investigated as you read.

TLDR;

I can be verbose (it's a blog, after all), so if you're just interested in the guts, this is a streamlined version:

  1. I retrieved data from Untappd on all Vermont breweries by getting a list from the Vermont Brewers Association and manually vetting brewery_ids
  2. Once I stored this as JSON, I used Pandas to strip the info down, normalize it, and store it in CSV format
  3. With this data, I made some charts:
    1. Bar charts of the most produced beer styles, most reviewed styles, and top 10 styles by weighted average
    2. A bubble chart of breweries with rating on the y axis, number of unique beers produced on the x axis, and the size of the bubble denoting the number of reviews the brewery had received
    3. A heatmap of beer style against ABV
    4. A scatterplot of breweries with rating on the x-axis and number of reviews on the y-axis
  4. These charts provided insights on the VT craft beer scene:
    1. IPAs are still the most drank/known style, but the sours & wild ales are very well reviewed as well
    2. Generally speaking, higher ABV beers have a higher rating
    3. Using the scatterplot, I was able to generate a list of breweries that were Hidden Gems, Worth the Hype, Overrated, and "Meh".

Background

Data Sources

To answer questions about Vermont breweries, I needed information on Vermont breweries. There were a couple of resources I had at my disposal:

Untappd

Nearly a decade ago, my senior college project was called BrewFind, and my team and I built a website & android app that displayed the breweries in the state and the beers they produced. We did this with the help of the Untappd API - and fortunately, my client key from this project is still valid, since they have stopped giving them out.

Untappd is a 'social drinking experience' app where people can 'check in' with a specific beer at a specific venue. Each time they check in, they can rate the beer. It's used pretty widely, and has a good collection of information on both the breweries and beers...and most importantly, I had a working API key.

Vermont Brewers Association (VBA)

The Vermont Brewers Association also has up-to-date information on the active breweries in the state, but nothing that is programmatically available.

Open Brewery Database (OBDB)

Finally, the Open Brewery Database is an open API with information self-reported by individuals out in the world.

Questions

In particular, I wanted to learn:

  1. What are the most dominant styles of beer (by production, rating, and review count)?
  2. Is there a correlation between ABV and rating?
  3. Is there a relationship between number of beers made by a brewery and the brewery's overall rating? What are the breweries that stand out in number of beers produced, number of ratings, number of styles, etc.
  4. Is there a relationship between ABV, style, and rating? Are there certain styles rated more favorably with a higher/lower ABV?
  5. Is there any movement away from IPAs? Is there a way I can get data to verify this?
  6. As a lager-head, I have plenty of anecdotal evidence that people are starting to move away from the dominance of hops, but no numbers.

These were just some of the questions I had, so I set about getting my hands on some data

Data Collection

Listing Vermont Breweries

First, I needed a list of the breweries in the state. Looking at the Untappd API, there isn't actually a way to search for breweries by location: there is just a generic query parameter that searches everything from name to state to description. Not particularly helpful.

The Open Brewery Database does have a way to filter by state, but after comparing the list retrieved to the listing on the VBA website, it was missing a lot of information.

I took the list that OBDB gave me, and manually supplemented it with the missing breweries on the VBA website. Now, to fulfill the /brewery/info query on Untappd, I needed their brewery_id for each one.

Retrieving brewery_ids

As mentioned, the /brewery/search endpoint just takes a generic q parameter, which I personally don't trust. Theoretically, I could ping the search with the name I had, and hope that the first result was the brewery I was looking for, and save the brewery_id returned - but there were cases where the brewery itself had variances in the name (Brewing Company vs Brewing Co., etc.) Also, my API key only allowed for 100 queries an hour, so I didn't want to risk nuking my attempts on something that didn't work.

With under 100 breweries, I set about getting the brewery_id from inspecting the Untappd web page for the brewery itself. I found a reliable place on the page layout where it was stored, and manually retrieved the brewery_id for each brewery I had in my list.

Now, with a CSV of all brewery names and IDs, I was able to write a python script to iterate over the list, hit the /brewery/info endpoint, store the data, then write everything out to a JSON file.

Getting Brewery Info

This mostly worked quickly, the only thing I got hung up on with briefly was that I was making my requests too quickly, and my queries got blocked. The API documentation didn't have any rate-limiting information other than 100 queries per hour, so I didn't think to throttle my requests.

In the end, I set an asyncrequestlimiter on my queries to wait a few seconds in between each request, and considering it was effectively an initialization script, I called it good.

Data Cleaning and Organization

Cleaning

The data came as JSON, and to ensure I didn't have to run the large set of throttled queries, I saved every bit of the results so I could come back to it if I needed it.

Looking at the format, it was clear that I needed to use an ETL (Extract, Transform, Load) strategy to sanitize and normalize the data before getting into it.

I first parsed through the JSON file, and made a list of brewery objects and a list of beer objects. Before either got loaded in, each field was sanitized as either a string or a number (replacing spaces & quotes, verifying as a float or int), or replaced with a default value ("" and -1).

Once created, I loaded the lists into Pandas DataFrames, and used the .any() function to look at missing data. Fortunately, it was mostly intact - just one missing column value in the whole thing (an address).

Organizing

With each brewery containing a list of beers and check-ins, and each beer/check-in containing information on the brewery, the data needed to be normalized.

I first stored brewery objects, then stripped out each beer object and kept only the brewery_id as a shared key. Once parsed, organized, and sanitized, both lists got written out to CSV files for later access.

Data Visualization

I wanted to use Pandas, Matplotlib, and Seaborn to visualize the data retrieved to see if I could get any further information to support or deny my hunches.

Bar Charts: Basic Style Information

The first set of graphs was straightforward horizontal bar charts:

  1. Highest rated beer styles
  2. Most produced beer styles
  3. Most reviewed beers styles

This was easy enough to do by loading in all the beer information into a DataFrame, then grouping by style and aggregating by style, rating, and number of ratings.

beer_styles = beer_df.groupby('beer_style').agg(
    Num_Beers=('beer_style', 'count'),
    Avg_Abv=('beer_abv', 'mean'),
    Avg_Ibu=('beer_ibu', 'mean'),
    Avg_Rating=('rating_score', 'mean'),
    Total_Num_Ratings=('rating_count', 'sum')
)

From there, for each type of chart I sorted by the relevant characteristic and took the top 10 values to plot.

For my bar charts, I wanted to put what I had been reading in The Visual Display of Quantitative Data by Edward Tufte, so I made a few modifications for my charts:

  1. I removed the frame
  2. I removed the default x/y ticks, and replaced them with a label at the end of each bar

Here are the charts:

Most Produced VT Beer Styles Most Reviewed VT Beer Styles Top 10 Styles by Weighted Average

By doing this, I significantly increased the data-ink ratio on my charts, and confirmed the following information:

  • There are by far more unique IPAs in the state than any other style
    • IPA, IIPA, Hazy IPA, Pale Ale, and Hazy IIPA are the top 5 styles in the state by unique beers produced.
    • I would like to find actual prodution numbers on these styles to see most produced by volume.
  • IPAs are by far the most reviewed style.
  • American Wild Ales are the highest rated style (by weighted average). All IPA styles were close behind.

The Wild Ales threw me a bit, and it only came out after I applied a weighted average instead of just the average from each beer (which was sometimes influenced by only a handful of reviews). Further digging showed that one brewery (House of Fermentology) claimed 21 of the 30 total American Wild Ales in the state...but all of them very highly rated!

Dot Dash Plot: Rating vs. Unique Beers and Number of Reviews

I made a scatter plot of breweries in the state with the number of distinct beers produced on the x-axis, and the average rating on the y-axis. To add another dimension, I made the size of the dot correlate to the number of total reviews that the brewery has received.

This would show at a glance:

  1. The highest rated breweries
  2. The breweries with the most beers made
  3. Outliers in either of these categories
  4. Rating vs number of ratings (substituting for 'popularity')

After making it, again in the spirit of Tufte, I made some small modifications:

  1. I removed the frame
  2. I replaced the x/y ticks with ticks for each point on the axis, making it a dot dash plot
  3. I reduced the range to only the range represented by the data
  4. I labeled the largest (most reviewed) dots

The chart:

Dot Dash Plot of VT Breweries by Rating, Number of Beers, and Number of Untappd Reviews

This showed a few interesting things:

  1. While The Alchemist is very highly rated/reviewed, it only has a very small handful of beers (famously, only 2 available outside of the tasting room). Hill Farmstead has a very similar rating, even more reviews, and FAR more distinct beers made.
  2. Foam Brewers have close to the MOST number of unique beers, and only slightly lower of an average review/number of reviews. Impressive!
  3. Harpoon claims the title for most unique beers, and has a large number of reviews, but has one of the lowest average ratings
  4. Almost exactly in the middle of the chart lies Zero Gravity, which feels exactly right. They're pretty solid, make a ton of seasonal beers, and are fairly well known. Right in the middle.

This is a chart that prompts more questions:

  1. What are some of the smaller bubbles? What are the entries with the highest ratings and lowest review counts (hidden gems)?
  2. What are the breweries with a low score and highest number of ratings (overrated breweries)?

I realized it would be easy enough to make a quadrant chart of rating vs number of ratings to organize the vast number of breweries in the state.

Scatterplot: Rating vs Number of Reviews

What I had in mind was creating a scatterplot of rating vs number of reviews, and dividing this into four quadrants:

  1. Hidden Gems: Low Number Reviews, High Rating
  2. Worth the Hype: High Number Reviews, High Rating
  3. Overrated: High Number Reviews, Low Rating
  4. Meh: Low Number Reviews, Low Rating

I knew what I had to do: find the median of number of reviews and rating, create a mask for each quadrant and apply it to the set to find breweries in each quadrant, change the color of the dot, and label outliers.

This basic implementation I was able to do myself, but it felt a little clunky. I was adding a 'color' column to the dataset, then plotting the whole thing, then going back and adding labels. It was a bit messy, so I fed the request into ChatGPT to see what it would do, and the result was elegant (and a good learning experience).

Instead of doing everything all at once, ChatGPT used np.where() to create a column quadrant based on the criteria, then created a dictionary of quadrant values to colors. It iterated over quadrants using a groupby, and plotted quadrants one by one, picking the color from the dictionary. Outliers were determined by a separate function, which were fed sub-frames: this was done by using a groupby and applying the function to each quadrant. Finally, the outliers were iterated over and labeled individually. So simple!

I made a few other adjustments:

  1. The y-axis (number of reviews) was made logarithmic: there was a lot of data clumping toward the bottom
  2. I used adjustText to prevent labels from overlapping

The chart:

VT Brewery Quadrant Scatterplot

Once I had the chart, I realized that no size would make it easy to read and process: it was a very fun chart to make, but the data was probably better represented by a table!

Here are the abridged breweries in each category, sorted by rating:

Hidden Gems (high to low) Rating # Reviews
Backacre Beermakers 4.421 8929
House of Fermentology 4.335 14252
Red Clover Ale Co. 4.113 8183
Wooster Bill's 4.1 20
Lucy & Howe Brewing Co. 3.886 3294
Hype Worthy (high to low) Rating # Reviews
The Alchemist 4.328 754993
Hill Farmstead Brewery 4.318 1325612
River Roost Brewery 4.187 64439
Foam Brewers 4.151 501954
Lawson's Finest Liquids 4.113 664608
Overrated (low to high) Rating # Reviews
Vermont Pub & Brewery 3.412 39378
Magic Hat Brewing Company 3.428 1270738
Harpoon Brewery 3.545 1405179
Long Trail Brewing Company 3.591 769595
Goodwater Brewery 3.6 20231
Meh (low to high) Rating # Reviews
J'Ville Brewery 3.253 1972
McNeill's Brewery 3.341 9404
Bobcat Cafe & Brewery 3.46 3362
Northshire Brewery 3.511 8188
Saint J Brewery 3.519 2703

Full tables are available in the appendix

For those familiar with the scene, you may be wondering about some of these breweries...where are they?

First, Wooster Bill's is just not right. After a little digging, it turns out that this was from a typo! I missed a digit on Idletyme's brewery_id, and Wooster Bill's was the result. I've fixed the data since writing, but it gave me a chuckle, so I left it in.

But for the others, this was a personal choice: I left in the breweries that are no longer operational. I did this partially because some of the closed breweries still make it to the top of the "Overrated" category! J'ville, McNeill's, and Magic Hat (to name a few) are no longer producing beer.

However, if I wanted to use this data to make a list of recommendations for travelers, I would certainly want to filter those breweries out.

Heatmap: Style vs ABV

Finally, I wanted to make a heatmap of style and ABV to overall rating.

To do this, I first got the average rating of the data grouped by style and ABV. This I pivoted to give an index of style, columns of ABV category, and values of the average rating. I correlated this, plotted it on a Seaborn heatmap, and got...garbage.

The chart that resulted was unreadable: there were too many styles of beer and unique ABVs: I needed to group the data.

I went back and bucketed the beers' ABV into Low (0-5%), Medium (5-8%), High (8-10%), and Very High (10+%). More difficult would be bucketing the styles...there are a massive number of unique styles, and some defy categorization. What is a Kentucky Common?

Most of what I was curious about were basic categories: hoppy, dark, light, and sour beer. I made a map of 'style' to 'type', which took the biggest styles and gave them the appropriate type (a Pilsner is a Light beer, etc). Once categorized, the heatmap made a lot more sense:

Style to ABV Heatmap of VT Beers

A rough picture emerged:

  1. IPAs and Sours were generally the highest rated categories (which we had seen in bar charts earlier)
  2. Very High ABV beers were the most popular in these categories (and dark beers as well), and low ABV beers generally weren't as popular
  3. The sweet spot across styles was 8-10% ABV, which feels about right to my gut. A double IPA is most commonly 8%, and that is a dominant style.

Future Directions

There are more data to organize and questions to answer — these are a few that I'm looking into next:

Check-in Data

There is a great deal of 'check-in' data in the initial JSON payload that I would like to unpack and parse through. Using a NLP, it would be interesting to see what sentiment looks like on comments.

Mapping

I would like to know how many check-ins happen outside of the state. What are the most popular beers outside of the state (other than Heady Topper)? Where is it most popular?

Interactivity

For so many of these charts, I want to click around on them. Building some of these charts to be interactive in a web environment would be excellent.

Conclusion (so far)

  1. IPAs are still generally dominant
  2. Sour beers are less made, but generally rated higher
  3. Very high ABV beers are generally rated higher
  4. Lists of Hidden Gems, Hypeworthy, Overrated, and Meh breweries have been created.

I'm looking forward to working more on this! You can keep an eye on the respository (linked below), my blog, or any of my socials to stay up to date.

If you have comments, questions, or errors to point out, please get in touch, I'd love to hear what you think.

Thanks for reading!

I'm currently looking for work, especially as a Data Analyst/Scientist, so please holler if you would like to connect on a project or role.

Appendix

Code

You can check out the repository here. I've included the intial JSON payload if you want to do your own work on it.

Hidden Gems

Brewery Rating # Reviews
Backacre Beermakers 4.421 8929
House of Fermentology 4.335 14252
Red Clover Ale Co. 4.113 8183
Wooster Bill's 4.1 20
Lucy & Howe Brewing Co. 3.886 3294
Black Flannel Brewing 3.879 12643
Soulmate Brewing Company 3.837 339
Farnham Ale & Lager 3.831 13361
Weird Window Brewing 3.809 8031
Whirligig Brewing 3.789 2121
Vermont Beer Makers 3.788 15738
Dirt Church Brewing Co 3.776 2602
Kickback Brewery 3.74 479
Scotch Hill Brewing Co. 3.726 658

Worth the Hype

Brewery Rating # Reviews
The Alchemist 4.328 754993
Hill Farmstead Brewery 4.318 1325612
River Roost Brewery 4.187 64439
Foam Brewers 4.151 501954
Lawson's Finest Liquids 4.113 664608
Frost Beer Works 4.058 192485
Fiddlehead Brewing Company 4.031 298087
Foley Brothers Brewing 4.021 161663
Burlington Beer Company 4.001 578840
Upper Pass Beer Company 3.992 104255
Ten Bends Beer 3.985 78639
Green Empire Brewing 3.985 27741
Four Quarters Brewing 3.97 165367
Prohibition Pig 3.918 73802
Hermit Thrush Brewery 3.895 149607
14th Star Brewing Company 3.849 131685
Whetstone Beer Co. 3.833 27189
Good Measure Brewing Co. 3.771 17282
Zero Gravity Craft Brewery 3.769 351251
Lost Nation Brewing 3.758 128875
Simple Roots Brewing Company 3.743 19818
Bent Hill Brewery 3.729 27127

Overrated

Brewery Rating # Reviews
Vermont Pub & Brewery 3.412 39378
Magic Hat Brewing Company 3.428 1270738
Harpoon Brewery 3.545 1405179
Long Trail Brewing Company 3.591 769595
Goodwater Brewery 3.6 20231
Queen City Brewery 3.601 52175
Otter Creek Brewing 3.626 506254
von Trapp Brewing 3.627 173600
Drop-In Brewing Company 3.635 28345
Switchback Brewing Company 3.635 139258
Madison Brewing Company 3.672 23972
Stone Corral 3.68 29899
Rock Art Brewery 3.686 119613
Rutland Beer Works 3.715 30916

Meh

Brewery Rating # Reviews
J'Ville Brewery 3.253 1972
McNeill's Brewery 3.341 9404
Bobcat Cafe & Brewery 3.46 3362
Northshire Brewery 3.511 8188
Saint J Brewery 3.519 2703
Trout River Brewing Company 3.532 16152
Jasper Murdock's Alehouse 3.536 5575
Brewster River Pub & Brewery 3.563 2973
Dog River Brewery 3.566 524
Kingdom Brewing 3.586 6260
Outer Limits Brewing 3.621 7969
1st Republic Brewing Co. 3.669 9874
Halyard Brewing Company 3.67 5967
Hired Hand Brewing Company 3.681 3909
Hogback Mountain Brewing 3.681 4804
Next Trick Brewing 3.684 1867
Mill River Brewing 3.69 9413
Beer Naked Brewery 3.7 3630
Brocklebank Craft Brewing 3.706 2381
Two Heroes Brewery 3.712 2246
Red Barn Brewing 3.718 6107