An Introduction to Cluster Analytics in the Prospect World

Written by: Ray Butler and Tyler Spicer (@tylerjspicer)

Follow us on Twitter! @Prospects365

We’ve been hard at work here at Prospects 365, both on articles already published (the weekly Ramblings and player profiles like this one on Brennen Davis) and some things behind the scenes. From the latter category, we’re finally ready to roll-out something we’ve been working for awhile on: applying cluster analytics to baseball. More specifically, we’re in the beginning stages of applying cluster analytics to help us identify unheralded prospects who should be squarely on your radar.

In the statistical universe, an observation is one or more measurements pertaining to an item of interest. For us, of course, the ‘items of interest’ are prospects. For simplicity, this article will not be covering the various methods of performing a cluster analysis. Instead, my hope is to discuss why clustering is beneficial for prospect research; I’ll also touch on some of our initial results and findings.

Observation HR BA SO
Wander Franco 6 .325 18
Tyler Freeman 3 .300 24

Cluster analysis is useful in baseball when comparing the stat lines of numerous players and finding commonalities between them. Observations of players within the same cluster are more similar than those in different clusters. This can help us identify otherwise-unknown prospects if they are placed into clusters with top prospects who are performing well. This analysis is both exploratory and preliminary. Just because the analysis says a player is in the same cluster as Wander Franco does not mean said player is a Wander Franco-type prospect or will ever be a Wander Franco-type prospect. The analysis is strictly meant to uncover undiscussed prospects who are performing similarly to elite prospects in various statistical categories.

For example, say we want to compare how Low-A middle infielders are performing this season. To do this, we decide to evaluate surface stats such as prospects’ batting average, home runs, and strikeouts (our actual evaluations dive much deeper). If we were to plot these variables in a 3-dimensional space it might look something like this:

freeman franco

In order to see how similar the players are to one another, we can draw a straight line between the data points and measure what is called the Euclidean distance. The shorter the line, the more similar the players are statistically.

We are able to do this analysis for any combination of variables that might make prospects stand out amongst each other. We actually did this for Low-A middle infielders, and below are the results for players of interest plotted on a strikeout and batting average graph. The colors represent what cluster each player belongs to after clustering on five different advanced variables. You can see the similarities of the players in the different clusters: Blue= Contact-oriented players with speed, Orange= Players contributing to power and speed categories, Green= Players with average speed, power and contact outputs.

cluster example

In this example, the players in orange are prospects who immediately become intriguing to us. It’s doubtful names like Nick Podkul, Rodolfo Castro and Jeremy Pena are players on many people’s radar. However, the cluster analysis above is saying that, from a holistic standpoint, they are having similar statistical seasons to that of Wander Franco and Tyler Freeman. Perhaps after live looks or additional studying, we can determine if the player can seriously become a prospect to keep an eye on moving forward.

That is exactly what we have done and in our trials using clustering analysis. With more intricate variables and some trial-and-error, we quickly discovered several underaged, under-the-radar prospects who are playing beyond their years this season.

Cluster analytics did not create Canaan Smith and Drew Rom. But in an age in which fantasy baseballers are constantly searching for new resources to give them a leg up on their competition, the clustering process did help bring them to light. It unveiled them, if you will.

Smith is a 20-year-old outfielder in the Yankees’ system who’s slashed .309/.399/.498 with 7 home runs and 4 stolen bases in 253 plate appearances for Low-A Charleston. He’s 1.5 years younger than his average competition in the South Atlantic League and just participated in the league’s All-Star game. Smith currently sports an impressive 158 wRC+ this season.

Rom is a 19-year-old pitcher in the Orioles’ system who signed for nearly $200,000 more than his slot value ($) after being selected in the fourth round of last summer’s MLB Draft. In 11 Low-A starts this season, 6’2, 170 lb. southpaw has posted a 1.23 ERA (2.65 xFIP) in 51.1 innings pitched. He boasts a 32.5 K% (!) and has only walked a modest 7.3% of the batters he’s faced this season. It’s also pretty cool to read about the left-hander embracing the analytical approach to pitching, which he talks about here in this Baseball America article ($). Rom is 2.8 years younger than his average competition in the South Atlantic League; just like Smith, the southpaw was also named a league All-Star. Prospecting legend John Calvagno recently ranked Rom higher than prospects like Roansy Contreras, Mark Vientos and others in his midseason “Sally” prospect list.

As we stated above, it’s up to us to dive deeper into the names that clustering provides us. In other words, the post-clustering process becomes about searching for substance behind the eye-opening statistics. And why exactly has *this* prospect gone completely unheralded up until now?

A closer look at older scouting reports on Smith suggest he’s got some substantial swing-and-miss in his game. Some scouts think he’s destined to be a 1B-only prospect. The strikeout rate has diminished quite a bit in 2019 (30.4 K% in Short Season last season, 23.3 K% in Low-A this season), so continued sustainability with his contact rate should lead to a much-improved stock. He has also started in the corner outfield spots exclusively this season.

Rom doesn’t possess premium velocity and often sits in the high 80s with his fastball. He commands his offspeed extremely well for his age, which can make success versus Low-A hitters make a pitcher appear like something he’s not. Thankfully, the left-hander is only 19-years-old and has a frame that should allow him to easily add good weight as he continues developing physically. As he fills out, we should see his fastball tick up a few miles per hour. Considering he’s already analytically-inclined, I trust the Orioles (wut???) and their new player development directors to help Rom reach his potential. I really am cautiously-bullish on the outlook here moving forward.

With Rookie League and Short Season prospects in the process of accumulating their 2019 samples, our hope is to begin some preliminary clustering within the Dominican Summer League, Arizona League, Gulf Coast League, Pioneer League, Appalachian League, New York-Penn League and Northwest League relatively soon. We really think that clustering will pay its biggest dividends in those leagues throughout the summer.

For now, we think it’s best to keep the our clustering variables to ourselves. We’re convinced we’re onto something big here, and the possibilities associated to clustering within the baseball world are practically endless. Because of this, we want to continue optimizing our process before divvying out the finer points. We will, however, continue to trickle out information (and interesting prospects, of course) as often as we see fit.

Cluster analytics could eventually change the way we go about researching, evaluating and comparing prospects, especially in the low minors. We hope you join us on our journey to something new and better.

Did you enjoy this article? Appreciate the effort? Will it help shape your fantasy baseball decisions this week or for the rest of the season? Consider buying us a beer

Follow P365 staff writer Tyler Spicer on Twitter! @tylerjspicer

Follow us on Twitter! @Prospects365

Featured image courtesy of photographer Patrick Cavey and

Want more recent literature from the site? Make sure you’ve checked out this week’s Ray’s Ramblings


Leave a Reply