Constructing My All-Time NBA Lineup Through K-Means Clustering
By: Nikhil Sharma
It's a debate that pervades all cultures, ages and creeds. From every barbershop to lunch table in the world, the question rings: what is your best all-time starting five? Who can construct the best lineup to put on the hardwood using any player from any time?
It's a question that will never have a definitive answer, unless we somehow found the ability to travel through time. We could always do NBA 2K simulations, but we all know about how dubious that game is in mirroring real-life basketball scenarios. The argument is also almost entirely subjective, since there is no definitive consensus on a ranking of NBA players throughout history. Your dream lineup could wildly differ from someone else's based on when you grew up watching basketball, your affinity for statistical analyses of players versus the eye test, a preference of offense over defense (or vice versa), or many other factors.
I figured that I would bring a novel approach to the age-old question by tackling it with some machine learning concepts. K-Means Clustering is an unsupervised algorithm that aims to find k groups in unlabeled data based on feature similarity. Each group, or cluster, that the algorithm creates has a centroid, whose features define each cluster. Thus, each point in the data is grouped based on minimal distance from one of the k centroids. An example of an interesting application of the algorithm is using it to group NBA players with comparable play-styles based on their statistical similarities, as done here. I decided to take it a step further; instead of using all NBA players, what if we only clustered a subset of the absolute best NBA players, then chose players from our clusters of player types to come up with an optimal all-time starting lineup?
First, I scraped basic and advanced statistics from basketball-reference.com from the 1979-80 to the 2017-18 seasons. I used these seasons because the 1979-80 season was the first season where the 3-point line was introduced; thus, all of the advanced statistics for players from that season onwards are complete. Then, I scraped a list of 905 All-Stars from the same range of seasons and matched them up to their season statistics. My rationale in using only the All-Stars for this analysis was to only have the best of the best available for me to ultimately choose my lineup from.
The data had 40 variables, which would have been too many features to use for clustering due to the potential of overfitting (the model would take into account all of the specific kinks in our data). By principal components analysis, I reduced the 40 variables down to 3; these 3 variables explained 62.42% of the variance in the data. While most would generally reduce the data down to 2 components, the first 2 principal components only explained 49.87% of the variance in the data in this case. Thus, I decided to use an extra component to carry out clustering, even if that meant possibly overfitting the data to some degree.
I then ran the k-means algorithm on the data. I decided on an optimal number of clusters of 7 via silhouette scores, which tell us how close a given point in a cluster is to its neighboring clusters. Since we want the clusters to be partitioned well, we want clusters to be further away from neighboring clusters, and therefore want a high average silhouette score, which was achieved with 7 clusters. Here is a visualization of the clusters in a 3-dimensional space, plotted based on the 3 principal components:
The algorithm partitioned the players into 7 cluster of sizes 186, 115, 153, 75, 139, 95, and 142. These clusters measured 70.9% of the variance in the data (between sum of squares/total sum of squares ratio). I decided to look through the players in each cluster to get a sense of the kinds of players that the algorithm grouped together. Also, to help understand the clusters better, I found the players closest to the centroid of each cluster (through minimum Euclidean distance) to see which kinds of players were closest to the defining features of that cluster. Here are my summarized results:
Cluster 1: Either Not There Yet, or Already Been There
Closest-to-Centroid: Eddie Johnson (1981)
Notable Players: Allen Iverson (2009), Carmelo Anthony (2015-17), Clyde Drexler (1986, 93-94, 96), Dennis Johnson (1980-82, 85), Dwyane Wade (2014, 16), Isiah Thomas (1982, 87-92, 93), Jason Kidd (1998, 2000-02, 04, 08), Joe Johnson (2007-11, 14), John Wall (2014-16), Kobe Bryant (1998, 2016), Kyrie Irving (2013-14), Julius Erving (1986-87)
This cluster features a lot of players who were either young at the time and had not quite reached their apex (such as younger versions of Kobe, Isiah Thomas and Kyrie Irving) or old and declining (such as Dr. J and Allen Iverson in their last few seasons, and the Carmelo Anthony of recent years). Still, there are some pretty great players here, like Jason Kidd and Isiah Thomas in their respective primes.
Cluster 2: Cream of the Crop Big Men
Closest-to-Centroid: Annalittucks, Erneh!!!!!! (Charles Barkley, 1995)
Notable Players: Anthony Davis (2014, 17-18), Charles Barkley (1987-90, 92-93, 95-96), David Robinson (1990-96, 98, 2000), Hakeem Olajuwon (1985-90, 92-96), Kareem Abdul-Jabbar (1980-83, 85), Karl Malone (1988-89, 91-98, 2000), Kevin Garnett (2003-06), Moses Malone (1981-83, 85), Patrick Ewing (1989-92, 94), Shaquille O'Neal (1993-96, 98, 2000, 03-06), Tim Duncan (1998, 2000-05, 07)
Now, here is a cluster with some exciting results. The common thread that connects this group is fairly obvious: very high-quality big guys. You really couldn't go wrong with any choice here, as there are tons of dominant, MVP-caliber behemoths.
Cluster 3: Not-as-Good Big Men
Closest-to-Centroid: Marques Johnson (1986)
Notable Players: Amar'e Stoudemire (2009-11), Blake Griffin (2011-13), Chris Bosh (2006-08, 10-12), Chris Webber (2000-02), Jack Sikma (1981-85), James Worthy (1981, 86-89, 90), Jermaine O'Neal (2002-05, 07), Joel Embiid (2018), Kevin McHale (1984, 89-91), Kevin Garnett (1997-98, 2000-02, 07), LaMarcus Aldridge (2012-16, 18), Larry Bird (1980-81), Moses Malone (1986-89), Patrick Ewing(1988, 93, 95-96), Tim Duncan (2006, 2008-10, 13) Yao Ming (2006, 08-09)
There is a clear talent gap here in comparison to the previous cluster. Not only do we see inferior players here, but we also see some repeats of players in their statistically less productive seasons. Still, just as with cluster 1, there are still some great players here (like one of my favorite players of all time, Chris Bosh).
Cluster 4: Jacks of All Trades
Closest-to-Centroid: Chris Mullin (1990)
Notable Players: Chris Paul (2008-09), Dirk Nowitzki (2003, 05-07), Dwyane Wade (2006-07, 09-11), Gilbert Arenas (2006), Grant Hill (1997), James Harden (2013, 15-18), Kawhi Leonard (2017), Kevin Durant (2010, 12-14, 16-18), Kobe Bryant (2003, 06-08), Larry Bird (1985-88), LeBron James (2005-18), Magic Johnson (1987, 1990), Michael Jordan (1985, 87-97), Russell Westbrook (2015-17)
Here are some of the guys that the basketball world would consider some of the most elite, MVP-calibre players in the history of the game. Most of these players are guard/wing types, with the exception of the occasional Dirk Nowitzki (who, in many ways, statistically resembles a wing player in his scoring methods). Just as with cluster 2, most choices here are respectable stars.
Cluster 5: Pure Scorers
Closest-to-Centroid: Jim Paxson (1983)
Notable Players: Allen Iverson (2000-06, 08), Carmelo Anthony (2007-08, 10-14), Clyde Drexler (1988-91), Demar DeRozan (2016-18), Dirk Nowitzki (2002, 04, 08-11), Dominique Wilkins (1986-91, 93-94), George Gervin (1980-84), Gilbert Arenas (2007), Kobe Bryant (2001-02, 04-05, 09, 11-13), Larry Bird (1984, 90), Michael Jordan (1998, 2002), Paul Pierce (2002-03, 06), Russell Westbrook (2011-13, 18), Stephon Marbury (2001), Steve Francis (2003), Tracy McGrady (2001-02, 04-07), Vince Carter (2000-01, 05-07)
In this cluster, we have a good amount of high volume scorers in their primes. In today's environment of NBA analysis where efficiency is heralded as king, I think this list of players would give Advanced NBA Twitter a collective aneurysm. Still, many of these players are bonafide studs, and a volume scorer can serve a great function on a team (especially in an all-time lineup, where we can assign their roles to our liking).
Cluster 6: Defensive Bigs
Closest-to-Centroid: Chris Bosh (2014)
Notable Players: Alonzo Mourning (2002), Andre Drummond (2016, 18). Ben Wallace (2004-06), Bill Laimbeer (1983-84, 87), Charles Oakley (1994), Chris Bosh (2013-14), Dennis Rodman (1990, 92), Dikembe Mutombo (1992, 95-98, 2000-02), Draymond Green (2016-18), Dwight Howard (2007, 13-14), Joakim Noah (2013-14), Kevin Garnett (2009, 10-11, 13), Robert Parish (1985-87, 90-91), Shaquille O'Neal (2007, 09), Tim Duncan (2011, 15)
A lot of the players we look back on for their defense pop up here, including Charles Oakley, Dennis Rodman, Bill Laimbeer and Ben Wallace. It is also interesting to note that some great big men in the later stages of their careers are present, such as past-their-primes Kevin Garnett and Tim Duncan. This makes sense, as a defensive-minded big is often the role a once-marquee big man plays once they take a backseat from their previous alpha role on a team.
Cluster 7: 3&D(istributor?)
Closest-to-Centroid: Peja Stojakovic (2002)
Notable Players: Chauncey Billups (2006-10), Chris Paul (2011-16), Damian Lillard (2014-15, 18), Deron Williams (2010-11), Gary Payton (1995-98, 2000), Gordon Hayward (2017), Jimmy Butler (2015), John Stockton (1989-97, 2000), Kemba Walker (2017-18), Klay Thompson (2015-16), Kyle Lowry (2016-18), Kyrie Irving (2015, 17-18), Magic Johnson (1980, 83-86, 88, 91), Paul George (2014, 16, 18), Paul Pierce (2005, 08, 10-11), Peja Stojakovic (2002-04), Ray Allen (2000-02, 04-06, 08-09, 11), Reggie Miller (1990, 95-96, 98, 2000), Scottie Pippen (1995-97), Stephen Curry (2014-15, 17-18), Steve Nash (2002-03, 05-06, 08, 10, 12)
This was probably the most fascinating cluster, as it had a mix of two somewhat distinct player types: distributors and 3&D wings. To further investigate, I carried out a method recommended by this article to extract feature importance in clusters. First, I constructed a random forest model that predicted the probability of every player belonging to the 7th cluster. Then, using a built-in function in the randomForest library in R, I saw which variables were most important in construction of the model. This showed me that in determining whether a player was in the 7th cluster or not, the two most important variables were 2 Pointers Attempted Per Game and 3-Point Attempt Rate (percentage of field goal attempts that were 3-pointers). Here are boxplots of those variables across all clusters:
As we can see, in most cases, players in cluster 7 took less 2-pointers than players in other clusters, and generally took far more 3-pointers than players in other clusters. Generally, floor generals might not take as many 2-pointers as other players since they are looking more to move the ball than score for themselves. However, they are often good spot-up shooters from the arc, since those are the open shots they get when the defense collapses. On the other hand, a sniper from 3-point land will obviously be more inclined to take more 3-pointers and less 2-pointers to help their team with that sweet extra point. Thus, it makes sense that the clustering algorithm would group dimers and 3-point specialists together.
After making sense of each cluster, I had to choose the five clusters that I would pick players from for my ultimate starting lineup. I did not consider players from clusters 1 and 3 in my lineup construction, since the players in those clusters were either inferior to players in other clusters or displayed redundant skillsets. Here are my picks:
G: Magic Johnson (1986)
Cluster 7: 3&D(istributor?)
Magic Johnson is, indisputably, one of the greatest point guards of all time. I wanted a floor general from my lineup who is not only an excellent passer but can also completely shut down the opposing point guard on defense. At 6'9 and 215 pounds, and a season where he lead the league in assists, Magic fits the perfect mold for my team's point guard.
G: Michael Jordan (1996)
Cluster 4: Jacks of All Trades
This was a tough one, and likely a divisive pick. Cluster 4 had absolutely excellent players, with the two highlights being LeBron James and Michael Jordan in their prime seasons. Now, my picking of MJ over LeBron here is not representative of my stake in the GOAT debate. I just thought that for this team, LeBron's talents might not be the greatest fit with Magic at the helm, especially considering how supremely talented they both are as passers. For my second guard spot, I valued a guy who could score from anywhere on the floor, no matter who is guarding him or what the time on the clock is. I think no one in the history of basketball fits this profile better than Michael Jeffrey Jordan (in my Skip Bayless voice). And what better version of MJ to pick than him in the 1995-96 season, when he lead the Bulls to a legendary 72-10 record and finished the job in the playoffs (something that a certain other team couldn't do).
F: Larry Bird (1984)
Cluster 5: Pure Scorers
I really wanted a player here who could play the ultimate second banana to Michael Jordan. I wanted the absolute peak form of the 3&D archetype, so that this lineup could have one player who could always be a consistent scorer on the wing via spot-up shooting or the occasional iso, so as to lay some of the stress off MJ. Think of Klay Thompson, but on steroids. To fill this role, I took 1984 MVP Larry Bird. In looking at the players in the "Pure Scorers" cluster, the Laker stan in me really wanted to take the rightful MVP of the 2006 NBA season, Kobe Bryant. However, Bird was the better shooter, would likely be more willing to accept a secondary role on a team, and also had a bigger body to guard opposing wings. Am I relegating Larry Bird to a role that is far beneath his belt? Perhaps. But I feel putting him in a position like this would maximize his talents to the nth degree, since he would not have to carry the load he was used to on this team.
F: Draymond Green (2017)
Cluster 6: Defensive Bigs
Here is my truly controversial pick. I definitely wanted a player from the "Defensive Bigs" cluster in my lineup, since having a good defender in the interior is so important in the NBA game. However, I wanted someone with at least a little bit of offensive prowess, just so that he could hit the occasional three or make a nice wing pass every once in a while. Thus, I narrowed my selections down to Draymond Green and Robert Parish. From here, I had to decide which I valued more: a defensive Swiss Army knife with good passing ability or a quick paint presence who could stretch the floor. Ultimately, I went with 2017 Defensive Player of the Year Draymond Green. I am certainly influenced by the "switch everything" philosophy that has permeated basketball in recent years, but I do think it is a sound defensive strategy that is bound to lead to success. And with this lineup, we now have four relatively big-bodied players who can switch 1-4 pretty easily, anchored by one of the smartest, most vocal defensive leaders I have personally witnessed play basketball. On top of his defensive prowess, he can also step out beyond the arc every now and then, and also make some truly excellent passes in the flow of the offense.
C: Shaquille O'Neal (2000)
Cluster 2: Cream of the Crop Big Men
One of the most dominant big men of all time. Need I say more? Shaquille O'Neal, in his MVP season, was a force to be reckoned with on both ends of the floor. Having Shaq in the paint is one of the best insurance policies this lineup could have. Jordan and Bird both having off nights? Feed the ball to Shaq down low and let him feast. In his prime, Shaq could also hold his own against some of the greatest big men of all time. A crafty player somehow gets through the rotating carousel of my four elite defenders? Let him try to finish over Shaq in the paint. Especially given that this was the season he turned up his defensive intensity and earned his first NBA All-Defensive Team selection, he would be quite menacing for opposing players near the basket.
Overall, I think my lineup would stack up well against anyone else's all time starting five. On the offensive side, I have three of the most dominant bucket-getters in the history of basketball in Jordan, Bird and Shaq, arguably the greatest floor general of all time in Magic and an excellent secondary distributor to keep the ball rolling in Draymond. On the defensive end, I have four of the smartest and most tenacious defenders to ever play the game, with enough size to seamlessly switch onto any guard or forward thrown at them, PLUS The Big Aristotle lurking in the paint. I promise you, no one is scoring on this lineup.
If we go by the clusters, my lineup has an amazing distributor, a jack of all trades, a pure scorer, a defensive big and a cream-of-the-crop big. If this isn't a formula for a wonderfully balanced all-time starting lineup, I don't know what is. I am quite pleased with the team I have constructed, and would be willing to take on a debate with any challengers who have a starting five that they think could beat mine.
Really though, who is scoring on these guys?
If you're interested in my code, you can check it out here. I've included my data as well, so that you can take a stab at this and try to construct a lineup to rival mine if you'd like.