
作者 | 高羊羊羊羊羊杨
来源 | CSDN博客
头图 | 付费*载下**自视觉中国
出品 | CSDN(ID:CSDNnews)
前段时间,湖人当家球星 科比·布莱恩特不幸遇难。这对于无数的球迷来说无疑使晴天霹雳, 他逆天终究也没能改命,但命运也从来都没改得了他,曼巴精神会一直延续下去。随着大数据时代的到来,好像任何事情都可以和大数据这三个字挂钩。早在很久以前,大数据分析就已经广泛的应用在运动员职业生涯规划、医疗、金融等方面,在本文中将会使用Python对球星科比进行对维度分析,向 “老大” 致敬!
前景提要
那天,是2020年1月27日凌晨,我失眠了,足足在床上打滚到4点钟还是睡不着,解锁屏幕,盯着刺眼的手机打算刷刷微博,但却得到了一个令人震惊的消息:球星科比不幸遇难。换做是往常,我当然是举报三连,这种标题*党**罪有应得,但却刷到了越来越多条类似的消息,直到看到官方发布的消息。

正如我的文案所说,我没有见过凌晨四点的洛杉矶,可我在凌晨四点听闻了你去世的消息,1978-2020。
作为球迷,我们能做的只有惋惜与缅怀。不散播谣言,不消费 “曼巴精神”

数据获取
来源:NBA官方提供了的科比布莱恩特近二十年职业生涯数据资料集(数据量比较庞大,大约有3万行)

数据处理
翻阅文档时不难发现其中有很多空缺值,简单粗暴的方式是直接删除有空值的行,但为了样本完整性与预测结果的正确率。

首先我们对投篮距离做一个简单的异常值检测,这里采用的是箱线图呈现
1#-*- coding: utf-8 -*-2catering_sale = '2.csv'3data = pd.read_csv(catering_sale, index_col = 'shot_id') #读取数据,指定“shot_id”列为索引列45import matplotlib.pyplot as plt #导入图像库6plt.rcParams['font.sans-serif'] = ['SimHei'] #用来正常显示中文标签7plt.rcParams['axes.unicode_minus'] = False #用来正常显示负号8#9plt.figure #建立图像10p = data.boxplot(return_type='dict') #画箱线图,直接使用DataFrame的方法11x = p['fliers'][0].get_xdata # 'flies'即为异常值的标签12y = p['fliers'][0].get_ydata13y.sort #从小到大排序,该方法直接改变原对象14print('共有30687个数据,其中异常值的个数为{}'.format(len(y)))1516#用annotate添加注释17#其中有些相近的点,注解会出现重叠,难以看清,需要一些技巧来控制。1819for i in range(len(x)):20 if i>0:21 plt.annotate(y[i], xy = (x[i],y[i]), xytext=(x[i]+0.05 -0.8/(y[i]-y[i-1]),y[i]))22 else:23 plt.annotate(y[i], xy = (x[i],y[i]), xytext=(x[i]+0.08,y[i]))2425plt.show #展示箱线图
我们将得到这样的结果:

根据判断,该列数据有68个异常值,这里采取的操作是将这些异常值所在行删除,其他列属性同理。

数据整合
将数据导入,并按我们的需求对数据进行合并、添加新列名的操作
1import pandas as pd234allData = pd.read_csv('data.csv')5data = allData[allData['shot_made_flag'].not].reset_index67# 添加新的列名8data['game_date_DT'] = pd.to_datetime(data['game_date'])9data['dayOfWeek'] = data['game_date_DT'].dt.dayofweek10data['dayOfYear'] = data['game_date_DT'].dt.dayofyear11data['secondsFromPeriodEnd'] = 60 * data['minutes_remaining'] + data['seconds_remaining']12data['secondsFromPeriodStart'] = 60 * (11 - data['minutes_remaining']) + (60 - data['seconds_remaining'])13data['secondsFromGameStart'] = (data['period'] <= 4).astype(int) * (data['period'] - 1) * 12 * 60 + (14 data['period'] > 4).astype(int) * ((data['period'] - 4) * 5 * 60 + 3 * 12 * 60) + data['secondsFromPeriodStart']1516'''17其中:18secondsFromPeriodEnd 一个周期结束后的秒19secondsFromPeriodStart 一个周期开始时的秒20secondsFromGameStart 一场比赛开始后的秒数21'''2223#对数据进行验证24print(data.loc[:10, ['period', 'minutes_remaining', 'seconds_remaining', 'secondsFromGameStart']])
运行有如下结果:
看起来还是一切正常的

绘制投篮尝试图
根据不同的时间变化(从比赛开始)来绘制投篮的尝试图
这里我们将用到matplotlib包
1import pandas as pd2import numpy as np3import matplotlib.pyplot as plt456plt.rcParams['figure.figsize'] = (16, 16)7plt.rcParams['font.size'] = 168binsSizes = [24, 12, 6]9plt.figure1011for k, binSizeInSeconds in enumerate(binsSizes):12 timeBins = np.arange(0, 60 * (4 * 12 + 3 * 5), binSizeInSeconds) + 0.0113 attemptsAsFunctionOfTime, b = np.histogram(data['secondsFromGameStart'], bins=timeBins)1415 maxHeight = max(attemptsAsFunctionOfTime) + 3016 barWidth = 0.999 * (timeBins[1] - timeBins[0])17 plt.subplot(len(binsSizes), 1, k + 1)18 plt.bar(timeBins[:-1], attemptsAsFunctionOfTime, align='edge', width=barWidth)19 plt.title(str(binSizeInSeconds) + ' second time bins')20 plt.vlines(x=[0, 12 * 60, 2 * 12 * 60, 3 * 12 * 60, 4 * 12 * 60, 4 * 12 * 60 + 5 * 60, 4 * 12 * 60 + 2 * 5 * 60,21 4 * 12 * 60 + 3 * 5 * 60], ymin=0, ymax=maxHeight, colors='r')22 plt.xlim((-20, 3200))23 plt.ylim((0, maxHeight))24 plt.ylabel('attempts')25plt.xlabel('time [seconds from start of game]')26plt.show
看下效果:

可以看出随着比赛时间的进行,科比的出手次数呈现增长状态。
绘制命中率对比图
这里们将做一个对比来判断一下科比的命中率如何
1# 在比赛中,根据时间的函数绘制出投篮精度。2# 绘制精度随时间变化的函数3plt.rcParams['figure.figsize'] = (15, 10)4plt.rcParams['font.size'] = 1656binSizeInSeconds = 207timeBins = np.arange(0, 60 * (4 * 12 + 3 * 5), binSizeInSeconds) + 0.018attemptsAsFunctionOfTime, b = np.histogram(data['secondsFromGameStart'], bins=timeBins)9madeAttemptsAsFunctionOfTime, b = np.histogram(data.loc[data['shot_made_flag'] == 1, 'secondsFromGameStart'],10 bins=timeBins)11attemptsAsFunctionOfTime[attemptsAsFunctionOfTime < 1] = 112accuracyAsFunctionOfTime = madeAttemptsAsFunctionOfTime.astype(float) / attemptsAsFunctionOfTime13accuracyAsFunctionOfTime[attemptsAsFunctionOfTime <= 50] = 0 # zero accuracy in bins that don't have enough samples1415maxHeight = max(attemptsAsFunctionOfTime) + 3016barWidth = 0.999 * (timeBins[1] - timeBins[0])1718plt.figure19plt.subplot(2, 1, 1)20plt.bar(timeBins[:-1], attemptsAsFunctionOfTime, align='edge', width=barWidth);21plt.xlim((-20, 3200))22plt.ylim((0, maxHeight))2324#上面图的y轴 投篮次数25plt.ylabel('attempts')26plt.title(str(binSizeInSeconds) + ' second time bins')27plt.vlines(x=[0, 12 * 60, 2 * 12 * 60, 3 * 12 * 60, 4 * 12 * 60, 4 * 12 * 60 + 5 * 60, 4 * 12 * 60 + 2 * 5 * 60,28 4 * 12 * 60 + 3 * 5 * 60], ymin=0, ymax=maxHeight, colors='r')29plt.subplot(2, 1, 2)30plt.bar(timeBins[:-1], accuracyAsFunctionOfTime, align='edge', width=barWidth);31plt.xlim((-20, 3200))32#下面图的y轴 命中率33plt.ylabel('accuracy')34plt.xlabel('time [seconds from start of game]')35plt.vlines(x=[0, 12 * 60, 2 * 12 * 60, 3 * 12 * 60, 4 * 12 * 60, 4 * 12 * 60 + 5 * 60, 4 * 12 * 60 + 2 * 5 * 60,36 4 * 12 * 60 + 3 * 5 * 60], ymin=0.0, ymax=0.7, colors='r')37plt.show
看一下效果怎么样

分析可得出科比的投篮命中率大概徘徊在0.4左右,但这并不是我们想要的效果
为了进一步对数据进行挖掘,我们需要使用一些算法了。

GMM聚类
那么 什么是GMM聚类呢?
GMM是高斯混合模型(或者是混合高斯模型)的简称。大致的意思就是所有的分布可以看做是多个高斯分布综合起来的结果。这样一来,任何分布都可以分成多个高斯分布来表示。
因为我们知道,按照大自然中很多现象是遵从高斯(即正态)分布的,但是,实际上,影响一个分布的原因是多个的,甚至有些是人为的,可能每一个影响因素决定了一个高斯分布,多种影响结合起来就是多个高斯分布。(个人理解)
因此,混合高斯模型聚类的原理:通过样本找到K个高斯分布的期望和方差,那么K个高斯模型就确定了。在聚类的过程中,不会明确的指定一个样本属于哪一类,而是计算这个样本在某个分布中的可能性。
高斯分布一般还要结合EM算法作为其似然估计算法。
1'''2现在,让我们继续我们的初步探索,研究一下科比投篮的空间位置。3我们将通过构建一个高斯混合模型来实现这一点,该模型试图对科比的射门位置进行简单的总结。4用GMM在科比的投篮位置上对他们的投篮尝试进行聚类5'''67numGaussians = 138gaussianMixtureModel = mixture.GaussianMixture(n_components=numGaussians, covariance_type='full',9 init_params='kmeans', n_init=50,10 verbose=0, random_state=5)11gaussianMixtureModel.fit(data.loc[:, ['loc_x', 'loc_y']])1213# 将GMM集群作为字段添加到数据集中14data['shotLocationCluster'] = gaussianMixtureModel.predict(data.loc[:, ['loc_x', 'loc_y']])

球场可视化
这里借鉴了MichaelKrueger的excelent脚本里的draw_court函数
draw_court函数
1def draw_court(ax=None, color='black', lw=2, outer_lines=False):2 # 如果没有提供用于绘图的axis对象,就获取当前对象3 if ax is None:4 ax = plt.gca56 # 创建一个NBA的球场7 # 建一个篮筐8 # 直径是18,半径是99 # 7.5在坐标系内10 hoop = Circle((0, 0), radius=7.5, linewidth=lw, color=color, fill=False)1112 # 创建篮筐13 backboard = Rectangle((-30, -7.5), 60, -1, linewidth=lw, color=color)1415 # The paint16 # 为球场外部上色, width=16ft, height=19ft17 outer_box = Rectangle((-80, -47.5), 160, 190, linewidth=lw, color=color,18 fill=False)19 # 为球场内部上色, width=12ft, height=19ft20 inner_box = Rectangle((-60, -47.5), 120, 190, linewidth=lw, color=color,21 fill=False)222324 #创建发球顶弧25 top_free_throw = Arc((0, 142.5), 120, 120, theta1=0, theta2=180,26 linewidth=lw, color=color, fill=False)2728 #创建发球底弧29 bottom_free_throw = Arc((0, 142.5), 120, 120, theta1=180, theta2=0,30 linewidth=lw, color=color, linestyle='dashed')3132 # 这是一个距离篮筐中心4英尺半径的弧线33 restricted = Arc((0, 0), 80, 80, theta1=0, theta2=180, linewidth=lw,34 color=color)3536 # 三分线37 # 创建边3pt的线,14英尺长38 corner_three_a = Rectangle((-220, -47.5), 0, 140, linewidth=lw,39 color=color)40 corner_three_b = Rectangle((220, -47.5), 0, 140, linewidth=lw, color=color)4142 # 圆弧到圆心是个圆环,距离为23'9"43 # 调整一下thetal的值,直到它们与三分线对齐44 three_arc = Arc((0, 0), 475, 475, theta1=22, theta2=158, linewidth=lw,45 color=color)464748 # 中场部分49 center_outer_arc = Arc((0, 422.5), 120, 120, theta1=180, theta2=0,50 linewidth=lw, color=color)51 center_inner_arc = Arc((0, 422.5), 40, 40, theta1=180, theta2=0,52 linewidth=lw, color=color)535455 # 要绘制到坐标轴上的球场元素的列表56 court_elements = [hoop, backboard, outer_box, inner_box, top_free_throw,57 bottom_free_throw, restricted, corner_three_a,58 corner_three_b, three_arc, center_outer_arc,59 center_inner_arc]6061 if outer_lines:6263 # 划出半场线、底线和边线64 outer_lines = Rectangle((-250, -47.5), 500, 470, linewidth=lw,65 color=color, fill=False)66 court_elements.append(outer_lines)676869 # 将球场元素添加到轴上70 for element in court_elements:71 ax.add_patch(element)7273 return ax
二维高斯图
建立绘制画二维高斯图的函数
Draw2DGaussians
1def Draw2DGaussians(gaussianMixtureModel, ellipseColors, ellipseTextMessages):2 fig, h = plt.subplots3 for i, (mean, covarianceMatrix) in enumerate(zip(gaussianMixtureModel.means_, gaussianMixtureModel.covariances_)):4 # 得到协方差矩阵的特征向量和特征值5 v, w = np.linalg.eigh(covarianceMatrix)6 v = 2.5 * np.sqrt(v) # go to units of standard deviation instead of variance 用标准差的单位代替方差78 # 计算椭圆角和两轴长度并画出它9 u = w[0] / np.linalg.norm(w[0])10 angle = np.arctan(u[1] / u[0])11 angle = 180 * angle / np.pi # convert to degrees 转换成度数12 currEllipse = mpl.patches.Ellipse(mean, v[0], v[1], 180 + angle, color=ellipseColors[i])13 currEllipse.set_alpha(0.5)14 h.add_artist(currEllipse)15 h.text(mean[0] + 7, mean[1] - 1, ellipseTextMessages[i], fontsize=13, color='blue')
下面开始绘制2D高斯投篮次数图,图中的每个椭圆都是离高斯分布中心2.5个标准差远的计数,每个蓝色的数字代表从该高斯分布观察到的所占百分比
1# 显示投篮尝试的高斯混合椭圆2plt.rcParams['figure.figsize'] = (13, 10)3plt.rcParams['font.size'] = 1545ellipseTextMessages = [str(100 * gaussianMixtureModel.weights_[x])[:4] + '%' for x in range(numGaussians)]6ellipseColors = ['red', 'green', 'purple', 'cyan', 'magenta', 'yellow', 'blue', 'orange', 'silver', 'maroon', 'lime',7 'olive', 'brown', 'darkblue']8Draw2DGaussians(gaussianMixtureModel, ellipseColors, ellipseTextMessages)9draw_court(outer_lines=True)10plt.ylim(-60, 440)11plt.xlim(270, -270)12plt.title('shot attempts')13plt.show
看一下成果:

我们可以看到,着色后的2D高斯图中,科比在球场的左侧(或者从他看来是右侧)做了更多的投篮尝试。这可能是因为他是右撇子。此外,我们还可以看到,大量的投篮尝试(16.8%)是直接从篮下进行的,5.06%的额外投篮尝试是从非常接近篮下的位置投出去的。
它看起来并不完美,但确实显示了一些有用的东西
对于绘制的每个高斯集群的投篮精度,蓝色数字将代表从这个集群中获取到的准确性,因此我们可以了解哪些是容易的,哪些是困难的。
对于每个集群,计算一下它的精度并绘图
1plt.rcParams['figure.figsize'] = (13, 10)2plt.rcParams['font.size'] = 1534variableCategories = data['shotLocationCluster'].value_counts.index.tolist56clusterAccuracy = {}7for category in variableCategories:8 shotsAttempted = np.array(data['shotLocationCluster'] == category).sum9 shotsMade = np.array(data.loc[data['shotLocationCluster'] == category, 'shot_made_flag'] == 1).sum10 clusterAccuracy[category] = float(shotsMade) / shotsAttempted1112ellipseTextMessages = [str(100 * clusterAccuracy[x])[:4] + '%' for x in range(numGaussians)]13Draw2DGaussians(gaussianMixtureModel, ellipseColors, ellipseTextMessages)14draw_court(outer_lines=True)15plt.ylim(-60, 440)16plt.xlim(270, -270)17plt.title('shot accuracy')18plt.show
看一下效果图

我们可以清楚地看到投篮距离和精度之间的关系。
绘制二维时空图
另一个有趣的事实是:科比不仅在右侧做了更多的投篮尝试(从他看来的那边),而且他在这些投篮尝试上更擅长
现在让我们绘制一个科比职业生涯的二维时空图。在X轴上,将从比赛开始时计时;在y轴上有科比投篮的集群指数(根据集群精度排序);图片的深度将反映科比在那个特定的时间从那个特定的集群中尝试的次数;图中的红色垂线分割比赛的每节
1# 制科比整个职业生涯比赛中的二维时空直方图2plt.rcParams['figure.figsize'] = (18, 10) #设置图像显示的大小3plt.rcParams['font.size'] = 18 #字体大小456# 根据集群的准确性对它们进行排序7sortedClustersByAccuracyTuple = sorted(clusterAccuracy.items, key=operator.itemgetter(1), reverse=True)8sortedClustersByAccuracy = [x[0] for x in sortedClustersByAccuracyTuple]910binSizeInSeconds = 1211timeInUnitsOfBins = ((data['secondsFromGameStart'] + 0.0001) / binSizeInSeconds).astype(int)12locationInUintsOfClusters = np.array(13 [sortedClustersByAccuracy.index(data.loc[x, 'shotLocationCluster']) for x in range(data.shape[0])])141516# 建立科比比赛的时空直方图17shotAttempts = np.zeros((gaussianMixtureModel.n_components, 1 + max(timeInUnitsOfBins)))18for shot in range(data.shape[0]):19 shotAttempts[locationInUintsOfClusters[shot], timeInUnitsOfBins[shot]] += 1202122# 让y轴有更大的面积,这样会更明显23shotAttempts = np.kron(shotAttempts, np.ones((5, 1)))2425# 每节结束的位置26vlinesList = 0.5001 + np.array([0, 12 * 60, 2 * 12 * 60, 3 * 12 * 60, 4 * 12 * 60, 4 * 12 * 60 + 5 * 60]).astype(27 int) / binSizeInSeconds2829plt.figure(figsize=(13, 8)) #设置宽和高30plt.imshow(shotAttempts, cmap='copper', interpolation="nearest") #设置了边界的模糊度,或者是图片的模糊度31plt.xlim(0, float(4 * 12 * 60 + 6 * 60) / binSizeInSeconds)32plt.vlines(x=vlinesList, ymin=-0.5, ymax=shotAttempts.shape[0] - 0.5, colors='r')33plt.xlabel('time from start of game [sec]')34plt.ylabel('cluster (sorted by accuracy)')35plt.show
看一下运行结果:

集群按精度降序排序。高准确度的投篮在最上面,而低准确度的半场投篮在最下面,我们现在可以看到,在第一、第二和第三节中的“最后一秒出手”实际上是从很远的地方“绝杀”, 然而,有趣的是,在第4节中,最后一秒的投篮并不属于“绝杀”的投篮群,而是属于常规的3分投篮(这仍然比较难命中,但不是毫无希望的)。
在以后的分析中,我们将根据投篮属性来评估投篮难度(如投篮类型和投篮距离)
下面将为投篮难度模型创建一个新表格
1def FactorizeCategoricalVariable(inputDB, categoricalVarName):2 opponentCategories = inputDB[categoricalVarName].value_counts.index.tolist34 outputDB = pd.DataFrame5 for category in opponentCategories:6 featureName = categoricalVarName + ': ' + str(category)7 outputDB[featureName] = (inputDB[categoricalVarName] == category).astype(int)89 return outputDB101112featuresDB = pd.DataFrame13featuresDB['homeGame'] = data['matchup'].apply(lambda x: 1 if (x.find('@') < 0) else 0)14featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'opponent')], axis=1)15featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'action_type')], axis=1)16featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'shot_type')], axis=1)17featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'combined_shot_type')], axis=1)18featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'shot_zone_basic')], axis=1)19featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'shot_zone_area')], axis=1)20featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'shot_zone_range')], axis=1)21featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'shotLocationCluster')], axis=1)2223featuresDB['playoffGame'] = data['playoffs']24featuresDB['locX'] = data['loc_x']25featuresDB['locY'] = data['loc_y']26featuresDB['distanceFromBasket'] = data['shot_distance']27featuresDB['secondsFromPeriodEnd'] = data['secondsFromPeriodEnd']2829featuresDB['dayOfWeek_cycX'] = np.sin(2 * np.pi * (data['dayOfWeek'] / 7))30featuresDB['dayOfWeek_cycY'] = np.cos(2 * np.pi * (data['dayOfWeek'] / 7))31featuresDB['timeOfYear_cycX'] = np.sin(2 * np.pi * (data['dayOfYear'] / 365))32featuresDB['timeOfYear_cycY'] = np.cos(2 * np.pi * (data['dayOfYear'] / 365))3334labelsDB = data['shot_made_flag']
根据FeaturesDB表构建模型,并确保它不会过度匹配(即训练误差与测试误差相同)
使用一个额外的分类器
建立一个简单的模型,并确保它不超载
1randomSeed = 12numFolds = 434stratifiedCV = model_selection.StratifiedKFold(n_splits=numFolds, shuffle=True, random_state=randomSeed)56mainLearner = ensemble.ExtraTreesClassifier(n_estimators=500, max_depth=5,7 min_samples_leaf=120, max_features=120,8 criterion='entropy', bootstrap=False,9 n_jobs=-1, random_state=randomSeed)1011startTime = time.time12trainAccuracy =13validAccuracy =14trainLogLosses =15validLogLosses =16for trainInds, validInds in stratifiedCV.split(featuresDB, labelsDB):17 # 分割训练和有效的集合18 X_train_CV = featuresDB.iloc[trainInds, :]19 y_train_CV = labelsDB.iloc[trainInds]20 X_valid_CV = featuresDB.iloc[validInds, :]21 y_valid_CV = labelsDB.iloc[validInds]2223 # 训练24 mainLearner.fit(X_train_CV, y_train_CV)2526 # 作出预测27 y_train_hat_mainLearner = mainLearner.predict_proba(X_train_CV)[:, 1]28 y_valid_hat_mainLearner = mainLearner.predict_proba(X_valid_CV)[:, 1]2930 # 储存结果31 trainAccuracy.append(accuracy(y_train_CV, y_train_hat_mainLearner > 0.5))32 validAccuracy.append(accuracy(y_valid_CV, y_valid_hat_mainLearner > 0.5))33 trainLogLosses.append(log_loss(y_train_CV, y_train_hat_mainLearner))34 validLogLosses.append(log_loss(y_valid_CV, y_valid_hat_mainLearner))3536print("-----------------------------------------------------")37print("total (train,valid) Accuracy = (%.5f,%.5f). took %.2f minutes" % (38 np.mean(trainAccuracy), np.mean(validAccuracy), (time.time - startTime) / 60))39print("total (train,valid) Log Loss = (%.5f,%.5f). took %.2f minutes" % (40 np.mean(trainLogLosses), np.mean(validLogLosses), (time.time - startTime) / 60))41print("-----------------------------------------------------")4243mainLearner.fit(featuresDB, labelsDB)44data['shotDifficulty'] = mainLearner.predict_proba(featuresDB)[:, 1]4546# 为了深入了解,我们来看看特性选择47featureInds = mainLearner.feature_importances_.argsort[::-1]48featureImportance = pd.DataFrame(49 np.concatenate((featuresDB.columns[featureInds, None], mainLearner.feature_importances_[featureInds, None]),50 axis=1),51 columns=['featureName', 'importanceET'])5253print(featureImportance.iloc[:30, :])**看看运行结果如何**:
1total (train,valid) Accuracy = (0.67912,0.67860). took 0.29 minutes2total (train,valid) Log Loss = (0.60812,0.61100). took 0.29 minutes3-----------------------------------------------------4 featureName importanceET50 action_type: Jump Shot 0.57803661 action_type: Layup Shot 0.17327472 combined_shot_type: Dunk 0.11334183 homeGame 0.028804394 action_type: Dunk Shot 0.0161591105 shotLocationCluster: 9 0.0136386116 combined_shot_type: Layup 0.00949568127 distanceFromBasket 0.0084703138 shot_zone_range: 16-24 ft. 0.0072107149 action_type: Slam Dunk Shot 0.006903161510 combined_shot_type: Jump Shot 0.005925861611 secondsFromPeriodEnd 0.005893911712 action_type: Running Jump Shot 0.005449041813 shotLocationCluster: 11 0.004491251914 locY 0.003885092015 action_type: Driving Layup Shot 0.003647572116 shot_zone_range: Less Than 8 ft. 0.003496152217 combined_shot_type: Tip Shot 0.002603992318 shot_zone_area: Center(C) 0.00115852419 opponent: DEN 0.0008821062520 action_type: Driving Dunk Shot 0.0008481562621 shot_zone_basic: Restricted Area 0.0006500222722 shotLocationCluster: 2 0.0005134762823 action_type: Tip Shot 0.0004899182924 shot_zone_basic: Mid-Range 0.0004873063025 action_type: Pullup Jump shot 0.0004536413126 shot_zone_range: 8-16 ft. 0.0004525743227 timeOfYear_cycX 0.0004322673328 dayOfWeek_cycX 0.000396683429 shotLocationCluster: 8 0.0002540773536Process finished with exit code 0
在这里想谈谈科比·布莱恩特在决策过程中的一些问题;为此,我们将收集两组不同的效果图,并分析它们之间的差异:
-
在一次成功的投篮后马上继续投篮
-
在一次不成功的投篮后马上马上投篮
考虑到科比投进或投失了最后一球,我收集了一些数据
1timeBetweenShotsDict = {}2timeBetweenShotsDict['madeLast'] =3timeBetweenShotsDict['missedLast'] =45changeInDistFromBasketDict = {}6changeInDistFromBasketDict['madeLast'] =7changeInDistFromBasketDict['missedLast'] =89changeInShotDifficultyDict = {}10changeInShotDifficultyDict['madeLast'] =11changeInShotDifficultyDict['missedLast'] =1213afterMadeShotsList =14afterMissedShotsList =1516for shot in range(1, data.shape[0]):1718 # 确保当前的投篮和最后的投篮都在同一场比赛的同一时间段19 sameGame = data.loc[shot, 'game_date'] == data.loc[shot - 1, 'game_date']20 samePeriod = data.loc[shot, 'period'] == data.loc[shot - 1, 'period']2122 if samePeriod and sameGame:23 madeLastShot = data.loc[shot - 1, 'shot_made_flag'] == 124 missedLastShot = data.loc[shot - 1, 'shot_made_flag'] == 02526 timeDifferenceFromLastShot = data.loc[shot, 'secondsFromGameStart'] - data.loc[shot - 1, 'secondsFromGameStart']27 distDifferenceFromLastShot = data.loc[shot, 'shot_distance'] - data.loc[shot - 1, 'shot_distance']28 shotDifficultyDifferenceFromLastShot = data.loc[shot, 'shotDifficulty'] - data.loc[shot - 1, 'shotDifficulty']2930 # check for currupt data points (assuming all samples should have been chronologically ordered)31 # 检查数据(假设所有样本都按时间顺序排列)32 if timeDifferenceFromLastShot < 0:33 continue3435 if madeLastShot:36 timeBetweenShotsDict['madeLast'].append(timeDifferenceFromLastShot)37 changeInDistFromBasketDict['madeLast'].append(distDifferenceFromLastShot)38 changeInShotDifficultyDict['madeLast'].append(shotDifficultyDifferenceFromLastShot)39 afterMadeShotsList.append(shot)4041 if missedLastShot:42 timeBetweenShotsDict['missedLast'].append(timeDifferenceFromLastShot)43 changeInDistFromBasketDict['missedLast'].append(distDifferenceFromLastShot)44 changeInShotDifficultyDict['missedLast'].append(shotDifficultyDifferenceFromLastShot)45 afterMissedShotsList.append(shot)4647afterMissedData = data.iloc[afterMissedShotsList, :]48afterMadeData = data.iloc[afterMadeShotsList, :]4950shotChancesListAfterMade = afterMadeData['shotDifficulty'].tolist51totalAttemptsAfterMade = afterMadeData.shape[0]52totalMadeAfterMade = np.array(afterMadeData['shot_made_flag'] == 1).sum5354shotChancesListAfterMissed = afterMissedData['shotDifficulty'].tolist55totalAttemptsAfterMissed = afterMissedData.shape[0]56totalMadeAfterMissed = np.array(afterMissedData['shot_made_flag'] == 1).sum
柱状图
为他们绘制“上次投篮后的时间”的柱状图
1plt.rcParams['figure.figsize'] = (13, 10)23jointHist, timeBins = np.histogram(timeBetweenShotsDict['madeLast'] + timeBetweenShotsDict['missedLast'], bins=200)4barWidth = 0.999 * (timeBins[1] - timeBins[0])56timeDiffHist_GivenMadeLastShot, b = np.histogram(timeBetweenShotsDict['madeLast'], bins=timeBins)7timeDiffHist_GivenMissedLastShot, b = np.histogram(timeBetweenShotsDict['missedLast'], bins=timeBins)8maxHeight = max(max(timeDiffHist_GivenMadeLastShot), max(timeDiffHist_GivenMissedLastShot)) + 30910plt.figure11plt.subplot(2, 1, 1)12plt.bar(timeBins[:-1], timeDiffHist_GivenMadeLastShot, width=barWidth)13plt.xlim((0, 500))14plt.ylim((0, maxHeight))15plt.title('made last shot')16plt.ylabel('counts')17plt.subplot(2, 1, 2)18plt.bar(timeBins[:-1], timeDiffHist_GivenMissedLastShot, width=barWidth)19plt.xlim((0, 500))20plt.ylim((0, maxHeight))21plt.title('missed last shot')22plt.xlabel('time since last shot')23plt.ylabel('counts')24plt.show
看一下运行结果:

从图中可以看出:科比投了一个球之后有些着急去投下一个,而图中的一些比较平缓的值可能是球权在另一只队伍手中,需要一些时间来夺回。
累计柱状图
为了更好地可视化柱状图之间的差异,我们来看看累积柱状图。
1plt.rcParams['figure.figsize'] = (13, 6)23timeDiffCumHist_GivenMadeLastShot = np.cumsum(timeDiffHist_GivenMadeLastShot).astype(float)4timeDiffCumHist_GivenMadeLastShot = timeDiffCumHist_GivenMadeLastShot / max(timeDiffCumHist_GivenMadeLastShot)5timeDiffCumHist_GivenMissedLastShot = np.cumsum(timeDiffHist_GivenMissedLastShot).astype(float)6timeDiffCumHist_GivenMissedLastShot = timeDiffCumHist_GivenMissedLastShot / max(timeDiffCumHist_GivenMissedLastShot)78maxHeight = max(timeDiffCumHist_GivenMadeLastShot[-1], timeDiffCumHist_GivenMissedLastShot[-1])910plt.figure11madePrev = plt.plot(timeBins[:-1], timeDiffCumHist_GivenMadeLastShot, label='made Prev')12plt.xlim((0, 500))13missedPrev = plt.plot(timeBins[:-1], timeDiffCumHist_GivenMissedLastShot, label='missed Prev')14plt.xlim((0, 500))15plt.ylim((0, 1))16plt.title('cumulative density function - CDF')17plt.xlabel('time since last shot')18plt.legend(loc='lower right')19plt.show
运行效果如下:

虽然可以观察到密度有差异 ,但好像不太清楚,所以还是转换成高斯格式来显示数据吧
1# 显示投中后和失球后的投篮次数2plt.rcParams['figure.figsize'] = (13, 10)34variableCategories = afterMadeData['shotLocationCluster'].value_counts.index.tolist5clusterFrequency = {}6for category in variableCategories:7 shotsAttempted = np.array(afterMadeData['shotLocationCluster'] == category).sum8 clusterFrequency[category] = float(shotsAttempted) / afterMadeData.shape[0]910ellipseTextMessages = [str(100 * clusterFrequency[x])[:4] + '%' for x in range(numGaussians)]11Draw2DGaussians(gaussianMixtureModel, ellipseColors, ellipseTextMessages)12draw_court(outer_lines=True)13plt.ylim(-60, 440)14plt.xlim(270, -270)15plt.title('after made shots')1617variableCategories = afterMissedData['shotLocationCluster'].value_counts.index.tolist18clusterFrequency = {}19for category in variableCategories:20 shotsAttempted = np.array(afterMissedData['shotLocationCluster'] == category).sum21 clusterFrequency[category] = float(shotsAttempted) / afterMissedData.shape[0]2223ellipseTextMessages = [str(100 * clusterFrequency[x])[:4] + '%' for x in range(numGaussians)]24Draw2DGaussians(gaussianMixtureModel, ellipseColors, ellipseTextMessages)25draw_court(outer_lines=True)26plt.ylim(-60, 440)27plt.xlim(270, -270)28plt.title('after missed shots')29plt.show30
让我们来看看最终结果:

结论
现在很明显,在投丢一个球后,科比更可能直接从篮下投出下一球。在图中也可以看出,科比在投蓝进球后,下一球更有可能尝试投个三分球,但本次案例中并没有有效的数据可以证明科比有热手效应。不难看出,科比还是一个注重篮下以及罚球线周边功夫的球员,而且是一个十分自信的领袖,不愧为我们的老大!

需要改进的地方
本次获取到的数据集十分庞大,里面的内容也很充足,甚至包括了每一种投篮姿势、上篮姿势的详细数据,对于本数据中还未挖掘到的信息各位读者如果有兴趣可以自行尝试,相信一定会收获满满!
注:可能本次分析中存在一些问题,还请各位读者指正,感谢阅读。
原文链接:https://blog.csdn.net/weixin_43656359/article/details/104586776