前言
#统计学# #数据分析# #Pandas# #matplotlib#
今天学习的主要内容是“应用统计学”1.3节,通过数值描述数据的分布,这些数值包含:
- 平均值 #mean#:所有数据的平均值;
- 中位数 median:数据集合中位于居中位置的数据;
- 四分位数quartile:指在统计学中把所有数值由小到大排列并分成四等份,处于三个分割点位置的数值。
- 下四分卫,1st quartile Q1:同quartile概念,位置为第一个分割点;
- 上四分卫,3rd quartile Q3:同quartile概念,位置为第三个分割点;
- 最小值Minimum,最大值Maximun;
- 四分卫距,Inter-quartile-range:Q3 - Q1
下面的示例图以一个具体的队列表示了上述Quartile的基本概念:
Median
Q1
Q3

Quartiles
上述概念,我们已经在“通过5个指标来教你观察数据 - 盒须图”系列视频中有过了解,感兴趣的同学可以去详细了解一下,我们这节主要是想通过Python-Pandas和Matplotlib帮我们完成对于差异值的筛选和判断;
先回顾一下原始数据,如下图所示:

Row data in 1.25
上一节课我们通过Python画出了下面的Bar柱形图,能够很明显的观察到美国具有最高的注册用户数量,155.74百万,那么它是异常值么?如何判断出来呢?我们通过代码实现;

Bar chart of 1.25
代码实现
# -*- coding: utf-8 -*-
"""
Created on Wed Dec 29 22:26:21 2021
@author: Derek Zhu
"Introduction to the Practice of Statistics" - NINTH EDITION
Written by
David S. Moore
George P. McCabe
Bruce A. Craig
Chapter 01 - Practices - 1.3 Describing Distributions with Numbers
"""
# Step1: Import the Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
# =============================================================================
# 1.25 Your Facebook app can generate a million dollars a month. A report on Facebook suggests that Facebook apps can generate large amounts of money, as much as $1 million a month.
# The following table gives the numbers of Facebook users by country for the top 10 countries based on the number of users:
# =============================================================================
os.chdir("E:/2018-VGIC/21_SuccessFactor/Statistics")
df_125 = pd.read_excel('statistics data.xlsx', sheet_name='1.25', header=0)
# 对数据进行了排序,展示图表时用
df_125 = df_125.sort_values("Users", ascending=True)
df_125
# =============================================================================
# Out[24]:
# Country Users
# 3 Germany 21.46
# 4 France 23.19
# 5 Philippines 26.87
# 0 Brazil 29.30
# 2 Mexico 29.80
# 7 United Kingdom 30.39
# 9 Turkey 30.63
# 1 India 37.38
# 6 Indonesia 40.52
# 8 United States 155.74
# =============================================================================
#%%
# Statistics figures
# =============================================================================
# Q1-1.5IQR Q1 median Q3 Q3+1.5IQR
# |-----:-----|
# o |--------| : |--------| o o
# |-----:-----|
# flier <-----------> fliers
# IQR
# =============================================================================
# 计算平均值
df_125['Users'].mean()
# Out[60]: 42.528
# 计算中位数
med = df_125['Users'].median()
# Out[15]: 30.095
# 计算最大值
df_125['Users'].max()
# Out[107]: 155.74
# 计算最小值
df_125['Users'].min()
# Out[108]: 21.46
# 注意计算中位数的方法,如果想得到与“应用统计学”一致的结果,需要采用'midpoint'方法
# =============================================================================
# interpolation{‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}
# This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j:
# =============================================================================
# 下四分位数, 1st quartile
Q1 = df_125['Users'].quantile(q = 0.25, interpolation='midpoint')
Q1
# Out[95]: 28.085
# 上四分位数, 3rd quartile
Q3 = df_125['Users'].quantile(q = 0.75, interpolation='midpoint')
Q3
# Out[97]: 34.005
# Inter quartile range 四分卫距
IQR = Q3 - Q1
IQR
# Out[99]: 5.920000000000002
# The edge of outliers 定义异常值的上下阈值
T1 = Q1 - 1.5 * IQR
T1
# Out[102]: 19.205
T2 = Q3 + 1.5 * IQR
T2
# Out[103]: 42.885000000000005
# 判断异常值
df_125['IsOutliers'] = df_125['Users'].apply(lambda x: 'Y' if (x < T1 or x > T2) else 'N')
df_125
# =============================================================================
# Out[105]:
# Country Users IsOutliers
# 3 Germany 21.46 N
# 4 France 23.19 N
# 5 Philippines 26.87 N
# 0 Brazil 29.30 N
# 2 Mexico 29.80 N
# 7 United Kingdom 30.39 N
# 9 Turkey 30.63 N
# 1 India 37.38 N
# 6 Indonesia 40.52 N
# 8 United States 155.74 Y
# =============================================================================
# 结论,美国地区的Facebook注册人数与其他9个排名前10的国家相比,在统计学上是一个异常值,需要特别注意。
结论,美国地区的Facebook注册人数与其他9个排名前10的国家相比,在统计学上是一个异常值,需要特别注意。
盒须图
可能有些同学感觉通过柱形图(Bar plot)并不能直观的展示出上述5个指标(最大值,最小值,中位数(Median),下中位数(1st Quartile),上中位数(3rd Quartile))。我们还可以通过盒须图Box-whisker plot来更加直接的展示,并展示异常值(Outliers);
代码实现
#%%
# Data distribution displayed via Box-whisker plot
# df_125['Users'].plot.box(title="Box and whisker plot", grid=True);
# Data distribution displayed via Bar plot
X2 = df_125["Users"]
# Drawing a Box-whisker plot
fig2, ax2 = plt.subplots(figsize=(8, 6), dpi=100)
ax2.boxplot(X2, labels=['Users in millions'])
# Set labels for title
ax2.set_title("Box-whisker plot for User Numbers in Top 10 Country")
# Set data labels
ptiles_vers = df_125['Users'].quantile([0, 0.25, 0.50, 0.75, 1], interpolation = 'midpoint')
ptiles_vers
# =============================================================================
# Out[48]:
# 0.00 21.460
# 0.25 28.085
# 0.50 30.095
# 0.75 34.005
# 1.00 155.740
# Name: Users, dtype: float64
# =============================================================================
for i in ptiles_vers:
plt.text(1.1, i, "{}".format(round(i, 2) ))
# Show fig and save to file
fig2.show()
fig2.savefig("statistics data fig 1.25 - box-whisker plot.png", dpi = 800)
下面是代码运行后的结果,我们得到了一个由5个统计指标构成的盒须图(Box-whisker plot),分别展示了(最大值,最小值,中位数(Median),下中位数(1st Quartile),上中位数(3rd Quartile)和异常值;
那么即便是不太了解数据分析的人,也会对异常值(155.74)感到好奇,这样就达到了我们应用统计学进行数据分析的目的:即,“ 通过数值观察数据分布 ”。

Box-whisker plot
总结
- 理解统计学常用的5个统计指标(最大值,最小值,中位数(Median),下中位数(1st Quartile),上中位数(3rd Quartile))
- 通过Pandas.DataFrame内置方法对上述指标进行计算,并通过apply增加计算列,判断异常数据。
- 通过Matplotlib的盒须图进行更加直观的展示,注意:数值标签需要单独计算;

Mindmap of Statics Practice