前言

#统计学# #数据分析# #Pandas# #matplotlib#

今天学习的主要内容是“应用统计学”1.3节，通过数值描述数据的分布，这些数值包含：

平均值 #mean#：所有数据的平均值；
中位数 median：数据集合中位于居中位置的数据；
四分位数quartile：指在统计学中把所有数值由小到大排列并分成四等份，处于三个分割点位置的数值。
下四分卫，1st quartile Q1：同quartile概念，位置为第一个分割点；
上四分卫，3rd quartile Q3：同quartile概念，位置为第三个分割点；
最小值Minimum，最大值Maximun；
四分卫距，Inter-quartile-range：Q3 - Q1

下面的示例图以一个具体的队列表示了上述Quartile的基本概念：

Median

应用统计学-课后练习-1.3

Quartiles

上述概念，我们已经在“通过5个指标来教你观察数据 - 盒须图”系列视频中有过了解，感兴趣的同学可以去详细了解一下，我们这节主要是想通过Python-Pandas和Matplotlib帮我们完成对于差异值的筛选和判断；

先回顾一下原始数据，如下图所示：

应用统计学-课后练习-1.3

Row data in 1.25

上一节课我们通过Python画出了下面的Bar柱形图，能够很明显的观察到美国具有最高的注册用户数量，155.74百万，那么它是异常值么？如何判断出来呢？我们通过代码实现；

应用统计学-课后练习-1.3

Bar chart of 1.25

代码实现

# -*- coding: utf-8 -*-
"""
Created on Wed Dec 29 22:26:21 2021

@author: Derek Zhu

"Introduction to the Practice of Statistics" - NINTH EDITION
Written by 
David S. Moore
George P. McCabe
Bruce A. Craig
Chapter 01 - Practices - 1.3 Describing Distributions with Numbers
"""

# Step1: Import the Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
# =============================================================================
# 1.25 Your Facebook app can generate a million dollars a month. A report on Facebook suggests that Facebook apps can generate large amounts of money, as much as $1 million a month.
# The following table gives the numbers of Facebook users by country for the top 10 countries based on the number of users:
# =============================================================================

os.chdir("E:/2018-VGIC/21_SuccessFactor/Statistics")

df_125 = pd.read_excel('statistics data.xlsx', sheet_name='1.25', header=0)

# 对数据进行了排序，展示图表时用
df_125 = df_125.sort_values("Users", ascending=True)
df_125
# =============================================================================
# Out[24]: 
#           Country   Users
# 3         Germany   21.46
# 4          France   23.19
# 5     Philippines   26.87
# 0          Brazil   29.30
# 2          Mexico   29.80
# 7  United Kingdom   30.39
# 9          Turkey   30.63
# 1           India   37.38
# 6       Indonesia   40.52
# 8   United States  155.74
# =============================================================================

  #%%
# Statistics figures
# =============================================================================
#      Q1-1.5IQR   Q1   median  Q3   Q3+1.5IQR
#                   |-----:-----|
#   o      |--------|     :     |--------|    o  o
#                   |-----:-----|
# flier             <----------->            fliers
#                        IQR
# =============================================================================

# 计算平均值
df_125['Users'].mean()
# Out[60]: 42.528

# 计算中位数
med = df_125['Users'].median()
# Out[15]: 30.095

# 计算最大值
df_125['Users'].max()
# Out[107]: 155.74

# 计算最小值
df_125['Users'].min()
# Out[108]: 21.46

# 注意计算中位数的方法，如果想得到与“应用统计学”一致的结果，需要采用'midpoint'方法
# =============================================================================
# interpolation{‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}
# This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j:
# =============================================================================

# 下四分位数, 1st quartile
Q1 = df_125['Users'].quantile(q = 0.25, interpolation='midpoint')
Q1
# Out[95]: 28.085

# 上四分位数, 3rd quartile
Q3 = df_125['Users'].quantile(q = 0.75, interpolation='midpoint')
Q3
# Out[97]: 34.005

# Inter quartile range 四分卫距
IQR = Q3 - Q1
IQR
# Out[99]: 5.920000000000002

# The edge of outliers 定义异常值的上下阈值
T1 = Q1 - 1.5 * IQR
T1
# Out[102]: 19.205

T2 = Q3 + 1.5 * IQR
T2
# Out[103]: 42.885000000000005

# 判断异常值
df_125['IsOutliers'] = df_125['Users'].apply(lambda x: 'Y' if (x < T1 or x > T2) else 'N')
df_125
# =============================================================================
# Out[105]: 
#           Country   Users IsOutliers
# 3         Germany   21.46          N
# 4          France   23.19          N
# 5     Philippines   26.87          N
# 0          Brazil   29.30          N
# 2          Mexico   29.80          N
# 7  United Kingdom   30.39          N
# 9          Turkey   30.63          N
# 1           India   37.38          N
# 6       Indonesia   40.52          N
# 8   United States  155.74          Y
# =============================================================================

# 结论，美国地区的Facebook注册人数与其他9个排名前10的国家相比，在统计学上是一个异常值，需要特别注意。

结论，美国地区的Facebook注册人数与其他9个排名前10的国家相比，在统计学上是一个异常值，需要特别注意。

盒须图

可能有些同学感觉通过柱形图（Bar plot）并不能直观的展示出上述5个指标（最大值，最小值，中位数（Median），下中位数（1st Quartile），上中位数（3rd Quartile））。我们还可以通过盒须图Box-whisker plot来更加直接的展示，并展示异常值（Outliers）；

代码实现

#%%
# Data distribution displayed via Box-whisker plot
# df_125['Users'].plot.box(title="Box and whisker plot", grid=True);

# Data distribution displayed via Bar plot
X2 = df_125["Users"]

# Drawing a Box-whisker plot
fig2, ax2 = plt.subplots(figsize=(8, 6), dpi=100)
ax2.boxplot(X2, labels=['Users in millions'])

# Set labels for title
ax2.set_title("Box-whisker plot for User Numbers in Top 10 Country")

# Set data labels
ptiles_vers = df_125['Users'].quantile([0, 0.25, 0.50, 0.75, 1], interpolation = 'midpoint')
ptiles_vers
# =============================================================================
# Out[48]: 
# 0.00     21.460
# 0.25     28.085
# 0.50     30.095
# 0.75     34.005
# 1.00    155.740
# Name: Users, dtype: float64
# =============================================================================

for i in ptiles_vers:
    plt.text(1.1, i, "{}".format(round(i, 2) ))

# Show fig and save to file
fig2.show()
fig2.savefig("statistics data fig 1.25 - box-whisker plot.png", dpi = 800)

下面是代码运行后的结果，我们得到了一个由5个统计指标构成的盒须图（Box-whisker plot），分别展示了（最大值，最小值，中位数（Median），下中位数（1st Quartile），上中位数（3rd Quartile）和异常值；

那么即便是不太了解数据分析的人，也会对异常值（155.74）感到好奇，这样就达到了我们应用统计学进行数据分析的目的：即，“ 通过数值观察数据分布 ”。

应用统计学-课后练习-1.3

Box-whisker plot

总结

理解统计学常用的5个统计指标（最大值，最小值，中位数（Median），下中位数（1st Quartile），上中位数（3rd Quartile））
通过Pandas.DataFrame内置方法对上述指标进行计算，并通过apply增加计算列，判断异常数据。
通过Matplotlib的盒须图进行更加直观的展示，注意：数值标签需要单独计算；

应用统计学-课后练习-1.3

Mindmap of Statics Practice

应用统计学 - 课后练习 - 1.3

前言

代码实现

盒须图

代码实现

总结