Selecting the Better Athlete: A Story-Driven Statistics Tutorial

Author

Prof CKDash Tutorials

Story setup: Two athletes (A and B) each run a 100 m sprint 10 times with 5‑minute intervals.
Their recorded times (in seconds) are:
Athlete A: 20, 21, 20, 19, 20, 21, 19, 20, 20, 20
Athlete B: 17, 21, 22, 22, 19, 16, 21, 24, 20, 18
Both averages are 20 s — but who is the more consistent and therefore the better choice?
We’ll use Statistics to find out.

1 1) Setup & Data

# Install/load required packages
pkgs <- c("tidyverse", "knitr", "kableExtra", "ggplot2", "scales", "ggdist")
to_install <- pkgs[!(pkgs %in% installed.packages()[, "Package"])]
if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org")

package 'svglite' successfully unpacked and MD5 sums checked
package 'distributional' successfully unpacked and MD5 sums checked
package 'quadprog' successfully unpacked and MD5 sums checked
package 'kableExtra' successfully unpacked and MD5 sums checked
package 'ggdist' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\CKDash\AppData\Local\Temp\RtmpgNdh7Y\downloaded_packages

lapply(pkgs, library, character.only = TRUE)

[[1]]
 [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
 [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
[13] "grDevices" "utils"     "datasets"  "methods"   "base"     

[[2]]
 [1] "knitr"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
 [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
[13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     

[[3]]
 [1] "kableExtra" "knitr"      "lubridate"  "forcats"    "stringr"   
 [6] "dplyr"      "purrr"      "readr"      "tidyr"      "tibble"    
[11] "ggplot2"    "tidyverse"  "stats"      "graphics"   "grDevices" 
[16] "utils"      "datasets"   "methods"    "base"      

[[4]]
 [1] "kableExtra" "knitr"      "lubridate"  "forcats"    "stringr"   
 [6] "dplyr"      "purrr"      "readr"      "tidyr"      "tibble"    
[11] "ggplot2"    "tidyverse"  "stats"      "graphics"   "grDevices" 
[16] "utils"      "datasets"   "methods"    "base"      

[[5]]
 [1] "scales"     "kableExtra" "knitr"      "lubridate"  "forcats"   
 [6] "stringr"    "dplyr"      "purrr"      "readr"      "tidyr"     
[11] "tibble"     "ggplot2"    "tidyverse"  "stats"      "graphics"  
[16] "grDevices"  "utils"      "datasets"   "methods"    "base"      

[[6]]
 [1] "ggdist"     "scales"     "kableExtra" "knitr"      "lubridate" 
 [6] "forcats"    "stringr"    "dplyr"      "purrr"      "readr"     
[11] "tidyr"      "tibble"     "ggplot2"    "tidyverse"  "stats"     
[16] "graphics"   "grDevices"  "utils"      "datasets"   "methods"   
[21] "base"

# Raw vectors
athlete_A <- c(20, 21, 20, 19, 20, 21, 19, 20, 20, 20)
athlete_B <- c(17, 21, 22, 22, 19, 16, 21, 24, 20, 18)

# Long data frame
dat <- tibble(
  Athlete = rep(c("A", "B"), each = 10),
  Time = c(athlete_A, athlete_B)
)

dat

# A tibble: 20 × 2
   Athlete  Time
   <chr>   <dbl>
 1 A          20
 2 A          21
 3 A          20
 4 A          19
 5 A          20
 6 A          21
 7 A          19
 8 A          20
 9 A          20
10 A          20
11 B          17
12 B          21
13 B          22
14 B          22
15 B          19
16 B          16
17 B          21
18 B          24
19 B          20
20 B          18

2 2) Frequency Distribution (Tabular)

We build a simple frequency table (counts) for each Athlete × Time and also a tidy table by time bins.

# Simple frequency of exact times by athlete
freq_exact <- dat %>%
  count(Athlete, Time, name = "Frequency") %>%
  arrange(Athlete, Time)

freq_exact %>%
  kbl(caption = "Exact-Time Frequency Table (seconds)",
      align = "c") %>%
  kable_styling(full_width = FALSE, position = "center",
                bootstrap_options = c("striped","hover"))

Exact-Time Frequency Table (seconds)
Athlete	Time	Frequency
A	19	2
A	20	6
A	21	2
B	16	1
B	17	1
B	18	1
B	19	1
B	20	1
B	21	2
B	22	2
B	24	1

# Frequency by bins (e.g., 2-second bins) for a quick histogram-like table
bin_width <- 2
breaks <- seq(floor(min(dat$Time)), ceiling(max(dat$Time)), by = bin_width)
freq_binned <- dat %>%
  mutate(Bin = cut(Time, breaks = breaks, right = FALSE, include.lowest = TRUE)) %>%
  count(Athlete, Bin, name = "Frequency")

freq_binned %>%
  kbl(caption = "Binned Frequency Table (2-second bins)",
      align = "c") %>%
  kable_styling(full_width = FALSE, position = "center",
                bootstrap_options = c("striped","hover"))

Binned Frequency Table (2-second bins)
Athlete	Bin	Frequency
A	[18,20)	2
A	[20,22)	8
B	[16,18)	2
B	[18,20)	2
B	[20,22)	3
B	[22,24]	3

3 3) Descriptive Statistics (Mean & Standard Deviation)

summary_stats <- dat %>%
  group_by(Athlete) %>%
  summarise(
    N = n(),
    Mean = mean(Time),
    SD = sd(Time),
    Min = min(Time),
    Q1 = quantile(Time, 0.25),
    Median = median(Time),
    Q3 = quantile(Time, 0.75),
    Max = max(Time),
    .groups = "drop"
  )

summary_stats %>%
  mutate(across(where(is.numeric), ~ round(.x, 2))) %>%
  kbl(caption = "Descriptive Statistics by Athlete",
      align = "c",
      col.names = c("Athlete","N","Mean","SD","Min","Q1","Median","Q3","Max")) %>%
  kable_styling(full_width = FALSE, position = "center",
                bootstrap_options = c("striped","hover"))

Descriptive Statistics by Athlete
Athlete	N	Mean	SD	Min	Q1	Median	Q3	Max
A	10	20	0.67	19	20.00	20.0	20.00	21
B	10	20	2.49	16	18.25	20.5	21.75	24

Interpretation: Both athletes have the same mean (≈20 s), but SD differs — a smaller SD means more consistency.

4 4) Histogram (side-by-side) with Density

ggplot(dat, aes(x = Time, fill = Athlete)) +
  geom_histogram(aes(y = after_stat(density)), bins = 8, color = "black", alpha = 0.6, position = "identity") +
  geom_density(alpha = 0.2) +
  labs(title = "Histogram with Density Overlay",
       x = "Time (s)", y = "Density") +
  theme_minimal(base_size = 15)

5 5) Bell-Shaped Normal Curve (per athlete)

We overlay each athlete’s normal curve using their own mean and SD.

# Prepare grid and curve data
curves <- dat %>%
  group_by(Athlete) %>%
  summarise(Mean = mean(Time), SD = sd(Time), .groups = "drop")

xgrid <- seq(min(dat$Time) - 3, max(dat$Time) + 3, length.out = 1000)

curve_df <- curves %>%
  rowwise() %>%
  mutate(
    x = list(xgrid),
    density = list(dnorm(xgrid, mean = Mean, sd = SD))
  ) %>%
  unnest(c(x, density))

ggplot() +
  geom_line(data = curve_df, aes(x = x, y = density, color = Athlete), linewidth = 1.1) +
  labs(title = "Bell Shape Curve per Athlete (Normal Fit)",
       x = "Time (s)", y = "Density") +
  theme_minimal(base_size = 15)

6 6) Bar Plot with Standard Deviation Bars

bar_df <- summary_stats %>% select(Athlete, Mean, SD)

ggplot(bar_df, aes(x = Athlete, y = Mean, fill = Athlete)) +
  geom_col(width = 0.5, color = "black") +
  geom_errorbar(aes(ymin = Mean - SD, ymax = Mean + SD),
                width = 0.18, linewidth = 1) +
  labs(title = "Average Time with Standard Deviation Bars",
       x = "Athlete", y = "Average Time (s)") +
  theme_minimal(base_size = 15)

7 7) Boxplot (with Mean marked)

ggplot(dat, aes(x = Athlete, y = Time, fill = Athlete)) +
  geom_boxplot(width = 0.6, outlier.shape = 16, outlier.size = 2, alpha = 0.7) +
  stat_summary(fun = mean, geom = "point", shape = 21, size = 4, fill = "yellow", color = "black") +
  labs(title = "Boxplot by Athlete (yellow dot = Mean)",
       x = "Athlete", y = "Time (s)") +
  theme_minimal(base_size = 15)

8 8) t-test: Do the means differ?

Here we test if the means are statistically different (two-sample t-test, equal/unequal variance handled by var.equal = FALSE).

tt <- t.test(athlete_A, athlete_B, var.equal = FALSE)
tt


    Welch Two Sample t-test

data:  athlete_A and athlete_B
t = 0, df = 10.279, p-value = 1
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.812592  1.812592
sample estimates:
mean of x mean of y 
       20        20

Explanation:
- Null hypothesis (H₀): Mean(A) = Mean(B)
- Alternative (H₁): Mean(A) ≠ Mean(B)
- If p-value < 0.05, we reject H₀ and say the means differ significantly.
- In our case, the means are very close; we expect no significant difference in the means, but the SD is clearly different, implying Athlete A is more consistent.

Teaching tip: To compare variability directly, discuss SD and show the bar plot + boxplot; for a formal test of variances, one may use an F-test or Levene/Brown‑Forsythe (not shown here to keep focus).

9 9) (Optional) Upload & Display Your Own Images

Use this section to add classroom images (e.g., school & students) that illustrate population vs. samples.

# Place image files in the same folder as this .qmd
# and replace the names below, then render.
# Example:
# knitr::include_graphics("population_1000_students.png")
# knitr::include_graphics("four_samples_4x50.png")

9.1 Key Takeaways

Mean একই হলেও SD ভিন্ন হতে পারে — Consistency মাপতে SD অপরিহার্য।
Histogram/Boxplot ডেটার গঠন ও ছড়ানো বোঝায়।
Bar+SD এক নজরে গড় ও স্থিরতা দেখায়।
t-test mean difference যাচাই করে; এখানে সিদ্ধান্ত নির্ভর করবে p-value‑এর ওপর, তবে স্থিরতার জন্য SD দেখাই মূল কথা।

R blog