R blog
R blog
Selecting the Better Athlete: A Story-Driven Statistics Tutorial
Story setup: Two athletes (A and B) each run a 100 m sprint 10 times with 5‑minute intervals.
Their recorded times (in seconds) are:
- Athlete A: 20, 21, 20, 19, 20, 21, 19, 20, 20, 20
- Athlete B: 17, 21, 22, 22, 19, 16, 21, 24, 20, 18
Both averages are 20 s — but who is the more consistent and therefore the better choice?
We’ll use Statistics to find out.
1 1) Setup & Data
# Install/load required packages
pkgs <- c("tidyverse", "knitr", "kableExtra", "ggplot2", "scales", "ggdist")
to_install <- pkgs[!(pkgs %in% installed.packages()[, "Package"])]
if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org")package 'svglite' successfully unpacked and MD5 sums checked
package 'distributional' successfully unpacked and MD5 sums checked
package 'quadprog' successfully unpacked and MD5 sums checked
package 'kableExtra' successfully unpacked and MD5 sums checked
package 'ggdist' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\CKDash\AppData\Local\Temp\RtmpgNdh7Y\downloaded_packageslapply(pkgs, library, character.only = TRUE)[[1]]
[1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
[7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
[13] "grDevices" "utils" "datasets" "methods" "base"
[[2]]
[1] "knitr" "lubridate" "forcats" "stringr" "dplyr" "purrr"
[7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
[13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
[[3]]
[1] "kableExtra" "knitr" "lubridate" "forcats" "stringr"
[6] "dplyr" "purrr" "readr" "tidyr" "tibble"
[11] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
[16] "utils" "datasets" "methods" "base"
[[4]]
[1] "kableExtra" "knitr" "lubridate" "forcats" "stringr"
[6] "dplyr" "purrr" "readr" "tidyr" "tibble"
[11] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
[16] "utils" "datasets" "methods" "base"
[[5]]
[1] "scales" "kableExtra" "knitr" "lubridate" "forcats"
[6] "stringr" "dplyr" "purrr" "readr" "tidyr"
[11] "tibble" "ggplot2" "tidyverse" "stats" "graphics"
[16] "grDevices" "utils" "datasets" "methods" "base"
[[6]]
[1] "ggdist" "scales" "kableExtra" "knitr" "lubridate"
[6] "forcats" "stringr" "dplyr" "purrr" "readr"
[11] "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
[16] "graphics" "grDevices" "utils" "datasets" "methods"
[21] "base" # Raw vectors
athlete_A <- c(20, 21, 20, 19, 20, 21, 19, 20, 20, 20)
athlete_B <- c(17, 21, 22, 22, 19, 16, 21, 24, 20, 18)
# Long data frame
dat <- tibble(
Athlete = rep(c("A", "B"), each = 10),
Time = c(athlete_A, athlete_B)
)
dat# A tibble: 20 × 2
Athlete Time
<chr> <dbl>
1 A 20
2 A 21
3 A 20
4 A 19
5 A 20
6 A 21
7 A 19
8 A 20
9 A 20
10 A 20
11 B 17
12 B 21
13 B 22
14 B 22
15 B 19
16 B 16
17 B 21
18 B 24
19 B 20
20 B 182 2) Frequency Distribution (Tabular)
We build a simple frequency table (counts) for each Athlete × Time and also a tidy table by time bins.
# Simple frequency of exact times by athlete
freq_exact <- dat %>%
count(Athlete, Time, name = "Frequency") %>%
arrange(Athlete, Time)
freq_exact %>%
kbl(caption = "Exact-Time Frequency Table (seconds)",
align = "c") %>%
kable_styling(full_width = FALSE, position = "center",
bootstrap_options = c("striped","hover"))| Athlete | Time | Frequency |
|---|---|---|
| A | 19 | 2 |
| A | 20 | 6 |
| A | 21 | 2 |
| B | 16 | 1 |
| B | 17 | 1 |
| B | 18 | 1 |
| B | 19 | 1 |
| B | 20 | 1 |
| B | 21 | 2 |
| B | 22 | 2 |
| B | 24 | 1 |
# Frequency by bins (e.g., 2-second bins) for a quick histogram-like table
bin_width <- 2
breaks <- seq(floor(min(dat$Time)), ceiling(max(dat$Time)), by = bin_width)
freq_binned <- dat %>%
mutate(Bin = cut(Time, breaks = breaks, right = FALSE, include.lowest = TRUE)) %>%
count(Athlete, Bin, name = "Frequency")
freq_binned %>%
kbl(caption = "Binned Frequency Table (2-second bins)",
align = "c") %>%
kable_styling(full_width = FALSE, position = "center",
bootstrap_options = c("striped","hover"))| Athlete | Bin | Frequency |
|---|---|---|
| A | [18,20) | 2 |
| A | [20,22) | 8 |
| B | [16,18) | 2 |
| B | [18,20) | 2 |
| B | [20,22) | 3 |
| B | [22,24] | 3 |
3 3) Descriptive Statistics (Mean & Standard Deviation)
summary_stats <- dat %>%
group_by(Athlete) %>%
summarise(
N = n(),
Mean = mean(Time),
SD = sd(Time),
Min = min(Time),
Q1 = quantile(Time, 0.25),
Median = median(Time),
Q3 = quantile(Time, 0.75),
Max = max(Time),
.groups = "drop"
)
summary_stats %>%
mutate(across(where(is.numeric), ~ round(.x, 2))) %>%
kbl(caption = "Descriptive Statistics by Athlete",
align = "c",
col.names = c("Athlete","N","Mean","SD","Min","Q1","Median","Q3","Max")) %>%
kable_styling(full_width = FALSE, position = "center",
bootstrap_options = c("striped","hover"))| Athlete | N | Mean | SD | Min | Q1 | Median | Q3 | Max |
|---|---|---|---|---|---|---|---|---|
| A | 10 | 20 | 0.67 | 19 | 20.00 | 20.0 | 20.00 | 21 |
| B | 10 | 20 | 2.49 | 16 | 18.25 | 20.5 | 21.75 | 24 |
Interpretation: Both athletes have the same mean (≈20 s), but SD differs — a smaller SD means more consistency.
4 4) Histogram (side-by-side) with Density
ggplot(dat, aes(x = Time, fill = Athlete)) +
geom_histogram(aes(y = after_stat(density)), bins = 8, color = "black", alpha = 0.6, position = "identity") +
geom_density(alpha = 0.2) +
labs(title = "Histogram with Density Overlay",
x = "Time (s)", y = "Density") +
theme_minimal(base_size = 15)5 5) Bell-Shaped Normal Curve (per athlete)
We overlay each athlete’s normal curve using their own mean and SD.
# Prepare grid and curve data
curves <- dat %>%
group_by(Athlete) %>%
summarise(Mean = mean(Time), SD = sd(Time), .groups = "drop")
xgrid <- seq(min(dat$Time) - 3, max(dat$Time) + 3, length.out = 1000)
curve_df <- curves %>%
rowwise() %>%
mutate(
x = list(xgrid),
density = list(dnorm(xgrid, mean = Mean, sd = SD))
) %>%
unnest(c(x, density))
ggplot() +
geom_line(data = curve_df, aes(x = x, y = density, color = Athlete), linewidth = 1.1) +
labs(title = "Bell Shape Curve per Athlete (Normal Fit)",
x = "Time (s)", y = "Density") +
theme_minimal(base_size = 15)6 6) Bar Plot with Standard Deviation Bars
bar_df <- summary_stats %>% select(Athlete, Mean, SD)
ggplot(bar_df, aes(x = Athlete, y = Mean, fill = Athlete)) +
geom_col(width = 0.5, color = "black") +
geom_errorbar(aes(ymin = Mean - SD, ymax = Mean + SD),
width = 0.18, linewidth = 1) +
labs(title = "Average Time with Standard Deviation Bars",
x = "Athlete", y = "Average Time (s)") +
theme_minimal(base_size = 15)7 7) Boxplot (with Mean marked)
ggplot(dat, aes(x = Athlete, y = Time, fill = Athlete)) +
geom_boxplot(width = 0.6, outlier.shape = 16, outlier.size = 2, alpha = 0.7) +
stat_summary(fun = mean, geom = "point", shape = 21, size = 4, fill = "yellow", color = "black") +
labs(title = "Boxplot by Athlete (yellow dot = Mean)",
x = "Athlete", y = "Time (s)") +
theme_minimal(base_size = 15)8 8) t-test: Do the means differ?
Here we test if the means are statistically different (two-sample t-test, equal/unequal variance handled by var.equal = FALSE).
tt <- t.test(athlete_A, athlete_B, var.equal = FALSE)
tt
Welch Two Sample t-test
data: athlete_A and athlete_B
t = 0, df = 10.279, p-value = 1
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.812592 1.812592
sample estimates:
mean of x mean of y
20 20 Explanation:
- Null hypothesis (H₀): Mean(A) = Mean(B)
- Alternative (H₁): Mean(A) ≠ Mean(B)
- If p-value < 0.05, we reject H₀ and say the means differ significantly.
- In our case, the means are very close; we expect no significant difference in the means, but the SD is clearly different, implying Athlete A is more consistent.
Teaching tip: To compare variability directly, discuss SD and show the bar plot + boxplot; for a formal test of variances, one may use an F-test or Levene/Brown‑Forsythe (not shown here to keep focus).
9 9) (Optional) Upload & Display Your Own Images
Use this section to add classroom images (e.g., school & students) that illustrate population vs. samples.
# Place image files in the same folder as this .qmd
# and replace the names below, then render.
# Example:
# knitr::include_graphics("population_1000_students.png")
# knitr::include_graphics("four_samples_4x50.png")9.1 Key Takeaways
- Mean একই হলেও SD ভিন্ন হতে পারে — Consistency মাপতে SD অপরিহার্য।
- Histogram/Boxplot ডেটার গঠন ও ছড়ানো বোঝায়।
- Bar+SD এক নজরে গড় ও স্থিরতা দেখায়।
- t-test mean difference যাচাই করে; এখানে সিদ্ধান্ত নির্ভর করবে p-value‑এর ওপর, তবে স্থিরতার জন্য SD দেখাই মূল কথা।