Computer Simulation using examples from `infer`

package. Code for quiz 13

Load the R packages that we will use

Set random seed generator to 123

Load data into R, assign it to

`hr`

- Use
`skim`

to summarize the data in`hr`

```
skim(hr)
```

Name | hr |

Number of rows | 500 |

Number of columns | 6 |

_______________________ | |

Column type frequency: | |

factor | 4 |

numeric | 2 |

________________________ | |

Group variables | None |

**Variable type: factor**

skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|

gender | 0 | 1 | FALSE | 2 | mal: 256, fem: 244 |

evaluation | 0 | 1 | FALSE | 4 | bad: 154, fai: 142, goo: 108, ver: 96 |

salary | 0 | 1 | FALSE | 6 | lev: 95, lev: 94, lev: 87, lev: 85 |

status | 0 | 1 | FALSE | 3 | fir: 194, pro: 179, ok: 127 |

**Variable type: numeric**

skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|

age | 0 | 1 | 39.86 | 11.55 | 20.3 | 29.60 | 40.2 | 50.1 | 59.9 | ▇▇▇▇▇ |

hours | 0 | 1 | 49.39 | 13.15 | 35.0 | 37.48 | 45.6 | 58.9 | 79.9 | ▇▃▂▂▂ |

- The mean hours worked per week is 49.4.

**Is the mean number of hours worked per week 48?**

`specify`

that`hours`

is the variable of interest.

```
Response: hours (numeric)
# A tibble: 500 x 1
hours
<dbl>
1 78.1
2 35.1
3 36.9
4 38.5
5 36.1
6 78.1
7 76
8 35.6
9 35.6
10 56.8
# ... with 490 more rows
```

`hypothesize`

that the average hours worked is 48

```
hr %>%
specify(response = hours) %>%
hypothesize(null = "point", mu = 48)
```

```
Response: hours (numeric)
Null Hypothesis: point
# A tibble: 500 x 1
hours
<dbl>
1 78.1
2 35.1
3 36.9
4 38.5
5 36.1
6 78.1
7 76
8 35.6
9 35.6
10 56.8
# ... with 490 more rows
```

`generate`

1000 replicates representing the null hypothesis

```
hr %>%
specify(response = hours) %>%
hypothesize(null = "point", mu = 48) %>%
generate(reps = 1000, type = "bootstrap")
```

```
Response: hours (numeric)
Null Hypothesis: point
# A tibble: 500,000 x 2
# Groups: replicate [1,000]
replicate hours
<int> <dbl>
1 1 39.7
2 1 44.3
3 1 46.8
4 1 33.7
5 1 39.6
6 1 39.5
7 1 40.5
8 1 55.8
9 1 72.6
10 1 35.7
# ... with 499,990 more rows
```

- The output has 500,000 rows

`calculate`

the distribution of statistics from the generated dataassign output to

`null_t_distribution`

display

`null_t_distribution`

```
null_t_distribution <- hr %>%
specify(response = age) %>%
hypothesize(null = "point", mu = 48) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "t")
null_t_distribution
```

```
Response: age (numeric)
Null Hypothesis: point
# A tibble: 1,000 x 2
replicate stat
<int> <dbl>
1 1 0.144
2 2 -1.72
3 3 0.404
4 4 -1.11
5 5 0.00894
6 6 1.46
7 7 -0.905
8 8 -0.663
9 9 0.291
10 10 3.09
# ... with 990 more rows
```

`null_t_distribution`

has 1000 t-stats.

`visualize`

the simulated null distribution.

```
visualise(null_t_distribution)
```

`calculate`

the statistic from your observed data.assign the output to

`observed_t_statistic`

display

`observed_t_statistic`

```
observed_t_statistic <- hr %>%
specify(response = hours) %>%
hypothesize(null = "point", mu = 48) %>%
calculate(stat = "t")
observed_t_statistic
```

```
Response: hours (numeric)
Null Hypothesis: point
# A tibble: 1 x 1
stat
<dbl>
1 2.37
```

`get_p_value`

from the simulated null distribution and the observed statistic.

```
null_t_distribution %>%
get_p_value(obs_stat = observed_t_statistic, direction = "two-sided")
```

```
# A tibble: 1 x 1
p_value
<dbl>
1 0.014
```

`shade_p_value`

on the simulated null distribution.

```
null_t_distribution %>%
visualize() +
shade_p_value(obs_stat = observed_t_statistic, direction = "two-sided")
```

Is the p-value < 0.05?

`yes`

Does your analysis support the null hypothesis that the true mean number of hours worked was 48?

`no`

- Load data into R, assign it to
`hr_2`

```
hr_2 <- read_csv("https://estanny.com/static/week13/data/hr_1_tidy.csv",
col_types = "fddfff")
```

**Is the average number of hours worked the same for both genders in hr_2?**

- Use
`skim`

to summarize the data in`hr_2`

by gender.

Name | Piped data |

Number of rows | 500 |

Number of columns | 6 |

_______________________ | |

Column type frequency: | |

factor | 3 |

numeric | 2 |

________________________ | |

Group variables | gender |

**Variable type: factor**

skim_variable | gender | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|---|

evaluation | female | 0 | 1 | FALSE | 4 | fai: 81, bad: 71, ver: 57, goo: 51 |

evaluation | male | 0 | 1 | FALSE | 4 | bad: 82, fai: 61, goo: 55, ver: 42 |

salary | female | 0 | 1 | FALSE | 6 | lev: 54, lev: 50, lev: 44, lev: 41 |

salary | male | 0 | 1 | FALSE | 6 | lev: 52, lev: 47, lev: 46, lev: 39 |

status | female | 0 | 1 | FALSE | 3 | fir: 96, pro: 87, ok: 77 |

status | male | 0 | 1 | FALSE | 3 | fir: 89, ok: 76, pro: 75 |

**Variable type: numeric**

skim_variable | gender | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|

age | female | 0 | 1 | 41.78 | 11.50 | 20.5 | 32.15 | 42.35 | 51.62 | 59.9 | ▆▅▇▆▇ |

age | male | 0 | 1 | 39.32 | 11.55 | 20.2 | 28.70 | 38.55 | 49.52 | 59.7 | ▇▇▆▇▆ |

hours | female | 0 | 1 | 50.32 | 13.23 | 35.0 | 38.38 | 47.80 | 60.40 | 79.7 | ▇▃▃▂▂ |

hours | male | 0 | 1 | 48.24 | 12.95 | 35.0 | 37.00 | 42.40 | 57.00 | 78.1 | ▇▂▂▁▂ |

Females worked an average of

**50.3**hours per week.Males worked an average of

**48.2**hours per week.

- Use
`geom_boxplot`

to plot distribution of hours worked by gender.

```
hr_2 %>%
ggplot(aes(x = gender, y = hours)) +
geom_boxplot()
```

`specify`

the variables of interest are hours and gender

```
Response: hours (numeric)
Explanatory: gender (factor)
# A tibble: 500 x 2
hours gender
<dbl> <fct>
1 36.5 female
2 55.8 female
3 35 male
4 52 female
5 35.1 male
6 36.3 female
7 40.1 female
8 42.7 female
9 66.6 male
10 35.5 male
# ... with 490 more rows
```

`hypothesize`

that the number of hours worked and gender are independent.

```
hr_2 %>%
specify(response = hours, explanatory = gender) %>%
hypothesize(null = "independence")
```

```
Response: hours (numeric)
Explanatory: gender (factor)
Null Hypothesis: independence
# A tibble: 500 x 2
hours gender
<dbl> <fct>
1 36.5 female
2 55.8 female
3 35 male
4 52 female
5 35.1 male
6 36.3 female
7 40.1 female
8 42.7 female
9 66.6 male
10 35.5 male
# ... with 490 more rows
```

`generate`

1000 replicates representing the null hypothesis

```
hr_2 %>%
specify(response = hours, explanatory = gender) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute")
```

```
Response: hours (numeric)
Explanatory: gender (factor)
Null Hypothesis: independence
# A tibble: 500,000 x 3
# Groups: replicate [1,000]
hours gender replicate
<dbl> <fct> <int>
1 36.4 female 1
2 35.8 female 1
3 35.6 male 1
4 39.6 female 1
5 35.8 male 1
6 55.8 female 1
7 63.8 female 1
8 40.3 female 1
9 56.5 male 1
10 50.1 male 1
# ... with 499,990 more rows
```

- The output has 500,000 rows.

`calculate`

the distribution of statistics from the generated data.Assign output to

`null_distribution_2_sample_permute`

Display

`null_distribution_2_sample_permute`

```
null_distribution_2_sample_permute <- hr_2 %>%
specify(response = hours, explanatory = gender) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "t", order = c("female", "male"))
null_distribution_2_sample_permute
```

```
Response: hours (numeric)
Explanatory: gender (factor)
Null Hypothesis: independence
# A tibble: 1,000 x 2
replicate stat
<int> <dbl>
1 1 -0.208
2 2 -0.328
3 3 -2.28
4 4 0.528
5 5 1.60
6 6 0.795
7 7 1.24
8 8 -3.31
9 9 0.517
10 10 0.949
# ... with 990 more rows
```

`null_t_distribution`

has 1000 t-stats.

`visualize`

the simulated null distribution.

```
visualize(null_distribution_2_sample_permute)
```

`calculate`

the statistic from your observed data.Assign the output to

`observed_t_2_sample_stat`

Display

`observed_t_2_sample_stat`

```
observed_t_2_sample_stat <- hr_2 %>%
specify(response = hours, explanatory = gender) %>%
calculate(stat = "t", order = c("female", "male"))
observed_t_2_sample_stat
```

```
Response: hours (numeric)
Explanatory: gender (factor)
# A tibble: 1 x 1
stat
<dbl>
1 1.78
```

`get_p_value`

from your simulated null distribution and the observed statistic

```
null_t_distribution %>%
get_p_value(obs_stat = observed_t_2_sample_stat, direction = "two-sided")
```

```
# A tibble: 1 x 1
p_value
<dbl>
1 0.086
```

`shade_p_value`

on the simulated null distribution.

```
null_t_distribution %>%
visualize() +
shade_p_value(obs_stat = observed_t_2_sample_stat, direction = "two-sided")
```

Is the p-value < 0.05?

`no`

Does your analysis support the null hypothesis that the true mean number of hours worked by female and male employees was the same?

`yes`

- Load data into R, assign to
`hr_anova`

```
hr_anova <- read_csv("https://estanny.com/static/week13/data/hr_2_tidy.csv",
col_types = "fddfff")
```

**Is the average number of hours worked the same for all three status (fired, ok, & promoted)?**

- Use
`skim`

to summarize the data into`hr_anova`

by`status`

Name | Piped data |

Number of rows | 500 |

Number of columns | 6 |

_______________________ | |

Column type frequency: | |

factor | 3 |

numeric | 2 |

________________________ | |

Group variables | status |

**Variable type: factor**

skim_variable | status | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|---|

gender | promoted | 0 | 1 | FALSE | 2 | mal: 90, fem: 89 |

gender | fired | 0 | 1 | FALSE | 2 | fem: 101, mal: 93 |

gender | ok | 0 | 1 | FALSE | 2 | mal: 73, fem: 54 |

evaluation | promoted | 0 | 1 | FALSE | 4 | goo: 70, ver: 62, fai: 24, bad: 23 |

evaluation | fired | 0 | 1 | FALSE | 4 | bad: 78, fai: 72, goo: 25, ver: 19 |

evaluation | ok | 0 | 1 | FALSE | 4 | bad: 53, fai: 46, ver: 15, goo: 13 |

salary | promoted | 0 | 1 | FALSE | 6 | lev: 42, lev: 42, lev: 39, lev: 34 |

salary | fired | 0 | 1 | FALSE | 6 | lev: 54, lev: 44, lev: 34, lev: 24 |

salary | ok | 0 | 1 | FALSE | 6 | lev: 32, lev: 31, lev: 26, lev: 19 |

**Variable type: numeric**

skim_variable | status | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|

age | promoted | 0 | 1 | 40.63 | 11.25 | 20.4 | 30.75 | 41.10 | 50.25 | 59.9 | ▆▇▇▇▇ |

age | fired | 0 | 1 | 40.03 | 11.53 | 20.3 | 29.45 | 40.40 | 50.08 | 59.9 | ▇▅▇▆▆ |

age | ok | 0 | 1 | 38.50 | 11.98 | 20.3 | 28.15 | 38.70 | 49.45 | 59.9 | ▇▆▅▅▆ |

hours | promoted | 0 | 1 | 59.21 | 12.66 | 35.0 | 49.75 | 58.90 | 70.65 | 79.9 | ▅▆▇▇▇ |

hours | fired | 0 | 1 | 41.67 | 8.37 | 35.0 | 36.10 | 38.45 | 43.40 | 77.7 | ▇▂▁▁▁ |

hours | ok | 0 | 1 | 47.35 | 10.86 | 35.0 | 37.10 | 45.70 | 54.50 | 78.9 | ▇▅▃▂▁ |

Employees that were

**fired**worked an average of**41.7**hours per week.Employees that were

**ok**worked an average of**47.4**hours per week.Employees that were

**promoted**worked an average of**59.2**hours per week.

- Use
`geom_boxplot`

to plot distributions of hours worked by status.

```
hr_anova %>%
ggplot(aes(x = status, y = hours)) +
geom_boxplot()
```

`specify`

the variables of interest are`hours`

and`status`

```
Response: hours (numeric)
Explanatory: status (factor)
# A tibble: 500 x 2
hours status
<dbl> <fct>
1 78.1 promoted
2 35.1 fired
3 36.9 fired
4 38.5 fired
5 36.1 fired
6 78.1 promoted
7 76 promoted
8 35.6 fired
9 35.6 ok
10 56.8 promoted
# ... with 490 more rows
```

`hypothesize`

that the number of hours worked and status are independent.

```
hr_anova %>%
specify(response = hours, explanatory = status) %>%
hypothesise(null = "independence")
```

```
Response: hours (numeric)
Explanatory: status (factor)
Null Hypothesis: independence
# A tibble: 500 x 2
hours status
<dbl> <fct>
1 78.1 promoted
2 35.1 fired
3 36.9 fired
4 38.5 fired
5 36.1 fired
6 78.1 promoted
7 76 promoted
8 35.6 fired
9 35.6 ok
10 56.8 promoted
# ... with 490 more rows
```

`generate`

1000 replicates representing the null hypothesis

```
hr_anova %>%
specify(response = hours, explanatory = status) %>%
hypothesise(null = "independence") %>%
generate(reps = 1000, type = "permute")
```

```
Response: hours (numeric)
Explanatory: status (factor)
Null Hypothesis: independence
# A tibble: 500,000 x 3
# Groups: replicate [1,000]
hours status replicate
<dbl> <fct> <int>
1 41.9 promoted 1
2 36.7 fired 1
3 35 fired 1
4 58.9 fired 1
5 36.1 fired 1
6 39.4 promoted 1
7 54.3 promoted 1
8 59.2 fired 1
9 40.2 ok 1
10 35.3 promoted 1
# ... with 499,990 more rows
```

- The output has 500,000 rows.

`calculate`

the distribution of statistics from generated dataAssign the output to

`null_distribution_anova`

Display

`null_distribution_anova`

```
null_distribution_anova <- hr_anova %>%
specify(response = hours, explanatory = status) %>%
hypothesise(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "F")
null_distribution_anova
```

```
Response: hours (numeric)
Explanatory: status (factor)
Null Hypothesis: independence
# A tibble: 1,000 x 2
replicate stat
<int> <dbl>
1 1 0.312
2 2 2.85
3 3 0.369
4 4 0.142
5 5 0.511
6 6 2.73
7 7 1.06
8 8 0.171
9 9 0.310
10 10 1.11
# ... with 990 more rows
```

`null_distribution_anova`

has 1000 F-stats.

`visualize`

the statistic from your observed data.

```
visualise(null_distribution_anova)
```

`calculate`

the statistic from your observed data.Assign the output to

`observed_f_sample_stat`

Display

`observed_f_sample_stat`

```
observed_f_sample_stat <- hr_anova %>%
specify(response = hours, explanatory = status) %>%
calculate(stat = "F")
observed_f_sample_stat
```

```
Response: hours (numeric)
Explanatory: status (factor)
# A tibble: 1 x 1
stat
<dbl>
1 128.
```

`get_p_value`

from the simulated null distribution and the observed statistic.

```
null_distribution_anova %>%
get_p_value(obs_stat = observed_f_sample_stat, direction = "greater")
```

```
# A tibble: 1 x 1
p_value
<dbl>
1 0
```

`shade_p_value`

on the simulated null distribution.

```
null_t_distribution %>%
visualize() +
shade_p_value(obs_stat = observed_f_sample_stat, direction = "greater")
```

Is the p-value < 0.05?

`yes`

Does your analysis support the null hypothesis that the true means of the number of hours worked for those that were “fired”, “ok” and “promoted” were the same?

`no`