In this section, we would like to explore the pattern and characteristics of rat inspections through statistical analysis by using several hypothesis testing and analyze the association between the number inspections and various potential covariates, including climatic, geographical factors.
In the section of borough visualizations, we find some independence among different boroughs in New York. To verify this, we plan to usw chi-square test to see homogenity among boroughs in recent years and homogeneity among boroughs in 2021.
We decide to look at the number of inspections over years to figure out the distribution of inspections varies in different boroughs from year to year.
\(H_0\) : The distribution of inspections among boroughs are same over years.
\(H_1\) : The distribution of inspections among boroughs aren’t all the same over years.
# create a data frame
rat_years = rat %>%
janitor::clean_names() %>%
dplyr::select(inspection_year, borough) %>%
group_by(borough, inspection_year) %>%
summarize(frequency = n()) %>%
mutate(borough = as.factor(borough),
borough = fct_reorder(borough, frequency, .desc = TRUE)) %>%
pivot_wider(names_from = "inspection_year", values_from = "frequency")
# print the table
rat_years %>% t() %>%
knitr::kable(caption = "Distribution of inspections across the years")
borough | Bronx | Brooklyn | Manhattan | Queens | Staten Island |
2012 | 43241 | 54080 | 52816 | 17183 | 4642 |
2013 | 51546 | 27762 | 34775 | 14335 | 3339 |
2014 | 48293 | 29302 | 34813 | 15322 | 6359 |
2015 | 56551 | 22937 | 39104 | 14178 | 2235 |
2016 | 54009 | 35925 | 68354 | 17615 | 4596 |
2017 | 72298 | 62579 | 83434 | 23484 | 6190 |
2018 | 64570 | 88004 | 84824 | 12613 | 3717 |
2019 | 64669 | 85285 | 77958 | 15070 | 4269 |
2020 | 17477 | 29401 | 19356 | 3947 | 1589 |
2021 | 26116 | 38256 | 32044 | 6733 | 2480 |
# Chi-square test
chisq.test(rat_years[,-1]) %>%
broom::tidy()
## # A tibble: 1 × 4
## statistic p.value parameter method
## <dbl> <dbl> <int> <chr>
## 1 69810. 0 36 Pearson's Chi-squared test
We can see from the Chi-square test above, we reject the null hypothesis at 1% significant level. This means that the inspections among boroughs have a different distribution across years.
Given that the distribution of inspections varies in different boroughs from year to year, we decide to look at inspections among different boroughs in 2021.
\(H_0\) : The distribution of inspections among boroughs are same in 2021.
\(H_1\) : The distribution of inspections among boroughs aren’t all the same in 2021.
# import the data
rat_boro = rat_2021 %>%
janitor::clean_names() %>%
dplyr::select(inspection_year, borough) %>%
group_by(borough, inspection_year) %>%
summarize(frequency = n()) %>%
mutate(borough = as.factor(borough),
borough = fct_reorder(borough, frequency, .desc = TRUE)) %>%
dplyr::select(borough, frequency)
# print the table
knitr::kable(rat_boro, caption = "Distribution of inspections in 2021")
borough | frequency |
---|---|
Bronx | 26116 |
Brooklyn | 38256 |
Manhattan | 32044 |
Queens | 6733 |
Staten Island | 2480 |
# chi-square test
chisq.test(rat_boro[,-1]) %>%
broom::tidy()
## # A tibble: 1 × 4
## statistic p.value parameter method
## <dbl> <dbl> <dbl> <chr>
## 1 46974. 0 4 Chi-squared test for given probabilities
We can see from the chi-square test result we should reject the null hypothesis, the inspections are not equally distributed among boroughs in 2021.
In the visualization, we see some difference of rat inspections as the temperature and weather change, as well as the happening of covid-19.
\(H_0\) : The distribution of inspections in different weather are same in 2021.
\(H_1\) : The distribution of inspections in different weather aren’t all the same in 2021.
# create a data frame
rat_n = rat_2021 %>%
group_by(inspection_year, inspection_month, inspection_day)
rat_t = rat %>%
mutate(Average_Temperature = (tmin + tmax)/2) %>%
rename(Date = date) %>%
mutate(Feeling = as.character(Average_Temperature)) %>%
mutate(Feeling = case_when(
Average_Temperature <= 0 ~ 'Frozen',
Average_Temperature <= 10 ~ 'Cold',
Average_Temperature <= 20 ~ 'Cool',
Average_Temperature <= 30 ~ 'Warm',
Average_Temperature <= 40 ~ 'Hot')) %>%
distinct(inspection_year, inspection_month, inspection_day, Date, Average_Temperature, tmax, tmin, Feeling)
rat_temp =
left_join(rat_n, rat_t, by = c("inspection_year", "inspection_month", "inspection_day")) %>%
dplyr::select(Average_Temperature, Feeling) %>%
group_by(Feeling) %>%
summarize(frequency = n()) %>%
mutate(Feeling = as.factor(Feeling),
Feeling = fct_reorder(Feeling, frequency, .desc = TRUE)) %>%
pivot_wider(names_from = "Feeling", values_from = "frequency")
# print the table
knitr::kable(rat_temp, caption = "The association between number of rat inspections and temperature in 2021")
Cold | Cool | Frozen | Hot | Warm |
---|---|---|---|---|
23188 | 28723 | 2425 | 532 | 50761 |
# chi-square test
chisq.test(rat_temp[,-1]) %>%
broom::tidy()
## # A tibble: 1 × 4
## statistic p.value parameter method
## <dbl> <dbl> <dbl> <chr>
## 1 82907. 0 3 Chi-squared test for given probabilities
We can see from the chi-square test result we should reject the null hypothesis, the inspections are not equally distributed in different weather in 2021.
For the rodent inspection across time, we compared mean difference between inspection density in 2019 before covid and inspection density in 2020 after covid.
\(H_0\) : The distribution of annual inspection before covid and after covid is same.
\(H_1\) : The distribution of annual inspection before covid and after covid is different.
# import the data
rat_covid = rbind(rat_2019, rat_2020) %>%
janitor::clean_names() %>%
dplyr::select(inspection_year, inspection_month) %>%
mutate(Inspection_year = as.factor(inspection_year),
Inspection_month = as.factor(inspection_month)) %>%
mutate(Inspection_month = inspection_month %>%
fct_relevel("Jan", "Feb", "Mar","Apr","May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) %>%
group_by(Inspection_year,Inspection_month) %>%
summarize(Frequency = n())
# print the table
rat_covid2 = rat_covid %>%
pivot_wider(names_from = "Inspection_year", values_from = "Frequency")
knitr::kable(rat_covid2, caption = "Number of rat inspections before and after covid")
Inspection_month | 2019 | 2020 |
---|---|---|
Jan | 21833 | 6589 |
Feb | 20477 | 15398 |
Mar | 24150 | 10340 |
Apr | 25887 | 3 |
May | 20150 | 2217 |
Jun | 21006 | 5180 |
Jul | 16577 | 6371 |
Aug | 19201 | 6628 |
Sep | 19474 | 7827 |
Oct | 17654 | 2896 |
Nov | 18360 | 3585 |
Dec | 22482 | 4736 |
# Wilcoxon Rank-sum test
wilcox.test(Frequency ~ Inspection_year, data = rat_covid,
exact = FALSE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Frequency by Inspection_year
## W = 144, p-value = 3.658e-05
## alternative hypothesis: true location shift is not equal to 0
We can see from the Wilcoxon Rank-sum test result that the true location shift is not equal to 0, which means that the medians of two populations differ. Therefore, we have the evidence to reject the null hypothesis and conclude that there is a difference between inspection density in 2019 before covid and inspection density in 2020 after covid.
Based on the statistical analyses above, we mainly found that: