HYPOTHESIS TEST

In this section, we would like to explore the pattern and characteristics of rat inspections through statistical analysis by using several hypothesis testing and analyze the association between the number inspections and various potential covariates, including climatic, geographical factors.

Homogeneity among Boroughs

In the section of borough visualizations, we find some independence among different boroughs in New York. To verify this, we plan to usw chi-square test to see homogenity among boroughs in recent years and homogeneity among boroughs in 2021.

Chi-square testing homogeneity among boroughs in recent years

We decide to look at the number of inspections over years to figure out the distribution of inspections varies in different boroughs from year to year.

\(H_0\) : The distribution of inspections among boroughs are same over years.

\(H_1\) : The distribution of inspections among boroughs aren’t all the same over years.

# create a data frame
rat_years = rat %>% 
  janitor::clean_names() %>% 
  dplyr::select(inspection_year, borough) %>% 
  group_by(borough, inspection_year) %>%
  summarize(frequency = n()) %>%
  mutate(borough = as.factor(borough),
    borough = fct_reorder(borough, frequency, .desc = TRUE)) %>% 
  pivot_wider(names_from = "inspection_year", values_from = "frequency")

# print the table
rat_years %>% t() %>% 
knitr::kable(caption = "Distribution of inspections across the years")

Distribution of inspections across the years
borough	Bronx	Brooklyn	Manhattan	Queens	Staten Island
2012	43241	54080	52816	17183	4642
2013	51546	27762	34775	14335	3339
2014	48293	29302	34813	15322	6359
2015	56551	22937	39104	14178	2235
2016	54009	35925	68354	17615	4596
2017	72298	62579	83434	23484	6190
2018	64570	88004	84824	12613	3717
2019	64669	85285	77958	15070	4269
2020	17477	29401	19356	3947	1589
2021	26116	38256	32044	6733	2480

# Chi-square test
chisq.test(rat_years[,-1]) %>% 
   broom::tidy()

## # A tibble: 1 × 4
##   statistic p.value parameter method                    
##       <dbl>   <dbl>     <int> <chr>                     
## 1    69810.       0        36 Pearson's Chi-squared test

We can see from the Chi-square test above, we reject the null hypothesis at 1% significant level. This means that the inspections among boroughs have a different distribution across years.

Chi-square testing homogeneity among boroughs in 2021

Given that the distribution of inspections varies in different boroughs from year to year, we decide to look at inspections among different boroughs in 2021.

\(H_0\) : The distribution of inspections among boroughs are same in 2021.

\(H_1\) : The distribution of inspections among boroughs aren’t all the same in 2021.

# import the data
rat_boro = rat_2021 %>% 
  janitor::clean_names() %>% 
  dplyr::select(inspection_year, borough) %>% 
  group_by(borough, inspection_year) %>% 
  summarize(frequency = n()) %>%
  mutate(borough = as.factor(borough),
    borough = fct_reorder(borough, frequency, .desc = TRUE)) %>% 
  dplyr::select(borough, frequency)

# print the table
knitr::kable(rat_boro, caption = "Distribution of inspections in 2021")

Distribution of inspections in 2021
borough	frequency
Bronx	26116
Brooklyn	38256
Manhattan	32044
Queens	6733
Staten Island	2480

# chi-square test
chisq.test(rat_boro[,-1]) %>% 
  broom::tidy()

## # A tibble: 1 × 4
##   statistic p.value parameter method                                  
##       <dbl>   <dbl>     <dbl> <chr>                                   
## 1    46974.       0         4 Chi-squared test for given probabilities

We can see from the chi-square test result we should reject the null hypothesis, the inspections are not equally distributed among boroughs in 2021.

Association with Weather and COVID-19

In the visualization, we see some difference of rat inspections as the temperature and weather change, as well as the happening of covid-19.

Chi-sqaure testing association between inspection density and temperature in 2021

\(H_0\) : The distribution of inspections in different weather are same in 2021.

\(H_1\) : The distribution of inspections in different weather aren’t all the same in 2021.

# create a data frame
rat_n = rat_2021 %>%
  group_by(inspection_year, inspection_month, inspection_day)

rat_t = rat %>%
  mutate(Average_Temperature = (tmin + tmax)/2) %>% 
  rename(Date = date) %>%
  mutate(Feeling = as.character(Average_Temperature)) %>% 
  mutate(Feeling = case_when(
      Average_Temperature <= 0 ~ 'Frozen',
      Average_Temperature <= 10 ~ 'Cold',
      Average_Temperature <= 20 ~ 'Cool',
      Average_Temperature <= 30 ~ 'Warm',
      Average_Temperature <= 40 ~ 'Hot')) %>% 
  distinct(inspection_year, inspection_month, inspection_day, Date, Average_Temperature, tmax, tmin, Feeling)

rat_temp = 
  left_join(rat_n, rat_t, by = c("inspection_year", "inspection_month", "inspection_day")) %>%
  dplyr::select(Average_Temperature, Feeling) %>% 
  group_by(Feeling) %>%
  summarize(frequency = n()) %>%
  mutate(Feeling = as.factor(Feeling),
    Feeling = fct_reorder(Feeling, frequency, .desc = TRUE)) %>% 
  pivot_wider(names_from = "Feeling", values_from = "frequency") 

# print the table
knitr::kable(rat_temp, caption = "The association between number of rat inspections and temperature in 2021")

The association between number of rat inspections and temperature in 2021
Cold	Cool	Frozen	Hot	Warm
23188	28723	2425	532	50761

# chi-square test
chisq.test(rat_temp[,-1]) %>% 
  broom::tidy()

## # A tibble: 1 × 4
##   statistic p.value parameter method                                  
##       <dbl>   <dbl>     <dbl> <chr>                                   
## 1    82907.       0         3 Chi-squared test for given probabilities

We can see from the chi-square test result we should reject the null hypothesis, the inspections are not equally distributed in different weather in 2021.

Wilcoxon Rank-sum testing changes of inspection density caused by COVID-19

For the rodent inspection across time, we compared mean difference between inspection density in 2019 before covid and inspection density in 2020 after covid.

\(H_0\) : The distribution of annual inspection before covid and after covid is same.

\(H_1\) : The distribution of annual inspection before covid and after covid is different.

# import the data
rat_covid = rbind(rat_2019, rat_2020) %>% 
  janitor::clean_names() %>% 
  dplyr::select(inspection_year, inspection_month) %>% 
  mutate(Inspection_year = as.factor(inspection_year),
         Inspection_month = as.factor(inspection_month)) %>%
  mutate(Inspection_month = inspection_month %>% 
                       fct_relevel("Jan", "Feb", "Mar","Apr","May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) %>%
  group_by(Inspection_year,Inspection_month) %>%
  summarize(Frequency = n())

# print the table
rat_covid2 = rat_covid %>%   
  pivot_wider(names_from = "Inspection_year", values_from = "Frequency")
knitr::kable(rat_covid2, caption = "Number of rat inspections before and after covid")

Number of rat inspections before and after covid
Inspection_month	2019	2020
Jan	21833	6589
Feb	20477	15398
Mar	24150	10340
Apr	25887	3
May	20150	2217
Jun	21006	5180
Jul	16577	6371
Aug	19201	6628
Sep	19474	7827
Oct	17654	2896
Nov	18360	3585
Dec	22482	4736

# Wilcoxon Rank-sum test
wilcox.test(Frequency ~ Inspection_year, data = rat_covid,
                   exact = FALSE)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Frequency by Inspection_year
## W = 144, p-value = 3.658e-05
## alternative hypothesis: true location shift is not equal to 0

We can see from the Wilcoxon Rank-sum test result that the true location shift is not equal to 0, which means that the medians of two populations differ. Therefore, we have the evidence to reject the null hypothesis and conclude that there is a difference between inspection density in 2019 before covid and inspection density in 2020 after covid.

Summaries

Based on the statistical analyses above, we mainly found that:

There is no homogeneity among boroughs across time and Brooklyn and Manhattan had the largest number of rat inspections and showed a difference with other boroughs.
There is association between the number of inspections and other covariates, like weather and COVID-19.

Hypothesis Test