HYPOTHESIS TEST

 

In this section, we would like to explore the pattern and characteristics of rat inspections through statistical analysis by using several hypothesis testing and analyze the association between the number inspections and various potential covariates, including climatic, geographical factors.

 

 

Homogeneity among Boroughs

 

In the section of borough visualizations, we find some independence among different boroughs in New York. To verify this, we plan to usw chi-square test to see homogenity among boroughs in recent years and homogeneity among boroughs in 2021.

 

Chi-square testing homogeneity among boroughs in recent years

We decide to look at the number of inspections over years to figure out the distribution of inspections varies in different boroughs from year to year.

\(H_0\) : The distribution of inspections among boroughs are same over years.

\(H_1\) : The distribution of inspections among boroughs aren’t all the same over years.

# create a data frame
rat_years = rat %>% 
  janitor::clean_names() %>% 
  dplyr::select(inspection_year, borough) %>% 
  group_by(borough, inspection_year) %>%
  summarize(frequency = n()) %>%
  mutate(borough = as.factor(borough),
    borough = fct_reorder(borough, frequency, .desc = TRUE)) %>% 
  pivot_wider(names_from = "inspection_year", values_from = "frequency")

# print the table
rat_years %>% t() %>% 
knitr::kable(caption = "Distribution of inspections across the years")
Distribution of inspections across the years
borough Bronx Brooklyn Manhattan Queens Staten Island
2012 43241 54080 52816 17183 4642
2013 51546 27762 34775 14335 3339
2014 48293 29302 34813 15322 6359
2015 56551 22937 39104 14178 2235
2016 54009 35925 68354 17615 4596
2017 72298 62579 83434 23484 6190
2018 64570 88004 84824 12613 3717
2019 64669 85285 77958 15070 4269
2020 17477 29401 19356 3947 1589
2021 26116 38256 32044 6733 2480
# Chi-square test
chisq.test(rat_years[,-1]) %>% 
   broom::tidy()
## # A tibble: 1 × 4
##   statistic p.value parameter method                    
##       <dbl>   <dbl>     <int> <chr>                     
## 1    69810.       0        36 Pearson's Chi-squared test

We can see from the Chi-square test above, we reject the null hypothesis at 1% significant level. This means that the inspections among boroughs have a different distribution across years.

 

 

Chi-square testing homogeneity among boroughs in 2021

Given that the distribution of inspections varies in different boroughs from year to year, we decide to look at inspections among different boroughs in 2021.

\(H_0\) : The distribution of inspections among boroughs are same in 2021.

\(H_1\) : The distribution of inspections among boroughs aren’t all the same in 2021.

# import the data
rat_boro = rat_2021 %>% 
  janitor::clean_names() %>% 
  dplyr::select(inspection_year, borough) %>% 
  group_by(borough, inspection_year) %>% 
  summarize(frequency = n()) %>%
  mutate(borough = as.factor(borough),
    borough = fct_reorder(borough, frequency, .desc = TRUE)) %>% 
  dplyr::select(borough, frequency)

# print the table
knitr::kable(rat_boro, caption = "Distribution of inspections in 2021")
Distribution of inspections in 2021
borough frequency
Bronx 26116
Brooklyn 38256
Manhattan 32044
Queens 6733
Staten Island 2480
# chi-square test
chisq.test(rat_boro[,-1]) %>% 
  broom::tidy()
## # A tibble: 1 × 4
##   statistic p.value parameter method                                  
##       <dbl>   <dbl>     <dbl> <chr>                                   
## 1    46974.       0         4 Chi-squared test for given probabilities

We can see from the chi-square test result we should reject the null hypothesis, the inspections are not equally distributed among boroughs in 2021.

 

 

Association with Weather and COVID-19

 

In the visualization, we see some difference of rat inspections as the temperature and weather change, as well as the happening of covid-19.

 

Chi-sqaure testing association between inspection density and temperature in 2021

\(H_0\) : The distribution of inspections in different weather are same in 2021.

\(H_1\) : The distribution of inspections in different weather aren’t all the same in 2021.

# create a data frame
rat_n = rat_2021 %>%
  group_by(inspection_year, inspection_month, inspection_day)

rat_t = rat %>%
  mutate(Average_Temperature = (tmin + tmax)/2) %>% 
  rename(Date = date) %>%
  mutate(Feeling = as.character(Average_Temperature)) %>% 
  mutate(Feeling = case_when(
      Average_Temperature <= 0 ~ 'Frozen',
      Average_Temperature <= 10 ~ 'Cold',
      Average_Temperature <= 20 ~ 'Cool',
      Average_Temperature <= 30 ~ 'Warm',
      Average_Temperature <= 40 ~ 'Hot')) %>% 
  distinct(inspection_year, inspection_month, inspection_day, Date, Average_Temperature, tmax, tmin, Feeling)

rat_temp = 
  left_join(rat_n, rat_t, by = c("inspection_year", "inspection_month", "inspection_day")) %>%
  dplyr::select(Average_Temperature, Feeling) %>% 
  group_by(Feeling) %>%
  summarize(frequency = n()) %>%
  mutate(Feeling = as.factor(Feeling),
    Feeling = fct_reorder(Feeling, frequency, .desc = TRUE)) %>% 
  pivot_wider(names_from = "Feeling", values_from = "frequency") 

# print the table
knitr::kable(rat_temp, caption = "The association between number of rat inspections and temperature in 2021")
The association between number of rat inspections and temperature in 2021
Cold Cool Frozen Hot Warm
23188 28723 2425 532 50761
# chi-square test
chisq.test(rat_temp[,-1]) %>% 
  broom::tidy()
## # A tibble: 1 × 4
##   statistic p.value parameter method                                  
##       <dbl>   <dbl>     <dbl> <chr>                                   
## 1    82907.       0         3 Chi-squared test for given probabilities

We can see from the chi-square test result we should reject the null hypothesis, the inspections are not equally distributed in different weather in 2021.

 

 

Wilcoxon Rank-sum testing changes of inspection density caused by COVID-19

For the rodent inspection across time, we compared mean difference between inspection density in 2019 before covid and inspection density in 2020 after covid.

\(H_0\) : The distribution of annual inspection before covid and after covid is same.

\(H_1\) : The distribution of annual inspection before covid and after covid is different.

# import the data
rat_covid = rbind(rat_2019, rat_2020) %>% 
  janitor::clean_names() %>% 
  dplyr::select(inspection_year, inspection_month) %>% 
  mutate(Inspection_year = as.factor(inspection_year),
         Inspection_month = as.factor(inspection_month)) %>%
  mutate(Inspection_month = inspection_month %>% 
                       fct_relevel("Jan", "Feb", "Mar","Apr","May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) %>%
  group_by(Inspection_year,Inspection_month) %>%
  summarize(Frequency = n())

# print the table
rat_covid2 = rat_covid %>%   
  pivot_wider(names_from = "Inspection_year", values_from = "Frequency")
knitr::kable(rat_covid2, caption = "Number of rat inspections before and after covid")
Number of rat inspections before and after covid
Inspection_month 2019 2020
Jan 21833 6589
Feb 20477 15398
Mar 24150 10340
Apr 25887 3
May 20150 2217
Jun 21006 5180
Jul 16577 6371
Aug 19201 6628
Sep 19474 7827
Oct 17654 2896
Nov 18360 3585
Dec 22482 4736
# Wilcoxon Rank-sum test
wilcox.test(Frequency ~ Inspection_year, data = rat_covid,
                   exact = FALSE)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Frequency by Inspection_year
## W = 144, p-value = 3.658e-05
## alternative hypothesis: true location shift is not equal to 0

We can see from the Wilcoxon Rank-sum test result that the true location shift is not equal to 0, which means that the medians of two populations differ. Therefore, we have the evidence to reject the null hypothesis and conclude that there is a difference between inspection density in 2019 before covid and inspection density in 2020 after covid.

 

 

Summaries

Based on the statistical analyses above, we mainly found that:

  • There is no homogeneity among boroughs across time and Brooklyn and Manhattan had the largest number of rat inspections and showed a difference with other boroughs.
  • There is association between the number of inspections and other covariates, like weather and COVID-19.