49 lines
2.0 KiB
Plaintext
49 lines
2.0 KiB
Plaintext
---
|
|
title: "hw3-5.1"
|
|
author: "Mark Pearl"
|
|
date: "29/01/2020"
|
|
output:
|
|
html_document: default
|
|
pdf_document: default
|
|
---
|
|
|
|
```{r setup, include=FALSE}
|
|
knitr::opts_chunk$set(echo = TRUE)
|
|
library(outliers)
|
|
```
|
|
|
|
```{r uscrime_data with sample}
|
|
uscrime_data <- read.table('C:/Users/mjpearl/Desktop/omsa/ISYE-6501-OAN/hw3/data/uscrime.txt',header = TRUE, stringsAsFactors = FALSE)
|
|
head(uscrime_data)
|
|
```
|
|
|
|
## Plots Section
|
|
The following plots will conduct exploratory analysis on the data to get a sense of the data's distribution and to see if we can spot any outliers with a visual representation.
|
|
|
|
```{r boxplot}
|
|
boxplot(x= uscrime_data$Crime)
|
|
```
|
|
|
|
From the boxplot we can see that there are a few observations above the whisker which indicates values past Q3 are outliers (2 observations closest to a Crime value of 2000). There does not seem to be any observations in the lower quartiles that indicate any outliers.
|
|
|
|
```{r uscrime_data histogram}
|
|
hist(uscrime_data$Crime)
|
|
```
|
|
|
|
The result of histogram indicates a skewed distribution for the right tail. For a grubbs test to be effective it is implied that the data follows a normal distribution. However our data does follow a normal distribution towards the middle portion of the graph, so it could mean that we have outlying data. We will continue with conducting the grubbs test for further investigation.
|
|
|
|
## Grub Test Section
|
|
To ensure we test observations represented by the minimum and maximum values on the graph, we will use the "opposite" parameter of the grubbs test function.
|
|
|
|
```{r grubtest1}
|
|
grubbs.test(x=uscrime_data$Crime, type = 10, opposite = F)
|
|
```
|
|
|
|
```{r grubtest2}
|
|
grubbs.test(x=uscrime_data$Crime, type = 10, opposite = T)
|
|
```
|
|
|
|
The first output indicates that the highest or maximum observation closest to 2000 on the graph can be deemed an outlier due to the significantly low p-value of 0.07. This holds true to what was also determined in the boxplot output.
|
|
|
|
The second output indicates that the lowest value of the crime feature is with a high certainty not an outlier as we retrieved a p-value of 1.
|