-
Notifications
You must be signed in to change notification settings - Fork 5
/
200_R_objects.Rmd
251 lines (123 loc) · 6.38 KB
/
200_R_objects.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
# (PART) Week 2 {-}
```{r setup_201, echo = FALSE}
library(knitr)
opts_chunk$set(message = FALSE, warning = FALSE, cache = TRUE)
options(width = 100, dplyr.width = 100)
```
<!-- This file by Martin Monkman
is licensed under a Creative Commons Attribution 4.0 International License
https://creativecommons.org/licenses/by/4.0/ -->
# R objects & variable types {#Robjects}
## Examining the data
R and other programming tools handle data files differently than other data analysis tools you may be familiar with, such as Excel or Google Sheets. There are two important differences:
**1. How the data are accessed**
The first vital difference is that in a tool like Excel, when we open the file, any changes that are made (by us or automatically by Excel) will be preserved when we save the file. This might be changes in the values, any formulas we enter into the cells, or formatting changes we make.
When we read a data file into R, we are _not_ opening the file, and the original file is unchanged. Instead, R captures the values of the data that is stored in the file, and those values are then in the R environment—saved temporarily in the computer's memory.
Let's think about the words used: we _open_ a file in Excel, but we _read_ a file into our R environment.
**2. How the data are displayed**
The second difference is that when we open a dataframe using a spreadsheet or other data analysis tool, we immediately _see_ the data. In Excel, the Gapminder dataset looks like this:
![gapminder spreadsheet](static/img/excel_gapminder_2.JPG){width=100%}
Like the crew of the Enterprise, we get the data "On screen!" so we can visually investigate it.
![_"On screen!"_](static/img/gapminder_STNG_onscreen_x.png){width=100%}
### _Looking_ at data
In R, when we load a dataframe:
* we assign it to an object
* and that object shows up in our environment pane.
![_gapminder in RStudio_](static/img/gapminder_RStudio_2.JPG){width=100%}
This is a different view of the universe.
In R, there are a few options for visually scanning your data:
| function | description |
| :-- | :-- |
|*Content: * | |
|`head()` | shows the first rows; the default is 6 |
|`tail()` | shows the last rows; the default is 6 |
|`view()` or `View()`| display in tab |
***
In addition to visually inspecting the data, there are also other functions to understand a dataframe:
| function | description |
| :-- | :-- |
|*Size:* | |
|`dim()` | returns a vector with the number of rows as the first element, and the number of columns as the second element (the dimensions of the object) |
|`nrow()` | returns the number of rows |
|`ncol()` | returns the number of columns |
Functions to understand the contents of your dataframe:
| function | description |
| :-- | :-- |
|*Names:* | |
|`names()` | returns the column names (synonym of colnames() for data.frame objects) |
|*Summary:* | |
|`ls()` | returns the names in a specified environment or object |
|`str()` | structure of the object and information about the class, length and content of each column |
| `ls.str()` | combines `ls()` and `str()`|
|`summary()` | summary statistics for each column |
|`glimpse()` | returns the number of columns and rows of the tibble, the names and class of each column, and previews as many values will fit on the screen. Unlike the other inspecting functions listed above, glimpse() is not a “base R” function so you need to have the dplyr or tibble packages loaded to be able to execute it. |
Note: most of these functions are “generic.” They can be used on other types of objects besides dataframes or tibbles.
## Variable types
In R (as with other programming languages), data is stored as different variable types. In a spreadsheet program like Excel this is obscured from the user, but in R it's explicit, and in many contexts, it matters.
`int` stands for integers.
`dbl` stands for doubles, or real numbers.
`chr` stands for character vectors, or strings.
`date` stands for dates.
`dttm` stands for date-times (a date + a time).
`fctr` stands for factors, which R uses to represent categorical variables with fixed possible values.
`lgl` stands for logical, vectors that contain only TRUE or FALSE.
## Missing values
### Readings
J.D. Long and Paul Teetor, [R Cookbook (2nd ed.)](https://rc2e.com/inputandoutput)
* [5.24 Removing NAs from a Data Frame](https://rc2e.com/datastructures#recipe-id249)
It is common that there are missing values in a dataset, and many reasons why this might occur. These missing values are usually represented in R as `NA` values (which is different than a zero or a blank cell)—these are explicitly missing values.
* they can be included as any type: e.g. numeric or character
Missing values are _contagious_
* an `NA` in the input will return an `NA` in the output
### Functions for missing values
Dealing with those pesky `NA` values
| function | action |
| :-- | :-- |
| `na.rm = TRUE` | remove `NA` values when running function|
| `is.na(x)` | returns TRUE or FALSE for each value in `x` |
| `anyNA(x)` | returns a single TRUE or FALSE |
### A short example
Here are three examples of what happens when `NA` values are part of your calculation.
Add a numeric value to an `NA`:
```{r}
1 + NA
```
Adding 1 to every item in a numeric list that includes an `NA`:
```{r}
num_list <- c(1, 2, NA, 4, 5)
1 + num_list
```
Calculating the mean of a numeric list that includes an `NA`:
```{r}
mean(num_list)
```
### Functions for dealing with NA values
The functions `is.na` and `anyNA(x)` are logical—they will return a "TRUE" or "FALSE" value.
What does `is.na(x)` return?
```{r}
# example
num_list <- c(1, NA, 3)
# answer
is.na(num_list)
```
There are three values in `num_list`, so three tests—only the second one is `NA`.
What about `anyNA(x)`?
```{r}
anyNA(num_list)
```
One of the three values in `num_list` is `NA`, so only one "TRUE" is returned.
What if we put an exclamation mark—the "not" symbol—in front of `is.na()`? How does it differ from `is.na()`?
```{r}
!is.na(num_list)
```
The "NA" value in our list of numbers will cause a function like `sum()` to return an "NA" result. Use `na.rm` as an argument within the `sum()` function to calculate the sum of `num_list`:
```{r}
# example
sum(num_list)
# answer
sum(num_list, na.rm = TRUE)
```
## Take aways
* using functions to identify unknown ("NA") values in a variable
* removing and finding "NA" values
-30-