# Rutgers University Sociology R Bootcamp

September 2, 2022

# This Morning’s Goals

1. Missing Data

2. Importing Data

3. Cleaning Data

• Editing Variable Values

• Creating New Variables

# How does Missing Data Come Up

• Organically in your data

• Coding error

• Respondents

• Refusal

• Don’t Know

• Inapplicability

Let’s create some purposefully so we can look at it.

missvec <- c(45, 89, 7, NA, 98, NA, 3, 45, 7)

We can use the is.na() function to see which values in our vector are missing.

• sum them to count how many there are. (This is because TRUE counts as 1 and FALSE counts as 0.)

• table will give us a summary of how many are TRUE (missing) and how many are FALSE (not missing)

is.na(missvec)
## [1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
sum(is.na(missvec))
## [1] 2
table(is.na(missvec))
##
## FALSE  TRUE
##     7     2

When we have missing data, functions applied to a vector are calculated for all values except the missings:

missvec * 3
## [1] 135 267  21  NA 294  NA   9 135  21

When we use the table() function on data, we can’t see the missing values unless we include the useNA = "ifany" argument, like this:

table(missvec)
## missvec
##  3  7 45 89 98
##  1  2  2  1  1
table(missvec, useNA = "ifany")
## missvec
##    3    7   45   89   98 <NA>
##    1    2    2    1    1    2

Now we can see the two missing values.

We could also set it to useNA = "always", which includes a spot for missings in the table, even if there aren’t any.

nomiss <- c("karen", "quan", "tom", "kristen", "paul", "karen", "paul", "karen")
table(nomiss)
## nomiss
##   karen kristen    paul    quan     tom
##       3       1       2       1       1
table(nomiss, useNA = "ifany")
## nomiss
##   karen kristen    paul    quan     tom
##       3       1       2       1       1
table(nomiss, useNA = "always")
## nomiss
##   karen kristen    paul    quan     tom    <NA>
##       3       1       2       1       1       0

# Types of Missing Data

Remember how we have different types of data:

• Numeric

• Double: number with something after the decimal point

• Integer: number without something after the decimal (e.g. 6L, 199L)

• Created by typing “L” after the number

• Requires less storate than double

• Character

• Logical

There are also different types of missing values. Right now, we’re only going to use four:

• NA : the default missing; how missing shows up in a table;

• NA_character_ is what you must use to assign character data as missing

• NA_integer_ is what you must use to assign integer data

• NA_real_ is what you must use to assign double data

Note that the last three types of missings have an underscore (_) at the end of them.

The others might come up this next year, but we see them so rarely that we won’t use them now.

When in doubt, run typeof() to see how it is stored.

a <- 1       # A number without L becomes double
b <- 3L      # Integers are created by putting a capital L after the number
c <- "word"
d <- TRUE
typeof(a)
## [1] "double"
typeof(b)
## [1] "integer"
typeof(c)
## [1] "character"
typeof(d)
## [1] "logical"
typeof(missvec)
## [1] "double"

You might remember from yesterday that we used a function class(). These are very similar, but typeof() will tell us if it is integer or double, making it more helpful in determining if we need to use NA or NA_integer_.

a <- 1
b <- 3L      # Integers are created by putting a capital L after the number
typeof(a)
## [1] "double"
class(a)
## [1] "numeric"
typeof(b)
## [1] "integer"
class(b)
## [1] "integer"

# Practicing with Missing Data

Let’s create some missing data:

misstext <- c("Blue","Green", "Orange", NA_character_, "Red", "Blue", "Orange", NA, "NA")
misstext
## [1] "Blue"   "Green"  "Orange" NA       "Red"    "Blue"   "Orange" NA
## [9] "NA"
table(misstext, useNA = "ifany")
## misstext
##   Blue  Green     NA Orange    Red   <NA>
##      2      1      1      2      1      2

Note here that it didn’t matter whether we used NA or NA_character_ when we created the vector, as both register as <NA> in the table.

One value labeled "NA" was kept as text, though. Because we used quotation marks around it when we created the vector, R thought we wanted it as text and left it undisturbed.

Now, let’s combine our two vectors into a tibble.

library(tidyverse)         # Remember to call this in or the below code won't work 
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'readr' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
misstibble <- tibble(missvec,
misstext) %>%
rename(numbers = missvec,
colors = misstext)
misstibble
## # A tibble: 9 x 2
##   numbers colors
##     <dbl> <chr>
## 1      45 Blue
## 2      89 Green
## 3       7 Orange
## 4      NA <NA>
## 5      98 Red
## 6      NA Blue
## 7       3 Orange
## 8      45 <NA>
## 9       7 NA

We can see that some rows aren’t missing anything, while others are missing one or even two values.

# Removing Data with Missing Values

If we want to keep only the rows that have no missing data, we can use the drop_na() function.

• You might also see na.omit() used. It is similar to drop_na(), but drop_na() has better functionality.

• drop_na() can also be piped.

drop_na(misstibble)
## # A tibble: 6 x 2
##   numbers colors
##     <dbl> <chr>
## 1      45 Blue
## 2      89 Green
## 3       7 Orange
## 4      98 Red
## 5       3 Orange
## 6       7 NA
misstibble %>% drop_na()
## # A tibble: 6 x 2
##   numbers colors
##     <dbl> <chr>
## 1      45 Blue
## 2      89 Green
## 3       7 Orange
## 4      98 Red
## 5       3 Orange
## 6       7 NA

# Removing Missing Data From Specific Columns

drop_na() also has the ability to remove observations that are missing on specific columns.

• If you do this, it must be piped.
misstibble %>% drop_na()
## # A tibble: 6 x 2
##   numbers colors
##     <dbl> <chr>
## 1      45 Blue
## 2      89 Green
## 3       7 Orange
## 4      98 Red
## 5       3 Orange
## 6       7 NA
misstibble %>% drop_na(numbers)  # Drop observation that have missing for numbers
## # A tibble: 7 x 2
##   numbers colors
##     <dbl> <chr>
## 1      45 Blue
## 2      89 Green
## 3       7 Orange
## 4      98 Red
## 5       3 Orange
## 6      45 <NA>
## 7       7 NA
misstibble %>% drop_na(colors)   # Drop observation that have missing for colors
## # A tibble: 7 x 2
##   numbers colors
##     <dbl> <chr>
## 1      45 Blue
## 2      89 Green
## 3       7 Orange
## 4      98 Red
## 5      NA Blue
## 6       3 Orange
## 7       7 NA

This is useful where there is a lot of missingness throughout the dataset and we don’t want to remove observations that are missing on variables we don’t care about.

# Official Question Time 1

Since we started today, we’ve done:

1. Missing data

1. How to create it

2. Types of it

• NA

• NA_character_

• NA_real_

• NA_integer_

3. How to detect it

• is.na()
4. How to remove it

• drop_na()

# Clearing the Environment

We’re about to change gears a little and won’t need these objects we just saved. Let’s remove them from the global environment so they won’t clutter up our workspace.

ls()
## [1] "a"          "b"          "c"          "d"          "misstext"
## [6] "misstibble" "missvec"    "nomiss"
rm(list=ls())
ls()
## character(0)

# Let’s Clean Up the GSS

This morning, I sent you a file containing data from the General Social Survey (GSS).

Go ahead and save that into your working directory, then click on it to import it into your Global Environment.

• Remember, you can also run getwd() to see where R is working from / where to save the file to.
getwd()
## [1] "C:/Users/fhtra/Documents/R/ay23_lab_ta/bootcamp"
load("~/R/ay23_lab_ta/bootcamp/rawgss.RData")

Using our imported GSS data, and what we’ve learned so far on data management, let’s create some variables.

# Creating a New Dataset

Looking at the size of the dataset (dim(GSS) = 2348, 89), we can see that there are way more variables than we want right now. Let’s create a new dataset “gss_bootcamp”, and select only a few of the variables we want.

1. Demographic factors

1. ID_ (Respondent’s ID)

2. AGE (age in years)

3. EDUC (years of education)

2. Political Variables

1. PARTYID (Party Identification)

2. POLVIEWS (Political Ideology)

3. Economic Variables

1. PRESTG10 (occupational prestige score)

2. INCOME (annual income in categories)

3. TVHOURS (hours of TV watched last week)

4. Quality of Life

1. HAPPY (General Happiness)

2. HEALTH (General Health)

5. We also want a set of variables asking what they think about government spending on a series of factors. These all start with “NAT” and end with a descriptor of the topic.

Go ahead and try this yourself, then go to the next slide to see how I did it.

# Smaller Dataset

library(tidyverse)
gss_bootcamp <- GSS %>%
as_tibble() %>%
select(ID_, AGE, EDUC,            # Demographics
PARTYID, POLVIEWS,         # Political Variables
PRESTG10, INCOME, TVHOURS,  # Economic Variables
HAPPY, HEALTH,             # Quality of Life
starts_with("NAT")         # National Spending
) 

Now, the dataset is much smaller with only 28 variables.

The Beauty of Source Code

Actually, let’s add sex and race to the dataset too.

If we didn’t do this as source, we’d have to entirely retype the previous section of code.

But instead, we can simply go back and add “SEX” and “RACE” to our code and rerun the section.

gss_bootcamp <- GSS %>%
as_tibble() %>%
select(ID_, AGE, EDUC, SEX, RACE,      # Demographics
PARTYID, POLVIEWS,              # Political Variables
PRESTG10, INCOME, TVHOURS,      # Economic Variables
HAPPY, HEALTH,                  # Quality of Life
starts_with("NAT")              # National Spending
) 

This is why it’s also important not to overwrite the original file (“GSS”).

• Because GSS is kept unchanged the entire time, we can simply rerun the code without having to load up the dataset from our files again.

And now, let’s take a look at our dataset with the View(gss_bootcamp) function to see it in our source pane, or print(gss_bootcamp) to see it in the console. Go ahead and pick one (or both).

print(gss_bootcamp)
## # A tibble: 2,348 x 30
##      ID_   AGE  EDUC   SEX  RACE PARTYID POLVIEWS PRESTG10 INCOME TVHOURS HAPPY
##    <int> <int> <int> <int> <int>   <int>    <int>    <int>  <int>   <int> <int>
##  1     1    43    14     1     1       5        6       47     13       3     2
##  2     2    74    10     2     1       2        8       22     12      -1     1
##  3     3    42    16     1     1       4        5       61     12       1     1
##  4     4    63    16     2     1       2        4       59     13       1     1
##  5     5    71    18     1     2       6        7       53     13      -1     2
##  6     6    67    16     2     1       2        3       53     98      10     3
##  7     7    59    13     2     2       0        4       48     10      -1     2
##  8     8    43    12     1     1       5        5       35     12      -1     2
##  9     9    62     8     2     1       3        4       35      5       4     3
## 10    10    55    12     1     1       1        8       39     12       2     2
## # ... with 2,338 more rows, and 19 more variables: HEALTH <int>, NATFARE <int>,
## #   NATROAD <int>, NATSOC <int>, NATMASS <int>, NATPARK <int>, NATCHLD <int>,
## #   NATSCI <int>, NATENRGY <int>, NATAID <int>, NATARMS <int>, NATSPAC <int>,
## #   NATENVIR <int>, NATHEAL <int>, NATCITY <int>, NATCRIME <int>,
## #   NATDRUG <int>, NATEDUC <int>, NATRACE <int>

We see that all we have are numbers. Next up: how to make these numbers cleaner for us to work with.

# Official Question Time 2

Since the last OQT, we’ve done:

1. Creating a new dataset from our old one

2. Selecting variables of interest

# Mutating Variables

Let’s start by adding a variable onto our dataset.

The tidyverse package dplyr gives us a great tool for mutating our variables. It’s called, aptly enough, mutate().

It takes the form mutate(variable=value).

Let’s say we want to add a column to our gss_bootcamp where every value is 123456. We can do that with:

gss_bootcamp <- gss_bootcamp %>%
mutate(numbers = 123456)
gss_bootcamp %>% select(numbers)
## # A tibble: 2,348 x 1
##    numbers
##      <dbl>
##  1  123456
##  2  123456
##  3  123456
##  4  123456
##  5  123456
##  6  123456
##  7  123456
##  8  123456
##  9  123456
## 10  123456
## # ... with 2,338 more rows

We told it that the column “numbers” should contain 123456. Because the column didn’t exist, it added it onto our existing dataset.

We can also use mutate to reference columns within the same dataset, or even the column itself. For example:

gss_bootcamp %>%
mutate(newnumbers = numbers,
numbers = numbers - 100000,
bignumbers = numbers * 2) %>%
select(numbers, newnumbers, bignumbers)
## # A tibble: 2,348 x 3
##    numbers newnumbers bignumbers
##      <dbl>      <dbl>      <dbl>
##  1   23456     123456      46912
##  2   23456     123456      46912
##  3   23456     123456      46912
##  4   23456     123456      46912
##  5   23456     123456      46912
##  6   23456     123456      46912
##  7   23456     123456      46912
##  8   23456     123456      46912
##  9   23456     123456      46912
## 10   23456     123456      46912
## # ... with 2,338 more rows

We gave it our dataset, piped it down, and told R to mutate three columns

1. newnumbers, a new column, should take the value of numbers

2. numbers, an existing column, should take the value numbers - 100000,

• In practice it’s better to add new columns than to overwrite existing ones, but we didn’t save this so we’re not at risk of overwriting.
3. bignumbers, a new column, should take the value numbers * 2.

• Also, note that it took the new value of numbers, not the original value, since it came after we changed the value.

# Recoding with ifelse()

One of the most basic computing functions is ifelse(). It takes the form ifelse(test, yes, no). In other words, “if the data passes the test with TRUE, follow the yes condition; else, follow the no condition.”

To illustrate, let’s make a small vector:

smallvec <- c(1,2,3,4,5,6,7)

ifelse(smallvec<4, 1, 0)
## [1] 1 1 1 0 0 0 0

Here’s what it did:

1. Take each value in smallvec

2. Test if the value is less than 4

3. If TRUE, return 1

4. If FALSE, return 0

We can also do this and make it return whatever we want. For example

ifelse(smallvec<4, "Small","Big" )
## [1] "Small" "Small" "Small" "Big"   "Big"   "Big"   "Big"
ifelse(smallvec<4, 234/5, 847^2)
## [1]     46.8     46.8     46.8 717409.0 717409.0 717409.0 717409.0
ifelse(smallvec<4, smallvec, smallvec^3)
## [1]   1   2   3  64 125 216 343

So now let’s recode our variable. In the GSS, SEX is coded 1 = male, 2 = female, but we only have the numbers in our dataset. We can use our ifelse() function to give these values text names. Let’s create a new variable called sex_text that has text values for the variable sex.

gss_bootcamp <- gss_bootcamp %>%
mutate(sex_text = ifelse(SEX == 1, "Male", "Female"))

After recoding like this, I like to go back and make a table() to make sure my code did what I wanted.

table(gss_bootcamp$sex_text, gss_bootcamp$SEX, useNA = "ifany")
##
##             1    2
##   Female    0 1296
##   Male   1052    0

# Recoding with case_when()

We can also do this more explicitly using case_when(), a function from the tidyverse. case_when() lets us do variable management with logical tests in a way that is easy to follow and understand.

It takes the form

dataset %>%                         # Dataset
mutate(variable = case_when(      # Mutate a variable depending on a case
condition ~ value,
condition ~ value
))

What does this mean?

1. We start with our dataset, as always.

2. Then, we tell R to mutate and give it our variable.

• Both these steps are same as before
3. Then we say, “actually, instead of applying the same value (or value calculation) for everybody, R should assign the value depending on a condition”

• The tilde used to assign is in the top left corner of your keyboard, next the the 1
• Commas go after all condition lines except the last
• Common Error #2: Forgetting to separate with a comma
4. Lastly, we close our parentheses

So we can code sex again in a different way, this time using case_when().

gss_bootcamp <- gss_bootcamp %>%
mutate(sex_text_cw = case_when(
SEX == 1 ~ "Male",          # If SEX == 1, then assign the value "Male"
SEX == 2 ~ "Female",        # If SEX == 2, then assign the value "Female"
)) 

See what we did there?

1. Start with our data and pipe it down

2. mutate our data so that sex_text_cw takes a value dependent on the following conditions

• If SEX == 1, then assign the value “Male”

• If SEX == 2, then assign the value “Female”

Lastly, let’s create two tables showing that the coding worked exactly as we wanted and there are no missing values.

table(gss_bootcamp$sex_text_cw, gss_bootcamp$SEX, useNA = "ifany") # Checking our work with the original variable
##
##             1    2
##   Female    0 1296
##   Male   1052    0
table(gss_bootcamp$sex_text_cw, gss_bootcamp$sex_text, useNA = "ifany") # Checking our work with the ifelse variable from earlier 
##
##          Female Male
##   Female   1296    0
##   Male        0 1052

# Using case_when() with Three or More Groups

As we just saw, case_when() is little more than a string of ifelse() values.

We can do the same thing as case_when() using ifelse() but it get’s tricky with more than two levels. For example, let’s recode RACE to give it text values.

gss_bootcamp <- gss_bootcamp %>%
mutate(race3_ifelse = ifelse(RACE == 1, "White",       # If race is 1, name it White
ifelse(RACE==2, "Black",  # If it isn't 1, now test: If race is 2, name it Black
"Other")))         # If it isn't 2, now name it Other

table(gss_bootcamp$race3_ifelse, gss_bootcamp$RACE, useNA = "ifany")
##
##            1    2    3
##   Black    0  385    0
##   Other    0    0  270
##   White 1693    0    0

It totally works, but is kinda hard to follow. Here’s the same thing using case_when().

gss_bootcamp <- gss_bootcamp %>%
mutate(race3 = case_when(
RACE == 1 ~ "White",
RACE == 2 ~ "Black",
RACE == 3 ~ "Other"))

table(gss_bootcamp$race3, gss_bootcamp$RACE, useNA = "ifany")
##
##            1    2    3
##   Black    0  385    0
##   Other    0    0  270
##   White 1693    0    0

With this, it is easier to follow what the conditions and output values are.

# Recoding Categorical Variables

Congrats! You now know the basics of data management!

Now, we’re going to do a lot of practice on this. We’re going to do a few variables together, then have a big chunk of time to work on this separately.

Earlier, I knew the values of SEX and RACE off the top of my head because I’ve worked with them a lot. But the GSS has hundreds of variables, and our small one still has 30. We don’t have to remember the values; instead, we can turn to the codebook for more detail. A codebook is a listing of all the variables that says what the values they give in the dataset actually mean.

The first variable we’re going to work with is HAPPY. Let’s find the variable in GSS’s online codebook to get started.

1. Click “SEARCH VARIABLES” (no account required)

2. Type “happy” into the search bar and click it when it pops up

We see there that the data is coded as follows:

Code Label
1 Very happy
2 Pretty happy
3 Not too happy
8 Don’t know
0 Not applicable

Let’s create a ‘text’ version of this variable that uses the labels instead of the codes. We can use case_when() to do this. In the spirit of making our names short but descriptive, and not overwriting anything, let’s call it happy_text.

gss_bootcamp <- gss_bootcamp %>%
mutate(happy_text = case_when(
HAPPY == 1 ~ "Very Happy",
HAPPY == 2 ~ "Pretty Happy",
HAPPY == 3 ~ "Not too Happy",
HAPPY > 3  ~ NA_character_           # Since there aren't any 0's, we don't need to add a line for 0's
))

After coding, let’s create a table of the old and new variables to compare. Remember to tell R to show the missings, too, so we can make sure they were property coded.

table(gss_bootcamp$HAPPY, gss_bootcamp$happy_text, useNA = "always")
##
##        Not too Happy Pretty Happy Very Happy <NA>
##   1                0            0        701    0
##   2                0         1307          0    0
##   3              336            0          0    0
##   8                0            0          0    4
##   <NA>             0            0          0    0

Great, it all works as we expected!

# Recoding Party ID

Let’s repeat the previous procedure for PARTYID. We see the following coding:

Code Label
0 Strong democrat
1 Not str democrat
2 Ind, near dem
3 Independent
4 Ind, near rep
5 Not str republican
6 Strong republican
7 Other party
8 Don’t know

Let’s condense this down to three categories:

1. Democrats (0-2)

2. Republicans (4-6)

3. Independents / Other Party (3 & 7)

4. Don’t Know / No Answer (8 & 9) - set as missing

Because the codes don’t line up easily with our categories, we can use the %in% operator to help us out.

Let’s first create a set of vectors that contain the values for each category.

democratcodes <- c(0,1,2)
republicancodes <- c(4,5,6)
independentcodes <- c(3,7)
dknacodes <- c(8,9)

Then, using our %in% operator, we can recode partyid a little easier now.

gss_bootcamp <- gss_bootcamp %>%
mutate(party3 = case_when(                      # NOTE that I created a new variable instead of overwriting the old one
PARTYID %in% democratcodes ~ "Democrat",
PARTYID %in% republicancodes ~ "Republican",
PARTYID %in% independentcodes ~ "Independent",
PARTYID %in% dknacodes ~ NA_character_,
))

table(gss_bootcamp$PARTYID, gss_bootcamp$party3, useNA = "always")   # We can use the table feature to make sure everybody is where they're supposed to be
##
##        Democrat Independent Republican <NA>
##   0         379           0          0    0
##   1         352           0          0    0
##   2         307           0          0    0
##   3           0         414          0    0
##   4           0           0        259    0
##   5           0           0        272    0
##   6           0           0        255    0
##   7           0          77          0    0
##   9           0           0          0   33
##   <NA>        0           0          0    0

# Recoding Numeric Variables

Some of our variables, like AGE and TVHOURS are numeric. Let’s take a quick look at them:

table(gss_bootcamp$AGE) ## ## 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 ## 22 26 15 27 40 29 38 43 31 39 45 43 50 34 43 42 65 40 40 43 38 55 41 40 40 39 ## 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 ## 40 29 42 30 33 31 37 41 29 48 35 48 50 39 37 46 46 39 24 37 33 39 36 27 39 41 ## 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 99 ## 45 33 22 20 29 23 19 28 11 9 12 9 12 7 10 12 14 5 8 29 7 typeof(gss_bootcamp$AGE)
## [1] "integer"
table(gss_bootcamp$TVHOURS) ## ## -1 0 1 2 3 4 5 6 7 8 9 10 12 14 15 16 17 18 20 24 ## 789 145 349 376 240 169 100 63 13 38 6 23 14 2 4 3 1 1 4 4 ## 98 99 ## 3 1 typeof(gss_bootcamp$TVHOURS)
## [1] "integer"

Yep, they sure are numeric. (Or, at least integer.) But if we look at the end of the table and at the codebook, we can see there’s some weird things happening.

For AGE:

Code Label
89 89 or older
98 Don’t Know

And for TVHOURS:

Code Label
-1 Not Applicable
98 Don’t Know

Let’s code all of these as missing and keep the values for everything else. Also, just as a quirk of R, because typeof(gss_bootcamp$AGE) = integer and typeof(gss_bootcamp$TVHOURS) = integer, we use NA_integer_ here instead of just setting them to NA.

gss_bootcamp <- gss_bootcamp %>%
mutate(
newage = case_when(            # You can't go wrong with just calling a new variable "newvariable"
AGE >= 89 ~ NA_integer_,
AGE <= 88 ~ AGE
),
newtvhours = case_when(
TVHOURS >=98 ~ NA_integer_,
TVHOURS == -1 ~ NA_integer_,
TRUE ~ TVHOURS
)
)

table(gss_bootcamp$newage, useNA = "ifany") ## ## 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 ## 22 26 15 27 40 29 38 43 31 39 45 43 50 34 43 42 ## 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 ## 65 40 40 43 38 55 41 40 40 39 40 29 42 30 33 31 ## 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 ## 37 41 29 48 35 48 50 39 37 46 46 39 24 37 33 39 ## 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 ## 36 27 39 41 45 33 22 20 29 23 19 28 11 9 12 9 ## 82 83 84 85 86 87 88 <NA> ## 12 7 10 12 14 5 8 36 table(gss_bootcamp$newtvhours, gss_bootcamp$TVHOURS, useNA = "ifany") ## ## -1 0 1 2 3 4 5 6 7 8 9 10 12 14 15 16 17 18 ## 0 0 145 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## 1 0 0 349 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## 2 0 0 0 376 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## 3 0 0 0 0 240 0 0 0 0 0 0 0 0 0 0 0 0 0 ## 4 0 0 0 0 0 169 0 0 0 0 0 0 0 0 0 0 0 0 ## 5 0 0 0 0 0 0 100 0 0 0 0 0 0 0 0 0 0 0 ## 6 0 0 0 0 0 0 0 63 0 0 0 0 0 0 0 0 0 0 ## 7 0 0 0 0 0 0 0 0 13 0 0 0 0 0 0 0 0 0 ## 8 0 0 0 0 0 0 0 0 0 38 0 0 0 0 0 0 0 0 ## 9 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 ## 10 0 0 0 0 0 0 0 0 0 0 0 23 0 0 0 0 0 0 ## 12 0 0 0 0 0 0 0 0 0 0 0 0 14 0 0 0 0 0 ## 14 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 ## 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 ## 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 ## 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 ## 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ## 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## 24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## <NA> 789 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## ## 20 24 98 99 ## 0 0 0 0 0 ## 1 0 0 0 0 ## 2 0 0 0 0 ## 3 0 0 0 0 ## 4 0 0 0 0 ## 5 0 0 0 0 ## 6 0 0 0 0 ## 7 0 0 0 0 ## 8 0 0 0 0 ## 9 0 0 0 0 ## 10 0 0 0 0 ## 12 0 0 0 0 ## 14 0 0 0 0 ## 15 0 0 0 0 ## 16 0 0 0 0 ## 17 0 0 0 0 ## 18 0 0 0 0 ## 20 4 0 0 0 ## 24 0 4 0 0 ## <NA> 0 0 3 1 This diagonal line means it’s working. 0 goes to 0, 9 goes to 9, etc. But the one’s we coded to NA (-1, 98, 99) are now <NA>. In the last line for newtvhours, I add a final statement TRUE ~ TVHOURS. This means “everything left over” that hasn’t already been coded. It is very handy when we’re working with multiple conditions as it takes everything that hasn’t been previously included. Just be careful, though, because you might not want everything that’s left. # Official Question Time 3 Since the last OQT, we’ve done: 1. Recoding with ifelse() 2. Recoding with case_when() 3. Practicing Data Management with Categorical Data • SEX –> sex_text • RACE –> race3 • HAPPY –> happy_text • PARTYID –> party3 4. Practicing Data Management With Numeric Data • AGE –> newage • TVHOURS –> newtvhours # Coding Practice Let’s take some time to code our data. Using the online codebook, code the following variables: 1. INCOME • Recode 0, 98, and 99 to missing • Everything else keep the same 2. EDUC • Recode 98 and 99 to missing • Everything else keep the same 3. POLVIEWS • Code as liberal, moderate, conservative • Don’t know, No Answer, and Not Applicable code to missing 4. Pick three of the national spending variables (starts with “nat”) • Don’t know, No Answer, and Not Applicable code to missing The next slide has snippets of code for if/when you get stuck. # Code Snippets for Selected Variables gss_bootcamp <- gss_bootcamp %>% mutate( #### Sample Recode for income Variable newincome = case_when( INCOME == 0 ~ NA_integer_, INCOME >= 98 ~ NA_integer_, TRUE ~ INCOME ), #### Sample Recode for education Variable education = case_when( EDUC %in% c(-1, 98, 99) ~ NA_integer_, TRUE ~ EDUC ), #### Sample Recode for polviews Variable newpolviews = case_when( POLVIEWS %in% c(1, 2, 3) ~ "Liberal", POLVIEWS %in% c(5, 6, 7) ~ "Conservative", POLVIEWS %in% 4 ~ "Moderate", POLVIEWS %in% c(0, 8, 9) ~ NA_character_ ), #### Sample Recode for spending Variable newnatenvir = case_when( NATENVIR %in% c(8, 9, 0) ~ NA_integer_, TRUE ~ NATENVIR )) # Official Question Time 5 Since the last OQT, we’ve done: 1. Practicing coding variables # Wrapping Up # One Last Practice For one last practice, complete the following steps: 1. GSS Recoding 1. Load the GSS data from the file 2. From “GSS,” create a dataset “newgss” that includes the following variables 1. Race 2. Sex 3. Age 4. Self-ranked social position (“RANK”) 5. Occupational Prestige 6. Spouse’s Occupational Prestige (“SPPRES10”) 7. Frequency of Prayer (“PRAY”) 3. Recode data and missing data appropriately for the analyses in Parts 3 and 4 4. Remove anybody with missing values 2. Show how many rows and columns are in newgss 3. From newgss, create a new dataset of only those age 50 and up and show how many are in it 4. From newgss, create a new dataset of high-status people who pray often (how ever you choose to define it), then show many are in it # Answers load("~/R/ay23_lab_ta/gss_files/rawgss.RData") # Part 1.1: Load GSS and name it newgss newgss <- GSS %>% # Part 1.2: Variable selection select(RACE, SEX, AGE, RANK, PRESTG10, SPPRES10, PRAY) %>% # Part 1.3: Recoding Data (Including Missing Data) mutate( newage = case_when( AGE >= 89 ~ NA_integer_, AGE <= 88 ~ AGE ), socpos = case_when( RANK %in% c(0, 98, 99) ~ NA_integer_, TRUE ~ RANK ), prayer_freq = case_when( PRAY == 1 ~ "Several Times/Day", PRAY == 2 ~ "1ce/Day", PRAY == 3 ~ "Several Times/Week", PRAY == 4 ~ "1ce/Week", PRAY == 5 ~ "< 1ce/Week", PRAY == 6 ~ "Never", PRAY %in% c(0, 8, 9) ~ NA_character_ ) ) %>% rename( # Not necessary but makes it easier to read race = RACE, sex = SEX, prestige = PRESTG10, spouse_prestige = SPPRES10 ) %>% # Part 1.4: Removing data with missing values drop_na() # Part 2: Count of newgss dim(newgss)  ## [1] 2216 10 # Part 3: Dataset of age>50 gss_old <- newgss %>% filter(newage >= 50) nrow(gss_old) # One way to show how many are there ## [1] 1044 # Part 4: High status + pray often richprayers <- newgss %>% # Social position is greater than 7/10 & prays at least once/day filter(socpos > 7 & prayer_freq <3) str(richprayers) # Another way to show how many are there  ## 'data.frame': 61 obs. of 10 variables: ##$ race           : int  1 1 1 3 1 2 1 2 2 2 ...
##  $sex : int 2 1 2 1 2 1 2 1 2 1 ... ##$ AGE            : int  61 52 42 53 84 83 83 86 34 59 ...
##  $RANK : int 10 8 9 10 8 10 10 10 9 8 ... ##$ prestige       : int  45 42 52 49 64 27 32 40 35 35 ...
##  $spouse_prestige: int 0 0 31 0 0 36 0 0 0 0 ... ##$ PRAY           : int  2 4 2 2 5 2 2 2 2 2 ...
##  $newage : int 61 52 42 53 84 83 83 86 34 59 ... ##$ socpos         : int  10 8 9 10 8 10 10 10 9 8 ...
##  \$ prayer_freq    : chr  "1ce/Day" "1ce/Week" "1ce/Day" "1ce/Day" ...
##  - attr(*, "col.label")= chr [1:89] "Rs religious preference" "Favor preference in hiring blacks" "Blacks overcome prejudice without favors " "How close feel to blacks  " ...

# Official Question Time 6

So far today, we’ve learned

1. Missing Data

2. Workspaces and Projects

3. Importing Data

4. Coding and Recoding Data

1. Categorical

2. Numeric

3. Missing