Cleaning SFO Weather Data

NoteHelpful Data Wrangling Notes
  • month.abb is a built-in object in R with 3-letter month abbreviations
  • You can create your own data frame with the tibble() function. Look up the documentation for this function by typing ?tibble::tibble in the Console.
  • You can create regular sequences in R with :, eg, 3:5 generates the sequence c(3, 4, 5).
  • You can create regular sequences in R with seq(), eg, seq(from = 3, to = 5, by = 1) generates the sequence c(3, 4, 5). Look up the documentation for this function by typing ?seq in the Console.
ImportantPracticing Keyboard Shortcuts

Try out the following as you work on this exercise:

  • Tab completion (Try this out when writing your file paths! Typing out a partial path will pull up a mini file-explorer)
  • Insert a code chunk
  • Run a code chunk
  • Navigating around words and lines (selecting and deleting them)
  • Run selected lines (not a whole code chunk)
  • Insert the assignment operator (<-)
  • Insert the pipe operator (|>)

Exercise

Carryout the following steps to clean and save the San Francisco Weather data. Make sure to download and add the data file to your portfolio repository as instructed.

Code
library(tidyverse)
library(readr)
  1. Read in the weather data in this file with the correct relative file path after you move it to the instructed location.
Code
weather <- read_csv("data/weather.csv")
head(weather)
# A tibble: 6 × 18
  Month   Day   Low  High NormalLow NormalHigh RecordLow LowYr RecordHigh HiYear
  <dbl> <dbl> <dbl> <dbl>     <dbl>      <dbl>     <dbl> <dbl>      <dbl>  <dbl>
1    11    20    48    55        48         62        35  1964         69   2005
2     6    16    52    68        53         70        46  1952         90   1961
3     5     9    47    63        50         66        41  1950         88   1993
4    10    26    47    69        52         69        39  1954         89   2003
5     9    27    55    82        55         73        47  1955         96   2010
6     7     6    52    70        54         71        47  1953         86   1957
# ℹ 8 more variables: Precip <dbl>, RecordPrecip <dbl>, PrecipYr <dbl>,
#   date <chr>, Record <lgl>, RecordText <chr>, RecordP <lgl>, CulmPrec <dbl>
  1. There is a variable that has values that don’t make sense in the data context. Figure out which variable this is and clean it up by making those values missing using na_if().
Code
weather_clean <- weather %>% 
    mutate(PrecipYr = na_if(PrecipYr, 99999))
  1. Create a variable called dateInYear that indicates the day of the year (1-365) for each case. (Jan 1 should be 1, and Dec 31 should be 365).
Code
weather_clean %>%
    mutate(dateInYear = yday(mdy(date)))
# A tibble: 365 × 19
   Month   Day   Low  High NormalLow NormalHigh RecordLow LowYr RecordHigh
   <dbl> <dbl> <dbl> <dbl>     <dbl>      <dbl>     <dbl> <dbl>      <dbl>
 1    11    20    48    55        48         62        35  1964         69
 2     6    16    52    68        53         70        46  1952         90
 3     5     9    47    63        50         66        41  1950         88
 4    10    26    47    69        52         69        39  1954         89
 5     9    27    55    82        55         73        47  1955         96
 6     7     6    52    70        54         71        47  1953         86
 7    11     3    48    60        51         66        40  1971         84
 8     3    26    47    58        47         62        38  1980         79
 9    10     4    57    66        55         72        47  1989         95
10    11    26    49    59        47         60        36  1952         76
# ℹ 355 more rows
# ℹ 10 more variables: HiYear <dbl>, Precip <dbl>, RecordPrecip <dbl>,
#   PrecipYr <dbl>, date <chr>, Record <lgl>, RecordText <chr>, RecordP <lgl>,
#   CulmPrec <dbl>, dateInYear <dbl>
  1. Create a variable called month_name that shows the 3-letter abbreviation for each case.
Code
weather_clean %>%
  mutate(month_name = month.abb[Month]) %>%
    head()
# A tibble: 6 × 19
  Month   Day   Low  High NormalLow NormalHigh RecordLow LowYr RecordHigh HiYear
  <dbl> <dbl> <dbl> <dbl>     <dbl>      <dbl>     <dbl> <dbl>      <dbl>  <dbl>
1    11    20    48    55        48         62        35  1964         69   2005
2     6    16    52    68        53         70        46  1952         90   1961
3     5     9    47    63        50         66        41  1950         88   1993
4    10    26    47    69        52         69        39  1954         89   2003
5     9    27    55    82        55         73        47  1955         96   2010
6     7     6    52    70        54         71        47  1953         86   1957
# ℹ 9 more variables: Precip <dbl>, RecordPrecip <dbl>, PrecipYr <dbl>,
#   date <chr>, Record <lgl>, RecordText <chr>, RecordP <lgl>, CulmPrec <dbl>,
#   month_name <chr>
  1. Save the wrangled data to the data/processed/ folder using write_csv(). Name this file weather_clean.csv. Look up the documentation for this function by typing ?write_csv in the Console. You’ll need to write an appropriate relative path.
Code
write_csv(weather_clean, file = "data/processed/weather_clean.csv")