Will S
One of our datasets would require some webscrapping to get access to the data. I wanted to try just that.
I started by looking at the robots.txt for the website and learned that we can explore it.
First I had to install rvest on our virtual environment
Then I
Code
# Helper function to reduce html_elements() |> html_text() code duplication
get_text_from_page <- function(page, css_selector) {
page |>
html_elements(css_selector) |>
html_text()
}
scrape_page <- function(url) {
Sys.sleep(2)
page <- read_html(url)
column_names = get_text_from_page(page, ".Table__TH")
table <- get_text_from_page(page, ".Table__TD")
# get index and names
names_index <- table[1:100]
index <- names_index[seq(from = 1, to = length(names_index), by=2)]
names <- names_index[seq(from = 2, to = length(names_index), by=2)]
# get stat values
df <- data.frame(matrix(table[101:length(table)], nrow = length(index), ncol=20, byrow = TRUE))
df <- cbind(index, names, df)
names(df) <- column_names
df <- df %>% mutate(across(-c(Name, POS), as.numeric))
return(df)
}Code
#base_url <- "https://www.espn.com/wnba/stats/player"
#urls <- str_c("https://www.espn.com/wnba/stats/player/_/season/", 1997:2025, "/seasontype/2")
#df <- scrape_page(base_url)
#dfs <- purrr::map2(urls, 1997:2025, ~ scrape_page(.x) |> dplyr::mutate(year = .y))
#df <- bind_rows(dfs)
#write_csv(df, "../data/raw/ESPN_WNBA_Top50RegularSeason_1997to2025.csv")Code
#espn <- read.csv("../data/raw/ESPN_WNBA_Top50RegularSeason_1997to2025.csv")
#kaggle <- read.csv("../data/raw/Kaggle_WNBA_draft.csv")
#espn <- espn %>%
# mutate(
# team.abbr = str_extract(Name, "[A-Z]{3}$"),
# Name = str_remove(Name, "[A-Z]{3}$")
# )
#head(espn)
#head(kaggle)
#combined <- espn %>%
# full_join(
# kaggle,
# by = c("Name" = "player", "year" = "year"))
#write_csv(combined, "../data/raw/combined_ESPN_Kaggle_WNBA_1997to2025.csv")Then, Hamza webscraped a different website entirely, so we went with his work instead since it covered more extensively.
After that, most of my work was spent formatting the index page and creating the plots and their animations.