This post is part two in a series of posts introducing funspotr. See also:
- Identifying R Functions & Packages Used in GitHub Repos (funspotr part 2)
- Network plots of code collections (funspotr part 3)
This post shows how funspotr can also be applied to parse gists:
By functions or packages used…?https://t.co/kbSLOpQZLF
— Bryan Shalloway (@brshallo) January 22, 2022
A problem I bumped into was that most of Chelsea’s gists don’t actually have .R or .Rmd extensions so my approach skipped most of her snippets. I wanted to parse my own gists but ran into a related problem that most of my github gist code snippets are saved as .md files1 so knitr::purl()
won’t work2.
In this post I…
- create a function to extract code chunks from simple .md files
- parse the functions and packages in my code using funspotr3.
This post was updated on 2023-10-11 to make it consistent with updated {funspotr}
code. Tables were also updated to reflect brshallo gists at this time. The following post on network plots however was not updated.
Parsing code
First I used funspotr to get a table with all of my gists.
library(dplyr)
library(purrr)
library(stringr)
library(funspotr)
brshallo_gists <- funspotr::list_files_github_gists("brshallo", pattern = ".")
brshallo_gists
## # A tibble: 112 × 2
## relative_paths absolute_paths
## <chr> <chr>
## 1 find_in_files.R https://gist.githubusercontent.…
## 2 permits-issued.md https://gist.githubusercontent.…
## 3 seattle-units-added-new-permits.md https://gist.githubusercontent.…
## 4 rolling-mean-conditioned-date.R https://gist.githubusercontent.…
## 5 rolling-mean-conditioned-on-iteration-date.R https://gist.githubusercontent.…
## 6 lag-multiple-across.md https://gist.githubusercontent.…
## 7 log-log-transform-example.md https://gist.githubusercontent.…
## 8 convert-currencies.R https://gist.githubusercontent.…
## 9 unique-set-speed-test.R https://gist.githubusercontent.…
## 10 unique-combinations.R https://gist.githubusercontent.…
## # ℹ 102 more rows
Parsing R files
funspotr is already set-up to parse all the unique functions and packages from R or Rmd files.
r_gists <- brshallo_gists %>%
filter(funspotr:::str_detect_r_docs(relative_paths))
r_gists_parsed <- funspotr::spot_funs_files(r_gists)
r_gists_unnested <- r_gists_parsed %>%
funspotr::unnest_results()
Hidden from this post but a warning message indicates a couple files which did not parse correctly. In this particular case those files were created using reprexes for .md output but I saved them as .R files – hence they failed parsing.
r_gists_unnested
## # A tibble: 701 × 4
## funs pkgs relative_paths absolute_paths
## <chr> <chr> <chr> <chr>
## 1 library base find_in_files.R https://gist.githubuserconte…
## 2 dir_ls fs find_in_files.R https://gist.githubuserconte…
## 3 map purrr find_in_files.R https://gist.githubuserconte…
## 4 grep base find_in_files.R https://gist.githubuserconte…
## 5 readLines base find_in_files.R https://gist.githubuserconte…
## 6 keep purrr find_in_files.R https://gist.githubuserconte…
## 7 length base find_in_files.R https://gist.githubuserconte…
## 8 library base rolling-mean-conditioned-date.R https://gist.githubuserconte…
## 9 seq base rolling-mean-conditioned-date.R https://gist.githubuserconte…
## 10 map purrr rolling-mean-conditioned-date.R https://gist.githubuserconte…
## # ℹ 691 more rows
Parsing markdown files
To parse my .md files, I wrote a function here extract_code_md()
that…
- reads in a file
- extracts the text in code chunks4
- saves it to a temporary file
- returns the file path of the temporary file
subset_even <- function(x) x[!seq_along(x) %% 2]
extract_code_md <- function(file_path){
lines <- readr::read_file(file_path) %>%
stringr::str_split("```.*", simplify = TRUE) %>%
subset_even() %>%
stringr::str_flatten("\n## new chunk \n")
file_output <- tempfile(fileext = ".R")
writeLines(lines, file_output)
file_output
}
I map extract_code_md()
on all the .md gists and then parse the files using funspotr.
md_gists <- brshallo_gists %>%
filter(!funspotr:::str_detect_r_docs(relative_paths))
md_gists_local <- md_gists %>%
rename(urls = absolute_paths) %>%
# name absolute_paths because that's what funspotr::spot_funs_files() expects
mutate(absolute_paths = map_chr(urls, extract_code_md))
md_gists_parsed <- funspotr::spot_funs_files(md_gists_local) %>%
mutate(absolute_paths = urls) %>%
select(-urls)
md_gists_unnested <- md_gists_parsed %>%
funspotr::unnest_results()
In this case also some files did not parse correctly though this is hidden due to warning = FALSE
settings in the code chunks. These are essentially just not included in the unnested output.
md_gists_unnested
## # A tibble: 1,061 x 5
## funs pkgs in_multiple_pkgs contents urls
## <chr> <chr> <lgl> <chr> <chr>
## 1 library base FALSE grouped-nested-t-test.md "C:\\Users\~
## 2 require base FALSE grouped-nested-t-test.md "C:\\Users\~
## 3 install_github remotes FALSE grouped-nested-t-test.md "C:\\Users\~
## 4 na.omit stats FALSE grouped-nested-t-test.md "C:\\Users\~
## 5 t.test stats FALSE grouped-nested-t-test.md "C:\\Users\~
## 6 tidy broom FALSE grouped-nested-t-test.md "C:\\Users\~
## 7 pull dplyr FALSE grouped-nested-t-test.md "C:\\Users\~
## 8 group_by dplyr FALSE grouped-nested-t-test.md "C:\\Users\~
## 9 summarise dplyr FALSE grouped-nested-t-test.md "C:\\Users\~
## 10 list base FALSE grouped-nested-t-test.md "C:\\Users\~
## # ... with 1,051 more rows
Note that I’m assuming all the code snippets are R code5.
Binding files together
I bind these files together and then arrange them based on the initial order in brshallo_gists
6.
gists_unnested <- bind_rows(
r_gists_unnested,
md_gists_unnested
) %>%
# got this arranging by a vector trick from SO:
# https://stackoverflow.com/questions/52216341/how-to-sort-rows-of-a-data-frame-based-on-a-vector-using-dplyr-pipe
arrange(match(relative_paths, brshallo_gists$relative_paths)) %>%
# add back-in links to url's where files are rather than urls column being
# local paths for .md snippets
select(-absolute_paths) %>%
left_join(brshallo_gists, by = "relative_paths")
gists_unnested %>%
DT::datatable(rownames = FALSE,
class = 'cell-border stripe',
filter = 'top',
escape = FALSE,
options = list(pageLength = 20))
Organizing snippets
Perhaps I’ll do a follow-up and show some ways the relationships between the resulting parsed code snippets may be visualized in a network or organized in some other way.
Mentioned in the initial thread, Obsidian seems to be a product that does some things along these lines:
I’ve found this useful (https://t.co/OYzwfTltLG) I is a tool for writing, organizing, linking markdown files.
— John Lee (@Jdlee888) January 21, 2022
Appendix
Interactively save current gists to folder so can read from another file if want to
post_path <- fs::path_dir(rstudioapi::getSourceEditorContext()$path)
fs::dir_create(post_path, "data")
readr::write_csv(gists_unnested, fs::path(post_path, "data", paste0("brshallo-gists-", format(Sys.Date(), "%Y%m%d"), ".csv")))
knitr::purl()
is used in functions within funspotr to parse R markdown files.↩︎In the future I may do a follow-up that passes the parsed functions and packages through a network analysis or some other approach to better visualize the relationships between code snippets.↩︎
based on what exists between ticks. Kind of like a less reliable version of
knitr::purl()
but for .md files. Also posted function on SO question.↩︎Otherwise the R code parsing steps in funspotr will fail.↩︎
Note that this will just return the unique functions in each file, if I want to see every time I used a function I would have passed in
show_each_use = FALSE
togithub_spot_funs()
.↩︎