Scientific reuse of openly published biodiversity information: Programmatic access to and analysis of primary biodiversity information using R. Nordic Oikos 2018, pre-conference R workshop, 18th and 19th February 2018 in Trondheim, Norway.
February 18-19, 2018
Scientific reuse of openly published biodiversity information: Programmatic access to and analysis of primary biodiversity information using R. Nordic Oikos 2018, pre-conference R workshop, 18th and 19th February 2018 in Trondheim, Norway.
The Swedish Research Council is the funder of Biodiversity Atlas Sweden
Grant No 2017-00688
Session will cover getting started - installing recommended software and tools, installation procedures for some relevant packages used in some of the upcoming sessions - for working with geospatial data analysis in R.
The session will also suggest a few online learning resources for intermediate and advanced use beyond simple R scripts - including training resources to help when building packages for data and web apps and workflows for international collaboration.
If time permits, also some usage example:
There are different ways to get started with everything you need onto your laptop:
Easiest / SaaS - Log in to a remote server (convenient - requires only a login/pass) where everything has been set up already.
"Traditional Non-Isolated" - Do a traditional manual installation of all software components step-by-step along with various adaptations required depending on the OS used (Windows, Mac, Linux etc). Non-isolated = it is mixed up with other stuff on your laptop.
"Traditional Isolated - VM" - Create a VM - Virtual Machine - for example using VirtualBox - starting with an instance of a Free and Open-Source Software-based (FOSS) OS.
Modern / FOSS - Pull image(s) with all required prepackaged software from Docker Hub (requires Docker) and launch it locally. The docker
command is like git
but not for code but for software packages (portable binaries, but can also be used for datasets!) - use docker push
to publish and docker pull
to use (versioned) images from the DockerHub.
This gives an official "stack" with most things needed for geospatial data analysis work:
#!/bin/bash docker run -d --name mywebide \ --user $(id -u):$(id -g) \ --publish 8787:8787 \ --volume $(pwd):/home/rstudio \ rocker/geospatial firefox http://localhost:8787 &
This 5G image extends the above stack with additional R packages:
docker pull bioatlas/mirroreum
For doing geospatial data analysis work and for working on spatiotemporal tasksin general and in particular for being able to work with data from GBIF and various related data sources, this is a recommended software "stack":
# use grep in the repo for strings library and require, ex: # rgrep "library" | grep -oP "library\(.*?\)" | sort -u | uniq
install.packages(c( "rgbif", "rstudioapi", "ALA4R", # data access "tidyverse", "plyr", "rio", "stringr", # wrangling "DT", "gapminder", "plotly", "RColorBrewer", # visuals "spocc", "rgeos", "rgdal", "mapproj", "maps", "maptools", "raster", "mapr", # geospatial "ape", "phytools" # phylogenetic tools ))
What packages do you have?
R -e "cat(rownames(installed.packages()))" # use setdiff() to see what you're missing?
The git
tool allows decentralized asynchronous collaboration across individuals and teams by managing different versions of sets of files (often code).
This tool supports a workflow, increasingly standardized and known as "GitFlow", to merge so called "Pull Requests" or "Merge Request" which combine different contributions from various individuals allowing changes or branches to evolve into new master versions.
The git
command and GitFlow workflow can be used locally by a single user because it allows to track changes and allows reverting to earlier versions etc.
It is also really useful at scale - for example when collaborating with several other colleagues. It enables synching changes from remote repositories into your local repos by pulling from GitHub - a site where open source code can be stored and citable code can be published at no cost.
Resources:
If you use the CLI to push HTML files to a branch you name "gh-pages", it will appear on the Internet
# start a new git repo in a local directory apt install git mkdir -p ~/repos/myrepo && cd ~/repos/myrepo git init # add content and push to remote repo git add index.Rmd git commit -m "initial commit" git remote add origin \ git@github.com:GBIF-Europe/nordic_oikos_2018_r.git git push --set-upstream origin master
# knit a .Rmd in RStudio and get a .html file git checkout -b gh-pages git add index.html git commit -m "add webpage" # publish by pushing HTML-file to gh-pages branch git push -u origin gh-pages # verify that the web page is there firefox https://gbif-europe.github.io/nordic_oikos_2018_r/s2_r_intro
There are UIs for example in RStudio for working with local and remote repositories (for example at GitHub) that abstract away some steps that always also can be done using git
at the CLI.
Setting up RStudio for using git
and GitHub requires using the GIT icon from the RStudio toolbar but first you need to do some initial setup in RStudio - richly documented in excellent ways in many places if you search the Internet.
If time permits…
The R package is the unit to collaborate on for functionality that maybe useful to others (algorithms, web apps, data). R packages are published and shared through CRAN/MRAN or RForge - global networks of archive mirrors for R packages. You can also put an R package on GitHub (and someone else can install such a package with devtools::install_github()
). The path from an Rscript to an R package involves:
In RStudio, create a new project for an R package and load the devtools package and follow guidelines in http://r-pkgs.had.co.nz/
Put the functions from the Rscript in the "/R/" directory - document exported functions using roxygen2/devtools::document() (comments look like # '
) and move the parts of the Rscript where you call the functions into various "/tests/" (use devtools::use_testthat
and devtools::use_test
)
The library statements goes away and are put in the function documentation using #' @importFrom
or #' @import
documentation comments, ending up in NAMESPACE and DESCRIPTION files
Add commentary in a a Vignette - an introduction.Rmd providing a short tutorial for the relevant workflow - showing how to use the functionality.
Run "devtools::check()" to see if package conforms to CRAN requirements and fix complaints reported, then put code on GitHub to collaborate with others on the package
An example R package with the earlier function can be found here: darwinator
R package example @ GitHub