February 18-19, 2018

Nordic Oikos 2018 – GBIF data with R

Scientific reuse of openly published biodiversity information: Programmatic access to and analysis of primary biodiversity information using R. Nordic Oikos 2018, pre-conference R workshop, 18th and 19th February 2018 in Trondheim, Norway.

Acknowledgements

The Swedish Research Council is the funder of Biodiversity Atlas Sweden

Grant No 2017-00688

Session 2: Quick intro to R, RStudio, GitHub

Session will cover getting started - installing recommended software and tools, installation procedures for some relevant packages used in some of the upcoming sessions - for working with geospatial data analysis in R.

The session will also suggest a few online learning resources for intermediate and advanced use beyond simple R scripts - including training resources to help when building packages for data and web apps and workflows for international collaboration.

Topics

  • Before work - How can you get it all onto your laptop?
  • What "stack" for R along with an Integrated Development Environment (IDE) and other trimmings is normally needed for working with geospatial data?
  • What is git and GitHub and how can you use workflows like GitFlow to collaborate with others?

If time permits, also some usage example:

Installing your stack

There are different ways to get started with everything you need onto your laptop:

  1. Easiest / SaaS - Use The Cloud and a Web Browser
  2. Traditional Non-Isolated - Manual Interactive Steps
  3. Traditional Isolated VM - Semi-Automated Steps
  4. Modern / FOSS - Use Docker

Different ways to get started:

Easiest / SaaS - Log in to a remote server (convenient - requires only a login/pass) where everything has been set up already.

  • Convenient, but … "Software As A Service" has drawbacks
  • Cloud Server - No Offline Work Possible
  • You Often Cannot Fix Remote Issues
  • Closed data (not yet published) gets shared with service provider
  • How do you connect to local data?

Different ways to get started:

"Traditional Non-Isolated" - Do a traditional manual installation of all software components step-by-step along with various adaptations required depending on the OS used (Windows, Mac, Linux etc). Non-isolated = it is mixed up with other stuff on your laptop.

  • Manual Interactive Steps
  • Not Portable - "it works (only?) on my machine"
  • Troubleshooting system library installation issues - read manuals or search the Internet :)
  • How Can You Share or Run Your Work Reproducibly On Another Host?

Different ways to get started:

"Traditional Isolated - VM" - Create a VM - Virtual Machine - for example using VirtualBox - starting with an instance of a Free and Open-Source Software-based (FOSS) OS.

  1. Download a recent Linux Mint (Linux Mint uses Debian or Ubuntu package bases) or equivalent OS, see https://distrowatch.com/ for popularity rankings
  2. Get VirtualBox and launch an instance of this ISO in VirtualBox - configure VirtualBox to use bridged networking
  3. Log in an run custom install scripts to automate installs in a "semi-deterministic" but not always "immutable" way - or use tools like Ansible
  4. A unit (.vmdk) can be moved to another server - with limited portability

Different ways to get started:

Modern / FOSS - Pull image(s) with all required prepackaged software from Docker Hub (requires Docker) and launch it locally. The docker command is like git but not for code but for software packages (portable binaries, but can also be used for datasets!) - use docker push to publish and docker pull to use (versioned) images from the DockerHub.

  • Fast, Versioned and Up-To-Date
  • Permits Offline Work
  • FOSS software :) - you can use local data and build from source!
  • Great for reproducible open research
  • Can be used to share both code and data

Modern Setup - Steps

This gives an official "stack" with most things needed for geospatial data analysis work:

#!/bin/bash
docker run -d --name mywebide \
  --user $(id -u):$(id -g) \
  --publish 8787:8787 \
  --volume $(pwd):/home/rstudio \
  rocker/geospatial

firefox http://localhost:8787 &

This 5G image extends the above stack with additional R packages:

    docker pull bioatlas/mirroreum

Traditional Setup - Manual steps

  1. Install R (>= v3.0) possibly with Rtools (if on Windows OS) from https://cran.rstudio.com/
  2. Install RStudio from https://www.rstudio.com/products/rstudio/download/#download
  3. Install various packages - which packages do you need? - including system libraries - and hunt for dependencies, sometimes resolving issues, conflicts etc by searching the Internet for solutions
  4. Customize your setup. For example, with RStudio, edit your .Rprofile in your home directory to override relevant settings/options, for example with the ALA4R package you could use these settings to override these default options

Working with geospatial data

Which packages do you need?

# use grep in the repo for strings library and require, ex:
# rgrep "library" | grep -oP "library\(.*?\)" | sort -u | uniq
    install.packages(c(
      "rgbif", "rstudioapi", "ALA4R",  # data access
      "tidyverse", "plyr", "rio", "stringr", # wrangling   
      "DT", "gapminder", "plotly", "RColorBrewer", # visuals
      "spocc", "rgeos", "rgdal", "mapproj", "maps", 
      "maptools", "raster", "mapr",  # geospatial
      "ape", "phytools"  # phylogenetic tools
    ))

What packages do you have?

    R -e "cat(rownames(installed.packages()))"
    # use setdiff() to see what you're missing?

Using git, GitHub, GitFlow

The git tool allows decentralized asynchronous collaboration across individuals and teams by managing different versions of sets of files (often code).

This tool supports a workflow, increasingly standardized and known as "GitFlow", to merge so called "Pull Requests" or "Merge Request" which combine different contributions from various individuals allowing changes or branches to evolve into new master versions.

Solo work or team work?

The git command and GitFlow workflow can be used locally by a single user because it allows to track changes and allows reverting to earlier versions etc.

It is also really useful at scale - for example when collaborating with several other colleagues. It enables synching changes from remote repositories into your local repos by pulling from GitHub - a site where open source code can be stored and citable code can be published at no cost.

Resources:

Publish a web page from CLI with git and the "gh-pages" branch

If you use the CLI to push HTML files to a branch you name "gh-pages", it will appear on the Internet

    # start a new git repo in a local directory
    apt install git
    mkdir -p ~/repos/myrepo && cd ~/repos/myrepo
    git init
    
    # add content and push to remote repo
    git add index.Rmd
    git commit -m "initial commit"
    git remote add origin \
      git@github.com:GBIF-Europe/nordic_oikos_2018_r.git
    git push --set-upstream origin master

…continued…

    # knit a .Rmd in RStudio and get a .html file
    git checkout -b gh-pages
    git add index.html
    git commit -m "add webpage"
    
    # publish by pushing HTML-file to gh-pages branch
    git push -u origin gh-pages
    
    # verify that the web page is there
    firefox https://gbif-europe.github.io/nordic_oikos_2018_r/s2_r_intro
    

Using the RStudio IDE UI toolbar with git and GitHub

There are UIs for example in RStudio for working with local and remote repositories (for example at GitHub) that abstract away some steps that always also can be done using git at the CLI.

Setting up RStudio for using git and GitHub requires using the GIT icon from the RStudio toolbar but first you need to do some initial setup in RStudio - richly documented in excellent ways in many places if you search the Internet.

Learning Resources

Usage Examples

If time permits…

Rscript -> R package

The R package is the unit to collaborate on for functionality that maybe useful to others (algorithms, web apps, data). R packages are published and shared through CRAN/MRAN or RForge - global networks of archive mirrors for R packages. You can also put an R package on GitHub (and someone else can install such a package with devtools::install_github()). The path from an Rscript to an R package involves:

Steps in RStudio

  1. In RStudio, create a new project for an R package and load the devtools package and follow guidelines in http://r-pkgs.had.co.nz/

  2. Put the functions from the Rscript in the "/R/" directory - document exported functions using roxygen2/devtools::document() (comments look like # ') and move the parts of the Rscript where you call the functions into various "/tests/" (use devtools::use_testthat and devtools::use_test)

  3. The library statements goes away and are put in the function documentation using #' @importFrom or #' @import documentation comments, ending up in NAMESPACE and DESCRIPTION files

continued …

  1. Add commentary in a a Vignette - an introduction.Rmd providing a short tutorial for the relevant workflow - showing how to use the functionality.

  2. Run "devtools::check()" to see if package conforms to CRAN requirements and fix complaints reported, then put code on GitHub to collaborate with others on the package

An example R package with the earlier function can be found here: darwinator R package example @ GitHub