Turboveg to R

Author

Irena Axmanová

The aim of this part of the tutorial is to show you step by step how to import the data from Turboveg 2 to R and prepare it for further analyses.

1.1 Turboveg data format

Turboveg for Windows is a program designed for the storage, selection, and export of vegetation plot data (relevés). The data are divided among several files that are matched by either species ID or relevé ID. You do not see this structure directly in the Turboveg interface, but you can find it in the Turbowin folder, subfolder data and particular database (see example below).

At some point, you need to export the data and process them further. This can be done in a specialised software called JUICE, but also directly in R.

To get Turboveg data to R, you first need to export the Turboveg database to a folder where you want to process the data. Or you can use the edit function and filter plots according to your selection and export only this selection as a new database. Alternatively, you can access the files directly in the data folder in Turbowin, but if you do something wrong here, you might completely lose your data.

1.2 Load libraries

For further data handling we will use following libraries

library(foreign)   #for reading dbf files 
library(tidyverse) #for data handling, pipes and visualisation 
library(readxl)    #for data import directly from Excel 
library(janitor)   #for unified, easy-to-handle format of variable names

1.3 Import env file = header data

1.3.1 Import

One option is to check the exported database manually, open the file called tvhabita.dbf in Excel and save it as tvhabita.xlsx or tvhabita.csv (UTF 8 encoded) file into your data folder. Although it includes one more step outside R, it is still rather straightforward and it saves you troubles with different formats in Turboveg and in R (encoding issues). The only limiting factor is the size of the file that can be handled in Excel.

You can then import the file from Excel

env <- read_excel("data/tvhabita.xlsx")

or from a csv file, which is a slightly more universal option, and we will use it for the import of most of the files.

env <- read_csv("data/tvhabita.csv")

Rows: 65 Columns: 55
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (9): COUNTRY, COVERSCALE, AUTHOR, MOSS_IDENT, LICH_IDENT, REMARKS, COOR...
dbl (33): RELEVE_NR, DATE, SURF_AREA, ALTITUDE, EXPOSITION, INCLINATIO, COV_...
lgl (13): REFERENCE, TABLE_NR, NR_IN_TAB, PROJECT, SYNTAXON, UTM, SYNOPTIC, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

If you check the imported names, they are rather difficult to handle.

names(env)

 [1] "RELEVE_NR"  "COUNTRY"    "REFERENCE"  "TABLE_NR"   "NR_IN_TAB" 
 [6] "COVERSCALE" "PROJECT"    "AUTHOR"     "DATE"       "SYNTAXON"  
[11] "SURF_AREA"  "UTM"        "ALTITUDE"   "EXPOSITION" "INCLINATIO"
[16] "COV_TOTAL"  "COV_TREES"  "COV_SHRUBS" "COV_HERBS"  "COV_MOSSES"
[21] "COV_LICHEN" "COV_ALGAE"  "COV_LITTER" "COV_WATER"  "COV_ROCK"  
[26] "TREE_HIGH"  "TREE_LOW"   "SHRUB_HIGH" "SHRUB_LOW"  "HERB_HIGH" 
[31] "HERB_LOW"   "HERB_MAX"   "CRYPT_HIGH" "MOSS_IDENT" "LICH_IDENT"
[36] "REMARKS"    "SYNOPTIC"   "NR_OF_REL"  "SYNTAXONEU" "SBSREFCODE"
[41] "LONGITUDE"  "LATITUDE"   "COORD_CODE" "SYNTAX_OLD" "LOCALITY"  
[46] "BIAS_MIN"   "BIAS_GPS"   "CEBA_GRID"  "FIELD_NR"   "HABITAT"   
[51] "GEOLOGY"    "SOIL"       "PH_H20"     "PH_KCL"     "SELECTION"

Therefore, we will directly change them to tidy names with the clean_names function from the package janitor. An alternative is to rename them one by one using e.g. rename, but here we want to save time and effort.

env <- read_csv("data/tvhabita.csv")%>% 
  clean_names()

Rows: 65 Columns: 55
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (9): COUNTRY, COVERSCALE, AUTHOR, MOSS_IDENT, LICH_IDENT, REMARKS, COOR...
dbl (33): RELEVE_NR, DATE, SURF_AREA, ALTITUDE, EXPOSITION, INCLINATIO, COV_...
lgl (13): REFERENCE, TABLE_NR, NR_IN_TAB, PROJECT, SYNTAXON, UTM, SYNOPTIC, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

tibble(env)

# A tibble: 65 × 55
   releve_nr country reference table_nr nr_in_tab coverscale project author
       <dbl> <chr>   <lgl>     <lgl>    <lgl>     <chr>      <lgl>   <chr> 
 1         1 CZ      NA        NA       NA        02         NA      0669  
 2         2 CZ      NA        NA       NA        02         NA      0832  
 3         3 CZ      NA        NA       NA        02         NA      0832  
 4         4 CZ      NA        NA       NA        02         NA      0832  
 5         5 CZ      NA        NA       NA        02         NA      0832  
 6         6 CZ      NA        NA       NA        02         NA      0832  
 7         7 CZ      NA        NA       NA        02         NA      0832  
 8         8 CZ      NA        NA       NA        02         NA      0832  
 9         9 CZ      NA        NA       NA        02         NA      0832  
10        10 CZ      NA        NA       NA        02         NA      0832  
# ℹ 55 more rows
# ℹ 47 more variables: date <dbl>, syntaxon <lgl>, surf_area <dbl>, utm <lgl>,
#   altitude <dbl>, exposition <dbl>, inclinatio <dbl>, cov_total <dbl>,
#   cov_trees <dbl>, cov_shrubs <dbl>, cov_herbs <dbl>, cov_mosses <dbl>,
#   cov_lichen <dbl>, cov_algae <dbl>, cov_litter <dbl>, cov_water <dbl>,
#   cov_rock <dbl>, tree_high <dbl>, tree_low <dbl>, shrub_high <dbl>,
#   shrub_low <dbl>, herb_high <dbl>, herb_low <dbl>, herb_max <dbl>, …

Note that the pipe %>% allows the output of a previous command to be used as input to another command instead of using nested functions. It means that a pipe binds individual steps into a sequence and it reads from left to right. You can insert it into your code using Ctrl+Shift+M.

names(env)

 [1] "releve_nr"  "country"    "reference"  "table_nr"   "nr_in_tab" 
 [6] "coverscale" "project"    "author"     "date"       "syntaxon"  
[11] "surf_area"  "utm"        "altitude"   "exposition" "inclinatio"
[16] "cov_total"  "cov_trees"  "cov_shrubs" "cov_herbs"  "cov_mosses"
[21] "cov_lichen" "cov_algae"  "cov_litter" "cov_water"  "cov_rock"  
[26] "tree_high"  "tree_low"   "shrub_high" "shrub_low"  "herb_high" 
[31] "herb_low"   "herb_max"   "crypt_high" "moss_ident" "lich_ident"
[36] "remarks"    "synoptic"   "nr_of_rel"  "syntaxoneu" "sbsrefcode"
[41] "longitude"  "latitude"   "coord_code" "syntax_old" "locality"  
[46] "bias_min"   "bias_gps"   "ceba_grid"  "field_nr"   "habitat"   
[51] "geology"    "soil"       "ph_h20"     "ph_kcl"     "selection"

Pipe also enables us to see the output before saving the result. We want to select just a few variables for checking, but not overwrite the data, before we are happy with the selection. For example, here we see that the habitat information is not filled (returns NAs), so I will not use it.

env %>% 
  select(releve_nr, habitat, latitude, longitude)

# A tibble: 65 × 4
   releve_nr habitat latitude longitude
       <dbl> <lgl>      <dbl>     <dbl>
 1         1 NA       491312.   164039.
 2         2 NA       492015.   163420 
 3         3 NA       492129.   163346.
 4         4 NA       492126.   162913.
 5         5 NA       490210.   162138.
 6         6 NA       490203.   162128.
 7         7 NA       490158.   162117.
 8         8 NA       490253.   162349.
 9         9 NA       490730.   161735.
10        10 NA       490742.   161622.
# ℹ 55 more rows

When I am fine with the selection, I can rewrite the file

env <- env %>% 
  select(releve_nr, coverscale,field_nr, country, author, date, syntaxon, 
         altitude, exposition, inclinatio, 
         cov_trees, cov_shrubs, cov_herbs, cov_mosses, 
         latitude, longitude, bias_gps, locality )

Or I can add all the steps I did so far into one pipeline and check the resulting dataset

env <- read_csv("data/tvhabita.csv")%>% 
  clean_names() %>% 
  select(releve_nr, coverscale, field_nr, country, author, date, syntaxon, 
         altitude, exposition, inclinatio, 
         cov_trees, cov_shrubs, cov_herbs, cov_mosses, 
         latitude, longitude, bias_gps, locality ) %>% 
  glimpse()

Rows: 65 Columns: 55
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (9): COUNTRY, COVERSCALE, AUTHOR, MOSS_IDENT, LICH_IDENT, REMARKS, COOR...
dbl (33): RELEVE_NR, DATE, SURF_AREA, ALTITUDE, EXPOSITION, INCLINATIO, COV_...
lgl (13): REFERENCE, TABLE_NR, NR_IN_TAB, PROJECT, SYNTAXON, UTM, SYNOPTIC, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Rows: 65
Columns: 18
$ releve_nr  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 16, 18, 28, 29, 31, …
$ coverscale <chr> "02", "02", "02", "02", "02", "02", "02", "02", "02", "02",…
$ field_nr   <chr> "1/2007", "2/2007", "3/2007", "4/2007", "5/2007", "6/2007",…
$ country    <chr> "CZ", "CZ", "CZ", "CZ", "CZ", "CZ", "CZ", "CZ", "CZ", "CZ",…
$ author     <chr> "0669", "0832", "0832", "0832", "0832", "0832", "0832", "08…
$ date       <dbl> 20070611, 20070702, 20070702, 20070702, 20070703, 20070703,…
$ syntaxon   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ altitude   <dbl> 412, 458, 414, 379, 374, 380, 373, 390, 255, 340, 368, 280,…
$ exposition <dbl> 135, 150, 150, 210, NA, 170, 65, NA, 80, 220, 95, 340, 120,…
$ inclinatio <dbl> 4, 24, 13, 21, NA, 10, 6, NA, 38, 13, 29, 47, 33, 24, NA, N…
$ cov_trees  <dbl> 80, 80, 80, 75, 70, 65, 65, 85, 80, 70, 85, 60, 75, 70, 70,…
$ cov_shrubs <dbl> 15, 0, 0, 0, 0, 1, 0, 0, 20, 0, 15, 20, 0, 0, 0, 0, 15, 0, …
$ cov_herbs  <dbl> 20, 25, 25, 30, 35, 60, 70, 70, 15, 75, 8, 30, 60, 85, 55, …
$ cov_mosses <dbl> 1, 10, 8, 10, 8, 3, 5, 5, 10, 3, 5, 8, 20, 3, 8, 5, 5, 10, …
$ latitude   <dbl> 491312.1, 492015.3, 492128.9, 492126.4, 490210.2, 490203.1,…
$ longitude  <dbl> 164038.6, 163420.0, 163345.8, 162913.4, 162137.8, 162128.3,…
$ bias_gps   <dbl> 7, 5, 6, 10, 6, 5, 6, 14, 6, 4, 5, 11, 9, 9, 6, 6, 0, 6, 11…
$ locality   <chr> "Brno, Maloměřice, Hády; 1,7 km od středu městské části", "…

And save it for easier access. Always keep releve_nr and coverscale, as you will need them later.

write_excel_csv(env, "data/env.csv")

*An alternative option is to directly import the file exported from the Turboveg database named tvhabita.dbf. Since dbf is a specific type of files, we need to use a specialised packages. Here I used read.dbf function from the foreign library.

env_dbf <- read.dbf("data/tvhabita.dbf", as.is = F) %>%    
  clean_names()

Check the structure, directly in R

view(env_dbf)

Get the list of the variable names

names(env_dbf)

 [1] "releve_nr"  "country"    "reference"  "table_nr"   "nr_in_tab" 
 [6] "coverscale" "project"    "author"     "date"       "syntaxon"  
[11] "surf_area"  "utm"        "altitude"   "exposition" "inclinatio"
[16] "cov_total"  "cov_trees"  "cov_shrubs" "cov_herbs"  "cov_mosses"
[21] "cov_lichen" "cov_algae"  "cov_litter" "cov_water"  "cov_rock"  
[26] "tree_high"  "tree_low"   "shrub_high" "shrub_low"  "herb_high" 
[31] "herb_low"   "herb_max"   "crypt_high" "moss_ident" "lich_ident"
[36] "remarks"    "synoptic"   "nr_of_rel"  "syntaxoneu" "sbsrefcode"
[41] "longitude"  "latitude"   "coord_code" "syntax_old" "locality"  
[46] "bias_min"   "bias_gps"   "ceba_grid"  "field_nr"   "habitat"   
[51] "geology"    "soil"       "ph_h20"     "ph_kcl"     "selection"

There are several issues with this type of import, as there might be different encodings used in the original files, not compatible with R. For the Czech dataset, I needed to further change the encoding style, so that the diacritics in text columns is translated correctly.

First we need to select the columns that include text and can have issues with diacritics and special symbols, e.g. remarks, locality… I specify them in the brackets and use function iconv to change the encoding to UTF-8. You may need to play a bit to see if it works correctly and change the original type in the from argument. I added one more line to transform the dataframe to tibble, which is the data format used in tidyverse packages.

env_dbf %>%   
  mutate(across(c(remarks, locality, habitat, soil),
                ~ iconv(.x, from = "cp852", to = "UTF-8"))) %>%
  select(locality) %>%
  as_tibble()

# A tibble: 65 × 1
   locality                                                               
   <chr>                                                                  
 1 Brno, Maloměřice, Hády; 1,7 km od středu městské části                 
 2 Svinošice (Brno, Kuřim); 450 m SZ od středu obce                       
 3 Lažany (Brno, Kuřim); 1 km VSV od středu obce                          
 4 Všechovice (Tišnov); 500 m Z od středu obce                            
 5 Vedrovice (Moravský Krumlov), Krumlovský les; 2,2 km SZ od středu obce 
 6 Vedrovice (Moravský Krumlov), Krumlovský les; 2,1 km SZ od středu obce 
 7 Vedrovice (Moravský Krumlov), Krumlovský les; 2,2 km SZ od středu obce 
 8 Vedrovice (Moravský Krumlov), Krumlovský les; 3,3 km SSV od středu obce
 9 Čučice (Oslavany), Přírodní park Oslava; 1,7 km JV od středu obce      
10 Čučice (Oslavany), Přírodní park Oslava; 1 km JJZ od středu obce       
# ℹ 55 more rows

In the next step we select the variables we want to keep further, which is useful, as the database structure also includes predefined variables, even if they are empty. Another advantage of the select function.

env <- env_dbf %>%
  mutate(across(c(remarks, locality, habitat, soil),
                ~ iconv(.x, from = "cp852", to = "UTF-8"))) %>%
  select(releve_nr, coverscale,field_nr, country, author, date, syntaxon, 
         altitude, exposition, inclinatio, 
         cov_trees, cov_shrubs, cov_herbs, cov_mosses, 
         latitude, longitude, bias_gps, locality )%>%
  as_tibble()

1.3.2 Coordinates

In the Turboveg 2 databases, the geographical coordinates are stored in a format with DDMMSS.SS. To show the plots in the map or perform any spatial calculations, you will need to transform the coordinates to decimal degrees.

That is the purpose of the following function, where I specify which parts of the string are degrees (position 1-2), which are minutes (3-4) and seconds (5 to11) and how to transform them.

 coord_to_degrees <- function(coord){
  as.numeric(str_sub(coord, 1, 2)) + as.numeric(str_sub(coord, 3, 4)) / 60 + as.numeric(str_sub(coord, 5, 11)) / 3600
}

Below I will check if it works as I need. Please consider that here we apply it to central European dataset and adjustments are needed in other countries e.g. if the coordinates are higher than 99° ~ DDDMMSS.SS

env %>%
  mutate(deg_lat = coord_to_degrees(latitude),
         deg_lon = coord_to_degrees(longitude)) %>% 
  select(releve_nr, latitude, deg_lat, longitude, deg_lon)

# A tibble: 65 × 5
   releve_nr latitude deg_lat longitude deg_lon
       <int>    <dbl>   <dbl>     <dbl>   <dbl>
 1         1  491312.    49.2   164039.    16.7
 2         2  492015.    49.3   163420     16.6
 3         3  492129.    49.4   163346.    16.6
 4         4  492126.    49.4   162913.    16.5
 5         5  490210.    49.0   162138.    16.4
 6         6  490203.    49.0   162128.    16.4
 7         7  490158.    49.0   162117.    16.4
 8         8  490253.    49.0   162349.    16.4
 9         9  490730.    49.1   161735.    16.3
10        10  490742.    49.1   161622.    16.3
# ℹ 55 more rows

If I know I will need coordinates for mapping, I will add the respective lines directly to my pipeline

env <- read_csv("data/tvhabita.csv")%>% 
  clean_names() %>% 
  mutate(deg_lat = coord_to_degrees(latitude),
         deg_lon = coord_to_degrees(longitude)) %>% 
  select(releve_nr, coverscale, field_nr, country, author, date, syntaxon, 
         altitude, exposition, inclinatio, 
         cov_trees, cov_shrubs, cov_herbs, cov_mosses, 
         deg_lat, deg_lon, bias_gps, locality) %>% 
  glimpse()

Rows: 65 Columns: 55
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (9): COUNTRY, COVERSCALE, AUTHOR, MOSS_IDENT, LICH_IDENT, REMARKS, COOR...
dbl (33): RELEVE_NR, DATE, SURF_AREA, ALTITUDE, EXPOSITION, INCLINATIO, COV_...
lgl (13): REFERENCE, TABLE_NR, NR_IN_TAB, PROJECT, SYNTAXON, UTM, SYNOPTIC, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Rows: 65
Columns: 18
$ releve_nr  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 16, 18, 28, 29, 31, …
$ coverscale <chr> "02", "02", "02", "02", "02", "02", "02", "02", "02", "02",…
$ field_nr   <chr> "1/2007", "2/2007", "3/2007", "4/2007", "5/2007", "6/2007",…
$ country    <chr> "CZ", "CZ", "CZ", "CZ", "CZ", "CZ", "CZ", "CZ", "CZ", "CZ",…
$ author     <chr> "0669", "0832", "0832", "0832", "0832", "0832", "0832", "08…
$ date       <dbl> 20070611, 20070702, 20070702, 20070702, 20070703, 20070703,…
$ syntaxon   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ altitude   <dbl> 412, 458, 414, 379, 374, 380, 373, 390, 255, 340, 368, 280,…
$ exposition <dbl> 135, 150, 150, 210, NA, 170, 65, NA, 80, 220, 95, 340, 120,…
$ inclinatio <dbl> 4, 24, 13, 21, NA, 10, 6, NA, 38, 13, 29, 47, 33, 24, NA, N…
$ cov_trees  <dbl> 80, 80, 80, 75, 70, 65, 65, 85, 80, 70, 85, 60, 75, 70, 70,…
$ cov_shrubs <dbl> 15, 0, 0, 0, 0, 1, 0, 0, 20, 0, 15, 20, 0, 0, 0, 0, 15, 0, …
$ cov_herbs  <dbl> 20, 25, 25, 30, 35, 60, 70, 70, 15, 75, 8, 30, 60, 85, 55, …
$ cov_mosses <dbl> 1, 10, 8, 10, 8, 3, 5, 5, 10, 3, 5, 8, 20, 3, 8, 5, 5, 10, …
$ deg_lat    <dbl> 49.22003, 49.33758, 49.35803, 49.35733, 49.03617, 49.03419,…
$ deg_lon    <dbl> 16.67739, 16.57222, 16.56272, 16.48706, 16.36050, 16.35786,…
$ bias_gps   <dbl> 7, 5, 6, 10, 6, 5, 6, 14, 6, 4, 5, 11, 9, 9, 6, 6, 0, 6, 11…
$ locality   <chr> "Brno, Maloměřice, Hády; 1,7 km od středu městské části", "…

and save it for later, for example with write_csv or with write_excel_csv which keeps the UTF encoding without any further specification (useful for Czech diacritics).

write_excel_csv(env, "data/env.csv")

1.4 Import spe file = species file in a long format

The first option is again to open the tvabund.dbf in Excel, save it as tvabund.csv and import it to our environment in R. Again, I will use the clean_names function during import, so that we have the same style of the variable names.

tvabund <- read_csv("data/tvabund.csv") %>% 
  clean_names()

Rows: 2278 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): COVER_CODE
dbl (3): RELEVE_NR, SPECIES_NR, LAYER

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

*An alternative option is to read the data directly from the dbf file. In this case, it is less complicated than the import of an env file, as there are no problematic text variables. This becomes very important for large files with many rows, because opening oversized dbf files in Excel can lead to deletion of the rows above this limit.

tvabund <- read.dbf("data/tvabund.dbf", as.is = F) %>% 
  clean_names()

glimpse(tvabund)

Rows: 2,278
Columns: 4
$ releve_nr  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ species_nr <dbl> 80, 80, 83, 651, 1506, 2206, 2246, 2248, 2315, 2455, 3067, …
$ cover_code <chr> "2b", "+", "r", "+", "r", "1", "r", "+", "+", "r", "2m", "2…
$ layer      <dbl> 1, 7, 7, 6, 6, 6, 6, 6, 7, 6, 6, 4, 7, 4, 7, 7, 6, 7, 6, 6,…

Now we will check the data again and we see, that there are no species names, just numbers. Also, the cover is given in the original codes and not in percentages. See the scheme below to understand where each piece of information is stored.

We have to prepare these different files we need, import them and merge them.

1.4.1 Nomenclature

In the abund file, species numbers refer to the codes in the checklist used in the Turboveg database. To translate them into species names you will need a translation table with the original number in the database, the original name in the database and the name you want to use in the analyses.

Preparation of the translation table: To show you how to prepare such a table, I opened the checklist file, so-called species.dbf from the folder: Turbowin/species/CzechiaSlovakia2015 and saved it here in the data folder as species.csv. Using the following pipeline, you can prepare your own translation table and add other names or information.

nomenclature_raw <- read_csv("data/species.csv") %>%   
  clean_names() %>%   
  left_join(select(., species_nr, accepted_abbreviat = abbreviat),     
            by = c("valid_nr" = "species_nr")) %>%   
  mutate(accepted_name = if_else(synonym, accepted_abbreviat, abbreviat)) %>%
  select(-accepted_abbreviat)

Rows: 13863 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): LETTERCODE, SHORTNAME, ABBREVIAT
dbl (2): SPECIES_NR, VALID_NR
lgl (1): SYNONYM

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We will now import the nomenclature file that is already adapted for the Czech flora.

nomenclature <- read_csv("data/table_nomenclature.csv") %>%
  clean_names()

Rows: 13873 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): TURBOVEG_NAME, KAPLAN_PLUS_NAME, KAPLAN_PLUS_NAME_SIMPLE, AUTHORSHI...
dbl (1): SPECIES_NR

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

tibble(nomenclature)

# A tibble: 13,873 × 8
   species_nr turboveg_name   kaplan_plus_name kaplan_plus_name_sim…¹ authorship
        <dbl> <chr>           <chr>            <chr>                  <chr>     
 1          1 Abies alba      Abies alba       Abies alba             Mill.     
 2          2 Abies species   Abies            Abies                  Mill.     
 3          3 Abietinella ab… Abietinella abi… Abietinella abietina   (Hedw.) M…
 4          4 Abietinella ab… Abietinella abi… Abietinella abietina … (Hedw.) M…
 5          5 Abietinella ab… Abietinella abi… Abietinella abietina … (Mitt.) S…
 6          6 Abietinella sp… Abietinella      Abietinella            Müll.Hal. 
 7          7 Abrothallus be… Abrothallus ber… Abrothallus bertianus  (Sommerf.…
 8          8 Abrothallus ca… Abrothallus cae… Abrothallus caerulesc… I.Kotte   
 9          9 Abrothallus ce… Abrothallus cet… Abrothallus cetrariae  I.Kotte   
10         10 Abrothallus mi… Abrothallus mic… Abrothallus microsper… Tul.      
# ℹ 13,863 more rows
# ℹ abbreviated name: ¹kaplan_plus_name_simple
# ℹ 3 more variables: czechveg_esy_name <chr>, czechveg_esy_name_subsp <chr>,
#   plant_group <chr>

There are several advantages of this approach. First, you can adjust the nomenclature to the newest source/regional checklist. In our example, the name in Turboveg is translated to the nomenclature presented in the recent edition of the Key to the Flora of the Czech Republic, and it is named after the main editor Kaplan.

Second, I can add a concept that groups several taxa into higher units, e.g. taxa that are not easy to recognise in the field are assigned into aggregates. This is exactly the same approach you use when creating an expert system file. Here it is even easier to understand and much easier to change the translation when you need to fix something. The name in this file is called ESy.

Last but not least. I can directly add much more information into such a table. For example invasion status, growth form or anything else. Here we have an indication of whether the species is nonvascular.

I might want to check how the species are translated using my translation file and select just these matching rows. Either create a variable called selection indicating if the species is in the subset or not.

nomenclature_check<- nomenclature %>% 
  left_join(tvabund %>% 
              distinct(species_nr)%>%
              mutate(selection=1))

Joining with `by = join_by(species_nr)`

or I can even add the frequency, how many times it appears in the records of the dataset

nomenclature_check<- nomenclature %>% 
  left_join(tvabund %>% 
              count(species_nr))

Joining with `by = join_by(species_nr)`

I can then write the file, make adjustments e.g. in Excel and upload it again. Great thing is that I have an indication of which species are in the dataset, and I do not have to pay attention to the other rows. Note that if you prefer names with ë or hybrid marks it is better to save it with UTF-8 encoding which is automatically included in write_excel_csv function.

write_csv(nomenclature_check, "data/nomenclature_check.csv")

upload the new, adjusted file

nomenclature <- read_csv("data/nomenclature_check.csv") %>%
  clean_names()

1.4.2 Cover

We have a translation table for nomenclature, but we still need to translate cover codes to percentages. For translation of cover we need to use information about the cover scale (stored in the header data / tvhabita / env file) and information on how to translate the values in that particular scale to percentages. The file here was prepared based on the translation of cover values in different scales to percentages following the EVA database approach. One more column was added to enable different adjustments, for example, change the values for rare species. For any project, we suggest to open the file and check if the scales you are using are there and if you agree with the translation.

cover <- read_csv("data/table_cover.csv") %>% clean_names()

Rows: 333 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): COVERSCALE, COVER_SCALE_NAME, COVER_CODE
dbl (2): COVER_PERC_EVA, COVER_PERC

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

tibble(cover)

# A tibble: 333 × 5
   coverscale cover_scale_name cover_code cover_perc_eva cover_perc
   <chr>      <chr>            <chr>               <dbl>      <dbl>
 1 00         Percentage %     .1                    0.1        0.1
 2 00         Percentage %     0.1                   0.1        0.1
 3 00         Percentage %     0.2                   0.2        0.2
 4 00         Percentage %     0.3                   0.3        0.3
 5 00         Percentage %     0.4                   0.4        0.4
 6 00         Percentage %     .5                    0.5        0.5
 7 00         Percentage %     0.5                   0.5        0.5
 8 00         Percentage %     0.6                   0.6        0.6
 9 00         Percentage %     0.7                   0.7        0.7
10 00         Percentage %     0.8                   0.8        0.8
# ℹ 323 more rows

Here I can check the different scale names included in the file

cover %>% distinct(cover_scale_name)

# A tibble: 16 × 1
   cover_scale_name      
   <chr>                 
 1 Percentage %          
 2 Braun/Blanquet (old)  
 3 Braun/Blanquet (new)  
 4 Londo                 
 5 Presence/Absence      
 6 Ordinale scale (1-9)  
 7 Barkman, Doing & Segal
 8 Doing                 
 9 Domin                 
10 Sedlakova             
11 Zlatník               
12 Hedl                  
13 Percentual scale      
14 Kučera                
15 Domin (uprava Hadac)  
16 Percentual (r, +)

And I can also filter the rows of the specified scales. E.g. here I am looking for all those that start with a specific pattern “Braun”

cover %>% 
  filter(str_starts(cover_scale_name, "Braun")) %>% 
  print(n=20)

# A tibble: 17 × 5
   coverscale cover_scale_name     cover_code cover_perc_eva cover_perc
   <chr>      <chr>                <chr>               <dbl>      <dbl>
 1 01         Braun/Blanquet (old) r                       1        0.1
 2 01         Braun/Blanquet (old) +                       2        0.5
 3 01         Braun/Blanquet (old) 1                       3        3  
 4 01         Braun/Blanquet (old) 2                      13       13  
 5 01         Braun/Blanquet (old) 3                      38       37.5
 6 01         Braun/Blanquet (old) 4                      63       62.5
 7 01         Braun/Blanquet (old) 5                      88       87.5
 8 02         Braun/Blanquet (new) r                       1        0.1
 9 02         Braun/Blanquet (new) +                       2        0.5
10 02         Braun/Blanquet (new) 1                       3        3  
11 02         Braun/Blanquet (new) 2m                      4        4  
12 02         Braun/Blanquet (new) 2a                      8       10  
13 02         Braun/Blanquet (new) 2                      13       13  
14 02         Braun/Blanquet (new) 2b                     18       20  
15 02         Braun/Blanquet (new) 3                      38       37.5
16 02         Braun/Blanquet (new) 4                      63       62.5
17 02         Braun/Blanquet (new) 5                      88       87.5

1.4.2 Merging all files into complete spe file

Finally, I have translation to nomenclature, to cover, so I need to put everything together. I append the nomenclature file, from which I select only the relevant variables within the nested pipe. The same applies for joining env and cover files. See more on joins here.

tvabund %>% 
  left_join(nomenclature %>% 
              select(species_nr, turboveg_name, kaplan_plus_name_simple, czechveg_esy_name_subsp, plant_group)) %>% 
  left_join(env %>% select(releve_nr, coverscale)) %>% 
  left_join(cover %>% select(coverscale,cover_code,cover_perc))

Joining with `by = join_by(species_nr)`
Joining with `by = join_by(releve_nr)`
Joining with `by = join_by(cover_code, coverscale)`

# A tibble: 2,278 × 10
   releve_nr species_nr cover_code layer turboveg_name    kaplan_plus_name_sim…¹
       <dbl>      <dbl> <chr>      <dbl> <chr>            <chr>                 
 1         1         80 2b             1 Acer campestre   Acer campestre        
 2         1         80 +              7 Acer campestre   Acer campestre        
 3         1         83 r              7 Acer platanoides Acer platanoides      
 4         1        651 +              6 Anemone species  Anemone               
 5         1       1506 r              6 Bromus benekenii Bromus benekenii      
 6         1       2206 1              6 Carex digitata   Carex digitata var. d…
 7         1       2246 r              6 Carex michelii   Carex michelii        
 8         1       2248 +              6 Carex montana    Carex montana         
 9         1       2315 +              7 Carpinus betulus Carpinus betulus      
10         1       2455 r              6 Cephalanthera d… Cephalanthera damason…
# ℹ 2,268 more rows
# ℹ abbreviated name: ¹kaplan_plus_name_simple
# ℹ 4 more variables: czechveg_esy_name_subsp <chr>, plant_group <chr>,
#   coverscale <chr>, cover_perc <dbl>

The output contains these variables

"releve_nr"     "species_nr"    "cover_code"    "layer"      "turboveg_name"   "kaplan_plus_name_simple"  "czechveg_esy_name_subsp" "plant_group"   "coverscale"    "cover_perc"

If I am satisfied with the result, I assign the pipeline into the spe file and add one more line to select just the needed variables. Here, I decided to use the expert system name (czechveg_esy_name_subsp) and I renamed it directly in the select function.

spe<- tvabund %>% 
  left_join(nomenclature %>% 
              select(species_nr,czechveg_esy_name_subsp, plant_group)) %>% 
  left_join(env %>% select(releve_nr, coverscale)) %>% 
  left_join(cover %>% select(coverscale,cover_code,cover_perc)) %>%
  # filter (!nonvascular=1) %>% # optional to remove nonvasculars
  select(releve_nr, species= czechveg_esy_name_subsp, plant_group, layer, cover_perc)

Joining with `by = join_by(species_nr)`
Joining with `by = join_by(releve_nr)`
Joining with `by = join_by(cover_code, coverscale)`

To see the result we will use view

view(spe)

We can again save the final file, to be easily accessible for later. Function write_excel_csv is useful if your file contains specific symbols like ë, as it primarily uses UTF-8 encoding.

write_excel_csv(spe, "data/spe.csv")

1.5 Merging of species covers across layers

1.5.1 Duplicate species records

What is the problem? Sometimes we have some species names listed more than once in the same plot. Either because we changed the original concept (from subspecies to species level, or after additional identification) or because we recorded the same species in different layers. Depending on our further questions and analyses this might become minor or bigger problem.

A> In the first case, duplicate within one layer, I can fix the problem by summing the values for the same species in the same layer to get distinct species-layer combinations per plot. This is something we need to do. Otherwise our data would go against tidy approach and we will experience issues in joins, summarisation etc.

B> In the other case, duplicate across layers, the data are OK, because there is one more variable that makes it unique record (layer). But if we want to look at the whole community and e.g. calculate share of some life forms weighted by cover, etc. we again need to sum the values across layers and put all the species as they were in the same layer (this is then usually marked as 0).

1.5.2 Duplicates checking

We recommend to always do the following check of the data. Simply group species by releves/plots and count if some of the species are at more rows.

spe %>%    
  group_by(releve_nr, species, layer) %>%    
  count() %>%   
  filter(n>1)

# A tibble: 1 × 4
# Groups:   releve_nr, species, layer [1]
  releve_nr species              layer     n
      <dbl> <chr>                <dbl> <int>
1       132 Galium palustre agg.     6     2

We can see that in two releves/plots there is a conflict in the species Galium palustre agg. Most probably we separated two species in the field that we later on decided to group into this aggregate. We can go back and check where exactly this happened, by exactly specifying where to look.

tvabund %>%    
  select(releve_nr, species_nr)%>%   
  left_join(nomenclature %>%                
              select(species_nr, turboveg_name,  species=czechveg_esy_name_subsp)) %>%   
  filter(releve_nr %in% c(132, 183182) & species =="Galium palustre agg.")

Joining with `by = join_by(species_nr)`

# A tibble: 2 × 4
  releve_nr species_nr turboveg_name    species             
      <dbl>      <dbl> <chr>            <chr>               
1       132       4558 Galium elongatum Galium palustre agg.
2       132       4570 Galium palustre  Galium palustre agg.

Alternatively, I can save the first output and use semi-join function, which is very useful if there are more rows I want to check and I do not need to specify multiple conditions in the filter.

test<-spe %>%    
  group_by(releve_nr, species, layer) %>%    
  count() %>%   
  filter(n>1)   

tvabund %>%    
  select(releve_nr, species_nr)%>%   
  left_join(nomenclature %>%                
              select(species_nr, turboveg_name,  species=czechveg_esy_name_subsp)) %>%   
  semi_join(test)

Joining with `by = join_by(species_nr)`
Joining with `by = join_by(releve_nr, species)`

# A tibble: 2 × 4
  releve_nr species_nr turboveg_name    species             
      <dbl>      <dbl> <chr>            <chr>               
1       132       4558 Galium elongatum Galium palustre agg.
2       132       4570 Galium palustre  Galium palustre agg.

OK, I understand why it happened and I have to fix it now. But we continue checking. Now we will check if there is also problem with species across layers (B). I will simply change the grouping conditions, to the higher hierarchy.

spe %>%    
  group_by(releve_nr, species) %>%    
  count() %>%   
  filter(n>1)

# A tibble: 187 × 3
# Groups:   releve_nr, species [187]
   releve_nr species                  n
       <dbl> <chr>                <int>
 1         1 Acer campestre           2
 2         1 Quercus petraea agg.     2
 3         2 Quercus petraea agg.     2
 4         3 Quercus petraea agg.     2
 5         4 Carpinus betulus         2
 6         4 Quercus petraea agg.     2
 7         5 Quercus petraea agg.     2
 8         6 Ligustrum vulgare        2
 9         6 Quercus petraea agg.     2
10         6 Rosa                     2
# ℹ 177 more rows

We got lot of duplicates, right? But it is understandable in the vegetation type we have. So keep it in mind for later analyses.

Sometimes it is good to take some extra time and just look at what is inside. Are there just trees recorded also as shrubs and juveniles or are there some herbs by mistake included in tree layer? Add %>% view () to see the whole list.

spe %>%    
  distinct(species,layer)%>%   
  group_by(species) %>%    
  count() %>%   
  filter(n>1)

# A tibble: 36 × 2
# Groups:   species [36]
   species                    n
   <chr>                  <int>
 1 Acer campestre             3
 2 Acer platanoides           3
 3 Acer pseudoplatanus        3
 4 Aesculus hippocastanum     2
 5 Alnus glutinosa            3
 6 Alnus incana               3
 7 Carpinus betulus           3
 8 Cornus mas                 2
 9 Cornus sanguinea           2
10 Corylus avellana           3
# ℹ 26 more rows

1.5.3 Fixing duplicate rows

Now finally the fixing. For some questions the most easiest thing how to resolve duplicate rows is to select only the relevant variables and groups and use distinct function. E.g. for species richness this would be enough. BUT we will lose information about the abundance, in our case percentage cover of each species.

spe %>%    
  distinct(releve_nr, species,layer)

# A tibble: 2,277 × 3
   releve_nr species                  layer
       <dbl> <chr>                    <dbl>
 1         1 Acer campestre               1
 2         1 Acer campestre               7
 3         1 Acer platanoides             7
 4         1 Anemone                      6
 5         1 Bromus ramosus agg.          6
 6         1 Carex digitata               6
 7         1 Carex michelii               6
 8         1 Carex montana                6
 9         1 Carpinus betulus             7
10         1 Cephalanthera damasonium     6
# ℹ 2,267 more rows

The percentage cover is estimated visually relative to the total area. In the field it is estimated indepently of other plants, because we know that the plants overlap within vertical space. If we use normal sum function, we can easily get total cover per plot above 100%. Although we can separate the information into vegetation layers, it is still rather coarse division. Especially in grasslands where the main diversity is in just one, often very dense, layer.

Therefore we will use the approach suggested by H.S. Fischer in the paper On combination of species from different vegetation layers (AVS 2015), where he suggested summing up covers considering overlap among species, so that the overall maximum value is 100 and all the values are adjusted relative to this treshold. We will prepare function called combine_cover

combine_cover <- function(x){
  while (length(x)>1){
    x[2] <- x[1]+(100-x[1])*x[2]/100
    x <- x[-1]
  }
  return(x)
}

A, Now let’s check how it works. We will first fix the issue with duplicates within the same layer (A)

spe %>%    
  group_by(releve_nr, species, layer) %>%    
  summarise(cover_perc_new = combine_cover(cover_perc))

`summarise()` has grouped output by 'releve_nr', 'species'. You can override
using the `.groups` argument.

# A tibble: 2,277 × 4
# Groups:   releve_nr, species [2,061]
   releve_nr species                  layer cover_perc_new
       <dbl> <chr>                    <dbl>          <dbl>
 1         1 Acer campestre               1           20  
 2         1 Acer campestre               7            0.5
 3         1 Acer platanoides             7            0.1
 4         1 Anemone                      6            0.5
 5         1 Bromus ramosus agg.          6            0.1
 6         1 Carex digitata               6            3  
 7         1 Carex michelii               6            0.1
 8         1 Carex montana                6            0.5
 9         1 Carpinus betulus             7            0.5
10         1 Cephalanthera damasonium     6            0.1
# ℹ 2,267 more rows

and we will add the pipelines for checking if there are still some duplicate rows. Note summarise finished the group_by function, so I have to specify the grouping again in the count (or add the group_by before count again).

spe %>%    
  group_by(releve_nr, species, layer) %>%    
  summarize(cover_perc_new = combine_cover(cover_perc))%>%   
  count(releve_nr, species, layer) %>%   
  filter(n>1)

# A tibble: 0 × 4
# Groups:   releve_nr, species [0]
# ℹ 4 variables: releve_nr <dbl>, species <chr>, layer <dbl>, n <int>

When the output have no rows, it means our attempt solved the issue.

If I am happy, I overwrite cover directly, save the output for easier access (next time you can start with reloading this file) or I will assign the whole pipeline into a new object e.g. ->spe_merged

spe %>% 
  group_by(releve_nr, species, layer) %>% 
  summarize(cover_perc = combine_cover(cover_perc))%>%
  write_csv("data/spe_merged_covers.csv")

`summarise()` has grouped output by 'releve_nr', 'species'. You can override
using the `.groups` argument.

B, We want to also remove information about layer and work at whole community level. This means we will do the same, we will just not add the layer into grouping, as we do not want to pay attention to it anymore.

spe %>% 
  group_by(releve_nr, species) %>% 
  summarize(cover_perc = combine_cover(cover_perc))%>%
  write_csv("data/spe_merged_covers_across_layers.csv")

1.5.4 Total cover of all species in the plot

The same approach as we did for merging covers can be used also for calculating total cover in the plot. Here you can see the comparison of total cover calculated as ordinary sum and total cover calculated with considering the overlaps.

spe %>% 
  group_by(releve_nr) %>% 
  summarize(covertotal_sum = sum(cover_perc), 
            covertotal_overlap = combine_cover(cover_perc)) %>%
  select(releve_nr, covertotal_sum, covertotal_overlap)%>%
  arrange(desc(covertotal_sum))

# A tibble: 65 × 3
   releve_nr covertotal_sum covertotal_overlap
       <dbl>          <dbl>              <dbl>
 1       125           232.               93.3
 2       131           223.               92.4
 3       130           201.               92.4
 4       129           192.               89.4
 5       113           192.               90.4
 6       127           190.               90.2
 7        99           189                91.3
 8       128           184.               91.0
 9       123           182.               87.4
10        86           176.               95.3
# ℹ 55 more rows

The same with respect to layers

spe %>% 
  group_by(releve_nr, layer) %>% 
  summarize(covertotal_sum = sum(cover_perc), 
            covertotal_overlap = combine_cover(cover_perc)) %>%
  select(releve_nr, layer, covertotal_sum, covertotal_overlap)

`summarise()` has grouped output by 'releve_nr'. You can override using the
`.groups` argument.

# A tibble: 237 × 4
# Groups:   releve_nr [65]
   releve_nr layer covertotal_sum covertotal_overlap
       <dbl> <dbl>          <dbl>              <dbl>
 1         1     1           82.5              70   
 2         1     4           13.5              13.1 
 3         1     6           22.2              20.1 
 4         1     7           10                 9.64
 5         2     1           62.5              62.5 
 6         2     6           31.5              28.8 
 7         2     7           13.1              12.8 
 8         3     1           62.5              62.5 
 9         3     6           23.8              21.5 
10         3     7            4                 4   
# ℹ 227 more rows