Basic data analyses

Author

Irena Axmanová

In this tutorial, we will show you how to sort and filter the data. We will also train how to append traits, indicator values or information about the status of species and calculate proportions of selected plant groups, community weighted and unweighted means or other summary statistics. If you are interested in a more general overview of the functions, you are encouraged to check also our study materials here.

library(tidyverse)
library(janitor)

2.1 Data import

We will play a bit with measured variables from our plots and try few useful functions. For more see study materials and links there. We will first read the data newly, to be sure we know what we are working with. At this point we can actually clean the environment, remove everything.

2.1.1 env dataset

First I want to add some more variables that were measured separately.

env_extra<- read_csv("data/env_measured.csv")
Rows: 65 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): forest_type_name
dbl (21): releve_nr, forest_type, herbs, juveniles, coverE1, biomass, soil_d...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

For the env dataset, I want to import the clean version saved in the previous script. I will directly add selected information from the other file. For more on joins, see the study materials.

env<- read_csv("data/env.csv") %>%       
  left_join(env_extra %>%                                 
               select (releve_nr, forest_type, forest_type_name,                                                soil_pH, biomass)%>%                           
               unite(forest, forest_type, forest_type_name))
Rows: 65 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): coverscale, field_nr, country, author, locality
dbl (12): releve_nr, date, altitude, exposition, inclinatio, cov_trees, cov_...
lgl  (1): syntaxon

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Joining with `by = join_by(releve_nr)`

In this example I used nested pipes. If we want to examine this further we can test individual steps by sending them to console - I am attaching to the env file another file, called env_extra, but first I will modify it. I will select only the variables I want to keep

and I will prepare a merged variable called forest with code and name merged together with the unite function: 1_oak forest…

Try to send to console the pieces of the code step by step, to understand, what we are merging. Alternative way would be to save this part as another object and join it to the env file in another step. However, the approach above is more in tidyverse style.

2.1.2 spe dataset

I want to work with the spe dataset, and I want to keep information about layers, as I want to focus on the herb-layer.

spe <- read_csv ("data/spe_merged_covers.csv")
Rows: 2277 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): species
dbl (3): releve_nr, layer, cover_perc

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2.1.3 traits

I also want to add some information about traits. So I will check what is available in my folder.

list.files("data")
 [1] "CzechVeg_tutorial.zip"               "data.zip"                           
 [3] "ellenberg_type_indicators_cz.csv"    "env.csv"                            
 [5] "env_measured.csv"                    "readme.txt"                         
 [7] "remarks.cdx"                         "remarks.dbf"                        
 [9] "spe.csv"                             "spe_juice_input.csv"                
[11] "spe_merged_covers.csv"               "spe_merged_covers_across_layers.csv"
[13] "species.csv"                         "status.csv"                         
[15] "table_cover.csv"                     "table_nomenclature.csv"             
[17] "traits.csv"                          "tvabund.cdx"                        
[19] "tvabund.csv"                         "tvabund.dbf"                        
[21] "TvAdmin.cdx"                         "TvAdmin.dbf"                        
[23] "tvhabita.cdx"                        "tvhabita.csv"                       
[25] "tvhabita.dbf"                        "tvwin.dbf"                          

I will import some traits e.g. plant height data, Ellenberg-type indicator values and status (origin and Red List categories). All these data are available in the PLADIAS database of the Czech flora and vegetation. I used the datasets from the download section and adjusted the format for easier data handling.

traits <- read_csv("data/traits.csv")
Rows: 317 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): species
dbl (2): height, seed_mass

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
indicator_values <- read_csv("data/ellenberg_type_indicators_cz.csv")
Rows: 3076 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): species
dbl (6): eiv_light, eiv_temperature, eiv_moisture, eiv_reaction, eiv_nutrien...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
status <- read_csv("data/status.csv")
Rows: 6098 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): species, origin, redlist

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2.2 Basic data handling

Select the columns/variables based on their names or position. Select can also be used in combination with the stringr package to identify the pattern in the names: try several options: starts_with, ends_with or a more general one contains. With the use of select in your script you can order the variables always in the same way e.g. ID, forest type, species number, productivity, even during import. And you can also rename the variables using select, to get exactly what you need.

env_extra %>% 
  select(ID=releve_nr, forest_type=forest_type_name, species_nr=herbs, productivity=biomass) 
# A tibble: 65 × 4
      ID forest_type         species_nr productivity
   <dbl> <chr>                    <dbl>        <dbl>
 1     1 oak hornbeam forest         26         12.8
 2     2 oak forest                  13          9.9
 3     3 oak forest                  14         15.2
 4     4 oak forest                  15         16  
 5     5 oak forest                  13         20.7
 6     6 oak forest                  16         46.4
 7     7 oak forest                  17         49.2
 8     8 oak hornbeam forest         21         48.7
 9     9 oak hornbeam forest         15         13.8
10    10 oak forest                  14         79.1
# ℹ 55 more rows

Distinct is a function that takes your data and removes all the duplicate rows, keeping only the unique ones. There are many cases where you will really appreciate this elegant and easy way. For example, I want a list of unique PlotIDs= releve_nr, unique combinations of two categories etc. Here I want to prepare a list of forest type codes and names, and as a first step, I sorted the dataset by forest type with arrange. Note that arrange can be used also in descending order e.g. arrange (desc(forest_type)).

env_extra %>%
  arrange(forest_type) %>%
  distinct(forest_type,forest_type_name)
# A tibble: 4 × 2
  forest_type forest_type_name   
        <dbl> <chr>              
1           1 oak forest         
2           2 oak hornbeam forest
3           3 ravine forest      
4           4 alluvial forest    

When we have a large dataset, we sometimes need to create a subset of the rows/cases by filter. First we have to define upon which variable we are going to filter the rows (e.g. Forest type, soil pH…) and which values are acceptable and which are not. Here I keep only plots from a type that exactly match.

env_extra %>% 
  filter(forest_type_name =="alluvial forest")
# A tibble: 11 × 22
   releve_nr forest_type forest_type_name herbs juveniles coverE1 biomass
       <dbl>       <dbl> <chr>            <dbl>     <dbl>   <dbl>   <dbl>
 1       101           4 alluvial forest     28         6      90    91.1
 2       103           4 alluvial forest     35         5      80   114. 
 3       104           4 alluvial forest     25        11      85   188. 
 4       110           4 alluvial forest     37         4      95   126. 
 5       111           4 alluvial forest     26        10      85    84.8
 6       113           4 alluvial forest     37        10      70    74.5
 7       125           4 alluvial forest     25         8      95   123. 
 8       127           4 alluvial forest     53         6       0   176  
 9       129           4 alluvial forest     31         7      85   100. 
10       131           4 alluvial forest     59         3      95   163. 
11       132           4 alluvial forest     60         7      90   287. 
# ℹ 15 more variables: soil_depth <dbl>, soil_pH <dbl>, slope <dbl>,
#   altitude <dbl>, coverE3 <dbl>, radiation <dbl>, heat <dbl>,
#   trans_dir <dbl>, trans_dif <dbl>, trans_tot <dbl>, eiv_light <dbl>,
#   eiv_moisture <dbl>, eiv_soilreaction <dbl>, eiv_nutrients <dbl>, TWI <dbl>

or those that are listed in the brackets

env_extra %>% 
  filter(forest_type_name %in% c("alluvial forest", "ravine forest"))
# A tibble: 21 × 22
   releve_nr forest_type forest_type_name herbs juveniles coverE1 biomass
       <dbl>       <dbl> <chr>            <dbl>     <dbl>   <dbl>   <dbl>
 1       101           4 alluvial forest     28         6      90    91.1
 2       103           4 alluvial forest     35         5      80   114. 
 3       104           4 alluvial forest     25        11      85   188. 
 4       105           3 ravine forest       12        10      80    65.5
 5       106           3 ravine forest       16         0      75    33.6
 6       108           3 ravine forest       28         7      70    82.3
 7       110           4 alluvial forest     37         4      95   126. 
 8       111           4 alluvial forest     26        10      85    84.8
 9       113           4 alluvial forest     37        10      70    74.5
10       114           3 ravine forest       12         6      50    71.5
# ℹ 11 more rows
# ℹ 15 more variables: soil_depth <dbl>, soil_pH <dbl>, slope <dbl>,
#   altitude <dbl>, coverE3 <dbl>, radiation <dbl>, heat <dbl>,
#   trans_dir <dbl>, trans_dif <dbl>, trans_tot <dbl>, eiv_light <dbl>,
#   eiv_moisture <dbl>, eiv_soilreaction <dbl>, eiv_nutrients <dbl>, TWI <dbl>

Or I want to filter based on some values. If I have NAs in the variable at the same time, I should specify if I want to keep these rows or not. In the example below I will keep all rows/plots where biomass is higher than 80g/m2 but I will also keep all the NAs, i.e. plots where it was not measured.

env_extra %>% 
  filter(biomass>80 |is.na(biomass))
# A tibble: 17 × 22
   releve_nr forest_type forest_type_name    herbs juveniles coverE1 biomass
       <dbl>       <dbl> <chr>               <dbl>     <dbl>   <dbl>   <dbl>
 1       101           4 alluvial forest        28         6      90    91.1
 2       103           4 alluvial forest        35         5      80   114. 
 3       104           4 alluvial forest        25        11      85   188. 
 4       108           3 ravine forest          28         7      70    82.3
 5       110           4 alluvial forest        37         4      95   126. 
 6       111           4 alluvial forest        26        10      85    84.8
 7       120           2 oak hornbeam forest    36         6      75    98.5
 8       121           2 oak hornbeam forest    30        12      80    84.7
 9       122           3 ravine forest          29         6      60   102. 
10       124           2 oak hornbeam forest    34        15      75   102. 
11       125           4 alluvial forest        25         8      95   123. 
12       127           4 alluvial forest        53         6       0   176  
13       128           3 ravine forest          20         4      70   121. 
14       129           4 alluvial forest        31         7      85   100. 
15       130           3 ravine forest          18         5      85   238  
16       131           4 alluvial forest        59         3      95   163. 
17       132           4 alluvial forest        60         7      90   287. 
# ℹ 15 more variables: soil_depth <dbl>, soil_pH <dbl>, slope <dbl>,
#   altitude <dbl>, coverE3 <dbl>, radiation <dbl>, heat <dbl>,
#   trans_dir <dbl>, trans_dif <dbl>, trans_tot <dbl>, eiv_light <dbl>,
#   eiv_moisture <dbl>, eiv_soilreaction <dbl>, eiv_nutrients <dbl>, TWI <dbl>

You can also filter plots in a specified range of some value.

env_extra %>% 
  filter((biomass >= 40 & biomass <= 80) | is.na(biomass))
# A tibble: 20 × 22
   releve_nr forest_type forest_type_name    herbs juveniles coverE1 biomass
       <dbl>       <dbl> <chr>               <dbl>     <dbl>   <dbl>   <dbl>
 1         6           1 oak forest             16         3      60    46.4
 2         7           1 oak forest             17         5      70    49.2
 3         8           2 oak hornbeam forest    21         1      70    48.7
 4        10           1 oak forest             14         4      75    79.1
 5        18           2 oak hornbeam forest    13         3      85    72.2
 6        86           2 oak hornbeam forest    32         5      40    46.9
 7        87           2 oak hornbeam forest    34         9      65    42.9
 8        99           2 oak hornbeam forest    18         6      85    64.8
 9       102           2 oak hornbeam forest    37        13      65    57.4
10       105           3 ravine forest          12        10      80    65.5
11       109           2 oak hornbeam forest    25         9      50    60.7
12       112           2 oak hornbeam forest    32        14      60    58.4
13       113           4 alluvial forest        37        10      70    74.5
14       114           3 ravine forest          12         6      50    71.5
15       115           3 ravine forest          23         2      70    54.6
16       116           3 ravine forest          16         3      45    64.5
17       117           2 oak hornbeam forest    31         7      55    54.7
18       118           2 oak hornbeam forest     8         5      45    66.2
19       123           3 ravine forest          34        11      70    53.6
20       126           2 oak hornbeam forest    27         7      65    72.9
# ℹ 15 more variables: soil_depth <dbl>, soil_pH <dbl>, slope <dbl>,
#   altitude <dbl>, coverE3 <dbl>, radiation <dbl>, heat <dbl>,
#   trans_dir <dbl>, trans_dif <dbl>, trans_tot <dbl>, eiv_light <dbl>,
#   eiv_moisture <dbl>, eiv_soilreaction <dbl>, eiv_nutrients <dbl>, TWI <dbl>

Similarly, you can for example, filter plots based on their plot size. The example below is not working on our dataset, but it comes in useful in large EVA exports. Here I keep grassland plots in the range 10-100 m2, forest plots between 100-1000 m2 and I keep the plots without a specified plot size.

env %>% 
  filter(
    (habitat == "grassland" & plot_size >= 10 & plot_size <= 100) |
    (habitat == "forest" & plot_size >= 100 & plot_size <= 1000) |
    is.na(plot_size)
  )

Mutate Often, you need to prepare new variables, based on the existing variables, filled with some value (number or string) or you need to rewrite the values of a variable. For this, you can use the mutate function.

Here I prepare new variable by summing two columns:

env_extra %>% 
  mutate(herblayer_richness = herbs+juveniles) %>% 
  select(releve_nr, herblayer_richness, herbs, juveniles)
# A tibble: 65 × 4
   releve_nr herblayer_richness herbs juveniles
       <dbl>              <dbl> <dbl>     <dbl>
 1         1                 38    26        12
 2         2                 16    13         3
 3         3                 15    14         1
 4         4                 20    15         5
 5         5                 14    13         1
 6         6                 19    16         3
 7         7                 22    17         5
 8         8                 22    21         1
 9         9                 19    15         4
10        10                 18    14         4
# ℹ 55 more rows

Or I can define new variable based on some conditions with ifelse (first I state what happens if the condition is met (here “low”), then what happens with the rest (“high”) and I can nest more ifelse statements)

env_extra %>% 
  mutate(productivity = ifelse(biomass<60,"low","high")) %>% 
  select (releve_nr, forest_type_name, productivity, biomass) %>% 
  print(n=20)
# A tibble: 65 × 4
   releve_nr forest_type_name    productivity biomass
       <dbl> <chr>               <chr>          <dbl>
 1         1 oak hornbeam forest low             12.8
 2         2 oak forest          low              9.9
 3         3 oak forest          low             15.2
 4         4 oak forest          low             16  
 5         5 oak forest          low             20.7
 6         6 oak forest          low             46.4
 7         7 oak forest          low             49.2
 8         8 oak hornbeam forest low             48.7
 9         9 oak hornbeam forest low             13.8
10        10 oak forest          high            79.1
11        11 oak forest          low              7.4
12        14 oak hornbeam forest low             14.2
13        16 oak hornbeam forest low             26.2
14        18 oak hornbeam forest high            72.2
15        28 oak forest          low             29.7
16        29 oak hornbeam forest low             37.9
17        31 oak forest          low             14.9
18        32 oak hornbeam forest low             19.2
19        36 oak hornbeam forest low              3.3
20        41 oak hornbeam forest low             26.9
# ℹ 45 more rows

or with case_when I can define a set of conditions and assign the value to each of them

env_extra %>% 
  mutate(productivity = case_when(biomass <= 60 ~ "low", 
                                  biomass > 60 ~ "high")) 

Join functions are used either to append new information (left_join, inner_join, full_join) or to filter our dataset (semi_join, anti_join) based on the comparison with another dataset. We always need to have some common variables which have exactly the same name or we have to specify using by which variables are equivalent in the two datasets e.g. left_join(traits, by = c("species"= "taxon"). In most cases we will use left_join which is used for appending information from another file, without changing the number of rows in the original file. If the information is not available, the value will be NA.

For more on join functions see here.

2.3 Summarise

Using the group_by() function, we can define how the data should be arranged into groups. Then we can proceed with asking questions about these groups. The basic function of summarising information is to count the number of rows, and we do not have to even specify the grouping variable separately, only within count. Count returns numbers stored as n, but we can also directly rename it e.g. count (forest_type_name, name= “frequency”).

env_extra %>% 
  count(forest_type_name)
# A tibble: 4 × 2
  forest_type_name        n
  <chr>               <int>
1 alluvial forest        11
2 oak forest             16
3 oak hornbeam forest    28
4 ravine forest          10

But we can explore the data much more. For example calculate mean values per vegetation type.

env %>% 
  group_by(forest) %>%
  summarise(mean_biomass = mean(biomass))
# A tibble: 4 × 2
  forest                mean_biomass
  <chr>                        <dbl>
1 1_oak forest                  26.7
2 2_oak hornbeam forest         44.4
3 3_ravine forest               88.7
4 4_alluvial forest            139. 

We might produce a table with summary statistics like min, mean, max for several variables per forest type at once, listing both the variables and functions which should be applied.

env_extra %>%
  arrange(forest_type,forest_type_name) %>%
  group_by(forest_type,forest_type_name) %>% 
  summarise(across(c(coverE1, biomass, coverE3, soil_pH),
      list(
        min  = ~min(.x, na.rm = TRUE),
        mean = ~mean(.x, na.rm = TRUE),
        max  = ~max(.x, na.rm = TRUE))))
`summarise()` has grouped output by 'forest_type'. You can override using the
`.groups` argument.
# A tibble: 4 × 14
# Groups:   forest_type [4]
  forest_type forest_type_name  coverE1_min coverE1_mean coverE1_max biomass_min
        <dbl> <chr>                   <dbl>        <dbl>       <dbl>       <dbl>
1           1 oak forest                  8         37.1          75         7.4
2           2 oak hornbeam for…           7         52.6          85         3.3
3           3 ravine forest              45         67.5          85        33.6
4           4 alluvial forest             0         79.1          95        74.5
# ℹ 8 more variables: biomass_mean <dbl>, biomass_max <dbl>, coverE3_min <dbl>,
#   coverE3_mean <dbl>, coverE3_max <dbl>, soil_pH_min <dbl>,
#   soil_pH_mean <dbl>, soil_pH_max <dbl>

Rather than tables, we can produce graphical outputs, such as boxplots. See our study materials for basicand advanced ggplot2 functions.

env %>%
  arrange(forest) %>%
  select(forest, cover_herbs=cov_herbs, biomass, cover_trees =cov_trees, soil_pH) %>%
  tidyr::pivot_longer(
    cols = c(cover_herbs, biomass, cover_trees, soil_pH),
    names_to = "variable",
    values_to = "value")%>% 
  ggplot(aes(x = forest, y = value, fill=forest)) +
  geom_boxplot() +
  facet_wrap(~ variable, scales = "free_y") +
  theme_bw() +
  theme(legend.position = 'none')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  labs(x = NULL) +
  labs (y = NULL)

We might be interested in species richness relationship with several factors. Below is an example of herb-layer species richness and biomass. At first I need to calculate the species richness in the spe file. And as an output I want to plot this with different colours for different forest types but keep the regression line for the whole dataset.

spe %>% 
  filter(layer %in% c(6, 7)) %>% 
  count(releve_nr, name = "herblayer_richness") %>% 
  left_join(env %>% 
              select(releve_nr, biomass, forest), by = "releve_nr") %>% 
  ggplot(aes(x = sqrt(biomass), y = herblayer_richness)) +
  geom_point(aes(colour = forest), size = 2, alpha = 0.8) +
  geom_smooth(method = "lm", se = TRUE, colour = "black") +
  theme_bw() +
  labs(
    x = "sqrt Biomass",
    y = "Herb-layer species richness",
    colour = "Forest type"
  )
`geom_smooth()` using formula = 'y ~ x'

Proportions of endangered or alien species in forest type. I will first look at the numbers of these species, here on example of redlist i.e. endangered species. I just one to see average number of redlist species per forest type.

spe%>% 
  left_join(spe %>% 
              left_join(status)%>% 
              filter(!is.na(redlist))%>%
              count(releve_nr, name="richness_redlist"))%>%
  left_join(spe %>% 
              summarise(richness_all = n_distinct(species), 
                        .by = releve_nr))%>%
  distinct(releve_nr,richness_redlist,richness_all)%>%
  mutate(richness_redlist = replace_na(richness_redlist, 0))%>%
  mutate(richness_redlist_perc= richness_redlist/richness_all*100)%>%
  left_join(env%>% select(releve_nr, forest))%>%
  summarise(redlist_mean_perc = mean(richness_redlist_perc), .by=forest)%>%
  arrange(forest)
Joining with `by = join_by(species)`
Joining with `by = join_by(releve_nr)`
Joining with `by = join_by(releve_nr)`
Joining with `by = join_by(releve_nr)`
# A tibble: 4 × 2
  forest                redlist_mean_perc
  <chr>                             <dbl>
1 1_oak forest                       7.95
2 2_oak hornbeam forest              6.45
3 3_ravine forest                    5.39
4 4_alluvial forest                  2.12

But I can also ask about diferences in abundances of some plant groups. What is the proportion of cover of alien species relative to total cover in different forests? I will again use the function we used for cover combination during import from Turboveg.

combine_cover <- function(x){
  while (length(x)>1){
    x[2] <- x[1]+(100-x[1])*x[2]/100
    x <- x[-1]
  }
  return(x)
}
spe%>% 
  left_join(spe %>% 
              left_join(status)%>% 
              filter(origin %in% c("arch","neo"))%>%
              summarize(cover_alien = combine_cover(cover_perc), 
                        .by=releve_nr))%>%
  left_join(spe %>% 
              summarize(cover_total = combine_cover(cover_perc), 
                        .by=releve_nr))%>%
  distinct(releve_nr, cover_alien, cover_total)%>%
  mutate(cover_alien = replace_na(cover_alien, 0))%>%
  mutate(cover_alien_perc= cover_alien/cover_total*100)%>%
  left_join(env%>% select(releve_nr, forest))%>%
  summarise(alien_cover_mean_perc = mean(cover_alien_perc), .by=forest)%>%
  arrange(forest)
Joining with `by = join_by(species)`
Joining with `by = join_by(releve_nr)`
Joining with `by = join_by(releve_nr)`
Joining with `by = join_by(releve_nr)`
# A tibble: 4 × 2
  forest                alien_cover_mean_perc
  <chr>                                 <dbl>
1 1_oak forest                         0.0188
2 2_oak hornbeam forest                1.24  
3 3_ravine forest                      3.11  
4 4_alluvial forest                    3.44  

2.4 Community weighted means

Summarise is often used for getting so-called community means or community weighted means. For example we have traits for individual species and we want to calculate a mean for each site and compare it. We can also consider to use abundance to weight the result. This is called community weighted mean and it simply gives higher importance to those species that are more abundant and lower importance to the rare once. Here the abundance is approximated as percentage cover in the site. The example below compares mean plant height both weighted and unweighted. To have meaningful values, I looked just at the herb-layers species.

spe %>%
  left_join(traits) %>%
  filter(layer==6)%>%
  summarise(meanHeight= mean(height, na.rm=T),
            meanHeight_weighted = weighted.mean(height, cover_perc, na.rm = TRUE),
            .by=releve_nr)
Joining with `by = join_by(species)`
# A tibble: 65 × 3
   releve_nr meanHeight meanHeight_weighted
       <dbl>      <dbl>               <dbl>
 1         1      0.348               0.322
 2         2      0.454               0.494
 3         3      0.430               0.391
 4         4      0.570               0.522
 5         5      0.563               0.514
 6         6      0.562               0.573
 7         7      0.491               0.553
 8         8      0.458               0.699
 9         9      0.476               0.493
10        10      0.539               0.560
# ℹ 55 more rows

2.5 Ellenberg indicator values

In the same way we can calculate community means across more variables. Here we will try the Ellenberg indicator values (abbreviated as EIV, measures of species demands for the particular factors, the higher value mean higher demands or affinity to habitats with higher values of these environmental factors). We will apply them just for presences of species, so no weights. Therefore we do not need to care that much for the differences among layers. However, we will keep each species only once (list of unique species for each plot), so we will remove information about layers first and use distinct. Alternative is to add #filter(Layer==6)%>% if we want to focus on herb-layer only

spe %>%
  left_join(indicator_values) %>%
  select(-c(layer,cover_perc))%>%
  distinct()%>%
  group_by(releve_nr)%>%
  summarise(across(starts_with("eiv"), ~ mean(.x, na.rm = TRUE)))
Joining with `by = join_by(species)`
# A tibble: 65 × 7
   releve_nr eiv_light eiv_temperature eiv_moisture eiv_reaction eiv_nutrients
       <dbl>     <dbl>           <dbl>        <dbl>        <dbl>         <dbl>
 1         1      5.08            5.56         4.64         6.54          4.95
 2         2      4.87            5.2          4.8          5             4.33
 3         3      4.87            5.13         4.8          5             4.4 
 4         4      5.32            5.53         4.47         5.79          4.26
 5         5      6.07            5.57         4.14         5.57          3.86
 6         6      5.72            5.56         4.5          6.44          5.22
 7         7      5.75            5.45         4.45         6.25          4.95
 8         8      5.13            5.39         4.70         5.48          4.70
 9         9      5.38            5.67         4.48         5.95          4.43
10        10      6.38            5.88         3.69         6.19          3.88
# ℹ 55 more rows
# ℹ 1 more variable: eiv_salinity <dbl>

Boxplots of Ellenberg-type indicator values can be produced for example like this: You can try again sending parts of the script to console, always adding new part and see what happens.

spe %>% 
  left_join(indicator_values) %>%
  select(-c(layer, cover_perc)) %>%
  distinct() %>%
  group_by(releve_nr) %>%
  summarise(across(starts_with("EIV"), ~ mean(.x, na.rm = TRUE))) %>%
  left_join(env %>% select(releve_nr, forest))%>%
  pivot_longer(
    cols = starts_with("EIV"),
    names_to = "EIV_variable",
    values_to = "value")%>%
  ggplot(aes(x = forest, y = value, fill = forest)) +
  geom_boxplot() +
  facet_wrap(~ EIV_variable, scales = "free_y") +
  theme_bw()+
  theme(legend.position = 'none')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  labs(x = NULL) +
  labs (y = NULL)
Joining with `by = join_by(species)`
Joining with `by = join_by(releve_nr)`

2.6 Visualisation

Visualisation is not part of the tutorial at the moment, but we recommend our study materials for basic and advanced ggplot2 functions.