4. Simple permutations

In the example in A basic similarity search workflow we performed the similarity search with mobest::locate() for only one parameter permutation, keeping everything constant except the two dependent variables. But as already laid out above in The mobest_locateoverview table, mobest can automatically consider more parameter permutations, the most basic of which are directly available in locate(). This flexibility has consequences for the presentability of the search results. Here the concept of the origin_vector can be helpful to summarize information.

Warning

Please note that all parameter permutations will be multiplied with all other permutations, causing the number of runs to grow rapidly. If you, for example, submit five time slices and five search samples, the number of runs will be \(5*5=25\) times bigger than for one time slice and sample. The permutation mechanism is explained in more detail in Advanced features of the mobest package.

The following code is included at the end of the similarity_search.R script.

4.1. Multiple search time slices

As explained in Search positions the search_time argument can take an integer vector of relative or absolute ages. That means we can run the search not just for one, but for arbitrarily many time slices with a single call to locate().

Here is an example with two time slices.

search_result <- mobest::locate(
  independent        = ind,
  dependent          = dep,
  kernel             = kernset,
  search_independent = search_ind,
  search_dependent   = search_dep,
  search_space_grid  = spatial_pred_grid,
  search_time        = c(-6800, -5700),
  search_time_mode   = "absolute"
)
search_product <- mobest::multiply_dependent_probabilities(search_result)

Code for this figure.

ggplot() +
  geom_raster(
    data = search_product,
    mapping = aes(x = field_x, y = field_y, fill = probability)
  ) +
  scale_fill_viridis_c() +
  geom_sf(
    data = research_area_3035,
    fill = NA, colour = "red",
    linetype = "solid", linewidth = 1
  ) +
  geom_point(
    data = search_samples,
    mapping = aes(x, y),
    colour = "red"
  ) +
  ggtitle(
    label = "<Stuttgart> ~5250BC",
    subtitle = "Early Neolithic (Linear Pottery Culture) - Lazaridis et al. 2014"
  ) +
  theme_bw() +
  theme(
    axis.title = element_blank()
  ) +
  guides(
    fill = guide_colourbar(title = "Similarity\nsearch\nprobability")
  ) +
  facet_wrap(
    ~search_time,
    labeller = \(variable, value) {
      paste0("Search time: ", abs(value), "BC")
    }
  )

_images/search_map_two_timeslices.png — Similarity search map plot for two different time slices.

4.2. Multiple search samples

We can also select multiple search samples and prepare the input data for mobest::locate(). Here we introduce another sample RISE434 originally published in [Allentoft et al., 2015].

search_samples <- samples_projected %>%
  dplyr::filter(
    Sample_ID %in% c("Stuttgart_published.DG", "RISE434.SG")
  )

search_ind <- mobest::create_spatpos(
  id = search_samples$Sample_ID,
  x  = search_samples$x,
  y  = search_samples$y,
  z  = search_samples$Date_BC_AD_Median
)
search_dep <- mobest::create_obs(
  C1 = search_samples$MDS_C1,
  C2 = search_samples$MDS_C2
)

We set the search_time_mode to "relative" to get a different, meaningful search time for both samples.

search_result <- mobest::locate(
  independent        = ind,
  dependent          = dep,
  kernel             = kernset,
  search_independent = search_ind,
  search_dependent   = search_dep,
  search_space_grid  = spatial_pred_grid,
  search_time        = -1500,
  search_time_mode   = "relative"
)
search_product <- mobest::multiply_dependent_probabilities(search_result)

Code for this figure.

ggplot() +
  geom_raster(
    data = search_product,
    mapping = aes(x = field_x, y = field_y, fill = probability)
  ) +
  scale_fill_viridis_c() +
  geom_sf(
    data = research_area_3035,
    fill = NA, colour = "red",
    linetype = "solid", linewidth = 1
  ) +
  geom_point(
    data = search_samples %>% dplyr::rename(search_id = Sample_ID),
    mapping = aes(x, y),
    colour = "red"
  ) +
  theme_bw() +
  theme(
    axis.title = element_blank()
  ) +
  guides(
    fill = guide_colourbar(title = "Similarity\nsearch\nprobability")
  ) +
  facet_wrap(
    ~search_id,
    ncol = 2,
    labeller = labeller(
      search_id = c(
        "Stuttgart_published.DG" = paste(
          "<Stuttgart> ~5250BC",
          "Early Neolithic (Linear Pottery culture) - Lazaridis et al. 2014",
          "Search time: ~6750BC",
          sep = "\n"
        ),
        "RISE434.SG" = paste(
          "<RISE434> ~2750BC",
          "Late Neolithic (Corded Ware culture) - Allentoft et al. 2015",
          "Search time: ~4250BC",
          sep = "\n"
        )
      )
    )
  )

_images/search_map_two_samples.png — Similarity search map plot for two different search samples.

4.3. Summarizing multiple searches in one figure

Plotting individual similarity probability map plots does not scale well, if the number of searches grows. mobest::determine_origin_vectors() offers a way to derive a simple, summary statistic, by generating what we call origin vectors from objects of class mobest_locateproduct. Each vector connects the spatial point where a sample was found with the point of highest genetic similarity in the interpolated search field and its permutations. The output is of class mobest_originvectors and documents distance and direction of the “origin vector”.

The origin vector summary can also be applied for individual parameter permutations, which is especially relevant for more complex application involving mobest::locate_multi() (see Similarity search with permutations), but already comes in handy for just two different search times, as introduced in the Multiple search time slices section above.

We can take the search_product object from there and apply mobest::determine_origin_vectors() with the grouping variable search_time, to determine one origin vector for each of the two search time iterations.

origin_vectors <- mobest::determine_origin_vectors(search_product, search_time)

The resulting object of type mobest_originvectors features one row for each vector with the following variables.

Column	Description
independent_table_id	Identifier of the spatiotemporal position permutation
dependent_setting_id	Identifier of the dependent variable space position permutation
dependent_var_id	Identifier of the dependent variable
kernel_setting_id	Identifier of the kernel setting permutation
pred_grid_id	Identifier of the spatiotemporal prediction grid
field_id	Identifier of the spatiotemporal prediction point
field_x	Spatial x axis coordinate of the prediction point
field_y	Spatial y axis coordinate of the prediction point
field_z	Temporal coordinate (age) of the prediction point
field_geo_id	Identifier of the spatial prediction point
search_id	Identifier of the search sample
search_x	Spatial x axis coordinate of the search sample
search_y	Spatial y axis coordinate of the search sample
search_z	Temporal coordinate (age) of the search sample
search_time	Search time as provided by the user in `locate()`’s `search_time` argument
probability	Probability density calculated in `locate()`
ov_x	Length of the origin vector in x direction
ov_y	Length of the origin vector in y direction
ov_dist	Length of the origin vector in space (Euclidean distance)
ov_dist_se	Standard error of the mean of all vector lengths
ov_dist_sd	Standard deviation of all vector lengths
ov_angle_deg	Direction of the origin vector as an angle in degree (0-360°)

Warning

Note that here many variables can reflect mean values, as determine_origin_vectors() can summarize information across parameter permutations. Depending on the input mobest_locateproduct table and the specific grouping requirements, multiple origin vectors are determined and then summarized. This explains variables like ov_dist_se and ov_dist_sd, which are only meaningful for groups of vectors.

One basic way of making use of an mobest_originvectors object is by highlighting the points of maximal similarity probability in the map plot.

Code for this figure.

ggplot() +
  geom_raster(
    data = search_product,
    mapping = aes(x = field_x, y = field_y, fill = probability)
  ) +
  scale_fill_viridis_c() +
  geom_sf(
    data = research_area_3035,
    fill = NA, colour = "red",
    linetype = "solid", linewidth = 1
  ) +
  geom_point(
    data = search_samples,
    mapping = aes(x, y),
    colour = "red"
  ) +
  geom_point(
    data = origin_vectors,
    mapping = aes(field_x, field_y),
    fill = "orange", shape = 24
  ) +
  geom_segment(
    data = origin_vectors,
    mapping = aes(
      x = search_x, y = search_y,
      xend = field_x, yend = field_y
    ),
    arrow = arrow(length = unit(0.2, "cm")),
    colour = "red"
  ) +
  geom_label(
    data = origin_vectors,
    mapping = aes(
      x = (field_x + search_x)/2, y = (field_y + search_y)/2,
      label = paste0(round(ov_dist/1000, -2), "km")
    ),
    fill = "white", colour = "red", size = 2
  ) +
  ggtitle(
    label = "<Stuttgart> ~5250BC",
    subtitle = "Early Neolithic (Linear Pottery Culture) - Lazaridis et al. 2014"
  ) +
  theme_bw() +
  theme(
    axis.title = element_blank()
  ) +
  guides(
    fill = guide_colourbar(title = "Similarity\nsearch\nprobability")
  ) +
  facet_wrap(
    ~search_time,
    labeller = \(variable, value) {
      paste0("Search time: ", abs(value), "BC")
    }
  )

_images/search_map_two_timeslices_with_ovs.png — Similarity search map plot for two different time slices including arrows to the point of maximum similarity probability. cf. Figure ‘Similarity search map plot for two different time slices.’

With origin_vectors we can also summarize the results for two different samples, as in Multiple search samples above, in a single figure.

If we take the search_product object from there and apply mobest::determine_origin_vectors() without a grouping variable - grouping by sample is the default - then we get origin vectors for the two samples, which we can display in the same map figure.

origin_vectors <- mobest::determine_origin_vectors(search_product)

Code for this figure.

ggplot() +
  geom_sf(
    data = research_land_outline_3035,
    fill = "grey", color = NA
  ) +
  geom_sf(
    data = research_area_3035,
    fill = NA, colour = "red",
    linetype = "solid", linewidth = 1
  ) +
  geom_point(
    data = origin_vectors,
    mapping = aes(search_x, search_y),
    colour = "red"
  ) +
  geom_point(
    data = origin_vectors,
    mapping = aes(field_x, field_y),
    fill = "orange", shape = 24
  ) +
  geom_segment(
    data = origin_vectors,
    mapping = aes(
      x = search_x, y = search_y,
      xend = field_x, yend = field_y
    ),
    arrow = arrow(length = unit(0.2, "cm")),
    colour = "red"
  ) +
  theme_bw() +
  theme(
    axis.title = element_blank()
  )

_images/search_map_two_samples_in_one_plot.png — Prototype of a figure that combines the independent search results for two samples in one figure. cf. Figure ‘Similarity search map plot for two different search samples.’