4

Here is how my data frame is currently structured (first 6 rows). The data I used is available here.

ID  date        sps     time    pp  datetime            km
1   2012-06-19  MICRO   2:19    0   2012-06-19 02:19    80
2   2012-06-21  MUXX    23:23   1   2012-06-21 23:23    80
3   2012-07-15  MAMO    11:38   0   2012-07-15 11:38    80
4   2012-07-20  MICRO   22:19   0   2012-07-20 22:19    80
5   2012-07-29  MICRO   23:03   0   2012-07-29 23:03    80
8   2012-08-07  PRLO    2:04    0   2012-08-07 02:04    80

The columns stand for:

  • ID: identification number
  • date: date of observation
  • km: location
  • sps: species code
  • time: time of observation
  • pp: codes if the species (sps) observed is a predator (1) or prey (0)
  • datetime: conversion of date and time rows to as.POSIXct format

The question I am trying to answer:

Does the likelihood of a predator (pp = 1) being observed increase with the number of times prey (pp = 0) are observed (e.g. is prey followed by predator more likely than prey followed by prey, etc.) at each location (km)?

Background:

  • For each location (km) there is a unique row in my data with the time the image is taken and an identification of whether the photo is of a predator or prey.
  • There are many photos of predators and prey at each location.
  • For each location, observations of predators and prey are made in temporal sequence.

What I am trying to do:

  1. For each location, exhaustively count the number of temporal pairs of observations: prey-prey, prey-predator, predator-prey and predator-predator.

  2. For each location, shuffle (randomize) the observations of pred/prey (i.e. maintain the same total number of pred/prey as observed) and count the number of pairs of observations generated by the shuffle: prey-prey, prey-predator, predator-prey and predator-predator. Record. Calculate the difference between number of observations in step (1) and that found by each shuffle. Repeat 1000 times. This should give me a sense of how likely the original observation of prey-prey, prey-predator, predator-prey, and predator-predator paired sequences are given the observed proportion of pred/prey.

My question:

Assuming a Markov Chain model is the most appropriate way to answer my question, how would I be able to code this in R?

At this point, I believe the R package I should be using is markovchain, but I don't know how to translate steps 1 and 2 into R code.

Blundering Ecologist
  • 1,199
  • 2
  • 14
  • 38
  • 2
    What specifically are you having trouble with? The Markov model, or the R implementation? I suggest you edit your question so as to make this more specific, you are getting close votes because you said "does anyone have any suggestions". – Superbest Jan 26 '18 at 22:49
  • 2
    Also, I disagree with the close votes. This is a sufficiently small and straightforward case that is great as an illustrative example for beginners. A concise answer can easily be written (not sure if I can, I have only worked with MMs in Python not R) and would be a very useful, informative addition to the site. – Superbest Jan 26 '18 at 22:52
  • @Superbest, I have modified my question to clarify what I am hoping to get help with. – Blundering Ecologist Jan 29 '18 at 18:05
  • Strictly speaking, your question is not answerable by a Markov model. The probability "prey followed by prey followed by predator" cannot be determined since Markov chains only rely on the preceding state. – thc Jan 31 '18 at 00:47
  • @thc From your answer I will at least be able to get to the probability of an event given the preceding one. Is there a different method you would suggest if I wanted to look at more than just the preceding state? – Blundering Ecologist Jan 31 '18 at 20:24
  • 1
    You could look into Hidden Markov Models, but I am not familiar enough with HMMs to give you a definitive answer if it will work. – thc Jan 31 '18 at 20:30

1 Answers1

4

Here is some code:

library(dplyr)
library(markovchain)

Read in data and format time stamps

data <- read.table("~/Downloads/d1.txt", sep="\t", header=T, stringsAsFactors=F)
data$datetime <- as.POSIXct(data$datetime)

Sort by time

data <- data[order(data$datetime, decreasing=F),]

For each location, create a sequence of pp

data <- data %>% group_by(km) %>% summarize(pp_chain=list(pp)) %>% as.data.frame 
pp_chains <- data$pp_chain; names(pp_chains) <- data$km

Fit the Markov model on all sequence chains

fit <- markovchainFit(pp_chains)

Estimate transition probabilities:

print(fit$estimate)
          0          1
0 0.9116832 0.08831677
1 0.5250852 0.47491476

This matrix says the probability of transition from 0 to 0 is 0.91; the probability of transition from 0 to 1 is 0.088; and so on.

Estimate steady states:

print(steadyStates(fit$estimate))
             0         1
[1,] 0.8560214 0.1439786

We can compare the transition probabilities with the steady state. If transition were just random, then transition would not depend on the previous state and they would just be equal to the steady state values.

Since that's not the case, it's clear that if you have a predator you're much more likely to have another predator and vice versa.

thc
  • 9,527
  • 1
  • 24
  • 39
  • That is exactly what I was looking for. Thank you. If possible, could you explain what "steady state" means? (I am trying to understand the output.) Transition probabilities make sense, but I do not understand how steady states relate to them. – Blundering Ecologist Jan 31 '18 at 20:14
  • Steady state is the distribution such that if we start off with that distribution, that distribution of states will persist in the chain for all time. It is basically the proportion of states in your data. – thc Jan 31 '18 at 20:28
  • If I understand that correctly, it means that if the transitions were random then 0-0 would be 0.856, 0-1 and 1-0 would be 0.0, and 1-1 would be 0.143? – Blundering Ecologist Jan 31 '18 at 20:35
  • 1
    No, 0-0 would be 0.856, 0-1 would be 0.143; 1-0 would be 0.856 and 1-1 would be 0.143. It's a "limit". – thc Feb 01 '18 at 00:38
  • Do you know if there is a way to extract the probability estimates for each of the locations (`km`)? – Blundering Ecologist Feb 05 '18 at 23:34