Probability Models

Robert Kissell , Jim Poserina , in Optimal Sports Math, Statistics, and Fantasy, 2017

3.6 Conclusions

This chapter provided readers with an overview of various probability models that can be used to predict the winner of a game or match, rank teams, and estimate the winning margin, and compute the probability that a team will win.

These probability models consist of deriving "team rating" metrics that are subsequently used as the basis for output predictions. The techniques discussed include logistic and power function optimization models, logit regression models using win/loss observations, and cumulative spread distribution functions (CDFs). These models will serve as the foundation behind our sports prediction models discussed in later chapters.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128051634000037

Measurement Criteria for Choosing among Models with Graded Responses

David Andrich , in Categorical Variables in Developmental Research, 1996

3.2.4 Summary of the CPM in relation to the measurement criteria

It is evident that the CPM satisfies none of the measurement criteria: it does not provide invariance of location, it does not specialize to the case of measurement when precision is to be increased by increasing the number of thresholds, and it does not permit the ordering of the response categories to be falsified. The reason it has none of these features is that it arises from a situation in which the distribution is not that of measurement error, but of a property already measured in which the entities are described by a mean and a variance.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780127249650500043

The basics of natural language processing

Chenguang Zhu , in Machine Reading Comprehension, 2021

2.4.2 Evaluation of language models

The language model establishes the probability model for text, that is, P ( w 1 w 2 w m ) . So a language model is evaluated by the probability value it assigns to test text unseen during training.

In the evaluation, all sentences in the test set are concatenated together to make a single word sequence: w 1 , w 2 , , w N , which includes the special symbols <s> and </s>. A language model should maximize the probability P ( w 1 w 2 w N ) . However, as this probability favors shorter sentences, we use the perplexity metric to normalize it by the number of words:

P e r p l e x i t y ( w 1 w 2 w N ) = P ( w 1 w 2 w N ) 1 N = 1 P ( w 1 w 2 w N ) N

For example, in the bigram language model, P e r p l e x i t y ( w 1 w 2 w N ) = i = 1 N 1 P ( w i | w i 1 ) N .

Since perplexity is a negative power of probability, it should be minimized to maximize the original probability. On the public benchmark dataset Penn Tree Bank, the currently best language model can achieve a perplexity score around 35.8 [4].

It's worth noting that factors like the dataset size and inclusion of punctuations can have a significant impact on the perplexity score. Therefore beyond perplexity, a language model can be evaluated by checking whether it helps with other downstream NLP tasks.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780323901185000023

Random Effects Analysis

W.D. Penny , A.J. Holmes , in Statistical Parametric Mapping, 2007

Maximum likelihood

Underlying RFX analysis is a probability model defined as follows. We first envisage that the mean effect in the population (i.e. averaged across subjects) is of size w pop and that the variability of this effect between subjects is σ b 2. The mean effect for the ith subject (i.e. averaged across scans), w i , is then assumed to be drawn from a Gaussian with mean w pop and variance σ2 b. This process reflects the fact that we are drawing subjects at random from a large population. We then take into account the within-subject (i.e. across scan) variability by modelling the jth observed effect in subject i as being drawn from a Gaussian with mean w i and variance σ w 2. Note that σ w 2 is assumed to be the same for all subjects. This is a requirement of a balanced design. This two-stage process is shown graphically in Figure 12.1.

FIGURE 12.1. Synthetic data illustrating the probability model underlying random effects analysis. The dotted line is the Gaussian distribution underlying the second-level model with mean w pop , the population effect, and variance σ b 2, the between-subject variance. The mean subject effects, w i , are drawn from this distribution. The solid lines are the Gaussians underlying the first level models with means w i and variances σ2 w. The crosses are the observed effects y ij which are drawn from the solid Gaussians.

Given a data set of effects from N subjects with n replications of that effect per subject, the population effect is modelled by a two-level process:

where w i is the true mean effect for subject i and y ij is the jth observed effect for subject i, and z i is the between-subject error for the ith subject. These Gaussian errors have the same variance, σ2 b. For the positron emission tomography (PET) data considered below this is a differential effect, the difference in activation between word generation and word shadowing. The first equation captures the within-subject variability and the second equation the between-subject variability.

The within-subject Gaussian error e ij has zero mean and variance Var[e ij ] = σ w 2. This assumes that the errors are independent over subjects and over replications within subject. The between-subject Gaussian error z i has zero mean and variance Var[z i ] = σ2 b. Collapsing the two levels into one gives:

The maximum-likelihood estimate of the population mean is:

12.3 w ˆ p o p = i N n i = 1 N j = 1 n y i j

We now make use of a number of statistical relations defined in Appendix 12.1 to show that this estimate has a mean E [ w ˆ p o p ] = w p o p and a variance given by:

12.4 Var [ w ˆ p o p ] = Var [ i = 1 N 1 N j = 1 n 1 n ( w p o p + z i + e i j ) ] = Var [ i = 1 N 1 N z i ] + Var [ i = 1 N 1 N j = 1 n 1 n e i j ] = σ b 2 N + σ w 2 N n

The variance of the population mean estimate contains contributions from both the within-subject and between-subject variance.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123725608500127

Bayesian Methodology in Statistics

J.M. Bernardo , in Comprehensive Chemometrics, 2009

1.08.3.2 Predictive Distributions

Let data D  =   { x 1,…, x n }, x i     χ, be a random sample from some distribution in the family { p ( x | ω ) , ω Ω } , π(ω) a (possibly improper) prior function describing available information (if any) on the value of ω, and consider now a situation where it is desired to predict the value of a future observation x     χ generated by the same random mechanism that has generated the data D. It follows from the foundations arguments discussed in Section 1.08.2 that the solution to this prediction problem is simply encapsulated by the predictive distribution p( x |D) describing the uncertainty on the value that x will take, given the information provided by D and any other available knowledge. Since p( x |ω, D)   = p( x |ω), it then follows from standard probability theory that

(16) p ( x | D ) = Ω p ( x | ω ) π ( ω | D ) d ω

which is an average of the probability distributions of x conditional on the (unknown) value of ω, weighted with the posterior distribution of ω given D, π(ω|D)   p(D|ω) π(ω).

If the assumptions on the probability model are correct, the posterior predictive distribution p( x |D) will converge, as the sample size increases, to the distribution p( x |ω) that has generated the data. Indeed, about the best technique to assess the quality of the inferences about ω encapsulated in π(ω|D) is to check against the observed data the predictive distribution p( x |D) generated by π(ω|D). For a good introduction to Bayesian predictive inference, see Geisser. 19

Example 4.

(Prediction in a Poisson process)

Let D  =   {r 1,…,r n } be a random sample from a Poisson distribution Pn(r|λ) with parameter λ, so that p(D|λ)     λ t e−λn , where t = r i . It may be shown (see Section 1.08.4) that absence of initial information on the value of λ may be formally described by the (improper) prior function π(λ)   =   λ−1/2. Using Bayes' theorem, the corresponding posterior is

(17) π ( λ | D ) λ t e λ n λ 1 / 2 λ t 1 / 2 e λ n

the kernel of a gamma density Ga(λ|t  +   1/2, n), with mean (t  +   1/2)/n. The corresponding predictive distribution is the Poisson-Gamma mixture

(18) p ( r | D ) = 0 Pn ( r | λ ) Ga ( λ | t + 1 2 , n ) d λ = n t + 1 / 2 Γ ( t + 1 / 2 ) 1 r ! Γ ( r + t + 1 / 2 ) ( 1 + n ) r + t + 1 / 2

Suppose, for example, that in a firm producing automobile restraint systems, the entire production in each of 10 consecutive months has yielded no complaint from their clients. With no additional information on the average number λ of complaints per month, the quality assurance department of the firm may report that the probabilities that r complaints will be received in the next month of production are given by Equation (18), with t  =   0 and n  =   10. In particular, p(r  =   0|D)   =   0.953, p(r  =   1|D)   =   0.043, and p(r  =   2|D)   =   0.003. Many other situations may be described with the same model. For instance, if meteorological conditions remain similar in a given area, p(r  =   0|D)   =   0.953 would describe the chances of there being no flash flood next year, given that there has been no flash floods in the area for 10 years.

Example 5.

(Prediction in a normal process)

Consider now an example of prediction of the future value of a continuous observable quantity. Let D  =   {x 1,…,x n } be a random sample from a normal distribution N(x|μ,   σ). As mentioned in Example 3, absence of initial information on the values of both μ and σ is formally described by the improper prior function π(μ, σ)   =   σ−1, and this leads to the joint posterior density (13). The corresponding (posterior) predictive distribution is

(19) p ( x | D ) = 0 N ( x | μ , σ ) π ( μ , σ | D ) d μ d σ = St ( x | x ¯ , s n + 1 n 1 , n 1 )

If μ is known to be positive, the appropriate prior function will be the restricted function

(20) π ( μ , σ ) = { σ 1 if μ > 0 0 otherwise

However, the result in Equation (19) will still basically hold, provided the likelihood function p(D|μ, σ) is concentrated on positive μ values.

Suppose, for example, that in the firm producing automobile restraint systems, the observed breaking strengths of n  =   10 randomly chosen safety belt webbings have mean x̄   =   28.011   kN and standard deviation s  =   0.443   kN, and that the relevant engineering specification requires breaking strengths to be larger than 26   kN. If data may truly be assumed to be a random sample from a normal distribution, the likelihood function is only appreciable for positive μ values, and only the information provided by this small sample is to be used, then the quality engineer may claim that the probability that a safety belt randomly chosen from the same batch as the sample tested would satisfy the required specification Pr(x  >   26|D)   =   0.9987. Besides, if production conditions remain constant, 99.87% of the safety belt webbings may be expected to have acceptable breaking strengths.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780444527011000958

Integrated Population Biology and Modeling, Part A

Ram Chandra Yadava , in Handbook of Statistics, 2018

3.5 A Probability Model for Time of First Birth

Singh (1964) proposed a probability model for time of first birth entitled "On the time of first birth." This is nothing but the sum of time of first conception and gestation period g which is normally taken as 9 months. Of course, this is based on the assumption that first conception results in live birth. Thus, if X denotes the time of first conception after marriage, then X follows the exponential distribution with p.d.f. f(x)   = λe   λx . Here, λ may be interpreted as the conception rate.

To account for possible heterogeneity in λ, Singh (1964) assumed that λ follows a Pearson's type III distribution with p.d.f.

(58)

While applying his model to real data obtained from a survey conducted in Varanasi, he made a further adjustment for accounting for some sociocultural factors prevailing in the society at that time. He also gave a procedure to estimate the parameters in the model.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/S0169716118300142

Statistical Methods for Physical Science

William Q. Meeker , Luis A. Escobar , in Methods in Experimental Physics, 1994

8.1.1 Modeling Variability with a Parametric Distribution

Chapter 1 shows how to use probability models to describe the variability in data generated from physical processes. Chapter 2 gives examples of some probability distributions that are commonly used in statistical models. For the most part, in this chapter, we will discuss problems with an underlying continuous model, although much of what we say also holds for discrete models.

The natural model for a continuous random variable, say Y, is the cumulative distribution function (cdf). Specific examples given in Chapters 1 and 2 are of the form

P r ( Y y ) = F r ( y ; θ )

where θ is a vector of parameters and the range of Y is specified. One simple example that we will use later in this chapter is the exponential distribution for which

(8.1) Pr ( Y y ) = F Y ( y ; θ ) = 1 exp ( y θ )

where θ is the single parameter of the distribution (equal to the first moment or mean, in this example). The most commonly used parametric probability distributions have between one and four parameters, although there are distributions with more than four parameters. More complicated models involving mixtures or other combinations of distributions or models that include explanatory variables could contain many more parameters.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/S0076695X08602586

Sports Prediction Models

Robert Kissell , Jim Poserina , in Optimal Sports Math, Statistics, and Fantasy, 2017

5.6 Logit Spread Model

Description

The logit spread model is a probability model that predicts the home team victory margin based on an inferred team rating metric and home team winning margin. The model transforms the home team victory margin to a probability value between zero and one and then the model can be solved via logit regression analysis.

The inferred team "ratings" are solved via a logit linear regression model. This technique was described in Chapter 3, Probability Models.

Model Form

The logit spread model has following form:

y * = b 0 + b h b a

where b 0 denotes a home field advantage parameter, b h denotes the home team parameter value, and b a denotes the away team parameter value.

The left-hand side of the equation y * is the log ratio of the cumulative density function of victory margin and is computed as follows:

Step 1:

s i =Home team victory margin in game i.

( s i > 0 indicates the home team won the game, s i < 0 indicates the won team lost the game, and s i = 0 indicates the game ended in a tie).

Step 2:

Compute average home team victory margin, s ¯ , across all games.

Step 3:

Compute standard deviation of home team victory margin, σ s , across all games.

Step 4:

Compute the z-score of each spread, z i = ( s i s ¯ ) / σ s .

Step 5:

Compute the cumulative probability corresponding to z i as F ( z i ) . This ensures the values of F ( z i ) will be between 0 and 1. This can be computed via Excel or MATLAB in the following manner:

Excel: F ( Z i ) = normsdist ( z i )

MATLAB: F ( Z i ) = normcdf ( z i )

Step 6:

Compute the success ratio y * as follows:

y * = F ( z i ) 1 F ( Z i )

Step 7:

Take the log transformation of y * as follows:

y = log ( y * ) = log ( F ( z i ) 1 F ( Z i ) )

Solving the Model

The team rating parameters b k are then estimated via OLS regression analysis on the following equation:

y = b ˆ 0 + b ˆ h b ˆ a

Analysts need to perform an analysis of the regression results including evaluation of R 2 , t-stats, F-value, and analysis of the error term.

Estimating Home Team Winning Spread

The home team expected winning spread is calculated as follows:

y = b ˆ 0 + b ˆ k b ˆ j

From the estimated y we determine the normalized z-score z for home team winning spread as follows:

z = F 1 ( e y 1 + e y )

where F 1 represents the inverse of the normal distribution for the log-transformed home team winning spread.

This can be computed via Excel or MATLAB functions as follows:

Excel: z = normsinv ( e y 1 + e y )

MATLAB: z = norminv ( e y 1 + e y )

Finally, the expected home team expected points is calculated as follows:

s ˆ = z · σ s + s ¯

where

s ¯ is the average home team winning spread

σ s is the standard deviation of home team winning spread

Estimating Probability

The corresponding probability that the home team will win the game can be determined via a second regression analysis of actual spread as a function of estimated spread s . The model has the form:

Actual Spread = c 0 + c 1 · s ˆ + Error

Solution of this regression will provide model parameters c ˆ 0 and c ˆ 1 and regression error term Error .

After solving the second regression model and determining the regression error term Error , we can compute the probability that the home team will win the game.

It can be computed directly from the Excel or MATLAB functions as follows:

In Excel:

p = 1 normdist ( 0 , s ˆ , Error , True )

In MATLAB:

p = 1 normcdf ( 0 , s ˆ , Error )

If p > . 50 then the home team is predicted to win the game. If p < . 50 then the away team is expected to win the game. If p = . 50 then the game is expected to end in a tie.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128051634000050

Statistics, Foundations

D.A.S. Fraser , in Encyclopedia of Physical Science and Technology (Third Edition), 2003

IV Statistical Model

A basic statistical model is a probability model with free parameters to allow a range of possibilities, one of which leads to a model that provides a reasonable approximation to a process or system being investigated. For example, in some context the normal (μ, σ 2) mentioned in Section III could be very appropriate. Also, the die example in Section III might be generalized to have probabilities p 1, p 2, p 3, p 4, p 5, p 6 at the six sample points 1, 2, 3, 4, 5, 6   where necessarily p i     0 and ∑p i   =   1, these being a consequence of probabilities viewed as proportions in a large aggregate.

Thus a general version of the basic model would have a space S, a collection A of subsets, and a probability measure P(A;θ) for each subset A in A . The probability measure includes a free parameter θ in a space Ω that allows a range of possibilities for the measure; in an application it would be implicit that some one value of θ provided probabilities that closely approximated the behavior in the application.

The basic model as just described is concerned with frequencies or proportions on a space of possibilities. It does not distinguish component elements of structure or the practical significance of the various components. This basic model has been central to statistics throughout the 20th century.

Some more-structured models have been proposed. For example, in the measurement context one could model the measurement errors as z 1,     ,z n from some error distribution f(z). The error distribution could be the standard normal in Fig. 2 or it could be one of the Student distributions with parameter value equal to, say, 6, which provides more realistic, longer tails. The statistical model would then be y i   =   μ   +   σz i where μ and σ represent the response location and scaling. This alternative, more detailed model makes the process of statistical inference more straightforward and less arbitrary. For some details see Fraser (1979).

This raises the more general question of what further elements of structive from applications should reasonably be included in the statistical model. It also raises the question as to what modifications might arise in the statistical analyses using the more detailed or more specific models.

One modeling question was raised by Cox (1958). The context involved two measuring instruments, I 1 producing a measurement y that is normal (μ, σ1 2), and I 2 producing a measurement y that is normal (μ, σ2 2). A coin is tossed and depending on whether it comes up heads or tails we have i  =   1 or 2   with probability 1/2 each; the corresponding instrument I i is used to measure θ. The standard modeling view would use the model f(i, y; θ)   =   (1/2)g(y    θ; σ i 2), where g(z, σ2) designates a normal density with mean 0 and variance σ2. In an application the data would be (i, y) with model f(i, y; θ). But clearly when the measurement is made the instrument that is used is known. This suggests that the model should be g(y    θ, σ2 1) if i  =   1 and g(y    θ, σ2) if i  =   2; this can viewed as the conditional model, given the value of the indicator i.

Traditional statistical theory would give quite different results for the first model than for the second model. A global model may be appropriate for certain statistical calculations. But in an application when one knows how the measurement of θ is obtained, it seems quite unrealistic for the purposes of inference to suggest that the instrument might have been different and that this alternative should be factored into the calculations.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B0122274105007304

Continuous-Time Markov Chains

Sheldon Ross , in Introduction to Probability Models (Eleventh Edition), 2014

6.1 Introduction

In this chapter we consider a class of probability models that has a wide variety of applications in the real world. The members of this class are the continuous-time analogs of the Markov chains of Chapter 4 and as such are characterized by the Markovian property that, given the present state, the future is independent of the past.

One example of a continuous-time Markov chain has already been met. This is the Poisson process of Chapter 5. For if we let the total number of arrivals by time t (that is, N ( t ) ) be the state of the process at time t , then the Poisson process is a continuous-time Markov chain having states 0 , 1 , 2 , that always proceeds from state n to state n + 1 , where n 0 . Such a process is known as a pure birth process since when a transition occurs the state of the system is always increased by one. More generally, an exponential model that can go (in one transition) only from state n to either state n - 1 or state n + 1 is called a birth and death model. For such a model, transitions from state n to state n + 1 are designated as births, and those from n to n - 1 as deaths. Birth and death models have wide applicability in the study of biological systems and in the study of waiting line systems in which the state represents the number of customers in the system. These models will be studied extensively in this chapter.

In Section 6.2 we define continuous-time Markov chains and then relate them to the discrete-time Markov chains of Chapter 4. In Section 6.3 we consider birth and death processes and in Section 6.4 we derive two sets of differential equations—the forward and backward equations—that describe the probability laws for the system. The material in Section 6.5 is concerned with determining the limiting (or long-run) probabilities connected with a continuous-time Markov chain. In Section 6.6 we consider the topic of time reversibility. We show that all birth and death processes are time reversible, and then illustrate the importance of this observation to queueing systems. In the final section we show how to "uniformize" Markov chains, a technique useful for numerical computations.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124079489000062