Professional Documents
Culture Documents
November 3, 2009
I, Charles Esson, declare that this thesis titled, ‘Transitive Closure, In-
breeding Coefficients, Pedigrees, on line Collection and on line Presentation’
and the work presented in it are my own. I confirm that this work has not
been submitted for the award of any other degree or diploma in any tertiary
institution.
Signed:
Date:
1
Abstract
An on-line system was developed to collect and display pedigree data, test
the developed algorithms and present the results. The systems uses a tran-
sitive closure, minimum path length, maximum path length and an adja-
cency list to deliver inbreeding coefficients, pedigree trees, descendant trees
and enforcement of pedigree constraints in real time. Real time methods
to display an animal pedigree and descendants, to calculate the inbreeding
coefficient and enforce the constraints, incrementally maintain the maximum
path length between nodes where developed along with improved routines to
incrementally insert and delete edges from a transitive closure.
Acknowledgements
Thankyou:
Trish Esson For the large data set used for testing
and for putting up with me.
Andrew James For the second data set.
Greg Simmons For supervision, encouragement and as-
sistance.
Sudanthi Wijewickrema For proof reading.
List of Figures
2.1 Heritability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Example pedigree . . . . . . . . . . . . . . . . . . . . . . . . 27
1
List of Tables
2
Contents
1 Introduction 7
1.1 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Genetics 11
2.1 Early history . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Two alleles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Quantitative genetics . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Genotype and phenotype . . . . . . . . . . . . . . . . . . . . . 14
2.5 Heritability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5.1 Children phenotype - Parent phenotype . . . . . . . . . 15
2.5.2 Breeding value - Parent phenotype . . . . . . . . . . . 16
2.5.3 Additive genetic effect . . . . . . . . . . . . . . . . . . 16
2.5.4 Correlation . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Ordinary least squares . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Generalized least squares . . . . . . . . . . . . . . . . . . . . . 18
2.8 Best linear unbiased prediction . . . . . . . . . . . . . . . . . 19
2.9 Predicting animal breeding values . . . . . . . . . . . . . . . . 22
2.10 Within herd estimated breeding values . . . . . . . . . . . . . 24
2.10.1 No members in a group . . . . . . . . . . . . . . . . . . 26
2.11 Relationship matrix . . . . . . . . . . . . . . . . . . . . . . . . 26
2.11.1 Calculating the relationship matrix . . . . . . . . . . . 26
2.12 Calculating the inbreeding coefficient . . . . . . . . . . . . . . 29
2.13 Incrementally maintaining A−1 . . . . . . . . . . . . . . . . . 31
2.14 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3
3.4 Incremental maintenance of a transitive closure . . . . . . . . 36
3.4.1 Insertion of an edge into a transitive closure . . . . . . 36
3.4.2 Deletion of an edge from a transitive closure . . . . . . 37
3.4.3 Three self joins of ”TRUSTY” to recover all paths . . . 37
3.4.4 Fragments . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.5 The shortest and longest Path . . . . . . . . . . . . . . 40
3.4.6 Building a transitive closure from a depth first search . 42
3.4.7 Alternative to using a transitive closure . . . . . . . . . 42
3.4.8 Building the transitive closure one herd at a time . . . 44
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
References 71
Appendices 74
4
A Code Structure 74
A.1 Input values and return values . . . . . . . . . . . . . . . . . . 75
A.2 Error codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
A.3 The table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
A.3.1 base . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
A.3.2 db string . . . . . . . . . . . . . . . . . . . . . . . . . . 76
A.3.3 dbt string . . . . . . . . . . . . . . . . . . . . . . . . . 76
A.3.4 strings . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
A.3.5 data convert . . . . . . . . . . . . . . . . . . . . . . . . 76
A.3.6 type . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A.3.7 logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A.3.8 heading string . . . . . . . . . . . . . . . . . . . . . . . 77
A.3.9 location . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A.4 Data entry and display . . . . . . . . . . . . . . . . . . . . . . 77
A.4.1 input to db . . . . . . . . . . . . . . . . . . . . . . . . 78
A.4.2 test db change . . . . . . . . . . . . . . . . . . . . . . 78
A.4.3 get data . . . . . . . . . . . . . . . . . . . . . . . . . . 78
A.4.4 output from db . . . . . . . . . . . . . . . . . . . . . . 78
B Ajax 79
B.1 File upload progress . . . . . . . . . . . . . . . . . . . . . . . 80
B.2 Yahoo user interface; selecting fields to display . . . . . . . . . 84
C The Application 85
C.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
C.2 Login and logoff . . . . . . . . . . . . . . . . . . . . . . . . . . 86
C.3 The herd screens . . . . . . . . . . . . . . . . . . . . . . . . . 86
C.3.1 Herd edit screen . . . . . . . . . . . . . . . . . . . . . . 89
C.3.2 Herd setup screen . . . . . . . . . . . . . . . . . . . . . 91
C.4 Animal screens . . . . . . . . . . . . . . . . . . . . . . . . . . 93
C.4.1 Public animal setup screen . . . . . . . . . . . . . . . . 94
C.4.2 Animal setup screen . . . . . . . . . . . . . . . . . . . 95
C.5 Detail screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
C.5.1 Detail setup screen . . . . . . . . . . . . . . . . . . . . 97
C.6 Upload screens . . . . . . . . . . . . . . . . . . . . . . . . . . 98
C.6.1 Upload file structure . . . . . . . . . . . . . . . . . . . 98
C.6.2 Upload sequence . . . . . . . . . . . . . . . . . . . . . 100
C.7 Download files . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5
D Linear Regression 103
D.1 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
D.2 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
D.3 Normal equation . . . . . . . . . . . . . . . . . . . . . . . . . 107
6
Chapter 1
Introduction
7
In 2005 the Australian cashmere Industry received funding from the Ru-
ral Industry and Development Corporation (RIRDC) for a sire referencing
scheme. Further funding was granted to develop a method to generate es-
timated breeding values (EBV) across the entire industry. This project re-
sulted in a batch solution developed by Andrew T.James (James, 2009). The
industry now feels the EBV project is at a stage where it is appropriate to
move to on line collection, and presentation of pedigree and phenotype data,
direct calculation of some results and display of additional results from the
batch process. An ideal system would present inbreeding coefficients and
within herd estimated breeding values without running the batch process.
This research effort aims to develop methods that are fast enough to
meet the ideal design goals. An online system has been constructed and the
developed methods implemented to make sure performance is acceptable.
8
Testing showed that for the large industry wide herd, incrementally main-
taining the pedigree transitive closure in real time was too slow. Methods
were therefor developed to speedup the algorithms. The final solution to this
problem was the maintenance of a transitive closure for each user’s hard,
and to use a batch process to maintain an industry wide transitive closure.
Methods to do this and decide which data structure to use for display were
developed.
1.2 Methodology
The design science paradigm seeks to extend the boundaries of human and
organizational capabilities by creating new and innovative artifacts. It ad-
dresses research through the building and evaluation of artifacts to meet
identified needs (Hevner, March, Park, & Ram, 2004). The industrie’s iden-
tified needs for the artifact were:
9
1.3 Thesis structure
To understand Merrrit and how the design goals were reached several con-
cepts need to be understood and in some cases extended. Some chapters
introduce new material, others look at implementation details.
Chapter 4 takes the results from chapter 3 and presents the SQL code to
implement the edge insertion and deletion solutions.
10
Chapter 2
Genetics
11
The industry is interested in identifying animals that have the potential
to increase the economic returns quickly. Calculating accurate estimated
breeding values (EBV) and encouraging their use across the cashmere indus-
try is the underlying aim of this and previous projects. To identify the best
animal across multiple herds, we need to use mathematics that remove herd
specific effects. The mathematics to do this was developed in the 60’s and
70’s (Henderson, 1975). The techniques developed require an industry wide
pedigree and matings across herd boundaries (the sire reference scheme).
When developing breeding strategies, within herd EBVs are useful to the
growers. EBVs that take into account the performance of ancestors and
descendants are better at estimating an animal’s genetic merit than EBVs
based on the animals phenotype records alone.
Inbreeding coefficients are an indication of how closely individuals are
related and are used by breeders to control their inbreeding and out-breeding
programs.
This chapter takes a quick look at the history of genetics, the basic ideas
behind quantitative genetics and then tries to give some insight into heritabil-
ity. Sections 2.6, 2.7 and 2.8 build towards understanding best linear unbi-
ased prediction (BLUP), a method that uses phenotype records from related
animals to better predict an animals breeding value. Ordinary least squares
assumes there is no relationship between the residual errors, generalized least
squares assumes the events are independent but there is a known relationship
between the residuals, and best linear unbiased prediction assumes there is
a known relationship between events. When predicting breeding values an
event is an animal, and the relationship between events is controlled by the
genes passed from one generation to the next.
12
2.2 Two alleles
An allele is an alternative DNA sequences at
the same physical gene locus. A diploid or-
ganism ( many plants are not diploids) inher- Genotype AA Aa aa
its one sequence from one parent and another Frequency p2 2pq q 2
sequence from the other. A diploid organism
carries two alleles, but only passes one of the Table 2.1: Hardy-Weinberg
inherited alleles to it’s offspring. In the sec- law
ond simplest case, there are only two different
alleles in the population with these being ex-
pressed as three possible combinations.
If the gene frequency of allele type A is p and the frequency of type a is
q, then the frequencies of the various combinations is as given in table 2.1.
To see this, forget about how the alleles are carried by the organism and
consider the probability of selecting AA, aa and Aa from a large population
(this is Hardy-Weinberg’s law restated).
The probability of selecting your first A is p, the probability of selecting
the second A is also p, and therefor the probability of selecting two As in
a row is p2 . There are two ways to select Aa (if the order doesn’t matter),
the probability of either is pq, so the probability of an Aa outcome is 2pq.
Similarly the probability of picking two a alleles in a row is q 2 .
(p1 +p2 +p3 +p4 )2 = p21 +p22 +p23 +p24 +2p1 p2 +2p1 p3 +2p1 p4 +2p2 p3 +2p2 p4 +2p3 p4
(2.1)
If there are i alleles the frequency of the combinations is given by:
(p1 + · · · + pi )2 (2.2)
If there are two gene locations, the frequency of the combinations is given
by:
((p1 + · · · + pi )2 + (q1 + · · · + qj )2 )2 (2.3)
Formula 2.3 can easily be extended to k gene locations. As the number
of different alleles increases the number of possible outcomes explodes, well
separated classes disappear and the desire to think in terms of alleles fades.
13
A trait that doesn’t have well separated classes is called a quantitative
trait. Quantitative traits tend to be normally distributed. To observe this
phenomenon, let us simplify the situation and pretend that we have a trait
that is affected by many gene locations. Let us further assume that the
probability of getting an allele that improves the outcome is p and is equal for
all location and that the incremental improvements add up. This describes
a Bernoulli trail (Rozanov, 1964). Each animal is a separate event, and the
herd is the trail.
If the number of gene locations is n and the number of selections that
contributed positively to the outcome is k, the probability for a particular
sequence is:
The number of ways we can select the k positive outcomes is given by:
n!
Ckn = (kn ) = (2.5)
k!(n − k)!
14
As well as additive effects, the genotype includes dominance effects and
epistatic effect, which can’t be measured so they are taken out of the geno-
type and along with the measurement errors, they are lumped together as a
residual and to get:
P (phenotype) = A(additive genetic effects) + E(environment) + e (2.8)
An equation to predict an animal’s phenotype could be written:
zi (phenotype predicted) = E(environment)+h2 xi (additive genetic effects)+ei
(2.9)
The formula h2 xi = A uses the an-
imal’s phenotype record to predict it’s
breeding value. xi is the deviation from
Offspring
the herd mean phenotype, and ei is the phenotype
difference between the predicted breed- +75
ing value and the actual breeding.
Best linear unbiased prediction (dis-
0
cussed in section 2.8 uses the relatives
phenotype records to produce a pre-
dicted breeding value that is closer to
the real breeding value.
0 +500
Parent phenotype
2.5 Heritability
Geneticists talk about broad heritabil- Figure 2.1: bO,P = 1/2h2
ity (H 2 ) and narrow heritability (h2 ).
We can’t measure broad heritability so
in a practical sense it is of no interest.
There are several definitions of narrow
heritability and investigating how they are related helps us understand her-
itability and breeding values. We will start the exploration with an experi-
ment that can be performed.
15
Where (bO,P ) is the gradient of the least squares line of best fit (regression
coefficient), of the parents’ phenotype against the offspring’s phenotype. If:
16
The phenotype measurement is the sum of the inherited attributes and the
environment (see equation 2.7).
P =A+E (2.18)
σ[(x+y),(w+z)] = σ(x,w) + σ(y,w) + σ(x,z) + σ(y,z) (2.19)
(Lynch &W alsh,1997, equation3.10g)
σA,P = σ[A,(A+E)] = σA2 + σA,E (2.20)
If the environment and genetic random variables are orthogonal σA,E will be
zero and:
2.5.4 Correlation
The correlation between an animal’s adjusted phenotype and it’s breeding
value is equal to the square root of it’s heritability. This leads to some
interesting formula manipulation but provides little insight.
s
√ σA2
h2 = See equation(2.16)
σP2
p
σ2
=p A
σP2
σA
h=
σP
σA2
rP A =
σP σA
σA σA
=
σP σA
σA
=
σP
=h
17
2.6 Ordinary least squares
Ordinary least squares (OLS) looks for a vector β to linearly combine mea-
sured estimators to provide an estimate, that is:.
β gets a hat because the result is the predicted β that results from the data
set y and X not the actual β. The predicted value is:
y
b = Xβb (2.25)
var() = Rσ 2 (2.26)
18
the equation with the inverse of the square root of R, where, R1/2 R1/2 = R.
Faraway, 2002 used the Cholesky Decomposition where SST = R, with 2.22
being transformed using S−1 :
The transform works if the residual variances no longer have a linear rela-
tionship.
Replacing X and y in equation 2.24 with the transformed values gives the
generalized least square (GLS) solution:
19
next event in a time series, it is obvious that the events are measurements of
points in the series. When used with genetics an event is a generation.
The value predicted for the next event, based on the knowledge of a
disturbance, or the knowledge of the relationship between the animals, is
obtained by minimizing the residual variance. In other words we are assum-
ing the new sample will not increase the variance. The effect that animal
relationships have on the variance is known and is discussed in section 2.11.
The BLUP is easier to follow if it is separated from complex family re-
lationships. We consider a signal affected by a known disturbance. The
following derivation is based on Goldburger, 1962. In the following deriva-
tion ( as with the rest of the thesis), lowercase bold symbols are used to
represent vectors, a matrices are represented by upper case bold symbols
and values are typeset normally.
The predicted value for the new event is:
ẙ = (x̊T β) + ˚
(2.30)
The aim is to find the correlation between ˚
and . We are hoping that:
E(˚
) = 0 (2.31)
E(˚
) = ω (2.32)
We need a vector of constants that projects y (the vector of all previous
dependent measurements) to a predicted value. That is:
cT y
p=b (2.33)
Recall that the vector of y values can be written:
y = Xβb + (2.34)
And that:
E() = 0 (2.35)
And let:
E(T ) = V (2.36)
We want a solution that minimizes:
σp2 = (p − ẙ)(p − ẙ)T (2.37)
Subject to:
E(p − ẙ) = 0 (2.38)
20
Using 2.34 equation 2.33 can be written:
cT Xβb + b
p=b cT (2.39)
cT X −x̊T )β̊ + b
p − ẙ = (b cT − ˚
(2.40)
cT X −x̊T = 0 (2.41)
cT − ˚
p − ẙ = b (2.42)
And
Finding the partial derivatives of 2.44, setting them to zero and solving for
c gives us:
b
c = V−1 X(XT V−1 X)−1 x̊ + V−1 (I − X(XT V−1 X)−1 )XT V−1 )ω (2.45)
b
cT y
p=b
= x̊T (XT V−1 X)−1 XT V−1 y + ω T V−1 y − ω T V−1 X(XT V−1 X)−1 XT V−1 y
(2.46)
21
Looking back to section 2.7 ( generalized least squares) we see that:
y = Xβb + Zb
a+e (2.50)
X is a design matrix for the fixed effects: that is, the effects that are not
influenced by additive genetic effects. Z is the design matrix used to select
breeding values (the animal). If there is only one record per animal, Z will
be the identity matrix. The residuals left over when the random effects are
removed are contained in the vector e.
If there are two fixed effects, say a doe can be on farm 1 or 2, the prediction
equation for the phenotype of doe i is.
If the animal is on farm 1 then xi0 will be one and xi1 will be zero.
The variance and covariances defined from the animal model with a desire
for some simplicity are:
22
G is a symmetric matrix because A is symmetric. From the definitions above
var() can be calculated :
var() = V
= var(Za + e) (see 2.50)
= Zvar(a)ZT + var(e) + cov(Za, e) + cov(e, Za)
= Zvar(a)ZT + var(e) + Zcov(a, e) + cov(e, a)ZT
= (ZGZT + R)σ 2 (2.53)
The covariance we are putting the effort into extracting is: cov(,˚
).
) = ω = zT GZ
cov(,˚ (2.54)
Which is the formula given by Henderson, 1975 as the best linear unbiased
predictor of y given G and R.
In 1950 Henderson provided a set of equations to simultaneously find b
a
and β.
b These are now referred to as the mixed model equations.
T −1
XT R−1 Z
T −1
X R X βb X R y
T −1 T −1 −1 = (2.56)
Z R X Z R Z+G a
b ZT R−1 y
23
σA2
= h2 (see equation 2.16)
σP2
σE2 = σP2 − σA2
σE2 (σ 2 − σ 2 )
= P 2 A
σP2 σP
σE2
= 1 − h2
σP2
σE2 (σE2 /σP2 )
=
σA2 (σA2 /σP2 )
σE2 1 − h2
=
σA2 h2
and equation 2.57 can be written:
XT X XT Z
T
βb X y
2 = (2.58)
ZT X ZT Z + 1−h
h2
A−1 a
b ZT y
24
cells contain the inbreeding coefficients and the off diagonal cells contain the
relationship value between the two animals represented by the intersection
of the column and row.
The last animal to be added to the pedigree can take the last row in A
and as a result the last row in X and Z.
The left matrix in equation 2.58 can be written as:
T
XT Z
T
X X X X XT Z 0 0
2 = + 2 (2.59)
ZT X ZT Z + 1−h h2
A−1 ZT X ZT Z 0 1−h
h2
A−1
The diagonal elements of the sub matrix X−1 X count the number of
animals in each group and the off diagonal elements the number in each pair
of groups. If the number of groups doesn’t change, the size of this sub matrix
remains constant and to update, one is added to the appropriate counts.
The number of columns in the sub matrix ZT X is equal to the number
of groups and the number of rows is equal to the number of animals. When
we add an animal to the end of the matrix, it grows by a row. One is placed
in a column if the animal belongs to the group in that column, zero in all
others. The matrix XT Z is the transpose of ZT X.
The sub matrix ZT Z is a matrix that contains a diagonal entry if a
phenotype record is available for the animal and zero if not.
Excluding A−1 from the consideration, updating the left hand size of 2.58
involves the addition of a row and column to a large matrix, and a little bit
of work in the top left hand corner to get the counts in order.
Matrix A−1 is added to the bottom right hand corner. The incremental
updating of A−1 is discussed in section 2.13.
Considering now the right hand side; XT y contains an entry per group,
the sum of all y values for the group, adding the new value to the relevant
summation will update these value. ZT y selects the appropriate y value for
a particular row in the equation set.
When adding the row to the equations we get the last value on the right
hand side correct if we set it to the phenotype record, and zero if there is no
phenotype.
While it’s possible to incrementally set up the BLUP equations, we are
left with a set of simultaneous equations to solve, the number equaling the
number of animals in the herd plus the number of groups.
Providing a screen for the user to initiate the estimated breeding calcula-
tion and using AJAX to provide the user with progress details is the solution.
If we are going to solve the simultaneous equations in this manner, we may
as well setup the equations from scratch, and avoid incrementally updating
them. To build the equations form scratch, the animals must be sorted in
25
the order they entered the pedigree. The transitive closure and maximum
path length can be used to put the animals in the correct order. We arrange
the animals from those with the longest maximum path length between the
animal and its latest descendant to those with the shortest.
26
Unknown 1 2
Ordering of animals
Calculation of the coefficients requires that the matrix be arranged from
earliest generation to oldest ( parents precede their offspring). This could
be done by arranging the records from earliest birth date to most recent, if
all birth dates are known. If the maximum path depth is maintained, the
list can start from records with the longest path to a descendant to those
with the shortest path. For example: if an animal has no descendants, then
the longest path is zero; while a grandparent with grand-offspring having no
offspring, will have a longest path of two. Table 2.2 gives the longest path
for the example pedigree given in 2.2. Note that animal two has a short path
of one edge to animal six and a long path of three edges to animal six.
27
breeding coefficient for an animal is equal to
Fi = 1/2(asd ).
animal 1 2 3 4 5 6
1 1.00 0.00 0.50 0.50 0.50 0.25
2 0.00 1.00 0.50 0.00 0.25 0.625
3 0.50 0.50 1.00 0.25 0.625 0.563
4 0.50 0.00 0.25 1.00 0.625 0.313
5 0.50 0.25 0.625 0.625 1.125 0.688
6 0.25 0.625 0.563 0.313 0.688 1.125
Table 2.3: A matrix for example pedigree given in figure 2.2 (Mrode, 1996).
auu = 0
a11 = 1 + 0.5(asd ) = 1 + 0.5(auu )
=1
28
The entry for animal 3 (sire equals animal 1, dam equals animal 2) equals:
etc.
A = TDTT (2.62)
A = LLT (2.63)
29
That is, the entries in row i of L are l0 to li , the entries in column i of
LT are l0 to li . If we multiply row i with column i we end up with the sum
of the entries squared.
Putting it another way:
2
a11 = l11
2 2
a22 = l21 + l21
2 2 2
a33 = l31 + l31 + l31
etc.
The diagonal elements of A can be calculated using two vectors that have
a length equal to the number of animals in the pedigree. One vector contains
the diagonal element of A progressively calculated, and on completion all the
diagonal elements of A will be present. The other vector contains a current
column of L and the column changes as the calculation progresses.
30
Using the example pedigree introduced in section 2.12:
l11 = sqrt1.0 − 0.25(ass + add ) Where ass and add are unknown.
= sqrt1.0 − 0.25(0 + 0)
=1
2
a11 = l11
= 12
=1
l21 = 0.5(lsj + ldj ) Where s and d are unknown
=0
l31 = 0.5(lsj + ldj ) Where s = 1 and d = 2
= 0.5(l11 + l21 )
= 0.5(1 + 0) = 0.5
etc.
As mentioned above, the animals must be arranged from the first animal
in the pedigree to the last, this is dealt with further in section 3.4.5
A = TDTT (2.66)
31
2.14 Conclusion
Merrrit is being developing as a tool to increase the industries economic per-
formance. This section introduced the theory behind selection and the maths
needed to remove fixed effects that result from animals being born and breed
on different properties. To use the methods articulated we need the pedi-
gree. In the next section we look at representing the pedigree as a graph, the
transitive closure and the incremental maintenance of the transitive closure.
How the transitive closure is used by the application to meet the design goals
is left to chapter 5.
32
Chapter 3
33
The on line creation of pedigrees by herd owners is a primary project aim.
With multiple users creating the pedigree, it is important that the system
checks the integrity of the data as it is entered, and that it does not restrict
the order of data entry. For example, the parents of an animal must be born
before the offspring. If we allow animal data to be entered in any order, the
data for multiple offspring can be entered before the parent data. When the
parent data is entered we need to be able to generate the birth data of the
first born offspring and make sure we are setting a reasonable date for the
parent’s birth.
In section 2.11.1, it was noted that we can calculate the inbreeding co-
efficient using an animal’s ancestors if they are arranged in the order they
appear in the pedigree. To perform the calculation, we need to be able to
find all descendants and arrange them in order.
The pedigree’s transitive closure is the foundation used to perform both
of these operations quickly. Putting the animals in the order they appear in
the pedigree requires us to maintain the maximum path length between the
animal and it’s descendants.
This chapter describes the transitive
closure and the algorithms to incremen-
tally maintain the transitive closure. As From To
performance suitable for the on line main- Unknown 4
tenance of the pedigree is required, con- 4 5
sideration is given to methods that can be 5 6
used to speed up the incremental mainte- 1 4
nance of large graphs ( greater than ten 1 3
thousand nodes). 3 5
The chapter also describes an algo- 2 6
rithm to incrementally maintain the min-
imum distance between graph nodes and Table 3.1: Edges table for pedi-
describes how to extend previous work gree given in figure 2.2
so that the application can maintain the
maximum distance between nodes.
34
cannot parent themselves and as only one strand of DNA can pass from a
parent to child, a mammal pedigree is a simple graph.
A directed acyclic graph is a directed graph with no directed cycles; that
is, for any vertex v , there is no nonempty directed path that starts and ends
on v . A pedigree is a directed acyclic graph as gene flow is from parents to
offspring and if you exclude time travel, your offspring can’t be your parents.
35
by selecting a particular parent. All ancestors can be found by selecting a
particular descendant.
The set of nodes that have paths that end at node a is given by:
The set of nodes with a path that start at node b is given by:
The set of nodes that have a path the inserted edge starts at has a single
entry, a. Similarly the inserted edge ends at b and the set of nodes the
inserted node ends at is a one entry set containing node b.
The new path created when the edge (nodea , nodeb ) is added can be found
36
using the union of Cartesian products between the node sets described above:
P1 = (A × b) All the paths that end at b.
P2 = (a × B ) All the paths that start at a.
P3 = (A × B ) Paths that start in set A and end in set B .
P4 = (a × b) The new edge.
PS = P1 ∪ P2 ∪ P3 ∪ P4 (3.3)
The new TC is the union between the new paths and the old:
TCnew = TC ∪ PS
Intersect set
The set of new paths may contain paths that exist in the old TC; that is,
there may be an intersect set. It is the finding of this intersect set that
creates problems when an edge is deleted.
37
1
1
2 a
a
b
b
3
2
38
After “SUSPECT” has been re-
moved, the paths from 1 and 2 to
1
b and 3 have also been removed
2 as there where paths between these
nodes through a and b. The path
a from any node on the upper side to
2 still exists. The path from 2 to
b b still exists because it is an edge.
The path from b to any node on
3 the down side will still exist. Three
head to tail joins of “TRUSTY”
will recover the missing path.
Figure 3.2: Three head to tail joins of trusted paths required.
“SUSPECT” was removed from “TRUSTY“. The case requiring three joins
is shown in figure 3.2.
The joining of heads to tails will generate a lot of paths that were no in
“SUSPECT”. Because a union removes duplicates, this doesn’t matter.
Let (nodei , nodej ) be one path in “TRUSTY” and (nodev , nodew ) be an-
other. Then:
3.4.4 Fragments
The intersect set (see section 3.4.1) will only contain paths with start nodes
and end nodes that are in ”SUSPECT“. ”SUSPECT“ is generally smaller
than the transitive closure. When the pedigree is large, speed becomes an
39
Unknown 1 2
40
vertices with different path lengths, the union operations will leave multiple
records for the same path. These need to be reduced to one record using
aggregate operators.
Lets describe any path in the TC as (nodei , nodej , pathmin , pathmax ), with
the path starting at i and ending at j . The new edge being inserted is
(nodea , nodeb , 1, 1).
The set of nodes that have paths that end at node a is given by:
A = {nodei , pathmin , pathmax T C(nodej ) = a} (3.6)
The set of nodes with a path that starts at node b is given by:
B = {nodej , pathmin , pathmax T C(nodei ) = b} (3.7)
If the transitive closure is correctly constructed, there will be one tuple per
node pair and the tuple will contain the maximum and minimum path lengths
between the node pair. A will then only contain one tuple per node that is
joined to a and that tuple will contain the maximum and minimum lengths
to a. Similar arguments apply to set B .
The set of nodes that have a path the inserted edge starts at has a single
entry, a, whose minimum and maximum path lengths will be one. Similarly,
the inserted edge ends at b and the set of nodes the inserted node ends at is
a one entry set containing b.
The new path created when the edge (nodea , nodeb , 1, 1) is added can be
found using the union of Cartesian products between the node sets described
above:
P1 = (A × b, Apathmin + bpathmin , Apathmax + bpathmax )
= (A × b, Apathmin + 1, Apathmax + 1)
P2 = (B × a, Bpathmin + apathmin , Bpathmax + apathmax )
= (B × a, Bpathmin + 1, Bpathmax + 1)
P2 = (A × B , Apathmin + Bpathmin + 1, Apathmax + Bpathmax + 1)
P4 = (a, b, 1, 1)
PS = P1 ∪ P2 ∪ P3 ∪ P4 (3.8)
The new TC is the union between the new paths and the old. There will be
multiple entries for each path if there are paths with different lengths.
TCpossible = TCold ∪ PS
The common paths with different lengths have to be reduced to one using
the aggregate function. The minimum path length will be the minimum path
41
found in the group with a common path, and the maximum path length
will be the maximum path found in the group with a common path. The
maximum and minimum paths may come from different tuples within the
group.
Adjacency list
An adjacency list has pointers that point to the next record in the chain. An
adjacency list is another name for an edge table. The list contains a record
for every edge, and the record contains a start node and an end node. The
data is complete, but no structure information is present, and hence, to find
ancestors and descendants, links have to be followed using multiple queries.As
mentioned above the applications effectively maintains a adjacency list as the
sire and dam animal numbers are kept in the animal record.
Node codes
Node codes have been proposed by Elliott, Akgul, Mayes, & Ozsoyoglu, 2007
as a method of representing pedigrees. A node code table is larger than a
transitive closure with all paths along with the traversed edges recorded in a
string. Each node code is recorded in a table along with an animal identity.
The node codes for the example pedigree ( see figure 2.2) are shown in
figure 3.4. The foundation animals are given codes 0 to n from left to right.
The codes for the offspring of a node are given in sibling order. The node
codes record all paths that can be taken from the foundation animals to the
42
node, and a separator records the sex of the node traversed. Elliott et al.,
2007 uses ”.“,”,“ and ”;“ to donate female, male and don’t know.
If an offspring is inserted into the
database, the node code of the offspring
is created by copying the parent node Unknown 1 2
codes, appending a separator based on
the parent’s sex and then appending the
offspring code. The insertion of a leaf is 4 3
simple, but the insertion of an ancestor is
a little bit more problematic as the codes
5
for all offspring have to be altered. El-
liott et al., 2007 allocated the codes by
performing a depth first search on a pedi-
6
gree that has been built and encoded us-
ing an adjacency list. The development Animal Codes Animal Codes
of algorithms for incremental additions to 1 0 5 0.0.0
the pedigree and concurrent maintenance 2 1 0.1,0
of the node code table would be an inter- 3 0.1 1,0,0
esting exercise. 1,0 6 0.0.0.0
Nodes that have common ancestors 4 0.0 1,1
have node codes with a common prefix.
This fact can be used to generate the in- Figure 3.4: Node codes
breeding coefficient using the node codes
for a particular animal.
The selection of ancestors and offspring is similar to the finding of the
same data using a transitive closure. Offspring are selected by selecting node
codes that have a prefix in common with the ancestor of interest. An ancestor
may have many node codes but all of these prefixes will be propagated to
the children with the path taken from ancestor to offspring encoded in the
offspring’s suffix.
All the ancestors can be found by selecting records with node codes that
have the same prefix strings as the node codes of the animal of interest.
43
tive method to that described in Elliott et al., 2007 for the calculation of
inbreeding coefficients.
3.5 Conclusion
It is possible to incrementally maintain the transitive closure, the minimum
path length and the maximum path length. Node codes as used by Elliott
et al., 2007 where introduced and there relationship to the transitive closure
considered. The next section considers the implementation of the algorithms
introduced in this section using PostgreSQL. As mentioned previously chap-
ter 5 will consider how the translative closure is used within the application.
44
Chapter 4
Incremental maintenance of a
transitive closures using SQL
45
This chapter looks at using PostgreSQL version 7.2 and version 8.0 to
implement the incremental maintenance of a transitive closure. The theory
underpinning this code is discussed in chapter 3.
Possible paths created by the insertion of the edge are contained in tc_temp.
This includes paths that already exist in the transitive closure, that is it
includes the intersect set. The intersect set is the set of paths that exists in
tc_temp and the transitive closure.
INSERT INTO tc_temp (
SELECT from_node,child_node AS to_node ,max_depth+1 AS max_depth
min_depth+1 AS min_depth
FROM parent_child_tc WHERE to_node=parent_node
UNION
SELECT parent_node AS from_node, to_node , max_depth+1 AS max_depth,
min_depth+1 AS min_depth
FROM parent_child_tc WHERE from_node=child_node
UNION
SELECT TC1.from_node AS from_node,TC2.to_node AS to_node,
((TC1.max_depth)+(TC2.max_depth)+1) AS max_depth,
((TC1.min_depth)+(TC2.min_depth)+1) AS min_depth
FROM parent_child_tc AS TC1, parent_child_tc AS TC2
WHERE TC1.to_node=parent_node AND TC2.from_node=child_node
UNION
SELECT parent_node AS from_node,child_node AS to_node,
1 AS max_depth,1 AS min_depth
);
Remove the intersect set to leave the new paths and only the new paths in
delta.
46
INSERT INTO deltas (
SELECT * FROM tc_temp WHERE NOT EXISTS (
SELECT * FROM parent_child_tc
WHERE parent_child_tc.from_node=tc_temp.from_node AND
parent_child_tc.to_node = tc_temp.to_node
)
);
Insert the new paths into the transitive closure
INSERT INTO parent_child_tc SELECT * FROM deltas;
The paths in tc_new that where already in the transitive closure can have
different path lengths. We make sure the maximum path length is set to the
maximum path found in the set with the same start and end node.
UPDATE parent_child_tc SET max_depth=c_temp.max_depth FROM tc_temp
WHERE parent_child_tc.from_node=tc_temp.from_node AND
parent_child_tc.to_node=tc_temp.to_node AND
parent_child_tc.max_depth < tc_temp.max_depth ;
And we make sure the minimum path is set to the minimum path.
UPDATE parent_child_tc SET min_depth=tc_temp.min_depth FROM tc_temp
WHERE parent_child_tc.from_node=tc_temp.from_node AND
parent_child_tc.to_node=tc_temp.to_node AND
parent_child_tc.min_depth > tc_temp.min_depth ;
47
PostgreSQL version 8 code
Truncating the table is faster than creating it. TRUNCATE is a PostgreSQL
extension
TRUNCATE TABLE SUSPECT, fragments, delta ;
Paths that go through the link to be deleted are added to the set “SUS-
PECT”. Paths in “SUSPECT” will be removed if the minimum path length
is greater than one.
INSERT INTO SUSPECT (
SELECT X.from_node AS from_node, Y.to_node AS to_node
FROM parent_child_tc AS X,parent_child_tc AS Y
WHERE X.to_node = parent_node AND Y.from_node= child_node
UNION
SELECT X.from_node AS from_node, child_node AS to_node
FROM parent_child_tc AS X
WHERE X.to_node = parent_node
UNION
SELECT parent_node AS from_node, X.to_node AS to_node
FROM parent_child_tc AS X
WHERE X.from_node = child_node
UNION
SELECT parent_node , child_node
);
48
to one if a record with the same start and end node is in “SUSPECT”. We
don’t have to test the minimum path length as the previous code will have
removed paths that are in the “SUSPECT” set if the minimum path length
was greater than one.
UPDATE parent_child_tc SET max_depth = 1 FROM SUSPECT WHERE
(parent_child_tc.from_node=SUSPECT.from_node) AND
(parent_child_tc.to_node= SUSPECT.to_node) AND
;
Fragment is a subset of the transitive closure that can regenerate paths that
should not have been deleted. The idea is simple; if the remaining path in the
transitive closure doesn’t have a start and end that is in the deleted set, the
path can’t be helpful in regenerating paths that should not have been lost.
Finding this subset takes less time that working with the full “TRUSTY”
set when regenerating the paths.
INSERT INTO fragments (
SELECT pc.from_node,pc.to_node,pc.min_depth,pc.max_depth
FROM parent_child_tc AS pc JOIN SUSPECT AS sp
ON (sp.from_node=pc.from_node)
UNION
SELECT pc.from_node,pc.to_node,pc.min_depth,pc.max_depth
FROM parent_child_tc AS pc JOIN SUSPECT AS sp
ON (sp.to_node=pc.to_node)
);
Using the fragment set we construct delta, this is the set of paths that may
have been lost that should not have been. This will contain paths that are
already in the transitive closure. The additional paths are deleted in the next
step. Three self joins of fragments is required to recover all possible paths
lost (see section 3.4.3).
INSERT INTO delta (
SELECT fr1.from_node,fr2.to_node,
(fr1.min_depth+fr2.min_depth) AS min_depth,
(fr1.max_depth+fr2.max_depth) AS max_depth
FROM fragments AS fr1, fragments AS fr2
WHERE fr1.to_node=fr2.from_node
UNION
SELECT fr1.from_node,fr3.to_node,
(fr1.min_depth+1+fr3.min_depth) AS min_depth,
(fr1.max_depth+1+fr3.min_depth) AS max_depth
FROM fragments AS fr1, fragments AS fr2, fragments AS fr3
49
WHERE (fr1.to_node=fr2.from_node) AND
(fr2.to_node = fr3.from_node)
);
All we want is the paths that should not have been lost.
DELETE FROM delta
USING parent_child_tc AS pc
WHERE (
(delta.from_node=pc.from_node) AND
(delta.to_node=pc.to_node)
)
;
In the stuff that was lost there can be multiple paths but we are only inter-
ested in the shortest and longest path of each group ( common start and end
node).
INSERT INTO parent_child_tc
(SELECT a.from_node,a.to_node,min(a.min_depth),max(a.max_depth)
FROM delta AS a
GROUP BY a.from_node,a.to_node
)
;
Version 7.1 and version 8.0 create “SUSPECT” using the same strategy.
INSERT INTO SUSPECT (
SELECT X.from_node AS from_node, Y.to_node AS to_node
FROM parent_child_tc AS X, parent_child_tc AS Y
WHERE X.to_node=parent_node AND Y.from_node=child_node
UNION
SELECT X.from_node AS from_node, child_node AS to_node
FROM parent_child_tc AS X
50
WHERE X.to_node=parent_node
UNION
SELECT parent_node AS from_node,X.to_node AS to_node
FROM parent_child_tc AS X
WHERE X.from_node=child_node
UNION
SELECT parent_node,child_node;
Version 8 creates the trusted set by deleting “SUSPECT” from the transitive
closure. The DELETE statement in version 7.1 is not as well developed, so
instead a “TRUSTY” table has to be created. A left outer join is used
instead of the NOT EXIST code used in the code presented by Guozhu et al.,
1999 because the 7.1 optimizer is not smart enough to convert the query
used by NOT EXIST to a left outer join. Instead in executes the inner query
in a loop resulting in very poor performance. This is not a universal problem
with version 7.1 as the NOT EXIST clause used in the edge insert code was
correctly optimized. The problem disappeared in version 8.
INSERT INTO TRUSTY (
SELECT parent_child_tc.from_node,parent_child_tc.to_node,
parent_child_tc.min_depth,parent_child_tc.max_depth
FROM parent_child_tc LEFT OUTER JOIN SUSPECT ON (
SUSPECT.from_node=parent_child_tc.from_node AND
SUSPECT.to_node = parent_child_tc.to_node
)
WHERE (SUSPECT.from_node IS null) AND
(parent_child_tc.min_depth<>1)
UNION
SELECT parent_child_tc.from_node,parent_child_tc.to_node,1,1
FROM parent_child_tc
WHERE (parent_child_tc.min_depth=1) AND
(NOT(
parent_child_tc.from_node=parent_node AND
parent_child_tc.to_node={$child_node}
)
)
);
The version 7.1 fragment set is created in a similar manner to the version
8 set. The version 8 code uses the reduced transitive closure set created by
deleting “SUSPECT” from the original transitive closure, while the version
7.1 code uses the “TRUSTY” table created above.
INSERT INTO fragments (
51
SELECT pc.from_node,pc.to_node,pc.min_depth,pc.max_depth
FROM TRUSTY AS pc JOIN SUSPECT AS sp ON
(sp.from_node=pc.from_node)
UNION
SELECT pc.from_node,pc.to_node,pc.min_depth,pc.max_depth
FROM TRUSTY AS pc JOIN SUSPECT AS sp ON
(sp.to_node=pc.to_node)
);
Version 8 code created delta and then deleted paths that were already in the
transitive closure that remained after “SUSPECT” was removed. Because
DELETE is not as well developed, version 7.1 creates a separate table tc_new
and then only puts into delta those paths not in “TRUSTY”. The version
7.1 code to create tc_new is similar to the version 8 code that creates the
initial delta.
INSERT INTO tc_new (
SELECT fr1.from_node,fr2.to_node,
(fr1.min_depth+fr2.min_depth) AS min_depth,
(fr1.max_depth+fr2.max_depth) AS max_depth
FROM fragments AS fr1, fragments AS fr2
WHERE fr1.to_node=fr2.from_node
UNION
SELECT fr1.from_node,fr3.to_node,
(fr1.min_depth+1+fr3.min_depth) AS min_depth,
(fr1.max_depth+1+fr3.min_depth) AS max_depth
FROM fragments AS fr1, fragments AS fr2, fragments AS fr3
WHERE (fr1.to_node=fr2.from_node) AND
(fr2.to_node = fr3.from_node)
);
In version 8, trusted paths are deleted from delta, while in version 7.1, paths
in tc_new that are not in “TRUSTY” are put in delta. Once again, we use
a left outer join because NOT EXIST is not being converted to a outer join by
the version 7.1 optimizer.
INSERT INTO delta (
SELECT tc_new.from_node,tc_new.to_node,
tc_new.max_depth,tc_new.min_depth
FROM tc_new LEFT OUTER JOIN TRUSTY ON
(TRUSTY.from_node=tc_new.from_node AND
TRUSTY.to_node=tc_new.to_node)
WHERE TRUSTY.from_node IS null
);
52
Because DELETE is not as well developed, parent_child_tc is rebuilt from
scratch.
TRUNCATE parent_child_tc;
And into the new parent_child_tc contains “TRUSTY” and a single entry
for each path found in delta, with the entry containing the maximum and
minimum path lengths within the group.
INSERT INTO parent_child_tc
(SELECT * FROM TRUSTY) ".
UNION (
SELECT a.from_node,a.to_node,min(a.min_depth),max(a.max_depth)
FROM delta AS a
GROUP BY a.from_node,a.to_node
)
;
4.3 Conclusion
We have considered the PostgreSQL version 7.1 and version 8.0 code needed
to implement the algorithms presented in chapter 3. Different algorithms
are used for each PostgreSQL version because of PostgreSQL version 7.1
limitations. Different SQL databases offer slightly different version of SQL
so the code presented here may require modification if different database
systems are used. The next chapter looks at how the transitive closure is
used within the application, and the SQL code needed to implement the
desired functionality.
53
Chapter 5
54
This chapter discusses the various ways the transitive closure has been
used to meet the application goals.
55
h
Table Description
animal select The animal identification
and sex. Created when
a record is required for
a parent or for an an-
animal_select table imal whose data is be-
id tag details sex
ing entered. When this
record is created, an in-
ternal animal number is
allocated by the system
animal table
(animal id).
id more data animal All other animal details.
This record is only cre-
ated if the animal is
referenced in some way
other than being a par-
ent. That is, not all
animals have an animal
record, and not all ani-
mals have a birth date.
Figure 5.1: Animal record structure
56
SELECT min(an.birth_date) FROM parent_child_tc AS tc
LEFT JOIN animals AS an ON tc.to_node = an.animal
WHERE tc.from_node = ’animal_of_interest’ AND
tc.min_depth = 1
;
The PostgreSQL version 7.1 optimizer has problems with the following code.
The optimizer fails to note that the select can be converted to a join, the
inner select ends up in a loop and the performance is terrible.
SELECT min(birth_date) FROM animals WHERE
(animals.animal IN
( SELECT to_node AS animal
FROM parent_child_tc
WHERE from_node= ’animal_of_interest’
)
)
;
Merrrit users can also build the database in a forward direction. If this is
done, we must check that the new animal is not being born before it’s parents.
The following code selects the latest birth date of the parents.
SELECT max(an.birth_date) FROM parent_child_tc AS tc
LEFT JOIN animals AS an ON tc.from_node = an.animal
WHERE tc.to_node = ’animal_of_interest’ AND tc.min_depth = 1
;
57
5.4 Calculation of the inbreeding coefficient
To calculate the inbreeding coefficient, we need a vector of ancestors from
the first animal to enter the pedigree to the most recent. This was discussed
in section 2.12.
The short piece of code below performs this task. In the following code
the animal we are interested in is referred to as the animal_of_interest.
The transitive closure contains a from_node and a to_node. The from_node
is the ancestor node, the to_node is the descendant node. If the to_node
equals the animal_of_interest the from_node gives the animal number
of an ancestor. The max_depth data tells how many generations back the
animal entered the pedigree. The min_depth data tells how recently the
animal appears in the pedigree.
To get the records in order, the result is sorted on the depth field, deepest
first. The LEFT OUTER JOIN with the animal table picks up the sire and dam
animal records, if they exist. The SELECT after the UNION is picking up the
details for the animal we want the information on. An animal does not have
a transitive closure entry pointing to itself.
SELECT tc.from_node,an.sire,an.dam,tc.max_depth
FROM parent_child_tc AS tc
LEFT OUTER JOIN animals AS an
ON tc.from_node = an.animal
WHERE
(tc.to_node=animal_of_interest) AND
(tc.max_depth < ’max_depth’)
UNION
SELECT ans.animal AS from_node,an.sire,an.dam,0 AS max_depth
FROM animal_select AS ans
LEFT OUTER JOIN animals AS an
ON (ans.animal = an.animal)
WHERE ans.animal=’animal_of_interest’
ORDER BY max_depth DESC ;
58
be a problem if an animal has too many descendants. This issue is discussed
further in section 6.2.
SELECT ans.animal,ans.sex,ans.birth_herd,an.sire,an.dam,
ans.first_tag,ans.second_tag,ans.e_tag,ans.name
FROM parent_child_tc AS tc
LEFT OUTER JOIN animals AS an
ON (tc.to_node = an.animal)
JOIN animal_select AS ans
ON (tc.to_node = ans.animal)
WHERE (tc.from_node=’animal_of_interest’)
UNION
SELECT ans.animal,ans.sex,ans.birth_herd,an.sire,an.dam,
ans.first_tag,ans.second_tag,ans.e_tag,ans.name
FROM animal_select AS ans
LEFT OUTER JOIN animals AS an
ON (ans.animal=an.animal)
WHERE ans.animal=’animal_of_interest’
;
59
5.7 Conclusion
The transitive closure is used to maintain data integrity, to select the records
needed to display the pedigree and descendants, and to generate the data
structure needed to calculate the inbreeding coefficient. We now move onto
a new topic, efficiently displaying the pedigree and descendants using a
browser.
60
Chapter 6
61
A human pedigree aims to display relationships and siblings common to
that relationship in birth order (Bennett et al., 1995). This creates complex
structures that are difficult to display, and as a result there is a body of litera-
ture that considers the problem (Tores & Barillot, 2001). The flow of genetic
material is important when displaying animal pedigrees, not relationships.
The flow of genetic material to an individual diploid can be represented by
a binary tree.
A survey of the animal genet-
ics literature indicates animal pedi-
grees still tend to use the Greek
symbols for Mars (♂) and Venus
(♀) to donate male and female, with
the standard symbols (circles and
squares) used by human geneticists
(Schott, 2005) not being favored .
The pedigree displayed by Mer-
rrit revolves around an individual
animal. One diagram shows the an-
imal’s ancestors and another its off-
spring.
6.1 Ancestors
A tree structure is used to display
the pedigree, with the animal of
interest to the left and the ances- Figure 6.1: Partially opened Merrrit
tors to the right. Next to each ani- pedigree display for animal 4461
mal node is a link that can be used
to display the full animal record.
The structure to be displayed is cre-
ated using PHP classes form the
pear package HTML TreeMenu (Radi
& Heyes, 2008). The PHP code that uses these classes links together objects
(instances of the PHP class) to describe the desired menu. The printMenu()
method then uses the resulting structure to output Javascript that the
browser uses to display the menu. The browser puts together the tree display
using icons pointed to when the PHP class objects are added to the PHP
structure described above.
Figure 6.1 shows an example pedigree. The arrows embedded in the
graphics call JavaScript code when they are clicked, and are used to open
62
and close pedigree branches. When the branch is open, the arrow points into
the branch, and when the branch is closed ( details not displayed) the arrow
points down. The depth of the undisplayed pedigree can be determined using
the “depth” data displayed as part of the animal’s URL.
The application code to create the PHP structure used to output the
JavaScript is found in the file /class/pedigree.php. The method used to
create a node in the tree is recursive, calling itself if there are ancestors to
itself. The recursion returns an object which is linked to a new object that is
then returned. In other words, it is a classic depth first search using recursive
code. The recursive code also returns how deeply the recursion is nested, and
the depth value returned is used as part of the data string sent to the browser
as the link text.
To reduce the number of calls to the SQL server, the code uses the tran-
sitive closure to select all of an animal’s ancestor records in one read. The
depth first search then becomes a memory operation that is reasonably fast.
The pedigree downloaded to the browser contains all known descendants,
but only the top four branches are displayed open. This reduces the size
of the initially displayed tree. If interested, the user can click on the arrow
icons to open closed branches.
6.2 Descendants
A tree structure is used to display descendants with the animal of interest
on the left and offspring to the right. The initial rendering displays all the
offspring (see figure 6.2) but none of the offspring’s descendants. Arrows are
added to the graphics if there are not too many offspring; these arrows may
be used to open up branches that show the identity of the descendants of the
offspring.
Animals early in the pedigree of a large herd can have many offspring.
For example, one of the herds used to test the system which has over 10000
records, had an early buck, KASHMERE NERO that had considerable in-
fluence on that herd and had over 5000 descendants. Using the transitive
closure to read all records needed to display all descendants can be dangerous,
and if not constrained, the server may not have enough memory allocated
to the task for record storage. Further, if communication speeds are mod-
erate and the number of progeny is not restrained, downloading all progeny
information to the browser can take considerable time.
The transitive closure is used to determine the number of descendants
before the display of descendants is started. If there are too many descen-
dants, the transitive closure is used to read only the offspring records whose
63
The offspring of the animal are ini-
tially rendered, with the branches
detailing the descendants of the
offspring closed.
minimum path length equals one and the offspring are displayed using the
set of records thus read. Using natural breeding methods, the number of
children is limited to the number of animals that can be sired in a year by
the number of offspring born per animal (for goats less than 200) and by the
number of years the sire is sexually active (for goats less than 10), giving an
upper limit of approximately 2000 records. With modern breeding methods,
there is no guarantee that this solution will work under all circumstances.
If the number of descendants is within reasonable limits, all descen-
dant records are selected using the transitive closure (see section 5.5). As
with the code that displays the pedigree, a recursive routine located in the
file /class/pedigree.php links class objects together to describe the tree
and the class method printMenu(), and then uses the structure to output
JavaScript code to the browser. The JavaScript code is then used by the
browser to display the descendant tree.
The number of offspring and descendants of each animal node displayed
is determined using the transitive closure and the results are included in the
data displayed (see figure 6.2).
64
6.3 Conclusion
This chapter discusses using a browser to display an animal pedigree and the
animals descendants. The next chapter concludes the thesis and discusses
future research.
65
Chapter 7
66
7.1 Thesis conclusion
Previous work in this area has used strings that describe the genetic path be-
tween animals (nodes codes) to improve the performance of pedigree searches
and the calculation of inbreeding coefficients. This application maintains the
minimum and maximum path length, an adjacency list and a transitive clo-
sure. The structures used are maintained incrementally. If the above data
structures are maintained, it is possible to enforce sensible birth dates in
real time, calculate the inbreeding coefficient and present in real time the
pedigree and decedent trees.
Incrementally maintaining a transitive closure is viable if the number of
animals is less than ten thousand. If there are larger herds than this, then
herd transitive closures can be maintained, with the industry herd updates
occurring in the background.
It is possible to incrementally update the BLUP equations, but the num-
ber and size of simultaneous equations to be solved prevent us from gener-
ating results in real time. The solution is to allow the user to initiate the
calculation for their herd.
Pedigrees and descendant trees can be displayed by a browser using the
pear package HTML_TreeMenu.
This thesis focus on the creative advance in the domain area, the artifact was
created to test the ideas. As predicted by Hevner et al., 2004, the creation
of the artifact opens up new research possibilities.
67
mum path length and a transitive closure can be used to speed up similar
queries. We have also shown that the data structures used can be maintained
incrementally. A proof that the node codes can or can’t be maintained in-
crementally would be interesting.
- Will the provision of EBV data as described above alter the decisions
of experienced breeders?
With EBV’s now available to the entire industry, the herd belonging to
the breeder who has a ’feel’ for the problem could be randomly divided, both
methods applied and the outcomes compared.
An argument could be mounted that taking the “feel“ out of the selection
is advantageous as a solid outcome can be had by anyone willing to undertake
systematic recording and the application of the methodology. Is the previous
statement a reasonable argument? Another possible research question.
68
7.2.4 Will this work affect Lambplan?
Lambplan and Merinoplan are industry wide programs that use best linear
unbiased prediction to calculate EBV’s for the lamb meat and merino indus-
tries. Lambplan and Merinoplan are run by Sheep Genetics a joint program
created by Meat and Livestock Australia (MLA) and Australian Wool In-
novation Limited (AWI). Both systems are based on a centralized database
and batch calculations. There is a flat rate of $330 and a charge of $1.65 per
animal with a maximum charge of $2750 per breeder (Fee Schedule, 2009).
This is down from the 2004 charge of $355 flat rate and $2.32 per animal
(McNair, 2004). In 2009 the per animal charge only applies to the current
drop, and historic data ( the very foundation of EBV’s based on BLUPs)
is stored for free. To use the service, you export your pedigree and pheno-
type records and receive in return a file containing EBV’s and inbreeding
coefficients. Your data must be correctly formatted before sending it
Because of it’s small size, the development of a solution for the cashmere
industry was unattractive to the current operator. This forced the cashmere
industry to develop it’s own solution, which resulted in the creation of Mer-
rrit, the subect of this thesis. We aimed to develop a system that could
be self managed. The techniques developed could be used to self generate
within herd EBV’s and inbreeding coefficients for any industry, and this could
reduce the cost. Industries would still need centralized control to generate
across herd EBV’s as it is unlikely any breeder will entrust his full dataset
to another.
The author has listened to lamb and merino breeders complain about
the cost of their current solutions and the accuracy of the EBV’s provided.
An understand of best linear unbiased prediction leads one to ask: Do the
complaints about the provided EBV value results from poor industry data
structure ( not enough cross herd links)? This is research that should be
undertaken by the Lambplan and Merino plan providers.
The possibility of extending this solution to other industries results in
several interesting research questions:
- How general is the dissatisfaction with Lambplan and Merino plan
EBV’s and why has it occurred?
- If Lambplan and Merino plan users were supplied with tools to display
their pedigree and to get inbreeding coefficients and within herd EBVs
without waiting for the industry calculation, would their satisfaction
with the service improve?
- Industry wide EBVs are only valuable if breeders believe the results.
For accurate across industry EBVs you need sound genetic links across
69
herds. Are Lambplan and Merinoplan results questionable because of
its poor data structure?
- Because cross herd/flock links are difficult and costly to maintain one
needs to ask the question: As a general rule, are breeders looking for
within herd EBVs or industry wide EBVs?
70
References
Bennett, R. L., A., S. K., Uhrich, S. B., O’Sullivan, C. K., Resta, R. G., &
Lochner-Doyle, D. (1995). Recommendation for standardized human
pedigree nomenclature. Journal of Genetic Counseling, 4 (4).
Dong, G., & Su, J. (1995). Incremental and decremental evaluation of
transitive closure by first-order queries. Information and Computation,
120 (1), 101–106.
Elliott, B., Akgul, S. F., Mayes, S., & Ozsoyoglu, Z. M. (2007). Efficient
evaluation of inbreeding queries on pedigree data. In Ssdbm ’07: Pro-
ceedings of the 19th international conference on scientific and statistical
database management. Washington, DC, USA: IEEE Computer Soci-
ety.
Elliott, B., Akgul, S. F., Ozsoyoglu, Z. M., & Manilich, E. (2006). A frame-
work for querying pedigree data. In Ssdbm ’06: Proceedings of the 18th
international conference on scientific and statistical database manage-
ment (pp. 71–80). Washington, DC, USA: IEEE Computer Society.
Faraway, J. (2002). Practical regression and anova using r. http://cran.r-
project.org/.
Fee schedule. (2009). Website. PO Box U254 UNE, Armidale, NSW, 2351:
Sheep Genetics. (http:www.sheepgenetics.org.au)
Goldburger, A. S. (1962). Best linear unbiased prediction in the generalized
linear regression model. Journal of the American Statistical Associa-
tion, 57 (298), 369–375.
Guozhu, D., Leonid, L., Jianwen, S., & Limsoon, W. (1999). Maintaining
transitive closure of graphs in sql. In Int. J. Information Technology,
5.
Henderson, C. R. (1975). Best linear unbiased estimator and prediction
under a selection model. Biometrics, 31 (2), 423–447.
Henderson, C. R. (1976). A simple method for computing the inverse of a
numerator relationship matrix used in the prediction of breeding values.
Biometrics, 33 (1), 69–83.
Hevner, A. R., March, S. T., Park, J., & Ram, S. (2004, march). Design
71
science in information systems research. MIS Quarterly, 28 (1), 75–105.
James, A. T. (2009). Estimating breeding values for better cashmere - using
the australian cashmere. PO Box 4776 KINGSTON ACT 2604: Rural
Industries Research and Development Corporation.
Lynch, M., & Walsh, B. (1997). Genetics and analysis of quantitative traits.
23 Plumtree Road, Sunderland, MA 01375 U.S.A: Sinauer Associates,
Inc.
McNair, S. (2004). Mla responds to lambplan critics.
Website. Stock and Land. (http://sl.farmonline.
com.au/news/state/agribusiness-and-general/general/
mla-responds-to-lambplan-critics/9187.aspx)
Mrode, R. (1996). Linear models for the prediction of animal breeding values.
Wallingford Oxon: CAB International.
Mrode, R. (2005). Linear models for the prediction of animal breeding values
(2 ed.). Wallingford Oxon: CAB International.
Pang, C., Dong, G., & Ramamohanarao, K. (2005). Incremental maintenance
of shortest distance and transitive closure in first-order logic and sql.
ACM Transactions on Database Systems., 30 (3), 698–721.
Quass, R. L. (1976). Computing the diagonal elements and inverse of a large
numerator relationship matrix. Biometrics, 32 (4), 949–953.
Radi, H., & Heyes, R. (2008). Html treemenu. Website. (http://pear.php.
net/package/html_treemenu.html)
Rozanov, Y. A. (1964). Probability theory a concise course (R. A. Silverman,
Ed.). New York: Dover Publication Inc.
Schott, G. D. (2005). Sex symbols ancient and modern: their origins and
iconography on the pedigree. British Medical Journal, 331 (7531), 1509–
1510.
Strang, G. (2003). Introduction to linear algebra. Wellesley: Wellesley-
Cabridge Press.
Tores, F., & Barillot, E. (2001, February). The art of pedigree drawing:
algorithmic aspects. Bioinformatics (Oxford, England), 17 (2), 174–
179.
Yahoo user interface. (2009). Website. (http://developer.yahoo.com/yui/)
72
Appendices
73
Appendix A
Code Structure
74
The key to successful large project development is a solid foundation.
A large project is completed when the code base is abandoned, not before.
Design goals change, algorithms are discovered, users generate new ideas and
new team members schooled in the latest buzz words come and go. As the
code base grows, the investment increases and the option of starting the
project over fades. This chapter documents the foundations. Hopefully we
have a foundation flexible enough to support the project over the long term.
75
The arrays are stored in a language directory as they contain the string to
be used as column headings when the data is displayed.
This design breaks the development problem into two parts: code to
interpret the tables and the development of the tables. Hopefully the options
supported by the table interpreter are broad enough to accommodate most
of the changes that will invariably be required as the application ages.
A.3.1 base
Data can be entered into the application using the web interface or via files
containing comma separated fields, with the field type being set using head-
ings. The file input code supports quite a variety of data encodings, but only
one encoding is needed for the web interface. The one encoding needed is
the base encoding.
A.3.2 db string
The name of the field in the base database table (the unjoined table). Which
table is the base table depends on the field, and the location is given in the
table entry: location.
A.3.4 strings
As mentioned above, files can be used to enter data into the system. The
table entry points to a table of strings that can be used as heading strings in
upload files. The heading string selects a table entry to handle the column.
The code that selects the table entry does a string match on the table pointed
to by this table entry.
76
A.3.6 type
Data can be of two types: data to select the record, and data that is data.
This field identifies the two types.
A.3.7 logic
The tables control the select logic. If this is set to AND, then the select field
is used as an AND condition. If set to OR, then the select field is set to OR.
The different tag types (first tag, second tag etc) are set as OR conditions,
while the herd as an AND condition. Different herds can have the same
tag because the animal has to belong to a particular herd. Within a herd,
different tag types can be used to identify the animals. If a tag type is to be
useful, each animal within the herd has to have a unique tag.
A.3.9 location
The location of the data is given by location and db_string.
77
A.4.1 input to db
This class method is used to convert user input into the database format.
As an example, the database stores the sex as ’m’ or ’f’, but an English user
inputs ’doe’, ’buck’ and many other options to represent the sex. The input to
the method is the user sting. As the method has to be able to return an error
message or a result, the method returns an associative array with 3 possible
index values, data, error_code and error_string. If the input data is
successfully converted, then there will be no error_code or error_string
fields in the associative array. The code in this method should clean out
attempts at SQL injection and cross site scripting.
This method should only concern itself with the field in question as
test_db_change looks after how the data relates to the rest of the database.
78
Appendix B
Ajax
79
Asynchronous Javascript and XML (AJAX) is a current buzz word. It’s
actually all very simple and has a lot to do with HTML and very little to do
with XML. A Javascript application is event driven and the code is available
for execution while a web page is being displayed. JavaScript can place a re-
quest to a server that is independent of the page request the user sent to get
the page being displayed; this is the Asynchronous part. Different browser
brands require different code to start the request and this complicates the
JavaScript code a little but nothing more. The replies to the Asynchronous
request are events that are used to call some Javascript code. The document
that is displayed to the user is structured because it was generated using
HTML and the location of the various elements within the structure is doc-
umented and is described as the document object model (the DOM). Using
the DOM, the programmer can work out which object to modify if he or
she wishes to change the data displayed by the object. The Javascript that
receives the reply to the asynchronous request retrieves the response string
(which may be encoded as XML data, but then again may not), decodes it
and modifies the objects that describe the document. The browser modifies
the document.
All that is required on the server side is a page that responds to the
asynchronous request. The server side request is just another GET and the
server doesn’t know or care that it is an asynchronous request. Having the
server output XML code is optional.
80
communication (see figure B.1). The thread doing the update seeks the com-
munication file to a known location ( zero) and writes the input file active
line number to the communication file. The thread replying to the browser’s
asynchronous request reads the data from the communication file. A Unix
system will present the file reading thread with the last value written by the
file writing thread.
The server code is found in the file ajax_upload_animal, and the code
and the comments pretty much say all that needs to be said.
<?php
$file_path =$_SERVER[’SCRIPT_FILENAME’];
$app_directory =
substr ($file_path,
0,
-(strlen($file_path) -(strrpos($file_path,’/’))));
//this was sent as a hidden field in the
//file screen_upload_add_animal.php etc.
//It is made from the session id which
//is different for each active user.
$ajax_data_file =
(isset($_REQUEST[’ajax_data_file’]))
? $_REQUEST[’ajax_data_file’]
: ’’;
//The file read by this code is updated in
//screen_upload_add_animal.php etc. as records
//are added to the database.
//We are responding to an asyc read using the Javascript
//found in the file animal_upload.js
function record_count_open ($app_directory, $session_id){
$hostname="{$app_directory}/temp/{$session_id}";
$handle = fopen($hostname,"r");
return $handle;
}
//the server writes the line at offset zero so
//we only need to read the first line
function record_count_read ( $handle) {
$input = ’0’;
if (!$handle) return $input;
$output = fgets($handle);
return $output;
}
$handle =
record_count_open ($app_directory,$ajax_data_file);
81
$records = record_count_read($handle);
echo $records;
?>
Note that we are outputting only one value and we do not decorate it
with XML encoding. The Javascript code is downloaded to the browser
as Javascript normally is; that is, with a <script></script> tag sent as
part of the header.
<script src=\"animal_upload.js\" type=\"text/Javascript\"></script>
The JavaScript sent to the browser pretty much documents what needs to
be done at the browser end:
function ajaxFunction(){
{
var xmlHttp;
try {
// Firefox, Opera 8.0+, Safari
xmlHttp=new XMLHttpRequest();
}
catch (e) {
// Internet Explorer
try {
xmlHttp=new ActiveXObject("Msxml2.XMLHTTP");
}
catch (e) {
try {
xmlHttp=new ActiveXObject("Microsoft.XMLHTTP");
}
catch (e) {
//So they don’t get a record load update,
//don’t upset the user
//alert("Your browser does not support AJAX!");
return false;
}
}
}
xmlHttp.onreadystatechange=function(){
if(xmlHttp.readyState==4){
document.ajax_form.records_loaded.value =
82
xmlHttp.responseText;
}
}
This code only contains the code to initiate the request and deal with the
response. The code to deal with the response is contained in the method:
xmlHttp.onreadystatechange.
The name of the DOM element depends on the document structure. ajax_form
and records_load are names given in the HTML code sent by the server to
display the original page.
As mentioned above, Javascript code is event driven. The event to initiate
the asynchronous request is created when the submit button is clicked. The
HTML code is in the file originally sent to display the page and has the
following form:
<input type=submit
class="form"
size="60"
name="accept"
value="submit"
onclick= "window.setInterval’."(’ajaxFunction()’, 2000)"
>
The interval event set by the code above will end when a new page is sent
to the browser by the server thread processing the submitted file (the user
file containing records to update the database). Care needs to be taken with
the server thread timeout value as files that add large numbers of records
to large transient closures can take a long time to process. Using AJAX to
provide feedback stops the user panicking but not the server.
83
B.2 Yahoo user interface; selecting fields to
display
“The YUI Library is a set of utilities and controls, written in JavaScript, for
building richly interactive web applications using techniques such as DOM
scripting, DHTML and AJAX” Yahoo User Interface, 2009.
The YUI consists of several Javascript libraries that extend the function-
ality of Javascript and to this is added application specific code written by
the application developer. The code written by the application developer can
ask the browser to load the Javascript libraries from the Yahoo server or the
developer can move the required files to a computer under his control and
send code to the browser that asks for the libraries to be loaded from there.
Merrrit is an application used by
many different breeders and has support
for many different data fields. For exam-
ple, there are four options for identifying
an animal: primary tag, secondary tag,
name and e tag. A user will only want
the fields he or she uses displayed in the
data editing screens. There needs to be a
screen the user can use to select the fields
of interest. As the Yahoo User Interface
offers drag and drop support, drag and
drop is an option.
The Javacode that looks after Drag
and Drop manipulates the DOM tree but
doesn’t send an asynchronous request to
the server. Figure B.2: Field select drag and
The application specific code is found drop
in the file /class/display_order.php.
The list of available fields is created using
the table described in section A.3. The options selected for display by the
user are sent to the server when the user clicks the submit button. The
options selected are encoded in a semicolon separated string which is saved
in a herd specific database table.
84
Appendix C
The Application
85
C.1 Overview
The application screen is divided into four (see figure C.2) :
- A browser area down the left side, which is used to navigate the appli-
cation.
- A link bar across the top, which has two important items, login and
log-off. These links are used by herd owners to log onto and off the
application.
The application has a public face and a herd owners’ face (See figure C.1).
The public face presents screens that can be used to view the data and
the herd owners’ face presents screens to view, edit, and upload data as
well as screens to setup the fields to be displayed. When a user enters the
application, the browser area displays the public tree, which contains a herd
branch. There is a list of herds under the herd branch. Clicking on the herd
code displays herd details, and clicking on the arrow to the left of the herd
code opens another branch. Within that branch, the user finds screens to
display animal data. When entering the application, the active screen area
displays an application summary.
- Text that provides an overview of the herd. The text can be altered by
the herd owner using the herd edit screen.
86
Once logged on, the herd owners
may edit data, control the fields
The public may look at the data.
displayed, upload files to alter data
in batch lots, and download data
into files on their local machines.
Figure C.1: Public and logged on browsers
87
Figure C.2: Entry screen with screen areas highlighted.
88
Figure C.3: Public screen: herd data
- Data from the herd record. The herd owner can alter which fields are
displayed using the herd setup screen and the contents of the fields
using the herd edit screen.
- A picture, which the herd owner can upload using the herd edit screen.
If the herd owner provides a link to his/her web site, the link is presented in
the title bar.
89
Figure C.4: Herd edit screen.
90
- Herd details are entered using a text box. HTML tags can be used
to add headings and highlight text etc. [upload herd details] must be
clicked to transfer data to the server.
- Fields to edit the herd record. The fields in this set are setup using
the herd setup screen. The fields that can be displayed in the herd edit
screen is greater than those that can be displayed in the herd screen.
For instance, if the herd password is one of the displayed fields, it is
displayed in the herd edit screen but not in the herd screen. In the herd
setup screen, the fields that are displayed in both the herd edit screen
and the herd screen are displayed in a different colour.
91
Figure C.5: Herd setup screen.
92
Figure C.6: Animal screen, display only
it). Dragging and dropping the fields in the display column alters the order
of display. The [Update display] button must be clicked to send the changes
to the server.
Field width
The server displays the current list of selected fields in the field width section.
Different field widths can be set and the [update width] button is used to
upload the changes to the server. It is best to select your fields, set the order
and click [update display] and then set the field width in the new screen.
93
animal screen is found in the browser’s edit branch, and the link brings up
the private version.
The first animals displayed and the order of display are determined by
the tests and values set above the columns. An example screen is shown in
figure C.6.
The fields displayed in the public version of the animal screen are set
using the animal public setup screen. The link to this screen is found in
the setup branch, and the icon used to represent this screen contains a large
yellow mallet. Clicking on the link next to the icon displays a screen that
can be used to beat the public animal screen into some sort of order.
The fields displayed to a logged on user are set in the animal setup screen,
a link to which can also be found in the setup branch, with an icon that
contains a small hammer. Delicate use of this tool results in a private animal
screen setup to meet your editing needs.
As mentioned above, the fields above the columns are used to request the
display of records from a set starting point and order. You have a choice of
two tests: equal to or greater and equal. If you want to display all DYN does
with a primary tag above 4000, you would set the column headings up as
shown in figure C.6.
There are three buttons and two fields below the displayed data. The
[display] button takes you back to the start of the list of records set by the
column headings, the [next] button takes you to the next group and the
[previous] button takes you back to the previous group. The record field is
used to set the number of records displayed on the screen and the offset field
is used to set the offset within the record set. The offset field changes when
the [next] and [previous] buttons are used.
Next to each record is a view button, which takes you to the screen that
will display the full animal record (see section C.5).
The private and public version of the animal screen is almost the same.
If you have the right to edit a record, the private version allows you to enter
new data using the fields that display the current data. If you have the right
to edit the record, a delete button appears next to the record. The view
button takes you to a screen that can be used to edit all animal details,
including the animal phenotype records.
94
give their animals a name, a primary tag, a secondary tag and an electronic
tag. The same breeder may only want to make the primary tag public.
Another breeder may only use the primary tag, if he does not want his
private or public animal screen cluttered with fields that are not used. To
further complicate matters, different users may want the same data displayed
in different ways.
The public animals setup screen is divided into two areas: one is used to
select the fields to display and the order in which they are displayed, and the
other is used to set the field width.
Field width
The server displays the current list of selected fields in the field width section.
As the display field and order section is quite long, you may need to use the
right hand side bar to see this section. Different field widths can be set for
each selected field and the [update width] button can be used to upload the
changes to the server. It is best to select your fields, set the order and click
[update display] and then set the field width in the new screen.
To display many items on one line, it pays to keep the field widths as
small as possible.
95
Figure C.7: Animal detail screen
screen to edit and view private data. This screen functions in the same way
as the screen described in the previous section.
Select fields
The detail screen ( see figure C.7 for a part display) is long and divided
into several areas. Along the top are fields used to select a record. The fields
96
displayed to select a record are the select fields picked for display in the detail
setup screen (discussed later). First tag, second tag, name, e tag and herd
are the select fields.
The herd must be set and the system insists that all tag fields are unique
within the herd. Therefore, any tag field that has been set can be used to
search for an animal.
Animal record
The animal record section displays animal record data. If you are logged on,
the data can be edited where displayed and updated by clicking the update
button located under this section.
Photograph
A photo of the animal can be added. If the user is not logged on, the photo
is displayed to the right of the animal record data. If the user is logged on,
the screen contains a field to select a file containing an animal image and a
button to initiate a file upload.
Pedigree
Next is displayed the pedigree. Chapter 6 discusses the displaying of pedi-
grees in some detail. The animal identifications displayed in the pedigree are
links to detail screens for the animal identified. The tree structure displays
how the animals are related.
Descendants
The descendants are displayed next. Chapter 6 discusses the displaying of
descendants in some detail. The descendant tree is available for browsing
pleasure, but if the number of descendants is too large, only the offspring
links are displayed.
Phenotype records
The phenotype records follow the descendants. At the time of writing, only
shearing records are implemented.
97
screens. There are sections for the animal record and supported phenotype
records. As with the animals screen, there are different setup screens for the
public and private versions of the detail screen.
Files may contain comments, which start with a // and go to the end of
the line. Comments are ignored by the decoder. Comments in hand created
files are useful, as they can be used to comment out records that you are
unsure of, and add details that are not used by the system. Figure C.8 has
a comment telling us that the file is used to create the example used in R.
A. Mrode’s book.
The first active line must contain a list of column names separated by the
same separator as that used in the rest of the file. The system selects the
separator, checking that the separator creates more than one field and that
the number of fields found in the first and second active line are the same.
Valid separators are ’ ^’, ’,’ , ’:’ , ’;’ and ’tab’. The upload screen displays
a list of valid string headings. For a discussion of how they are stored in the
application, see section A.3.4.
Hand created files may use " or ’ to bring the field value set in the
previous line to the current line. This speeds up data entry. For example,
98
Figure C.9: Three steps to upload a file
99
twins have the same sire and dam, and the tag details from the first kid
record can be used for the second.
As mentioned above, the number of fields in the first and second active
line must be the same, but the last field can be dropped in following lines.
It pays to make the last field a comment, as adding or leaving the comment
off is then optional.
- The system decodes the file, and checks that the data types within the
fields are correct. If there are errors, the table is presented back to the
user, with error fields highlighted and a [reject] button.
- If the data types are correct, the system checks that the proposed data
will not violate database integrity tests. For example: it ensures that
birth dates are in order, and if we are adding records, that there is no
record for the animal present already. If the data doesn’t pass these
tests, it is presented back to the user with the faulty records highlighted
and a [reject] button. If all tests are passed, the data is presented back
with an [accept] and [reject] button.
- If the data is good and the user clicks the [accept] button, the data is
loaded into the database.
100
Figure C.10: File download screen
101
If the columns selected for download change, the [update fields] button
needs to be clicked. When the update list is returned from the server, the
download can be started by clicking the [download] button.
Breeders with the privilege to do so can download the records from all
herds. This privilege is needed if the user is calculating industry wide esti-
mated breeding values.
first_tag;sex;birth_date;birth_herd;sire_first_tag;dam_first_tag;
E1;BUCK;;TST;;;
E2;DOE;;TST;;;
E3;DOE;1999-09-01;TST;E1;E2;
E4;BUCK;2000-12-01;TST;E1;UNSET;
E5;BUCK;2001-12-01;TST;E4;E3;
E6;BUCK;2002-12-01;TST;E5;E2;
UNSET;;DOE;;TST;;;
102
Appendix D
Linear Regression
103
Chapter 2 mentioned linear regression and then moved on. This appendix
discusses the topic further.
The covariance over the variance gives the regression coefficient which is
the gradient of a line fitted to the data using least squares. Linear regression
can be explained using linear algebra, statistics or calculus. All three methods
provide some insight.
D.1 Calculus
A line has the formula.
y = bx + a (D.1)
To fit the line you alter a and b until the predicted value is as close to the
actual value as possible. The difference between the predicted and the actual
value is the residual. To find the minimum, you need a formula that tells
you how the sum of the residuals change over the data set as a and b vary.
For a point the residual is:
ri = yi − f (xi ) (D.2)
This gives a quadratic surface with the the sum of the residuals on one axis
and the values of a and b on the other. A quadratic surface has a minimum
or a maximum when the derivative is zero. We want to find the minimum
as we alter a and b. We take the partial derivatives relative to a and b and
solve the resulting simultaneous equations.
∂(r2 ) ∂a
=
∂a 2(r)∂r
r = (yi − bxi + a)
∂r
= −1
∂a
∂r
0 = 2(r)
∂b
0 = −2(yi − bxi − a) (D.4)
104
∂(r2 ) ∂b
=
∂b 2(r)∂r
r = (yi − bxi + a)
∂r
= −xi
∂b
∂r
0 = 2(r)
∂b
0 = −2(yi − bxi + a)(xi ) (D.5)
Formula D.4 and D.5 are the residual for a point, and if we sum over the
whole set and put the result in matrix form, we get:
P 2 P P
P ix x i b x i yi
= P (D.6)
xi n a yi
Solving for a
P P 2 P P
yi xi − xi xi yi
a= (D.7)
n x2i − ( xi )2
P P
ȳ x2i − x̄ xi yi
P P
a= P 2 P (D.8)
xi − n2 /n( xi 2 /n2 )
Solving for b
P P P P
n yi xi − xi xi
b= (D.9)
n x2i − ( xi )2
P P
yi xi − n(x̄2 )
P P
b= P 2 (D.10)
xi − n(x̄)2
It is important to note:
2. Least squares work out well because the first derivative has a minimum.
3. That b is the gradient of a line and a is where the line cuts through the
y axis when x is zero.
105
D.2 Statistics
Variance was introduced into the lexicon by Fisher in his 1918 paper (Lynch
& Walsh, 1997) and is defined as:
P 2
(xi ) − nx¯1 2
σxx = (D.11)
n−1
Covariance is defined as:
P
(xi yi ) − nx̄ȳ
σxy = (D.12)
n−1
The regression coefficient is defined as:
After simplification:
yi xi − n(x̄2 )
P P
b= (D.15)
(x2i ) − n(x̄)2
P
1. The regression coefficient is nothing more than the gradient of the line
fitted using least squares.
1
There is good reason to use n − 1 when calculating the standard deviation: we are
working with differences, and the number of differences is one less than the number of
points. We are still dealing with differences when dealing with the entire population,
mere mortals should still use n − 1, statisticians being the high priests of science are
communicating with god to obtain prior knowledge, so they can use n.
106
D.3 Normal equation
If you think in terms of sub spaces the deriva-
tion of the normal equations is straight for-
ward. The matrix equation we wish to solves
x1
is:
y = Xβb + e
XT e = 0
e = y − Xβb
So:
XT (y − Xβ)
b =0
XT y − XT Xβb = 0
XT y = XT Xβb
(XT X)−1 XT y = (XT X)−1 XT Xβb
(XT X)−1 XT y = Iβb
βb = (XT X)−1 XT Y
107