You are on page 1of 110

University of Ballarat

Bachelor of Computing Honours thesis


Transitive Closure, Inbreeding
Coefficients, Pedigrees, on line
Collection and on line Presentation
Author: Supervisor:
Charles E. Esson Gregory L. Simmons

November 3, 2009
I, Charles Esson, declare that this thesis titled, ‘Transitive Closure, In-
breeding Coefficients, Pedigrees, on line Collection and on line Presentation’
and the work presented in it are my own. I confirm that this work has not
been submitted for the award of any other degree or diploma in any tertiary
institution.

Signed:

Date:

1
Abstract

An on-line system was developed to collect and display pedigree data, test
the developed algorithms and present the results. The systems uses a tran-
sitive closure, minimum path length, maximum path length and an adja-
cency list to deliver inbreeding coefficients, pedigree trees, descendant trees
and enforcement of pedigree constraints in real time. Real time methods
to display an animal pedigree and descendants, to calculate the inbreeding
coefficient and enforce the constraints, incrementally maintain the maximum
path length between nodes where developed along with improved routines to
incrementally insert and delete edges from a transitive closure.

Key words: Pedigree, transitive closure, SQL view maintenance, inbreeding


coefficient

Acknowledgements

Thankyou:
Trish Esson For the large data set used for testing
and for putting up with me.
Andrew James For the second data set.
Greg Simmons For supervision, encouragement and as-
sistance.
Sudanthi Wijewickrema For proof reading.
List of Figures

2.1 Heritability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Example pedigree . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1 Restoration of deleted paths . . . . . . . . . . . . . . . . . . . 38


3.2 Three head to tail joins of trusted paths required. . . . . . . . 39
3.3 Shortest and longest path . . . . . . . . . . . . . . . . . . . . 40
3.4 Node codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.1 Animal record structure . . . . . . . . . . . . . . . . . . . . . 56

6.1 Partially opened Merrrit pedigree display for animal 4461 . . . 62


6.2 Displaying descendants . . . . . . . . . . . . . . . . . . . . . . 64

A.1 Application data input and output . . . . . . . . . . . . . . . 77

B.1 Intertask communication . . . . . . . . . . . . . . . . . . . . . 80


B.2 Field select drag and drop . . . . . . . . . . . . . . . . . . . . 84

C.1 Public and logged on browsers . . . . . . . . . . . . . . . . . . 87


C.2 Entry screen with screen areas highlighted. . . . . . . . . . . . 88
C.3 Public screen: herd data . . . . . . . . . . . . . . . . . . . . . 89
C.4 Herd edit screen. . . . . . . . . . . . . . . . . . . . . . . . . . 90
C.5 Herd setup screen. . . . . . . . . . . . . . . . . . . . . . . . . 92
C.6 Animal screen, display only . . . . . . . . . . . . . . . . . . . 93
C.7 Animal detail screen . . . . . . . . . . . . . . . . . . . . . . . 96
C.8 Example animal add file . . . . . . . . . . . . . . . . . . . . . 98
C.9 Three steps to upload a file . . . . . . . . . . . . . . . . . . . 99
C.10 File download screen . . . . . . . . . . . . . . . . . . . . . . . 101
C.11 Example animal download file . . . . . . . . . . . . . . . . . . 102

D.1 Projecting a 3D point onto a 2D plane . . . . . . . . . . . . . 107

1
List of Tables

2.1 Hardy-Weinberg law . . . . . . . . . . . . . . . . . . . . . . . 13


2.2 Arrangement of animals from longest path to shortest . . . . . 27
2.3 A matrix for example pedigree given in figure 2.2 (Mrode, 1996). 28

3.1 Edges table for pedigree given in figure 2.2 . . . . . . . . . . . 34


3.2 Transitive closure of example pedigree from figure 2.2 . . . . . 35

2
Contents

1 Introduction 7
1.1 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Genetics 11
2.1 Early history . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Two alleles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Quantitative genetics . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Genotype and phenotype . . . . . . . . . . . . . . . . . . . . . 14
2.5 Heritability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5.1 Children phenotype - Parent phenotype . . . . . . . . . 15
2.5.2 Breeding value - Parent phenotype . . . . . . . . . . . 16
2.5.3 Additive genetic effect . . . . . . . . . . . . . . . . . . 16
2.5.4 Correlation . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Ordinary least squares . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Generalized least squares . . . . . . . . . . . . . . . . . . . . . 18
2.8 Best linear unbiased prediction . . . . . . . . . . . . . . . . . 19
2.9 Predicting animal breeding values . . . . . . . . . . . . . . . . 22
2.10 Within herd estimated breeding values . . . . . . . . . . . . . 24
2.10.1 No members in a group . . . . . . . . . . . . . . . . . . 26
2.11 Relationship matrix . . . . . . . . . . . . . . . . . . . . . . . . 26
2.11.1 Calculating the relationship matrix . . . . . . . . . . . 26
2.12 Calculating the inbreeding coefficient . . . . . . . . . . . . . . 29
2.13 Incrementally maintaining A−1 . . . . . . . . . . . . . . . . . 31
2.14 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 Graphs and transitive closures 33


3.1 Directed acyclic graph . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Transitive closure . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Only one transitive closure needs to be maintained . . . . . . 35

3
3.4 Incremental maintenance of a transitive closure . . . . . . . . 36
3.4.1 Insertion of an edge into a transitive closure . . . . . . 36
3.4.2 Deletion of an edge from a transitive closure . . . . . . 37
3.4.3 Three self joins of ”TRUSTY” to recover all paths . . . 37
3.4.4 Fragments . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.5 The shortest and longest Path . . . . . . . . . . . . . . 40
3.4.6 Building a transitive closure from a depth first search . 42
3.4.7 Alternative to using a transitive closure . . . . . . . . . 42
3.4.8 Building the transitive closure one herd at a time . . . 44
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4 Incremental maintenance of a transitive closures using SQL 45


4.1 PostgreSQL transitive closure edge insert . . . . . . . . . . . . 46
4.1.1 PostgreSQL version 7.1 code . . . . . . . . . . . . . . . 46
4.2 PostgreSQL transitive closure edge unlink . . . . . . . . . . . 47
4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5 Using the transitive closure, the SQL code 54


5.1 Introduction to database table structure . . . . . . . . . . . . 55
5.2 Valid birth date . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Can’t change the sex of an animal with descendants . . . . . . 57
5.4 Calculation of the inbreeding coefficient . . . . . . . . . . . . . 58
5.5 Descendant records . . . . . . . . . . . . . . . . . . . . . . . . 58
5.6 Ancestor records . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6 Displaying pedigrees and descendants 61


6.1 Ancestors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 Descendants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

7 Theses conclusion and future research 66


7.1 Thesis conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.2 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.2.1 Can node codes be maintained incrementally? . . . . . 67
7.2.2 Did the application meet it’s underlying design goal? . 68
7.2.3 Are production gains increased? . . . . . . . . . . . . . 68
7.2.4 Will this work affect Lambplan? . . . . . . . . . . . . . 69

References 71

Appendices 74

4
A Code Structure 74
A.1 Input values and return values . . . . . . . . . . . . . . . . . . 75
A.2 Error codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
A.3 The table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
A.3.1 base . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
A.3.2 db string . . . . . . . . . . . . . . . . . . . . . . . . . . 76
A.3.3 dbt string . . . . . . . . . . . . . . . . . . . . . . . . . 76
A.3.4 strings . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
A.3.5 data convert . . . . . . . . . . . . . . . . . . . . . . . . 76
A.3.6 type . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A.3.7 logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A.3.8 heading string . . . . . . . . . . . . . . . . . . . . . . . 77
A.3.9 location . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A.4 Data entry and display . . . . . . . . . . . . . . . . . . . . . . 77
A.4.1 input to db . . . . . . . . . . . . . . . . . . . . . . . . 78
A.4.2 test db change . . . . . . . . . . . . . . . . . . . . . . 78
A.4.3 get data . . . . . . . . . . . . . . . . . . . . . . . . . . 78
A.4.4 output from db . . . . . . . . . . . . . . . . . . . . . . 78

B Ajax 79
B.1 File upload progress . . . . . . . . . . . . . . . . . . . . . . . 80
B.2 Yahoo user interface; selecting fields to display . . . . . . . . . 84

C The Application 85
C.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
C.2 Login and logoff . . . . . . . . . . . . . . . . . . . . . . . . . . 86
C.3 The herd screens . . . . . . . . . . . . . . . . . . . . . . . . . 86
C.3.1 Herd edit screen . . . . . . . . . . . . . . . . . . . . . . 89
C.3.2 Herd setup screen . . . . . . . . . . . . . . . . . . . . . 91
C.4 Animal screens . . . . . . . . . . . . . . . . . . . . . . . . . . 93
C.4.1 Public animal setup screen . . . . . . . . . . . . . . . . 94
C.4.2 Animal setup screen . . . . . . . . . . . . . . . . . . . 95
C.5 Detail screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
C.5.1 Detail setup screen . . . . . . . . . . . . . . . . . . . . 97
C.6 Upload screens . . . . . . . . . . . . . . . . . . . . . . . . . . 98
C.6.1 Upload file structure . . . . . . . . . . . . . . . . . . . 98
C.6.2 Upload sequence . . . . . . . . . . . . . . . . . . . . . 100
C.7 Download files . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5
D Linear Regression 103
D.1 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
D.2 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
D.3 Normal equation . . . . . . . . . . . . . . . . . . . . . . . . . 107

6
Chapter 1

Introduction

7
In 2005 the Australian cashmere Industry received funding from the Ru-
ral Industry and Development Corporation (RIRDC) for a sire referencing
scheme. Further funding was granted to develop a method to generate es-
timated breeding values (EBV) across the entire industry. This project re-
sulted in a batch solution developed by Andrew T.James (James, 2009). The
industry now feels the EBV project is at a stage where it is appropriate to
move to on line collection, and presentation of pedigree and phenotype data,
direct calculation of some results and display of additional results from the
batch process. An ideal system would present inbreeding coefficients and
within herd estimated breeding values without running the batch process.
This research effort aims to develop methods that are fast enough to
meet the ideal design goals. An online system has been constructed and the
developed methods implemented to make sure performance is acceptable.

The key insights documented are:

- That if we incrementally maintain a transitive closure of the pedigree


along with the maximum path length between nodes, we can quickly
obtain a list of all the ancestors and order the ancestors in the order
they enter the pedigree.

- Pang, Dong, & Ramamohanarao, 2005 work can be extended to main-


tain the maximum path between nodes.

- An animal’s inbreeding coefficient is only affected by it’s ancestors. Re-


cent ancestors have a greater effect on the inbreeding coefficient and
therefor, the calculation speed can be increased by limiting the mini-
mum path length between nodes to a maximum value.

- When an animal is added, the inverse of the additive genetic rela-


tionship between animals (A−1 ) can be maintained incrementally; the
entries affected are limited to a subset of all entries and these entries
can be determined using a list of descendants and their ancestors.

- The performance of the SQL incremental edge deletion routine can be


improved by creating a table of segments that may be useful in recre-
ating excess edges deleted when suspect (see section 3.4.1) is removed
from the old transitive closure. The segments that may be useful are
those that start and end on a path that has been deleted.

- If you are maintaining a transitive closure and a minimum path length,


you do not have to maintain an edge table.

8
Testing showed that for the large industry wide herd, incrementally main-
taining the pedigree transitive closure in real time was too slow. Methods
were therefor developed to speedup the algorithms. The final solution to this
problem was the maintenance of a transitive closure for each user’s hard,
and to use a batch process to maintain an industry wide transitive closure.
Methods to do this and decide which data structure to use for display were
developed.

1.1 Previous work


The mathematics and methods for calculating inbreeding coefficient and es-
timated breeding values as a batch process across the entire data set are well
developed and solutions implemented. Efficiently querying large pedigrees
has been looked at by Elliott, Akgul, Ozsoyoglu, & Manilich, 2006. Their
work used “node codes”, which provide a complete description of the genetic
flow from one animal to another for all animals in the herd, and aid the
querying of the pedigree structure. However a depth first search was used to
create the codes, which is a time consuming calculation, making the proposed
solution unsuitable for incremental calculation.

1.2 Methodology
The design science paradigm seeks to extend the boundaries of human and
organizational capabilities by creating new and innovative artifacts. It ad-
dresses research through the building and evaluation of artifacts to meet
identified needs (Hevner, March, Park, & Ram, 2004). The industrie’s iden-
tified needs for the artifact were:

- On line collection of pedigree and phenotype data.

- On line presentation of any results that can be generated in real time.

- Presentation of results calculated off line.

Presenting the research as design science however underplays the achieve-


ments. Methods to calculate the inbreeding coefficients in real time and
methods to incrementally maintain A−1 where important results; as were
improvements to the incremental transitive closure algorithms and methods
to incrementally maintain the maximum path between nodes. As well as
meeting the industry needs, the developed program was used to test the
success or failure of the proposed solutions and algorithm improvements.

9
1.3 Thesis structure
To understand Merrrit and how the design goals were reached several con-
cepts need to be understood and in some cases extended. Some chapters
introduce new material, others look at implementation details.

Chapter 2 looks at the genetics and mathematics behind the calculation of


inbreeding coefficients and estimated breeding values. It is noted that to
calculate the inbreeding coefficient, you need the animals ancestors, in the
order they entered the pedigree. It is also noted that a list of descendants
and their ancestors can be used to incrementally update A−1 .

Chapter 3 considers incremental maintenance of a transitive closure. A


method to maintain the maximum path length is introduced, this is needed
to return a list of ancestors in the order needed to calculate the inbreeding
coefficient (using the method discussed in chapter 2). Maintaining the min-
imum path length removes the need to maintain the edge table as has been
done in previous work in this area. Methods to reduce the execution time of
incremental edge deletion are also introduced.

Chapter 4 takes the results from chapter 3 and presents the SQL code to
implement the edge insertion and deletion solutions.

Chapter 5 looks at how the transitive closure is used by the application to


perform the functions the application needs to perform. The SQL code used
to maintain the integrity of the pedigree, and to obtain the list of animals
required to calculate the inbreeding coefficient is also introduced.

Chapter 6 discusses the on line displaying of pedigrees and descendant trees.

Chapter 7 presents the conclusion and further research.

Appendix A looks at the code structure of the developed application, ap-


pendix B at AJAX, and how it is used in the application and appendix C
documents the application. Appendix D provides an introduction to linear
regression, the results are used in chapter 2.

10
Chapter 2

Genetics

11
The industry is interested in identifying animals that have the potential
to increase the economic returns quickly. Calculating accurate estimated
breeding values (EBV) and encouraging their use across the cashmere indus-
try is the underlying aim of this and previous projects. To identify the best
animal across multiple herds, we need to use mathematics that remove herd
specific effects. The mathematics to do this was developed in the 60’s and
70’s (Henderson, 1975). The techniques developed require an industry wide
pedigree and matings across herd boundaries (the sire reference scheme).
When developing breeding strategies, within herd EBVs are useful to the
growers. EBVs that take into account the performance of ancestors and
descendants are better at estimating an animal’s genetic merit than EBVs
based on the animals phenotype records alone.
Inbreeding coefficients are an indication of how closely individuals are
related and are used by breeders to control their inbreeding and out-breeding
programs.
This chapter takes a quick look at the history of genetics, the basic ideas
behind quantitative genetics and then tries to give some insight into heritabil-
ity. Sections 2.6, 2.7 and 2.8 build towards understanding best linear unbi-
ased prediction (BLUP), a method that uses phenotype records from related
animals to better predict an animals breeding value. Ordinary least squares
assumes there is no relationship between the residual errors, generalized least
squares assumes the events are independent but there is a known relationship
between the residuals, and best linear unbiased prediction assumes there is
a known relationship between events. When predicting breeding values an
event is an animal, and the relationship between events is controlled by the
genes passed from one generation to the next.

2.1 Early history


Quantitative Genetics developed in the early 20th century and resulted from
the melding of Mendelian genetics, rediscovered in 1900, and biometrics, a
branch of genetics founded by Francis Calton (1869, 1889), a science to which
the foundation of most modern statistics can be traced. Francis Calton
focused on continuously varying characteristics (those that can’t be easily
separated into separate classes). However, the melding of the two ideas did
not occur without argument. The death of W.F.R. Waldon in 1906 ( one of
the key protagonists) along with the publication of key experiments resulted
in the rapid emergence of the multiple factor hypothesis(Lynch & Walsh,
1997). To understand the multiple factor hypothesis we need to understand
the factors.

12
2.2 Two alleles
An allele is an alternative DNA sequences at
the same physical gene locus. A diploid or-
ganism ( many plants are not diploids) inher- Genotype AA Aa aa
its one sequence from one parent and another Frequency p2 2pq q 2
sequence from the other. A diploid organism
carries two alleles, but only passes one of the Table 2.1: Hardy-Weinberg
inherited alleles to it’s offspring. In the sec- law
ond simplest case, there are only two different
alleles in the population with these being ex-
pressed as three possible combinations.
If the gene frequency of allele type A is p and the frequency of type a is
q, then the frequencies of the various combinations is as given in table 2.1.
To see this, forget about how the alleles are carried by the organism and
consider the probability of selecting AA, aa and Aa from a large population
(this is Hardy-Weinberg’s law restated).
The probability of selecting your first A is p, the probability of selecting
the second A is also p, and therefor the probability of selecting two As in
a row is p2 . There are two ways to select Aa (if the order doesn’t matter),
the probability of either is pq, so the probability of an Aa outcome is 2pq.
Similarly the probability of picking two a alleles in a row is q 2 .

2.3 Quantitative genetics


If there are four alleles for a gene location in the population with frequency
p1 , p2 , p3 and p4 , the frequencies of the various combinations is given by:

(p1 +p2 +p3 +p4 )2 = p21 +p22 +p23 +p24 +2p1 p2 +2p1 p3 +2p1 p4 +2p2 p3 +2p2 p4 +2p3 p4
(2.1)
If there are i alleles the frequency of the combinations is given by:
(p1 + · · · + pi )2 (2.2)
If there are two gene locations, the frequency of the combinations is given
by:
((p1 + · · · + pi )2 + (q1 + · · · + qj )2 )2 (2.3)
Formula 2.3 can easily be extended to k gene locations. As the number
of different alleles increases the number of possible outcomes explodes, well
separated classes disappear and the desire to think in terms of alleles fades.

13
A trait that doesn’t have well separated classes is called a quantitative
trait. Quantitative traits tend to be normally distributed. To observe this
phenomenon, let us simplify the situation and pretend that we have a trait
that is affected by many gene locations. Let us further assume that the
probability of getting an allele that improves the outcome is p and is equal for
all location and that the incremental improvements add up. This describes
a Bernoulli trail (Rozanov, 1964). Each animal is a separate event, and the
herd is the trail.
If the number of gene locations is n and the number of selections that
contributed positively to the outcome is k, the probability for a particular
sequence is:

P(ω) = pk (1 − p)n−k (2.4)

The number of ways we can select the k positive outcomes is given by:
n!
Ckn = (kn ) = (2.5)
k!(n − k)!

The probability distribution for this simplified case is given by:

Pξ (k) = Ckn pk (1 − p)n−k (2.6)

This is the binomial distribution and when n is large, it can be approximated


by a normal distribution.
Life is a normal distribution; it should however be noted that a normal
distribution is not assumed when deriving the equations for ordinary least
squares (OLS), generalized least squares (GLS), or best linear unbiased pre-
diction (BLUP).

2.4 Genotype and phenotype


A phenotype can be divided into a genotype and environment effects. The
genotype is the portion of the phenotype that occurred because of the genes,
that is, the portion that is related to previous animals. If we consider an
animal as an event, the genes connect events. The environment effects are the
portion that varies because of the animals environment (nature or nurture).

P (phenotype) = G(genotype) + E(environment) (2.7)

14
As well as additive effects, the genotype includes dominance effects and
epistatic effect, which can’t be measured so they are taken out of the geno-
type and along with the measurement errors, they are lumped together as a
residual and to get:
P (phenotype) = A(additive genetic effects) + E(environment) + e (2.8)
An equation to predict an animal’s phenotype could be written:
zi (phenotype predicted) = E(environment)+h2 xi (additive genetic effects)+ei
(2.9)
The formula h2 xi = A uses the an-
imal’s phenotype record to predict it’s
breeding value. xi is the deviation from
Offspring
the herd mean phenotype, and ei is the phenotype
difference between the predicted breed- +75
ing value and the actual breeding.
Best linear unbiased prediction (dis-
0
cussed in section 2.8 uses the relatives
phenotype records to produce a pre-
dicted breeding value that is closer to
the real breeding value.
0 +500
Parent phenotype
2.5 Heritability
Geneticists talk about broad heritabil- Figure 2.1: bO,P = 1/2h2
ity (H 2 ) and narrow heritability (h2 ).
We can’t measure broad heritability so
in a practical sense it is of no interest.
There are several definitions of narrow
heritability and investigating how they are related helps us understand her-
itability and breeding values. We will start the exploration with an experi-
ment that can be performed.

2.5.1 Children phenotype - Parent phenotype


Given a herd with known average phenotype (µherd ), and the mating of sev-
eral sires of known phenotype (Psire ) with a different random group of does
from the herd, the average phenotype of the offspring (Pof f spring ) is expected
to be.
Pprogeny = bO,P (Psire − µparent.herd ) + µprogeny (2.10)

15
Where (bO,P ) is the gradient of the least squares line of best fit (regression
coefficient), of the parents’ phenotype against the offspring’s phenotype. If:

bO,P ≡ 1/2h2 (2.11)

Pprogeny = 1/2h2 (Psire − µparent.herd ) + µprogeny (2.12)


2
The symbol h is used to represent heritability, and 1/2 appears in the above
formula because the progeny only gets 1/2 of it’s genes from the sire.
The regression coefficient can be calculated using linear regression, or as
a ratio of covariance over the variance ( see appendix D).
σPof f spring ,Pparent
bO,P = (2.13)
σP2 parent

Or using a different syntax:


σPof f spring σparent
bO,P = (2.14)
σPparent σPparent

2.5.2 Breeding value - Parent phenotype


Now that we have a definition of heritability, we can get some insight into
breeding values. The regression coefficient of the breeding values against its
phenotypes is also equal to heritability. Recalling that a regression coefficient
is the ratio of the covariance over the variance, we get:
σA,P
bA,P = = h2 (2.15)
σP2
Taking this and the previous definition leaves us with an animal’s breeding
value being twice the influence it is likely to have have on it’s progeny. The
influence is halved because one parent only supplies half the genes to an
offspring.

2.5.3 Additive genetic effect


Heritability is also defined as the fraction of the phenotype variance ( σP2 )
that is due to the additive genetic effect. That is:
σA2
h2 = (2.16)
σP2
To prove this we need to use equation 2.15 and show:

σA,P = σA2 (2.17)

16
The phenotype measurement is the sum of the inherited attributes and the
environment (see equation 2.7).

P =A+E (2.18)
σ[(x+y),(w+z)] = σ(x,w) + σ(y,w) + σ(x,z) + σ(y,z) (2.19)
(Lynch &W alsh,1997, equation3.10g)
σA,P = σ[A,(A+E)] = σA2 + σA,E (2.20)

If the environment and genetic random variables are orthogonal σA,E will be
zero and:

σA,P = σA2 (2.21)


σ2
The equations h2 = σA2 is important as variance of the additive genetic
P
affect (σA2 ) is used in the BLUP equations and needs to be related back
to heritability, . (see section 2.9).

2.5.4 Correlation
The correlation between an animal’s adjusted phenotype and it’s breeding
value is equal to the square root of it’s heritability. This leads to some
interesting formula manipulation but provides little insight.
s
√ σA2
h2 = See equation(2.16)
σP2
p
σ2
=p A
σP2
σA
h=
σP
σA2
rP A =
σP σA
σA σA
=
σP σA
σA
=
σP
=h

17
2.6 Ordinary least squares
Ordinary least squares (OLS) looks for a vector β to linearly combine mea-
sured estimators to provide an estimate, that is:.

y = Xβ +  The actual y values. (2.22)


y
b = Xβ The estimated y values. (2.23)

Where X is a matrix of the measured independent values for all animals,


y is a vector of the measured dependent values for all animals and y b is a
vector of estimated dependent values for all animals.
An an example in this application space, ordinary least squares could be
used to create a formula to predict down-weight based on measured down
length, measured down diameter, sex and herd.
Ordinary least squares projects a hyperspace where the number of di-
mensions is equal to the number of samples, onto a space where the number
of dimensions equals the number of independent variables used to describe
the data. The difference between the samples and the predictions generated
using the independent variables are the residuals. The residuals are orthog-
onal to the projection. The derivation of the normal equations is given in
appendix D.3. Using matrix notation, the normal equation is written as:

βb = (X0 X)−1 X0 y (2.24)

β gets a hat because the result is the predicted β that results from the data
set y and X not the actual β. The predicted value is:

y
b = Xβb (2.25)

Ordinary least squares is discussed further in appendix D.

2.7 Generalized least squares


If the residuals ( ) are normally distributed with some relationship, the
linear proportion of the relationship between the residual variances can be
expressed with the equation:

var() = Rσ 2 (2.26)

The linear proportion of the relationship between the residual variance


can be removed by performing a transform on equation 2.22, and βb can then
be found using ordinary least squares. Lynch & Walsh, 1997 transformed

18
the equation with the inverse of the square root of R, where, R1/2 R1/2 = R.
Faraway, 2002 used the Cholesky Decomposition where SST = R, with 2.22
being transformed using S−1 :

S−1 Y = S−1 Xβ + S−1  (2.27)

The transform works if the residual variances no longer have a linear rela-
tionship.

var() = Rσ 2 = SST σ 2 = Sσ 2 ST (2.28)


−1 −1 −1T
var(S ) = S var()S
= S−1 Sσ 2 ST S−1T
= Iσ 2 I
= Iσ 2

Replacing X and y in equation 2.24 with the transformed values gives the
generalized least square (GLS) solution:

βb = (XT R−1 X)−1 XT R−1 y (2.29)

2.8 Best linear unbiased prediction


To summarize previous sections; the phenotype of an animal can be esti-
mated using independent variables which can be measured; the herd, the sex
and the age when a measurement is taken, are examples. The independent
measurements are referred to as independent variables, or estimators. The
model is obtained by minimizing the differences between the measured de-
pendent variable and the value estimated using the derived equation and the
estimators. This is done using ordinary least squares if it is assumed the
estimators are independent and have the same random error. If the estima-
tors are not independent (they have some covariance between them) and the
relationship is known and/or the random errors are not the same and the
difference in size is known, then general least squares can be used to improve
the result.
An OLS or GLS regression answers the question: given a set of indepen-
dent variables what do you expect of the dependent variable?’. A best linear
unbiased prediction ( BLUP) answers the question: ’given a series of events
which are related and for which the independent variables have been mea-
sured, what do you expect the next event to be1 ?’. If we are predicting the
1
Very similar to the questions asked in Bayesian Statistics.

19
next event in a time series, it is obvious that the events are measurements of
points in the series. When used with genetics an event is a generation.
The value predicted for the next event, based on the knowledge of a
disturbance, or the knowledge of the relationship between the animals, is
obtained by minimizing the residual variance. In other words we are assum-
ing the new sample will not increase the variance. The effect that animal
relationships have on the variance is known and is discussed in section 2.11.
The BLUP is easier to follow if it is separated from complex family re-
lationships. We consider a signal affected by a known disturbance. The
following derivation is based on Goldburger, 1962. In the following deriva-
tion ( as with the rest of the thesis), lowercase bold symbols are used to
represent vectors, a matrices are represented by upper case bold symbols
and values are typeset normally.
The predicted value for the new event is:
ẙ = (x̊T β) + ˚
 (2.30)
The aim is to find the correlation between ˚
 and . We are hoping that:
E(˚
) = 0 (2.31)
E(˚
) = ω (2.32)
We need a vector of constants that projects y (the vector of all previous
dependent measurements) to a predicted value. That is:
cT y
p=b (2.33)
Recall that the vector of y values can be written:
y = Xβb +  (2.34)
And that:
E() = 0 (2.35)
And let:
E(T ) = V (2.36)
We want a solution that minimizes:
σp2 = (p − ẙ)(p − ẙ)T (2.37)
Subject to:
E(p − ẙ) = 0 (2.38)

20
Using 2.34 equation 2.33 can be written:

cT Xβb + b
p=b cT  (2.39)

Using 2.39 and 2.30 equation (p − ẙ) can be rewritten:

cT X −x̊T )β̊ + b
p − ẙ = (b cT  − ˚
 (2.40)

If the prediction is unbiased:

cT X −x̊T = 0 (2.41)

And using 2.40 and setting cT X −x̊T to zero:

cT  − ˚
p − ẙ = b  (2.42)

And

σp2 = (p − ẙ)(p − ẙ)T


cT  −˚
= (b cT  −˚
)(b )T
cT T b
=b 2 + 2b
c +˚ cT˚

cT Vb
=b 2 + 2b
c +˚ cT ω (2.43)

To minimize 2.43 subject to 2.41 we find the minimum of a Lagrange function,


say:
 
Λ(b c) − 2λ g(b
c, λ) = f (b c) − c
c) − c = cT X − x̊T − 0
g(b
= cT X − x̊T
f (b cT Vb
c) = b 2 + 2b
c +˚ cT ω
Λ(b cT Vb
c, λ) = b 2 + 2b
c +˚ cT ω − 2λ(cT X − x̊T )
(2.44)

Finding the partial derivatives of 2.44, setting them to zero and solving for
c gives us:
b

c = V−1 X(XT V−1 X)−1 x̊ + V−1 (I − X(XT V−1 X)−1 )XT V−1 )ω (2.45)
b

Using equation 2.33 and 2.45:

cT y
p=b
= x̊T (XT V−1 X)−1 XT V−1 y + ω T V−1 y − ω T V−1 X(XT V−1 X)−1 XT V−1 y
(2.46)

21
Looking back to section 2.7 ( generalized least squares) we see that:

βb = (XT R−1 X)−1 XT R−1 y (2.47)

If the Cholesky Decomposition of V can be found, we can write:

β̌ = (XT V−1 X)−1 XT V−1 y (2.48)

p = x̊T β̌ + ω T V−1 y − ω T V−1 Xβ̌


= x̊T β̌ + ω T V−1 (y − Xβ̌) (2.49)

2.9 Predicting animal breeding values


.
The model of interest has the form:

y = Xβb + Zb
a+e (2.50)

X is a design matrix for the fixed effects: that is, the effects that are not
influenced by additive genetic effects. Z is the design matrix used to select
breeding values (the animal). If there is only one record per animal, Z will
be the identity matrix. The residuals left over when the random effects are
removed are contained in the vector e.
If there are two fixed effects, say a doe can be on farm 1 or 2, the prediction
equation for the phenotype of doe i is.

yi = β0 + β1 xi0 + β2 xi1 + ai + e (2.51)

If the animal is on farm 1 then xi0 will be one and xi1 will be zero.

The variance and covariances defined from the animal model with a desire
for some simplicity are:

var(a) = Gσ 2 = Akσ 2 = Aσa2 Where A is the the relationship matrix


var(e) = R = Iσ 2
cov(a, e) = 0
cov(e, a) = 0
(2.52)

22
G is a symmetric matrix because A is symmetric. From the definitions above
var() can be calculated :

var() = V
= var(Za + e) (see 2.50)
= Zvar(a)ZT + var(e) + cov(Za, e) + cov(e, Za)
= Zvar(a)ZT + var(e) + Zcov(a, e) + cov(e, a)ZT
= (ZGZT + R)σ 2 (2.53)

The covariance we are putting the effort into extracting is: cov(,˚
).

) = ω = zT GZ
cov(,˚ (2.54)

Replacing ω and V in equation 2.49 gives:

p = x̊T β̌ + zT GZT (ZGZT + R)−1 (y − Xβ̌) (2.55)

Which is the formula given by Henderson, 1975 as the best linear unbiased
predictor of y given G and R.
In 1950 Henderson provided a set of equations to simultaneously find b
a
and β.
b These are now referred to as the mixed model equations.
 T −1
XT R−1 Z
    T −1 
X R X βb X R y
T −1 T −1 −1 = (2.56)
Z R X Z R Z+G a
b ZT R−1 y

In 1959 Henderson proved that βb in 2.56 was a solution to 2.47 and in


a in 2.56 is equal to GZT (ZGZT + R)−1 (y − Xβ̌) in
1963 he proved that b
2.55 (Henderson, 1975).
If R−1 = IσE2 and Gσ 2 = AσA2 then 2.56 can be rewritten:
" T #  
X X XT Z

βb XT y
σ 2 = (2.57)
ZT X ZT Z + σE2 A−1 b a ZT y
A

23
σA2
= h2 (see equation 2.16)
σP2
σE2 = σP2 − σA2
σE2 (σ 2 − σ 2 )
= P 2 A
σP2 σP
σE2
= 1 − h2
σP2
σE2 (σE2 /σP2 )
=
σA2 (σA2 /σP2 )
σE2 1 − h2
=
σA2 h2
and equation 2.57 can be written:

XT X XT Z
    T 
βb X y
2 = (2.58)
ZT X ZT Z + 1−h
h2
A−1 a
b ZT y

As discussed in section 2.4, heritability can be determined in many ways;


restricted maximum likelihood (REML) can be used to calculate a value from
the data set (Lynch & Walsh, 1997). Selection alters heritability; in other
words for the value of h2 to change, a new generation is required, and the
generation must be selected, this requires time. Heritability figures are taken
from the batch process as real time calculation is not required.

2.10 Within herd estimated breeding values


Before this is discussed, we need to consider what animal model the applica-
tion is going to support, as there are many options available.
It makes sense to support within herd groups, male and female is a group
classification that comes to mind, as is the animals age when the phenotype
record was created. The mathematics to support multiple phenotype records
of the same trait taken over time is a little more complex than the math-
ematics discussed in the previous sections as the variance between records
has to be determined by other means. Chapter 7, edition 2 of Mrode’s book
(Mrode, 2005) looks at the model needed for longitudinal data.
To apply the techniques described in section 2.12, the pedigree must be
set up in the order the animals enter the pedigree. However the A matrix
used in equation 2.58 can be set up in any order, as long as the cells within
the new line and column contain the correct data. That is, the diagonal

24
cells contain the inbreeding coefficients and the off diagonal cells contain the
relationship value between the two animals represented by the intersection
of the column and row.
The last animal to be added to the pedigree can take the last row in A
and as a result the last row in X and Z.
The left matrix in equation 2.58 can be written as:
 T
XT Z
  T   
X X X X XT Z 0 0
2 = + 2 (2.59)
ZT X ZT Z + 1−h h2
A−1 ZT X ZT Z 0 1−h
h2
A−1

The diagonal elements of the sub matrix X−1 X count the number of
animals in each group and the off diagonal elements the number in each pair
of groups. If the number of groups doesn’t change, the size of this sub matrix
remains constant and to update, one is added to the appropriate counts.
The number of columns in the sub matrix ZT X is equal to the number
of groups and the number of rows is equal to the number of animals. When
we add an animal to the end of the matrix, it grows by a row. One is placed
in a column if the animal belongs to the group in that column, zero in all
others. The matrix XT Z is the transpose of ZT X.
The sub matrix ZT Z is a matrix that contains a diagonal entry if a
phenotype record is available for the animal and zero if not.
Excluding A−1 from the consideration, updating the left hand size of 2.58
involves the addition of a row and column to a large matrix, and a little bit
of work in the top left hand corner to get the counts in order.
Matrix A−1 is added to the bottom right hand corner. The incremental
updating of A−1 is discussed in section 2.13.
Considering now the right hand side; XT y contains an entry per group,
the sum of all y values for the group, adding the new value to the relevant
summation will update these value. ZT y selects the appropriate y value for
a particular row in the equation set.
When adding the row to the equations we get the last value on the right
hand side correct if we set it to the phenotype record, and zero if there is no
phenotype.
While it’s possible to incrementally set up the BLUP equations, we are
left with a set of simultaneous equations to solve, the number equaling the
number of animals in the herd plus the number of groups.
Providing a screen for the user to initiate the estimated breeding calcula-
tion and using AJAX to provide the user with progress details is the solution.
If we are going to solve the simultaneous equations in this manner, we may
as well setup the equations from scratch, and avoid incrementally updating
them. To build the equations form scratch, the animals must be sorted in

25
the order they entered the pedigree. The transitive closure and maximum
path length can be used to put the animals in the correct order. We arrange
the animals from those with the longest maximum path length between the
animal and its latest descendant to those with the shortest.

2.10.1 No members in a group


If the number of groups remains constant, it was noted above that updat-
ing counts could be used to update XT X. With this knowledge it may be
tempting to set up the equations with all possible groups. If a group has no
members the simultaneous equations will not have a unique solution.

2.11 Relationship matrix


In section 2.9 it was mentioned that G is defined as AσA2 where A is the
the relationship matrix. The relationship matrix records the flow of genetic
information through the pedigree, that is, it records the relationship between
the variances. The matrix has twice the coefficient of co-ancestry (2Θij )
between animal i and j in the off diagonal cells ij and ji, and the inbreeding
coefficient plus one (Fi + 1) along the diagonal cells (Lynch & Walsh, 1997,
page763).
The inbreeding coefficient (Fi ) is the probability that the two gametes
within an individual are common by descent. The coefficient of kinship,
the coefficient of co-ancestry and the coefficient of consanguinity(Θi j) are
different names for the same thing and is the probability that two gametes
are common between two individuals. An offspring’s inbreeding coefficient is
equal to the parent’s coefficient of kinship (Lynch & Walsh, 1997, page 135).
As “common by descent’ can only be found within the known relationships,
it is assumed that individuals are unrelated (Lynch & Walsh, 1997, page
132). As it is assumed that individuals are unrelated when they enter the
pedigree, we only need related individuals to calculate the coefficients.

2.11.1 Calculating the relationship matrix


Example pedigree
This section uses the example pedigree in figure 2.2 to illustrate the calcula-
tions.

26
Unknown 1 2

Kid Sire Dam


4 3
3 1 2
4 1 unknown
5 4 3
5
6 5 2

Figure 2.2: Example pedigree (Mrode, 1996)

Ordering of animals
Calculation of the coefficients requires that the matrix be arranged from
earliest generation to oldest ( parents precede their offspring). This could
be done by arranging the records from earliest birth date to most recent, if
all birth dates are known. If the maximum path depth is maintained, the
list can start from records with the longest path to a descendant to those
with the shortest path. For example: if an animal has no descendants, then
the longest path is zero; while a grandparent with grand-offspring having no
offspring, will have a longest path of two. Table 2.2 gives the longest path
for the example pedigree given in 2.2. Note that animal two has a short path
of one edge to animal six and a long path of three edges to animal six.

Relationship between animals


How related the new animal is to the other an-
imals is dependent on the parents. There are Animal Longest Depth
four entries involved. 1 3
2 3
i = Offspring
3 2
s = Sire 4 2
d = Dam 5 1
j = All other animals 6 0
u = Unknown Table 2.2: Arrangement of
aji = aij = 0.5(ajs + ajd ) (2.60) animals from longest path
aii = 1 + 0.5(asd ) (2.61) to shortest

If the parent of an animal isn’t known, the


inbreeding coefficient is set to zero. The in-

27
breeding coefficient for an animal is equal to
Fi = 1/2(asd ).

animal 1 2 3 4 5 6
1 1.00 0.00 0.50 0.50 0.50 0.25
2 0.00 1.00 0.50 0.00 0.25 0.625
3 0.50 0.50 1.00 0.25 0.625 0.563
4 0.50 0.00 0.25 1.00 0.625 0.313
5 0.50 0.25 0.625 0.625 1.125 0.688
6 0.25 0.625 0.563 0.313 0.688 1.125

Table 2.3: A matrix for example pedigree given in figure 2.2 (Mrode, 1996).

The a11 A matrix entry for animal 1 (unknown


sire and dam) equals:

auu = 0
a11 = 1 + 0.5(asd ) = 1 + 0.5(auu )
=1

The entry for animal 2 (unknown sire and dam) equals:

a12 = a21 = 0.5(a2s + a2d )


= 0.5(a2 u + a2 u)
=0
auu = 0
a22 = 1 + 0.5(asd )
= 1 + 05(auu )
=1

28
The entry for animal 3 (sire equals animal 1, dam equals animal 2) equals:

a13 = a31 = 0.5(a1s + a1d )


= 0.5(a1s + a1d )
= 0.5(a11 + a12 )
= 0.5(1 + 0)
= 0.5
a23 = a32 = 0.5(a2s + a2d )
= 0.5(a21 + a22 )
= 0.5(0 + 1)
= 0.5
a33 = 1 + 0.5(asd )
= 1 + 0.5(a12 )
= 1 + 0.5(0)
=1

etc.

2.12 Calculating the inbreeding coefficient


To aid the selection of animals for joining, we wish to present to the grower
the inbreeding coefficient (these are equal to the diagonal elements of A
minus one). Quass, 1976 presented a method to quickly calculate A−1 , ideas
presented by Quass, 1976 can be used to calculate the desired data.
A positive definitive symmetrical matrix (which A is) can be decomposed
as:

A = TDTT (2.62)

Where T is a lower triangular matrix with one’s


√ along the diagonal and
D is a diagonal matrix. If we define L as L D, we have the Cholesky
Decomposition (Strang, 2003, page 334). Then:

A = LLT (2.63)

L is a lower triangular matrix so:


i
X
2
aii = lim (2.64)
m=1

29
That is, the entries in row i of L are l0 to li , the entries in column i of
LT are l0 to li . If we multiply row i with column i we end up with the sum
of the entries squared.
Putting it another way:
2
a11 = l11
2 2
a22 = l21 + l21
2 2 2
a33 = l31 + l31 + l31
etc.

Henderson, 1976 provided rules for calculating the diagonal elements of L.


p
lii = di
p
= [0.5 − 0.25(Fs + Fd )]
p
= [0.5 − 0.25(ass − 1 + add − 1)]
p
= [1.0 − 0.25(ass + add )]
v
u
u Xs Xd

= t 1.0 − 0.25 2
l + sml2 dm
m=1 m=1

The off diagonal elements for column j are calculated using:

lij = 0.5(lsj + ldj ) s, d ≥ j (2.65)

The diagonal elements of A can be calculated using two vectors that have
a length equal to the number of animals in the pedigree. One vector contains
the diagonal element of A progressively calculated, and on completion all the
diagonal elements of A will be present. The other vector contains a current
column of L and the column changes as the calculation progresses.

30
Using the example pedigree introduced in section 2.12:

l11 = sqrt1.0 − 0.25(ass + add ) Where ass and add are unknown.
= sqrt1.0 − 0.25(0 + 0)
=1
2
a11 = l11
= 12
=1
l21 = 0.5(lsj + ldj ) Where s and d are unknown
=0
l31 = 0.5(lsj + ldj ) Where s = 1 and d = 2
= 0.5(l11 + l21 )
= 0.5(1 + 0) = 0.5
etc.

As mentioned above, the animals must be arranged from the first animal
in the pedigree to the last, this is dealt with further in section 3.4.5

2.13 Incrementally maintaining A−1


In the previous section it was noted that:

A = TDTT (2.66)

The inverse is equal to:

A−1 = (T−1 )T D−1 T−1 (2.67)

The inverse of a diagonal matrix is the reciprocal of the diagonal elements.


Find D and it is a simple matter to find D−1 . T−1 is a triangular matrix
with 1 along the diagonals and -0.5 to the right where the columns for an
animal meet the row for the parents (Mrode, 1996).
Incrementally calculating the elements of D was discussed on section 2.12.
To fully update D, the value for the added animal and all the ancestors of
it’s descendants have to be updated. The set to update is easy to find using
the transitive closure discussed in the next chapter.
To update T−1 , a row and a column are added to the matrix with the
appropriate elements set to -0.5. After updating D and T−1 , equation 2.67
an be used to calculate A−1 .

31
2.14 Conclusion
Merrrit is being developing as a tool to increase the industries economic per-
formance. This section introduced the theory behind selection and the maths
needed to remove fixed effects that result from animals being born and breed
on different properties. To use the methods articulated we need the pedi-
gree. In the next section we look at representing the pedigree as a graph, the
transitive closure and the incremental maintenance of the transitive closure.
How the transitive closure is used by the application to meet the design goals
is left to chapter 5.

32
Chapter 3

Graphs and transitive closures

33
The on line creation of pedigrees by herd owners is a primary project aim.
With multiple users creating the pedigree, it is important that the system
checks the integrity of the data as it is entered, and that it does not restrict
the order of data entry. For example, the parents of an animal must be born
before the offspring. If we allow animal data to be entered in any order, the
data for multiple offspring can be entered before the parent data. When the
parent data is entered we need to be able to generate the birth data of the
first born offspring and make sure we are setting a reasonable date for the
parent’s birth.
In section 2.11.1, it was noted that we can calculate the inbreeding co-
efficient using an animal’s ancestors if they are arranged in the order they
appear in the pedigree. To perform the calculation, we need to be able to
find all descendants and arrange them in order.
The pedigree’s transitive closure is the foundation used to perform both
of these operations quickly. Putting the animals in the order they appear in
the pedigree requires us to maintain the maximum path length between the
animal and it’s descendants.
This chapter describes the transitive
closure and the algorithms to incremen-
tally maintain the transitive closure. As From To
performance suitable for the on line main- Unknown 4
tenance of the pedigree is required, con- 4 5
sideration is given to methods that can be 5 6
used to speed up the incremental mainte- 1 4
nance of large graphs ( greater than ten 1 3
thousand nodes). 3 5
The chapter also describes an algo- 2 6
rithm to incrementally maintain the min-
imum distance between graph nodes and Table 3.1: Edges table for pedi-
describes how to extend previous work gree given in figure 2.2
so that the application can maintain the
maximum distance between nodes.

3.1 Directed acyclic graph


A pedigree can be represented by a directed acyclic graph.
Geometrically a graph is defined as a set of points (vertices or nodes)
in space which are connected by a set of lines (edges). For a graph G, the
vertex set is donated by V and the edge set by E , and we write: G = (V , E )
A simple graph contains no self loops and no parallel edges. As mammals

34
cannot parent themselves and as only one strand of DNA can pass from a
parent to child, a mammal pedigree is a simple graph.
A directed acyclic graph is a directed graph with no directed cycles; that
is, for any vertex v , there is no nonempty directed path that starts and ends
on v . A pedigree is a directed acyclic graph as gene flow is from parents to
offspring and if you exclude time travel, your offspring can’t be your parents.

3.2 Transitive closure


A path from vi to vj is a sequence P =
v1 , e1 , v2 , e2 , · · · , ei−1 , vi . If the graph
is simple (only one edge between ver- From To From To
tices), the path can be represented as: Unknown 4 1 4
p = v1 , v2 , · · · , vi . Unknown 5 1 5
Two nodes are connected if there is a Unknown 6 1 6
path from one to the other. The transi- 4 5 1 3
tive closure of a graph G = (V , E ) is the 4 6 1 5
graph G+ = (V , E +) such that for all 5 6 1 6
node pairs (v , w ) in V there is an edge in 3 5
E + if there is a path between (v , w ) in 2 3
G. 2 6
Every directed acyclic graph has a
topological ordering; that is, an order- Table 3.2: Transitive closure of
ing of the vertices such that the start- example pedigree from figure 2.2
ing node of every edge occurs earlier in
the ordering than the ending node of the
edge. This is important for the development of proofs for the algorithms used
for the addition and removal of edges.

3.3 Only one transitive closure needs to be


maintained
If you have a directed acyclic graph and there is a path from vi and vj
and you change the direction of all paths, there will now be a path from
vj and vi . The transitive closure contains a list of paths. Changing the
direction of the path in the original graph will not create new connections
between node pairs or destroy connections present. The direction of the path
however will be reversed. If you store a transitive closure as a pair of nodes
that connects parents to all decedents, then all descendants can be found

35
by selecting a particular parent. All ancestors can be found by selecting a
particular descendant.

3.4 Incremental maintenance of a transitive


closure
There have been several papers published (Guozhu, Leonid, Jianwen, & Lim-
soon, 1999 and Dong & Su, 1995) that deal with the theoretical possibility
of the incremental maintenance of a transitive closure, these also provide ex-
ample SQL code. This section extends these ideas with the aim of providing
a solid background for their use in the application.

3.4.1 Insertion of an edge into a transitive closure


If you insert an edge between a and b, then all nodes that had a path to a
will have a path to b and similarly, all nodes that had a path from b will now
have a path from a.
There will now be paths between both sets of nodes; that is, nodes that
had a path to a and nodes that had a path from b. There will also be a path
between a and b.
When nodes are presented as a pair, they represent a path whose node
order gives the direction. Generating the transitive closure (TC) after the in-
sertion of an edge between a and b requires us to generate two sets of nodes.
Lets describe any path in the TC as (nodei , nodej ), with the path starting at
i and ending at j . The new edge being inserted is (nodea , nodeb ).

The set of nodes that have paths that end at node a is given by:

A = {nodei T C(nodej ) = a} (3.1)

The set of nodes with a path that start at node b is given by:

B = {nodej T C(nodei ) = b} (3.2)

The set of nodes that have a path the inserted edge starts at has a single
entry, a. Similarly the inserted edge ends at b and the set of nodes the
inserted node ends at is a one entry set containing node b.
The new path created when the edge (nodea , nodeb ) is added can be found

36
using the union of Cartesian products between the node sets described above:
P1 = (A × b) All the paths that end at b.
P2 = (a × B ) All the paths that start at a.
P3 = (A × B ) Paths that start in set A and end in set B .
P4 = (a × b) The new edge.
PS = P1 ∪ P2 ∪ P3 ∪ P4 (3.3)
The new TC is the union between the new paths and the old:
TCnew = TC ∪ PS

Intersect set
The set of new paths may contain paths that exist in the old TC; that is,
there may be an intersect set. It is the finding of this intersect set that
creates problems when an edge is deleted.

3.4.2 Deletion of an edge from a transitive closure


Let the original set be TCold . Once again let an edge be represented by
(nodei , nodej ), with the node order giving the direction. The set of all possible
paths that could have been created by the insertion of the edge being deleted
are suspect paths. This set of paths that can be created as set PS was created
by formula 3.3 in the previous section.
Paths that are not in the “SUSPECT” set but are in TCold are the
“TRUSTY” paths.
“SUSPECT” will contain paths that would be in the intersection set ( see
section 3.4.1) if the edge was being added to the transitive closure. These
can be recovered from “TRUSTY”. There are two cases (discussed in figure
3.1) that need to be considered.
The paths represented by the case shown to the left of figure 3.1 can
be recovered by joining tails to heads. The case to the right cannot be
recovered unless “TRUSTY“ contains all edges other than the edge (a, b). If
the shortest path length is maintained, the edges can be kept in ”TRUSTY“
by making the subtraction of ”SUSPECT“ from ”TRUSTY“ conditional on
the shortest path length being greater than one.

3.4.3 Three self joins of ”TRUSTY” to recover all paths


The results given in Dong & Su, 1995 show that joining “TRUSTY” head to
tail three times recovers all paths that should not have been deleted when

37
1
1

2 a
a

b
b

3
2

After “SUSPECT” has been re-


After “SUSPECT” has been re-
moved, the following node pairs
moved, the node pair (1 , b) and
will still exit within “TRUSTY”:
(1 , 2 ) will have been removed
(1 , a) (1 , 2 ) (2 , 3 ) along with oth-
because node 1 was connected
ers. The node pair (1 , 3 ) would
to node b and 2 through a.
have been removed because that
These will not be recovered us-
pair started and ended a path that
ing the method to the left unless
also went through a, b. Joining
“TRUSTY” still contains all edges
heads to tails will join (1 , 2 ) and
other than the edge between (a, b)
(2 , 3 ). The path (1 , 3 ) will be re-
.
stored.
Figure 3.1: Restoration of deleted paths

38
After “SUSPECT” has been re-
moved, the paths from 1 and 2 to
1
b and 3 have also been removed
2 as there where paths between these
nodes through a and b. The path
a from any node on the upper side to
2 still exists. The path from 2 to
b b still exists because it is an edge.
The path from b to any node on
3 the down side will still exist. Three
head to tail joins of “TRUSTY”
will recover the missing path.
Figure 3.2: Three head to tail joins of trusted paths required.

“SUSPECT” was removed from “TRUSTY“. The case requiring three joins
is shown in figure 3.2.
The joining of heads to tails will generate a lot of paths that were no in
“SUSPECT”. Because a union removes duplicates, this doesn’t matter.
Let (nodei , nodej ) be one path in “TRUSTY” and (nodev , nodew ) be an-
other. Then:

Ta = {nodei T (nodej ) = T (nodev )}


Tb = {nodew T (nodej ) = T (nodev )}
U = Ta × Tb
Ua = {nodei U (nodej ) = T (nodev )}
Ub = {nodew U (nodej ) = T (nodev )}
I = Ua × Ub
(3.4)

The new transitive closure equals:

TCnew = {TCold − SUSPECT short.path > 1} ∪ I (3.5)

3.4.4 Fragments
The intersect set (see section 3.4.1) will only contain paths with start nodes
and end nodes that are in ”SUSPECT“. ”SUSPECT“ is generally smaller
than the transitive closure. When the pedigree is large, speed becomes an

39
Unknown 1 2

4 3 The longest path between node


two and six is three edges, the
shortest path between node two
5 and six is one edge.

Figure 3.3: Shortest and longest path

issue and reducing the set size becomes important.


Sa = {nodei SELECT (nodei )}
Sb = {nodej SELECT (nodei )}
Fa = {(nodei ) Tnew (nodei ) = Sa OR Tnew (nodej ) = Sb }
Fb = {(nodej ) Tnew (nodei ) = Sa OR Tnew (nodej ) = Sb }
I = Fa × Fb

3.4.5 The shortest and longest Path


The utility of the transitive closure is increased if the system maintains the
shortest and longest paths between nodes. For example, if the shortest path
length is maintained, the number of offspring is equal to the number of paths
that start from a node that have a path length of 1.
Several of the algorithms described in the previous section require the
maintenance of an edge table or the shortest path ( a path with a short
distance of one is an edge). The longest path is required when ordering
the data for the creation of the vectors needed to calculate the inbreeding
coefficient (see section 2.12).
The algorithms for deleting and inserting an edge rely on joining heads
to tails. The maximum length of a new path is equal to the sum of the
maximum lengths of the joined paths. The minimum path length is equal to
the sum of the minimum path lengths of the joined paths. Issues arise when
there are multiple paths between vertices and the cartesian product of nodes
result in entries with the same start and end node but different minimum or
maximum path lengths.
The tuples now have four fields (nodefrom , nodeto , pathmin , pathmax ), and
are identical only if all fields are the same. If there are multiple paths between

40
vertices with different path lengths, the union operations will leave multiple
records for the same path. These need to be reduced to one record using
aggregate operators.
Lets describe any path in the TC as (nodei , nodej , pathmin , pathmax ), with
the path starting at i and ending at j . The new edge being inserted is
(nodea , nodeb , 1, 1).

The set of nodes that have paths that end at node a is given by:
A = {nodei , pathmin , pathmax T C(nodej ) = a} (3.6)
The set of nodes with a path that starts at node b is given by:
B = {nodej , pathmin , pathmax T C(nodei ) = b} (3.7)
If the transitive closure is correctly constructed, there will be one tuple per
node pair and the tuple will contain the maximum and minimum path lengths
between the node pair. A will then only contain one tuple per node that is
joined to a and that tuple will contain the maximum and minimum lengths
to a. Similar arguments apply to set B .
The set of nodes that have a path the inserted edge starts at has a single
entry, a, whose minimum and maximum path lengths will be one. Similarly,
the inserted edge ends at b and the set of nodes the inserted node ends at is
a one entry set containing b.
The new path created when the edge (nodea , nodeb , 1, 1) is added can be
found using the union of Cartesian products between the node sets described
above:
P1 = (A × b, Apathmin + bpathmin , Apathmax + bpathmax )
= (A × b, Apathmin + 1, Apathmax + 1)
P2 = (B × a, Bpathmin + apathmin , Bpathmax + apathmax )
= (B × a, Bpathmin + 1, Bpathmax + 1)
P2 = (A × B , Apathmin + Bpathmin + 1, Apathmax + Bpathmax + 1)
P4 = (a, b, 1, 1)
PS = P1 ∪ P2 ∪ P3 ∪ P4 (3.8)
The new TC is the union between the new paths and the old. There will be
multiple entries for each path if there are paths with different lengths.
TCpossible = TCold ∪ PS
The common paths with different lengths have to be reduced to one using
the aggregate function. The minimum path length will be the minimum path

41
found in the group with a common path, and the maximum path length
will be the maximum path found in the group with a common path. The
maximum and minimum paths may come from different tuples within the
group.

3.4.6 Building a transitive closure from a depth first


search
In section 3.4.7, it is mentioned that the transitive closure can be generated
from node codes. Node codes can be built from a depth first search of an
adjacency list. We have chosen to maintain the transitive closure using in-
cremental maintenance. However, the system effectively keeps the adjacency
list as the animal numbers of the parents are kept in the animal record to
speed up the generation of animal data. This list could be used to generate
the node codes and check the incrementally maintained transitive closure.

3.4.7 Alternative to using a transitive closure


We have chosen to represent the pedigree as a transitive closure, but alter-
native methods have also been developed.

Adjacency list
An adjacency list has pointers that point to the next record in the chain. An
adjacency list is another name for an edge table. The list contains a record
for every edge, and the record contains a start node and an end node. The
data is complete, but no structure information is present, and hence, to find
ancestors and descendants, links have to be followed using multiple queries.As
mentioned above the applications effectively maintains a adjacency list as the
sire and dam animal numbers are kept in the animal record.

Node codes
Node codes have been proposed by Elliott, Akgul, Mayes, & Ozsoyoglu, 2007
as a method of representing pedigrees. A node code table is larger than a
transitive closure with all paths along with the traversed edges recorded in a
string. Each node code is recorded in a table along with an animal identity.
The node codes for the example pedigree ( see figure 2.2) are shown in
figure 3.4. The foundation animals are given codes 0 to n from left to right.
The codes for the offspring of a node are given in sibling order. The node
codes record all paths that can be taken from the foundation animals to the

42
node, and a separator records the sex of the node traversed. Elliott et al.,
2007 uses ”.“,”,“ and ”;“ to donate female, male and don’t know.
If an offspring is inserted into the
database, the node code of the offspring
is created by copying the parent node Unknown 1 2
codes, appending a separator based on
the parent’s sex and then appending the
offspring code. The insertion of a leaf is 4 3
simple, but the insertion of an ancestor is
a little bit more problematic as the codes
5
for all offspring have to be altered. El-
liott et al., 2007 allocated the codes by
performing a depth first search on a pedi-
6
gree that has been built and encoded us-
ing an adjacency list. The development Animal Codes Animal Codes
of algorithms for incremental additions to 1 0 5 0.0.0
the pedigree and concurrent maintenance 2 1 0.1,0
of the node code table would be an inter- 3 0.1 1,0,0
esting exercise. 1,0 6 0.0.0.0
Nodes that have common ancestors 4 0.0 1,1
have node codes with a common prefix.
This fact can be used to generate the in- Figure 3.4: Node codes
breeding coefficient using the node codes
for a particular animal.
The selection of ancestors and offspring is similar to the finding of the
same data using a transitive closure. Offspring are selected by selecting node
codes that have a prefix in common with the ancestor of interest. An ancestor
may have many node codes but all of these prefixes will be propagated to
the children with the path taken from ancestor to offspring encoded in the
offspring’s suffix.
All the ancestors can be found by selecting records with node codes that
have the same prefix strings as the node codes of the animal of interest.

Changing node codes to a transitive closure


As mentioned in the introduction to this section, node codes record all pos-
sible paths to a node. From the node codes, we can determine the start
node, end node and number of edges. If the maximum paths for all node
pairs is determined and placed in a table along with the start nodes and end
nodes we will have the transitive closure with the maximum path lengths.
Such a table along with the method described in section 2.12 is an alterna-

43
tive method to that described in Elliott et al., 2007 for the calculation of
inbreeding coefficients.

3.4.8 Building the transitive closure one herd at a time


It was found that edge insertion and deletion times become excessive when
the pedigree is greater than ten thousand animals, this was downgrading the
user experience for large and small herds. The decision was taken to incre-
mentally maintain the transitive closure for a user’s herd and update the
links between all herds (the main industry transitive closure) in the back-
ground. This section details the steps that need to be taken to support such
a strategy.

Which transitive closure should be used?


A user tends to use herd data more often than he or she updates it. The
industry wide herd provides industry wide details of an animal’s progeny and
ancestors so it is a preferable transitive closure to use. The system will only
use the herd transitive closure for animal display if it contains updates that
have not been transferred to the full herd.

Updating the main transitive closure


When a link is inserted or deleted from the local herd it is added to a table
( the main update table) of updates that are to be inserted or deleted from
the main transitive closure. The main transitive closure is up to date when
the main update table contains no updates for the local herd.

3.5 Conclusion
It is possible to incrementally maintain the transitive closure, the minimum
path length and the maximum path length. Node codes as used by Elliott
et al., 2007 where introduced and there relationship to the transitive closure
considered. The next section considers the implementation of the algorithms
introduced in this section using PostgreSQL. As mentioned previously chap-
ter 5 will consider how the translative closure is used within the application.

44
Chapter 4

Incremental maintenance of a
transitive closures using SQL

45
This chapter looks at using PostgreSQL version 7.2 and version 8.0 to
implement the incremental maintenance of a transitive closure. The theory
underpinning this code is discussed in chapter 3.

4.1 PostgreSQL transitive closure edge insert


Writing code that utilized the improvements delivered with PostgreSQL ver-
sion 8 did not improved performance of the edge insert code significantly, so
the version 7.1 code is used in version 8 systems.
As the code maintains the maximum path length, this code extends the
work presented by Pang et al., 2005.

4.1.1 PostgreSQL version 7.1 code


TRUNCATE TABLE tc_temp;
TRUNCATE TABLE deltas;

Possible paths created by the insertion of the edge are contained in tc_temp.
This includes paths that already exist in the transitive closure, that is it
includes the intersect set. The intersect set is the set of paths that exists in
tc_temp and the transitive closure.
INSERT INTO tc_temp (
SELECT from_node,child_node AS to_node ,max_depth+1 AS max_depth
min_depth+1 AS min_depth
FROM parent_child_tc WHERE to_node=parent_node
UNION
SELECT parent_node AS from_node, to_node , max_depth+1 AS max_depth,
min_depth+1 AS min_depth
FROM parent_child_tc WHERE from_node=child_node
UNION
SELECT TC1.from_node AS from_node,TC2.to_node AS to_node,
((TC1.max_depth)+(TC2.max_depth)+1) AS max_depth,
((TC1.min_depth)+(TC2.min_depth)+1) AS min_depth
FROM parent_child_tc AS TC1, parent_child_tc AS TC2
WHERE TC1.to_node=parent_node AND TC2.from_node=child_node
UNION
SELECT parent_node AS from_node,child_node AS to_node,
1 AS max_depth,1 AS min_depth
);

Remove the intersect set to leave the new paths and only the new paths in
delta.

46
INSERT INTO deltas (
SELECT * FROM tc_temp WHERE NOT EXISTS (
SELECT * FROM parent_child_tc
WHERE parent_child_tc.from_node=tc_temp.from_node AND
parent_child_tc.to_node = tc_temp.to_node
)
);
Insert the new paths into the transitive closure
INSERT INTO parent_child_tc SELECT * FROM deltas;
The paths in tc_new that where already in the transitive closure can have
different path lengths. We make sure the maximum path length is set to the
maximum path found in the set with the same start and end node.
UPDATE parent_child_tc SET max_depth=c_temp.max_depth FROM tc_temp
WHERE parent_child_tc.from_node=tc_temp.from_node AND
parent_child_tc.to_node=tc_temp.to_node AND
parent_child_tc.max_depth < tc_temp.max_depth ;
And we make sure the minimum path is set to the minimum path.
UPDATE parent_child_tc SET min_depth=tc_temp.min_depth FROM tc_temp
WHERE parent_child_tc.from_node=tc_temp.from_node AND
parent_child_tc.to_node=tc_temp.to_node AND
parent_child_tc.min_depth > tc_temp.min_depth ;

4.2 PostgreSQL transitive closure edge un-


link
The code presented maintains the minimum and maximum path length be-
tween nodes. Pang et al., 2005 presents code to maintain the minimum path
length, and extends the work of Pang et al., 2005 in the following ways. The
code doesn’t delete edges from the transitive closure when “SUSPECT” is
removed; this is possible because the minimum path length is maintained.
The code uses “SUSPECT” and the remaining transitive closure after “SUS-
PECT” is removed (“TRUSTY” in version 7.1) to generate a fragment table
that contains all paths that may be useful in the regeneration of lost paths.
If many paths don’t go through the deleted edge this is faster than regen-
erating the lost paths directly from trusted paths. The code maintains the
maximum path length as well as the minimum path length.
The edge being inserted is from child_node to parent node. The tran-
sitive closure table has the columns: from, to, min_depth and max_depth.
.

47
PostgreSQL version 8 code
Truncating the table is faster than creating it. TRUNCATE is a PostgreSQL
extension
TRUNCATE TABLE SUSPECT, fragments, delta ;

Paths that go through the link to be deleted are added to the set “SUS-
PECT”. Paths in “SUSPECT” will be removed if the minimum path length
is greater than one.
INSERT INTO SUSPECT (
SELECT X.from_node AS from_node, Y.to_node AS to_node
FROM parent_child_tc AS X,parent_child_tc AS Y
WHERE X.to_node = parent_node AND Y.from_node= child_node
UNION
SELECT X.from_node AS from_node, child_node AS to_node
FROM parent_child_tc AS X
WHERE X.to_node = parent_node
UNION
SELECT parent_node AS from_node, X.to_node AS to_node
FROM parent_child_tc AS X
WHERE X.from_node = child_node
UNION
SELECT parent_node , child_node
);

Deleting is faster than generating a new table. We delete from parent_child_tc


any path in “SUSPECT” that has a minimum path length greater than one.
We also remove the edge that is being deleted.

DELETE FROM parent_child_tc AS tc ". USING SUSPECT AS sp


WHERE (
((sp.from_node=tc.from_node) AND
(sp.to_node=tc.to_node) AND
(tc.min_depth<>1)) OR
(tc.from_node= parent_node AND tc.to_node= child_node )
)
;

We didn’t delete anything from the transitive closure if it has a minimum


path length equal to one (other than the edge being deleted). This leave in
the transitive closure entries with a minimum path of one, but an unknown
maximum path length. To deal with this, we set the maximum path length

48
to one if a record with the same start and end node is in “SUSPECT”. We
don’t have to test the minimum path length as the previous code will have
removed paths that are in the “SUSPECT” set if the minimum path length
was greater than one.
UPDATE parent_child_tc SET max_depth = 1 FROM SUSPECT WHERE
(parent_child_tc.from_node=SUSPECT.from_node) AND
(parent_child_tc.to_node= SUSPECT.to_node) AND
;

Fragment is a subset of the transitive closure that can regenerate paths that
should not have been deleted. The idea is simple; if the remaining path in the
transitive closure doesn’t have a start and end that is in the deleted set, the
path can’t be helpful in regenerating paths that should not have been lost.
Finding this subset takes less time that working with the full “TRUSTY”
set when regenerating the paths.
INSERT INTO fragments (
SELECT pc.from_node,pc.to_node,pc.min_depth,pc.max_depth
FROM parent_child_tc AS pc JOIN SUSPECT AS sp
ON (sp.from_node=pc.from_node)
UNION
SELECT pc.from_node,pc.to_node,pc.min_depth,pc.max_depth
FROM parent_child_tc AS pc JOIN SUSPECT AS sp
ON (sp.to_node=pc.to_node)
);

Using the fragment set we construct delta, this is the set of paths that may
have been lost that should not have been. This will contain paths that are
already in the transitive closure. The additional paths are deleted in the next
step. Three self joins of fragments is required to recover all possible paths
lost (see section 3.4.3).
INSERT INTO delta (
SELECT fr1.from_node,fr2.to_node,
(fr1.min_depth+fr2.min_depth) AS min_depth,
(fr1.max_depth+fr2.max_depth) AS max_depth
FROM fragments AS fr1, fragments AS fr2
WHERE fr1.to_node=fr2.from_node
UNION
SELECT fr1.from_node,fr3.to_node,
(fr1.min_depth+1+fr3.min_depth) AS min_depth,
(fr1.max_depth+1+fr3.min_depth) AS max_depth
FROM fragments AS fr1, fragments AS fr2, fragments AS fr3

49
WHERE (fr1.to_node=fr2.from_node) AND
(fr2.to_node = fr3.from_node)
);

All we want is the paths that should not have been lost.
DELETE FROM delta
USING parent_child_tc AS pc
WHERE (
(delta.from_node=pc.from_node) AND
(delta.to_node=pc.to_node)
)
;

In the stuff that was lost there can be multiple paths but we are only inter-
ested in the shortest and longest path of each group ( common start and end
node).
INSERT INTO parent_child_tc
(SELECT a.from_node,a.to_node,min(a.min_depth),max(a.max_depth)
FROM delta AS a
GROUP BY a.from_node,a.to_node
)
;

PostgreSQL version 7.1 code


Version 7.1 doesn’t support the truncation of multiple tables with one com-
mand.
TRUNCATE TABLE SUSPECT;
TRUNCATE TABLE fragments;
TRUNCATE TABLE delta;
TRUNCATE TABLE TRUSTY;
TRUNCATE TABLE tc_new;

Version 7.1 and version 8.0 create “SUSPECT” using the same strategy.
INSERT INTO SUSPECT (
SELECT X.from_node AS from_node, Y.to_node AS to_node
FROM parent_child_tc AS X, parent_child_tc AS Y
WHERE X.to_node=parent_node AND Y.from_node=child_node
UNION
SELECT X.from_node AS from_node, child_node AS to_node
FROM parent_child_tc AS X

50
WHERE X.to_node=parent_node
UNION
SELECT parent_node AS from_node,X.to_node AS to_node
FROM parent_child_tc AS X
WHERE X.from_node=child_node
UNION
SELECT parent_node,child_node;

Version 8 creates the trusted set by deleting “SUSPECT” from the transitive
closure. The DELETE statement in version 7.1 is not as well developed, so
instead a “TRUSTY” table has to be created. A left outer join is used
instead of the NOT EXIST code used in the code presented by Guozhu et al.,
1999 because the 7.1 optimizer is not smart enough to convert the query
used by NOT EXIST to a left outer join. Instead in executes the inner query
in a loop resulting in very poor performance. This is not a universal problem
with version 7.1 as the NOT EXIST clause used in the edge insert code was
correctly optimized. The problem disappeared in version 8.
INSERT INTO TRUSTY (
SELECT parent_child_tc.from_node,parent_child_tc.to_node,
parent_child_tc.min_depth,parent_child_tc.max_depth
FROM parent_child_tc LEFT OUTER JOIN SUSPECT ON (
SUSPECT.from_node=parent_child_tc.from_node AND
SUSPECT.to_node = parent_child_tc.to_node
)
WHERE (SUSPECT.from_node IS null) AND
(parent_child_tc.min_depth<>1)
UNION
SELECT parent_child_tc.from_node,parent_child_tc.to_node,1,1
FROM parent_child_tc
WHERE (parent_child_tc.min_depth=1) AND
(NOT(
parent_child_tc.from_node=parent_node AND
parent_child_tc.to_node={$child_node}
)
)
);

The version 7.1 fragment set is created in a similar manner to the version
8 set. The version 8 code uses the reduced transitive closure set created by
deleting “SUSPECT” from the original transitive closure, while the version
7.1 code uses the “TRUSTY” table created above.
INSERT INTO fragments (

51
SELECT pc.from_node,pc.to_node,pc.min_depth,pc.max_depth
FROM TRUSTY AS pc JOIN SUSPECT AS sp ON
(sp.from_node=pc.from_node)
UNION
SELECT pc.from_node,pc.to_node,pc.min_depth,pc.max_depth
FROM TRUSTY AS pc JOIN SUSPECT AS sp ON
(sp.to_node=pc.to_node)
);

Version 8 code created delta and then deleted paths that were already in the
transitive closure that remained after “SUSPECT” was removed. Because
DELETE is not as well developed, version 7.1 creates a separate table tc_new
and then only puts into delta those paths not in “TRUSTY”. The version
7.1 code to create tc_new is similar to the version 8 code that creates the
initial delta.
INSERT INTO tc_new (
SELECT fr1.from_node,fr2.to_node,
(fr1.min_depth+fr2.min_depth) AS min_depth,
(fr1.max_depth+fr2.max_depth) AS max_depth
FROM fragments AS fr1, fragments AS fr2
WHERE fr1.to_node=fr2.from_node
UNION
SELECT fr1.from_node,fr3.to_node,
(fr1.min_depth+1+fr3.min_depth) AS min_depth,
(fr1.max_depth+1+fr3.min_depth) AS max_depth
FROM fragments AS fr1, fragments AS fr2, fragments AS fr3
WHERE (fr1.to_node=fr2.from_node) AND
(fr2.to_node = fr3.from_node)
);

In version 8, trusted paths are deleted from delta, while in version 7.1, paths
in tc_new that are not in “TRUSTY” are put in delta. Once again, we use
a left outer join because NOT EXIST is not being converted to a outer join by
the version 7.1 optimizer.
INSERT INTO delta (
SELECT tc_new.from_node,tc_new.to_node,
tc_new.max_depth,tc_new.min_depth
FROM tc_new LEFT OUTER JOIN TRUSTY ON
(TRUSTY.from_node=tc_new.from_node AND
TRUSTY.to_node=tc_new.to_node)
WHERE TRUSTY.from_node IS null
);

52
Because DELETE is not as well developed, parent_child_tc is rebuilt from
scratch.
TRUNCATE parent_child_tc;

And into the new parent_child_tc contains “TRUSTY” and a single entry
for each path found in delta, with the entry containing the maximum and
minimum path lengths within the group.
INSERT INTO parent_child_tc
(SELECT * FROM TRUSTY) ".
UNION (
SELECT a.from_node,a.to_node,min(a.min_depth),max(a.max_depth)
FROM delta AS a
GROUP BY a.from_node,a.to_node
)
;

4.3 Conclusion
We have considered the PostgreSQL version 7.1 and version 8.0 code needed
to implement the algorithms presented in chapter 3. Different algorithms
are used for each PostgreSQL version because of PostgreSQL version 7.1
limitations. Different SQL databases offer slightly different version of SQL
so the code presented here may require modification if different database
systems are used. The next chapter looks at how the transitive closure is
used within the application, and the SQL code needed to implement the
desired functionality.

53
Chapter 5

Using the transitive closure,


the SQL code

54
This chapter discusses the various ways the transitive closure has been
used to meet the application goals.

5.1 Introduction to database table structure


To understand the following code, some of the database table structure needs
to be introduced (see figure 5.1). Users of Merrrit are allowed to build pedi-
grees backwards; that is, the user is allowed to enter data for recent animals
and then add data for older animals as it becomes available. Older data is
often less accessible because it is stored on obsolete systems, or because it
is in paper form requiring additional work to convert it into electronic form.
When the system is used in this way, the identification of older animals is dis-
covered when the sire and dam are loaded as data within an animal record.
In this situation the only information available is the animal’s identity and
sex ( a sire is male a dam female).
When a parent is identified and it is not in the database, an animal_select
record is created, but an animal record is not. The animal_select record
contains the internal animal identification (created by the system), the tag
data used by the farmer to identify the animal, and the sex.
The animal record can identify four other animals. The sire, the biologi-
cal mother, the birth mother and the raising mother. When embryo transfer
has occurred the biological mother and birth mother will be different. The
raising mother can be different when offspring are miss mothered. The sir,
dam, birth mother and raising mother are recorded in the animal record us-
ing internal numbers, external identifications data is only stored in the select
record.
The parent_child_tc contains an entry if there is a genetic path between
two animals. The longest path linking the two animals, and the shortest path,
are maintained as part of the record.

5.2 Valid birth date


Merrrit allows the user to build the pedigree backwards. Animals are added
to the database when input data specifies an unknown animal as a sire or
dam. The birth dates are set only if a record giving that data is found. The
birth date of a parent has to be earlier than the birth date of descendants.
To enforce this restriction, the transitive closure is used to collect the earliest
birth date of all descendants. The earliest birth date is obtained with the
following code.

55
h
Table Description
animal select The animal identification
and sex. Created when
a record is required for
a parent or for an an-
animal_select table imal whose data is be-
id tag details sex
ing entered. When this
record is created, an in-
ternal animal number is
allocated by the system
animal table
(animal id).
id more data animal All other animal details.
This record is only cre-
ated if the animal is
referenced in some way
other than being a par-
ent. That is, not all
animals have an animal
record, and not all ani-
mals have a birth date.
Figure 5.1: Animal record structure

56
SELECT min(an.birth_date) FROM parent_child_tc AS tc
LEFT JOIN animals AS an ON tc.to_node = an.animal
WHERE tc.from_node = ’animal_of_interest’ AND
tc.min_depth = 1
;

The PostgreSQL version 7.1 optimizer has problems with the following code.
The optimizer fails to note that the select can be converted to a join, the
inner select ends up in a loop and the performance is terrible.
SELECT min(birth_date) FROM animals WHERE
(animals.animal IN
( SELECT to_node AS animal
FROM parent_child_tc
WHERE from_node= ’animal_of_interest’
)
)
;

Merrrit users can also build the database in a forward direction. If this is
done, we must check that the new animal is not being born before it’s parents.
The following code selects the latest birth date of the parents.
SELECT max(an.birth_date) FROM parent_child_tc AS tc
LEFT JOIN animals AS an ON tc.from_node = an.animal
WHERE tc.to_node = ’animal_of_interest’ AND tc.min_depth = 1
;

5.3 Can’t change the sex of an animal with


descendants
When creating an animal select record because the animal was specified as
a sire or a dam in an offspring’s record, we set the sex; a dam is female
and a buck is male. Input records seen later cannot change the sex without
corrupting the pedigree. In other words, if there are no descendants the
animals sex can be changed, if there are it cannot.
The following code counts the number of descendants.
SELECT count(*) FROM parent_child_tc AS tc
WHERE tc.from_node=’animal_of_interest’;

57
5.4 Calculation of the inbreeding coefficient
To calculate the inbreeding coefficient, we need a vector of ancestors from
the first animal to enter the pedigree to the most recent. This was discussed
in section 2.12.
The short piece of code below performs this task. In the following code
the animal we are interested in is referred to as the animal_of_interest.
The transitive closure contains a from_node and a to_node. The from_node
is the ancestor node, the to_node is the descendant node. If the to_node
equals the animal_of_interest the from_node gives the animal number
of an ancestor. The max_depth data tells how many generations back the
animal entered the pedigree. The min_depth data tells how recently the
animal appears in the pedigree.
To get the records in order, the result is sorted on the depth field, deepest
first. The LEFT OUTER JOIN with the animal table picks up the sire and dam
animal records, if they exist. The SELECT after the UNION is picking up the
details for the animal we want the information on. An animal does not have
a transitive closure entry pointing to itself.

SELECT tc.from_node,an.sire,an.dam,tc.max_depth
FROM parent_child_tc AS tc
LEFT OUTER JOIN animals AS an
ON tc.from_node = an.animal
WHERE
(tc.to_node=animal_of_interest) AND
(tc.max_depth < ’max_depth’)
UNION
SELECT ans.animal AS from_node,an.sire,an.dam,0 AS max_depth
FROM animal_select AS ans
LEFT OUTER JOIN animals AS an
ON (ans.animal = an.animal)
WHERE ans.animal=’animal_of_interest’
ORDER BY max_depth DESC ;

5.5 Descendant records


To display the descendants, we need data from the descendant records. We
could follow the links in the animal records and read in new records as the
descendants are discovered (this would require a depth first search). It is
faster to use the transitive closure and read the entire set with one database
read request. Care must be taken when using this code as memory usage can

58
be a problem if an animal has too many descendants. This issue is discussed
further in section 6.2.
SELECT ans.animal,ans.sex,ans.birth_herd,an.sire,an.dam,
ans.first_tag,ans.second_tag,ans.e_tag,ans.name
FROM parent_child_tc AS tc
LEFT OUTER JOIN animals AS an
ON (tc.to_node = an.animal)
JOIN animal_select AS ans
ON (tc.to_node = ans.animal)
WHERE (tc.from_node=’animal_of_interest’)
UNION
SELECT ans.animal,ans.sex,ans.birth_herd,an.sire,an.dam,
ans.first_tag,ans.second_tag,ans.e_tag,ans.name
FROM animal_select AS ans
LEFT OUTER JOIN animals AS an
ON (ans.animal=an.animal)
WHERE ans.animal=’animal_of_interest’
;

5.6 Ancestor records


To display the pedigree, the ancestor animal_select records are required.
To obtain this with one database read, the transitive closure is used to select
the records.
SELECT ans.animal,ans.sex,ans.birth_herd,an.sire,an.dam,
ans.first_tag,ans.second_tag,ans.e_tag,ans.name
FROM parent_child_tc AS tc
LEFT OUTER JOIN animals AS an
ON (tc.from_node = an.animal) ".
JOIN animal_select AS ans ON (tc.from_node = ans.animal)
WHERE (tc.to_node=’{$animal_of_interest’) AND
(tc.max_depth < ’{$max_depth}’)
UNION
SELECT ans.animal,ans.sex,ans.birth_herd,an.sire,an.dam,
ans.first_tag,ans.second_tag,ans.e_tag,ans.name
FROM animal_select AS ans
LEFT OUTER JOIN animals AS an ON (ans.animal=an.animal)
WHERE ans.animal=’{$animal_of_interest’;

59
5.7 Conclusion
The transitive closure is used to maintain data integrity, to select the records
needed to display the pedigree and descendants, and to generate the data
structure needed to calculate the inbreeding coefficient. We now move onto
a new topic, efficiently displaying the pedigree and descendants using a
browser.

60
Chapter 6

Displaying pedigrees and


descendants

61
A human pedigree aims to display relationships and siblings common to
that relationship in birth order (Bennett et al., 1995). This creates complex
structures that are difficult to display, and as a result there is a body of litera-
ture that considers the problem (Tores & Barillot, 2001). The flow of genetic
material is important when displaying animal pedigrees, not relationships.
The flow of genetic material to an individual diploid can be represented by
a binary tree.
A survey of the animal genet-
ics literature indicates animal pedi-
grees still tend to use the Greek
symbols for Mars (♂) and Venus
(♀) to donate male and female, with
the standard symbols (circles and
squares) used by human geneticists
(Schott, 2005) not being favored .
The pedigree displayed by Mer-
rrit revolves around an individual
animal. One diagram shows the an-
imal’s ancestors and another its off-
spring.

6.1 Ancestors
A tree structure is used to display
the pedigree, with the animal of
interest to the left and the ances- Figure 6.1: Partially opened Merrrit
tors to the right. Next to each ani- pedigree display for animal 4461
mal node is a link that can be used
to display the full animal record.
The structure to be displayed is cre-
ated using PHP classes form the
pear package HTML TreeMenu (Radi
& Heyes, 2008). The PHP code that uses these classes links together objects
(instances of the PHP class) to describe the desired menu. The printMenu()
method then uses the resulting structure to output Javascript that the
browser uses to display the menu. The browser puts together the tree display
using icons pointed to when the PHP class objects are added to the PHP
structure described above.
Figure 6.1 shows an example pedigree. The arrows embedded in the
graphics call JavaScript code when they are clicked, and are used to open

62
and close pedigree branches. When the branch is open, the arrow points into
the branch, and when the branch is closed ( details not displayed) the arrow
points down. The depth of the undisplayed pedigree can be determined using
the “depth” data displayed as part of the animal’s URL.
The application code to create the PHP structure used to output the
JavaScript is found in the file /class/pedigree.php. The method used to
create a node in the tree is recursive, calling itself if there are ancestors to
itself. The recursion returns an object which is linked to a new object that is
then returned. In other words, it is a classic depth first search using recursive
code. The recursive code also returns how deeply the recursion is nested, and
the depth value returned is used as part of the data string sent to the browser
as the link text.
To reduce the number of calls to the SQL server, the code uses the tran-
sitive closure to select all of an animal’s ancestor records in one read. The
depth first search then becomes a memory operation that is reasonably fast.
The pedigree downloaded to the browser contains all known descendants,
but only the top four branches are displayed open. This reduces the size
of the initially displayed tree. If interested, the user can click on the arrow
icons to open closed branches.

6.2 Descendants
A tree structure is used to display descendants with the animal of interest
on the left and offspring to the right. The initial rendering displays all the
offspring (see figure 6.2) but none of the offspring’s descendants. Arrows are
added to the graphics if there are not too many offspring; these arrows may
be used to open up branches that show the identity of the descendants of the
offspring.
Animals early in the pedigree of a large herd can have many offspring.
For example, one of the herds used to test the system which has over 10000
records, had an early buck, KASHMERE NERO that had considerable in-
fluence on that herd and had over 5000 descendants. Using the transitive
closure to read all records needed to display all descendants can be dangerous,
and if not constrained, the server may not have enough memory allocated
to the task for record storage. Further, if communication speeds are mod-
erate and the number of progeny is not restrained, downloading all progeny
information to the browser can take considerable time.
The transitive closure is used to determine the number of descendants
before the display of descendants is started. If there are too many descen-
dants, the transitive closure is used to read only the offspring records whose

63
The offspring of the animal are ini-
tially rendered, with the branches
detailing the descendants of the
offspring closed.

Opening branches displays addi-


tional descendants.
Figure 6.2: Displaying descendants

minimum path length equals one and the offspring are displayed using the
set of records thus read. Using natural breeding methods, the number of
children is limited to the number of animals that can be sired in a year by
the number of offspring born per animal (for goats less than 200) and by the
number of years the sire is sexually active (for goats less than 10), giving an
upper limit of approximately 2000 records. With modern breeding methods,
there is no guarantee that this solution will work under all circumstances.
If the number of descendants is within reasonable limits, all descen-
dant records are selected using the transitive closure (see section 5.5). As
with the code that displays the pedigree, a recursive routine located in the
file /class/pedigree.php links class objects together to describe the tree
and the class method printMenu(), and then uses the structure to output
JavaScript code to the browser. The JavaScript code is then used by the
browser to display the descendant tree.
The number of offspring and descendants of each animal node displayed
is determined using the transitive closure and the results are included in the
data displayed (see figure 6.2).

64
6.3 Conclusion
This chapter discusses using a browser to display an animal pedigree and the
animals descendants. The next chapter concludes the thesis and discusses
future research.

65
Chapter 7

Theses conclusion and future


research

66
7.1 Thesis conclusion
Previous work in this area has used strings that describe the genetic path be-
tween animals (nodes codes) to improve the performance of pedigree searches
and the calculation of inbreeding coefficients. This application maintains the
minimum and maximum path length, an adjacency list and a transitive clo-
sure. The structures used are maintained incrementally. If the above data
structures are maintained, it is possible to enforce sensible birth dates in
real time, calculate the inbreeding coefficient and present in real time the
pedigree and decedent trees.
Incrementally maintaining a transitive closure is viable if the number of
animals is less than ten thousand. If there are larger herds than this, then
herd transitive closures can be maintained, with the industry herd updates
occurring in the background.
It is possible to incrementally update the BLUP equations, but the num-
ber and size of simultaneous equations to be solved prevent us from gener-
ating results in real time. The solution is to allow the user to initiate the
calculation for their herd.
Pedigrees and descendant trees can be displayed by a browser using the
pear package HTML_TreeMenu.

7.2 Future research


Designing useful artifacts is complex due to the need for creative
advances in domain areas in which existing theory is often insuf-
ficient. ”As technical knowledge grows, IT is applied to new ap-
plication areas that were not previously believed to be amenable
to IT support” (Markus et al. 2002, p. 180). The resultant IT
artifacts extend the boundaries of human problem solving and or-
ganizational capabilities by providing intellectual as well as com-
putational tools. Theories regarding their application and impact
will follow their development and use. (Hevner et al., 2004)

This thesis focus on the creative advance in the domain area, the artifact was
created to test the ideas. As predicted by Hevner et al., 2004, the creation
of the artifact opens up new research possibilities.

7.2.1 Can node codes be maintained incrementally?


Previous work in this area has used node codes to speed up pedigree queries.
This work has shown that an adjacency list, minimum path length, maxi-

67
mum path length and a transitive closure can be used to speed up similar
queries. We have also shown that the data structures used can be maintained
incrementally. A proof that the node codes can or can’t be maintained in-
crementally would be interesting.

7.2.2 Did the application meet it’s underlying design


goal?
Increasing the use of EBVs across the cashmere industry was the underlying
aim of the project. We aimed to make the effort required more rewarding by
providing instant feedback, and the process easier to use by offering different
methods to build the pedigree tree. It would be interesting to observe how
the use of EBVs within the industry changed as a result of this work. The
creation of the artifact makes this research possible.

7.2.3 Are production gains increased?


The cashmere industry has been running a national fleece competition for
decades, where prizes are awarded for the most valuable fleece. The compe-
tition is dominated by two breeders, one of which has a genetics background
and has been using EBV’s for many years. The other is an old fashioned
breeder using phenotypes and a feel for what is right. The latter herd has
won the most valuable fleece award for almost a decade.
The two herds (one is in Queensland and the other in Victoria) are difficult
to compare. It would be interesting to have answers to the follower questions.

- Does the provision of EBVs based on best linear unbiased prediction


help experienced breeders?

- Will the provision of EBV data as described above alter the decisions
of experienced breeders?

- Do experienced breeders and the methodologies described in the thesis


pick the same animals as the best animal?

With EBV’s now available to the entire industry, the herd belonging to
the breeder who has a ’feel’ for the problem could be randomly divided, both
methods applied and the outcomes compared.
An argument could be mounted that taking the “feel“ out of the selection
is advantageous as a solid outcome can be had by anyone willing to undertake
systematic recording and the application of the methodology. Is the previous
statement a reasonable argument? Another possible research question.

68
7.2.4 Will this work affect Lambplan?
Lambplan and Merinoplan are industry wide programs that use best linear
unbiased prediction to calculate EBV’s for the lamb meat and merino indus-
tries. Lambplan and Merinoplan are run by Sheep Genetics a joint program
created by Meat and Livestock Australia (MLA) and Australian Wool In-
novation Limited (AWI). Both systems are based on a centralized database
and batch calculations. There is a flat rate of $330 and a charge of $1.65 per
animal with a maximum charge of $2750 per breeder (Fee Schedule, 2009).
This is down from the 2004 charge of $355 flat rate and $2.32 per animal
(McNair, 2004). In 2009 the per animal charge only applies to the current
drop, and historic data ( the very foundation of EBV’s based on BLUPs)
is stored for free. To use the service, you export your pedigree and pheno-
type records and receive in return a file containing EBV’s and inbreeding
coefficients. Your data must be correctly formatted before sending it
Because of it’s small size, the development of a solution for the cashmere
industry was unattractive to the current operator. This forced the cashmere
industry to develop it’s own solution, which resulted in the creation of Mer-
rrit, the subect of this thesis. We aimed to develop a system that could
be self managed. The techniques developed could be used to self generate
within herd EBV’s and inbreeding coefficients for any industry, and this could
reduce the cost. Industries would still need centralized control to generate
across herd EBV’s as it is unlikely any breeder will entrust his full dataset
to another.
The author has listened to lamb and merino breeders complain about
the cost of their current solutions and the accuracy of the EBV’s provided.
An understand of best linear unbiased prediction leads one to ask: Do the
complaints about the provided EBV value results from poor industry data
structure ( not enough cross herd links)? This is research that should be
undertaken by the Lambplan and Merino plan providers.
The possibility of extending this solution to other industries results in
several interesting research questions:
- How general is the dissatisfaction with Lambplan and Merino plan
EBV’s and why has it occurred?
- If Lambplan and Merino plan users were supplied with tools to display
their pedigree and to get inbreeding coefficients and within herd EBVs
without waiting for the industry calculation, would their satisfaction
with the service improve?
- Industry wide EBVs are only valuable if breeders believe the results.
For accurate across industry EBVs you need sound genetic links across

69
herds. Are Lambplan and Merinoplan results questionable because of
its poor data structure?

- Because cross herd/flock links are difficult and costly to maintain one
needs to ask the question: As a general rule, are breeders looking for
within herd EBVs or industry wide EBVs?

- Would greater productivity gains be achieved if costs were reduced by


making available online systems for the generation of in herd EBVs?

70
References

Bennett, R. L., A., S. K., Uhrich, S. B., O’Sullivan, C. K., Resta, R. G., &
Lochner-Doyle, D. (1995). Recommendation for standardized human
pedigree nomenclature. Journal of Genetic Counseling, 4 (4).
Dong, G., & Su, J. (1995). Incremental and decremental evaluation of
transitive closure by first-order queries. Information and Computation,
120 (1), 101–106.
Elliott, B., Akgul, S. F., Mayes, S., & Ozsoyoglu, Z. M. (2007). Efficient
evaluation of inbreeding queries on pedigree data. In Ssdbm ’07: Pro-
ceedings of the 19th international conference on scientific and statistical
database management. Washington, DC, USA: IEEE Computer Soci-
ety.
Elliott, B., Akgul, S. F., Ozsoyoglu, Z. M., & Manilich, E. (2006). A frame-
work for querying pedigree data. In Ssdbm ’06: Proceedings of the 18th
international conference on scientific and statistical database manage-
ment (pp. 71–80). Washington, DC, USA: IEEE Computer Society.
Faraway, J. (2002). Practical regression and anova using r. http://cran.r-
project.org/.
Fee schedule. (2009). Website. PO Box U254 UNE, Armidale, NSW, 2351:
Sheep Genetics. (http:www.sheepgenetics.org.au)
Goldburger, A. S. (1962). Best linear unbiased prediction in the generalized
linear regression model. Journal of the American Statistical Associa-
tion, 57 (298), 369–375.
Guozhu, D., Leonid, L., Jianwen, S., & Limsoon, W. (1999). Maintaining
transitive closure of graphs in sql. In Int. J. Information Technology,
5.
Henderson, C. R. (1975). Best linear unbiased estimator and prediction
under a selection model. Biometrics, 31 (2), 423–447.
Henderson, C. R. (1976). A simple method for computing the inverse of a
numerator relationship matrix used in the prediction of breeding values.
Biometrics, 33 (1), 69–83.
Hevner, A. R., March, S. T., Park, J., & Ram, S. (2004, march). Design

71
science in information systems research. MIS Quarterly, 28 (1), 75–105.
James, A. T. (2009). Estimating breeding values for better cashmere - using
the australian cashmere. PO Box 4776 KINGSTON ACT 2604: Rural
Industries Research and Development Corporation.
Lynch, M., & Walsh, B. (1997). Genetics and analysis of quantitative traits.
23 Plumtree Road, Sunderland, MA 01375 U.S.A: Sinauer Associates,
Inc.
McNair, S. (2004). Mla responds to lambplan critics.
Website. Stock and Land. (http://sl.farmonline.
com.au/news/state/agribusiness-and-general/general/
mla-responds-to-lambplan-critics/9187.aspx)
Mrode, R. (1996). Linear models for the prediction of animal breeding values.
Wallingford Oxon: CAB International.
Mrode, R. (2005). Linear models for the prediction of animal breeding values
(2 ed.). Wallingford Oxon: CAB International.
Pang, C., Dong, G., & Ramamohanarao, K. (2005). Incremental maintenance
of shortest distance and transitive closure in first-order logic and sql.
ACM Transactions on Database Systems., 30 (3), 698–721.
Quass, R. L. (1976). Computing the diagonal elements and inverse of a large
numerator relationship matrix. Biometrics, 32 (4), 949–953.
Radi, H., & Heyes, R. (2008). Html treemenu. Website. (http://pear.php.
net/package/html_treemenu.html)
Rozanov, Y. A. (1964). Probability theory a concise course (R. A. Silverman,
Ed.). New York: Dover Publication Inc.
Schott, G. D. (2005). Sex symbols ancient and modern: their origins and
iconography on the pedigree. British Medical Journal, 331 (7531), 1509–
1510.
Strang, G. (2003). Introduction to linear algebra. Wellesley: Wellesley-
Cabridge Press.
Tores, F., & Barillot, E. (2001, February). The art of pedigree drawing:
algorithmic aspects. Bioinformatics (Oxford, England), 17 (2), 174–
179.
Yahoo user interface. (2009). Website. (http://developer.yahoo.com/yui/)

72
Appendices

73
Appendix A

Code Structure

74
The key to successful large project development is a solid foundation.
A large project is completed when the code base is abandoned, not before.
Design goals change, algorithms are discovered, users generate new ideas and
new team members schooled in the latest buzz words come and go. As the
code base grows, the investment increases and the option of starting the
project over fades. This chapter documents the foundations. Hopefully we
have a foundation flexible enough to support the project over the long term.

A.1 Input values and return values


All method and function inputs are passed by value. The problem inherent
in C type languages ( that they can only return one value) is dealt with
by having functions and methods return an array if multiple data items are
required. PHP associate arrays make this methodology very attractive as
the index name can replace the data item.

A.2 Error codes


Functions and methods return data in an associate array entry called error_code
if there are problems. The application can use the error_code to control it’s
actions. The methods and functions also return a error_string if there are
problems; the error_string is used to tell the user about the problem. The
error string should be returned in the user’s language. Code that generates
error strings is stored in the language directory (/ENGLISH).

A.3 The table


The application code is driven by tables; if a table is needed, the table
address is always passed into code as a call parameter. The code makes the
assumption that index values within a table row will be there and that the
values are for a particular use. The code base makes no assumptions on the
names of fields in the database or their encoding. To add new data fields to
the system, we add the fields to the database table and to the tables that
drive the application. Tables can be found in the files:
\ENGLISH\translation_array_animals.php,
\ENGLISH\translation_array_shearing.php,
\ENGLISH\translation_array_x.php and
\ENGLISH\translation_array_herd_standard.php.

75
The arrays are stored in a language directory as they contain the string to
be used as column headings when the data is displayed.
This design breaks the development problem into two parts: code to
interpret the tables and the development of the tables. Hopefully the options
supported by the table interpreter are broad enough to accommodate most
of the changes that will invariably be required as the application ages.

A.3.1 base
Data can be entered into the application using the web interface or via files
containing comma separated fields, with the field type being set using head-
ings. The file input code supports quite a variety of data encodings, but only
one encoding is needed for the web interface. The one encoding needed is
the base encoding.

A.3.2 db string
The name of the field in the base database table (the unjoined table). Which
table is the base table depends on the field, and the location is given in the
table entry: location.

A.3.3 dbt string


There is always a data structure maintained that has the tables joined. This
is the field name in the joined table.

A.3.4 strings
As mentioned above, files can be used to enter data into the system. The
table entry points to a table of strings that can be used as heading strings in
upload files. The heading string selects a table entry to handle the column.
The code that selects the table entry does a string match on the table pointed
to by this table entry.

A.3.5 data convert


This points to an object that inputs and outputs the data from the database.
The methods of this object are discussed further in section A.4.

76
A.3.6 type
Data can be of two types: data to select the record, and data that is data.
This field identifies the two types.

A.3.7 logic
The tables control the select logic. If this is set to AND, then the select field
is used as an AND condition. If set to OR, then the select field is set to OR.
The different tag types (first tag, second tag etc) are set as OR conditions,
while the herd as an AND condition. Different herds can have the same
tag because the animal has to belong to a particular herd. Within a herd,
different tag types can be used to identify the animals. If a tag type is to be
useful, each animal within the herd has to have a unique tag.

A.3.8 heading string


The string to be used as the data’s name when displaying the data item.
Different languages will have different headings.

A.3.9 location
The location of the data is given by location and db_string.

A.4 Data entry and display


The tables mentioned above
contain a pointer to a class ob-
ject. The class object meth- input_to_db test_db_change get_data output_from_db
ods are used to input and Database
output data. Checking that
a date is within acceptable
range is not an application
problem, but is a data in-
Figure A.1: Application data input and out-
put problem and the data in-
put
put/output objects look af-
ter it. The data entry
classes are stored in the file
\ENGLISH\data_conversion.php. The input strings used to set database
values and the display string for particular database values are language
dependent, and hence, the file is located in the language directory.

77
A.4.1 input to db
This class method is used to convert user input into the database format.
As an example, the database stores the sex as ’m’ or ’f’, but an English user
inputs ’doe’, ’buck’ and many other options to represent the sex. The input to
the method is the user sting. As the method has to be able to return an error
message or a result, the method returns an associative array with 3 possible
index values, data, error_code and error_string. If the input data is
successfully converted, then there will be no error_code or error_string
fields in the associative array. The code in this method should clean out
attempts at SQL injection and cross site scripting.
This method should only concern itself with the field in question as
test_db_change looks after how the data relates to the rest of the database.

A.4.2 test db change


After the application has worked out which animal the input data is referring
to, read the relevant animal records and worked out which fields are being
altered, this method is called. The test_db_change method in a class look-
ing after birth dates would check that the birth date is valid; that is, make
sure that it is after it’s ancestors and before it’s progeny. This method has
access to and returns the update array so it has the option of removing or
altering the update and the fields updated or returning an error message.

A.4.3 get data


Some data doesn’t have a one to one relationship to the database. When
such data is entered, the test_db_change method works out which fields
should be changed and alters the update array accordingly. On a read this
method generates the composite field from the data read from the database.
An example is fleece variance. The test micron and standard deviation is
stored in the database and the variance is generated form those fields.

A.4.4 output from db


There is a one to one relationship between input to the method and output
from the method. Internal database codes are converted to display strings.
For example, the sex codes “m” and “f” are converted to “doe” and “buck”.

78
Appendix B

Ajax

79
Asynchronous Javascript and XML (AJAX) is a current buzz word. It’s
actually all very simple and has a lot to do with HTML and very little to do
with XML. A Javascript application is event driven and the code is available
for execution while a web page is being displayed. JavaScript can place a re-
quest to a server that is independent of the page request the user sent to get
the page being displayed; this is the Asynchronous part. Different browser
brands require different code to start the request and this complicates the
JavaScript code a little but nothing more. The replies to the Asynchronous
request are events that are used to call some Javascript code. The document
that is displayed to the user is structured because it was generated using
HTML and the location of the various elements within the structure is doc-
umented and is described as the document object model (the DOM). Using
the DOM, the programmer can work out which object to modify if he or
she wishes to change the data displayed by the object. The Javascript that
receives the reply to the asynchronous request retrieves the response string
(which may be encoded as XML data, but then again may not), decodes it
and modifies the objects that describe the document. The browser modifies
the document.
All that is required on the server side is a page that responds to the
asynchronous request. The server side request is just another GET and the
server doesn’t know or care that it is an asynchronous request. Having the
server output XML code is optional.

B.1 File upload progress


Merrrit allows users to upload files that
are used to update the database. This op-
tion is provided to upload data contained
Database
in historic databases and to make it eas-
ier to enter large data sets. The files can
task1 file task2
contain hundreds of records and the up-
date can take several minutes. AJAX is File sent
to task1
used to provide the user with a progress Asynchronous
Browser request to task 2
report. When the database is being up-
dated in this manner, one HTML server
thread is performing the update, and an- Figure B.1: Intertask communi-
other is responding to the browser’s re- cation
quest for a progress report.
Intertask communication on servers
that we don’t control can be a problem, so the application uses a file for

80
communication (see figure B.1). The thread doing the update seeks the com-
munication file to a known location ( zero) and writes the input file active
line number to the communication file. The thread replying to the browser’s
asynchronous request reads the data from the communication file. A Unix
system will present the file reading thread with the last value written by the
file writing thread.
The server code is found in the file ajax_upload_animal, and the code
and the comments pretty much say all that needs to be said.
<?php
$file_path =$_SERVER[’SCRIPT_FILENAME’];
$app_directory =
substr ($file_path,
0,
-(strlen($file_path) -(strrpos($file_path,’/’))));
//this was sent as a hidden field in the
//file screen_upload_add_animal.php etc.
//It is made from the session id which
//is different for each active user.
$ajax_data_file =
(isset($_REQUEST[’ajax_data_file’]))
? $_REQUEST[’ajax_data_file’]
: ’’;
//The file read by this code is updated in
//screen_upload_add_animal.php etc. as records
//are added to the database.
//We are responding to an asyc read using the Javascript
//found in the file animal_upload.js
function record_count_open ($app_directory, $session_id){
$hostname="{$app_directory}/temp/{$session_id}";
$handle = fopen($hostname,"r");
return $handle;
}
//the server writes the line at offset zero so
//we only need to read the first line
function record_count_read ( $handle) {
$input = ’0’;
if (!$handle) return $input;
$output = fgets($handle);
return $output;
}
$handle =
record_count_open ($app_directory,$ajax_data_file);

81
$records = record_count_read($handle);
echo $records;
?>

Note that we are outputting only one value and we do not decorate it
with XML encoding. The Javascript code is downloaded to the browser
as Javascript normally is; that is, with a <script></script> tag sent as
part of the header.
<script src=\"animal_upload.js\" type=\"text/Javascript\"></script>

The JavaScript sent to the browser pretty much documents what needs to
be done at the browser end:
function ajaxFunction(){
{
var xmlHttp;
try {
// Firefox, Opera 8.0+, Safari
xmlHttp=new XMLHttpRequest();
}
catch (e) {
// Internet Explorer
try {
xmlHttp=new ActiveXObject("Msxml2.XMLHTTP");
}
catch (e) {
try {
xmlHttp=new ActiveXObject("Microsoft.XMLHTTP");
}
catch (e) {
//So they don’t get a record load update,
//don’t upset the user
//alert("Your browser does not support AJAX!");
return false;
}
}
}

xmlHttp.onreadystatechange=function(){
if(xmlHttp.readyState==4){
document.ajax_form.records_loaded.value =

82
xmlHttp.responseText;
}
}

var dataFile = ’./ajax_upload_animal.php?ajax_data_file=’


+ document.ajax_form.ajax_data_file.value
+ ’&amp;’;
xmlHttp.open("GET",dataFile,true);
xmlHttp.send(null);
}

This code only contains the code to initiate the request and deal with the
response. The code to deal with the response is contained in the method:
xmlHttp.onreadystatechange.

The method writes to the DOM element:


document.ajax_form.ajax_data_file.value.

The name of the DOM element depends on the document structure. ajax_form
and records_load are names given in the HTML code sent by the server to
display the original page.
As mentioned above, Javascript code is event driven. The event to initiate
the asynchronous request is created when the submit button is clicked. The
HTML code is in the file originally sent to display the page and has the
following form:

<input type=submit
class="form"
size="60"
name="accept"
value="submit"
onclick= "window.setInterval’."(’ajaxFunction()’, 2000)"
>

The interval event set by the code above will end when a new page is sent
to the browser by the server thread processing the submitted file (the user
file containing records to update the database). Care needs to be taken with
the server thread timeout value as files that add large numbers of records
to large transient closures can take a long time to process. Using AJAX to
provide feedback stops the user panicking but not the server.

83
B.2 Yahoo user interface; selecting fields to
display
“The YUI Library is a set of utilities and controls, written in JavaScript, for
building richly interactive web applications using techniques such as DOM
scripting, DHTML and AJAX” Yahoo User Interface, 2009.
The YUI consists of several Javascript libraries that extend the function-
ality of Javascript and to this is added application specific code written by
the application developer. The code written by the application developer can
ask the browser to load the Javascript libraries from the Yahoo server or the
developer can move the required files to a computer under his control and
send code to the browser that asks for the libraries to be loaded from there.
Merrrit is an application used by
many different breeders and has support
for many different data fields. For exam-
ple, there are four options for identifying
an animal: primary tag, secondary tag,
name and e tag. A user will only want
the fields he or she uses displayed in the
data editing screens. There needs to be a
screen the user can use to select the fields
of interest. As the Yahoo User Interface
offers drag and drop support, drag and
drop is an option.
The Javacode that looks after Drag
and Drop manipulates the DOM tree but
doesn’t send an asynchronous request to
the server. Figure B.2: Field select drag and
The application specific code is found drop
in the file /class/display_order.php.
The list of available fields is created using
the table described in section A.3. The options selected for display by the
user are sent to the server when the user clicks the submit button. The
options selected are encoded in a semicolon separated string which is saved
in a herd specific database table.

84
Appendix C

The Application

85
C.1 Overview
The application screen is divided into four (see figure C.2) :

- A top left icon area.

- A browser area down the left side, which is used to navigate the appli-
cation.

- A link bar across the top, which has two important items, login and
log-off. These links are used by herd owners to log onto and off the
application.

- The active display area located at the bottom right.

The application has a public face and a herd owners’ face (See figure C.1).
The public face presents screens that can be used to view the data and
the herd owners’ face presents screens to view, edit, and upload data as
well as screens to setup the fields to be displayed. When a user enters the
application, the browser area displays the public tree, which contains a herd
branch. There is a list of herds under the herd branch. Clicking on the herd
code displays herd details, and clicking on the arrow to the left of the herd
code opens another branch. Within that branch, the user finds screens to
display animal data. When entering the application, the active screen area
displays an application summary.

C.2 Login and logoff


In the previous section it was mentioned that users get the option of editing
data and setting up the herd displays if they are logged on.
To log on, you click the login link found in the link bar that runs across
the top of the application screen. When the link is clicked, the browser is
replaced with the login screen. The herd and password have to be entered
for a successful login. The [give up] button abandons the login attempt.

C.3 The herd screens


If a user clicks on a herd code, herd details are displayed in the active display
area. Figure C.3 shows a typical screen. The herd screen is divided into four:

- Text that provides an overview of the herd. The text can be altered by
the herd owner using the herd edit screen.

86
Once logged on, the herd owners
may edit data, control the fields
The public may look at the data.
displayed, upload files to alter data
in batch lots, and download data
into files on their local machines.
Figure C.1: Public and logged on browsers

87
Figure C.2: Entry screen with screen areas highlighted.

88
Figure C.3: Public screen: herd data

- Data from the herd record. The herd owner can alter which fields are
displayed using the herd setup screen and the contents of the fields
using the herd edit screen.

- A picture, which the herd owner can upload using the herd edit screen.

- System generated data. For example, the number of animal births


registered against the herd.

If the herd owner provides a link to his/her web site, the link is presented in
the title bar.

C.3.1 Herd edit screen


If the herd owner is logged on, there is an edit branch in the browser tree,
and within that branch is a link to the herd edit screen. A representative
screen is shown in figure C.4.
The herd edit screen is divided into three parts, and each part can be
updated independently.

89
Figure C.4: Herd edit screen.

90
- Herd details are entered using a text box. HTML tags can be used
to add headings and highlight text etc. [upload herd details] must be
clicked to transfer data to the server.

- To change the picture displayed, a file must be uploaded, and there is a


[Browse] button to select the file. After the correct file has been found,
the [upload photo] button must be clicked to upload the new file.

- Fields to edit the herd record. The fields in this set are setup using
the herd setup screen. The fields that can be displayed in the herd edit
screen is greater than those that can be displayed in the herd screen.
For instance, if the herd password is one of the displayed fields, it is
displayed in the herd edit screen but not in the herd screen. In the herd
setup screen, the fields that are displayed in both the herd edit screen
and the herd screen are displayed in a different colour.

C.3.2 Herd setup screen


If the herd owner is logged on, there is a setup branch in the browser tree,
and within that branch is a link to the herd setup screen. An example herd
setup screen is shown in figure C.5.
The application has a wide range of data fields to suit the needs of many
different users. Users don’t want their screens cluttered up with unused data.
The herd setup screen is used by the user to select the fields of interest and
the width of the selected fields.
The herd setup screen is divided into two areas: one is used to select the
fields to display and the order in which they are displayed, and the other is
used to set the field width.

Displayed fields and order


You set the displayed fields and order using a drag and drop application
based on the Yahoo User Interface (see section B.2). There are two columns:
the first contains available fields that have not been selected for display, and
the second column displays the fields selected. The field order in the display
column will set the field display order in the edit herd screen, the herd screen
and in the field length section of this screen.
Brown fields will be displayed in both the edit herd screen and the herd
screen. Purple fields will be displayed in the edit herd screen only.
To move a field from the available column to the display column, move
the mouse over the field, left click on it, holding the mouse key down drag
the field over to the display column and release the left click button (drop

91
Figure C.5: Herd setup screen.

92
Figure C.6: Animal screen, display only

it). Dragging and dropping the fields in the display column alters the order
of display. The [Update display] button must be clicked to send the changes
to the server.

Field width
The server displays the current list of selected fields in the field width section.
Different field widths can be set and the [update width] button is used to
upload the changes to the server. It is best to select your fields, set the order
and click [update display] and then set the field width in the new screen.

C.4 Animal screens


The animal screen comes in a public and private version. The public version
is used to display a summary of several animals and the private version can
be used to edit the displayed data. When the user is logged onto a herd, the

93
animal screen is found in the browser’s edit branch, and the link brings up
the private version.
The first animals displayed and the order of display are determined by
the tests and values set above the columns. An example screen is shown in
figure C.6.
The fields displayed in the public version of the animal screen are set
using the animal public setup screen. The link to this screen is found in
the setup branch, and the icon used to represent this screen contains a large
yellow mallet. Clicking on the link next to the icon displays a screen that
can be used to beat the public animal screen into some sort of order.
The fields displayed to a logged on user are set in the animal setup screen,
a link to which can also be found in the setup branch, with an icon that
contains a small hammer. Delicate use of this tool results in a private animal
screen setup to meet your editing needs.
As mentioned above, the fields above the columns are used to request the
display of records from a set starting point and order. You have a choice of
two tests: equal to or greater and equal. If you want to display all DYN does
with a primary tag above 4000, you would set the column headings up as
shown in figure C.6.
There are three buttons and two fields below the displayed data. The
[display] button takes you back to the start of the list of records set by the
column headings, the [next] button takes you to the next group and the
[previous] button takes you back to the previous group. The record field is
used to set the number of records displayed on the screen and the offset field
is used to set the offset within the record set. The offset field changes when
the [next] and [previous] buttons are used.
Next to each record is a view button, which takes you to the screen that
will display the full animal record (see section C.5).
The private and public version of the animal screen is almost the same.
If you have the right to edit a record, the private version allows you to enter
new data using the fields that display the current data. If you have the right
to edit the record, a delete button appears next to the record. The view
button takes you to a screen that can be used to edit all animal details,
including the animal phenotype records.

C.4.1 Public animal setup screen


If the user is logged on, the public animal setup screen has a link in the
browser’s setup branch.
To meet the needs of a varied user base, the animal records contain many
fields. Take as an example the data used to identify animals. Some breeders

94
give their animals a name, a primary tag, a secondary tag and an electronic
tag. The same breeder may only want to make the primary tag public.
Another breeder may only use the primary tag, if he does not want his
private or public animal screen cluttered with fields that are not used. To
further complicate matters, different users may want the same data displayed
in different ways.
The public animals setup screen is divided into two areas: one is used to
select the fields to display and the order in which they are displayed, and the
other is used to set the field width.

Displayed fields and order


You set the displayed fields and order using a drag and drop application
based on the Yahoo User Interface (see section B.2). There are two columns,
the first of which contains available fields that have not been selected for
display, and the second displays the fields selected. The field order in the
display column will set the field display order in the public animal screen.
To move a field from the available column to the display column, move
the mouse over the field, left click on it, holding the mouse key down drag
the field over to the display column and release the left click button (drop
it). Dragging and dropping the fields in the display column alters the order
of display. The [Update display] button must be clicked to send the changes
to the server. The [Update display] button is located under the available
data. As there are many options, you will probably have to use the slide bar
on the right hand side to find it.

Field width
The server displays the current list of selected fields in the field width section.
As the display field and order section is quite long, you may need to use the
right hand side bar to see this section. Different field widths can be set for
each selected field and the [update width] button can be used to upload the
changes to the server. It is best to select your fields, set the order and click
[update display] and then set the field width in the new screen.
To display many items on one line, it pays to keep the field widths as
small as possible.

C.4.2 Animal setup screen


This screen is used to set up the users animal screen (the private animal
screen). A separate setup screen gives the user the option of using the animal

95
Figure C.7: Animal detail screen

screen to edit and view private data. This screen functions in the same way
as the screen described in the previous section.

C.5 Detail screen


The detail screen is used to view an animal’s data, which includes ancestor
and descendant information, phenotype records, animal data and calculated
values. The screen comes in tow forms: the public version displays data, and
the private version allows you to edit the displayed data.

Select fields
The detail screen ( see figure C.7 for a part display) is long and divided
into several areas. Along the top are fields used to select a record. The fields

96
displayed to select a record are the select fields picked for display in the detail
setup screen (discussed later). First tag, second tag, name, e tag and herd
are the select fields.
The herd must be set and the system insists that all tag fields are unique
within the herd. Therefore, any tag field that has been set can be used to
search for an animal.

Animal record
The animal record section displays animal record data. If you are logged on,
the data can be edited where displayed and updated by clicking the update
button located under this section.

Photograph
A photo of the animal can be added. If the user is not logged on, the photo
is displayed to the right of the animal record data. If the user is logged on,
the screen contains a field to select a file containing an animal image and a
button to initiate a file upload.

Pedigree
Next is displayed the pedigree. Chapter 6 discusses the displaying of pedi-
grees in some detail. The animal identifications displayed in the pedigree are
links to detail screens for the animal identified. The tree structure displays
how the animals are related.

Descendants
The descendants are displayed next. Chapter 6 discusses the displaying of
descendants in some detail. The descendant tree is available for browsing
pleasure, but if the number of descendants is too large, only the offspring
links are displayed.

Phenotype records
The phenotype records follow the descendants. At the time of writing, only
shearing records are implemented.

C.5.1 Detail setup screen


If the user is logged on, the detail setup screen has a link in the browser’s
setup branch. The screen is setup in a manner similar to the other setup

97
screens. There are sections for the animal record and supported phenotype
records. As with the animals screen, there are different setup screens for the
public and private versions of the detail screen.

C.6 Upload screens


The browser displays an upload branch when the herd owner is logged on.
These screens are used to upload data files. There are screens to add records,
update records and to delete records.

C.6.1 Upload file structure


The files can be comma separated files exported from the user’s current herd
recording system, hand crafted files created by the user, or comma separated
files downloaded form this system. An example file is shown in figure C.8.
//File to build R.A.Mrode Example
first tag;sire first tag;dam first tag;birth_DD/MM/YY;sex
E3;E1;E2;01/9/99;doe
E4;E1;UNSET;1/12/00;buck
E5;E4;E3;1/12/01;buck
E6;E5;E2;1/12/02;buck

Figure C.8: Example animal add file

Files may contain comments, which start with a // and go to the end of
the line. Comments are ignored by the decoder. Comments in hand created
files are useful, as they can be used to comment out records that you are
unsure of, and add details that are not used by the system. Figure C.8 has
a comment telling us that the file is used to create the example used in R.
A. Mrode’s book.
The first active line must contain a list of column names separated by the
same separator as that used in the rest of the file. The system selects the
separator, checking that the separator creates more than one field and that
the number of fields found in the first and second active line are the same.
Valid separators are ’ ^’, ’,’ , ’:’ , ’;’ and ’tab’. The upload screen displays
a list of valid string headings. For a discussion of how they are stored in the
application, see section A.3.4.
Hand created files may use " or ’ to bring the field value set in the
previous line to the current line. This speeds up data entry. For example,

98
Figure C.9: Three steps to upload a file

99
twins have the same sire and dam, and the tag details from the first kid
record can be used for the second.
As mentioned above, the number of fields in the first and second active
line must be the same, but the last field can be dropped in following lines.
It pays to make the last field a comment, as adding or leaving the comment
off is then optional.

C.6.2 Upload sequence


The upload operation proceeds in three stages (see figure C.9):

- A screen to select and upload the file is presented.

- The system decodes the file, and checks that the data types within the
fields are correct. If there are errors, the table is presented back to the
user, with error fields highlighted and a [reject] button.

- If the data types are correct, the system checks that the proposed data
will not violate database integrity tests. For example: it ensures that
birth dates are in order, and if we are adding records, that there is no
record for the animal present already. If the data doesn’t pass these
tests, it is presented back to the user with the faulty records highlighted
and a [reject] button. If all tests are passed, the data is presented back
with an [accept] and [reject] button.

- If the data is good and the user clicks the [accept] button, the data is
loaded into the database.

C.7 Download files


If the user is logged on, the browser displays a download branch containing
download screens. The download screens are used to initiate the downloading
of a semicolon separated data file. The first line in the file describes the
columns, with the fields separated by a semicolon, and the following lines
contain data, one line per record. Users download data files if they wish to
use local applications to analyze the data.
The download screen has a column of available fields and a download
column that contains a list of fields that have been selected for inclusion in
the download file ( see fig C.10). The order of the fields in the download
column determines the order of the fields in the download file. To drop a
field into the download column, drag it from the available column.

100
Figure C.10: File download screen

101
If the columns selected for download change, the [update fields] button
needs to be clicked. When the update list is returned from the server, the
download can be started by clicking the [download] button.
Breeders with the privilege to do so can download the records from all
herds. This privilege is needed if the user is calculating industry wide esti-
mated breeding values.
first_tag;sex;birth_date;birth_herd;sire_first_tag;dam_first_tag;
E1;BUCK;;TST;;;
E2;DOE;;TST;;;
E3;DOE;1999-09-01;TST;E1;E2;
E4;BUCK;2000-12-01;TST;E1;UNSET;
E5;BUCK;2001-12-01;TST;E4;E3;
E6;BUCK;2002-12-01;TST;E5;E2;
UNSET;;DOE;;TST;;;

Figure C.11: Example animal download file

102
Appendix D

Linear Regression

103
Chapter 2 mentioned linear regression and then moved on. This appendix
discusses the topic further.
The covariance over the variance gives the regression coefficient which is
the gradient of a line fitted to the data using least squares. Linear regression
can be explained using linear algebra, statistics or calculus. All three methods
provide some insight.

D.1 Calculus
A line has the formula.

y = bx + a (D.1)

To fit the line you alter a and b until the predicted value is as close to the
actual value as possible. The difference between the predicted and the actual
value is the residual. To find the minimum, you need a formula that tells
you how the sum of the residuals change over the data set as a and b vary.
For a point the residual is:

ri = yi − f (xi ) (D.2)

You can square both sides.

ri2 = (yi − f (xi ))2 (D.3)

This gives a quadratic surface with the the sum of the residuals on one axis
and the values of a and b on the other. A quadratic surface has a minimum
or a maximum when the derivative is zero. We want to find the minimum
as we alter a and b. We take the partial derivatives relative to a and b and
solve the resulting simultaneous equations.

∂(r2 ) ∂a
=
∂a 2(r)∂r
r = (yi − bxi + a)
∂r
= −1
∂a
∂r
0 = 2(r)
∂b
0 = −2(yi − bxi − a) (D.4)

104
∂(r2 ) ∂b
=
∂b 2(r)∂r
r = (yi − bxi + a)
∂r
= −xi
∂b
∂r
0 = 2(r)
∂b
0 = −2(yi − bxi + a)(xi ) (D.5)

Formula D.4 and D.5 are the residual for a point, and if we sum over the
whole set and put the result in matrix form, we get:
P 2 P    P 
P ix x i b x i yi
= P (D.6)
xi n a yi

Solving for a
P P 2 P P
yi xi − xi xi yi
a= (D.7)
n x2i − ( xi )2
P P

Adding a few n’s gives you:

ȳ x2i − x̄ xi yi
P P
a= P 2 P (D.8)
xi − n2 /n( xi 2 /n2 )

Solving for b
P P P P
n yi xi − xi xi
b= (D.9)
n x2i − ( xi )2
P P

Adding a few n’s gives you:

yi xi − n(x̄2 )
P P
b= P 2 (D.10)
xi − n(x̄)2

It is important to note:

1. That normal distributions have nothing to do with this, as we are


simply fitting a function using least squares.

2. Least squares work out well because the first derivative has a minimum.

3. That b is the gradient of a line and a is where the line cuts through the
y axis when x is zero.

105
D.2 Statistics
Variance was introduced into the lexicon by Fisher in his 1918 paper (Lynch
& Walsh, 1997) and is defined as:
P 2
(xi ) − nx¯1 2
σxx = (D.11)
n−1
Covariance is defined as:
P
(xi yi ) − nx̄ȳ
σxy = (D.12)
n−1
The regression coefficient is defined as:

b = σxy /σxx (D.13)

Using D.11, D.12 and D.13:


P
(yi xi ) − n(x̄ȳ)/n − 1
b= P 2 (D.14)
xi − n(x̄)2 /n − 1

After simplification:

yi xi − n(x̄2 )
P P
b= (D.15)
(x2i ) − n(x̄)2
P

Which gets us back to D.10.


Two important points to note are:

1. The regression coefficient is nothing more than the gradient of the line
fitted using least squares.

2. Statisticians tell us to use n − 1 to calculate the sample standard de-


viation and n for the population standard deviations, all that matters
in this application is consistent use, the value used cancels out 1 .

1
There is good reason to use n − 1 when calculating the standard deviation: we are
working with differences, and the number of differences is one less than the number of
points. We are still dealing with differences when dealing with the entire population,
mere mortals should still use n − 1, statisticians being the high priests of science are
communicating with god to obtain prior knowledge, so they can use n.

106
D.3 Normal equation
If you think in terms of sub spaces the deriva-
tion of the normal equations is straight for-
ward. The matrix equation we wish to solves
x1
is:

y = Xβb + e

y is a column vector in a hyperspace, the Xb


e
y
number of rows gives the dimension. Xβ is b
column vector that lies in a subspace of the x2

hyperspace. The dimension of the subspace


is determined by the number of column vec- Figure D.1: Projecting a 3D
tors in X (figure D.1 shows a plane defined point onto a 2D plane
by two column vectors in X, you only need
two column vectors because a subspace has
to go through zero). e is the column vector you have to add onto Xβb to get
back to y. If the column vector e is orthogonal to Xβb then e is as short as
possible, and their dot product is zero, or using matrix notation:

XT e = 0

Looking at figure D.1 we can see:

e = y − Xβb

So:

XT (y − Xβ)
b =0

Rearrange and you have the normal equation:

XT y − XT Xβb = 0
XT y = XT Xβb
(XT X)−1 XT y = (XT X)−1 XT Xβb
(XT X)−1 XT y = Iβb
βb = (XT X)−1 XT Y

107

You might also like