Data Structures CAATTs Extraction

Data Structures and CAATTs for Data Extraction
Data Structures (TRICIA)
Two fundamental components:
Organization: the way records are physically arranged on the secondary storage device
Access method: technique used to locate records and to navigate through the database
or file
File Processing Operations
1. Retrieve a record by key
2. Insert a record
3. Update a record
4. Read a file
5. Find next record
6. Scan a file
7. Delete a record
Data Structures
Flat file structures
Sequential structure [Figure 8-1]

All records in contiguous storage spaces in specified sequence (key field)
Sequential files are simple & easy to process
Application reads from beginning in sequence
If only small portion of file being processed, inefficient method
Does not permit accessing a record directly
Efficient: 4, 5 sometimes 3
Inefficient: 1, 2, 6, 7 usually 3
Indexed structure
In addition to data file, separate index file
Contains physical address in data file of each indexed record
Indexed random file [Figure 8-2]
Records are created without regard to physical proximity to other related

records
Physical organization of index file itself may be sequential or random
Random indexes are easier to maintain, sequential more difficult
Advantage over sequential: rapid searches
Other advantages: processing individual records, efficient usage of disk storage
Efficient: 1, 2, 3, 7
Inefficient: 4
Virtual Storage Access Method (VSAM) [Figure 8-3]
Large files, routine batch processing
Moderate degree of individual record processing
Used for files across cylinders
Uses number of indexes, with summarized content
Access time for single record is slower than Indexed Sequential or Indexed
Random
Disadvantage: does not perform record insertions efficiently requires physical

relocation of all records beyond that point SOS
Has 3 physical components: indexes, prime data storage area, overflow area
[Figure 8-4]
Might have to search index, prime data area, and overflow area slowing down
access time
Integrating overflow records into prime data area, then reconstructing indexes
reorganizes ISAM files
Very Efficient: 4, 5, 6
Moderately Efficient: 1, 3
Inefficient: 2, 7
Hashing Structure
o Employs algorithm to convert primary key into physical record storage address [Figure
8-5]
No separate index necessary
Advantage: access speed
Disadvantage
Inefficient use of storage
Different keys may create same address
Inefficient: 4, 5, 7
Pointer Structure
Stores the address (pointer) of related record in a field with each data record
[Figure 8-6]
Records stored randomly
Pointers provide connections b/w records
Pointers may also provide links of records b/w files [Figure 8-7]
Types of pointers [Figure 8-8]:

Physical address actual disk storage location
Advantage: Access speed
Disadvantage: if related record moves, pointer must be

changed & w/o logical reference, a pointer could be lost
causing referenced record to be lost
Relative address relative position in the file (135th)
Must be manipulated to convert to physical address
Logical address primary key of related record
Key value is converted by hashing to physical address
Inefficient: 4, 5, 7
Database Conceptual Models (JOSIA)
Refers to the particular method used to organize records in a database.
a.k.a. logical data structures
Objective: develop the database efficiently so that data can be accessed quickly and easily.
There are three main models:
hierarchical (tree structure)
network
relational
Most existing databases are relational. Some legacy systems use hierarchical or network
databases.
The Relational Model
The relational model portrays data in the form of two dimensional tables.
Its strength is the ease with which tables may be linked to one another.
a major weakness of hierarchical and network databases
Relational model is based on the relational algebra functions of restrict, project, and join.
The Relational Algebra Functions
Restrict, Project, and Join

Associations and Cardinality
Association
Represented by a line connecting two entities
Described by a verb, such as ships, requests, or receives
Cardinality the degree of association between two entities
The number of possible occurrences in one table that are associated with a single
occurrence in a related table
Used to determine primary keys and foreign keys

Examples of Entity Associations
Properly Designed Relational Tables
Each row in the table must be unique in at least one attribute, which is the primary key.
Tables are linked by embedding the primary key into the related table as a foreign key.
The attribute values in any column must all be of the same class or data type.
Each column in a given table must be uniquely named.
Tables must conform to the rules of normalization, i.e., free from structural dependencies or
anomalies.
Three Types of Anomalies (RIZ GEL)
Insertion Anomaly: A new item cannot be added to the table until at least one entity uses a
particular attribute item.
Deletion Anomaly: If an attribute item used by only one entity is deleted, all information about
that attribute item is lost.
Update Anomaly: A modification on an attribute must be made in each of the rows in which the
attribute appears.
Anomalies can be corrected by creating additional relational tables.

Advantages of Relational Tables
Removes all three types of anomalies.
Various items of interest (customers, inventory, sales) are stored in separate tables.
Space is used efficiently.
Very flexible users can form ad hoc relationships.
The Normalization Process
A process which systematically splits unnormalized complex tables into smaller tables that
meet two conditions:
all nonkey (secondary) attributes in the table are dependent on the primary key
all nonkey attributes are independent of the other nonkey attributes
When unnormalized tables are split and reduced to third normal form, they must then be linked
together by foreign keys.
Steps in the Normalization Process

Accountants and Data Normalization
Update anomalies can generate conflicting and obsolete database values.
Insertion anomalies can result in unrecorded transactions and incomplete audit trails.
Deletion anomalies can cause the loss of accounting records and the destruction of audit trails.
Accountants should understand the data normalization process and be able to determine
whether a database is properly normalized.
Six Phases in Designing Relational Databases (SHELA)
1. Identify entities
identify the primary entities of the organization
construct a data model of their relationships
2. Construct a data model showing entity associations
determine the associations between entities
model associations into an ER diagram
3. Add primary keys and attributes
assign primary keys to all entities in the model to uniquely identify records
every attribute should appear in one or more user views
4. Normalize and add foreign keys
remove repeating groups, partial and transitive dependencies
assign foreign keys to be able to link tables
5. Construct the physical database
create physical tables
populate tables with data
6. Prepare the user views
normalized tables should support all required views of system users
user views restrict users from having access to unauthorized data

Auditors and Data Normalization
Database normalization is a technical matter that is usually the responsibility of systems

professionals.
The subject has implications for internal control that make it the concern of auditors also.
Most auditors will never be responsible for normalizing an organizations databases; they should
have an understanding of the process and be able to determine whether a table is properly
normalized.
In order to extract data from tables to perform audit procedures, the auditor first needs to know
how the data are structured.
Embedded Audit Module (EAM)
Identify important transactions live while they are being processed and extract them [Figure 8-
26]
Examples
Errors
Fraud
Compliance
SAS 109, SAS 94, SAS 99 / S-OX
Disadvantages:
Operational efficiency can decrease performance, especially if testing is extensive
Verifying EAM integrity - such as environments with a high level of program

maintenance
Status: increasing need, demand, and usage of COA/EAM/CA
Generalized Audit Software (GAS)
Brief history
1950-1967 nascent field, little tools or techniques (e.g., K. Davis in Viet Nam)
October 1967 Haskins & Sells, Ken Stringer, AUDITAPE
1967-1970 AICPA efforts for one GAS, Big 8 each developed their own
c1970 first commercial GAS CARS
2000 commercial GAS is common place
Importance of GAS in history of IS auditing

Popular because:
o GAS software is easy to use and requires little computer background
o Many products are platform independent, works on mainframes and PCs
o Auditors can perform tests independently of IT staff
o GAS can be used to audit the data currently being stored in most file structures and formats
Simple structures [Figure 8-27]
Complex structures [Figures 8-28, 8-29]
Auditing issues:
Auditor must sometime rely on IT personnel to produce files/data
Risk that data integrity is compromised by extraction procedures
Auditors skilled in programming better prepared to avoid these pitfalls
ACL
ACL is a proprietary version of GAS
Leader in the industry
Designed as an auditor-friendly meta-language (i.e., contains commonly used auditor tests)
Access to data generally easy with ODBC interface

Data Structures CAATTs Extraction

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Structures CAATTs Extraction

Uploaded by

Copyright:

Available Formats

Data Structures and CAATTs for Data Extraction

Data Structures (TRICIA)

Two fundamental components:

File Processing Operations

1. Retrieve a record by key

5. Find next record

Flat file structures

Sequential structure [Figure 8-1]

Sequential files are simple & easy to process

Application reads from beginning in sequence

If only small portion of file being processed, inefficient method

Does not permit accessing a record directly

In addition to data file, separate index file

Contains physical address in data file of each indexed record

Indexed random file [Figure 8-2]

Records are created without regard to physical proximity to other related

Physical organization of index file itself may be sequential or random

Random indexes are easier to maintain, sequential more difficult

Advantage over sequential: rapid searches

Other advantages: processing individual records, efficient usage of disk storage

Virtual Storage Access Method (VSAM) [Figure 8-3]

Large files, routine batch processing

Moderate degree of individual record processing

Used for files across cylinders

Uses number of indexes, with summarized content

Disadvantage: does not perform record insertions efficiently requires physical

Records stored randomly

Pointers provide connections b/w records

Types of pointers [Figure 8-8]:

Advantage: Access speed

Disadvantage: if related record moves, pointer must be

Relative address relative position in the file (135th)

Must be manipulated to convert to physical address

Logical address primary key of related record

Key value is converted by hashing to physical address

Database Conceptual Models (JOSIA)

Refers to the particular method used to organize records in a database.

a.k.a. logical data structures

There are three main models:

hierarchical (tree structure)

The Relational Model

a major weakness of hierarchical and network databases

The Relational Algebra Functions

Restrict, Project, and Join

Represented by a line connecting two entities

Described by a verb, such as ships, requests, or receives

Cardinality the degree of association between two entities

Used to determine primary keys and foreign keys

Properly Designed Relational Tables

Each column in a given table must be uniquely named.

Three Types of Anomalies (RIZ GEL)

Anomalies can be corrected by creating additional relational tables.

Removes all three types of anomalies.

Space is used efficiently.

Very flexible users can form ad hoc relationships.

The Normalization Process

all nonkey attributes are independent of the other nonkey attributes

Steps in the Normalization Process

Update anomalies can generate conflicting and obsolete database values.

Six Phases in Designing Relational Databases (SHELA)

identify the primary entities of the organization

construct a data model of their relationships