You are on page 1of 4

BLOOM JOIN

bloom join reduces the amount of data (number of tuples) transferred compared to semi join by utilizing the concept of bloom filters
It uses a bit vector.
Therefore using bloom join is more efficient than using semi join.
It avoid transferring unnecessary data
SEMIJOIN
Doesnot reduce the amount of data Transfered
Less efficient than bloomjoin
It does not use a bit vector
It transfers unnecessary data

CENTRALIZED DBMS
While a centralized database keeps its data in storage devices that are in a single location connected to a single CPU
A centralized database is easier to maintain and keep updated since all the data are stored in a single location
It is easier to maintain data integrity and avoid the data duplication
Speed of processing request becomes slow as request come from multiple clients
Lot of load on single centralized server
Keeping data upto date is easier
Cost of maintenance is less
Less complexity
Designing database for centralized DBMS is easy

DISTRIBUTED DBMS
A distributed database system keeps its data in storage devices that are possibly located in different geographical locations and managed using a central DBMS.
A distributed database is not easier to maintain since data is scattered at different sites.
Speed of processing a request is very fast
Here databases are parallelized making the load balanced between several servers.
But keeping the data up to date in distributed database system requires additional work.
increases the cost of maintenance
lot of complexity and also requires additional software for this purpose.
designing databases for a distributed database is more complex than the same for a centralized database.
Two-phase commit protocol
Two-phase commit is a standard protocol in distributed transactions for achieving ACID properties. Each transaction has a coordinator who initiates and coordinates the transaction (Begg & Connolly 2002, p.749).
In the two-phase commit the coordinator sends a prepare message to all participants (nodes) and waits for their answers. The coordinator then sends their answers to all other sites. Every participant waits for these
answers from the coordinator before committing to or aborting the transaction. If committing, the coordinator records this into a log and sends a commit message to all participants. If for any reason a participant aborts
the process, the coordinator sends a rollback message and the transaction is undone using the log file created earlier. The advantages of this are all participants reach a decision consistently, yet independently
(Skeen).
However, the two-phase commit protocol also has limitations in that it is a blocking protocol (Begg & Connolly 2002, p.749). For example, participants will block resource processes while waiting for a message from
the coordinator. If for any reason this fails, the participant will continue to wait and may never resolve its transaction. Therefore the resource could be blocked indefinitely. On the other hand, a coordinator will also
block resources while waiting for replies from participants. In this case, a coordinator can also block indefinitely if no acknowledgement is received from the participant. Begg and Connolly suggest that the likelihood of
a block happening is rare (2002, p. 749). This is most likely the reason why systems still use the two-phase commit protocol.




Three-phase commit protocol
An alternative to the two-phase commit protocol used by many database systems is the three-phase commit. Dale Skeen describes the three-phase commit as a non blocking protocol. He then goes on to say that it
was developed to avoid the failures that occur in two-phase commit transactions.
As with the two-phase commit, the three-phase also has a coordinator who initiates and coordinates the transaction (Begg & Connolly 2002, p.750). However, the three-phase protocol introduces a third phase
called the pre-commit. The aim of this is to 'remove the uncertainty period for participants that have committed and are waiting for the global abort or commit message from the coordinator' (Begg & Connolly 2002,
p.750). When receiving a pre-commit message, participants know that all others have voted to commit. If a pre-commit message has not been received the participant will abort and release any blocked resources.
ASSOCIATION RULE MINING
Association rules have been broadly used in many applications domains for finding pattern in data. The pattern reveals combinations of events that occur at the same
time. One of the best domain is business field, where discovering of pattern or association helps in effective decision making and marketing. Other areas where
association rule mining can be applied, are finding pattern in biological databases, market basket analysis of library circulation data, to study protein composition, to
study population and economic census etc.
Recent studies have shown that there are various algorithms for finding association rule. One of the best known algorithm is apriori algorithm. However the
complexity and performance of mining algorithms is subject to research area, as they have to mine a larger set of data items. i.e. most of the study are based on how
to simplify association rule and to improve the algorithm performance.
Support: The support is simply the number of transactions that include all items in the antecedent and consequent parts of the rule. The support is sometimes expressed as a
percentage of the total number of records in the database
Confidence: Confidence is the ratio of the number of transactions that include all items in the consequent as well as the antecedent (namely, the support) to the number of
transactions that include all items in the antecedent.
Lift: Lift is nothing but the ratio of confidence to expected confidence. Lift is a value that gives us information about the increase in probability of the "then"
(consequent) given the "if" (antecedent) part.

PHANTOM DEADLOCKS
In distributed deadlock detection, the delay in propagating local information might cause the deadlock detection algorithms to identify deadlocks that do not really exist.
Such situations are called phantom deadlocks and they lead to unnecessary aborts
Phantom deadlocks are deadlocks that are falsely detected in a distributed system due to system internal delays but don't actually exist. For example, if a process releases
a resource R1 and issues a request for R2, and the first message is lost or delayed, a coordinator (detector of deadlocks) could falsely conclude a deadlock if the request for
R2 while having R1 would cause a deadlock)

Goals of search engines:
Effectiveness (quality): to retrieve the most relevant set of documents for a query
Process text and store text statistics to improve relevance
Efficiency (speed): process queries from users as fast as possible
Use specialized data structures
Two Major Functions
Search engine components support two major functions
The index process: building data structures that enable searching
The query process: using those data structures to produce a ranked list of documents for a users query



Text Acquisition:
Identifying and making available the documents
that will be searched
How?
Crawling or scanning the web, a corporate intranet, or other sources of information
Building a document data store containing the text and metadata for all the documents

Crawlers:
Document feeds:
A mechanism for accessing a real-time stream of documents
RSS: a common standard used for web feeds for content such as news, blogs, or video
Conversion:
Converting a variety of formats (e.g., HTML, XML, PDF, ) into a consistent text and metadata format
Resolving encoding problem
Using ASCII (7 bits) or extended ASCII (8 bits) for English
Using Unicode (16 bits) for international languages

Firstly, data mart contains programs, data, software and hardware of a specific department of a company.
There can be separate data marts for finance, sales, production or marketing.
All these data marts are different but they can be coordinated.
Data mart of one department is different from data mart of another department, and though indexed,
this system is not suitable for a huge data base as it is designed to meet the requirements of a particular department.
It is easier to manage and takes less time to process
They are quick and easy to use because of Small amounts of data
Data marts are inexpensive

Data Warehousing is not limited to a particular department and it represents the database of a complete organization.
The data stored in data warehouse is more detailed though indexing is light as it has to store huge amounts of information.
It is also difficult to manage and takes a long time to process.
They are difficult to use as they make use of large amounts of data.
Data warehousing is also more expensive

Top-down tree construction schema:
Examine training database and find best splitting predicate for the root node
Partition training database
Recurse on each child node

TOP DOWN TREE CONSTRUCTION
BuildTree(Node t, Training database D, Split Selection Method S)
(1) Apply S to D to find splitting criterion
(2) if (t is not a leaf node)
(3) Create children nodes of t
(4) Partition D into children partitions
(5) Recurse on each partition
(6) endif

Three algorithmic components:
Split selection (CART, C4.5,ID3)
Pruning (direct stopping rule, test dataset pruning, cost-complexity pruning, statistical tests, bootstrapping)
Data access (CLOUDS, SLIQ, SPRINT, RainForest, BOAT, UnPivot operator)

POLYINSTANTIATION
Polyinstantiation is the concept of type (class, database row or otherwise) being instantiated into multiple independent instances (objects, copies).
It may also indicate, such as in the case of database polyinstantiation, that two different instances have the same name (identifier, primary key).

Although useful from a security standpoint, polyinstantiation raises several problems:
Moral scrutiny, since it involves lying
Providing consistent views
Explosion in the number of rows

You might also like