You are on page 1of 29

TEXT MINING FOR DOCUMENTS ANNOTATION AND

ONTOLOGY SUPPORT

A seminar report Submitted in partial fulfillment of requirements for the


award of

BACHELOR OF TECHNOLOGY
IN
INFORMATION TECHNOLOGY
By
MANDALANENI VIJAY KUMAR (07U91A1235)

Under the Esteemed guidance of


SHAHED AKTHAR ASSOCIATE PROF
DEPARTMENT OF INFORMATION TECNOLOGY
SRI MITTAPALLI COLLEGE OF ENGINEERING
(Affiliated to Jawaharlal Nehru Technological University,
Kakinada)
TUMMALAPALEM, NH-5, GUNTUR – 522233
SRI MITTAPALLI COLLEGE OF ENGINEERING
DEPARTMENT OF INFORMATIOTECHNOLOGY
TUMMALAPALEM, NH-5
GUNTUR-522233
(2007-2011)

CERTIFICATE
This is to certify that MANDALANENI VIJAY KUMAR(07U91A1235)
as satisfactory completed seminar entitled “TEXT MINING FOR

DOCUMENTS ANNOTATION AND ONTOLOGY SUPPORT” for the


partial fulfillment of the department for the award of degree of bachelor of technology in
the Information Technology by JAWAHARLAL NEHRU TECHNOLOGICAL
UNIVERSITY, KAKINADA

STAFF-INCHARGE HEAD OF THE


DEPORTMENT
(SHAHEDA AKTHAR) M.S(Ph.D) (Sri P.VASU )
M.Tech
Associate Professor Associate Professor
ACKNOWLEDGEMENT
“GUIDANCE IS PERVASIVE AND IT ENLIGHTENS”
We express our sincere thanks where it is due
It is with immense pleasure that we would like to express our indebted gratitude
to our guide SHAHEDA AKTHAR who has guided us a lot and encouraged us in every
step of the seminar work. Her moral support and guidance throughout the seminar helped
us to a great extent.
We express our sincere thanks to our beloved Chairman Sri M.KOTESWARA
RAO and the secretary sri M.B.V.SATYNARAYANA garu for providing support and
stimulating environment for developing the seminar.
We express deep sense of reverence and profound graduate Dr. E.V.KRISHNA
RAO, Principal, Sri Mittapalli College of Engineering, for providing us the great support
for us in completing our resource for carrying out the seminar.
Our sincere thanks to P.VASU, HOD, Dept of IT for his co-operation and
guidance in helping us to make our seminar successful and complete in all aspects We are
grateful to his precious guidance and suggestions.
We also place our floral gratitude to all other teaching staff and lab technicians for
their constant support and advice throughout the seminar. Last but not the least; we thank
our friends who directly or indirectly helped us in the successful completion of my
seminar

MANDALANENI VIJAY KUMAR (07U91A1235)

CONTENTS
1.Introduction…………………………………………………….

2.Knowledge Discovery in Texts……………………


2.1.Particular Steps of the KDT Process
2.2.Text Data Mining Tasks

3.Representation of Textual Documents


3.1.Classical Information Retrieval Models
3.2.Term selection/reduction

4.Text Mining Methods


4.1.Clustering/Visualization
4.2.Association Rules
4.3.Classification Models

5.Exploitation of KDT in Webocracy


5.1.Clustering/Visualization
5.2.Association rules
5.3.Classification Models

6.Conclusion

7.Acknowledgements

8.References
ABSTRACT

E-mail now a days is a security hazard. Many viruses and worms use e-mail to spread
themselves throughout the Internet, and almost every day new types of worms and
viruses appear. It is of vital importance for administrators and users to keep mail security
up-to-date.
There are three steps of filtering that every mail should be subjected to: Attachment
filters, Virus filters, Spam filters.
You should encrypt your e-mail for the same reason that you don't write all of your
correspondence on the back of a post card. E-mail is actually far less secure than the
postal system. With the post office,you at least put your letter inside an envelope to hide
it from casual snooping.
PGP program,PKC Encryption algorithm,MIME and S/MIME protocols,RSA that are
Explained.
What Is Email?

Email, also sometimes written as e-mail, is simply the shortened form of electronic mail,a
protocol for receiving, sending, and storing electronic messages. Email has gained
popularity with the spread of the Internet. In many cases, email has become the preferred
method of communication.

Though there is some degree of uncertainty as to when email was invented, the father of
the modern version is generally regarded to be American Ray Tomlinson. Before
Tomlinson, messages could be sent between users, but only when they were connected to
the same computer. Even once computers were networked, messages could not be
targeted to a particular individual. Tomlinson devised a way to address email to certain
users, and thus was credited for one of the most important communication inventions in
the 20th century.
INTRODUCTION
E-mail now a days is a security hazard. Many viruses and worms use e-mail to spread
themselves throughout the Internet, and almost every day new types of worms and
viruses appear. It is of vital importance for administrators and users to keep mail security
up-to-date.

Hack:
1. To write program code.
2. To modify a program, often in an unauthorized manner, by changing the code itself.
3. Code that is written to provide extra functionality to an existing program.
4. An inelegant and usually temporary solution to a problem

Hacker: A slang term for a computer enthusiast, i.e., a person who enjoys learning
programming languages and computer systems and can often be considered an expert on
the subject(s). Among professional programmers, depending on how it used, the term can
be either complimentary or derogatory, although it is developing an increasingly
derogatory connotation. The pejorative sense of hacker is becoming more prominent
largely because the popular press has coopted the term to refer to individuals who gain
unauthorized access to computer systems for the purpose of stealing and corrupting data.
Hackers, themselves, maintain that the proper term for such individuals is cracker.

EMAIL VIRUSES

Email viruses spread in two main ways:

Attachments:Viruses commonly hide in programs sent as email attachments, and run


when the user double-clicks on the program to start it. Therefore, you shouldn't run
programs received as email attachments unless you have a virus protection program
running and the attachment is from a trusted source.

For example, a greeting card program forwarded from a friend of a friend is not
from a trusted source, and there is nothing to stop it from running malicious system
programming code behind its animated presentation once you start it running on your
machine. You should also be wary of opening documents that might contain scripts and
macros (see below). Some attachments will have two extensions to try and trick you into
believing they are just a harmless data file and not a program, such as "coolpicture.jpg.
exe".

Scripts: One of the first script viruses was a MIME virus that attacked older versions of
programs like Netscape Mail, Microsoft Outlook, and Eudora, and could under certain
rare conditions run a damaging program as soon as the email was simply opened. In a
variation on an old hacker technique, the attached MIME file was given a very long name
that then triggered a bug that allowed the end of the name to be run as a series of
instructions, which could then be written to do damaging things to your computer.

However, Visual Basic (VBasic) script viruses became very real, and have continued to
do considerable damage across the Internet. VBasic is a very flexible and deeply
powerful program development environment used by Microsoft for their operating
system, office automation, and Internet applications. This means that VBasic viruses can
run from anywhere in the Microsoft software architecture and affect the entire system,
from email to operating system, giving them unprecedented reach and power.
The first widespread VBasic virus was Melissa, which brought down several of the
largest corporations in the world for several days in late March 1999.

Melissa traveled in a Microsoft Word document and was triggered when the document
was opened, opened the associated Microsoft Outlook email program, read the user's
email address book, and then sent copies of itself to the first fifty names. This clever
architecture was quickly followed by many variants programmed by hackers around the
world, including the KAK virus that triggered as soon as an email was opened, and the
BubbleBoy virus that triggered as soon as the email was viewed in the preview pane.
VARIOUS TYPES TO PROVIDE EMAIL SECURITY

1.Email filtering.
2.Web email vulnerabilities.
3. Reaper exploit.
4. Email encryption.

EMAIL FILTERING

There are three steps of filtering that every mail should be subjected to
Attahment filters: are used to block executable attachments, such as .exe files. Long
lists of other attachment types are also executable. Of late, exploits in image processing
libraries have been made public. This allows spreading viruses using image files, such as
gif or jpg. Attachment filters require little processing and little maintenance (they are
always up-to-date, but you must make sure you block all attachment types used as virus
carrier). However, they are ineffective if a virus author uses a more complex method of
spreading the infection by wrapping the virus into an archive file, e.g. a zip archive
(unless you choose to block archives as well).

Virus filters: are used to scan all attachments for known viruses. The virus database
must be constantly updated to reliably detect the latest threats. As the update of virus
filter databases lags behind, there is a window of vulnerability where viruses can pass
undetected into users' mailboxes. By blocking executable attachments, an attachment
blocker can close this window, to a point: Users must still be instructed to be very careful
with the content of archive files that passed both the attachment blocker and the virus
checker

Spam filters :Finally, most mail traffic nowadays is Spam. Good Spam filters are able to
capture about 90% of all Spam mails, while at the same time false positives (a legitimate
mail incorrectly flagged as spam) are very rare.

WEB EMAIL

There is an unexpected vulnerability to confidentiality of personal information with some


web based email services. When you click a link on a web page, the HTTP protocol
sends the URL of the current page to the new page. Therefore, if you access your email
through a web based email service and click on a link in an email, the URL of the current
web page is passed to the new page. This can cause unexpected compromise of persona
l information with web email services that put account information in the URL of the web
page, since this information is transmitted to the server of any third party web page you
access through your web email account. This information can include your email address,
login ID, and even your actual name.
In most cases the information can't be used to actually access your web email account,
since most services have implemented password and other protections, but it can reveal
more personal information than is available through other normal web communications.

REAPER EXPLOIT
Email confidentiality can also be compromised by macro viruses like the reaper exploit,
where the virus waits in the background and sends your reply or forward of an email back
to the hacker, and then travels with your email to divulge copies of replies or forwards by
the recipients back to the hacker as well. This term is used mainly as an historical
reference because it sounds cool, and less because it is in common current use.

ENCRYPTION

You should encrypt your e-mail for the same reason that you don't write all of your
correspondence on the back of a post card. E-mail is actually far less secure than the
postal system. With the post office,you at least put your letter inside an envelope to hid
it from casual snooping. Take a look at the header area of any e-mail message that
you receive and you will see that it has passed through a number of nodes on its way to
you. Every one of these nodes presents the opportunity for snooping. Encryption in no
way should imply illegal activity. It is simply intended to keep personal thoughts
personal.

MIME

Short for Multipurpose Internet Mail Extensions, a specification for formatting non-
ASCII messages so that they can be sent over the Internet. Many e-mail clients now
support MIME, which enables them to send and receive graphics, audio, and video files
via the Internet mail system. In addition, MIME supports messages in character sets other
than ASCII.There are many predefined MIME types, such as GIF graphics files and
PostScript files. It is also possible to define your own MIME types.In addition to e-mail
applications, Web browsers also support various MIME types. This enables the browser
to display or output files that are not in HTML format. The content types defined by
MIME standards are also of importance outside of e-mail, such as in communication
protocols like HTTP for the World Wide Web. HTTP requires that data be transmitted in
the context of e-mail-like messages, although the data most often is not actually e-mail.
MIME is specified in six linked RFC memoranda: RFC 2045, RFC 2046, RFC 2047,
RFC 4288, RFC 4289 and RFC 2049, which together define the specifications.

According to MIME co-creator Nathaniel Borenstein, the intention was to allow MIME
change, to advance to version 2.0 and so forth, but this decision led to the opposite
outcome, making it nearly impossible to create a new version of the standard.

"We did not adequately specify how to handle a future MIME version," Borenstein said.
"So if you write something that knows 1.0, what should you do if you encounter 2.0 or
1.1? I sort of thought it was obvious but it turned out everyone implemented that in
different ways. And the result is that it would be just about impossible for the Internet to
ever define a 2.0 or a 1.1."[2]

This header indicates the Internet media type of the message content, consisting of a type
and subtype, for example

Content-Type: text/plain

Through the use of the multipart type, MIME allows messages to have parts arranged in a
tree structure where the leaf nodes are any non-multipart content type and the non-leaf
nodes are any of a variety of multipart types. This mechanism supports:

• simple text messages using text/plain (the default value for "Content-Type: ")
• text plus attachments (multipart/mixed with a text/plain part and other non-text
parts). A MIME message including an attached file generally indicates the file's
original name with the "Content-disposition:" header, so the type of file is
indicated both by the MIME content-type and the (usually OS-specific) filename
extension
• reply with original attached (multipart/mixed with a text/plain part and the
original message as a message/rfc822 part)
• alternative content, such as a message sent in both plain text and another format
such as HTML (multipart/alternative with the same content in text/plain and
text/html forms)
• image, audio, video and application (for example, image/jpeg, audio/mp3,
video/mp4, and application/msword and so on)
• many other message constructs

Since RFC 2822, conforming message header names and values should be ASCII
characters; values that contain non-ASCII data should use the MIME encoded-word
syntax (RFC 2047) instead of a literal string. This syntax uses a string of ASCII
characters indicating both the original character encoding (the "charset") and the content-
transfer-encoding used to map the bytes of the charset into ASCII characters.

The form is: "=?charset?encoding?encoded text?=".


• charset may be any character set registered with IANA. Typically it would be the
same charset as the message body.
• encoding can be either "Q" denoting Q-encoding that is similar to the quoted-
printable encoding, or "B" denoting base64 encoding.
• encoded text is the Q-encoded or base64-encoded text.

Difference between Q-encoding and quoted-printable

The ASCII codes for the question mark ("?") and equals sign ("=") may not be
represented directly as they are used to delimit the encoded-word. The ASCII code for
space may not be represented directly because it could cause older parsers to split up the
encoded word undesirably. To make the encoding smaller and easier to read the
underscore is used to represent the ASCII code for space creating the side effect that
underscore cannot be represented directly. Use of encoded words in certain parts of
headers imposes further restrictions on which characters may be represented directly.

For example,

Subject: =?iso-8859-1?Q?=A1Hola,_se=F1or!?=

is interpreted as "Subject: ¡Hola, señor!".

a message with a non-English e-mail client, the header names are usually translated by
the client.

S/MIME

(Secure / Multipurpose Internet Mail Extensions) is a protocol that adds digital signatures
and encryption to Internet MIME (Multipurpose Internet Mail Extensions) messages
described in RFC 1521.

MIME is the official proposed standard format for extended Internet electronic mail.
Internet e-mail messages consist of two parts, the header and the body. The header forms
a collection of field/value pairs structured to provide information essential for the
transmission of the message. The structure of these headers can be found in RFC 822.
The body is normally unstructured unless the e-mail is in MIME format. MIME defines
how the body of an e-mail message is structured. The MIME format permits e-mail to
include enhanced text, graphics, audio, and more in a standardized manner via MIME-
compliant mail systems. However, MIME itself does not provide any security services.
The purpose of S/MIME is to define such services, following the syntax given in PKCS
#7 (see Question 5.3.3) for digital signatures and encryption. The MIME body section
carries a PKCS #7 message, which itself is the result of cryptographic processing on
other MIME body sections. S/MIME standardization has transitioned into IETF, and sets
of documents describing S/MIME version 3 have been published there.

RSA ALGORITHM
RSA is a public-key cryptosystem defined by Rivest, Shamir, and Adleman.
For example,Plaintexts are positive integers up to 2^{512}. Keys are quadruples
(p,q,e,d), with p a 256-bit prime number, q a 258-bit prime number,and d and e large
numbers with (de - 1) divisible by (p-1)(q-1). We define E_K(P) = P^e mod pq, D_K(C)
= C^d mod pq. All quantities are readily computed from classic and modern number
theoretic algorithms (Euclid's algorithm for computing the greatest common divisor
yields an algorithm for the former, and historically newly explored computational
approaches to finding large `probable' primes, such as the Fermat test, provide the latter.)
Now E_K is easily computed from the pair (pq,e)---but, as far as anyone knows,
there is no easy way to compute D_K from the pair (pq,e). So whoever generates K can
publish (pq,e). Anyone can send a secret message to him; he is the only one who can
read the messages.
Key generation

RSA involves a public key and a private key. The public key can be known to everyone
and is used for encrypting messages. Messages encrypted with the public key can only be
decrypted using the private key. The keys for the RSA algorithm are generated the
following way:

1. Choose two distinct prime numbers p and q.


o For security purposes, the integers p and q should be chosen at random,
and should be of similar bit-length. Prime integers can be efficiently found
using a primality test.
2. Compute n = pq.
o n is used as the modulus for both the public and private keys
3. Compute φ(n) = (p – 1)(q – 1), where φ is Euler's totient function.
4. Choose an integer e such that 1 < e < φ(n) and gcd(e,φ(n)) = 1, i.e. e and φ(n) are
coprime.
o e is released as the public key exponent.
o e having a short bit-length and small Hamming weight results in more
efficient encryption - most commonly 0x10001 = 65537. However, small
values of e (such as 3) have been shown to be less secure in some settings
5. Determine d = e–1 mod φ(n); i.e. d is the multiplicative inverse of e mod φ(n).
o This is often computed using the extended Euclidean algorithm.

Public Key Cryptography

Public Key Cryptography (PKC) is a near magical property of information arising from
the underlying mathematical structure of the universe that also conveniently enables
creation of modern-day secure communication channels on the Internet.

The main feature of PKC is the use of two keys for each person, a public key and a
private key, where either key can decrypt a message encrypted with the other. Each key is
almost impossible to find out from the other, and if the keys are long enough the method
is effectively unbreakable -- according to the known laws of science.
The elegant PKC architecture enables clever creation of a secure communications system
for distributed participants, which is exactly what is needed for the Internet. The
technology is the basis of the field of Public Key Infrastructure (PKI), and the basis of the
industry standard Rivest, Shamir, Adleman (RSA) encryption algorithm
Public Key Cryptography (PKC) History
Public Key Cryptography (PKC) uses two keys, a "public key" and a "private key", to
implement an encryption algorithm that doesn't require two parties to first exchange a
secret key in order to conduct secure communications. In a nice mathematical twist, this
conceptual breakthrough also enables an elegant implementation of digital signatures.

In a classic cryptosystem, we have encryption functions E_K and decryption


functions D_K such that D_K(E_K(P)) = P for any plaintext P. In a public-key
cryptosystem, E_K can be easily computed
from some ``public key'' X which in turn is computed from K. X is published, so that anyone can
encrypt messages. If decryption D_K cannot be easily computed from public key X without
knowledge of private key K, but readily with knowledge of K, then only the person who
generated K can decrypt messages. That's the essence of public-key cryptography,introduced by
Diffie and Hellman in 1976.

Role of the session key in public key schemes


In virtually all public key systems, the encryption and decryption times are very lengthy
compared to other block-oriented algorithms such as DES for equivalent data sizes. Therefore in
most implementations of public-key systems, a temporary, random `session key' of much smaller
length than the message is generated for each message and alone encrypted by the public key
algorithm. The message is actually encrypted using a faster private key algorithm with the
session key. At the receiver side, the session key is decrypted using the public-key algorithms
and the recovered `plaintext' key is used to decrypt the message.

The session key approach blurs the distinction between `keys' and `messages' -- in the
scheme, the message includes the key, and the key itself is treated as an encryptable `message'.
Under this dual-encryption approach, the overall cryptographic strength is related to the security
of either the public- and private-key
algorithms.

How Public Key Cryptography (PKC) Works


The security of the standard Public Key Cryptography (PKC) algorithm RSA is founded on
the mathematical difficulty of finding two prime factors of a very large number.
Historically, most encryption systems depended on a secret key that two or more parties used
to decrypt information encrypted by a commonly agreed method. The main idea of PKC is the
use of two unique keys for each participant, with a bi-directional encryption mechanism that can
use either key to decrypt information encrypted with the other key, as described below:
Public key: One of the keys allocated to each person is called the "public key", and is
published in an open directory somewhere where anyone can easily look it up, for example by
email address.

Private key: Each person keeps their other key secret, which is then called their "private key".
If John wants to send an encrypted email to Mary, he encrypts his message with Mary's public
key, and then sends it to her. He doesn't need to be worried about interception or eavesdropping
since the only person that can read the message is Mary, because she is the only one that has the
corresponding private key that can decrypt it. This powerful architecture has three profound
consequences:

Geography: The sender and the recipient no longer need to meet or use some other potentially
insecure method to exchange a common secret key. Since everyone has their own set of keys,
then anyone can securely communicate with anyone else by first looking up their public key and
using that to encrypt the message, enabling secure communication even across great distances
over a network (like the Internet).

Digital signatures: A sender can digitally sign their message by encrypting their name (or
some other meaningful document) with their secret key and then attaching it to a message. The
recipient can verify that the message came from the sender by decrypting their signature with
their public key. If the decryption works and produces a readable signature, then the message
came from the sender because only they could have encrypted the signature with their private
key in the first place.

Security:The disclosure of a key doesn't compromise all of the communications on a network,


since disclosure of public keys is intended, and only messages sent to one person are affected by
the disclosure of a private key.

Details: The algorithms on which both RSA's and Cock's algorithms are based uses a
mathematical expression built on the multiplication of two large prime numbers (a number that is
the product of only 1 and itself). For example, the following numbers are the product of two
prime numbers:
Product Primes
15 = 3x5
77 = 7 x 11
221 = 13 x 17

While RSA's and Cock's algorithms are similar, RSA's is described in the following because it is
the more general case and was published first. Essentially, the public key is the product of two
randomly selected large prime numbers, and the secret key is the two primes themselves. The
algorithm encrypts data using the product, and decrypts it with the two primes, and vice versa.
A mathematical description of the encryption and decryption expressions is shown below:
Encryption: C = M^e (modulo n)
Decryption: M = C^d (modulo n)
Where:
M = the plain-text message expressed as an integer number.
C = the encrypted message expressed as an integer number.
n = the product of two randomly selected, large primes p and q.
d = a large, random integer relatively prime to (p-1)*(q-1).
e = the multiplicative inverse of d, that is:
( e * d ) = 1 ( modulo ( p - 1 ) * ( q - 1 ) )
The public key is the pair of numbers ( n, e ).
The private key is the pair of numbers ( n, d ).
This is prime factors of a large number, and of finding the private key d from the public key n.
difficult This algorithm is secure because of the great mathematical difficulty of finding the two
because the only known method of finding the two prime factors of a large number is to check all
the possibilities one by one, which isn't practical because there are so many prime numbers. For
example, a 128 bit public key would be a number between 1 and
340,282,366,920,938,000,000,000,000,000,000,000,000

Now, first Euclid proved that there are an infinite number of primes. Then, the work of
Legendre, Gauss, Littlewood, Te Riele, Tchebycheff, Sylvester, Hadamard, de la Vallée Poussin,
Atle Selberg, Paul Erdös, Hardy, Wright, and Von Koch showed that the number of prime
numbers between one and n is approximately n / ln(n). Therefore, there are about: 2^128 /
ln( 2^128 ) = 3,835,341,275,459,350,000,000,000,000,000,000,000
different prime numbers in a 128 bit key. That means that even with enough computing power to
check one trillion of these numbers a second, it would take more than 121,617,874,031,562,000
years to check them all. That's about 10 million times longer than the universe has existed so far.

Therefore, unless someone makes a very large and unexpected mathematical


breakthrough, it's practically impossible to find out the private key from a public key with RSA
encryption, making it one of the most secure methods ever invented. However, please note that
like almost all encryption systems, the RSA algorithm is still vulnerable to plain-text attacks,
when a third party can repeatedly choose (or otherwise knows) some of the text to be encrypted
and can examine the result. In addition, the promised development of quantum computers over
the next several decades that can effectively perform many calculations simultaneously may be
able to break the RSA algorithm relatively quickly.

Encrypting email is the only way to guarantee its confidentiality in transit. The most widely
used method of email encryption uses Pretty Good Privacy, which integrates directly with your
email application.
PRETTY GOOD PRIVACY (PGP)

PGP is a program that gives your electronic mail something that it otherwise doesn't have:
Privacy. It does this by encrypting your mail so that nobody but the intended person can read it.
When encrypted, the message looks like a meaningless jumble of random characters. PGP has
proven itself quite capable of resisting even the most sophisticated forms of analysis aimed at
reading the encrypted text. PGP can also be used to apply a digital signature to a message
without encrypting it. This is normally used in public postings where you don't want to hide
what you are saying, but rather want to allow others to confirm that the message actually came
from you. Once a digital signature is created, it is impossible for anyone to modify either the
message or the signature without the modification being detected by PGP.
While PGP is easy to use, it does give you enough rope so that you can hang yourself. You
should become thoroughly familiar with the various options in PGP before using it to send
serious messages. For example, giving the command "PGP -sat <filename>" will only sign a
message, it will not encrypt it. Even though the output looks like it is encrypted, it really isn't.
Anybody in the world would be able to recover the original text.
PGP provides a confidentiality and authentication service that can be used for Electronic
mail and file storage applications.It is available free worldwide in versions that run on a variety
of platforms ,including Windows, Unix Macintosh and many more in addition , the commercial
version satisfies uses who want a product to that comes with vendor support.

Operational Description
The actual operation of PGP consists of five services.
1. Authentication
2. Confidentiality
3. Compression
4. E-Mail Compatibility
5. Segmentation
Authentication

1. The sender creates a message


2. Sha-1 is used to generate a 160-bit hash code of the message.
3. The hash code is encrypted with RSA using the sender’s private key and the result is
4. prepended to the message
5. The receiver uses RSA with the sender’s public key to decrypt and recover the hash code.
6. The receiver generates a new hash code for the message and compares it with the decrypted
hash code. If the two match the message is accepted as Authentic.

The combination of SHA-1 and RSA provides and effective digital signature’s scheme.

Confidentiality
Another basic service provided by PGP confidentiality, which is provided by encrypting
messages to be transmitted or to be stored locally as files. In both cases, the symmetric
encryption algorithm CAST-128 may be used . Alternatively IDEA or 3DES may be used. The
64 -bit cipher feed back mode is used.

In PGP, each symmetric key is used only once i.e. a new key is generated in a random 128-bit
number for each message. Thus although this is referred to in the documentation as in a session
key. It is in reality in a one- time key. Because it is to be used only once. The session key is
bound to the message and transmitted with it .To protect the key it is encrypted with the
receiver’s public key.

1.The sender generates a message the random 128 bit number to be used as a session key
for this message only.
2.The message is encrypted using CAST-128 or 3 DES with the session key.
3.The session key is encrypted with RSA, using the recipient’s public key and prepended
to the message.
4.The receiver uses RSA with its private key to decrypt and recover the session key.
5.The session key is used to decrypt the message.

Compression
As a default,PGP compresses the message after applying the signature but before
encryption.The placement of compression algorithm, indicated by Z for compression and
Z inverse for decompression.

1.The signature is generated before compression for two reasons:


a. It is preferable to sign an uncompressed message so that one can store only the
uncompressed message together with the signature for future verification. If one signed a
compressed document, then it would be necessary either to store a compressed version of
message for later verification or to recompress the message when verification is required..
b. Even if one were willing to generate dynamically a recompressed message for
verification ,PGP’S compression algorithm presents a difficulty.The algorithm is not
deterministic;various implementations of the algorithm achieve different tradeoffs in running
speed versus compression ratio and ,as a result ,produce different compressed
Forms.However these different compression algorithms are interoperable because any
version of the algorithm can correctly decompress the output of any other version .Applying
the hash function and signature after compression would constrain all PGP implementations
to the same version of the compression algorithm.
2. Message encryption is applied after compression to strengthen cryptographic security.
Because the compressed message has less redundancy than the original plaintext,
cryptanalysis is more difficult.

E-Mail Compatibility
When PGP is used ,at least part of the block to be transmitted is encrypted. If only the
signature service is used,then the message digest is encrypted .If the confidentiality service is
used, the message plus signature are encrypted .Thus,part or all of the resulting block consists
of a stream of arbitrary 8-bit octets.However ,many electronic mail systems only permit the use
of blocks consisting of ASCII text. To accommodate this restriction this restriction ,PGP
provides the service of converting the raw 8-bit binary stream to a stream of printable ASCII
characters. The scheme used for this purpose is radix-64 conversion.Each group of three octets of
binary data is mapped into 4 ASCII codes .This format also appends a CRC to detect
transmission errors.
The use of radix 64 expands a message by 33% .Fortunately ,the session key
and signature portions of the message are relatively compact,and the plaintext message has been
compressed.In fact,the compression should be more than enough to componsate for the radix 64
for expansion.
One worthy aspect of the radix 64 algorithm is that blindly converts the input stream to
radix 64 format regardless of content, even if the input happens to be ASCII text. Thus if a
message is signed but not encrypted and the conversion is applied to the entire block. And the
output is unreadable to the causual observer, which provides a certain level of confidentiality
.As an option PGP can be configured to convert to radix 64 format only the signature portion of
signed plain text messages. This enables the human recipients to read the message without using
PGP.

Segmentation

E-mail facilities often are restricted to a maximum message length.For example , many of the
facilities accessible through the Internet impose a maximum length of 50,000 octects. Any
message longer than that must be broken up into a smaller segments. Each of which is maild
separately.
To accommodate this restriction, PGP automatically subdivides a message that is too
large into segments that are small enough to send via E-mail. The segmentation is done after all
of the other processing including the radix 64 conversion. Thus the session key component and
signature component appear only once, at the beginning of the first segment. At the receiving
end PGP must strip off all E-mail headers and reassemble the entire original block.

KEY FEATURES OF EMAIL

• Email Is A Push Technology

• Email Waits For You

• Email Is One-To-Many
• Email Is Almost Free.

How to Choose a Good Password

Do not use
1. Names:
a. of yourself, including nicknames;
b. of your spouse or significant other, of your parents, children, siblings, pets, or
other family members;
c. of fictional characters, especially ones from fantasy or sci-fi stories like the Lord
of the Rings or Star Trek;
d. of any place or proper noun;
e. of computers or computer systems;
f. any combination of any of the above.
2. Numbers, including:
a. your phone number;
b. your social security number;
c. anyone's birthday;
d. your driver's licence number or licence plate;
e. your room number or address;
f. any common number like 3.1415926 or 1.618034;
g. any series such as 1248163264;
h. any combination of any of the above.
3. Any username in any form, including:
a. capitalized (Joeuser);
b. doubled (joeuserJoeuser);
c. reversed (resueoJ);
d. reflected (joeuserResueoj);
e. with numbers or symbols appended (Joeuser!).
4. Any word in any dictionary in any language in any form.
5. Any word you think isn't in a dictionary, including:
a. any slang word or obscenity;
b. any technical term or jargon (BartleMUD, microfortnight, Oobleck).
6. Any common phrase:
a. ``Go ahead, make my day.''
b. ``Brother, can you spare a dime?''
c. ``1 fish, 2 fish, red fish, blue fish.''
7. Simple patterns, including:
a. passwords of all the same letter;
b. simple keyboard patterns (qwerty, asdfjkl);
c. anything that someone might easily recognize if they see you typing it.
8. Any information about you that is easily obtainable:
a. favorite color;
b. favorite rock group.
9. Any object that is in your field of vision at your workstation.
10. Any password that you have used in the past.

There are programs (and they are easy to write) which will crack passwords that are based on the
above.

Do
1. Change your password every three to six months. Changing once every term should be
considered an absolute minimum frequency.
2. Use both upper and lower case letters.
3. Use numbers and special symbols (!@#$) with letters.
4. Create simple mnemonics (memory aids) or compounds that are easily remembered, yet
hard to decipher:
a. ``3laR2s2uaPA$$WDS!'' for ``Three-letter acronyms are too short to use as
passwords!''
b. ``IwadaSn,atCwt2bmP,btc't.'' for ``It was a dark and stormy night, and the
crackers were trying to break my password, but they couldn't.''
c. ``HmPwaCciaCccP?'' for ``How many passwords would a cracker crack if a
cracker could crack passwords?''
5. Use two or more words together (Yet_Another_Example).
6. Use misspelled words (WhutdooUmeenIkan'tSpel?).
7. Use a minimum of eight characters. You may use up to 255 characters on Athena, and
generally the longer the password, the more secure it is.

Never!
Finally, NEVER write your password down anywhere, nor share your password with anyone,
including your best friend, your academic advisor, or an on-line consultant!

CONCLUSION

E-mail now a days is a security hazard. Many viruses and worms use e-mail to spread themselves
throughout the Internet, and almost every day new types of worms and viruses appear. It is of
vital importance for administrators and users to keep mail security up-to-date.
There are three steps of filtering that every mail should be subjected to: Attachment filters, Virus
filters, Spam filters.
You should encrypt your e-mail for the same reason that you don't write all of your
correspondence on the back of a post card. E-mail is actually far less secure than the postal
system. With the post office,you at least put your letter inside an envelope to hide it from casual
snooping.

BIBLIOGRAPHY

1. www.freeopensourcesoftware.net

2. www.livinginternet.com

3. www.faqs.com

You might also like