You are on page 1of 30

C.1.

Creating the web (8 hours)


- practical activities linked to developing different types of web pages and be
able to evaluate when a particular type of web page is most appropriate.

C.1.1 Distinguish between the Internet and World Wide Web


(web)
INTERNET = the entire network of connected computers and routers used for
sending data.
WORLD WIDE WEB (WEB) = WWW = the system used for accessing web pages
and websites.

C.1.2 Describe how the web is constantly evolving


-

Differences between:
o The early forms of the web
o Web 2.0
o The semantic web
o Later developments
Possibilities and limitations associated with the evaluation of the web.

EARLY WEB (WEB 1.0)

http://nostalgiacafe.proboards.com/thread/133/1990s-internet-world-wide-web
http://www.businessinsider.com/big-brands-90s-websites-look-terrible-2013-4?
op=1
= simple websites that distribute information
NETSCAPE NAVIGATOR was a proprietary web browser for the WEB 1.0 era.
WEB 1.0 = a library (you can use it as a source of information, but you cant
contribute to or change the information in any way).
WEB 2.0

FIREFOX follows the Web 2.0 philosophy


-

More interactive (user produced content)


More multimedia intensive (audio, video etc.)
Social (blogs, wikis, social networks: Facebook, Tumblr, Wikipedia, Twiter)

IS STILL USEFUL WEB 1.0???? YES


-

Web 2.0 philosophy is creating a web page that visitors can impact or
change

E.g. 1 : AMAZON allows visitors to post product reviews. Future visitors will
have a chance to read them, which might influence their decision to buy the
product.

BUT, a RESTAURANT might have a webpage that shows the current menu. While
the menu might evolve over time, the webmaster wouldnt want visitors to be
able to make changes.
E.g. 2: WIKIPEDIA = online encyclopaedia resource that allows visitors to make
changes to most articles. Ideally, with enough people contributing to Wikipedia
entries, the most accurate and relevant information about every subject will
eventually be part of each article.
Unfortunately, because anyone can change entries, its possible for someone to
post false misleading information.
WEB 3.0 (SEMANTIC WEB)

= all data on web is interconnected like a super-database.


Need to see a movie and eat to a good Italian restaurant?
How many web pages you were surfing for finding a solution?
Use WEB 3.0

Type: I want to see a movie and eat to . Which are my


options?

Web 3.0 will:


-

Analyse your question


Search the internet for all possible answers
Organize the results for you

AND . THATS NOT ALL:


-

E.g:

It will be your personal assistant: it will learn what you are interested in
(more you use the web, more it will record, less specific you will need to
be with your questions).
Question: Where should I go for lunch?
Browser:
- consult its records of what you like and dislike
- take into account your current location
- suggest a list of restaurants.

WEB 3.0 search engines could find not only keywords, in your search, but also
interpret the context of your request.
SEMANTIC WEB

= proposes to help computers read and use the WEB. METADATA added to web
pages can make the existing WWW machine readable.
METADATA = simply machine readable data that describes other data.

In semantic web, they are invisible as people read the page, but theyre visible
to computers.

C.1.3 Identify the characteristics of the following:

Hypertext transfer protocol (HTTP)


Hypertext mark-up language (HTML)
Hypertext transfer protocol secure (HTTPS)
Uniform resource locator (URL)
Extensible mark-up language (XML)
Extensible stylesheet language transformation (XSLT)
JavaScript
Cascading style sheet (CSS)

HTTP:

Characteristics:
-

STATELESS (each transaction between the client and server is independent


and no state is set based on a previous transaction or condition)
USES REQUESTS FROM THE CLIENT TO THE SERVER AND RESPONSES
FROM THE SERVER TO THE CLIENT FOR SENDING AND RECEIVING DATA

HTTP is the protocol used to exchange or transfer hypertext.


-

It functions as a REQUEST RESPONSE protocol in the CLIENT SERVER


computing model.

A web browser, for example may be the client and an application running on the
computer hosting a website may be the server.
-

The client submits an HTTP request message to the server.


The server, which provides resources such as HTML files and other
content, or performs other functions on behalf of the client, returns a
response message to the client
The response contains completion status information about the request
and may also contain requested content in its message body.

HTTP is an application larger protocol designed within the framework of the


Internet Protocol Suite.
HTTP SESSION:

An HTTP session is a sequence of network request-response transactions.


An HTTP client initiates a request by establishing a TCP connection to a
particular port on a server.
An HTTP server listening on that port waits for a clients request message.
Upon receiving the request, the server sends back a status line and a
message of its own. The body of this message is typically the requested
resource, although an error message or other information may also be
returned.

HTTPS:

= a communications protocol for secure communication over a computer


network
-

It is the result of layering the HTTP on top of the SSL (SECURE SOCKET
LAYER)/ TLS (TRANSPORT LAYER SECURITY) protocol, thus adding the
security capabilities of SSL/ TLS standard HTTP communications.
The security of HTTPS is therefore that of the underlying TLS, which uses
long-term public and secret keys to exchange a short term session key to
encrypt the data flow between client and server.
HTTPS provides AUTHENTICATION of the website and associated
webserver that one is communication with, to avoid man in the middle - attack (an attacker who has the ability to both monitor and alter
or inject messages into a communication channel)
HTTPS provides BIDIRECTIONAL ENCRYPTION of communications between
a client and a server.

HTML

= the standard markup language used to create web pages.


-

A web browser can read HTML files and compose them into visible or
audible web pages.

Every HTML file contains 2 main sections:


-

The head (information about document itself, such as its title)


The body (the information to be displayed)

The browser determines how the page should be displayed based on the tags.
(The tags guide the browser and a page can look different on different browsers).
XML

= allows the creator of a document to describe its contents by defining his/


her own set of tags.
-

It does not replace HTML, it enriches it


Like HTML, XML is made up of tagged data. But when you write an XML
document, you are not confined to a predefined set of tags because there
are none. You can create any set of tags necessary to describe the data in
your document. The focus is not on how that data should be formatted;
the focus is on what the data is.
The documents in XML can be generated automatically with relative ease.
XML is a markup specification language and XML files are data.
Unlike HTML, whose tags focus on format of displayed data, XML tags
specify the nature of the data.

XSLT

= the format and relationships among XML tags are defined in a DOCUMENT
TYPE DEFINITION DOCUMENT. A set of XSLT define the way the content of an XML
document is turned into another format suitable for the current needs of a user.

The original document is not changed; rather, a new document is created


based on the content of an existing one.
In order for a web browser to be able automatically to apply an XSL
transformation to an XML document on display, an XML stylesheet
processing instruction can be inserted into XML.

JAVASCRIPT

Is a dynamic computer programming language


It is most commonly used as part of web browsers, whose
implementations allow client-side scripts to interact with the user, control
the browser, communicate asynchronously and alter the document
content that is displayed.
It is almost entirely object-based
It is prototype-based.

CSS

= a style sheet language used for describing the look and formatting of a
document written in a markup language.
-

Is designed primarily to enable to enable the separation of document


content from document presentation, including elements such as the
layout, colours and fonts.
This separation can improve content accessibility, provide more flexibility
and control in the specification of presentation characteristics, enable
multiple HTML pages to share formatting by specifying the relevant CSS in
a separate CSS file and reduce complexity and repetition in the structural
content, such as semantically insignificant tables that were widely used to
format pages before consistent CSS rendering was available in all major
browsers.

A STYLESHEET = a set of instructions each of which tells a browser how to


draw a particular element on a page.
CSS = the W3 Consortium Standard for defining the visual presentation for
web pages. The basic purpose of CSS is to allow the designers to define style
declarations (formatting details such as fonts, element sizes and colours) and to
apply those styles to selected portions of HTML pages using selectors
references to an element or group of elements to which the style is applied.

C.1.4 Identify the characteristics of the following:


Uniform resource identifier (URI)
URL
URI

used to identify a name of a resource.

URL is an example of URI.

= defines all types of names and addresses that refer to objects on the world
wide web.

C.1.5. Describe the purpose of a URL.


URL

= a specific character string that constitutes a reference to an internet resource


-

Is used to specify the web document we wish to view. (include protocol


type, domain name/IP, file location/pathway)

Every HTTP URL consists of:


-

The scheme name (PROTOCOL)


A colon, 2 slashes
A host, normally given as a domain name, but sometimes as a literal IP
address
Optionally a colon followed by a port number
The full path of the resource (file location)

C.1.6 Describe how a domain name server functions


DOMAIN NAME SERVER

converts the domain name into IP address.

The ISP sends into to the computer what DNS server shall use.

C.1.7 Identify the characteristics of:


internet protocol (IP)
transmission control protocol (TCP)
file transfer protocol (FTP)
IP = INTERNET PROTOCOL

= specifies the format of packets, also called datagrams and the addressing
scheme.
-

Most networks combine IP with a higher-level protocol called TCP


(Transmission Control Protocol) which establishes a virtual connection
between a destination and a source.
IP allows you to address a package and drop it the system, but theres no
direct link between you and the recipient.

TCP/IP on the other hand, establishes a connection between 2 hosts so that


they can send messages back and forth for a period of time.
IP provides unreliable datagram service between hosts.
TCP = Transmission Control Protocol

= provides reliable data delivery


-

It uses IP for datagram delivery


Compensates for loss, delay, duplication and similar problems in Internet
Components

FEATURES OF TCP:
-

Connection oriented: an application requests a connection to destination


and uses connection to transfer data
IP does not use connections each datagram is sent independently.
POINT to POINT: A TCP connection has 2 endpoints (no broadcast/
multicast)
Reliability: TCP guarantees that data will be delivered without loss,
duplication or transmission errors
FULL-DUPLEX: - endpoints can exchange data in both directions
simultaneously.
Reliable

FTP = File Transfer Protocol

= most common protocol for moving files between 2 locations.


-

It uses a control connection to send commands between a FTP client and a


FTP server
The file transfer is sent on a separate connection called the data
connection

C.1.8 Outline the different components of a web page


THE COMPONENTS OF A WEB PAGE

Metatags
Title
The banner (an area at the top of the page that is often the same on all
the pages)
The menu
The content area
Footer
Corner
Images
Headlines/ titles
Body content
Navigation
Credits

C.1.9 Explain the importance of protocols and standards on


the web
PROTOCOLS enable compatibility through a common language internationally
-

The standards and protocols provide instructions for the implementation of


a project, they are often implicit instructions.
Standards are frequently the rules of a profession or field and protocols
are typically the instructions and tools.

A standard = an agreed-upon way of doing something or measuring


something.

C.1.10 Describe the different types of web page


-

Personal pages, blogs, search engine pages, forums.

C.1.11 Explain the differences between a static web page


and a dynamic web page.
- include analysis of static HTML web pages and dynamic web pages,
e.g. PHP, ASP.NET, Java Servlets.
STATIC WEB PAGES

Display exactly the same information whenever anyone visits the site.
They can include text, video, images.

DYNAMIC WEB PAGES

Are capable of producing different content for different visitors from the
same source of code file, based on what O.S the visitor uses, if he/she is
using a PC or a mobile device and the source that referred the visitor.

C.1.12 Explain the functions of a browser.


BROWSER

= a software application for retrieving, presenting and traversing information


resources on the www.
The primary purpose of a web browser is to bring information resources to the
user ("retrieval" or "fetching"), allowing them to view the information ("display",
"rendering"), and then access other information ("navigation", "following links").
This process begins when the user inputs a Uniform Resource Locator (URL. The
prefix of the URL, the Uniform Resource Identifier or URI, determines how the
URL will be interpreted.
Once the resource has been retrieved the web browser will display it. HTML and
associated content (image files, formatting information such as CSS, etc.) is
passed to the browser's layout engine to be transformed from markup to an
interactive document, a process known as "rendering".
Information resources may contain hyperlinks to other information resources.
Each link contains the URI of a resource to go to. When a link is clicked, the

browser navigates to the resource indicated by the link's target URI, and the
process of bringing content to the user begins again.

C.1.13 Evaluate the use of client-side scripting and serverside scripting in web pages
There are two main ways to customise Web pages and make them more
interactive. The two are often used together because they do very different
things.
Scripts

A script is a set of instructions. For Web pages they are instructions either to the
Web browser (client-side scripting) or to the server (server-side scripting). These
are explained more below.
-

Scripts provide change to a Web page. Any page which changes each time
you visit it (or during a visit) probably uses scripting.

All log on systems, some menus, almost all photograph slideshows and many
other pages use scripts. Google uses scripts to fill in your search term for you, to
place advertisements, to find the thing you are searching for and so on. Amazon
uses scripting to list products and record what you have bought.
Client-side

The client is the system on which the Web browser is running. JavaScript is the
main client-side scripting language for the Web. Client-side scripts are
interpreted by the browser. The process with client-side scripting is:
-

the user requests a Web page from the server


the server finds the page and sends it to the user
the page is displayed on the browser with any scripts running during or
after display

So client-side scripting is used to make Web pages change after they


arrive at the browser. It is useful for making pages a bit more interesting and
user-friendly. It can also provide useful gadgets such as calculators, clocks
etc. but on the whole is used for appearance and interaction.
Client-side scripts rely on the user's computer. If that computer is slow they
may run slowly. They may not run at all if the browser does not understand the
scripting language. As they have to run on the user's system the code which
makes up the script is there in the HTML for the user to look at (and copy or
change).

Server-side

The server is where the Web page and other content is kept. The server sends
pages to the user/client on request. The process is:
-

the user requests a Web page from the server


the script in the page is interpreted by the server creating or changing the
page content to suit the user and the occasion and/or passing data around
the page in its final form is sent to the user and then cannot be changed
using server-side scripting

The use of HTML forms or clever links allow data to be sent to the server and
processed. The results may come back as a second Web page.
Server-side scripting tends to be used for allowing users to have individual
accounts and providing data from databases. It allows a level of privacy,
personalisation and provision of information that is very powerful. E-commerce,
MMORPGs and social networking sites all rely heavily on server-side scripting.
PHP and ASP.net are the two main technologies for server-side scripting.
-

The script is interpreted by the server meaning that it will always work the
same way. Server-side scripts are never seen by the user (so they can't
copy your code). They run on the server and generate results which are
sent to the user. Running all these scripts puts a lot of load onto a server
but none on the user's system.

The combination

A site such as Google, Amazon, Facebook will use both types of scripting:
-

server-side handles logging in, personal information and preferences and


provides the specific data which the user wants (and allows new data to
be stored)
client-side makes the page interactive, displaying or sorting data in
different ways if the user asks for that by clicking on elements with event
triggers

C.1.14 Describe how web pages can be connected to


underlying data sources

C.1.15 Describe the function of the common gateway


interface (CGI)
CGI is the part of the Web server that can communicate with other programs
running on the server. With CGI, the Web server can call up a program, while
passing user-specific data to the program (such as what host the user is
connecting from, or input the user has supplied using HTML form syntax). The
program then processes that data and the server passes the program's response
back to the Web browser.

The common gateway interface (CGI) is a standard way for a Web server to pass
a Web user's request to an application program and to receive data back to
forward to the user.

1. The Web surfer fills out a form and clicks, Submit. The information in the
form is sent over the Internet to the Web server.
2. The Web server grabs the information from the form and passes it to the
CGI software. The CGI software performs whatever validation of this
information that is required. For instance, it might check to see if an e-mail
address is valid. If this is a database program, the CGI software prepares a
database statement to add, edit, or delete information from the database.
3. The CGI software then executes the prepared database statement, which
is passed to the database driver.
4. The database driver acts as a middleman and performs the requested
action on the database itself.
5. The results of the database action are then passed back to the database
driver.
6. The database driver sends the information from the database to the CGI
software.
7. The CGI software takes the information from the database and
manipulates it into
8. the format that is desired.
9. If any static HTML pages need to be created, the CGI program accesses
the Web server computers file system and reads, writes, and/or edits files.
10.The CGI software then sends the result it wants the Web surfers browser
to see back to the Web server.

11.The Web server sends the result it got from the CGI software back to the
Web surfers browser.

C.1.16 Evaluate the structure of different types of web


pages.

C. 2. Searching the web (6 hours)


C.2.1 Define the term search engine.
= a program that searches for and identifies items in a database that correspond
to keywords or characters specified by the user, used especially for finding
particular sites on the World Wide Web

C.2.2. Distinguish between surface web and deep web.


DEEP WEB = INVISIBLE WEB = HIDDEN WEB

= that portion of World Wide Web content that is not indexed by standard search
engines.
The deep Web consists of data that you won't locate with a simple Google
search. No one really knows how big the deep Web really is, but it's hundreds (or
perhaps even thousands) of times bigger that the surface Web. This data isn't
necessarily hidden on purpose. It's just hard for current search engine
technology to find and make sense of it.
The surface Web consists of data that search engines can find and then offer up
in response to your queries.
But in the same way that only the tip of an iceberg is visible to observers, a
traditional search engine sees only a small amount of the information that's
available -- a measly 0.03 percent

C.2.3. Outline the principles of searching algorithms used by


search engines.
- PageRank
- HITS algorithm

PAGERANK

PageRank is one of the methods Google uses to determine a pages


relevance or importance.
In short PageRank is a vote, by all the other pages on the Web, about how
important a page is. A link to a page counts as a vote of support. If theres no
link theres no support (but its an abstention from voting rather than a vote
against the page).
Quoting from the original Google paper, PageRank is defined like this:
We assume page A has pages T1...Tn which point to it (i.e., are citations).
The parameter d is a damping factor which can be set between 0 and 1.
We usually set d to 0.85. There are more details about d in the next
section. Also C(A) is defined as the number of links going out of page A.
The PageRank of a page A is given as follows:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Note that the PageRanks form a probability distribution over web pages,
so the sum of all web pages' PageRanks will be one.
PageRank or PR(A) can be calculated using a simple iterative algorithm,
and corresponds to the principal eigenvector of the normalized link matrix
of the web.
but thats not too helpful so lets break it down into sections.
1. PR(Tn) - Each page has a notion of its own self-importance. Thats
PR(T1) for the first page in the web all the way up to PR(Tn) for the
last page
2. C(Tn) - Each page spreads its vote out evenly amongst all of its outgoing
links. The count, or number, of outgoing links for page 1 is C(T1), C(Tn)
for page n, and so on for all pages.
3. PR(Tn)/C(Tn) - so if our page (page A) has a backlink from page n the
share of the vote page A will get is PR(Tn)/C(Tn)
4. d(... - All these fractions of votes are added together but, to stop the other
pages having too much influence, this total vote is damped down by
multiplying it by 0.85 (the factor d)
5. (1 - d) - The (1 d) bit at the beginning is a bit of probability math magic
so the sum of all web pages' PageRanks will be one: it adds in the bit
lost by the d(.... It also means that if a page has no links to it (no
backlinks) even then it will still get a small PR of 0.15 (i.e. 1 0.85).
(Aside: the Google paper says the sum of all pages but they mean the
the normalised sum otherwise known as the average to you and me.

How is PageRank Calculated?

This is where it gets tricky. The PR of each page depends on the PR of the pages
pointing to it. But we wont know what PR those pages have until the pages
pointing to them have their PR calculated and so on And when you consider
that page links can form circles it seems impossible to do this calculation!
But actually its not that bad. Remember this bit of the Google paper:
PageRank or PR(A) can be calculated using a simple iterative algorithm,
and corresponds to the principal eigenvector of the normalized link matrix
of the web.
What that means to us is that we can just go ahead and calculate a pages PR
without knowing the final value of the PR of the other pages. That seems
strange but, basically, each time we run the calculation were getting a closer
estimate of the final value. So all we need to do is remember the each value we
calculate and repeat the calculations lots of times until the numbers stop
changing much.
http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm

HITS (hyperlink-induced topic search)

http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture4/lecture4.h
tml
= an algorithm that made use of the link structure of the web in order to
discover and rank pages relevant for a particular topic.
Suppose you want to search for the "best automobile makers in the last 4 years".
When you ask a search engine this question, it will count all occurrences of the
given words in a given set of documents. The results might be different than
what you expected. (think of a dictionary or a web page that repeats the phrase
"automobile maker = car manufacturer" one billion times. - this web page will be
the first displayed by the query engine).
So, there is needed a different ranking system in order to find those pages
that are authoritative for a given query. Page i is called an authority for the
query "automobile makers" if it contains valuable information on the subject.
(those are the ones truly relevant to the given query). For supporting the
system, there are defined an other category of web pages relevant to the
process of finding the authoritative pages, called HUBS. The hubs role is to
advertise the authoritative pages. They contain useful links towards the
authoritative pages. They point the search engine in the right direction.
For better understanding: the authoritative pages could be official sites of web
manufacturers: www.bmv.com and a hub is a blog where people discuss about
the cars they purchased or pages that contain rankings of the cars
(recommending the official web manufacturer).

HITS algorithm identifies good authorities and hubs for a topic by


assigning two numbers to a page: an authority and a hub weight. The weights
are
defined recursively. A higher authority weight occurs if the page is pointed
to by pages with high hub weights. A higher hub weight occurs if the page points
to many pages with high authority weights.
A good hub increases the authority weight of the pages it points. A good
authority increases the hub weight of the pages that point to it. The idea is then
to apply the two operations above alternatively until equilibrium values for the
hub and authority weights are reached.

C.2.4. Describe how a web crawler functions.


http://blog.woorank.com/2014/07/mp-explain-me-how-a-crawler-workslike-im-five/
http://www.google.com/insidesearch/howsearchworks/crawlingindexing.html
http://en.wikipedia.org/wiki/Web_crawler
Web crawler = bots/ web spiders/ web robots
= computer programs that scan the web, reading everything they find. The
crawlers scan the web pages to see what words they contain, and where those
words are used. The crawler turns its findings into a giant index. The index is
basically a big list of words and the web pages that feature them.

And this information is indexed by the search engine.


A WEBCRAWLER:

creates a copy of every web page (for later indexing by the search engine)
that it visits
usually starts at a popular site
searches a page for links to other pages
follows these links and repeats process
initially looks for the file robots.txt for instructions on pages to ignore
(duplicate content, irrelevant pages)
is used to retrieve email addresses (for spam)
is used by webmaster for checking integrity of site

C.2.5. Discuss the relationship between data in a meta-tag


and how it is accessed by a web crawler
- THIS IS NOT ALWAYS A TRANSITIVE RELATIONSHIP ( a transitive relationship
means that if a data in a meta-tag, named A is related to another data in a
meta-tag, named B and B is in turn related to another data C, then A is also
related to C)

Some spiders pay more attention to words occurring in


titles
sub-titles
meta tags
Others will index every word
Meta

tags
inserted by web designer/owner
contain keywords and concepts (helps to clarify meaning)
description / title can be shown in the search results
noindex, nofollow in robots tag can instruct crawlers not to index pages

Students should understand that keywords can be misleading.


Data may not always have the intended meaning.

C.2.6. Discuss the use of parallel web crawling.


A parallel crawler is a crawler that runs multiple processes in parallel.
The goal is to maximize the download rate while minimizing the overhead from
parallelization and to avoid repeated downloads of the same page. To avoid
downloading the same page more than once, the crawling system requires a
policy for assigning the new URLs discovered during the crawling process, as the
same URL can be found by two different crawling processes.
The expansion of the web has led to new search engine initiatives which include
parallelization of web crawlers.
Parallel web crawlers are designed to:
maximize performance
minimise overheads
avoid duplication
communicate with each other (to avoid above)
can work different geographical areas

C.2.7. Outline the purpose of web-indexing in search


engines.
The purpose of storing an index is to optimize speed and performance in
finding relevant documents for a search query. Without an index, the search
engine would scan every document in the corpus, which would require
considerable time and computing power. For example, while an index of 10,000
documents can be queried within milliseconds, a sequential scan of every word
in 10,000 large documents could take hours. The additional computer storage
required to store the index, as well as the considerable increase in the time
required for an update to take place, are traded off for the time saved during
information retrieval.
Web-crawlers retrieve copies of each web page visited. Each page is
inspected to determine its ranking for specific search terms.
Metadata web indexing involves assigning keywords or phrases to web pages or web
sites within a metadata tag (or "meta-tag") field, so that the web page or web site can be
retrieved with a search engine that is customized to search the keywords field. This may or may
not involve using keywords restricted to a controlled vocabulary list. This method is commonly
used by search engine indexing.
The search engine indexing process takes the detailed information collected by the search
engine spider (web crawler) and analyses the information.
The information the search engine spider found on each page is analyzed, building a list of the
words and phrases within the document taking into account:
- number of times a word/ phrase is used on the web page.
The search engines consider a web page where the word or phrase is used to excess to be
spam and unethical attempt to get listed without providing quality content for the website visitors.
- weight of the word/ phrase
The weight of the word/ phrase increases in value depending on where it is located (top of the
document, sub headings, text links, meta tags, title). Each search engine has a different way of
analyzing this information therefore different results in the search engines is the result. A web
page may do well in the search engine results on one search engine and poorly on another.
The indexed information is saved in a database, waiting for someone to do a search.
When someone does a search at the search engine (or directory), the words they entered in the
search box are compared with the indexed information in the search engine's database and the
list of web pages it feels are most relevant to the search are returned.
None of the search engines publish how often their indexes are updated.
They are always looking for fresh new content to add to their index so, adding new original
content to your website or blog will help.
As other people find your website or blog on their own and start to link to it because they find it
valuable to their readers, the search engine bots will find and follow the link. This will give the
search engine bot an opportunity to have a look to see if you have any new content to index.
Making sure that all the links within your website/blog are working is another way for the search
engine bots to find new and updated content to index.
Going over board on looking for linking partners though isn't going to help get your website or
blog indexed or reindexed. The search engines are wise to this and may even stop them from
visiting your site, eliminating you from the search engine indexing process.

C.2.8. Suggest how web developers can create pages that appear more
prominently in search engine results.
- examine time taken, number of hits, quality of results.
- ensure the site has high-quality information
-get other indexed sites to link to your site
- encourage others to link to you
- identify the keywords for which you would like to be found
-place keywords in prime locations (headlines and section titles, link text, page
title metadata, page description metadata, page text, page URL)
- ensure a search-friendly web site architecture (ensure theres a simple link to
every page on your site, include content early in each HTML page, use standard
header tags, be careful of duplicate pages)
- keep your site fresh (updated)

http://www.idealware.org/articles/found_on_search_engines.php
C.2.8. Suggest how web developers can create pages that appear more
prominently in search engine results.
- students will be expected to test specific data in a range of search engines, for
example examining time taken, number of hits, quality of returns.

C.2.9. Describe the different metrics used by search


engines.
How do different search engines compare? Parameters to look at include:
recall (finding the relevant page in an index)
precision (ranking a page correctly)
relevance
coverage
customization
user experience

Today, the major search engines use many metrics to determine the value of external
links. Some of these metrics include:

The
The
The
The
The
The
The

trustworthiness of the linking domain.


popularity of the linking page.
relevancy of the content between the source page and the target page.
anchor text used in the link.
amount of links to the same page on the source page.
amount of domains that link to the target page.
amount of variations that are used as anchor text to links to the target

page.
The ownership relationship between the source and target domains.

In addition to these metrics, external links are important for two main reasons:
1. Popularity

Whereas traffic is a "messy" metric and difficult for search engines to measure
accurately (according to Yahoo! search engineers), external links are both a more
stable metric and an easier metric to measure. This is because traffic numbers
are buried in private server logs while external links are publicly visible and
easily stored. For this reason and others, external links are a great metric for
determining the popularity of a given web page. This metric (which is roughly
similar to toolbar PageRank) is combined with relevancy metrics to determine the
best results for a given search query.
2. Relevancy

Links provide relevancy clues that are tremendously valuable for search engines.
The anchor text used in links is usually written by humans (who can interpret
web pages better than computers) and is usually highly reflective of the content

of the page being linked to. Many times this will be a short phrase (e.g. "best
aircraft article") or the URL of the target page (e.g. http://www.best-aircraftarticles.com).
The target and source pages and domains cited in a link also provide valuable relevancy
metrics for search engines. Links tend to point to related content. This helps search
engines establish knowledge hubs on the Internet that they can then use to validate the
importance of a given web document.

C.2.10. Explain why the effectiveness of a search engine is


determined by the assumptions made when developing
it.
- understand the ability of the search engine to produce the required results is
based primarily on the assumption used when developing the algorithms that
underpin it.

C.2.11. Discuss the use of white hat and black hat search
engine optimization.
White hat (links from C.2.8)
new sites send XML site map to Google
include a robots.txt file
add site to Googles Webmaster Tools to warn you if site is uncrawlable
make sure the HI tag contains your main keyword
page titles contain keywords
relevant keywords with each image
site has suitable keyword density (but no keyword stuffing)
Students should be able to explain the effect of the above techniques.
Black-hat
hidden content
keyword stuffing
link farms
etc
Black hat SEO refers to attempts to improve rankings in ways that are not
approved by search engines and involve deception. They go against current
search engine guidelines. White hat SEO refers to use of good practice methods
to achieve high search engine rankings. They comply with search engine
guidelines.
Black hat SEO is more frequently used by those who are looking for a quick
financial return on their Web site, rather than a long-term investment on their
Web site. Black hat SEO can possibly result in your Web site being banned from a
search engine, however since the focus is usually on quick high return business
models, most experts who use Black Hat SEO tactics consider being banned from
search engines a somewhat irrelevant risk.

In search engine optimization (SEO) terminology, white hat SEO refers to the
usage of optimization strategies, techniques and tactics that focus on a human
audience opposed to search engines and completely follows search engine rules
and policies.
For example, a website that is optimized for search engines, yet focuses on
relevancy and organic ranking is considered to be optimized using White Hat SEO
practices. Some examples of White Hat SEO techniques include using keywords
and keyword analysis, backlinking, link building to improve link popularity, and
writing content for human readers.
White Hat SEO is more frequently used by those who intend to make a long-term
investment on their website. Also called Ethical SEO.
Students should be able to assess both how the above function and their degree
of success.

C.2.12. Outline future challenges to search engines as the


web continues to grow.
- issues such as error management, lack of quality assurance of information
uploaded.
Areas being developed are:
-

Concept-based searching
Natural language queries (e.g. Ask.Jeeves.com)

Future challenges:
-

Error management
Lack of quality assurance of information uploaded

The search engines will need to evolve to remain effective as the web grows.

C.3 Distributed approaches to the web (6 hours)


C.3.1. Define the terms: mobile computing, ubiquitous
computing, peer - 2 - peer network, grid computing.
MOBILE COMPUTING = NOMADIC COMPUTING = use of portable computing
devices + mobile communication technology.
E.g.: use of laptops, tablets, netbooks, smartphones.
UBIQUITOUS COMPUTING = PERVASIVE COMPUTING = the incorporation of
scalable computing devices into all aspects of everyday life.
E.g.: smart electricity meters, sensor networks, devices that talk to each other.

PEER-2-PEER = P2P = network without distinctive clients and servers where


individual nodes (peers) act as both the delivery and the receiver of data.
E.g.: streaming of files (particularly music, video files)
GRID COMPUTING = the use of remote computers to act a virtual unit.
= the collection of computer resources from multiple locations to reach a common goal. The
grid can be thought of as a distributed system with non-interactive workloads that involve a
large number of files.
E.g.: for solving data intensive problems

C.3.2. Compare the major features of: mobile computing,


ubiquitous computing, peer - 2 - peer network, grid
computing.
MOBILE COMPUTING:

All components reduced in size


Light
Portable
Include battery power
Slimmed down OS
Low-power microprocessors
Solid state drives

UBIQUITOUS COMPUTING:

Microprocessors embedded into everyday objects


Use wireless technologies to connect with the internet
Unobtrusive
Devices can talk to each other (communication)

PEER 2 PEER:

Each computer can function either as a client or a server


Used primarily for file-sharing
Software searches P2P network for requested file
Can use a central server for coordination
Performance increase with increase of number of nodes

GRID COMPUTING:

Distributed system with central control


Geographically dispersed
Little/no communication between nodes
= virtual supercomputer for solving single tasks
Highly scalable
Use standard computers

C.3.3. Distinguish between interoperability and open


standards.
An open standard is a standard that is publicly available and has various rights
to use associated with it or various properties of how it was designed (e.g. open
process).
Open standards rely on a broadly consultative and inclusive group, that
discusses and debates the technical and economic merits, demerits and
feasibility of a proposed common protocol. After the doubts and reservations of
all members are addressed, the resulting common document is endorsed as a
common standard.
This document is subsequently released to the public, and henceforth
becomes an open standard. It is usually published and is available freely or at
a nominal cost to any and all comers.
Interoperability must be distinguished from open standards. Although the
goal of each is to provide effective and efficient exchange between computer
systems, the mechanisms for accomplishing that goal differ. Open standards
imply interoperability by definition, while interoperability does not, by itself,
imply wider exchange between a range of products, or similar products from
several different vendors, or even between past and future revisions of the same
product. Interoperability may be developed post-facto, as a special measure
between two products, while excluding the rest, or when a vendor is forced to
adapt its system to make it interoperable with a dominant system.
OPEN STANDARDS are:

Publicly available

Often developed collaboratively

Royalty-free

INTEROPERABILITY :

Allows different systems to work together

Examples of open standards:


- www architecture specified by W3C
- Internet Protocol
- HTML
Examples of interoperability:
- producing web pages viewable in standard compatible web browsers, various
operating systems such as Windows, Macintosh and Linux and devices such as
PC, PDA and mobile phone based on the latest web standards.

C.3.4. Describe the range of hardware used by distributed


networks.
- developments in mobile technology that have facilitated the growth of
distributed networks.

A distributed system consists of a collection of autonomous computers,


connected through a network and distributed operating system software, which
enables computers to coordinate their activities and to share the resources of the
system, so that users perceive the system as a single, integrated computing
facility.

C.3.5. Explain why distributed systems may act as a catalyst


to a greater decentralization of the web.
- decentralization has increased international - mindedness.
http://motherboard.vice.com/en_uk/read/bitcoin-internet-the-plan-for-adecentralized-web-that-runs-on-cryptocurrency
The distributed systems could act as a catalyst (as a factor that makes the
change faster) to a greater decentralization of the web by eliminating the need
for huge servers in data centres, (which are usually owned by the tech giants)
and using instead of them, the spare hard drive space of users devices (which
means distributed systems). Applications would sit on top of the system so we
could use the Internet much as usual, but our data wouldnt be stored as
complete files on one server or local network. It would be shredded and
encrypted, then dispersed in pieces across different computers in network.
https://www.youtube.com/watch?v=RdGH40oUVDY

C.3.6. Distinguish between lossless and lossy compression.


LOSSY COMPRESSION refers to data compression techniques in which some
amount of data is lost.
Lossy compression technologies attempt to eliminate redundant or unnecessary
information. Most video compression technologies, such as MPEG, use a lossy
technique.
LOSSLESS COMPRESSION refers to data compression techniques in which no
data is lost.
The PKZIP compression technology is an example of lossless compression. For
most types of data, lossless compression techniques can reduce the space
needed by only about 50%.

For greater compression, we must use a lossy compression technique. Only


certain types of data -- graphics, audio, and video -- can tolerate lossy
compression.
You must use a lossless compression technique when compressing data and
programs.

C.3.7. Evaluate the use of decompression software in the


transfer of information.
The most important reason for compressing data is that more and more we share
data. The Web and its underlying networks have limitations on bandwidth that
define the maximum number of bits or bytes that can be transmitted from one
place to another in a fixed amount of time.
The decompression is the reconversion of compressed data into its original (or
nearly original) form so that it can be heard, read, and/or seen as normal.

C.4. The evolving web (10 hours)


C.4.1. Discuss how the web has supported new methods of
online interaction such as social networking.
- issues linked to the growth of new internet technologies such as Web 2.0 and
how they have shaped interactions between different stakeholders of the web.
Technologies that have led to the:

Social networking

Blogs

Wikis

Skype/Google Hangout

are:
Interaction
Collaboration
User-generated content
Virtual communities
Ajax allows JavaScript to upload/ download new data from the server (without
reloading the page)
XML formatting Document Object Model (DOM)
Flash playing video and audio

Use of widgets (e.g. calendars)


SLATES acronym

C.4.2. Describe how cloud computing is different from clientserver architecture.


- address only major differences.
In client-server, the server is usually local (in the same building or at least
in a building around), it is accessed over a private network and is accessed
within a company, nobody outside has access to it, while the cloud computing is
accessed through the Internet, you dont know what companies are hosting your
data.
In cloud computing you are dependent on the Internet and it varies depending on
the connection speed.

C.4.3. Discuss the effects of the use of cloud computing for


specified organizations.
- include public and private clouds.
*SECURITY, COSTS, EXPERTISE FOR BOTH LARGE AND SMALL ORGANIZATIONS.
By choosing a public cloud solution, an organization can offload much of
the management responsibility to its cloud vendor.
In a private cloud scenario, there is significant demand on resources to
specify, purchase, house, update, maintain, and safeguard the physical
infrastructure. Financially, deploying a private cloud can also create a large
initial capital expense, with subsequent investment required as new
equipment and capacity is added.
In a public cloud scenario, capital expense is virtually eliminated;
the financial burden is shifted to a fee-for-service, often based on utilization
and data volume.
Maintaining and securing public cloud infrastructure is the responsibility
of the vendor, enabling the customer organization to streamline IT operations
and minimize time and money spent on system upkeep.

C.4.4. Discuss the management of issues such as copyright


and intellectual property on the web.
- investigate sites such as Turnitin and Creative Commons.
Intellectual property (IP) is a legal term that refers to creations of the mind.

Copyright is a legal term used to describe the rights that creators have
over their literary and artistic works. Works covered by copyright range from
books, music, paintings, sculpture and films, to computer programs, databases,
advertisements, maps and technical drawings.
A patent is an exclusive right granted for an invention. Generally
speaking, a patent provides the patent owner with the right to decide how - or
whether - the invention can be used by others. In exchange for this right, the
patent owner makes technical information about the invention publicly available
in the published patent document.
A trademark is a sign capable of distinguishing the goods or services of
one enterprise from those of other enterprises. Trademarks date back to ancient
times when craftsmen used to put their signature or "mark" on their products.
http://www.digitalenterprise.org/ip/ip.html
http://www.wipo.int/about-ip/en/

C.4.5. Describe the interrelationship between privacy,


identification and authentication.
Internet privacy involves the right or mandate of
personal privacy concerning the storing, repurposing, provision to third
parties, and displaying of information pertaining to oneself via the Internet.
Identification is the process whereby a network element recognizes a
valid user's identity. Authentication is the process of verifying the
claimed identity of a user.
Transport Layer Security (TLS) and its predecessor, Secure Sockets
Layer (SSL), are cryptographic protocols designed to provide
communication security over the Internet.
They use X.509 certificates and hence asymmetric
cryptography to authenticate the counterparty with whom they are
communicating, and to exchange a symmetric key.
This session key is then used to encrypt data flowing between the
parties. This allows for data/message confidentiality, and message
authentication codes for message integrity and as a by-product, message
authentication.
Several versions of the protocols are in widespread use in applications
such as web browsing, electronic mail, Internet faxing, instant messaging,
and voice-over-IP (VoIP).
SSL is a protocol developed by Netscape for transmitting private
documents via the Internet.

SSL uses a cryptographic system that uses two keys to encrypt data a
public key known to everyone and a private or secret key known only to the
recipient of the message. Both Netscape Navigator and Internet Explorer support
SSL, and many Web sites use the protocol to obtain confidential user information,
such as credit card numbers. By convention, URLs that require an SSL connection
start with https: instead of http:.
Another protocol for transmitting data securely over the World Wide
Web is Secure HTTP (S-HTTP). Whereas SSL creates a secure connection between
a client and a server, over which any amount of data can be sent securely,
S-HTTP is designed to transmit individual messages securely.

TLS is a protocol that guarantees privacy and data integrity


between client/server applications communicating over the Internet.
The TLS protocol is made up of two layers:
The TLS Record Protocol - layered on top of a reliable transport
protocol, such as TCP, it ensures that the connection is private by using
symmetric data encryption and it ensures that the connection is reliable. The TLS
Record Protocol also is used for encapsulation of higher-level protocols, such as
the TLS Handshake Protocol.
The TLS Handshake Protocol - allows authentication between the server
and client and the negotiation of an encryption algorithm and
cryptographic keys before the application protocol transmits or receives any
data.
TLS is application protocol-independent. Higher-level protocols can layer
on top of the TLS protocol transparently.

C.4.6. Describe the role of network architecture, protocols


and standards in the future development of the web
- the future development of the web will have an effect on the rules and
structures that support it.

C.4.7. Explain why the web may be creating unregulated


monopolies
- the web is creating new multinational online oligarchies.
Web browsers(Microsoft)
Cloud computing (Microsoft)
Facebook - dominating social networking
LinkedIn

Danger of one social networking site, search engine, browser creating a


monopoly limiting innovation.
ISPs favouring some content.
Mobile phone operators blocking competitor sites.

C.4.8. Discuss the effects of a decentralized and democratic


web.
- the web has changed users behaviours and "removed" international
boundaries.

http://www.theguardian.com/technology/2010/jan/24/internet-revolutionchanging-world
http://en.wikipedia.org/wiki/Decentralization

You might also like