Archiving The Web Guide

AVISIONOFTHEROLEANDFUTUREOFWEBARCHIVES KalevH.Leetaru 1 GraduateSchoolofLibraryandInformationScience UniversityofIllinois Imagine a world in which libraries and archives had never existed.
d. No institutions had ever systematicallycollectedorpreserved ourcollectiveculturalpast:everybook,letter,ordocumentwas created,readandthenimmediatelythrownaway.Whatwouldweknowaboutourpast?Yet,thatis precisely what is happening with the web: more and more of our daily lives occur within the digital world,yetmorethantwodecadesafterthebirthofthemodernweb,thelibrariesandarchivesof thisworldarestilljustbeingformed. Weve reached an incredible point in society. Every single day a quarterbillion photographs are uploadedtoFacebook,300billionemailsaresentand340milliontweetsarepostedtoTwitter.There aremorethan644millionwebsiteswith150,000newonesaddedeachday,andupwardsof156million blogs.Evenmoreincredibly,thegrowthrateofcontentcreationinthedigitalworldisexploding.The entire New York Times over the last 60 years contained around 3 billion words. More than 8 billion wordsarepostedtoTwittereverysingleday.Thatsright,every24hoursthereare2.5timesasmany wordspostedtoTwitterastherewereineveryarticleofeveryissueofthepaperofrecordoftheUnited Statesoverthelasthalfcentury. By some estimates there have been 50 trillion words in all of the books published over the last half millennia.Atitscurrentgrowthrate,Twitterwillreachthatmilestonelessthanthreeyearsfromnow. Nearly a third of the planets population is now connected to the internet and there are as many cell phones as there are people on earth. Yet, for the most part we consume all of this information as it arrivesanddiscarditjustasquickly,givinglittlethoughttoposterity.Thatswherewebarchivescome in:tomakesurethatafewyears,decades,centuries,andmillenniafromnowwewillstillhaveatleasta partialwrittenrecordofhumansocietyatthedawnofthetwentyfirstcentury. THEWEBARCHIVEINTODAYSWORLD ThelossoftheLibraryofAlexandria,oncethegreatestlibraryonearth,createdanenormousholeinour understandingoftheancientworld.Imagineifthatlibraryhadnotonlypersistedtopresentday,but had continued to collect materials through the millennia? Yet, in the web era, we are repeating this cycle of loss, not through a fire or other sudden event that destroyed the Library of Alexandria, but ratherthroughinaction:wearesimplynotcollectingit. Thedawnofthedigitalworldexistsinthearchivesofjustafeworganizations.Manymailinglistsand early services like Gopher have largely been lost, while organizations such as Google have invested considerable resources in resurrecting others like USENET. The earliest years of the web are gone forever,butbeginningin1996theInternetArchivebegancapturingsnapshots,givingusoneofthefew recordsoftheearlyiterationsofthisworld.OrganizationsliketheInternationalInternetPreservation Consortium (IIPC) are helping to bring web archivists from across the world and across disciplines togethertoshareexperiencesandbestpracticesandforgecollaborationstohelpadvancethesecritical efforts.
1
kalev.leetaru5@gmail.com
UNINTENDEDUSES Archivesexisttopreserveasampleoftheworldforfuturegenerations.Theyacceptthattheycannot archiveeverythinganddonttryto:theyoperateasanopportunisticcollector.Traditionalhumanities andsocialsciencesscholarshipwasdesignedaroundtheselimitations:thetraditionofdeepreadingofa small number of works in the humanities was born out of this model. Yet, a new generation of researchersisincreasinglyusingarchivesinwaystheywerentintendedforandneedagreaterarrayof informationonhowthosearchivesarecreatedtoanticipatebiasesandimpactsontheirfindings. TheLibraryofCongressChroniclingAmericasite,whiletechnicallyawebdelivereddigitallibrary,nota webarchive,offersanexampleofwhygreaterinsightintothearchivingprocessiscriticalforresearch. Usingthesiterecentlyforaproject,mysearchreturnedtentimesasmanyhitsformytopicinElPaso, Texasnewspapers,asitdidforNewYorkCity.Furtherinspectionshowedthiswasactuallybecausethe ChroniclingAmericasitehadmorecontentfromElPasonewspapersduringthistimeperiodthanitdid from New York City papers, rather than this being a reflection of El Paso papers covering my topic in moredetail.PartofthisissuestemsfromtheacquisitionmodelofChroniclingAmerica:eachindividual state determines the order it digitizes newspapers printed in its borders: one state might begin with smaller papers, while other begins with larger papers, one state might digitize a particular year from everypaper,whileanothermightdigitizetheentiretyofeachpaperinturn.ChroniclingAmericaalso excludespapersthathavebeendigitizedalreadybycommercialvendors:thusNewYorkCityslargest paper,theNewYorkTimes,isnotpresentinthearchive.Thislandscapeintroducessignificantartifacts into searches, but normalization procedures can help address them. In order to do so, however, a bibliographyisneededthatlistseverypagefromeverypaperthathasbeenincludedinthearchive.This wouldhaveallowedmetoswitchmysearchresultsfromarawcountofmatchingnewspaperpagesinto apercentofallpagesfromeachcity,whichwouldhaveaccountedfortherebeingmorecontentinEl PasothanNewYorkCity. ThisisevenclearerwhenconductingsearchesofthehistoricNewYorkTimes.AsearchoftheTimesfor anykeywordovertheperiod1945presentwillshowitsusedecliningby50%overthatperiod.Thisis notareflectionofthattermdeclininginuse,butratherreflectsthefactthattheTimesitselfshrunkby morethanhalfoverthisperiod.Similarly,searchescoveringtheyear1978willshowan88dayperiod wherethetermwasneverused.Thisisnotbecausethetermdroppedoutoffavorduringthatperiod, butratherbecauseamachinistsstrikehaltedthepaperspublicationentirely.Havinganindexofthe total number of articles published each day (and thus the possible universe of articles the term could havebeenusedin)allowstherawcountstobenormalizedtoyieldthetruepictureofthetermsusage. However,nowebarchivetodayofferssuchamasterindexofitsholdings. Oneofthecoreoptimizationsusedbywebcrawlerscanhaveasignificantimpactoncertainclassesof research.Nearlyeverywebarchiveusescrawlersdesignedtomeasuretherateofchangeofasite(ie, howoftenonaveragepagesonthatsitechange)inordertocrawlsitesthatchangemoreoftenfaster than thosethatrarely change. Thisallowsbandwidthand disk storagetobeprioritized towardssites that change often, rather than storing a large number of identical snapshots of a site that never changes. However, sometimes it is precisely that rare change that is most interesting. For example, whenstudyinghowWhiteHousepressreleaseshadchanged,Iwasexaminingpagesthatshouldnever show any change whatsoever, and when there was a change, I needed to know the specific day on whichthechangeoccurredtoreconcileitwithpoliticalwindsatthetime.However,therarerateofthe
changeonthatportionofthesitemeantthatsnapshotsoftenweremonthsorsometimesyearsapart, makingitimpossibletonarrowsomechangesdownbelowthelevelofseveralyears. In other analyses, the dynamic alteration of the recrawl rate itself is a problem. For example, when studying the inner workings of the Drudge Report over the last halfdecade, a key research question revolvedaroundtherateatwhichvariouselementsofthatsitechanged.Iftherateofsnapshottingwas being varied by a software algorithm based on the very phenomena I was measuring, that would strongly bias my findings. In that particular case I was lucky enough to find a specialty archive that existedsolelytoarchivetheDrudgeReport,andwhichhadcollectedsnapshotsevery2minutesnonstop formorethan6years. Thisisnotaneasyproblem,asarchivesmustbalancetheirverylimitedresourcesbetweencrawlingfor newpagesandrecrawlingexistingpageslookingforchanges.Withinrecrawling,theymustbalancethe needtopinpointchangestothemostnarrowtimeframepossiblewithensuringtheycaptureasmany changesaspossiblefromhighvelocitysites. Finally,theverynotionofwhatconstituteschangevariesdramaticallyamongresearchprojects.Hasa pagechangedifitstilllooksthesame,butanHTMLtagwaschanged?Whataboutifthetitlechanges, orthebackgroundcolor?Doesachangeinthenavigationbaratthetopcountthesameasachangeto thebodytext?Thereareasmanyanswerstothesequestionsasthereareresearchprojects,andno singlesolutionsatisfiesthemall.WhenlookingatchangestoWhiteHousepressreleases,onlyachange toapagetitleorbodytextcountedaschange,whiletheInternetArchivecountedallofthemyriad editsandadditionstotheWhiteHousenavigationbaraschanges.Thisrequireddownloadingevery single snapshot of each page and applying our own filters to extract and compare the body text ourselves.Onepossiblesolutiontothismightbetheincorporationofhybridhierarchicalstructuraland semanticdocumentmodelsthatallowausertoindicatewhichareasofthedocumentheorshecares aboutandtoreturnonlythosesnapshotsinwhichthatsectionhaschanged. WHATTOKEEP? As noted in the introduction to this blog post, the digital world is experiencing explosive growth, producingmorecontentinafewhoursthanwasproducedinthegreaterpartofacenturyintheprint era. This growth is giving us an incredible view of global society and enabling communication, collaboration,andsocialresearchatscalesunimaginableevenadecadeago,yetthericherthisarchive becomes,theharderitistoarchive.Theveryvolumeofmaterialthatmakesthewebsoexcitingasa communicationsplatformmeansthereissimplytoomuchofittokeep.Evenintheeraofbooks,there weresimplytoomanyofthemforanylibrarytokeep,butatleastwecouldassumethatsomelibrary somewherewasprobablycollectingthebooksthatwewerent:anassumptionthatisntnecessarilytrue inthedigitalworldyet. Anageoldmechanismfordealingwithoverflowistodeterminewhichworksarethemostimportant andwhichcanbediscarded.Yet,howdowedecidewhatconstitutesnoiseandwhatshouldbekept? Talktoahistorianwritingabiographyofahistoricfigureandheorshewilllikelypointtoroutinedayto day letters and diary entries as a critical source of information on that persons mood, feelings, and beliefs.EmergingresearchonusingTwittertoforecastthestockmarketormeasurepublicsentiment are finding that only when one considers the entirety of all 340 million tweets each day do the key patternsemerge.AtweetofImoutsidehangingthelaundry,suchabeautifuldaymightatfirstseem aprimecandidatefordiscarding,butbyitsverynature,itreflectsanauthorfeelingcalmandsecureand
relaxed:criticalpopulationleveldynamicsofgreatinteresttosocialscientists.Anothermechanismisto discard highly similar works, such as multiple editions of the same work. Yet, an emerging area of researchonthewebisthetracingofmemes,whicharevariationsofaquoteorstorythatevolveas they are forwarded across users and communities much like a realtime version of the telephone game. It is critical for such research to be able to access every version of a story, not just the most recent. The rise of dual electronic + print publishing pipelines has led to the need to collect TWO copies of a work,insteadofjustasingleauthoritativeprintedition.Digitaleditionsofbooksreleasedaswebsites may include videos, photographs, multimedia and interactive features that provide a very different experiencefromtheprintcopy.Eveninsubjectdomainswhereprintisstilltheofficialrecord,digital has become the defacto record through its ease of access. How many citizens travel to their nearest Federal Depository Library and browse the latest edition of the Public Papers of the President to find press releases and statements by their government? Most likely turn instead to the White Houses website,yetastudyIcoauthoredin2008foundthatofficialUSgovernmentpressreleasesontheWhite House website were being continually edited, with key information added and removed and dates changed over time to reflect changing political realities. In a world in which information is so easily changedandevensupposedlyimmutablegovernmentmaterialchangeswithaclickofamouse,howdo weaswebarchivistscapturethisworldandmakeitavailable? Thisbringsuponeverycriticaldistinctionbetweentheprintanddigitaleras:theconceptofchange.In the print era, an archive simply needed to collect an item as it was published. If a book was subsequentlychanged,thepublisherwouldissueaneweditionandnotifythelibraryofitsavailability. A book sitting on a shelf was static: if 20 libraries each held a copy of that book, they could be reasonablycertainthatall20copieswereidenticaltoeachother.Inthedigitalera,wemustconstantly scourfornewpagestoarchive,butwealsohaveanewrole:checkingourexistingarchiveforchange. Every single page every saved by the archive must be rechecked on a regular basis to see if it has changed.Websitesdontmakethiseasy.AstudyoftheChicagoTribuneIconductedfortheCenterof ResearchLibrariesin2011foundtherewasnosinglemasterlistofarticlespublishedontheTribunes siteeachdayandtheRSSfeedsweresortedbypopularity,notdate.Toensureonearchivedeverynew articlepostedtothesite,anarchivistwouldhavetomonitorall105maintopicpagesontheTribunes siteeveryfewhoursorrisklosingnewarticlesonanewsheavyday.Atthelevelofthewebasawhole, one can monitor the DNS domain registry to get a continuallyupdated list of every domain name in existence. However, even this provides only a list of websites like cnn.com, not a list of all of the pagesonthatsite. Intheeraofbooks,alibraryneedntpurchaseaworkthedayitwasreleased,asmostbookscontinued tobeprintedandavailableforatleastmonths,ifnotyearsafterwards.Alibrarycouldwaitayearor two until it had sufficient budget or space to collect it. Web pages, on the other hand, may have halflivesmeasuredinsecondstominutes.Theycanchangeconstantly,withnonotice,andthevelocity of change can be extreme. In addition, more content is arriving in streaming format on the web. Archiving Twitter requires being able to collect and save over 4,000 messages per second in realtime, withnoabilitytogobackformissedones.Anetworkoutageof10minutesmeans2.5milliontweets that have been lost forever. In the web world, content producers set the schedule for collection and archivistsmustadheretothoseschedules. Myron Gutmann, Assistant Director of the National Science Foundations Directorate for Social, Behavioral,&EconomicSciencesrecentlygaveatalkearlierthisyearwherehearguedthatintheprint
era the high cost of producing information meant that whatever was published was worth keeping becausethereweresomanylayersofreview.Incontrast,thetremendouslylowcostofpublicationin thedigitalerameansanyonecanpublishanythingwithoutanyformofreview.Thisraisesthequestion eveninscholarlydisciplinesofwhatisworthkeeping?Ifanarchivebecomestoofullandamassive community of researchers is served by one set of content and just 10 users are served by another collection of material, whose voice matters the most in what is deleted? How do we make decisions aboutwhattokeep?Historicallythosedecisionsweremadebylibrariansorarchivistsbythemselves, but as users and data miners become increasing users of archives, this raises the question of how to engagethosecommunitiesinthesecriticaldecisions. THERISEOFTHEPARALLELWEB When we speak of archiving the web we often think of the web as a single monolithic entity in which all content that is produced or consumed via a web browser is accessible for archiving. The original vision of the web was based on this ideal: an open unified platform in which all material was availabletoallusers.Forthemostpartthisvisionsurvivedtheearlyyearsoftheweb,asusersstroveto reach the greatest possible audience. Yet, a new trend has begun over the past halfdecade, correspondingwiththeriseofsocialmedia:thecreationofparallelversionsoftheweb. EveryoneofthosequarterbillionphotographsuploadedtoFacebookeachdayispostedandconsumed via the web, whether through browser on a desktop or a mobile app on a smartphone. Yet, despite transitingthesamephysicaltelecommunicationsinfrastructureastherestoftheweb,thosephotosare storedinaparallelweb,ownedandcontrolledentirelybyacommercialentity.Theyarenotpartofthe publicwebandthusnotavailabletowebarchives.Inmanywaysthisisnodifferentthanthelibraries andarchivesoftheprintera.Librariesfocusedoncollectingbooksandpamphlets,whileagooddealof communication and culture occurred in letters, diaries, drawings, and artwork that have largely been lost.Thedifferenceinthedigitaleraisthatinsteadofbeingscatteredacrossindividualhouseholds,all ofthismaterialisalreadybeingcentralizedintocommerciallyownedarchivesandlibraries. Noteveryonedesireseveryconversationoftheirstobepreservedforposterity,butintheprinteraone had a choice: a letter or diary or photograph was a physical object, held by its owner and could be passeddowntolatergenerations.Howmanyofushavecomeacrossashoeboxofoldphotographsor lettersfromagrandparent?Inthedigitalera,acompanyholdsthatmaterialonourbehalfandwhile most have terms of service that agree we own our material, only one major social media platform todayoffersanexportbuttonthatallowsustodownloadacopyofthematerialwehavegivenitover theyears:GooglePlusGoogleTakeout.Twitterhasrecognizedtheimportanceofthecommunications that occur via its service and has made a feed of its content available to the Library of Congress for archiving for posterity. Most others like Facebook and international platforms like Weibo or VK (formerlyVKontakte)havenot.Facebookineffecthasbecomeaparallelversionoftheweb,hostedon theweb,butwalledofffromit,withnomeansforuserstoarchivetheirmaterialforthefuture. Twitteroffersashiningexampleofhowsuchplatformscaninteractwiththewebarchivingcommunity and ensure that their material is archived for future generations. Selfarchiving services like Google Takeoutofferanintermediatestepinwhichusersatleastretaintheabilitytomaketheirownarchival copy of their contributions to the web for future generations. As more of the web moves behind paywalls,passwordprotection,andothermechanisms,creatingmoreandmoreparallelversionsofthe web,theremustbegreaterdiscussionwithinthewebarchivingcommunityabouthowwereachoutto
theseservicestofindwaysofensuringusersofthesecommunitiesmayarchivetheir materialforthe future. DATAMINING For millennia, scholarship in archives and libraries has meant intensive reading of a small number of works. In the past decade the digital humanities and computational social sciences has led to the growing use of computerized analysis of archives in which software algorithms are used to identify patternsandpointtoareasofinterestinthedata.Digitalarchiveshavelargelybeenbuiltaroundthis earlier access modality of deep reading, while computational techniques need rapid access to vast volumes of content, often encompassing the entire archive. New programming interfaces and access policiesareneededtoenablethisnewgenerationofscholarshipusingwebarchives. Informal discussions with web archivists suggest a chickenortheegg dilemma in this regard: data minerswanttoanalyzearchives,butcantwithoutthenecessaryprogrammaticinterfaces,andarchives forthemostpartwanttoencourageuseoftheirarchives,butdontknowwhatinterfacestosupport withoutworkingwithdataminers.Fewarchivestodaysupportthenecessaryprogrammaticinterfaces forautomatedaccesstotheircollections,andthosethatdotendtobeaimedatmetadata,ratherthan fulltext content, and use librarycentric protocols and mindsets. Some have fairly complex interfaces, withveryfinegrainedtoolkitsforeachpossibleusescenario.Thefewthatofferdataexportsofferan eitherorproposition:youeitherdownloadaZIPfileoftheentirecontentsofeverythinginthearchive or you get nothing: there is no inbetween. Though there are some bright spots: the National EndowmentfortheHumanitieshasmadeinitialstepstowardshelpingarchivistsanddataminerswork together through grand challenge programs like its Digging into Data initiative where a selection of archivesmadetheircontentavailabletoawardeesforlargedatamining. Yet,oneonlyhastolookatTwitterforamodelofwhatarchivescoulddo.Twitterprovidesonlyasingle smallprogramminginterfacewithafewverybasicoptions,butthroughthatinterfaceithasbeenable tosupportanecosystemofnearlyeveryimaginableresearchquestionandtool.Itevenoffersatiered costrecoverymodel:usersneedingonlysmallquantitiesofdata(asip)canaccessthefeedforfree, whiletherestarechargedatatieredpricingmodelbasedonthequantityofdatatheyneed,uptothe entirety of all 340M tweets at the highest level. Finally, the interfaces provided by Twitter are compatible with the huge numbers of analytical, visualization, and filtering tools provided by the GooglesandYahoosoftheworldwiththeiropencloudtoolkits.Ifarchivestookthesameapproach withastandardizedinterfacelikeTwitters,researcherscouldleveragethese hugeecosystemsforthe studyofthewebitself. Forsomearchives,thebottleneckhasbecomethesizeofthedata,whichhasbecometoolargetoshare via the network. Through a partnership with Google, data miners can request from the HathiTrust a copy of the 18001924 Google Books archive, consisting of around 750 million pages of material. Insteadofreceivinga downloadlink, usersmustpaythecostofpurchasing andshippingaboxfullof USBdrives,becausenetworks,evenbetweenresearchuniversities,simplycannotkeepupwiththesize ofdatasetsusedtoday.Inthesciences,someofthelargestprojects,suchastheLargeSynopticSurvey Telescope,aregoingasfarastopurchaseandhouseanentirecomputingclusterinthesamemachine room as the data archive and allowing researchers to submit proposals to run their programs on the cluster,becauseevenwithUSBdrivesthedataissimplytoolargetocopy.
Notallofthebarrierstoofferingbulkdataminingaccesstoarchivesaretechnical:copyrightandother legal restrictions can present significant complications. Though even here technology can provide a possible alternative: nonconsumptive analysis, in which software algorithms perform surfacelevel analysesratherthandeepreadingoftext,maysatisfytherequirementsofcopyright.Inothercases, transformations of copyright material to another form, such as to a wordlist, as was done with the GoogleBooksNgramsdataset,mayprovidepossiblesolutions. Not everyone appreciates or understands the value web archives provide society and they are constantlyunderpressurejusttofindenoughfundstokeepthepowerrunning.Thisisanareawhere partnering with researchers may help: there are only a few sources of funding for the creation and operationofwebarchivescomparedwiththemyriadfundingopportunitiesforresearch.Theincreased bandwidth,hardwareload,andotherresourcerequirementsoflargedataminingprojectscomesata realcost,butatthesametime,itdirectlydemonstratesthevalueofthosearchivestonewaudiences anddisciplinesthatmaybeabletopartnerwiththosearchivesonproposals,potentiallyofferingnew fundingopportunities. USERINSIGHT Whilesomearchivescannotofferaccesstotheirholdingsforlegalreasonsandinsteadserveonlyasan archiveoflastresort,mostarchiveswouldholdlittlevaluetotheirconstituentsiftheywerenotableto provide some level of access to the content they archived. User interfaces as a whole today are designed for casual browsing by nonexpert users, with simplicity and ease of use as their core principles.Asarchivesbecomeagrowingsourceforscholarlyresearch,archivesmustaddressseveral keyareasofneedinsupportingthesemoreadvancedusers: Inventory.Thereisacriticalneedforbettervisibilityintothepreciseholdingsofeacharchive. Withmostdigitallibrariesofdigitizedmaterialsavisitorcanbrowsethroughthecollectionfrom start to end, though even there one usually cant export a CSV file containing a master list of everythinginthatcollection.Mostwebarchives,ontheotherhand,areaccessibleonlythrough a direct lookup mechanism where the user types in a URL and gets back any matching snapshots. Archives only store copies of material, they dont provide an index to it or even a listingofwhattheyhold:itisassumedthatthisroleisprovidedelsewhere.Fordomainsthat havebeendeletedornowhouseunrelatedcontent,thisisnotalwaysthecase.Thiswouldbe akintolibrariesdroppingtheirreadingrooms,stacks,andcardcatalogs,andstoringalloftheir books in a robotic warehouse. Instead of browsing or requesting a book by title or category, onecouldonlyrequestabookbyitsISBNcode,whichhadtobeknownbeforehand,anditwas someoneelsesresponsibilitytostorethosecodes.Atremendousstepforwardwouldbealist fromeacharchiveofalloftherootdomainsthatithasoneormorepagesfrom,butultimately having a list of all URLs, along with the number of snapshots and a list of the dates of those snapshotswouldreallyenableanentirelynewformofaccesstothesearchives.Thisdatacould beusedbyresearchersandotherstocomeupwithnewwaysofaccessingandinteractingwith thedataheldbythesearchives. MetaSearch.Withbetterinventorydata,wecouldbuildmetasearchtoolsthatactasthedigital equivalent of WorldCat for web archives. Web archives today operate more like document archivesthanlibraries:theyholdcontent,buttheythemselvesoftenhavenoideathefullextent of what they hold. A scholar looking for a particular print document might have to spend monthsorevenyearsscouringarchivesallovertheworldlookingforonethatholdsacopyof thatdocument,whereasifshewaslookingforabook,asimplesearchonWorldCatwouldturn
upalistofeveryparticipatinglibrarythatheldacopyintheirelectroniccatalog.Thisispossible because libraries have invested in maintaining inventories of their holdings and standardizing thewayinwhichthoseinventoriesarestoredsothatthirdpartiescanaggregateanddevelop servicesthatallowuserstosearchacrossthoseinventories.ImaginebeingabletotypeinaURL andseeeverycopyfromeverywebarchiveintheworld,ratherthanjustthecopiesheldbyany onearchive. SpecialtyArchives.Metasearchwouldallowfederatedsearchacrossallarchives,butthisalso raisestheconcernaboutbackupsofsmallerspecialtyarchives.Largerwholewebarchiveslike theInternetArchivestillcantpossiblyarchiveeverythingthatexists.Specialtyarchivesfillthis niche,oftenwithinstitutionalfocusesorthrougharesearchercreatinganarchiveofmaterialon a particular niche topic for her own use. Often these archives are created for a particular researchprojectandthendiscardedwhenthatpaperispublished.Howdowebringtheseinto the fold? Perhaps some mechanism is needed for allowing those archives to submit to a network of web archives and say essentially if youre interested, here you go? They would needtobemarkedseparately,sincetheircontentwasproducedoutsideofthemainarchives processes, but as web crawlers become easier to use and more researchers create their own specialty curated collections, should we have mechanisms to allow them to be archived, to leveragetheirresourcestopenetrateareasofthewebwemightnotbeableto? Citability.Forarchivestobeusefulinscholarlyresearch,aparticularsnapshotofapagemust haveapermanentidentifierthatcanbecitedinthereferenceslistofapublication.TheInternet Archiveprovidesanidealexampleofthis,inwhicheachsnapshothasitsownpermanentURL that includes both the page URL and the exact timestamp of that snapshot. This URL can be citedinapublicationinthesameformatasanyotherwebpage.Yet,noteveryarchiveprovides thistypeofaccess,somemakeuseofAJAX(interactiveJavaScriptapplications)thatprovidea more desktoplike browsing experience, but mask the URL for each snapshot, making it impossibletopointotherstothatcopy.
TECHNICALINSIGHT Inthemoderneralibrariesandarchiveshaveexisteddecoupledfromtheirresearchers:aprofessional classcollectedandcuratedtheircollectionsandscholarstraveledabouttowhicheverinstitutionsheld thematerialstheyneeded.Fewrecordsexistastowhyagivenlibrarycollectedthisworkratherthan thatone,andasscholarswesimplyacceptthis.Yet,perhapsinthedigitalerawecandobetter,asmost of these decisions are stored in emails, memos, and other materials, all of them searchable and indexable. Web crawlers are seeded with starting URLs and crawl based on deterministic software algorithms,bothofwhichcanbedocumentedforscholars. Most web archives operate as black boxes designed for casual browsing and retrieval of individual objects,withoutaskingtoomanyquestionsabouthowthatobjectgotthere.Thisisinstarkcontrastto digitizedarchives,inwhicheveryconceivablepieceofmetadataiscollected. AvisitortotheInternet Archivetodayencountersanoddexperience:retrievingadigitizedbookyieldsawealthofinformation onhowthatdigitalcopycametobe,fromthespecificlibraryitcamefromtothenameoftheperson who operated the scanner that photographed it, while retrieving a web page yields only a list of availablesnapshotdates. SnapshotTimestamps.Allarchivesstoreaninternaltimestamprecordingtheprecisemoment whenapagesnapshotwasdownloaded,buttheiruserinterfacesoftenmaskthisinformation. Forexample,whenexaminingchangesinWhiteHousepressreleases,wefoundthatclickingon
asnapshotforApril4,2001intheInternetArchivewouldalwaystakeustoasnapshotofthe pagewerequested,butifwelookedintheURLbar(InternetArchiveincludesthetimestampof thesnapshotintheURL),wenoticedthatoccasionallythesnapshotwewereultimatelygiven wasfromdaysorweeksbeforeorafterourrequesteddate.Uponfurtherresearch,wefound thatsomearchivesautomaticallyredirectausertothenearestdatewhenagivensnapshotdate becomesunavailableduetohardwarefailureorotherreasons.Thisisanidealbehaviorfora casualuser,butforanexpertusertracinghowchangesinapagecorrespondtopoliticalevents occurring each day, this is problematic. Archives should provide a notice when a requested snapshot is not available, allowing the user to decide whether to proceed to the closest availableone,orselectanotherdate. Page Versus Site Timestamps. Some archives display only a single timestamp for all pages collected from a given site during a particular crawl: usually the time at which the crawlers startedarchivingthatsite.Evenamediumsizedsitemaytakehoursordaystofullycrawlwhen ratelimiting and other factors are taken into account, and for some users it is imperative to knowtheprecisemomentwheneachpagewasrequested,notwhenthecrawlersfirstentered thesite.Mostarchivesstorethisinformation,soitissimplyamatterofprovidingaccesstoit viatheuserinterfaceforthoseusersrequestingit. CrawlAlgorithms.Noteverysitecanbecrawledtoitsentirety:somesitesmaysimplybetoo largeorhavecomplexlinkingstructuresthatmakeitdifficulttofindeverypage,ortheymaybe dynamically generated. Some research questions may be affected by the algorithm used to crawlthesite(depthfirstvsbreadthfirst),theseedURLsusedtoenterthesite(thefrontpage, tableofcontentspages,contentpages,etc),whereitabortedthecrawl(ifitdid),whichpages errored during the crawl (and thus whose links were not crawled), etc. If, for example, one wishestoestimatethesizeofadynamicdatabasedrivenwebsite,suchfactorscanbeusedto draw estimates on its total size and composition, but only if users can access these technical characteristicsofthecrawl. Raw Source Access. Current archives are designed to provide a transparent time machine viewoftheweb,whereclickingonasnapshotattemptstorenderthepageinamodernbrowser in a way that reproduces what it originally looked like when it was captured, as faithfully as possible. However, a page might contain embedded HTML instructions such as a <META REFRESH> tag or JavaScript code that automatically forwards the browser to a new URL. This may happen transparently without the user noticing. In our study of White House press releases, we were especially interested in pages that had been blanked out, where a press releasehadbeenreplacedwitha<METAREFRESH>tagandaneditorialcommentinanHTML comment in the page. Clicking on these pages using the Internet Archive interface just forwarded us to the new URL indicated by the refresh command, so we had to download the pages via a special downloading software package so we could review the source code of the page without being redirected. This is a relatively rare scenario, but it would be helpful for archivestoprovideaviewsourcemode,whereclickingonasnapshottakestheuserrightto thesourcecodeofapage,insteadoftryingtodisplaythepage. Crawler Physical Location. Several major foreign news outlets embargo content or present different selections or versions of their content depending on where the visitors computer is physicallylocated.Avisitoraccessingsuchasitewillseeaverydifferentpicturedependingon whether she is in the United States, the United Kingdom, China, or Russia. This is actually growingasanissue,asmoresitesadoptcontentmanagementsystemsthatdynamicallyadjust thestructureandlayoutofthesiteforeachindividualvisitorbasedontheiractionsastheyclick through the site. Analyses of such sites require information on where the crawlers were physicallylocatedandtheexactorderofpagestheyrequestedfromthesite.Aswiththeother
recommendationslistedabove,thisinformationisalreadyheldbymostarchives,itissimplya matterofmakingitmoreavailabletousers. FIDELITYANDLINKAGE Fidelity.ModernwebarchivingplatformscapturenotonlytheHTMLcodeofapage,butalso interprettheHTMLandassociatedCSSfilestocompilealistofimages,CSSfiles,JavaScriptcode, and other files necessary to properly display the page and archive these as well. The rise of interactive and highly multimedia web pages is challenging this approach, as pages may have embedded Flash or AJAX/JavaScript applications, streaming video, and embedded widgets displayinginformationfromothersites.Nolongerlimitedtohighdesignorhighlytechnical sites, these features are making their way into more traditional websites, such as the news media. For example, the BBCs website includes both Flash and JavaScript animations on its front page, while the Chicago Tribunes site includes Flash animations on its front page that respondtomouseoversandanimateorperformotheractions.BBCalsoincludesanembedded JavaScript widget that displays advertisements. Both sites include extensive embedded streaming Flashbased video. Many of these tools reference data or JavaScript code on other sites:forexamplemanysitesnowmakeuseofGooglesVisualizationAPItoolkitforinteractive graphsanddisplaysandsimplylinktothecodehousedonGooglessite.Ontheonehand,we might dismiss advertisements and embedded content as not worth archiving, yet a rich literatureintheadvertisingdisciplineaddressesthepsychologicalimpactofadvertisementsand othersidebarmaterialontheprocessingofinformationinthewebera.Evendigitizedhistorical newspaperarchiveshavebeenverycarefultoofferaccesstotheentirescannedpageimageto allowscholarstoaccessadvertisementsandlayoutinformation,ratherthanjustfocusingonthe article text. Excluding dynamic content will make it impossible for scholars of the future to understandhowadvertisementswereusedontheweb.Yet,simplysavingacopyofaFlashor AJAXwidgetmaynotbesufficient,astechnicaldependenciesmayrenderthemunexecutable20 years from now. One possibility might be creating a screen capture of each page as it is archived,toprovideatleastacoarsesnapshotofwhatthatpagelookedliketoavisitorofthe timeperiod. Web/Social Linkage. Many sites are making use of social media platforms like Twitter and Facebook as part of their overall information ecosystem. For example, the front page of the Chicago Tribune prominently links to its Facebook page, where editors post a curated assortmentoflinkstoTribunecontentoverthecourseofeachday.Visitorslikestoriesand post comments on the Facebook summary of the story, creating a rich environment of commentarythatexistsinparalleltotheoriginalwebpageontheTribunesite.Othersitesallow commentary directly on their webpages through a user comments section. Some sites may onlyallowcommentsforafewdaysafterapageisposted,whileothersmayallowcomments yearslater.Thissocialnarrativeisanintegralpartofthecontentseenbyvisitorsofthetime, yet how do we properly preserve this material, especially from linked social media platform profiles? CONCLUSIONSANDTHEROLEOFARCHIVES Aswebarchivesmatureandexpand,agrowingquestionrevolvesaroundtheroleofthesearchivesin society. What should their primary mission(s) be and how can they best fulfill those roles? At their most basic level, I believe web archives fulfill three primary roles: Preservation, Research, and Authentication,inthatorder.
Preservation. First and foremost, web archives preserve the web. They act as the web equivalentofthearchiveorlibrary,constantlymonitoringfornewcontent,requestingacopyof thatcontent,andkeepingacopyofitforposterity.Inthisrole,theirmissionistoacquireand preservethewebforfuturegenerations,withaccessbeingprimarilythroughbasicbrowsingand retrieval. Some archives, for legal purposes, may not even be able to provide access to their holdings during the lifetime of the organizations providing them content, instead holding that material under embargo for a certain number of years, but ensuring its continued survival for futuregenerations. Research.Auniqueandemerginguseofarchivesisasaresearchserviceforscholars.Veryfew academics,especiallyinthesocialsciencesandhumanities,havethecomputationalexpertiseor resources to crawl and download large portions of the web for research. Commercial web crawling companieslikeGoogledonotprovidetheirdataforresearch,andthuswebarchives provide a fundamentally unique and enabling resource for the study of the web that scholars can turn to. Even more critically, many key humanities and social science questions revolve around how ideas and communication change over time, and web archives capture the only viewofchangeontheweb.Inthisrole,thesecondarymissionofarchivesistoprovideaccess totheirholdingsthatgoesbeyondthebasicbrowsingneededforcasualuseordeepscholarly readingofasmallnumberofworks,towardsprovidingprogrammatictoolsandaccesspolicies thatsupportcomputationaldataminingoflargeportionsoftheirholdings. Authentication.Afinalemerginguseofarchivesisasanauthenticationservice.Webdatais highlymutable,changingconstantly,andthereisnowaytoauthenticatewhetherthepageIsee todayisthesameaswhatIsawyesterday,especiallyifthechangeisasmallone.Ittookmore thanfiveyearsforchangestoWhiteHousepressreleasestobespottedviacopiesheldinthe Internet Archive, and even then the discovery was entirely by accident. Third party archives allowauthenticationofwhatapagelookedlikeatagivenmoment.Onecouldevenimagine somedayabrowserpluginthat,asauserbrowsedcertainsitesontheweb(suchasgovernment pages,perhapsmedicalorotherpages),wouldcompareeachpagewiththemostrecentcopy storedbyanetworkofwebarchives,anddisplayanindicatortotheuserastowhetherthepage haschangedsinceitwaslastarchived,aswellashighlightthosechanges.Inthisrole,thethird peripheral mission of the web archive is to act as a disinterested third party that can authenticateandverifythecontentsofagivenwebpageatagivenmomentintime.
Wikipedia offers an intriguing vision of what the ultimate web archive might look like. Every edit to everypagesincetheinceptionofthesitehasbeenarchivedandisavailableatamouseclick,allowinga visitororscholartotracetheentirehistoryofeveryword.Everyoperationtakenonthesiteandthe completesourcecodetoeveryalgorithmusedforvariousautomatedprocessesarefullydocumented andmakeavailable,offeringcompletetechnicaltransparently.Finally,adedicatedbulkdownloadpage ismaintainedinwhichresearchersmaydownloadaZIPfilecontainingtheentiretyofthesiteandevery editeverperformed,whichhasmadeWikipediaamainstayofconsiderablesocialandcomputerscience research. Asourdigitalworldcontinuestogrowatabreathtakingpaceandmoreandmoreofourdailyliveoccurs within its digital boundaries, we must ensure that web archives are there to preserve our collective globalconsciousnessforfuturegenerations.

Archiving The Web Guide

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Archiving The Web Guide

Uploaded by

Copyright:

Available Formats

AVISIONOFTHEROLEANDFUTUREOFWEBARCHIVES KalevH.Leetaru 1 GraduateSchoolofLibraryandInformationScience UniversityofIllinois Imagine a world in which libraries and archives had never existed.

You might also like